Home Home > GIT Browse > SLE15
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJan Beulich <jbeulich@novell.com>2012-05-23 15:59:25 +0200
committerJan Beulich <jbeulich@novell.com>2012-05-23 15:59:25 +0200
commit201b6a1a6f5a5b66dbf9ed9013a1a39ddf99fd2f (patch)
treef45bfbedbec297c9cf507bf5613ada53188ace0f
parent526e2c1490f9804f8dd5c47f7cfaba74e1b80a4d (diff)
- Update Xen patches to 3.4-final and c/s 1177.rpm-3.4.0-2--openSUSE-12.2-Beta1rpm-3.4.0-2
-rw-r--r--Documentation/vm/frontswap.txt120
-rw-r--r--arch/x86/include/mach-xen/asm/hypervisor.h33
-rw-r--r--arch/x86/include/mach-xen/asm/processor.h2
-rw-r--r--arch/x86/kernel/irq_64.c6
-rw-r--r--arch/x86/kernel/setup-xen.c20
-rw-r--r--drivers/xen/blktap/blktap.c42
-rw-r--r--drivers/xen/blktap2-new/control.c1
-rw-r--r--drivers/xen/blktap2/control.c1
-rw-r--r--drivers/xen/core/gnttab.c14
-rw-r--r--drivers/xen/gntdev/gntdev.c392
-rw-r--r--drivers/xen/pci.c2
-rw-r--r--include/linux/frontswap.h11
-rw-r--r--mm/frontswap.c76
13 files changed, 413 insertions, 307 deletions
diff --git a/Documentation/vm/frontswap.txt b/Documentation/vm/frontswap.txt
index 5a1a00c68231..a9f731af0fac 100644
--- a/Documentation/vm/frontswap.txt
+++ b/Documentation/vm/frontswap.txt
@@ -2,6 +2,13 @@ Frontswap provides a "transcendent memory" interface for swap pages.
In some environments, dramatic performance savings may be obtained because
swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk.
+(Note, frontswap -- and cleancache (merged at 3.0) -- are the "frontends"
+and the only necessary changes to the core kernel for transcendent memory;
+all other supporting code -- the "backends" -- is implemented as drivers.
+See the LWN.net article "Transcendent memory in a nutshell" for a detailed
+overview of frontswap and related kernel parts:
+https://lwn.net/Articles/454795/ )
+
Frontswap is so named because it can be thought of as the opposite of
a "backing" store for a swap device. The storage is assumed to be
a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming
@@ -31,6 +38,12 @@ a disk write and, if the data is later read back, a disk read are avoided.
If a put returns failure, transcendent memory has rejected the data, and the
page can be written to swap as usual.
+If a backend chooses, frontswap can be configured as a "writethrough
+cache" by calling frontswap_writethrough(). In this mode, the reduction
+in swap device writes is lost (and also a non-trivial performance advantage)
+in order to allow the backend to arbitrarily "reclaim" space used to
+store frontswap pages to more completely manage its memory usage.
+
Note that if a page is put and the page already exists in transcendent memory
(a "duplicate" put), either the put succeeds and the data is overwritten,
or the put fails AND the page is invalidated. This ensures stale data may
@@ -63,21 +76,46 @@ but-much-faster-than-disk "pseudo-RAM device" and the frontswap (and
cleancache) interface to transcendent memory provides a nice way to read
and write -- and indirectly "name" -- the pages.
+Frontswap -- and cleancache -- with a fairly small impact on the kernel,
+provides a huge amount of flexibility for more dynamic, flexible RAM
+utilization in various system configurations:
+
+In the single kernel case, aka "zcache", pages are compressed and
+stored in local memory, thus increasing the total anonymous pages
+that can be safely kept in RAM. Zcache essentially trades off CPU
+cycles used in compression/decompression for better memory utilization.
+Benchmarks have shown little or no impact when memory pressure is
+low while providing a significant performance improvement (25%+)
+on some workloads under high memory pressure.
+
+"RAMster" builds on zcache by adding "peer-to-peer" transcendent memory
+support for clustered systems. Frontswap pages are locally compressed
+as in zcache, but then "remotified" to another system's RAM. This
+allows RAM to be dynamically load-balanced back-and-forth as needed,
+i.e. when system A is overcommitted, it can swap to system B, and
+vice versa. RAMster can also be configured as a memory server so
+many servers in a cluster can swap, dynamically as needed, to a single
+server configured with a large amount of RAM... without pre-configuring
+how much of the RAM is available for each of the clients!
+
In the virtual case, the whole point of virtualization is to statistically
multiplex physical resources acrosst the varying demands of multiple
virtual machines. This is really hard to do with RAM and efforts to do
it well with no kernel changes have essentially failed (except in some
-well-publicized special-case workloads). Frontswap -- and cleancache --
-with a fairly small impact on the kernel, provides a huge amount
-of flexibility for more dynamic, flexible RAM multiplexing.
+well-publicized special-case workloads).
Specifically, the Xen Transcendent Memory backend allows otherwise
"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
virtual machines, but the pages can be compressed and deduplicated to
optimize RAM utilization. And when guest OS's are induced to surrender
-underutilized RAM (e.g. with "self-ballooning"), sudden unexpected
+underutilized RAM (e.g. with "selfballooning"), sudden unexpected
memory pressure may result in swapping; frontswap allows those pages
-to be swapped to and from hypervisor RAM if overall host system memory
-conditions allow.
+to be swapped to and from hypervisor RAM (if overall host system memory
+conditions allow), thus mitigating the potentially awful performance impact
+of unplanned swapping.
+
+A KVM implementation is underway and has been RFC'ed to lkml. And,
+using frontswap, investigation is also underway on the use of NVM as
+a memory extension technology.
2) Sure there may be performance advantages in some situations, but
what's the space/time overhead of frontswap?
@@ -104,6 +142,12 @@ the existing eight bits, but let's worry about that minor optimization
later.) For very large swap disks (which are rare) on a standard
4K pagesize, this is 1MB per 32GB swap.
+When swap pages are stored in transcendent memory instead of written
+out to disk, there is a side effect that this may create more memory
+pressure that can potentially outweigh the other advantages. A
+backend, such as zcache, must implement policies to carefully (but
+dynamically) manage memory limits to ensure this doesn't happen.
+
3) OK, how about a quick overview of what this frontswap patch does
in terms that a kernel hacker can grok?
@@ -145,19 +189,24 @@ put" and (possibly) a "frontswap backend get", which are presumably much
faster.
4) Can't frontswap be configured as a "special" swap device that is
- just higher priority than any real swap device (e.g. like zswap)?
+ just higher priority than any real swap device (e.g. like zswap,
+ or maybe swap-over-nbd/NFS)?
-No. Recall that acceptance of any swap page by the frontswap
-backend is entirely unpredictable. This is critical to the definition
-of frontswap because it grants completely dynamic discretion to the
-backend. But since any "put" might fail, there must always be a real
-slot on a real swap device to swap the page. Thus frontswap must be
-implemented as a "shadow" to every swapon'd device with the potential
-capability of holding every page that the swap device might have held
-and the possibility that it might hold no pages at all.
-On the downside, this also means that frontswap cannot contain more
-pages than the total of swapon'd swap devices. For example, if NO
-swap device is configured on some installation, frontswap is useless.
+No. First, the existing swap subsystem doesn't allow for any kind of
+swap hierarchy. Perhaps it could be rewritten to accomodate a hierarchy,
+but this would require fairly drastic changes. Even if it were
+rewritten, the existing swap subsystem uses the block I/O layer which
+assumes a swap device is fixed size and any page in it is linearly
+addressable. Frontswap barely touches the existing swap subsystem,
+and works around the constraints of the block I/O subsystem to provide
+a great deal of flexibility and dynamicity.
+
+For example, the acceptance of any swap page by the frontswap backend is
+entirely unpredictable. This is critical to the definition of frontswap
+backends because it grants completely dynamic discretion to the
+backend. In zcache, one cannot know a priori how compressible a page is.
+"Poorly" compressible pages can be rejected, and "poorly" can itself be
+defined dynamically depending on current memory constraints.
Further, frontswap is entirely synchronous whereas a real swap
device is, by definition, asynchronous and uses block I/O. The
@@ -166,14 +215,30 @@ that are inappropriate for a RAM-oriented device including delaying
the write of some pages for a significant amount of time. Synchrony is
required to ensure the dynamicity of the backend and to avoid thorny race
conditions that would unnecessarily and greatly complicate frontswap
-and/or the block I/O subsystem.
+and/or the block I/O subsystem. That said, only the initial "put"
+and "get" operations need be synchronous. A separate asynchronous thread
+is free to manipulate the pages stored by frontswap. For example,
+the "remotification" thread in RAMster uses standard asynchronous
+kernel sockets to move compressed frontswap pages to a remote machine.
+Similarly, a KVM guest-side implementation could do in-guest compression
+and use "batched" hypercalls.
In a virtualized environment, the dynamicity allows the hypervisor
(or host OS) to do "intelligent overcommit". For example, it can
choose to accept pages only until host-swapping might be imminent,
-then force guests to do their own swapping. In zcache, "poorly"
-compressible pages can be rejected, where "poorly" can itself be defined
-dynamically depending on current memory constraints.
+then force guests to do their own swapping.
+
+There is a downside to the transcendent memory specifications for
+frontswap: Since any "put" might fail, there must always be a real
+slot on a real swap device to swap the page. Thus frontswap must be
+implemented as a "shadow" to every swapon'd device with the potential
+capability of holding every page that the swap device might have held
+and the possibility that it might hold no pages at all. This means
+that frontswap cannot contain more pages than the total of swapon'd
+swap devices. For example, if NO swap device is configured on some
+installation, frontswap is useless. Swapless portable devices
+can still use frontswap but a backend for such devices must configure
+some kind of "ghost" swap device and ensure that it is never used.
5) Why this weird definition about "duplicate puts"? If a page
has been previously successfully put, can't it always be
@@ -195,9 +260,12 @@ When the (non-frontswap) swap subsystem swaps out a page to a real
swap device, that page is only taking up low-value pre-allocated disk
space. But if frontswap has placed a page in transcendent memory, that
page may be taking up valuable real estate. The frontswap_shrink
-routine allows code outside of the swap subsystem (such as Xen tmem
-or zcache or some future tmem backend) to force pages out of the memory
-managed by frontswap and back into kernel-addressable memory.
+routine allows code outside of the swap subsystem to force pages out
+of the memory managed by frontswap and back into kernel-addressable memory.
+For example, in RAMster, a "suction driver" thread will attempt
+to "repatriate" pages sent to a remote machine back to the local machine;
+this is driven using the frontswap_shrink mechanism when memory pressure
+subsides.
7) Why does the frontswap patch create the new include file swapfile.h?
@@ -207,4 +275,4 @@ static and global. This seemed a reasonable compromise: Define
them as global but declare them in a new include file that isn't
included by the large number of source files that include swap.h.
-Dan Magenheimer, last updated September 12, 2011
+Dan Magenheimer, last updated April 9, 2012
diff --git a/arch/x86/include/mach-xen/asm/hypervisor.h b/arch/x86/include/mach-xen/asm/hypervisor.h
index f668981d2768..c167535f99c9 100644
--- a/arch/x86/include/mach-xen/asm/hypervisor.h
+++ b/arch/x86/include/mach-xen/asm/hypervisor.h
@@ -39,30 +39,31 @@
#include <xen/interface/xen.h>
#include <xen/interface/sched.h>
#include <xen/interface/vcpu.h>
+#include <asm/percpu.h>
#include <asm/ptrace.h>
#include <asm/pgtable_types.h>
-#include <asm/smp-processor-id.h>
extern shared_info_t *HYPERVISOR_shared_info;
-#ifdef CONFIG_XEN_VCPU_INFO_PLACEMENT
+#if defined(CONFIG_XEN_VCPU_INFO_PLACEMENT)
DECLARE_PER_CPU(struct vcpu_info, vcpu_info);
-#define vcpu_info(cpu) (&per_cpu(vcpu_info, cpu))
-#define current_vcpu_info() (&__get_cpu_var(vcpu_info))
-#define vcpu_info_read(fld) percpu_read(vcpu_info.fld)
-#define vcpu_info_write(fld, val) percpu_write(vcpu_info.fld, val)
-#define vcpu_info_xchg(fld, val) percpu_xchg(vcpu_info.fld, val)
+# define vcpu_info(cpu) (&per_cpu(vcpu_info, cpu))
+# define current_vcpu_info() (&__get_cpu_var(vcpu_info))
+# define vcpu_info_read(fld) percpu_read(vcpu_info.fld)
+# define vcpu_info_write(fld, val) percpu_write(vcpu_info.fld, val)
+# define vcpu_info_xchg(fld, val) percpu_xchg(vcpu_info.fld, val)
void setup_vcpu_info(unsigned int cpu);
void adjust_boot_vcpu_info(void);
-#else
-#define vcpu_info(cpu) (HYPERVISOR_shared_info->vcpu_info + (cpu))
-#ifdef CONFIG_SMP
-#define current_vcpu_info() vcpu_info(smp_processor_id())
-#else
-#define current_vcpu_info() vcpu_info(0)
-#endif
-#define vcpu_info_read(fld) (current_vcpu_info()->fld)
-#define vcpu_info_write(fld, val) (current_vcpu_info()->fld = (val))
+#elif defined(CONFIG_XEN)
+# define vcpu_info(cpu) (HYPERVISOR_shared_info->vcpu_info + (cpu))
+# ifdef CONFIG_SMP
+# include <asm/smp-processor-id.h>
+# define current_vcpu_info() vcpu_info(smp_processor_id())
+# else
+# define current_vcpu_info() vcpu_info(0)
+# endif
+# define vcpu_info_read(fld) (current_vcpu_info()->fld)
+# define vcpu_info_write(fld, val) (current_vcpu_info()->fld = (val))
static inline void setup_vcpu_info(unsigned int cpu) {}
#endif
diff --git a/arch/x86/include/mach-xen/asm/processor.h b/arch/x86/include/mach-xen/asm/processor.h
index 6bc8580ae46e..11ee22b2407c 100644
--- a/arch/x86/include/mach-xen/asm/processor.h
+++ b/arch/x86/include/mach-xen/asm/processor.h
@@ -924,6 +924,7 @@ extern int set_tsc_mode(unsigned int val);
extern int amd_get_nb_id(int cpu);
+#ifndef CONFIG_XEN
struct aperfmperf {
u64 aperf, mperf;
};
@@ -952,6 +953,7 @@ unsigned long calc_aperfmperf_ratio(struct aperfmperf *old,
return ratio;
}
+#endif
/*
* AMD errata checking
diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
index d04d3ecded62..916a7dec64ef 100644
--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -39,7 +39,9 @@ static inline void stack_overflow_check(struct pt_regs *regs)
{
#ifdef CONFIG_DEBUG_STACKOVERFLOW
#define STACK_TOP_MARGIN 128
+#ifndef CONFIG_X86_NO_TSS
struct orig_ist *oist;
+#endif
u64 irq_stack_top, irq_stack_bottom;
u64 estack_top, estack_bottom;
u64 curbase = (u64)task_stack_page(current);
@@ -58,11 +60,15 @@ static inline void stack_overflow_check(struct pt_regs *regs)
if (regs->sp >= irq_stack_top && regs->sp <= irq_stack_bottom)
return;
+#ifndef CONFIG_X86_NO_TSS
oist = &__get_cpu_var(orig_ist);
estack_top = (u64)oist->ist[0] - EXCEPTION_STKSZ + STACK_TOP_MARGIN;
estack_bottom = (u64)oist->ist[N_EXCEPTION_STACKS - 1];
if (regs->sp >= estack_top && regs->sp <= estack_bottom)
return;
+#else
+ estack_top = estack_bottom = 0;
+#endif
WARN_ONCE(1, "do_IRQ(): %s has overflown the kernel stack (cur:%Lx,sp:%lx,irq stk top-bottom:%Lx-%Lx,exception stk top-bottom:%Lx-%Lx)\n",
current->comm, curbase, regs->sp,
diff --git a/arch/x86/kernel/setup-xen.c b/arch/x86/kernel/setup-xen.c
index 56aca1c6ee54..0e75af8588b1 100644
--- a/arch/x86/kernel/setup-xen.c
+++ b/arch/x86/kernel/setup-xen.c
@@ -187,19 +187,25 @@ static void __init
setup_pfn_to_mfn_frame_list(typeof(__alloc_bootmem) *__alloc_bootmem)
{
unsigned long i, j, size;
- unsigned int k, fpp = PAGE_SIZE / sizeof(unsigned long);
+ unsigned int k, order, fpp = PAGE_SIZE / sizeof(unsigned long);
size = (max_pfn + fpp - 1) / fpp;
size = (size + fpp - 1) / fpp;
++size; /* include a zero terminator for crash tools */
size *= sizeof(unsigned long);
+ order = get_order(size);
if (__alloc_bootmem)
- pfn_to_mfn_frame_list_list = alloc_bootmem_pages(size);
- if (size > PAGE_SIZE
- && xen_create_contiguous_region((unsigned long)
- pfn_to_mfn_frame_list_list,
- get_order(size), 0))
- BUG();
+ pfn_to_mfn_frame_list_list =
+ alloc_bootmem_pages(PAGE_SIZE << order);
+ if (order) {
+ if (xen_create_contiguous_region((unsigned long)
+ pfn_to_mfn_frame_list_list,
+ order, 0))
+ pr_err("List of P2M frame lists is not contiguous, %s will not work",
+ is_initial_xendomain()
+ ? "kdump" : "save/restore");
+ memset(pfn_to_mfn_frame_list_list, 0, size);
+ }
size -= sizeof(unsigned long);
if (__alloc_bootmem)
pfn_to_mfn_frame_list = alloc_bootmem(size);
diff --git a/drivers/xen/blktap/blktap.c b/drivers/xen/blktap/blktap.c
index eafd5db5cecb..bfa203abbe4c 100644
--- a/drivers/xen/blktap/blktap.c
+++ b/drivers/xen/blktap/blktap.c
@@ -124,7 +124,7 @@ typedef struct tap_blkif {
[req id, idx] tuple */
blkif_t *blkif; /*Associate blkif with tapdev */
struct domid_translate_ext trans; /*Translation from domid to bus. */
- struct vm_foreign_map foreign_map; /*Mapping page */
+ struct vm_foreign_map *foreign_maps; /*Mapping pages */
} tap_blkif_t;
static struct tap_blkif *tapfds[MAX_TAP_DEV];
@@ -340,7 +340,7 @@ static pte_t blktap_clear_pte(struct vm_area_struct *vma,
pg = idx_to_page(mmap_idx, pending_idx, seg);
ClearPageReserved(pg);
- info->foreign_map.map[offset + RING_PAGES] = NULL;
+ info->foreign_maps->map[offset + RING_PAGES] = NULL;
khandle = &pending_handle(mmap_idx, pending_idx, seg);
@@ -388,12 +388,17 @@ static pte_t blktap_clear_pte(struct vm_area_struct *vma,
static void blktap_vma_open(struct vm_area_struct *vma)
{
tap_blkif_t *info;
+ unsigned long idx;
+ struct vm_foreign_map *foreign_map;
+
if (vma->vm_file == NULL)
return;
info = vma->vm_file->private_data;
- vma->vm_private_data =
- &info->foreign_map.map[(vma->vm_start - info->rings_vstart) >> PAGE_SHIFT];
+ idx = (vma->vm_start - info->rings_vstart) >> PAGE_SHIFT;
+ foreign_map = &info->foreign_maps[idx];
+ foreign_map->map = &info->foreign_maps->map[idx];
+ vma->vm_private_data = foreign_map;
}
/* tricky part
@@ -403,7 +408,6 @@ static void blktap_vma_open(struct vm_area_struct *vma)
*/
static void blktap_vma_close(struct vm_area_struct *vma)
{
- tap_blkif_t *info;
struct vm_area_struct *next = vma->vm_next;
if (next == NULL ||
@@ -413,9 +417,7 @@ static void blktap_vma_close(struct vm_area_struct *vma)
vma->vm_file != next->vm_file)
return;
- info = vma->vm_file->private_data;
- next->vm_private_data =
- &info->foreign_map.map[(next->vm_start - info->rings_vstart) >> PAGE_SHIFT];
+ blktap_vma_open(next);
}
static struct vm_operations_struct blktap_vm_ops = {
@@ -653,8 +655,9 @@ static int blktap_release(struct inode *inode, struct file *filp)
mm = xchg(&info->mm, NULL);
if (mm)
mmput(mm);
- kfree(info->foreign_map.map);
- info->foreign_map.map = NULL;
+ kfree(info->foreign_maps->map);
+ kfree(info->foreign_maps);
+ info->foreign_maps = NULL;
/* Free the ring page. */
ClearPageReserved(virt_to_page(info->ufe_ring.sring));
@@ -743,14 +746,19 @@ static int blktap_mmap(struct file *filp, struct vm_area_struct *vma)
}
/* Mark this VM as containing foreign pages, and set up mappings. */
- info->foreign_map.map = kzalloc(((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) *
- sizeof(*info->foreign_map.map), GFP_KERNEL);
- if (info->foreign_map.map == NULL) {
+ info->foreign_maps = kcalloc(size, sizeof(*info->foreign_maps),
+ GFP_KERNEL);
+ if (info->foreign_maps)
+ info->foreign_maps->map =
+ kcalloc(size, sizeof(*info->foreign_maps->map),
+ GFP_KERNEL);
+ if (!info->foreign_maps || !info->foreign_maps->map) {
+ kfree(info->foreign_maps);
WPRINTK("Couldn't alloc VM_FOREIGN map.\n");
goto fail;
}
- vma->vm_private_data = &info->foreign_map;
+ vma->vm_private_data = info->foreign_maps;
vma->vm_flags |= VM_FOREIGN;
vma->vm_flags |= VM_DONTCOPY;
@@ -1258,7 +1266,7 @@ static int blktap_read_ufe_ring(tap_blkif_t *info)
pg = idx_to_page(mmap_idx, pending_idx, j);
ClearPageReserved(pg);
offset = (uvaddr - info->rings_vstart) >> PAGE_SHIFT;
- info->foreign_map.map[offset] = NULL;
+ info->foreign_maps->map[offset] = NULL;
}
fast_flush_area(pending_req, pending_idx, usr_idx, info);
make_response(blkif, pending_req->id, res.operation,
@@ -1559,7 +1567,7 @@ static void dispatch_rw_block_io(blkif_t *blkif,
FOREIGN_FRAME(map[i].dev_bus_addr
>> PAGE_SHIFT));
offset = (uvaddr - info->rings_vstart) >> PAGE_SHIFT;
- info->foreign_map.map[offset] = pg;
+ info->foreign_maps->map[offset] = pg;
}
} else {
for (i = 0; i < nseg; i++) {
@@ -1585,7 +1593,7 @@ static void dispatch_rw_block_io(blkif_t *blkif,
offset = (uvaddr - info->rings_vstart) >> PAGE_SHIFT;
pg = idx_to_page(mmap_idx, pending_idx, i);
- info->foreign_map.map[offset] = pg;
+ info->foreign_maps->map[offset] = pg;
}
}
diff --git a/drivers/xen/blktap2-new/control.c b/drivers/xen/blktap2-new/control.c
index 615df7469648..39460cd3976b 100644
--- a/drivers/xen/blktap2-new/control.c
+++ b/drivers/xen/blktap2-new/control.c
@@ -314,3 +314,4 @@ module_init(blktap_init);
module_exit(blktap_exit);
MODULE_LICENSE("Dual BSD/GPL");
MODULE_ALIAS("devname:" BLKTAP2_DEV_DIR "control");
+MODULE_ALIAS("xen-backend:tap2");
diff --git a/drivers/xen/blktap2/control.c b/drivers/xen/blktap2/control.c
index f44714361707..a6fe24ef250d 100644
--- a/drivers/xen/blktap2/control.c
+++ b/drivers/xen/blktap2/control.c
@@ -283,3 +283,4 @@ module_init(blktap_init);
module_exit(blktap_exit);
MODULE_LICENSE("Dual BSD/GPL");
MODULE_ALIAS("devname:" BLKTAP2_DEV_DIR "control");
+MODULE_ALIAS("xen-backend:tap2");
diff --git a/drivers/xen/core/gnttab.c b/drivers/xen/core/gnttab.c
index 7f1b507b24d8..d15fe5f62abf 100644
--- a/drivers/xen/core/gnttab.c
+++ b/drivers/xen/core/gnttab.c
@@ -905,7 +905,7 @@ int __devinit
#endif
gnttab_init(void)
{
- int i;
+ int i, ret;
unsigned int max_nr_glist_frames, nr_glist_frames;
unsigned int nr_init_grefs;
@@ -928,12 +928,16 @@ gnttab_init(void)
nr_glist_frames = nr_freelist_frames(nr_grant_frames);
for (i = 0; i < nr_glist_frames; i++) {
gnttab_list[i] = (grant_ref_t *)__get_free_page(GFP_KERNEL);
- if (gnttab_list[i] == NULL)
+ if (gnttab_list[i] == NULL) {
+ ret = -ENOMEM;
goto ini_nomem;
+ }
}
- if (gnttab_resume() < 0)
- return -ENODEV;
+ if (gnttab_resume() < 0) {
+ ret = -ENODEV;
+ goto ini_nomem;
+ }
nr_init_grefs = nr_grant_frames * ENTRIES_PER_GRANT_FRAME;
@@ -967,7 +971,7 @@ gnttab_init(void)
for (i--; i >= 0; i--)
free_page((unsigned long)gnttab_list[i]);
kfree(gnttab_list);
- return -ENOMEM;
+ return ret;
}
#ifdef CONFIG_XEN
diff --git a/drivers/xen/gntdev/gntdev.c b/drivers/xen/gntdev/gntdev.c
index 3ec5dbc96b38..16092ddc5b0b 100644
--- a/drivers/xen/gntdev/gntdev.c
+++ b/drivers/xen/gntdev/gntdev.c
@@ -80,7 +80,6 @@ typedef struct gntdev_grant_info {
grant_ref_t ref;
grant_handle_t kernel_handle;
grant_handle_t user_handle;
- uint64_t dev_bus_addr;
} valid;
} u;
} gntdev_grant_info_t;
@@ -288,35 +287,27 @@ static void compress_free_list(gntdev_file_private_data_t *private_data)
static int find_contiguous_free_range(gntdev_file_private_data_t *private_data,
uint32_t num_slots)
{
- uint32_t i, start_index = private_data->next_fit_index;
- uint32_t range_start = 0, range_length;
-
- /* First search from the start_index to the end of the array. */
- range_length = 0;
- for (i = start_index; i < private_data->grants_size; ++i) {
- if (private_data->grants[i].state == GNTDEV_SLOT_INVALID) {
- if (range_length == 0) {
- range_start = i;
- }
- ++range_length;
- if (range_length == num_slots) {
- return range_start;
- }
- }
- }
-
- /* Now search from the start of the array to the start_index. */
- range_length = 0;
- for (i = 0; i < start_index; ++i) {
- if (private_data->grants[i].state == GNTDEV_SLOT_INVALID) {
- if (range_length == 0) {
- range_start = i;
- }
- ++range_length;
- if (range_length == num_slots) {
- return range_start;
- }
+ /* First search from next_fit_index to the end of the array. */
+ uint32_t start_index = private_data->next_fit_index;
+ uint32_t end_index = private_data->grants_size;
+
+ for (;;) {
+ uint32_t i, range_start = 0, range_length = 0;
+
+ for (i = start_index; i < end_index; ++i) {
+ if (private_data->grants[i].state == GNTDEV_SLOT_INVALID) {
+ if (range_length == 0)
+ range_start = i;
+ if (++range_length == num_slots)
+ return range_start;
+ } else
+ range_length = 0;
}
+ /* Now search from the start of the array to next_fit_index. */
+ if (!start_index)
+ break;
+ end_index = start_index;
+ start_index = 0;
}
return -ENOMEM;
@@ -457,14 +448,15 @@ static int gntdev_mmap (struct file *flip, struct vm_area_struct *vma)
{
struct gnttab_map_grant_ref op;
unsigned long slot_index = vma->vm_pgoff;
- unsigned long kernel_vaddr, user_vaddr;
- uint32_t size = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ unsigned long kernel_vaddr, user_vaddr, mfn;
+ unsigned long size = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
uint64_t ptep;
int ret, exit_ret;
- int flags;
- int i;
+ unsigned int i, flags;
struct page *page;
gntdev_file_private_data_t *private_data = flip->private_data;
+ gntdev_grant_info_t *grants;
+ struct vm_foreign_map *foreign_map;
if (unlikely(!private_data)) {
pr_err("file's private data is NULL\n");
@@ -473,17 +465,19 @@ static int gntdev_mmap (struct file *flip, struct vm_area_struct *vma)
/* Test to make sure that the grants array has been initialised. */
down_read(&private_data->grants_sem);
- if (unlikely(!private_data->grants)) {
- up_read(&private_data->grants_sem);
+ grants = private_data->grants;
+ up_read(&private_data->grants_sem);
+
+ if (unlikely(!grants)) {
pr_err("attempted to mmap before ioctl\n");
return -EINVAL;
}
- up_read(&private_data->grants_sem);
+ grants += slot_index;
- if (unlikely((size <= 0) ||
- (size + slot_index) > private_data->grants_size)) {
+ if (unlikely(size + slot_index <= slot_index ||
+ size + slot_index > private_data->grants_size)) {
pr_err("Invalid number of pages or offset"
- "(num_pages = %d, first_slot = %ld)\n",
+ "(num_pages = %lu, first_slot = %lu)\n",
size, slot_index);
return -ENXIO;
}
@@ -493,15 +487,21 @@ static int gntdev_mmap (struct file *flip, struct vm_area_struct *vma)
return -EINVAL;
}
+ foreign_map = kmalloc(sizeof(*foreign_map), GFP_KERNEL);
+ if (!foreign_map) {
+ pr_err("couldn't allocate mapping structure for VM area\n");
+ return -ENOMEM;
+ }
+ foreign_map->map = &private_data->foreign_pages[slot_index];
+
/* Slots must be in the NOT_YET_MAPPED state. */
down_write(&private_data->grants_sem);
for (i = 0; i < size; ++i) {
- if (private_data->grants[slot_index + i].state !=
- GNTDEV_SLOT_NOT_YET_MAPPED) {
- pr_err("Slot (index = %ld) is in the wrong "
- "state (%d)\n", slot_index + i,
- private_data->grants[slot_index + i].state);
+ if (grants[i].state != GNTDEV_SLOT_NOT_YET_MAPPED) {
+ pr_err("Slot (index = %lu) is in the wrong state (%d)\n",
+ slot_index + i, grants[i].state);
up_write(&private_data->grants_sem);
+ kfree(foreign_map);
return -EINVAL;
}
}
@@ -510,13 +510,8 @@ static int gntdev_mmap (struct file *flip, struct vm_area_struct *vma)
vma->vm_ops = &gntdev_vmops;
/* The VM area contains pages from another VM. */
+ vma->vm_private_data = foreign_map;
vma->vm_flags |= VM_FOREIGN;
- vma->vm_private_data = kzalloc(size * sizeof(struct page *),
- GFP_KERNEL);
- if (vma->vm_private_data == NULL) {
- pr_err("couldn't allocate mapping structure for VM area\n");
- return -ENOMEM;
- }
/* This flag prevents Bad PTE errors when the memory is unmapped. */
vma->vm_flags |= VM_RESERVED;
@@ -544,14 +539,11 @@ static int gntdev_mmap (struct file *flip, struct vm_area_struct *vma)
flags |= GNTMAP_readonly;
kernel_vaddr = get_kernel_vaddr(private_data, slot_index + i);
- user_vaddr = get_user_vaddr(vma, i);
page = private_data->foreign_pages[slot_index + i];
gnttab_set_map_op(&op, kernel_vaddr, flags,
- private_data->grants[slot_index+i]
- .u.valid.ref,
- private_data->grants[slot_index+i]
- .u.valid.domid);
+ grants[i].u.valid.ref,
+ grants[i].u.valid.domid);
/* Carry out the mapping of the grant reference. */
ret = HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,
@@ -562,114 +554,90 @@ static int gntdev_mmap (struct file *flip, struct vm_area_struct *vma)
pr_err("Error mapping the grant reference "
"into the kernel (%d). domid = %d; ref = %d\n",
op.status,
- private_data->grants[slot_index+i]
- .u.valid.domid,
- private_data->grants[slot_index+i]
- .u.valid.ref);
+ grants[i].u.valid.domid,
+ grants[i].u.valid.ref);
else
/* Propagate eagain instead of trying to fix it up */
exit_ret = -EAGAIN;
goto undo_map_out;
}
- /* Store a reference to the page that will be mapped into user
- * space.
- */
- ((struct page **) vma->vm_private_data)[i] = page;
-
/* Mark mapped page as reserved. */
SetPageReserved(page);
/* Record the grant handle, for use in the unmap operation. */
- private_data->grants[slot_index+i].u.valid.kernel_handle =
- op.handle;
- private_data->grants[slot_index+i].u.valid.dev_bus_addr =
- op.dev_bus_addr;
+ grants[i].u.valid.kernel_handle = op.handle;
- private_data->grants[slot_index+i].state = GNTDEV_SLOT_MAPPED;
- private_data->grants[slot_index+i].u.valid.user_handle =
- GNTDEV_INVALID_HANDLE;
+ grants[i].state = GNTDEV_SLOT_MAPPED;
+ grants[i].u.valid.user_handle = GNTDEV_INVALID_HANDLE;
/* Now perform the mapping to user space. */
- if (!xen_feature(XENFEAT_auto_translated_physmap)) {
-
- /* NOT USING SHADOW PAGE TABLES. */
- /* In this case, we map the grant(s) straight into user
- * space.
- */
-
- /* Get the machine address of the PTE for the user
- * page.
- */
- if ((ret = create_lookup_pte_addr(vma->vm_mm,
- vma->vm_start
- + (i << PAGE_SHIFT),
- &ptep)))
- {
- pr_err("Error obtaining PTE pointer (%d)\n",
- ret);
- goto undo_map_out;
- }
-
- /* Configure the map operation. */
-
- /* The reference is to be used by host CPUs. */
- flags = GNTMAP_host_map;
-
- /* Specifies a user space mapping. */
- flags |= GNTMAP_application_map;
-
- /* The map request contains the machine address of the
- * PTE to update.
- */
- flags |= GNTMAP_contains_pte;
-
- if (!(vma->vm_flags & VM_WRITE))
- flags |= GNTMAP_readonly;
-
- gnttab_set_map_op(&op, ptep, flags,
- private_data->grants[slot_index+i]
- .u.valid.ref,
- private_data->grants[slot_index+i]
- .u.valid.domid);
-
- /* Carry out the mapping of the grant reference. */
- ret = HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,
- &op, 1);
- BUG_ON(ret);
- if (op.status != GNTST_okay) {
- pr_err("Error mapping the grant "
- "reference into user space (%d). domid "
- "= %d; ref = %d\n", op.status,
- private_data->grants[slot_index+i].u
- .valid.domid,
- private_data->grants[slot_index+i].u
- .valid.ref);
- /* This should never happen after we've mapped into
- * the kernel space. */
- BUG_ON(op.status == GNTST_eagain);
- goto undo_map_out;
- }
-
- /* Record the grant handle, for use in the unmap
- * operation.
- */
- private_data->grants[slot_index+i].u.
- valid.user_handle = op.handle;
-
- /* Update p2m structure with the new mapping. */
- set_phys_to_machine(__pa(kernel_vaddr) >> PAGE_SHIFT,
- FOREIGN_FRAME(private_data->
- grants[slot_index+i]
- .u.valid.dev_bus_addr
- >> PAGE_SHIFT));
- } else {
+ user_vaddr = get_user_vaddr(vma, i);
+
+ if (xen_feature(XENFEAT_auto_translated_physmap)) {
/* USING SHADOW PAGE TABLES. */
/* In this case, we simply insert the page into the VM
* area. */
ret = vm_insert_page(vma, user_vaddr, page);
+ if (!ret)
+ continue;
+ exit_ret = ret;
+ goto undo_map_out;
+ }
+
+ /* NOT USING SHADOW PAGE TABLES. */
+ /* In this case, we map the grant(s) straight into user
+ * space.
+ */
+ mfn = op.dev_bus_addr >> PAGE_SHIFT;
+
+ /* Get the machine address of the PTE for the user page. */
+ if ((ret = create_lookup_pte_addr(vma->vm_mm,
+ user_vaddr,
+ &ptep)))
+ {
+ pr_err("Error obtaining PTE pointer (%d)\n", ret);
+ goto undo_map_out;
}
+ /* Configure the map operation. */
+
+ /* Specifies a user space mapping. */
+ flags |= GNTMAP_application_map;
+
+ /* The map request contains the machine address of the
+ * PTE to update.
+ */
+ flags |= GNTMAP_contains_pte;
+
+ gnttab_set_map_op(&op, ptep, flags,
+ grants[i].u.valid.ref,
+ grants[i].u.valid.domid);
+
+ /* Carry out the mapping of the grant reference. */
+ ret = HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,
+ &op, 1);
+ BUG_ON(ret);
+ if (op.status != GNTST_okay) {
+ pr_err("Error mapping the grant reference "
+ "into user space (%d). domid = %d; ref = %d\n",
+ op.status,
+ grants[i].u.valid.domid,
+ grants[i].u.valid.ref);
+ /* This should never happen after we've mapped into
+ * the kernel space. */
+ BUG_ON(op.status == GNTST_eagain);
+ goto undo_map_out;
+ }
+
+ /* Record the grant handle, for use in the unmap
+ * operation.
+ */
+ grants[i].u.valid.user_handle = op.handle;
+
+ /* Update p2m structure with the new mapping. */
+ set_phys_to_machine(__pa(kernel_vaddr) >> PAGE_SHIFT,
+ FOREIGN_FRAME(mfn));
}
exit_ret = 0;
@@ -681,7 +649,8 @@ undo_map_out:
* by do_mmap_pgoff(), which will eventually call gntdev_clear_pte().
* All we need to do here is free the vma_private_data.
*/
- kfree(vma->vm_private_data);
+ vma->vm_flags &= ~VM_FOREIGN;
+ kfree(foreign_map);
/* THIS IS VERY UNPLEASANT: do_mmap_pgoff() will set the vma->vm_file
* to NULL on failure. However, we need this in gntdev_clear_pte() to
@@ -698,9 +667,11 @@ undo_map_out:
static pte_t gntdev_clear_pte(struct vm_area_struct *vma, unsigned long addr,
pte_t *ptep, int is_fullmm)
{
- int slot_index, ret;
+ int ret;
+ unsigned int nr;
+ unsigned long slot_index;
pte_t copy;
- struct gnttab_unmap_grant_ref op;
+ struct gnttab_unmap_grant_ref op[2];
gntdev_file_private_data_t *private_data;
/* THIS IS VERY UNPLEASANT: do_mmap_pgoff() will set the vma->vm_file
@@ -725,60 +696,53 @@ static pte_t gntdev_clear_pte(struct vm_area_struct *vma, unsigned long addr,
/* Only unmap grants if the slot has been mapped. This could be being
* called from a failing mmap().
*/
- if (private_data->grants[slot_index].state == GNTDEV_SLOT_MAPPED) {
+ if (private_data->grants[slot_index].state != GNTDEV_SLOT_MAPPED)
+ return xen_ptep_get_and_clear_full(vma, addr, ptep, is_fullmm);
- /* First, we clear the user space mapping, if it has been made.
- */
- if (private_data->grants[slot_index].u.valid.user_handle !=
- GNTDEV_INVALID_HANDLE &&
- !xen_feature(XENFEAT_auto_translated_physmap)) {
- /* NOT USING SHADOW PAGE TABLES. */
-
- /* Copy the existing value of the PTE for returning. */
- copy = *ptep;
-
- gnttab_set_unmap_op(&op, ptep_to_machine(ptep),
- GNTMAP_contains_pte,
- private_data->grants[slot_index]
- .u.valid.user_handle);
- ret = HYPERVISOR_grant_table_op(
- GNTTABOP_unmap_grant_ref, &op, 1);
- BUG_ON(ret);
- if (op.status != GNTST_okay)
- pr_warning("User unmap grant status = %d\n",
- op.status);
- } else {
- /* USING SHADOW PAGE TABLES. */
- copy = xen_ptep_get_and_clear_full(vma, addr, ptep, is_fullmm);
- }
+ /* Clear the user space mapping, if it has been made. */
+ if (private_data->grants[slot_index].u.valid.user_handle !=
+ GNTDEV_INVALID_HANDLE) {
+ /* NOT USING SHADOW PAGE TABLES (and user handle valid). */
+
+ /* Copy the existing value of the PTE for returning. */
+ copy = *ptep;
- /* Finally, we unmap the grant from kernel space. */
- gnttab_set_unmap_op(&op,
- get_kernel_vaddr(private_data, slot_index),
- GNTMAP_host_map,
+ gnttab_set_unmap_op(&op[0], ptep_to_machine(ptep),
+ GNTMAP_contains_pte,
private_data->grants[slot_index].u.valid
- .kernel_handle);
- ret = HYPERVISOR_grant_table_op(GNTTABOP_unmap_grant_ref,
- &op, 1);
- BUG_ON(ret);
- if (op.status != GNTST_okay)
- pr_warning("Kernel unmap grant status = %d\n",
- op.status);
+ .user_handle);
+ nr = 1;
+ } else {
+ /* USING SHADOW PAGE TABLES (or user handle invalid). */
+ copy = xen_ptep_get_and_clear_full(vma, addr, ptep, is_fullmm);
+ nr = 0;
+ }
+ /* We always unmap the grant from kernel space. */
+ gnttab_set_unmap_op(&op[nr],
+ get_kernel_vaddr(private_data, slot_index),
+ GNTMAP_host_map,
+ private_data->grants[slot_index].u.valid
+ .kernel_handle);
- /* Return slot to the not-yet-mapped state, so that it may be
- * mapped again, or removed by a subsequent ioctl.
- */
- private_data->grants[slot_index].state =
- GNTDEV_SLOT_NOT_YET_MAPPED;
+ ret = HYPERVISOR_grant_table_op(GNTTABOP_unmap_grant_ref, op, nr + 1);
+ BUG_ON(ret);
+ if (nr && op[0].status != GNTST_okay)
+ pr_warning("User unmap grant status = %d\n", op[0].status);
+ if (op[nr].status != GNTST_okay)
+ pr_warning("Kernel unmap grant status = %d\n", op[nr].status);
+
+ /* Return slot to the not-yet-mapped state, so that it may be
+ * mapped again, or removed by a subsequent ioctl.
+ */
+ private_data->grants[slot_index].state = GNTDEV_SLOT_NOT_YET_MAPPED;
+
+ if (!xen_feature(XENFEAT_auto_translated_physmap)) {
/* Invalidate the physical to machine mapping for this page. */
set_phys_to_machine(
page_to_pfn(private_data->foreign_pages[slot_index]),
INVALID_P2M_ENTRY);
-
- } else {
- copy = xen_ptep_get_and_clear_full(vma, addr, ptep, is_fullmm);
}
return copy;
@@ -787,9 +751,8 @@ static pte_t gntdev_clear_pte(struct vm_area_struct *vma, unsigned long addr,
/* "Destructor" for a VM area.
*/
static void gntdev_vma_close(struct vm_area_struct *vma) {
- if (vma->vm_private_data) {
+ if (vma->vm_flags & VM_FOREIGN)
kfree(vma->vm_private_data);
- }
}
/* Called when an ioctl is made on the device.
@@ -903,7 +866,8 @@ private_data_initialised:
return -EFAULT;
start_index = op.index >> PAGE_SHIFT;
- if (start_index + op.count > private_data->grants_size)
+ if (start_index + op.count < start_index ||
+ start_index + op.count > private_data->grants_size)
return -EINVAL;
down_write(&private_data->grants_sem);
@@ -912,26 +876,28 @@ private_data_initialised:
* state.
*/
for (i = 0; i < op.count; ++i) {
- if (unlikely
- (private_data->grants[start_index + i].state
- != GNTDEV_SLOT_NOT_YET_MAPPED)) {
- if (private_data->grants[start_index + i].state
- == GNTDEV_SLOT_INVALID) {
- pr_err("Tried to remove an invalid "
- "grant at offset 0x%x.",
- (start_index + i)
- << PAGE_SHIFT);
- rc = -EINVAL;
- } else {
- pr_err("Tried to remove a grant which "
- "is currently mmap()-ed at "
- "offset 0x%x.",
- (start_index + i)
- << PAGE_SHIFT);
- rc = -EBUSY;
- }
- goto unmap_out;
+ const char *what;
+
+ switch (private_data->grants[start_index + i].state) {
+ case GNTDEV_SLOT_NOT_YET_MAPPED:
+ continue;
+ case GNTDEV_SLOT_INVALID:
+ what = "invalid";
+ rc = -EINVAL;
+ break;
+ case GNTDEV_SLOT_MAPPED:
+ what = "currently mmap()-ed";
+ rc = -EBUSY;
+ break;
+ default:
+ what = "in an invalid state";
+ rc = -ENXIO;
+ break;
}
+ pr_err("%s[%d] tried to remove a grant which is %s at %#x+%#x\n",
+ current->comm, current->pid,
+ what, start_index, i);
+ goto unmap_out;
}
down_write(&private_data->free_list_sem);
diff --git a/drivers/xen/pci.c b/drivers/xen/pci.c
index 19f694bff354..346209cb6e2b 100644
--- a/drivers/xen/pci.c
+++ b/drivers/xen/pci.c
@@ -68,7 +68,7 @@ static int xen_add_device(struct device *dev)
#ifdef CONFIG_ACPI
handle = DEVICE_ACPI_HANDLE(&pci_dev->dev);
- if (!handle)
+ if (!handle && pci_dev->bus->bridge)
handle = DEVICE_ACPI_HANDLE(pci_dev->bus->bridge);
#ifdef CONFIG_PCI_IOV
if (!handle && pci_dev->is_virtfn)
diff --git a/include/linux/frontswap.h b/include/linux/frontswap.h
index 3e46c31f250a..68ff7af5c5fb 100644
--- a/include/linux/frontswap.h
+++ b/include/linux/frontswap.h
@@ -13,11 +13,12 @@ struct frontswap_ops {
void (*invalidate_area)(unsigned);
};
-extern int frontswap_enabled;
+extern bool frontswap_enabled;
extern struct frontswap_ops
frontswap_register_ops(struct frontswap_ops *ops);
extern void frontswap_shrink(unsigned long);
extern unsigned long frontswap_curr_pages(void);
+extern void frontswap_writethrough(bool);
extern void __frontswap_init(unsigned type);
extern int __frontswap_put_page(struct page *page);
@@ -27,9 +28,9 @@ extern void __frontswap_invalidate_area(unsigned);
#ifdef CONFIG_FRONTSWAP
-static inline int frontswap_test(struct swap_info_struct *sis, pgoff_t offset)
+static inline bool frontswap_test(struct swap_info_struct *sis, pgoff_t offset)
{
- int ret = 0;
+ bool ret = false;
if (frontswap_enabled && sis->frontswap_map)
ret = test_bit(offset, sis->frontswap_map);
@@ -63,9 +64,9 @@ static inline unsigned long *frontswap_map_get(struct swap_info_struct *p)
#define frontswap_enabled (0)
-static inline int frontswap_test(struct swap_info_struct *sis, pgoff_t offset)
+static inline bool frontswap_test(struct swap_info_struct *sis, pgoff_t offset)
{
- return 0;
+ return false;
}
static inline void frontswap_set(struct swap_info_struct *sis, pgoff_t offset)
diff --git a/mm/frontswap.c b/mm/frontswap.c
index d98c13e60e63..8c0a5f8683f0 100644
--- a/mm/frontswap.c
+++ b/mm/frontswap.c
@@ -5,7 +5,7 @@
* "backend" driver implementation of frontswap. See
* Documentation/vm/frontswap.txt for more information.
*
- * Copyright (C) 2009-2010 Oracle Corp. All rights reserved.
+ * Copyright (C) 2009-2012 Oracle Corp. All rights reserved.
* Author: Dan Magenheimer
*
* This work is licensed under the terms of the GNU GPL, version 2.
@@ -35,12 +35,23 @@ static struct frontswap_ops frontswap_ops __read_mostly;
* has not been registered, so is preferred to the slower alternative: a
* function call that checks a non-global.
*/
-int frontswap_enabled __read_mostly;
+bool frontswap_enabled __read_mostly;
EXPORT_SYMBOL(frontswap_enabled);
/*
+ * If enabled, frontswap_put will return failure even on success. As
+ * a result, the swap subsystem will always write the page to swap, in
+ * effect converting frontswap into a writethrough cache. In this mode,
+ * there is no direct reduction in swap writes, but a frontswap backend
+ * can unilaterally "reclaim" any pages in use with no data loss, thus
+ * providing increases control over maximum memory usage due to frontswap.
+ */
+static bool frontswap_writethrough_enabled __read_mostly;
+
+#ifdef CONFIG_DEBUG_FS
+/*
* Counters available via /sys/kernel/debug/frontswap (if debugfs is
- * properly configured. These are for information only so are not protected
+ * properly configured). These are for information only so are not protected
* against increment races.
*/
static u64 frontswap_gets;
@@ -48,21 +59,50 @@ static u64 frontswap_succ_puts;
static u64 frontswap_failed_puts;
static u64 frontswap_invalidates;
+static inline void inc_frontswap_gets(void) {
+ frontswap_gets++;
+}
+static inline void inc_frontswap_succ_puts(void) {
+ frontswap_succ_puts++;
+}
+static inline void inc_frontswap_failed_puts(void) {
+ frontswap_failed_puts++;
+}
+static inline void inc_frontswap_invalidates(void) {
+ frontswap_invalidates++;
+}
+#else
+static inline void inc_frontswap_gets(void) { }
+static inline void inc_frontswap_succ_puts(void) { }
+static inline void inc_frontswap_failed_puts(void) { }
+static inline void inc_frontswap_invalidates(void) { }
+#endif
/*
* Register operations for frontswap, returning previous thus allowing
- * detection of multiple backends and possible nesting
+ * detection of multiple backends and possible nesting.
*/
struct frontswap_ops frontswap_register_ops(struct frontswap_ops *ops)
{
struct frontswap_ops old = frontswap_ops;
frontswap_ops = *ops;
- frontswap_enabled = 1;
+ frontswap_enabled = true;
return old;
}
EXPORT_SYMBOL(frontswap_register_ops);
-/* Called when a swap device is swapon'd */
+/*
+ * Enable/disable frontswap writethrough (see above).
+ */
+void frontswap_writethrough(bool enable)
+{
+ frontswap_writethrough_enabled = enable;
+}
+EXPORT_SYMBOL(frontswap_writethrough);
+
+/*
+ * Called when a swap device is swapon'd.
+ */
void __frontswap_init(unsigned type)
{
struct swap_info_struct *sis = swap_info[type];
@@ -80,7 +120,7 @@ EXPORT_SYMBOL(__frontswap_init);
* swaptype and offset. Page must be locked and in the swap cache.
* If frontswap already contains a page with matching swaptype and
* offset, the frontswap implmentation may either overwrite the data and
- * return success or invalidate the page from frontswap and return failure
+ * return success or invalidate the page from frontswap and return failure.
*/
int __frontswap_put_page(struct page *page)
{
@@ -97,7 +137,7 @@ int __frontswap_put_page(struct page *page)
ret = (*frontswap_ops.put_page)(type, offset, page);
if (ret == 0) {
frontswap_set(sis, offset);
- frontswap_succ_puts++;
+ inc_frontswap_succ_puts();
if (!dup)
atomic_inc(&sis->frontswap_pages);
} else if (dup) {
@@ -107,9 +147,12 @@ int __frontswap_put_page(struct page *page)
*/
frontswap_clear(sis, offset);
atomic_dec(&sis->frontswap_pages);
- frontswap_failed_puts++;
+ inc_frontswap_failed_puts();
} else
- frontswap_failed_puts++;
+ inc_frontswap_failed_puts();
+ if (frontswap_writethrough_enabled)
+ /* report failure so swap also writes to swap device */
+ ret = -1;
return ret;
}
EXPORT_SYMBOL(__frontswap_put_page);
@@ -117,7 +160,7 @@ EXPORT_SYMBOL(__frontswap_put_page);
/*
* "Get" data from frontswap associated with swaptype and offset that were
* specified when the data was put to frontswap and use it to fill the
- * specified page with data. Page must be locked and in the swap cache
+ * specified page with data. Page must be locked and in the swap cache.
*/
int __frontswap_get_page(struct page *page)
{
@@ -132,7 +175,7 @@ int __frontswap_get_page(struct page *page)
if (frontswap_test(sis, offset))
ret = (*frontswap_ops.get_page)(type, offset, page);
if (ret == 0)
- frontswap_gets++;
+ inc_frontswap_gets();
return ret;
}
EXPORT_SYMBOL(__frontswap_get_page);
@@ -150,7 +193,7 @@ void __frontswap_invalidate_page(unsigned type, pgoff_t offset)
(*frontswap_ops.invalidate_page)(type, offset);
atomic_dec(&sis->frontswap_pages);
frontswap_clear(sis, offset);
- frontswap_invalidates++;
+ inc_frontswap_invalidates();
}
}
EXPORT_SYMBOL(__frontswap_invalidate_page);
@@ -254,19 +297,18 @@ EXPORT_SYMBOL(frontswap_curr_pages);
static int __init init_frontswap(void)
{
- int err = 0;
-
#ifdef CONFIG_DEBUG_FS
struct dentry *root = debugfs_create_dir("frontswap", NULL);
if (root == NULL)
return -ENXIO;
debugfs_create_u64("gets", S_IRUGO, root, &frontswap_gets);
debugfs_create_u64("succ_puts", S_IRUGO, root, &frontswap_succ_puts);
- debugfs_create_u64("puts", S_IRUGO, root, &frontswap_failed_puts);
+ debugfs_create_u64("failed_puts", S_IRUGO, root,
+ &frontswap_failed_puts);
debugfs_create_u64("invalidates", S_IRUGO,
root, &frontswap_invalidates);
#endif
- return err;
+ return 0;
}
module_init(init_frontswap);