Home Home > GIT Browse
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJohannes Thumshirn <jthumshirn@suse.de>2018-07-19 12:52:00 +0200
committerJohannes Thumshirn <jthumshirn@suse.de>2018-07-19 12:52:00 +0200
commit16640d25489b95cff369c3ba5f3d5f86a3f90f81 (patch)
tree41066c58140cacc76184fff2f47d21eb565af984
parent6d39c55e7cddf8003f194f57d8ffab024843b859 (diff)
parentb7d6a07544c9508e1bd64898de9d4275abe32d66 (diff)
Merge remote-tracking branch 'origin/users/mhocko/SLE12-SP4/for-next' into SLE12-SP4
Merge pagecache limit updates from Michal Hocko suse-commit: 3c93f8dd9bc91a30907fb47183750aad019d1b5a
-rw-r--r--Documentation/vm/pagecache-limit60
-rw-r--r--include/linux/mmzone.h8
-rw-r--r--include/linux/pagemap.h1
-rw-r--r--include/linux/swap.h5
-rw-r--r--include/linux/vmstat.h12
-rw-r--r--include/trace/events/pagecache-limit.h99
-rw-r--r--include/trace/events/vmscan.h2
-rw-r--r--kernel/sysctl.c32
-rw-r--r--mm/filemap.c3
-rw-r--r--mm/page_alloc.c50
-rw-r--r--mm/shmem.c16
-rw-r--r--mm/vmscan.c344
-rw-r--r--mm/vmstat.c7
13 files changed, 638 insertions, 1 deletions
diff --git a/Documentation/vm/pagecache-limit b/Documentation/vm/pagecache-limit
new file mode 100644
index 000000000000..a3c1dbbfb72e
--- /dev/null
+++ b/Documentation/vm/pagecache-limit
@@ -0,0 +1,60 @@
+Functionality:
+-------------
+The patch introduces two new tunables in the proc filesystem:
+
+/proc/sys/vm/pagecache_limit_mb
+
+This tunable sets a limit to the unmapped pages in the pagecache in megabytes.
+If non-zero, it should not be set below 4 (4MB), or the system might behave erratically. In real-life, much larger limits (a few percent of system RAM / a hundred MBs) will be useful.
+
+Examples:
+echo 512 >/proc/sys/vm/pagecache_limit_mb
+
+This sets a baseline limits for the page cache (not the buffer cache!) of 0.5GiB.
+As we only consider pagecache pages that are unmapped, currently mapped pages (files that are mmap'ed such as e.g. binaries and libraries as well as SysV shared memory) are not limited by this.
+NOTE: The real limit depends on the amount of free memory. Every existing free page allows the page cache to grow 8x the amount of free memory above the set baseline. As soon as the free memory is needed, we free up page cache.
+
+
+/proc/sys/vm/pagecache_limit_ignore_dirty
+
+The default for this setting is 1; this means that we don't consider dirty memory to be part of the limited pagecache, as we can not easily free up dirty memory (we'd need to do writes for this). By setting this to 0, we actually consider dirty (unampped) memory to be freeable and do a third pass in shrink_page_cache() where we schedule the pages for writeout. Values larger than 1 are also possible and result in a fraction of the dirty pages to be considered non-freeable.
+
+
+
+
+How it works:
+------------
+The heart of this patch is a new function called shrink_page_cache(). It is called from balance_pgdat (which is the worker for kswapd) if the pagecache is above the limit.
+The function is also called in __alloc_pages_slowpath.
+
+shrink_page_cache() calculates the nr of pages the cache is over its limit. It reduces this number by a factor (so you have to call it several times to get down to the target) then shrinks the pagecache (using the Kernel LRUs).
+
+shrink_page_cache does several passes:
+- Just reclaiming from inactive pagecache memory.
+ This is fast -- but it might not find enough free pages; if that happens,
+ the second pass will happen
+- In the second pass, pages from active list will also be considered.
+- The third pass will only happen if pagecacahe_limig_ignore-dirty is not 1.
+ In that case, the third pass is a repetition of the second pass, but this
+ time we allow pages to be written out.
+
+In all passes, only unmapped pages will be considered.
+
+
+How it changes memory management:
+--------------------------------
+If the pagecache_limit_mb is set to zero (default), nothing changes.
+
+If set to a positive value, there will be three different operating modes:
+(1) If we still have plenty of free pages, the pagecache limit will NOT be enforced. Memory management decisions are taken as normally.
+(2) However, as soon someone consumes those free pages, we'll start freeing pagecache -- as those are returned to the free page pool, freeing a few pages from pagecache will return us to state (1) -- if however someone consumes these free pages quickly, we'll continue freeing up pages from the pagecache until we reach pagecache_limit_mb.
+(3) Once we are at or below the low watermark, pagecache_limit_mb, the pages in the page cache will be governed by normal paging memory management decisions; if it starts growing above the limit (corrected by the free pages), we'll free some up again.
+
+This feature is useful for machines that have large workloads, carefully sized to eat most of the memory. Depending on the applications page access pattern, the kernel may too easily swap the application memory out in favor of pagecache. This can happen even for low values of swappiness. With this feature, the admin can tell the kernel that only a certain amount of pagecache is really considered useful and that it otherwise should favor the applications memory.
+
+
+Foreground vs. background shrinking:
+-----------------------------------
+
+Usually, the Linux kernel reclaims its memory using the kernel thread kswapd. It reclaims memory in the background. If it can't reclaim memory fast enough, it retries with higher priority and if this still doesn't succeed it uses a direct reclaim path.
+
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ff1209154ee9..bc6e47c3bc4d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -724,6 +724,14 @@ typedef struct pglist_data {
unsigned long flags;
+ /*
+ * This atomic counter is set when there is pagecache limit
+ * reclaim going on on this particular node. Other potential
+ * reclaiers should back off to prevent from heavy lru_lock
+ * bouncing.
+ */
+ atomic_t pagecache_reclaim;
+
ZONE_PADDING(_pad2_)
/* Per-node vmstats */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 52959bc16d97..9271fe7d33a4 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -12,6 +12,7 @@
#include <linux/uaccess.h>
#include <linux/gfp.h>
#include <linux/bitops.h>
+#include <linux/swap.h>
#include <linux/hardirq.h> /* for in_interrupt() */
#include <linux/hugetlb_inline.h>
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4658dae50279..146526cfc4b8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -336,6 +336,11 @@ extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
unsigned long *nr_scanned);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
+#define FREE_TO_PAGECACHE_RATIO 8
+extern unsigned long pagecache_over_limit(void);
+extern void shrink_page_cache(gfp_t mask, struct page *page);
+extern unsigned int vm_pagecache_limit_mb;
+extern unsigned int vm_pagecache_ignore_dirty;
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern unsigned long vm_total_pages;
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 1057fe957f9e..be76b41891c1 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -380,6 +380,18 @@ static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages,
__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
}
+enum pagecache_limit_stat_item {
+ NR_PAGECACHE_LIMIT_THROTTLED, /* Number of tasks throttled by the
+ * page cache limit.
+ */
+ NR_PAGECACHE_LIMIT_BLOCKED, /* Number of tasks blocked waiting for
+ * the page cache limit reclaim.
+ */
+ NR_PAGECACHE_LIMIT_ITEMS,
+};
+
+void all_pagecache_limit_counters(unsigned long *);
+
extern const char * const vmstat_text[];
#endif /* _LINUX_VMSTAT_H */
diff --git a/include/trace/events/pagecache-limit.h b/include/trace/events/pagecache-limit.h
new file mode 100644
index 000000000000..4a519d49bcd9
--- /dev/null
+++ b/include/trace/events/pagecache-limit.h
@@ -0,0 +1,99 @@
+
+/*
+ * This file defines pagecache limit specific tracepoints and should only be
+ * included through include/trace/events/vmscan.h, never directly.
+ */
+
+TRACE_EVENT(mm_shrink_page_cache_start,
+
+ TP_PROTO(gfp_t mask),
+
+ TP_ARGS(mask),
+
+ TP_STRUCT__entry(
+ __field(gfp_t, mask)
+ ),
+
+ TP_fast_assign(
+ __entry->mask = mask;
+ ),
+
+ TP_printk("mask=%s",
+ show_gfp_flags(__entry->mask))
+);
+
+TRACE_EVENT(mm_shrink_page_cache_end,
+
+ TP_PROTO(unsigned long nr_reclaimed),
+
+ TP_ARGS(nr_reclaimed),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, nr_reclaimed)
+ ),
+
+ TP_fast_assign(
+ __entry->nr_reclaimed = nr_reclaimed;
+ ),
+
+ TP_printk("nr_reclaimed=%lu",
+ __entry->nr_reclaimed)
+);
+
+TRACE_EVENT(mm_pagecache_reclaim_start,
+
+ TP_PROTO(unsigned long nr_pages, int pass, int prio, gfp_t mask,
+ bool may_write),
+
+ TP_ARGS(nr_pages, pass, prio, mask, may_write),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, nr_pages )
+ __field(int, pass )
+ __field(int, prio )
+ __field(gfp_t, mask )
+ __field(bool, may_write )
+ ),
+
+ TP_fast_assign(
+ __entry->nr_pages = nr_pages;
+ __entry->pass = pass;
+ __entry->prio = prio;
+ __entry->mask = mask;
+ __entry->may_write = may_write;
+ ),
+
+ TP_printk("nr_pages=%lu pass=%d prio=%d mask=%s may_write=%d",
+ __entry->nr_pages,
+ __entry->pass,
+ __entry->prio,
+ show_gfp_flags(__entry->mask),
+ (int) __entry->may_write)
+);
+
+TRACE_EVENT(mm_pagecache_reclaim_end,
+
+ TP_PROTO(unsigned long nr_scanned, unsigned long nr_reclaimed,
+ unsigned int nr_zones),
+
+ TP_ARGS(nr_scanned, nr_reclaimed, nr_zones),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, nr_scanned )
+ __field(unsigned long, nr_reclaimed )
+ __field(unsigned int, nr_zones )
+ ),
+
+ TP_fast_assign(
+ __entry->nr_scanned = nr_scanned;
+ __entry->nr_reclaimed = nr_reclaimed;
+ __entry->nr_zones = nr_zones;
+ ),
+
+ TP_printk("nr_scanned=%lu nr_reclaimed=%lu nr_scanned_zones=%u",
+ __entry->nr_scanned,
+ __entry->nr_reclaimed,
+ __entry->nr_zones)
+);
+
+
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 27e8a5c77579..e666b0948894 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -37,6 +37,8 @@
(RECLAIM_WB_ASYNC) \
)
+#include "pagecache-limit.h"
+
TRACE_EVENT(mm_vmscan_kswapd_sleep,
TP_PROTO(int nid),
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 98caf74882ac..104ef9059983 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1244,6 +1244,9 @@ static struct ctl_table kern_table[] = {
{ }
};
+int pc_limit_proc_dointvec(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos);
+
static struct ctl_table vm_table[] = {
{
.procname = "overcommit_memory",
@@ -1370,6 +1373,20 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
.extra2 = &one_hundred,
},
+ {
+ .procname = "pagecache_limit_mb",
+ .data = &vm_pagecache_limit_mb,
+ .maxlen = sizeof(vm_pagecache_limit_mb),
+ .mode = 0644,
+ .proc_handler = &pc_limit_proc_dointvec,
+ },
+ {
+ .procname = "pagecache_limit_ignore_dirty",
+ .data = &vm_pagecache_ignore_dirty,
+ .maxlen = sizeof(vm_pagecache_ignore_dirty),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
#ifdef CONFIG_HUGETLB_PAGE
{
.procname = "nr_hugepages",
@@ -2450,6 +2467,21 @@ static int do_proc_douintvec(struct ctl_table *table, int write,
buffer, lenp, ppos, conv, data);
}
+int pc_limit_proc_dointvec(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int ret = do_proc_dointvec(table,write,buffer,lenp,ppos,
+ NULL,NULL);
+ if (write && !ret) {
+ printk(KERN_WARNING "pagecache limit set to %d."
+ "Feature is supported only for SLES for SAP appliance\n",
+ vm_pagecache_limit_mb);
+ if (num_possible_cpus() > 16)
+ printk(KERN_WARNING "Using page cache limit on large machines is strongly discouraged. See TID 7021211\n");
+ }
+ return ret;
+}
+
/**
* proc_dointvec - read a vector of integers
* @table: the sysctl table
diff --git a/mm/filemap.c b/mm/filemap.c
index 5701861fbbe0..b3f8beb45c8e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -900,6 +900,9 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
void *shadow = NULL;
int ret;
+ if (unlikely(vm_pagecache_limit_mb) && pagecache_over_limit() > 0)
+ shrink_page_cache(gfp_mask, page);
+
__SetPageLocked(page);
ret = __add_to_page_cache_locked(page, mapping, offset,
gfp_mask, &shadow);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cc597174ff08..23bbeb106f15 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7790,6 +7790,56 @@ void zone_pcp_reset(struct zone *zone)
local_irq_restore(flags);
}
+/* Returns a number that's positive if the pagecache is above
+ * the set limit. Note that we allow the pagecache to grow
+ * larger if there's plenty of free pages.
+ */
+unsigned long pagecache_over_limit()
+{
+ /* We only want to limit unmapped and non-shmem page cache pages;
+ * normally all shmem pages are mapped as well, but that does
+ * not seem to be guaranteed. (Maybe this was just an oprofile
+ * bug?).
+ * (FIXME: Do we need to subtract NR_FILE_DIRTY here as well?) */
+ unsigned long pgcache_pages = global_page_state(NR_FILE_PAGES)
+ - max_t(unsigned long,
+ global_page_state(NR_FILE_MAPPED),
+ global_page_state(NR_SHMEM));
+ /* We certainly can't free more than what's on the LRU lists
+ * minus the dirty ones. (FIXME: pages accounted for in NR_WRITEBACK
+ * are not on the LRU lists any more, right?) */
+ unsigned long pgcache_lru_pages = global_page_state(NR_ACTIVE_FILE)
+ + global_page_state(NR_INACTIVE_FILE);
+ unsigned long free_pages = global_page_state(NR_FREE_PAGES);
+ unsigned long swap_pages = total_swap_pages - get_nr_swap_pages();
+ unsigned long limit;
+
+ if (vm_pagecache_ignore_dirty != 0)
+ pgcache_lru_pages -= global_page_state(NR_FILE_DIRTY)
+ /vm_pagecache_ignore_dirty;
+ /* Paranoia */
+ if (unlikely(pgcache_lru_pages > LONG_MAX))
+ return 0;
+ /* We give a bonus for free pages above 6% of total (minus half swap used) */
+ free_pages -= totalram_pages/16;
+ if (likely(swap_pages <= LONG_MAX))
+ free_pages -= swap_pages/2;
+ if (free_pages > LONG_MAX)
+ free_pages = 0;
+
+ /* Limit it to 94% of LRU (not all there might be unmapped) */
+ pgcache_lru_pages -= pgcache_lru_pages/16;
+ pgcache_pages = min_t(unsigned long, pgcache_pages, pgcache_lru_pages);
+
+ /* Effective limit is corrected by effective free pages */
+ limit = vm_pagecache_limit_mb * ((1024*1024UL)/PAGE_SIZE) +
+ FREE_TO_PAGECACHE_RATIO * free_pages;
+
+ if (pgcache_pages > limit)
+ return pgcache_pages - limit;
+ return 0;
+}
+
#ifdef CONFIG_MEMORY_HOTREMOVE
/*
* All pages in the range must be in a single zone and isolated
diff --git a/mm/shmem.c b/mm/shmem.c
index 12973e9d24f3..e8e58ec40942 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1211,6 +1211,17 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
/* No radix_tree_preload: swap entry keeps a place for page in tree */
error = -EAGAIN;
+ /*
+ * try to shrink the page cache proactively even though
+ * we might already have the page in so the shrinking is
+ * not necessary but this is much easier than dropping
+ * the lock in shmem_unuse_inode before add_to_page_cache_lru.
+ * GFP_NOWAIT makes sure that we do not shrink when adding
+ * to page cache
+ */
+ if (unlikely(vm_pagecache_limit_mb) && pagecache_over_limit() > 0)
+ shrink_page_cache(GFP_KERNEL, NULL);
+
mutex_lock(&shmem_swaplist_mutex);
list_for_each_safe(this, next, &shmem_swaplist) {
info = list_entry(this, struct shmem_inode_info, swaplist);
@@ -1229,8 +1240,11 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
if (error != -ENOMEM)
error = 0;
mem_cgroup_cancel_charge(page, memcg, false);
- } else
+ } else {
mem_cgroup_commit_charge(page, memcg, true, false);
+ if (unlikely(vm_pagecache_limit_mb) && pagecache_over_limit() > 0)
+ shrink_page_cache(GFP_KERNEL, page);
+ }
out:
unlock_page(page);
put_page(page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0a401a00c2c1..571f2eb70fee 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -149,6 +149,8 @@ struct scan_control {
* From 0 .. 100. Higher means more swappy.
*/
int vm_swappiness = 60;
+unsigned int vm_pagecache_limit_mb __read_mostly = 0;
+unsigned int vm_pagecache_ignore_dirty __read_mostly = 1;
/*
* The total number of pages which are beyond the high watermark within all
* zones.
@@ -3152,6 +3154,8 @@ static void clear_pgdat_congested(pg_data_t *pgdat)
clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
}
+static void __shrink_page_cache(gfp_t mask);
+
/*
* Prepare kswapd for sleeping. This verifies that there are no processes
* waiting in throttle_direct_reclaim() and that watermarks have been met.
@@ -3260,6 +3264,10 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
};
count_vm_event(PAGEOUTRUN);
+ /* this reclaims from all zones so don't count to sc.nr_reclaimed */
+ if (unlikely(vm_pagecache_limit_mb) && pagecache_over_limit() > 0)
+ __shrink_page_cache(GFP_KERNEL);
+
do {
unsigned long nr_reclaimed = sc.nr_reclaimed;
bool raise_priority = true;
@@ -3426,6 +3434,12 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
}
+ /* We do not need to loop_again if we have not achieved our
+ * pagecache target (i.e. && pagecache_over_limit(0) > 0) because
+ * the limit will be checked next time a page is added to the page
+ * cache. This might cause a short stall but we should rather not
+ * keep kswapd awake.
+ */
/*
* After a short sleep, check if it was a premature sleep. If not, then
* go fully to sleep until explicitly woken up.
@@ -3626,6 +3640,336 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
}
#endif /* CONFIG_HIBERNATION */
+/*
+ * This should probably go into mm/vmstat.c but there is no intention to
+ * spread any knowledge outside of this single user so let's stay here
+ * and be quiet so that nobody notices us.
+ *
+ * A new counter has to be added to enum pagecache_limit_stat_item and
+ * its name to vmstat_text.
+ *
+ * The pagecache limit reclaim is also a slow path so we can go without
+ * per-cpu accounting for now.
+ *
+ * No kernel path should _ever_ depend on these counters. They are solely
+ * for userspace debugging via /proc/vmstat
+ */
+static atomic_t pagecache_limit_stats[NR_PAGECACHE_LIMIT_ITEMS];
+
+void all_pagecache_limit_counters(unsigned long *ret)
+{
+ int i;
+
+ for (i = 0; i < NR_PAGECACHE_LIMIT_ITEMS; i++)
+ ret[i] = atomic_read(&pagecache_limit_stats[i]);
+}
+
+static void inc_pagecache_limit_stat(enum pagecache_limit_stat_item item)
+{
+ atomic_inc(&pagecache_limit_stats[item]);
+}
+
+static void dec_pagecache_limit_stat(enum pagecache_limit_stat_item item)
+{
+ atomic_dec(&pagecache_limit_stats[item]);
+}
+
+/*
+ * Returns non-zero if the lock has been acquired, false if somebody
+ * else is holding the lock.
+ */
+static int pagecache_reclaim_lock_node(struct pglist_data *pgdat)
+{
+ return atomic_add_unless(&pgdat->pagecache_reclaim, 1, 1);
+}
+
+static void pagecache_reclaim_unlock_node(struct pglist_data *pgdat)
+{
+ BUG_ON(atomic_dec_return(&pgdat->pagecache_reclaim));
+}
+
+/*
+ * Potential page cache reclaimers who are not able to take
+ * reclaim lock on any node are sleeping on this waitqueue.
+ * So this is basically a congestion wait queue for them.
+ */
+DECLARE_WAIT_QUEUE_HEAD(pagecache_reclaim_wq);
+
+/*
+ * Similar to shrink_node but it has a different consumer - pagecache limit
+ * so we cannot reuse the original function - and we do not want to clobber
+ * that code path so we have to live with this code duplication.
+ *
+ * In short this simply scans through the given lru for all cgroups for the
+ * given node.
+ *
+ * returns true if we managed to cumulatively reclaim (via nr_reclaimed)
+ * the given nr_to_reclaim pages, false otherwise. The caller knows that
+ * it doesn't have to touch other nodes if the target was hit already.
+ *
+ * DO NOT USE OUTSIDE of shrink_all_nodes unless you have a really really
+ * really good reason.
+ */
+static bool shrink_node_per_memcg(struct pglist_data *pgdat, enum lru_list lru,
+ unsigned long nr_to_scan, unsigned long nr_to_reclaim,
+ unsigned long *nr_reclaimed, struct scan_control *sc)
+{
+ struct mem_cgroup *root = sc->target_mem_cgroup;
+ struct mem_cgroup *memcg;
+ struct mem_cgroup_reclaim_cookie reclaim = {
+ .pgdat = pgdat,
+ .priority = sc->priority,
+ };
+
+ memcg = mem_cgroup_iter(root, NULL, &reclaim);
+ do {
+ struct lruvec *lruvec;
+
+ lruvec = mem_cgroup_lruvec(pgdat, memcg);
+ *nr_reclaimed += shrink_list(lru, nr_to_scan, lruvec, memcg, sc);
+ if (*nr_reclaimed >= nr_to_reclaim) {
+ mem_cgroup_iter_break(root, memcg);
+ return true;
+ }
+
+ memcg = mem_cgroup_iter(root, memcg, &reclaim);
+ } while (memcg);
+
+ return false;
+}
+
+/*
+ * We had to resurect this function for __shrink_page_cache (upstream has
+ * removed it and reworked shrink_all_memory by 7b51755c).
+ *
+ * Tries to reclaim 'nr_pages' pages from LRU lists system-wide, for given
+ * pass.
+ *
+ * For pass > 3 we also try to shrink the LRU lists that contain a few pages
+ */
+static int shrink_all_nodes(unsigned long nr_pages, int pass,
+ struct scan_control *sc)
+{
+ unsigned long nr_reclaimed = 0;
+ unsigned int nr_locked_zones = 0;
+ DEFINE_WAIT(wait);
+ int nid;
+
+ prepare_to_wait(&pagecache_reclaim_wq, &wait, TASK_INTERRUPTIBLE);
+ trace_mm_pagecache_reclaim_start(nr_pages, pass, sc->priority, sc->gfp_mask,
+ sc->may_writepage);
+
+ for_each_online_node(nid) {
+ struct pglist_data *pgdat = NODE_DATA(nid);
+ enum lru_list lru;
+
+ /*
+ * Back off if somebody is already reclaiming this node
+ * for the pagecache reclaim.
+ */
+ if (!pagecache_reclaim_lock_node(pgdat))
+ continue;
+
+ /*
+ * This reclaimer might scan a node so it will never
+ * sleep on pagecache_reclaim_wq
+ */
+ finish_wait(&pagecache_reclaim_wq, &wait);
+ nr_locked_zones++;
+
+ for_each_evictable_lru(lru) {
+ enum zone_stat_item ls = NR_LRU_BASE + lru;
+ unsigned long lru_pages = node_page_state(pgdat, ls);
+
+ /* For pass = 0, we don't shrink the active list */
+ if (pass == 0 && (lru == LRU_ACTIVE_ANON ||
+ lru == LRU_ACTIVE_FILE))
+ continue;
+
+ /* Original code relied on nr_saved_scan which is no
+ * longer present so we are just considering LRU pages.
+ * This means that the zone has to have quite large
+ * LRU list for default priority and minimum nr_pages
+ * size (8*SWAP_CLUSTER_MAX). In the end we will tend
+ * to reclaim more from large zones wrt. small.
+ * This should be OK because shrink_page_cache is called
+ * when we are getting to short memory condition so
+ * LRUs tend to be large.
+ */
+ if (((lru_pages >> sc->priority) + 1) >= nr_pages || pass > 3) {
+ unsigned long nr_to_scan;
+ struct reclaim_state reclaim_state;
+ unsigned long scanned = sc->nr_scanned;
+ struct reclaim_state *old_rs = current->reclaim_state;
+
+ nr_to_scan = min(nr_pages, lru_pages);
+
+ /*
+ * A bit of a hack but the code has always been
+ * updating sc->nr_reclaimed once per shrink_all_nodes
+ * rather than accumulating it for all calls to shrink
+ * lru. This costs us an additional argument to
+ * shrink_node_per_memcg but well...
+ *
+ * Let's stick with this for bug-to-bug compatibility
+ */
+ while (nr_to_scan > 0) {
+ /* shrink_list takes lru_lock with IRQ off so we
+ * should be careful about really huge nr_to_scan
+ */
+ unsigned long batch = min_t(unsigned long, nr_to_scan, SWAP_CLUSTER_MAX);
+
+ if (shrink_node_per_memcg(pgdat, lru,
+ batch, nr_pages, &nr_reclaimed, sc)) {
+ pagecache_reclaim_unlock_node(pgdat);
+ goto out_wakeup;
+ }
+ nr_to_scan -= batch;
+ }
+
+ current->reclaim_state = &reclaim_state;
+ reclaim_state.reclaimed_slab = 0;
+ shrink_slab(sc->gfp_mask, nid, NULL,
+ sc->nr_scanned - scanned, lru_pages);
+ sc->nr_reclaimed += reclaim_state.reclaimed_slab;
+ current->reclaim_state = old_rs;
+ }
+ }
+ pagecache_reclaim_unlock_node(pgdat);
+ }
+
+ /*
+ * We have to go to sleep because all the zones are already reclaimed.
+ * One of the reclaimer will wake us up or __shrink_page_cache will
+ * do it if there is nothing to be done.
+ */
+ if (!nr_locked_zones) {
+ inc_pagecache_limit_stat(NR_PAGECACHE_LIMIT_BLOCKED);
+ schedule();
+ dec_pagecache_limit_stat(NR_PAGECACHE_LIMIT_BLOCKED);
+ finish_wait(&pagecache_reclaim_wq, &wait);
+ goto out;
+ }
+
+out_wakeup:
+ wake_up_interruptible(&pagecache_reclaim_wq);
+out:
+ sc->nr_reclaimed = nr_reclaimed;
+ trace_mm_pagecache_reclaim_end(sc->nr_scanned, nr_reclaimed,
+ nr_locked_zones);
+ return nr_locked_zones;
+}
+
+/*
+ * Function to shrink the page cache
+ *
+ * This function calculates the number of pages (nr_pages) the page
+ * cache is over its limit and shrinks the page cache accordingly.
+ *
+ * The maximum number of pages, the page cache shrinks in one call of
+ * this function is limited to SWAP_CLUSTER_MAX pages. Therefore it may
+ * require a number of calls to actually reach the vm_pagecache_limit_kb.
+ *
+ * This function is similar to shrink_all_memory, except that it may never
+ * swap out mapped pages and only does two passes.
+ */
+static void __shrink_page_cache(gfp_t mask)
+{
+ unsigned long ret = 0;
+ int pass = 0;
+ struct scan_control sc = {
+ .gfp_mask = mask,
+ .may_swap = 0,
+ .may_unmap = 0,
+ .may_writepage = 0,
+ .target_mem_cgroup = NULL,
+ .reclaim_idx = gfp_zone(mask),
+ };
+ long nr_pages;
+
+ /* We might sleep during direct reclaim so make atomic context
+ * is certainly a bug.
+ */
+ BUG_ON(!(mask & __GFP_DIRECT_RECLAIM));
+
+retry:
+ /* How many pages are we over the limit?
+ * But don't enforce limit if there's plenty of free mem */
+ nr_pages = pagecache_over_limit();
+
+ /* Don't need to go there in one step; as the freed
+ * pages are counted FREE_TO_PAGECACHE_RATIO times, this
+ * is still more than minimally needed. */
+ nr_pages /= 2;
+
+ /*
+ * Return early if there's no work to do.
+ * Wake up reclaimers that couldn't scan any node due to congestion.
+ * There is apparently nothing to do so they do not have to sleep.
+ * This makes sure that no sleeping reclaimer will stay behind.
+ * Allow breaching the limit if the task is on the way out.
+ */
+ if (nr_pages <= 0 || fatal_signal_pending(current)) {
+ wake_up_interruptible(&pagecache_reclaim_wq);
+ return;
+ }
+
+ /* But do a few at least */
+ nr_pages = max_t(unsigned long, nr_pages, 8*SWAP_CLUSTER_MAX);
+ inc_pagecache_limit_stat(NR_PAGECACHE_LIMIT_THROTTLED);
+ trace_mm_shrink_page_cache_start(mask);
+
+ /*
+ * Shrink the LRU in 2 passes:
+ * 0 = Reclaim from inactive_list only (fast)
+ * 1 = Reclaim from active list but don't reclaim mapped (not that fast)
+ * 2 = Reclaim from active list but don't reclaim mapped (2nd pass)
+ */
+ for (; pass < 2; pass++) {
+ for (sc.priority = DEF_PRIORITY; sc.priority >= 0; sc.priority--) {
+ unsigned long nr_to_scan = nr_pages - ret;
+
+ sc.nr_scanned = 0;
+
+ /*
+ * No node reclaimed because of too many reclaimers. Retry whether
+ * there is still something to do
+ */
+ if (!shrink_all_nodes(nr_to_scan, pass, &sc)) {
+ dec_pagecache_limit_stat(NR_PAGECACHE_LIMIT_THROTTLED);
+ goto retry;
+ }
+
+ ret += sc.nr_reclaimed;
+ if (ret >= nr_pages)
+ goto out;
+ }
+
+ if (pass == 1) {
+ if (vm_pagecache_ignore_dirty == 1 ||
+ (mask & (__GFP_IO | __GFP_FS)) != (__GFP_IO | __GFP_FS) )
+ break;
+ else
+ sc.may_writepage = 1;
+ }
+ }
+out:
+ trace_mm_shrink_page_cache_end(ret);
+ dec_pagecache_limit_stat(NR_PAGECACHE_LIMIT_THROTTLED);
+}
+
+void shrink_page_cache(gfp_t mask, struct page *page)
+{
+ /* FIXME: As we only want to get rid of non-mapped pagecache
+ * pages and we know we have too many of them, we should not
+ * need kswapd. */
+ /*
+ wakeup_kswapd(page_zone(page), 0);
+ */
+
+ __shrink_page_cache(mask);
+}
+
/* It's optimal to keep kswapds on the same CPUs as their memory, but
not required for correctness. So if the last cpu in a node goes
away, we get changed to run anywhere: as the first one comes back,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index aabd596fc554..62bc1db33bc7 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1208,6 +1208,10 @@ const char * const vmstat_text[] = {
"vmacache_full_flushes",
#endif
#endif /* CONFIG_VM_EVENTS_COUNTERS */
+
+ /* Pagecache limit counters */
+ "nr_pagecache_limit_throttled",
+ "nr_pagecache_limit_blocked",
};
#endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA */
@@ -1639,7 +1643,10 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
all_vm_events(v);
v[PGPGIN] /= 2; /* sectors -> kbytes */
v[PGPGOUT] /= 2;
+ v += NR_VM_EVENT_ITEMS;
#endif
+ all_pagecache_limit_counters(v);
+
return (unsigned long *)m->private + *pos;
}