Home Home > GIT Browse > openSUSE-15.0
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJohannes Thumshirn <jthumshirn@suse.de>2018-07-19 12:52:00 +0200
committerJohannes Thumshirn <jthumshirn@suse.de>2018-07-19 12:52:00 +0200
commit3c93f8dd9bc91a30907fb47183750aad019d1b5a (patch)
tree347cdca515bd7b71bfceafdc060b2e5421896fe6
parent10bee809a266b18d36d9daf702317b185187e1f8 (diff)
parent1541884e95fe764f1a13e6c2d53f308bc9c5c1c1 (diff)
Merge remote-tracking branch 'origin/users/mhocko/SLE12-SP4/for-next' into SLE12-SP4
Merge pagecache limit updates from Michal Hocko
-rw-r--r--patches.suse/pagecache-limit-dirty.diff135
-rw-r--r--patches.suse/pagecache-limit-fix-get_nr_swap_pages.patch23
-rw-r--r--patches.suse/pagecache-limit-fix-shmem-deadlock.patch85
-rw-r--r--patches.suse/pagecache-limit-fix-wrong-reclaimed-count.patch44
-rw-r--r--patches.suse/pagecache-limit-reduce-zone-lrulock-bouncing.patch234
-rw-r--r--patches.suse/pagecache-limit-tracepoints.patch164
-rw-r--r--patches.suse/pagecache-limit-unmapped.diff80
-rw-r--r--patches.suse/pagecache-limit-vmstat_counters.patch179
-rw-r--r--patches.suse/pagecache-limit-warn-on-usage.patch58
-rw-r--r--patches.suse/pagecache-limit.patch456
-rw-r--r--patches.suse/pagecachelimit_batch_huge_nr_to_scan.patch61
-rw-r--r--series.conf15
12 files changed, 1534 insertions, 0 deletions
diff --git a/patches.suse/pagecache-limit-dirty.diff b/patches.suse/pagecache-limit-dirty.diff
new file mode 100644
index 0000000000..05cdce31f1
--- /dev/null
+++ b/patches.suse/pagecache-limit-dirty.diff
@@ -0,0 +1,135 @@
+From: Kurt Garloff <garloff@suse.de>
+Subject: Make pagecache limit behavior w.r.t. dirty pages configurable
+References: FATE309111
+Patch-mainline: Never, SUSE specific
+
+The last fixes to this patchset ensured that we don't end up calling
+shrink_page_cache() [from add_to_page_cache()] again and again without
+the ability to actually free something. For this reason we subtracted
+the dirty pages from the list of freeable unmapped pages in the
+calculation.
+
+With this additional patch, a new sysctl
+/proc/sys/vm/pagecache_limit_ignore_dirty
+is introduced. With the default setting (1), behavior does not change.
+When setting it to 0, we actually consider all of the dirty pages
+freeable -- we then allow for a third pass in shrink_page_cache, where
+we allow writing out pages (if the gfp_mask allows it).
+The value can be set to values above 1 as well; with the value set to 2,
+we consider half of the dirty pages freeable etc.
+
+Signed-off-by: Kurt Garloff <garloff@suse.de>
+
+---
+ Documentation/vm/pagecache-limit | 13 +++++++++++--
+ include/linux/swap.h | 1 +
+ kernel/sysctl.c | 7 +++++++
+ mm/page_alloc.c | 6 ++++--
+ mm/vmscan.c | 9 +++++++++
+ 5 files changed, 32 insertions(+), 4 deletions(-)
+
+--- a/Documentation/vm/pagecache-limit
++++ b/Documentation/vm/pagecache-limit
+@@ -1,6 +1,6 @@
+ Functionality:
+ -------------
+-The patch introduces a new tunable in the proc filesystem:
++The patch introduces two new tunables in the proc filesystem:
+
+ /proc/sys/vm/pagecache_limit_mb
+
+@@ -15,6 +15,13 @@ As we only consider pagecache pages that
+ NOTE: The real limit depends on the amount of free memory. Every existing free page allows the page cache to grow 8x the amount of free memory above the set baseline. As soon as the free memory is needed, we free up page cache.
+
+
++/proc/sys/vm/pagecache_limit_ignore_dirty
++
++The default for this setting is 1; this means that we don't consider dirty memory to be part of the limited pagecache, as we can not easily free up dirty memory (we'd need to do writes for this). By setting this to 0, we actually consider dirty (unampped) memory to be freeable and do a third pass in shrink_page_cache() where we schedule the pages for writeout. Values larger than 1 are also possible and result in a fraction of the dirty pages to be considered non-freeable.
++
++
++
++
+ How it works:
+ ------------
+ The heart of this patch is a new function called shrink_page_cache(). It is called from balance_pgdat (which is the worker for kswapd) if the pagecache is above the limit.
+@@ -27,7 +34,9 @@ shrink_page_cache does several passes:
+ This is fast -- but it might not find enough free pages; if that happens,
+ the second pass will happen
+ - In the second pass, pages from active list will also be considered.
+-- The third pass is just another round of the second pass
++- The third pass will only happen if pagecacahe_limig_ignore-dirty is not 1.
++ In that case, the third pass is a repetition of the second pass, but this
++ time we allow pages to be written out.
+
+ In all passes, only unmapped pages will be considered.
+
+--- a/include/linux/swap.h
++++ b/include/linux/swap.h
+@@ -336,6 +336,7 @@ extern int vm_swappiness;
+ extern unsigned long pagecache_over_limit(void);
+ extern void shrink_page_cache(gfp_t mask, struct page *page);
+ extern unsigned int vm_pagecache_limit_mb;
++extern unsigned int vm_pagecache_ignore_dirty;
+ extern int remove_mapping(struct address_space *mapping, struct page *page);
+ extern unsigned long vm_total_pages;
+
+--- a/kernel/sysctl.c
++++ b/kernel/sysctl.c
+@@ -1377,6 +1377,13 @@ static struct ctl_table vm_table[] = {
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
++ {
++ .procname = "pagecache_limit_ignore_dirty",
++ .data = &vm_pagecache_ignore_dirty,
++ .maxlen = sizeof(vm_pagecache_ignore_dirty),
++ .mode = 0644,
++ .proc_handler = &proc_dointvec,
++ },
+ #ifdef CONFIG_HUGETLB_PAGE
+ {
+ .procname = "nr_hugepages",
+--- a/mm/page_alloc.c
++++ b/mm/page_alloc.c
+@@ -7809,12 +7809,14 @@ unsigned long pagecache_over_limit()
+ * minus the dirty ones. (FIXME: pages accounted for in NR_WRITEBACK
+ * are not on the LRU lists any more, right?) */
+ unsigned long pgcache_lru_pages = global_page_state(NR_ACTIVE_FILE)
+- + global_page_state(NR_INACTIVE_FILE)
+- - global_page_state(NR_FILE_DIRTY);
++ + global_page_state(NR_INACTIVE_FILE);
+ unsigned long free_pages = global_page_state(NR_FREE_PAGES);
+ unsigned long swap_pages = total_swap_pages - atomic_long_read(&nr_swap_pages);
+ unsigned long limit;
+
++ if (vm_pagecache_ignore_dirty != 0)
++ pgcache_lru_pages -= global_page_state(NR_FILE_DIRTY)
++ /vm_pagecache_ignore_dirty;
+ /* Paranoia */
+ if (unlikely(pgcache_lru_pages > LONG_MAX))
+ return 0;
+--- a/mm/vmscan.c
++++ b/mm/vmscan.c
+@@ -150,6 +150,7 @@ struct scan_control {
+ */
+ int vm_swappiness = 60;
+ unsigned int vm_pagecache_limit_mb __read_mostly = 0;
++unsigned int vm_pagecache_ignore_dirty __read_mostly = 1;
+ /*
+ * The total number of pages which are beyond the high watermark within all
+ * zones.
+@@ -3817,6 +3818,14 @@ static void __shrink_page_cache(gfp_t ma
+ if (ret >= nr_pages)
+ return;
+ }
++
++ if (pass == 1) {
++ if (vm_pagecache_ignore_dirty == 1 ||
++ (mask & (__GFP_IO | __GFP_FS)) != (__GFP_IO | __GFP_FS) )
++ break;
++ else
++ sc.may_writepage = 1;
++ }
+ }
+ }
+
diff --git a/patches.suse/pagecache-limit-fix-get_nr_swap_pages.patch b/patches.suse/pagecache-limit-fix-get_nr_swap_pages.patch
new file mode 100644
index 0000000000..ca801c61de
--- /dev/null
+++ b/patches.suse/pagecache-limit-fix-get_nr_swap_pages.patch
@@ -0,0 +1,23 @@
+From: John Jolly <jjolly@suse.de>
+Subject: Fix !CONFIG_SWAP build error with nr_swap_pages
+Patch-mainline: Never, SUSE specific
+References: bnc#882108
+
+With !CONFIG_SWAP set, the build errors with nr_swap_pages undeclared.
+
+The proper method of accessing nr_swap_pages is via the get_nr_swap_pages function
+---
+ mm/page_alloc.c | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+--- a/mm/page_alloc.c
++++ b/mm/page_alloc.c
+@@ -7811,7 +7811,7 @@ unsigned long pagecache_over_limit()
+ unsigned long pgcache_lru_pages = global_page_state(NR_ACTIVE_FILE)
+ + global_page_state(NR_INACTIVE_FILE);
+ unsigned long free_pages = global_page_state(NR_FREE_PAGES);
+- unsigned long swap_pages = total_swap_pages - atomic_long_read(&nr_swap_pages);
++ unsigned long swap_pages = total_swap_pages - get_nr_swap_pages();
+ unsigned long limit;
+
+ if (vm_pagecache_ignore_dirty != 0)
diff --git a/patches.suse/pagecache-limit-fix-shmem-deadlock.patch b/patches.suse/pagecache-limit-fix-shmem-deadlock.patch
new file mode 100644
index 0000000000..2a04a9df69
--- /dev/null
+++ b/patches.suse/pagecache-limit-fix-shmem-deadlock.patch
@@ -0,0 +1,85 @@
+From: Michal Hocko <mhocko@suse.cz>
+Subject: pagecache limit: Fix the shmem deadlock
+Patch-mainline: never, SUSE specific
+References: bnc#755537
+
+SLE12->SLE12-SP2
+- __GFP_WAIT -> __GFP_DIRECT_RECLAIM
+
+See the original patch description for SLE11-SP2 bellow:
+This patch is strictly not needed in SLE11-SP3 because we no longer call
+add_to_page_cache under info->lock spinlock anymore but shmem_unuse_inode
+is still called with shmem_swaplist_mutex and uses GFP_NOWAIT when adding
+to the page cache. We have taken a conservative approach and rather shrink
+the cache proactively in this case.
+
+Original patch description for reference:
+
+shmem_getpage uses info->lock spinlock to make sure (among other things)
+that the shmem inode information is synchronized with the page cache
+status so we are calling add_to_page_cache_lru with the lock held.
+Unfortunately add_to_page_cache_lru calls add_to_page_cache which handles
+page cache limit and it might end up in the direct reclaim which in turn might
+sleep even though the given gfp_mask is GFP_NOWAIT so we end up sleeping
+in an atomic context -> kaboom.
+
+Let's fix this by enforcing that add_to_page_cache doesn't go into the reclaim
+if the gfp_mask says it should be atomic. Caller is then responsible for the
+shrinking and we are doing that when we preallocate a page for shmem. Other
+callers are not relying on GFP_NOWAIT when adding to page cache.
+
+I really do _hate_ abusing page (NULL) parameter but I was too lazy to
+prepare a cleanup patch which would get rid of __shrink_page_cache and merge
+it with shrink_page_cache so we would get rid of the page argument which is
+of no use.
+
+Also be strict and BUG_ON when we get into __shrink_page_cache with an atomic
+gfp_mask.
+
+Please also note that this change might lead to a more extensive reclaim if we
+have more threads fighting for the same shmem page because then the shrinking
+is not linearized by the lock and so they might race with the limit evaluation
+and start reclaiming all at once. The risk is not that big though becase we
+would end up reclaiming NR_CPUs * over_limit pages at maximum.
+
+Signed-off-by: Michal Hocko <mhocko@suse.cz>
+
+---
+ mm/shmem.c | 11 +++++++++++
+ mm/vmscan.c | 5 +++++
+ 2 files changed, 16 insertions(+)
+
+--- a/mm/shmem.c
++++ b/mm/shmem.c
+@@ -1211,6 +1211,17 @@ int shmem_unuse(swp_entry_t swap, struct
+ /* No radix_tree_preload: swap entry keeps a place for page in tree */
+ error = -EAGAIN;
+
++ /*
++ * try to shrink the page cache proactively even though
++ * we might already have the page in so the shrinking is
++ * not necessary but this is much easier than dropping
++ * the lock in shmem_unuse_inode before add_to_page_cache_lru.
++ * GFP_NOWAIT makes sure that we do not shrink when adding
++ * to page cache
++ */
++ if (unlikely(vm_pagecache_limit_mb) && pagecache_over_limit() > 0)
++ shrink_page_cache(GFP_KERNEL, NULL);
++
+ mutex_lock(&shmem_swaplist_mutex);
+ list_for_each_safe(this, next, &shmem_swaplist) {
+ info = list_entry(this, struct shmem_inode_info, swaplist);
+--- a/mm/vmscan.c
++++ b/mm/vmscan.c
+@@ -3786,6 +3786,11 @@ static void __shrink_page_cache(gfp_t ma
+ };
+ long nr_pages;
+
++ /* We might sleep during direct reclaim so make atomic context
++ * is certainly a bug.
++ */
++ BUG_ON(!(mask & __GFP_DIRECT_RECLAIM));
++
+ /* How many pages are we over the limit?
+ * But don't enforce limit if there's plenty of free mem */
+ nr_pages = pagecache_over_limit();
diff --git a/patches.suse/pagecache-limit-fix-wrong-reclaimed-count.patch b/patches.suse/pagecache-limit-fix-wrong-reclaimed-count.patch
new file mode 100644
index 0000000000..b1bf159cec
--- /dev/null
+++ b/patches.suse/pagecache-limit-fix-wrong-reclaimed-count.patch
@@ -0,0 +1,44 @@
+From: Vlastimil Babka <vbabka@suse.cz>
+Subject: pagecache limit: fix wrong nr_reclaimed count
+Patch-mainline: never, SUSE specific
+References: FATE#309111, bnc#924701
+
+During development of tracepoints for pagecache limit reclaim, it was found out
+that the total accumulated nr_reclaimed count in __shrink_page_cache() stored
+in variable "ret" can be higher than the real value when more than one
+shrink_all_zones() passes are performed during the reclaim. This happens
+because shrink_all_zones() uses sc.nr_reclaimed to accumulate work from all
+passes, but __shrink_page_cache() assumes the value corresponds to a single
+pass and performs own accumulation in the "ret" variable. This may result in
+multiple accumulation and thus premature exits from __shrink_page_cache()
+instead of retrying with higher priority as intended.
+
+The issue is limited in practice, as the number of pages to reclaim is
+capped at 8*SWAP_CLUSTER_MAX anyway. However, the implementation should work
+as intended, and we want to report the accumulated count in a tracepoint, so
+it should be corrected also to allow proper performance analysis.
+
+The patch fixes the issue by removing the accumulation in shrink_all_zones()
+and leaving it only __shrink_page_cache() where it is needed. After the patch
+shrink_all_zones() will set sc->nr_reclaimed according to how much was
+reclaimed during a single call of the function.
+
+Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
+Acked-by: Michal Hocko <mhocko@suse.cz>
+
+---
+ mm/vmscan.c | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+--- a/mm/vmscan.c
++++ b/mm/vmscan.c
+@@ -3851,8 +3851,8 @@ static int shrink_all_nodes(unsigned lon
+
+ out_wakeup:
+ wake_up_interruptible(&pagecache_reclaim_wq);
+- sc->nr_reclaimed += nr_reclaimed;
+ out:
++ sc->nr_reclaimed = nr_reclaimed;
+ return nr_locked_zones;
+ }
+
diff --git a/patches.suse/pagecache-limit-reduce-zone-lrulock-bouncing.patch b/patches.suse/pagecache-limit-reduce-zone-lrulock-bouncing.patch
new file mode 100644
index 0000000000..5f9bdac7ef
--- /dev/null
+++ b/patches.suse/pagecache-limit-reduce-zone-lrulock-bouncing.patch
@@ -0,0 +1,234 @@
+From: Michal Hocko <mhocko@suse.cz>
+Subject: pagecachelimit: reduce lru_lock contention for heavy parallel reclaim
+Patch-mainline: never, SUSE specific
+References: bnc#878509, bnc#864464
+
+mhocko@suse.com:
+move per-zone to per-node handling for SLE12-SP4 because the memory reclaim is
+per-node rather than per-zone now.
+
+More customers have started complaining about hard lockups detected during
+heavy pageche limit reclaim.
+
+All the collected vmcore files shown us the same class of problem. There is no
+hard lockup in the system. It is just irq aware lru_lock bouncing all over the
+place like crazy. There were many CPUs fighting over the single node's lru_lock
+to isolate some pages + some other lru_lock users who try to free memory as a
+result of munmap or exit.
+
+All those systems were configured to use 4G page_cache although the machine was
+equipped with much more memory. pagecache_over_limit tries to be clever and
+relax the limit a bit but 4G on 1TB machine still sounds like a too low and
+increases the risk of parallel page cache reclaim. If we add NUMA effects and
+hundreds of CPUs then the lock bouncing is simply unavoidable problem.
+
+This patch reduces the problem by reducing the number of the page cache
+reclaimers. Only one such reclaimer is allowed to scan one node.
+shrink_all_zones which is used only by the pagecache reclaim iterates over all
+available zones. We have added a per-node atomic counter and use it as a lock
+(we cannot use the spinlock because reclaim is a sleepable context and mutex
+sounds too heavy). Please note that a new contention might hit on
+prepare_to_wait now but this hasn't been seen in the representative SAP
+workload when testing.
+
+Only one reclaimer is allowed to lock to the node and try to reclaim it.
+Others will back off to other currently unlocked zones. If all the zones are
+locked for a reclaimer it is put into a sleep on pagecache_reclaim_wq
+waitqueue which is woken up after any of the current reclaimers is done
+with the work. Sleeper retries __shrink_page_cache along with re-evaluating
+page cache limit and attempt the new round only if it is still applicable.
+
+This patch potentially breaks kABI on some architectures but x86_64 should be
+safe because it is put before padding and after 3 ints so there should be 32b
+available even without the padding. If other architectures have a problem with
+that we can use suse_kabi_padding at the end of the structure.
+This will be sorted out before the patch gets merged into our tree.
+
+Signed-off-by: Michal Hocko <mhocko@suse.cz>
+Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
+
+---
+ include/linux/mmzone.h | 8 ++++
+ mm/vmscan.c | 89 ++++++++++++++++++++++++++++++++++++++++++++-----
+ 2 files changed, 88 insertions(+), 9 deletions(-)
+
+--- a/include/linux/mmzone.h
++++ b/include/linux/mmzone.h
+@@ -724,6 +724,14 @@ typedef struct pglist_data {
+
+ unsigned long flags;
+
++ /*
++ * This atomic counter is set when there is pagecache limit
++ * reclaim going on on this particular node. Other potential
++ * reclaiers should back off to prevent from heavy lru_lock
++ * bouncing.
++ */
++ atomic_t pagecache_reclaim;
++
+ ZONE_PADDING(_pad2_)
+
+ /* Per-node vmstats */
+--- a/mm/vmscan.c
++++ b/mm/vmscan.c
+@@ -3640,6 +3640,26 @@ unsigned long shrink_all_memory(unsigned
+ }
+ #endif /* CONFIG_HIBERNATION */
+
++/*
++ * Returns non-zero if the lock has been acquired, false if somebody
++ * else is holding the lock.
++ */
++static int pagecache_reclaim_lock_node(struct pglist_data *pgdat)
++{
++ return atomic_add_unless(&pgdat->pagecache_reclaim, 1, 1);
++}
++
++static void pagecache_reclaim_unlock_node(struct pglist_data *pgdat)
++{
++ BUG_ON(atomic_dec_return(&pgdat->pagecache_reclaim));
++}
++
++/*
++ * Potential page cache reclaimers who are not able to take
++ * reclaim lock on any node are sleeping on this waitqueue.
++ * So this is basically a congestion wait queue for them.
++ */
++DECLARE_WAIT_QUEUE_HEAD(pagecache_reclaim_wq);
+
+ /*
+ * Similar to shrink_node but it has a different consumer - pagecache limit
+@@ -3693,16 +3713,34 @@ static bool shrink_node_per_memcg(struct
+ *
+ * For pass > 3 we also try to shrink the LRU lists that contain a few pages
+ */
+-static void shrink_all_nodes(unsigned long nr_pages, int pass,
++static int shrink_all_nodes(unsigned long nr_pages, int pass,
+ struct scan_control *sc)
+ {
+ unsigned long nr_reclaimed = 0;
++ unsigned int nr_locked_zones = 0;
++ DEFINE_WAIT(wait);
+ int nid;
+
++ prepare_to_wait(&pagecache_reclaim_wq, &wait, TASK_INTERRUPTIBLE);
++
+ for_each_online_node(nid) {
+ struct pglist_data *pgdat = NODE_DATA(nid);
+ enum lru_list lru;
+
++ /*
++ * Back off if somebody is already reclaiming this node
++ * for the pagecache reclaim.
++ */
++ if (!pagecache_reclaim_lock_node(pgdat))
++ continue;
++
++ /*
++ * This reclaimer might scan a node so it will never
++ * sleep on pagecache_reclaim_wq
++ */
++ finish_wait(&pagecache_reclaim_wq, &wait);
++ nr_locked_zones++;
++
+ for_each_evictable_lru(lru) {
+ enum zone_stat_item ls = NR_LRU_BASE + lru;
+ unsigned long lru_pages = node_page_state(pgdat, ls);
+@@ -3744,8 +3782,8 @@ static void shrink_all_nodes(unsigned lo
+ */
+ if (shrink_node_per_memcg(pgdat, lru,
+ nr_to_scan, nr_pages, &nr_reclaimed, sc)) {
+- sc->nr_reclaimed += nr_reclaimed;
+- return;
++ pagecache_reclaim_unlock_node(pgdat);
++ goto out_wakeup;
+ }
+
+ current->reclaim_state = &reclaim_state;
+@@ -3756,8 +3794,25 @@ static void shrink_all_nodes(unsigned lo
+ current->reclaim_state = old_rs;
+ }
+ }
++ pagecache_reclaim_unlock_node(pgdat);
+ }
++
++ /*
++ * We have to go to sleep because all the zones are already reclaimed.
++ * One of the reclaimer will wake us up or __shrink_page_cache will
++ * do it if there is nothing to be done.
++ */
++ if (!nr_locked_zones) {
++ schedule();
++ finish_wait(&pagecache_reclaim_wq, &wait);
++ goto out;
++ }
++
++out_wakeup:
++ wake_up_interruptible(&pagecache_reclaim_wq);
+ sc->nr_reclaimed += nr_reclaimed;
++out:
++ return nr_locked_zones;
+ }
+
+ /*
+@@ -3776,7 +3831,7 @@ static void shrink_all_nodes(unsigned lo
+ static void __shrink_page_cache(gfp_t mask)
+ {
+ unsigned long ret = 0;
+- int pass;
++ int pass = 0;
+ struct scan_control sc = {
+ .gfp_mask = mask,
+ .may_swap = 0,
+@@ -3791,6 +3846,7 @@ static void __shrink_page_cache(gfp_t ma
+ */
+ BUG_ON(!(mask & __GFP_DIRECT_RECLAIM));
+
++retry:
+ /* How many pages are we over the limit?
+ * But don't enforce limit if there's plenty of free mem */
+ nr_pages = pagecache_over_limit();
+@@ -3800,9 +3856,18 @@ static void __shrink_page_cache(gfp_t ma
+ * is still more than minimally needed. */
+ nr_pages /= 2;
+
+- /* Return early if there's no work to do */
+- if (nr_pages <= 0)
++ /*
++ * Return early if there's no work to do.
++ * Wake up reclaimers that couldn't scan any node due to congestion.
++ * There is apparently nothing to do so they do not have to sleep.
++ * This makes sure that no sleeping reclaimer will stay behind.
++ * Allow breaching the limit if the task is on the way out.
++ */
++ if (nr_pages <= 0 || fatal_signal_pending(current)) {
++ wake_up_interruptible(&pagecache_reclaim_wq);
+ return;
++ }
++
+ /* But do a few at least */
+ nr_pages = max_t(unsigned long, nr_pages, 8*SWAP_CLUSTER_MAX);
+
+@@ -3812,13 +3877,19 @@ static void __shrink_page_cache(gfp_t ma
+ * 1 = Reclaim from active list but don't reclaim mapped (not that fast)
+ * 2 = Reclaim from active list but don't reclaim mapped (2nd pass)
+ */
+- for (pass = 0; pass < 2; pass++) {
++ for (; pass < 2; pass++) {
+ for (sc.priority = DEF_PRIORITY; sc.priority >= 0; sc.priority--) {
+ unsigned long nr_to_scan = nr_pages - ret;
+
+ sc.nr_scanned = 0;
+- /* sc.swap_cluster_max = nr_to_scan; */
+- shrink_all_nodes(nr_to_scan, pass, &sc);
++
++ /*
++ * No node reclaimed because of too many reclaimers. Retry whether
++ * there is still something to do
++ */
++ if (!shrink_all_nodes(nr_to_scan, pass, &sc))
++ goto retry;
++
+ ret += sc.nr_reclaimed;
+ if (ret >= nr_pages)
+ return;
diff --git a/patches.suse/pagecache-limit-tracepoints.patch b/patches.suse/pagecache-limit-tracepoints.patch
new file mode 100644
index 0000000000..578090cacb
--- /dev/null
+++ b/patches.suse/pagecache-limit-tracepoints.patch
@@ -0,0 +1,164 @@
+From: Vlastimil Babka <vbabka@suse.cz>
+Subject: pagecache limit: add tracepoints
+Patch-mainline: never, SUSE specific
+References: bnc#924701
+
+TODO description
+
+Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
+
+---
+ include/trace/events/pagecache-limit.h | 99 +++++++++++++++++++++++++++++++++
+ include/trace/events/vmscan.h | 2
+ mm/vmscan.c | 6 ++
+ 3 files changed, 107 insertions(+)
+
+--- /dev/null
++++ b/include/trace/events/pagecache-limit.h
+@@ -0,0 +1,99 @@
++
++/*
++ * This file defines pagecache limit specific tracepoints and should only be
++ * included through include/trace/events/vmscan.h, never directly.
++ */
++
++TRACE_EVENT(mm_shrink_page_cache_start,
++
++ TP_PROTO(gfp_t mask),
++
++ TP_ARGS(mask),
++
++ TP_STRUCT__entry(
++ __field(gfp_t, mask)
++ ),
++
++ TP_fast_assign(
++ __entry->mask = mask;
++ ),
++
++ TP_printk("mask=%s",
++ show_gfp_flags(__entry->mask))
++);
++
++TRACE_EVENT(mm_shrink_page_cache_end,
++
++ TP_PROTO(unsigned long nr_reclaimed),
++
++ TP_ARGS(nr_reclaimed),
++
++ TP_STRUCT__entry(
++ __field(unsigned long, nr_reclaimed)
++ ),
++
++ TP_fast_assign(
++ __entry->nr_reclaimed = nr_reclaimed;
++ ),
++
++ TP_printk("nr_reclaimed=%lu",
++ __entry->nr_reclaimed)
++);
++
++TRACE_EVENT(mm_pagecache_reclaim_start,
++
++ TP_PROTO(unsigned long nr_pages, int pass, int prio, gfp_t mask,
++ bool may_write),
++
++ TP_ARGS(nr_pages, pass, prio, mask, may_write),
++
++ TP_STRUCT__entry(
++ __field(unsigned long, nr_pages )
++ __field(int, pass )
++ __field(int, prio )
++ __field(gfp_t, mask )
++ __field(bool, may_write )
++ ),
++
++ TP_fast_assign(
++ __entry->nr_pages = nr_pages;
++ __entry->pass = pass;
++ __entry->prio = prio;
++ __entry->mask = mask;
++ __entry->may_write = may_write;
++ ),
++
++ TP_printk("nr_pages=%lu pass=%d prio=%d mask=%s may_write=%d",
++ __entry->nr_pages,
++ __entry->pass,
++ __entry->prio,
++ show_gfp_flags(__entry->mask),
++ (int) __entry->may_write)
++);
++
++TRACE_EVENT(mm_pagecache_reclaim_end,
++
++ TP_PROTO(unsigned long nr_scanned, unsigned long nr_reclaimed,
++ unsigned int nr_zones),
++
++ TP_ARGS(nr_scanned, nr_reclaimed, nr_zones),
++
++ TP_STRUCT__entry(
++ __field(unsigned long, nr_scanned )
++ __field(unsigned long, nr_reclaimed )
++ __field(unsigned int, nr_zones )
++ ),
++
++ TP_fast_assign(
++ __entry->nr_scanned = nr_scanned;
++ __entry->nr_reclaimed = nr_reclaimed;
++ __entry->nr_zones = nr_zones;
++ ),
++
++ TP_printk("nr_scanned=%lu nr_reclaimed=%lu nr_scanned_zones=%u",
++ __entry->nr_scanned,
++ __entry->nr_reclaimed,
++ __entry->nr_zones)
++);
++
++
+--- a/include/trace/events/vmscan.h
++++ b/include/trace/events/vmscan.h
+@@ -37,6 +37,8 @@
+ (RECLAIM_WB_ASYNC) \
+ )
+
++#include "pagecache-limit.h"
++
+ TRACE_EVENT(mm_vmscan_kswapd_sleep,
+
+ TP_PROTO(int nid),
+--- a/mm/vmscan.c
++++ b/mm/vmscan.c
+@@ -3756,6 +3756,8 @@ static int shrink_all_nodes(unsigned lon
+ int nid;
+
+ prepare_to_wait(&pagecache_reclaim_wq, &wait, TASK_INTERRUPTIBLE);
++ trace_mm_pagecache_reclaim_start(nr_pages, pass, sc->priority, sc->gfp_mask,
++ sc->may_writepage);
+
+ for_each_online_node(nid) {
+ struct pglist_data *pgdat = NODE_DATA(nid);
+@@ -3853,6 +3855,8 @@ out_wakeup:
+ wake_up_interruptible(&pagecache_reclaim_wq);
+ out:
+ sc->nr_reclaimed = nr_reclaimed;
++ trace_mm_pagecache_reclaim_end(sc->nr_scanned, nr_reclaimed,
++ nr_locked_zones);
+ return nr_locked_zones;
+ }
+
+@@ -3912,6 +3916,7 @@ retry:
+ /* But do a few at least */
+ nr_pages = max_t(unsigned long, nr_pages, 8*SWAP_CLUSTER_MAX);
+ inc_pagecache_limit_stat(NR_PAGECACHE_LIMIT_THROTTLED);
++ trace_mm_shrink_page_cache_start(mask);
+
+ /*
+ * Shrink the LRU in 2 passes:
+@@ -3948,6 +3953,7 @@ retry:
+ }
+ }
+ out:
++ trace_mm_shrink_page_cache_end(ret);
+ dec_pagecache_limit_stat(NR_PAGECACHE_LIMIT_THROTTLED);
+ }
+
diff --git a/patches.suse/pagecache-limit-unmapped.diff b/patches.suse/pagecache-limit-unmapped.diff
new file mode 100644
index 0000000000..b7f7e7fbde
--- /dev/null
+++ b/patches.suse/pagecache-limit-unmapped.diff
@@ -0,0 +1,80 @@
+From: Kurt Garloff <garloff@suse.de>
+Subject: Fix calculation of unmapped page cache size
+References: FATE309111
+Patch-mainline: Never, SUSE specific
+
+Remarks from sle11sp3->sle12 porting by mhocko@suse.cz:
+- nr_swap_pages is atomic now
+
+Original changelog as per 11sp3:
+--------------------------------
+Unfortunately, the assumption that NR_FILE_PAGES - NR_FILE_MAPPED
+is easily freeable was wrong -- this could lead to us repeatedly
+calling shrink_page_cache() from add_to_page_cache() without
+making much progress and thus slowing down the system needlessly
+(bringing it down to crawl in the worst case).
+
+Calculating the unmapped page cache pages is not obvious, unfortunately.
+There's two upper limits:
+* It can't be larger than the overall pagecache size minus the max
+ from mapped and shmem pages. (Those two overlap, unfortunately,
+ so we can't just subtract the sum of those two ...)
+* It can't be larger than the inactive plus active FILE LRU lists.
+
+So we take the smaller of those two and divide by two to approximate
+the number we're looking for.
+
+Signed-off-by: Kurt Garloff <garloff@suse.de>
+
+---
+ mm/page_alloc.c | 33 +++++++++++++++++++++++++++++++--
+ 1 file changed, 31 insertions(+), 2 deletions(-)
+
+--- a/mm/page_alloc.c
++++ b/mm/page_alloc.c
+@@ -7796,14 +7796,43 @@ void zone_pcp_reset(struct zone *zone)
+ */
+ unsigned long pagecache_over_limit()
+ {
+- /* We only want to limit unmapped page cache pages */
++ /* We only want to limit unmapped and non-shmem page cache pages;
++ * normally all shmem pages are mapped as well, but that does
++ * not seem to be guaranteed. (Maybe this was just an oprofile
++ * bug?).
++ * (FIXME: Do we need to subtract NR_FILE_DIRTY here as well?) */
+ unsigned long pgcache_pages = global_page_state(NR_FILE_PAGES)
+- - global_page_state(NR_FILE_MAPPED);
++ - max_t(unsigned long,
++ global_page_state(NR_FILE_MAPPED),
++ global_page_state(NR_SHMEM));
++ /* We certainly can't free more than what's on the LRU lists
++ * minus the dirty ones. (FIXME: pages accounted for in NR_WRITEBACK
++ * are not on the LRU lists any more, right?) */
++ unsigned long pgcache_lru_pages = global_page_state(NR_ACTIVE_FILE)
++ + global_page_state(NR_INACTIVE_FILE)
++ - global_page_state(NR_FILE_DIRTY);
+ unsigned long free_pages = global_page_state(NR_FREE_PAGES);
++ unsigned long swap_pages = total_swap_pages - atomic_long_read(&nr_swap_pages);
+ unsigned long limit;
+
++ /* Paranoia */
++ if (unlikely(pgcache_lru_pages > LONG_MAX))
++ return 0;
++ /* We give a bonus for free pages above 6% of total (minus half swap used) */
++ free_pages -= totalram_pages/16;
++ if (likely(swap_pages <= LONG_MAX))
++ free_pages -= swap_pages/2;
++ if (free_pages > LONG_MAX)
++ free_pages = 0;
++
++ /* Limit it to 94% of LRU (not all there might be unmapped) */
++ pgcache_lru_pages -= pgcache_lru_pages/16;
++ pgcache_pages = min_t(unsigned long, pgcache_pages, pgcache_lru_pages);
++
++ /* Effective limit is corrected by effective free pages */
+ limit = vm_pagecache_limit_mb * ((1024*1024UL)/PAGE_SIZE) +
+ FREE_TO_PAGECACHE_RATIO * free_pages;
++
+ if (pgcache_pages > limit)
+ return pgcache_pages - limit;
+ return 0;
diff --git a/patches.suse/pagecache-limit-vmstat_counters.patch b/patches.suse/pagecache-limit-vmstat_counters.patch
new file mode 100644
index 0000000000..f177305cd4
--- /dev/null
+++ b/patches.suse/pagecache-limit-vmstat_counters.patch
@@ -0,0 +1,179 @@
+From: Michal Hocko <mhocko@suse.cz>
+Subject: pagecache limit: export debugging counters via /proc/vmstat
+Patch-mainline: never, SUSE specific
+References: bnc#924701
+
+Pagecache limit has proven to be hard to tune historically (which is not
+entirely unexpected). The primary motivation for the knob was to prevent
+from heavy pagecache users from interfering with the rest of the system
+and push out memory to the swap. There have been many changes done in
+the reclaim path to help with that but that still doesn't seem sufficient
+and some customers still seem to benefit from the pagecache_limit_mb knob.
+
+As the pagecache limit reclaim doesn't scale well with the growing of
+number of CPUs it has to be throttled in a way or another and that might
+lead to long stalls when the limit is set too low. What is too low, however,
+doesn't have a simple answer and it highly depends on the workload.
+
+This patch helps in a way by exporting 2 counters via /proc/vmstat:
+ - nr_pagecache_limit_throttled - tells administrator how many tasks
+ are throttled because they have hit the pagecache limit. Some of
+ those tasks will be performing pagecache limit direct reclaim
+ before they are allowed to get a new pagecache page.
+ - nr_pagecache_limit_blocked - tells administrator how many tasks
+ are blocked waiting for the pagecache limit reclaim to make some
+ progress but they cannot perform the reclaim themselves.
+
+A high number of the first (throttled) signals there is a strong pressure on
+the pagecache limit. This itself doesn't imply very long stalls necessarily.
+The memory reclaim might be still effective enough to finish in a reasonable
+time and the processes will only see the throttling which is the main point
+of the pagecache_limit_mb knob.
+
+But a high number of the later (blocked) is a clear signal that the pagecache
+limit is under provisioned and the demand for the page cache is much higher
+than the system manages to reclaim. Long stalls are basically unavoidable in
+such a case. Increasing the limit in such a case is essential if the pagecache
+incurred latencies are not acceptable.
+
+Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
+Signed-off-by: Michal Hocko <mhocko@suse.cz>
+
+---
+ include/linux/vmstat.h | 12 ++++++++++++
+ mm/vmscan.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
+ mm/vmstat.c | 7 +++++++
+ 3 files changed, 62 insertions(+), 2 deletions(-)
+
+--- a/include/linux/vmstat.h
++++ b/include/linux/vmstat.h
+@@ -380,6 +380,18 @@ static inline void __mod_zone_freepage_s
+ __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
+ }
+
++enum pagecache_limit_stat_item {
++ NR_PAGECACHE_LIMIT_THROTTLED, /* Number of tasks throttled by the
++ * page cache limit.
++ */
++ NR_PAGECACHE_LIMIT_BLOCKED, /* Number of tasks blocked waiting for
++ * the page cache limit reclaim.
++ */
++ NR_PAGECACHE_LIMIT_ITEMS,
++};
++
++void all_pagecache_limit_counters(unsigned long *);
++
+ extern const char * const vmstat_text[];
+
+ #endif /* _LINUX_VMSTAT_H */
+--- a/mm/vmscan.c
++++ b/mm/vmscan.c
+@@ -3641,6 +3641,40 @@ unsigned long shrink_all_memory(unsigned
+ #endif /* CONFIG_HIBERNATION */
+
+ /*
++ * This should probably go into mm/vmstat.c but there is no intention to
++ * spread any knowledge outside of this single user so let's stay here
++ * and be quiet so that nobody notices us.
++ *
++ * A new counter has to be added to enum pagecache_limit_stat_item and
++ * its name to vmstat_text.
++ *
++ * The pagecache limit reclaim is also a slow path so we can go without
++ * per-cpu accounting for now.
++ *
++ * No kernel path should _ever_ depend on these counters. They are solely
++ * for userspace debugging via /proc/vmstat
++ */
++static atomic_t pagecache_limit_stats[NR_PAGECACHE_LIMIT_ITEMS];
++
++void all_pagecache_limit_counters(unsigned long *ret)
++{
++ int i;
++
++ for (i = 0; i < NR_PAGECACHE_LIMIT_ITEMS; i++)
++ ret[i] = atomic_read(&pagecache_limit_stats[i]);
++}
++
++static void inc_pagecache_limit_stat(enum pagecache_limit_stat_item item)
++{
++ atomic_inc(&pagecache_limit_stats[item]);
++}
++
++static void dec_pagecache_limit_stat(enum pagecache_limit_stat_item item)
++{
++ atomic_dec(&pagecache_limit_stats[item]);
++}
++
++/*
+ * Returns non-zero if the lock has been acquired, false if somebody
+ * else is holding the lock.
+ */
+@@ -3808,7 +3842,9 @@ static int shrink_all_nodes(unsigned lon
+ * do it if there is nothing to be done.
+ */
+ if (!nr_locked_zones) {
++ inc_pagecache_limit_stat(NR_PAGECACHE_LIMIT_BLOCKED);
+ schedule();
++ dec_pagecache_limit_stat(NR_PAGECACHE_LIMIT_BLOCKED);
+ finish_wait(&pagecache_reclaim_wq, &wait);
+ goto out;
+ }
+@@ -3875,6 +3911,7 @@ retry:
+
+ /* But do a few at least */
+ nr_pages = max_t(unsigned long, nr_pages, 8*SWAP_CLUSTER_MAX);
++ inc_pagecache_limit_stat(NR_PAGECACHE_LIMIT_THROTTLED);
+
+ /*
+ * Shrink the LRU in 2 passes:
+@@ -3892,12 +3929,14 @@ retry:
+ * No node reclaimed because of too many reclaimers. Retry whether
+ * there is still something to do
+ */
+- if (!shrink_all_nodes(nr_to_scan, pass, &sc))
++ if (!shrink_all_nodes(nr_to_scan, pass, &sc)) {
++ dec_pagecache_limit_stat(NR_PAGECACHE_LIMIT_THROTTLED);
+ goto retry;
++ }
+
+ ret += sc.nr_reclaimed;
+ if (ret >= nr_pages)
+- return;
++ goto out;
+ }
+
+ if (pass == 1) {
+@@ -3908,6 +3947,8 @@ retry:
+ sc.may_writepage = 1;
+ }
+ }
++out:
++ dec_pagecache_limit_stat(NR_PAGECACHE_LIMIT_THROTTLED);
+ }
+
+ void shrink_page_cache(gfp_t mask, struct page *page)
+--- a/mm/vmstat.c
++++ b/mm/vmstat.c
+@@ -1208,6 +1208,10 @@ const char * const vmstat_text[] = {
+ "vmacache_full_flushes",
+ #endif
+ #endif /* CONFIG_VM_EVENTS_COUNTERS */
++
++ /* Pagecache limit counters */
++ "nr_pagecache_limit_throttled",
++ "nr_pagecache_limit_blocked",
+ };
+ #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA */
+
+@@ -1639,7 +1643,10 @@ static void *vmstat_start(struct seq_fil
+ all_vm_events(v);
+ v[PGPGIN] /= 2; /* sectors -> kbytes */
+ v[PGPGOUT] /= 2;
++ v += NR_VM_EVENT_ITEMS;
+ #endif
++ all_pagecache_limit_counters(v);
++
+ return (unsigned long *)m->private + *pos;
+ }
+
diff --git a/patches.suse/pagecache-limit-warn-on-usage.patch b/patches.suse/pagecache-limit-warn-on-usage.patch
new file mode 100644
index 0000000000..ba43801c98
--- /dev/null
+++ b/patches.suse/pagecache-limit-warn-on-usage.patch
@@ -0,0 +1,58 @@
+From: Michal Hocko <mhocko@suse.cz>
+Subject: Warn on pagecache limit usage
+Patch-mainline: never, SUSE specific
+References: FATE309111, FATE325794
+
+Let's be verbose about page cache limit usage for support purpose.
+The feature is supported only for SLES for SAP appliance and we
+should be aware of the fact it is used anyway.
+
+Signed-off-by: Michal Hocko <mhocko@suse.cz>
+
+---
+ kernel/sysctl.c | 20 +++++++++++++++++++-
+ 1 file changed, 19 insertions(+), 1 deletion(-)
+
+--- a/kernel/sysctl.c
++++ b/kernel/sysctl.c
+@@ -1244,6 +1244,9 @@ static struct ctl_table kern_table[] = {
+ { }
+ };
+
++int pc_limit_proc_dointvec(struct ctl_table *table, int write,
++ void __user *buffer, size_t *lenp, loff_t *ppos);
++
+ static struct ctl_table vm_table[] = {
+ {
+ .procname = "overcommit_memory",
+@@ -1375,7 +1378,7 @@ static struct ctl_table vm_table[] = {
+ .data = &vm_pagecache_limit_mb,
+ .maxlen = sizeof(vm_pagecache_limit_mb),
+ .mode = 0644,
+- .proc_handler = &proc_dointvec,
++ .proc_handler = &pc_limit_proc_dointvec,
+ },
+ {
+ .procname = "pagecache_limit_ignore_dirty",
+@@ -2455,6 +2458,21 @@ static int do_proc_douintvec(struct ctl_
+ buffer, lenp, ppos, conv, data);
+ }
+
++int pc_limit_proc_dointvec(struct ctl_table *table, int write,
++ void __user *buffer, size_t *lenp, loff_t *ppos)
++{
++ int ret = do_proc_dointvec(table,write,buffer,lenp,ppos,
++ NULL,NULL);
++ if (write && !ret) {
++ printk(KERN_WARNING "pagecache limit set to %d."
++ "Feature is supported only for SLES for SAP appliance\n",
++ vm_pagecache_limit_mb);
++ if (num_possible_cpus() > 16)
++ printk(KERN_WARNING "Using page cache limit on large machines is strongly discouraged. See TID 7021211\n");
++ }
++ return ret;
++}
++
+ /**
+ * proc_dointvec - read a vector of integers
+ * @table: the sysctl table
diff --git a/patches.suse/pagecache-limit.patch b/patches.suse/pagecache-limit.patch
new file mode 100644
index 0000000000..e4f16d7029
--- /dev/null
+++ b/patches.suse/pagecache-limit.patch
@@ -0,0 +1,456 @@
+From: Markus Guertler <mguertler@novell.com>
+Subject: Introduce (optional) pagecache limit
+References: FATE309111, FATE325794
+Patch-mainline: Never, SUSE specific
+
+SLE12-SP3->SLE12-SP4
+- move from zone to the node reclaim
+
+SLE12->SLE12-SP2
+- move the slab shrinking into shrink_all_zones because slab shrinkers
+ are numa aware now so we need a to do zone to get the proper node.
+
+SLE11-SP3->SLE12 changes & remarks by mhocko@suse.cz:
+- The feature should be deprecated and dropped eventually and replaced
+ by Memory cgroup controller.
+- vmswappiness was always broken because shrink_list didn't and doesn't
+ consider it
+- sc->nr_to_reclaim is updated once per shrink_all_zones which means
+ that we might end up reclaiming more than expected.
+
+Notes on forward port to SLE12
+- shrink_all_zones has to be memcg aware now because there is no global
+ LRU anymore. Put this into shrink_zone_per_memcg
+- priority is a part of scan_control now
+- swappiness is no longer in scan_control but as mentioned above it didn't
+ have any effect anyway
+-
+
+Original changelog (as per 11sp3):
+----------------------------------
+There are apps that consume lots of memory and touch some of their
+pages very infrequently; yet those pages are very important for the
+overall performance of the app and should not be paged out in favor
+of pagecache. The kernel can't know this and takes the wrong decisions,
+even with low swappiness values.
+
+This sysctl allows to set a limit for the non-mapped page cache;
+non-mapped meaning that it will not affect shared memory or files
+that are mmap()ed -- just anonymous file system cache.
+Above this limit, the kernel will always consider removing pages from
+the page cache first.
+
+The limit that ends up being enforced is dependent on free memory;
+if we have lots of it, the effective limit is much higher -- only when
+the free memory gets scarce, we'll become strict about anonymous
+page cache. This should make the setting much more attractive to use.
+
+[Reworked by Kurt Garloff and Nick Piggin]
+
+Signed-off-by: Kurt Garloff <garloff@suse.de>
+Signed-off-by: Nick Piggin <npiggin@suse.de>
+Acked-by: Michal Hocko <mhocko@suse.cz>
+
+---
+ Documentation/vm/pagecache-limit | 51 +++++++++
+ include/linux/pagemap.h | 1
+ include/linux/swap.h | 4
+ kernel/sysctl.c | 7 +
+ mm/filemap.c | 3
+ mm/page_alloc.c | 19 +++
+ mm/shmem.c | 5
+ mm/vmscan.c | 207 +++++++++++++++++++++++++++++++++++++++
+ 8 files changed, 296 insertions(+), 1 deletion(-)
+
+--- /dev/null
++++ b/Documentation/vm/pagecache-limit
+@@ -0,0 +1,51 @@
++Functionality:
++-------------
++The patch introduces a new tunable in the proc filesystem:
++
++/proc/sys/vm/pagecache_limit_mb
++
++This tunable sets a limit to the unmapped pages in the pagecache in megabytes.
++If non-zero, it should not be set below 4 (4MB), or the system might behave erratically. In real-life, much larger limits (a few percent of system RAM / a hundred MBs) will be useful.
++
++Examples:
++echo 512 >/proc/sys/vm/pagecache_limit_mb
++
++This sets a baseline limits for the page cache (not the buffer cache!) of 0.5GiB.
++As we only consider pagecache pages that are unmapped, currently mapped pages (files that are mmap'ed such as e.g. binaries and libraries as well as SysV shared memory) are not limited by this.
++NOTE: The real limit depends on the amount of free memory. Every existing free page allows the page cache to grow 8x the amount of free memory above the set baseline. As soon as the free memory is needed, we free up page cache.
++
++
++How it works:
++------------
++The heart of this patch is a new function called shrink_page_cache(). It is called from balance_pgdat (which is the worker for kswapd) if the pagecache is above the limit.
++The function is also called in __alloc_pages_slowpath.
++
++shrink_page_cache() calculates the nr of pages the cache is over its limit. It reduces this number by a factor (so you have to call it several times to get down to the target) then shrinks the pagecache (using the Kernel LRUs).
++
++shrink_page_cache does several passes:
++- Just reclaiming from inactive pagecache memory.
++ This is fast -- but it might not find enough free pages; if that happens,
++ the second pass will happen
++- In the second pass, pages from active list will also be considered.
++- The third pass is just another round of the second pass
++
++In all passes, only unmapped pages will be considered.
++
++
++How it changes memory management:
++--------------------------------
++If the pagecache_limit_mb is set to zero (default), nothing changes.
++
++If set to a positive value, there will be three different operating modes:
++(1) If we still have plenty of free pages, the pagecache limit will NOT be enforced. Memory management decisions are taken as normally.
++(2) However, as soon someone consumes those free pages, we'll start freeing pagecache -- as those are returned to the free page pool, freeing a few pages from pagecache will return us to state (1) -- if however someone consumes these free pages quickly, we'll continue freeing up pages from the pagecache until we reach pagecache_limit_mb.
++(3) Once we are at or below the low watermark, pagecache_limit_mb, the pages in the page cache will be governed by normal paging memory management decisions; if it starts growing above the limit (corrected by the free pages), we'll free some up again.
++
++This feature is useful for machines that have large workloads, carefully sized to eat most of the memory. Depending on the applications page access pattern, the kernel may too easily swap the application memory out in favor of pagecache. This can happen even for low values of swappiness. With this feature, the admin can tell the kernel that only a certain amount of pagecache is really considered useful and that it otherwise should favor the applications memory.
++
++
++Foreground vs. background shrinking:
++-----------------------------------
++
++Usually, the Linux kernel reclaims its memory using the kernel thread kswapd. It reclaims memory in the background. If it can't reclaim memory fast enough, it retries with higher priority and if this still doesn't succeed it uses a direct reclaim path.
++
+--- a/include/linux/pagemap.h
++++ b/include/linux/pagemap.h
+@@ -12,6 +12,7 @@
+ #include <linux/uaccess.h>
+ #include <linux/gfp.h>
+ #include <linux/bitops.h>
++#include <linux/swap.h>
+ #include <linux/hardirq.h> /* for in_interrupt() */
+ #include <linux/hugetlb_inline.h>
+
+--- a/include/linux/swap.h
++++ b/include/linux/swap.h
+@@ -332,6 +332,10 @@ extern unsigned long mem_cgroup_shrink_n
+ unsigned long *nr_scanned);
+ extern unsigned long shrink_all_memory(unsigned long nr_pages);
+ extern int vm_swappiness;
++#define FREE_TO_PAGECACHE_RATIO 8
++extern unsigned long pagecache_over_limit(void);
++extern void shrink_page_cache(gfp_t mask, struct page *page);
++extern unsigned int vm_pagecache_limit_mb;
+ extern int remove_mapping(struct address_space *mapping, struct page *page);
+ extern unsigned long vm_total_pages;
+
+--- a/kernel/sysctl.c
++++ b/kernel/sysctl.c
+@@ -1370,6 +1370,13 @@ static struct ctl_table vm_table[] = {
+ .extra1 = &zero,
+ .extra2 = &one_hundred,
+ },
++ {
++ .procname = "pagecache_limit_mb",
++ .data = &vm_pagecache_limit_mb,
++ .maxlen = sizeof(vm_pagecache_limit_mb),
++ .mode = 0644,
++ .proc_handler = &proc_dointvec,
++ },
+ #ifdef CONFIG_HUGETLB_PAGE
+ {
+ .procname = "nr_hugepages",
+--- a/mm/filemap.c
++++ b/mm/filemap.c
+@@ -900,6 +900,9 @@ int add_to_page_cache_lru(struct page *p
+ void *shadow = NULL;
+ int ret;
+
++ if (unlikely(vm_pagecache_limit_mb) && pagecache_over_limit() > 0)
++ shrink_page_cache(gfp_mask, page);
++
+ __SetPageLocked(page);
+ ret = __add_to_page_cache_locked(page, mapping, offset,
+ gfp_mask, &shadow);
+--- a/mm/page_alloc.c
++++ b/mm/page_alloc.c
+@@ -7790,6 +7790,25 @@ void zone_pcp_reset(struct zone *zone)
+ local_irq_restore(flags);
+ }
+
++/* Returns a number that's positive if the pagecache is above
++ * the set limit. Note that we allow the pagecache to grow
++ * larger if there's plenty of free pages.
++ */
++unsigned long pagecache_over_limit()
++{
++ /* We only want to limit unmapped page cache pages */
++ unsigned long pgcache_pages = global_page_state(NR_FILE_PAGES)
++ - global_page_state(NR_FILE_MAPPED);
++ unsigned long free_pages = global_page_state(NR_FREE_PAGES);
++ unsigned long limit;
++
++ limit = vm_pagecache_limit_mb * ((1024*1024UL)/PAGE_SIZE) +
++ FREE_TO_PAGECACHE_RATIO * free_pages;
++ if (pgcache_pages > limit)
++ return pgcache_pages - limit;
++ return 0;
++}
++
+ #ifdef CONFIG_MEMORY_HOTREMOVE
+ /*
+ * All pages in the range must be in a single zone and isolated
+--- a/mm/shmem.c
++++ b/mm/shmem.c
+@@ -1229,8 +1229,11 @@ int shmem_unuse(swp_entry_t swap, struct
+ if (error != -ENOMEM)
+ error = 0;
+ mem_cgroup_cancel_charge(page, memcg, false);
+- } else
++ } else {
+ mem_cgroup_commit_charge(page, memcg, true, false);
++ if (unlikely(vm_pagecache_limit_mb) && pagecache_over_limit() > 0)
++ shrink_page_cache(GFP_KERNEL, page);
++ }
+ out:
+ unlock_page(page);
+ put_page(page);
+--- a/mm/vmscan.c
++++ b/mm/vmscan.c
+@@ -149,6 +149,7 @@ struct scan_control {
+ * From 0 .. 100. Higher means more swappy.
+ */
+ int vm_swappiness = 60;
++unsigned int vm_pagecache_limit_mb __read_mostly = 0;
+ /*
+ * The total number of pages which are beyond the high watermark within all
+ * zones.
+@@ -3152,6 +3153,8 @@ static void clear_pgdat_congested(pg_dat
+ clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
+ }
+
++static void __shrink_page_cache(gfp_t mask);
++
+ /*
+ * Prepare kswapd for sleeping. This verifies that there are no processes
+ * waiting in throttle_direct_reclaim() and that watermarks have been met.
+@@ -3260,6 +3263,10 @@ static int balance_pgdat(pg_data_t *pgda
+ };
+ count_vm_event(PAGEOUTRUN);
+
++ /* this reclaims from all zones so don't count to sc.nr_reclaimed */
++ if (unlikely(vm_pagecache_limit_mb) && pagecache_over_limit() > 0)
++ __shrink_page_cache(GFP_KERNEL);
++
+ do {
+ unsigned long nr_reclaimed = sc.nr_reclaimed;
+ bool raise_priority = true;
+@@ -3426,6 +3433,12 @@ static void kswapd_try_to_sleep(pg_data_
+ prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+ }
+
++ /* We do not need to loop_again if we have not achieved our
++ * pagecache target (i.e. && pagecache_over_limit(0) > 0) because
++ * the limit will be checked next time a page is added to the page
++ * cache. This might cause a short stall but we should rather not
++ * keep kswapd awake.
++ */
+ /*
+ * After a short sleep, check if it was a premature sleep. If not, then
+ * go fully to sleep until explicitly woken up.
+@@ -3626,6 +3639,200 @@ unsigned long shrink_all_memory(unsigned
+ }
+ #endif /* CONFIG_HIBERNATION */
+
++
++/*
++ * Similar to shrink_node but it has a different consumer - pagecache limit
++ * so we cannot reuse the original function - and we do not want to clobber
++ * that code path so we have to live with this code duplication.
++ *
++ * In short this simply scans through the given lru for all cgroups for the
++ * given node.
++ *
++ * returns true if we managed to cumulatively reclaim (via nr_reclaimed)
++ * the given nr_to_reclaim pages, false otherwise. The caller knows that
++ * it doesn't have to touch other nodes if the target was hit already.
++ *
++ * DO NOT USE OUTSIDE of shrink_all_nodes unless you have a really really
++ * really good reason.
++ */
++static bool shrink_node_per_memcg(struct pglist_data *pgdat, enum lru_list lru,
++ unsigned long nr_to_scan, unsigned long nr_to_reclaim,
++ unsigned long *nr_reclaimed, struct scan_control *sc)
++{
++ struct mem_cgroup *root = sc->target_mem_cgroup;
++ struct mem_cgroup *memcg;
++ struct mem_cgroup_reclaim_cookie reclaim = {
++ .pgdat = pgdat,
++ .priority = sc->priority,
++ };
++
++ memcg = mem_cgroup_iter(root, NULL, &reclaim);
++ do {
++ struct lruvec *lruvec;
++
++ lruvec = mem_cgroup_lruvec(pgdat, memcg);
++ *nr_reclaimed += shrink_list(lru, nr_to_scan, lruvec, memcg, sc);
++ if (*nr_reclaimed >= nr_to_reclaim) {
++ mem_cgroup_iter_break(root, memcg);
++ return true;
++ }
++
++ memcg = mem_cgroup_iter(root, memcg, &reclaim);
++ } while (memcg);
++
++ return false;
++}
++
++/*
++ * We had to resurect this function for __shrink_page_cache (upstream has
++ * removed it and reworked shrink_all_memory by 7b51755c).
++ *
++ * Tries to reclaim 'nr_pages' pages from LRU lists system-wide, for given
++ * pass.
++ *
++ * For pass > 3 we also try to shrink the LRU lists that contain a few pages
++ */
++static void shrink_all_nodes(unsigned long nr_pages, int pass,
++ struct scan_control *sc)
++{
++ unsigned long nr_reclaimed = 0;
++ int nid;
++
++ for_each_online_node(nid) {
++ struct pglist_data *pgdat = NODE_DATA(nid);
++ enum lru_list lru;
++
++ for_each_evictable_lru(lru) {
++ enum zone_stat_item ls = NR_LRU_BASE + lru;
++ unsigned long lru_pages = node_page_state(pgdat, ls);
++
++ /* For pass = 0, we don't shrink the active list */
++ if (pass == 0 && (lru == LRU_ACTIVE_ANON ||
++ lru == LRU_ACTIVE_FILE))
++ continue;
++
++ /* Original code relied on nr_saved_scan which is no
++ * longer present so we are just considering LRU pages.
++ * This means that the zone has to have quite large
++ * LRU list for default priority and minimum nr_pages
++ * size (8*SWAP_CLUSTER_MAX). In the end we will tend
++ * to reclaim more from large zones wrt. small.
++ * This should be OK because shrink_page_cache is called
++ * when we are getting to short memory condition so
++ * LRUs tend to be large.
++ */
++ if (((lru_pages >> sc->priority) + 1) >= nr_pages || pass > 3) {
++ unsigned long nr_to_scan;
++ struct reclaim_state reclaim_state;
++ unsigned long scanned = sc->nr_scanned;
++ struct reclaim_state *old_rs = current->reclaim_state;
++
++ /* shrink_list takes lru_lock with IRQ off so we
++ * should be careful about really huge nr_to_scan
++ */
++ nr_to_scan = min(nr_pages, lru_pages);
++
++ /*
++ * A bit of a hack but the code has always been
++ * updating sc->nr_reclaimed once per shrink_all_nodes
++ * rather than accumulating it for all calls to shrink
++ * lru. This costs us an additional argument to
++ * shrink_node_per_memcg but well...
++ *
++ * Let's stick with this for bug-to-bug compatibility
++ */
++ if (shrink_node_per_memcg(pgdat, lru,
++ nr_to_scan, nr_pages, &nr_reclaimed, sc)) {
++ sc->nr_reclaimed += nr_reclaimed;
++ return;
++ }
++
++ current->reclaim_state = &reclaim_state;
++ reclaim_state.reclaimed_slab = 0;
++ shrink_slab(sc->gfp_mask, nid, NULL,
++ sc->nr_scanned - scanned, lru_pages);
++ sc->nr_reclaimed += reclaim_state.reclaimed_slab;
++ current->reclaim_state = old_rs;
++ }
++ }
++ }
++ sc->nr_reclaimed += nr_reclaimed;
++}
++
++/*
++ * Function to shrink the page cache
++ *
++ * This function calculates the number of pages (nr_pages) the page
++ * cache is over its limit and shrinks the page cache accordingly.
++ *
++ * The maximum number of pages, the page cache shrinks in one call of
++ * this function is limited to SWAP_CLUSTER_MAX pages. Therefore it may
++ * require a number of calls to actually reach the vm_pagecache_limit_kb.
++ *
++ * This function is similar to shrink_all_memory, except that it may never
++ * swap out mapped pages and only does two passes.
++ */
++static void __shrink_page_cache(gfp_t mask)
++{
++ unsigned long ret = 0;
++ int pass;
++ struct scan_control sc = {
++ .gfp_mask = mask,
++ .may_swap = 0,
++ .may_unmap = 0,
++ .may_writepage = 0,
++ .target_mem_cgroup = NULL,
++ .reclaim_idx = gfp_zone(mask),
++ };
++ long nr_pages;
++
++ /* How many pages are we over the limit?
++ * But don't enforce limit if there's plenty of free mem */
++ nr_pages = pagecache_over_limit();
++
++ /* Don't need to go there in one step; as the freed
++ * pages are counted FREE_TO_PAGECACHE_RATIO times, this
++ * is still more than minimally needed. */
++ nr_pages /= 2;
++
++ /* Return early if there's no work to do */
++ if (nr_pages <= 0)
++ return;
++ /* But do a few at least */
++ nr_pages = max_t(unsigned long, nr_pages, 8*SWAP_CLUSTER_MAX);
++
++ /*
++ * Shrink the LRU in 2 passes:
++ * 0 = Reclaim from inactive_list only (fast)
++ * 1 = Reclaim from active list but don't reclaim mapped (not that fast)
++ * 2 = Reclaim from active list but don't reclaim mapped (2nd pass)
++ */
++ for (pass = 0; pass < 2; pass++) {
++ for (sc.priority = DEF_PRIORITY; sc.priority >= 0; sc.priority--) {
++ unsigned long nr_to_scan = nr_pages - ret;
++
++ sc.nr_scanned = 0;
++ /* sc.swap_cluster_max = nr_to_scan; */
++ shrink_all_nodes(nr_to_scan, pass, &sc);
++ ret += sc.nr_reclaimed;
++ if (ret >= nr_pages)
++ return;
++ }
++ }
++}
++
++void shrink_page_cache(gfp_t mask, struct page *page)
++{
++ /* FIXME: As we only want to get rid of non-mapped pagecache
++ * pages and we know we have too many of them, we should not
++ * need kswapd. */
++ /*
++ wakeup_kswapd(page_zone(page), 0);
++ */
++
++ __shrink_page_cache(mask);
++}
++
+ /* It's optimal to keep kswapds on the same CPUs as their memory, but
+ not required for correctness. So if the last cpu in a node goes
+ away, we get changed to run anywhere: as the first one comes back,
diff --git a/patches.suse/pagecachelimit_batch_huge_nr_to_scan.patch b/patches.suse/pagecachelimit_batch_huge_nr_to_scan.patch
new file mode 100644
index 0000000000..3d95e71012
--- /dev/null
+++ b/patches.suse/pagecachelimit_batch_huge_nr_to_scan.patch
@@ -0,0 +1,61 @@
+From: Michal Hocko <mhocko@suse.cz>
+Subject: pagecache_limit: batch large nr_to_scan targets
+Patch-mainline: never, SUSE specific
+References: bnc#895221
+
+Although pagecache_limit is expected to be set before the load is started there
+seem to be a user which sets the limit after the machine is short on memory and
+has a large amount of the page cache already. Although such a usage is dubious
+at best we still shouldn't fall flat under such conditions.
+
+We had a report where a machine with 512GB of RAM crashed as a result of hard
+lockup detector (which is on by default) when the limit was set to 1G because
+the page cache reclaimer got a target of 22M pages and isolate_lru_pages simply
+starved other spinners for too long.
+
+This patch batches the scan target by SWAP_CLUSTER_MAX chunks to prevent
+from such issues. It should be still noted that setting limit to a
+very small value wrt. an already large page cache target is dangerous and can
+lead to big reclaim storms and page cache over-reclaim.
+
+Signed-off-by: Michal Hocko <mhocko@suse.cz>
+
+---
+ mm/vmscan.c | 19 ++++++++++++-------
+ 1 file changed, 12 insertions(+), 7 deletions(-)
+
+--- a/mm/vmscan.c
++++ b/mm/vmscan.c
+@@ -3766,9 +3766,6 @@ static int shrink_all_nodes(unsigned lon
+ unsigned long scanned = sc->nr_scanned;
+ struct reclaim_state *old_rs = current->reclaim_state;
+
+- /* shrink_list takes lru_lock with IRQ off so we
+- * should be careful about really huge nr_to_scan
+- */
+ nr_to_scan = min(nr_pages, lru_pages);
+
+ /*
+@@ -3780,10 +3777,18 @@ static int shrink_all_nodes(unsigned lon
+ *
+ * Let's stick with this for bug-to-bug compatibility
+ */
+- if (shrink_node_per_memcg(pgdat, lru,
+- nr_to_scan, nr_pages, &nr_reclaimed, sc)) {
+- pagecache_reclaim_unlock_node(pgdat);
+- goto out_wakeup;
++ while (nr_to_scan > 0) {
++ /* shrink_list takes lru_lock with IRQ off so we
++ * should be careful about really huge nr_to_scan
++ */
++ unsigned long batch = min_t(unsigned long, nr_to_scan, SWAP_CLUSTER_MAX);
++
++ if (shrink_node_per_memcg(pgdat, lru,
++ batch, nr_pages, &nr_reclaimed, sc)) {
++ pagecache_reclaim_unlock_node(pgdat);
++ goto out_wakeup;
++ }
++ nr_to_scan -= batch;
+ }
+
+ current->reclaim_state = &reclaim_state;
diff --git a/series.conf b/series.conf
index 2850aa0292..aa5dfb044a 100644
--- a/series.conf
+++ b/series.conf
@@ -15058,6 +15058,21 @@
# below here.
########################################################
+ # Pagecache limit is not supported in SLE15 but we are sharing the code
+ # base so put it here after all the patches to reduce any potential merge
+ # conflicts
+ patches.suse/pagecache-limit.patch
+ patches.suse/pagecache-limit-unmapped.diff
+ patches.suse/pagecache-limit-dirty.diff
+ patches.suse/pagecache-limit-warn-on-usage.patch
+ patches.suse/pagecache-limit-fix-shmem-deadlock.patch
+ patches.suse/pagecache-limit-fix-get_nr_swap_pages.patch
+ patches.suse/pagecache-limit-reduce-zone-lrulock-bouncing.patch
+ patches.suse/pagecachelimit_batch_huge_nr_to_scan.patch
+ patches.suse/pagecache-limit-vmstat_counters.patch
+ patches.suse/pagecache-limit-fix-wrong-reclaimed-count.patch
+ patches.suse/pagecache-limit-tracepoints.patch
+
########################################################
# Patches cherry-picked from SLE12 codestream, for which
# the upstream status couldn't be automatically determined