[patch 0/5] mm: per-zone dirty limiting

July 25th, 2011 - 04:30 pm ET by Johannes Weiner | Report spam
Hello!

Writing back single file pages during reclaim exhibits bad IO
patterns, but we can't just stop doing that before the VM has other
means to ensure the pages in a zone are reclaimable.

Over time there were several suggestions of at least doing
write-around of the pages in inode-proximity when the need arises to
clean pages during memory pressure. But even that would interrupt
writeback from the flushers, without any guarantees that the nearby
inode-pages are even sitting on the same troubled zone.

The reason why dirty pages reach the end of LRU lists in the first
place is in part because the dirty limits are a global restriction
while most systems have more than one LRU list that are different in
size. Multiple nodes have multiple zones have multiple file lists but
at the same time there is nothing to balance the dirty pages between
the lists except for reclaim writing them out upon encounter.

With around 4G of RAM, a x86_64 machine of mine has a DMA32 zone of a
bit over 3G, a Normal zone of 500M, and a DMA zone of 15M.

A linear writer can quickly fill up the Normal zone, then the DMA32
zone, throttled by the dirty limit initially. The flushers catch up,
the zones are now mostly full of clean pages and memory reclaim kicks
in on subsequent allocations. The pages it frees from the Normal zone
are quickly filled with dirty pages (unthrottled, as the much bigger
DMA32 zone allows for a huge number of dirty pages in comparison to
the Normal zone). As there are also anon and active file pages on the
Normal zone, it is not unlikely that a significant amount of its
inactive file pages are now dirty [ foo=zone(global) ]:

reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive2313(821289) active™42(10039) isolated'(27) dirtyY709(146944) writebacks9(4017)
reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive1102(806876) active™25(10022) isolated2(32) dirtyr125(146914) writeback•7(3972)
reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive0493(803374) active˜71(9978) isolated2(32) dirtyW274(146618) writeback@88(4088)
reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive1957(806559) active˜71(9978) isolated2(32) dirtye125(147329) writebackE6(3866)
reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive0601(803978) active˜60(9973) isolated'(27) dirtyc792(146590) writebacka(4276)
reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive1786(804032) active˜60(9973) isolated=0(64) dirtyd310(146998) writeback82(3847)
reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive1643(805651) active˜60(9982) isolated2(32) dirtyc778(147217) writeback27(4156)
reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive1678(804709) active˜59(10112) isolated'(27) dirty673(148224) writeback)(4233)

[ These prints occur only once per reclaim invocation, so the actual
->writepage calls are more frequent than the timestamp may suggest. ]

In the scenario without the Normal zone, first the DMA32 zone fills
up, then the DMA zone. When reclaim kicks in, it is presented with a
DMA zone whose inactive pages are all dirty -- and dirtied most
recently at that, so the flushers really had abysmal chances at making
some headway:

reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactivew6(430813) active=2(2931) isolated2(32) dirty4(68649) writeback=0(18765)
reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactiver6(430344) active=2(2931) isolated2(32) dirtyv4(67790) writeback=0(17146)
reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactiver9(430838) active=2(2931) isolated2(32) dirty)3(65303) writebackF8(20122)
reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactiveu7(431181) active=2(2931) isolated2(32) dirtyc(68851) writebacks1(15926)
reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactiveu8(432808) active=2(2931) isolated2(32) dirtyd5(64106) writeback=0(19666)
reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactiver6(431018) active=2(2931) isolated2(32) dirtyt0(65770) writeback(17907)
reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactivei7(430467) active=2(2931) isolated2(32) dirtyt3(63757) writeback=0(18826)
reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactivei3(430951) active=2(2931) isolated2(32) dirtyb6(54529) writeback‘(16198)

The idea behind this patch set is to take the ratio the global dirty
limits have to the global memory state and put it into proportion to
the individual zone. The allocator ensures that pages allocated for
being written to in the page cache are distributed across zones such
that there are always enough clean pages on a zone to begin with.

I am not yet really satisfied as it's not really orthogonal or
integrated with the other writeback throttling much, and has rough
edges here and there, but test results do look rather promising so
far:

Copying 8G to fuse-ntfs on USB stick in 4G machine

3.0:

Performance counter stats for 'dd if=/dev/zero of=zeroes bs2k count&2144' (6 runs):

140,671,831 cache-misses # 4.923 M/sec ( +- 0.198% ) (scaled from 82.80%)
726,265,014 cache-references # 25.417 M/sec ( +- 1.104% ) (scaled from 83.06%)
144,092,383 branch-misses # 4.157 % ( +- 0.493% ) (scaled from 83.17%)
3,466,608,296 branches # 121.319 M/sec ( +- 0.421% ) (scaled from 67.89%)
17,882,351,343 instructions # 0.417 IPC ( +- 0.457% ) (scaled from 84.73%)
42,848,633,897 cycles # 1499.554 M/sec ( +- 0.604% ) (scaled from 83.08%)
236 page-faults # 0.000 M/sec ( +- 0.323% )
8,026 CPU-migrations # 0.000 M/sec ( +- 6.291% )
2,372,358 context-switches # 0.083 M/sec ( +- 0.003% )
28574.255540 task-clock-msecs # 0.031 CPUs ( +- 0.409% )

912.625436885 seconds time elapsed ( +- 3.851% )

nr_vmscan_write 667839

3.0-per-zone-dirty:

Performance counter stats for 'dd if=/dev/zero of=zeroes bs2k count&2144' (6 runs):

140,791,501 cache-misses # 3.887 M/sec ( +- 0.186% ) (scaled from 83.09%)
816,474,193 cache-references # 22.540 M/sec ( +- 0.923% ) (scaled from 83.16%)
154,500,577 branch-misses # 4.302 % ( +- 0.495% ) (scaled from 83.15%)
3,591,344,338 branches # 99.143 M/sec ( +- 0.402% ) (scaled from 67.32%)
18,713,190,183 instructions # 0.338 IPC ( +- 0.448% ) (scaled from 83.96%)
55,285,320,107 cycles # 1526.208 M/sec ( +- 0.588% ) (scaled from 83.28%)
237 page-faults # 0.000 M/sec ( +- 0.302% )
28,028 CPU-migrations # 0.001 M/sec ( +- 3.070% )
2,369,897 context-switches # 0.065 M/sec ( +- 0.006% )
36223.970238 task-clock-msecs # 0.060 CPUs ( +- 1.062% )

605.909769823 seconds time elapsed ( +- 0.783% )

nr_vmscan_write 0

That's an increase of throughput by 30% and no writeback interference
from reclaim.

As not every other allocation has to reclaim from a Normal zone full
of dirty pages anymore, the patched kernel is also more responsive in
general during the copy.

I am also running fs_mark on XFS on a 2G machine, but the final
results are not in yet. The preliminary results appear to be in this
ballpark:

fs_mark -d fsmark-one -d fsmark-two -D 100 -N 150 -n 150 -L 25 -t 1 -S 0 -s $((10 << 20))

3.0:

real 20m43.901s
user 0m8.988s
sys 0m58.227s
nr_vmscan_write 3347

3.0-per-zone-dirty:

real 20m8.012s
user 0m8.862s
sys 1m2.585s
nr_vmscan_write 161

Patch #1 is more or less an unrelated fix that subsequent patches
depend upon as they modify the same code. It should go upstream
immediately, me thinks.

#2 and #3 are boring cleanup, guess they can go in right away as well.

#4 adds per-zone dirty throttling for __GFP_WRITE allocators, #5
passes __GFP_WRITE from the grab_cache_page* functions in the hope to
get most writers and no readers; I haven't checked all sites yet.

Discuss! :-)

include/linux/gfp.h | 4 +-
include/linux/pagemap.h | 6 +-
include/linux/writeback.h | 5 +-
mm/filemap.c | 8 +-
mm/page-writeback.c | 225 ++++++++++++++++++++++++++++++--
mm/page_alloc.c | 27 ++++++
6 files changed, 196 insertions(+), 79 deletions(-)

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
email Follow the discussionReplies 28 repliesReplies Make a reply

Replies

#1 Johannes Weiner
July 25th, 2011 - 04:30 pm ET | Report spam
From: Johannes Weiner

Allow allocators to pass __GFP_WRITE when they know in advance that
the allocated page will be written to and become dirty soon.

The page allocator will then attempt to distribute those allocations
across zones, such that no single zone will end up full of dirty and
thus more or less unreclaimable pages.

The global dirty limits are put in proportion to the respective zone's
amount of dirtyable memory and the allocation denied when the limit of
that zone is reached.

Before the allocation fails, the allocator slowpath has a stage before
compaction and reclaim, where the flusher threads are kicked and the
allocator ultimately has to wait for writeback if still none of the
zones has become eligible for allocation again in the meantime.

Signed-off-by: Johannes Weiner

include/linux/gfp.h | 4 +-
include/linux/writeback.h | 3 +
mm/page-writeback.c | 132 +++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 27 +++++++++
4 files changed, 149 insertions(+), 17 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..78d5338 100644
a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -36,6 +36,7 @@ struct vm_area_struct;
#endif
#define ___GFP_NO_KSWAPD 0x400000u
#define ___GFP_OTHER_NODE 0x800000u
+#define ___GFP_WRITE 0x1000000u

/*
* GFP bitmasks..
@@ -85,6 +86,7 @@ struct vm_area_struct;

#define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD)
#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
+#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Will be dirtied soon */

/*
* This may seem redundant, but it's a way of annotating false positives vs.
@@ -92,7 +94,7 @@ struct vm_area_struct;
*/
#define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)

-#define __GFP_BITS_SHIFT 24 /* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

/* This equals 0, but use constants in case they ever change */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 8c63f3a..9312e25 100644
a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -93,6 +93,9 @@ void laptop_mode_timer_fn(unsigned long data);
static inline void laptop_sync_completion(void) { }
#endif
void throttle_vm_writeout(gfp_t gfp_mask);
+bool zone_dirty_ok(struct zone *zone);
+void try_to_writeback_pages(struct zonelist *zonelist, gfp_t gfp_mask,
+ nodemask_t *nodemask);

/* These are exported to sysctl. */
extern int dirty_background_ratio;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 41dc871..ce673ec 100644
a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -154,6 +154,18 @@ static unsigned long determine_dirtyable_memory(void)
return x + 1; /* Ensure that we never return 0 */
}

+static unsigned long zone_dirtyable_memory(struct zone *zone)
+{
+ unsigned long x = 1; /* Ensure that we never return 0 */
+
+ if (is_highmem(zone) && !vm_highmem_is_dirtyable)
+ return x;
+
+ x += zone_page_state(zone, NR_FREE_PAGES);
+ x += zone_reclaimable_pages(zone);
+ return x;
+}
+
/*
* Scale the writeback cache size proportional to the relative writeout speeds.
*
@@ -378,6 +390,24 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
}
EXPORT_SYMBOL(bdi_set_max_ratio);

+static void sanitize_dirty_limits(unsigned long *pbackground,
+ unsigned long *pdirty)
+{
+ unsigned long background = *pbackground;
+ unsigned long dirty = *pdirty;
+ struct task_struct *tsk;
+
+ if (background >= dirty)
+ background = dirty / 2;
+ tsk = current;
+ if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
+ background += background / 4;
+ dirty += dirty / 4;
+ }
+ *pbackground = background;
+ *pdirty = dirty;
+}
+
/*
* global_dirty_limits - background-writeback and dirty-throttling thresholds
*
@@ -389,33 +419,52 @@ EXPORT_SYMBOL(bdi_set_max_ratio);
*/
void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
{
- unsigned long background;
- unsigned long dirty;
unsigned long uninitialized_var(available_memory);
- struct task_struct *tsk;

if (!vm_dirty_bytes || !dirty_background_bytes)
available_memory = determine_dirtyable_memory();

if (vm_dirty_bytes)
- dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+ *pdirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
else
- dirty = (vm_dirty_ratio * available_memory) / 100;
+ *pdirty = vm_dirty_ratio * available_memory / 100;

if (dirty_background_bytes)
- background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+ *pbackground = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
else
- background = (dirty_background_ratio * available_memory) / 100;
+ *pbackground = dirty_background_ratio * available_memory / 100;

- if (background >= dirty)
- background = dirty / 2;
- tsk = current;
- if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
- background += background / 4;
- dirty += dirty / 4;
- }
- *pbackground = background;
- *pdirty = dirty;
+ sanitize_dirty_limits(pbackground, pdirty);
+}
+
+static void zone_dirty_limits(struct zone *zone, unsigned long *pbackground,
+ unsigned long *pdirty)
+{
+ unsigned long uninitialized_var(global_memory);
+ unsigned long zone_memory;
+
+ zone_memory = zone_dirtyable_memory(zone);
+
+ if (!vm_dirty_bytes || !dirty_background_bytes)
+ global_memory = determine_dirtyable_memory();
+
+ if (vm_dirty_bytes) {
+ unsigned long dirty_pages;
+
+ dirty_pages = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+ *pdirty = zone_memory * dirty_pages / global_memory;
+ } else
+ *pdirty = zone_memory * vm_dirty_ratio / 100;
+
+ if (dirty_background_bytes) {
+ unsigned long dirty_pages;
+
+ dirty_pages = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+ *pbackground = zone_memory * dirty_pages / global_memory;
+ } else
+ *pbackground = zone_memory * dirty_background_ratio / 100;
+
+ sanitize_dirty_limits(pbackground, pdirty);
}

/*
@@ -661,6 +710,57 @@ void throttle_vm_writeout(gfp_t gfp_mask)
}
}

+bool zone_dirty_ok(struct zone *zone)
+{
+ unsigned long background_thresh, dirty_thresh;
+ unsigned long nr_reclaimable, nr_writeback;
+
+ zone_dirty_limits(zone, &background_thresh, &dirty_thresh);
+
+ nr_reclaimable = zone_page_state(zone, NR_FILE_DIRTY) +
+ zone_page_state(zone, NR_UNSTABLE_NFS);
+ nr_writeback = zone_page_state(zone, NR_WRITEBACK);
+
+ return nr_reclaimable + nr_writeback <= dirty_thresh;
+}
+
+void try_to_writeback_pages(struct zonelist *zonelist, gfp_t gfp_mask,
+ nodemask_t *nodemask)
+{
+ unsigned int nr_exceeded = 0;
+ unsigned int nr_zones = 0;
+ struct zoneref *z;
+ struct zone *zone;
+
+ for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask),
+ nodemask) {
+ unsigned long background_thresh, dirty_thresh;
+ unsigned long nr_reclaimable, nr_writeback;
+
+ nr_zones++;
+
+ zone_dirty_limits(zone, &background_thresh, &dirty_thresh);
+
+ nr_reclaimable = zone_page_state(zone, NR_FILE_DIRTY) +
+ zone_page_state(zone, NR_UNSTABLE_NFS);
+ nr_writeback = zone_page_state(zone, NR_WRITEBACK);
+
+ if (nr_reclaimable + nr_writeback <= background_thresh)
+ continue;
+
+ if (nr_reclaimable > nr_writeback)
+ wakeup_flusher_threads(nr_reclaimable - nr_writeback);
+
+ if (nr_reclaimable + nr_writeback <= dirty_thresh)
+ continue;
+
+ nr_exceeded++;
+ }
+
+ if (nr_zones == nr_exceeded)
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
+}
+
/*
* sysctl handler for /proc/sys/vm/dirty_writeback_centisecs
*/
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4e8985a..1fac154 100644
a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1666,6 +1666,9 @@ zonelist_scan:
!cpuset_zone_allowed_softwall(zone, gfp_mask))
goto try_next_zone;

+ if ((gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
+ goto this_zone_full;
+
BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
unsigned long mark;
@@ -1863,6 +1866,22 @@ out:
return page;
}

+static struct page *
+__alloc_pages_writeback(gfp_t gfp_mask, unsigned int order,
+ struct zonelist *zonelist, enum zone_type high_zoneidx,
+ nodemask_t *nodemask, int alloc_flags,
+ struct zone *preferred_zone, int migratetype)
+{
+ if (!(gfp_mask & __GFP_WRITE))
+ return NULL;
+
+ try_to_writeback_pages(zonelist, gfp_mask, nodemask);
+
+ return get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
+ high_zoneidx, alloc_flags,
+ preferred_zone, migratetype);
+}
+
#ifdef CONFIG_COMPACTION
/* Try memory compaction for high-order allocations before reclaim */
static struct page *
@@ -2135,6 +2154,14 @@ rebalance:
if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
goto nopage;

+ /* Try writing back pages if per-zone dirty limits are reached */
+ page = __alloc_pages_writeback(gfp_mask, order, zonelist,
+ high_zoneidx, nodemask,
+ alloc_flags, preferred_zone,
+ migratetype);
+ if (page)
+ goto got_pg;
+
/*
* Try direct compaction. The first pass is asynchronous. Subsequent
* attempts after direct reclaim are synchronous
1.7.6

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Similar topics