[RFC] [PATCH v2 0/8] Provide cgroup isolation for buffered writes.
March 22nd, 2011 - 06:20 pm ET by Justin TerAvest | Report spam
This patchset adds tracking to the page_cgroup structure for which cgroup has
dirtied a page, and uses that information to provide isolation between
cgroups performing writeback.
I know that there is some discussion to remove request descriptor limits
entirely, but I included a patch to introduce per-cgroup limits to enable
this functionality. Without it, we didn't see much isolation improvement.
I think most of this material has been discussed on lkml previously, this is
just another attempt to make a patchset that handles buffered writes for CFQ.
There was a lot of previous discussion at:
http://thread.gmane.org/gmane.linux.kernel/1007922
Thanks to Andrea Righi, Kamezawa Hiroyuki, Munehiro Ikeda, Nauman Rafique,
and Vivek Goyal for work on previous versions of these patches.
For version 2:
- I collected more statistics and provided data in the cover sheet
- blkio id is now stored inside "flags" in page_cgroup, with cmpxchg
- I cleaned up some patch names
- Added symmetric reference wrappers in cfq-iosched
There are a couple lingering issues that exist in this patchset-- it's meant
to be an RFC to discuss the overall design for tracking of buffered writes.
I have at least a couple of patches to finish to make absolutely sure that
refcounts and locking are handled properly, I just need to do more testing.
Documentation/block/biodoc.txt | 10 +
block/blk-cgroup.c | 203 +++++++++++++++++-
block/blk-cgroup.h | 9 +-
block/blk-core.c | 218 +++++++++++++
block/blk-settings.c | 2 +-
block/blk-sysfs.c | 59 +++
block/cfq-iosched.c | 473 ++++++++++++++++++++++++++++++-
block/cfq.h | 6 +-
block/elevator.c | 7 +-
fs/buffer.c | 2 +
fs/direct-io.c | 2 +
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 81 +++++++-
include/linux/blkio-track.h | 89 ++++++++
include/linux/elevator.h | 14 +-
include/linux/iocontext.h | 1 +
include/linux/memcontrol.h | 6 +
include/linux/mmzone.h | 4 +-
include/linux/page_cgroup.h | 38 +++-
init/Kconfig | 16 ++
mm/Makefile | 3 +-
mm/bounce.c | 2 +
mm/filemap.c | 2 +
mm/memcontrol.c | 6 +
mm/memory.c | 6 +
mm/page-writeback.c | 14 +-
mm/page_cgroup.c | 29 ++-
mm/swap_state.c | 2 +
28 files changed, 1066 insertions(+), 240 deletions(-)
8f0b0f4 cfq: Don't allow preemption across cgroups
a47cdc6 block: Per cgroup request descriptor counts
8dd7adb cfq: add per cgroup writeout done by flusher stat
1fa0b6d cfq: Fix up tracked async workload length.
e9e85d3 block: Modify CFQ to use IO tracking information.
f8ffb19 cfq-iosched: Make async queues per cgroup
1d9ee09 block,fs,mm: IO cgroup tracking for buffered write
31c7321 cfq-iosched: add symmetric reference wrappers
= Isolation experiment results
For isolation testing, we run a test that's available at:
git://google3-2.osuosl.org/tests/blkcgroup.git
It creates containers, runs workloads, and checks to see how well we meet
isolation targets. For the purposes of this patchset, I only ran
tests among buffered writers.
Before patches
:32:06 INFO experiment 0 achieved DTFs: 666, 333
10:32:06 INFO experiment 0 FAILED: max observed error is 167, allowed is 150
10:32:51 INFO experiment 1 achieved DTFs: 647, 352
10:32:51 INFO experiment 1 FAILED: max observed error is 253, allowed is 150
10:33:35 INFO experiment 2 achieved DTFs: 298, 701
10:33:35 INFO experiment 2 FAILED: max observed error is 199, allowed is 150
10:34:19 INFO experiment 3 achieved DTFs: 445, 277, 277
10:34:19 INFO experiment 3 FAILED: max observed error is 155, allowed is 150
10:35:05 INFO experiment 4 achieved DTFs: 418, 104, 261, 215
10:35:05 INFO experiment 4 FAILED: max observed error is 232, allowed is 150
10:35:53 INFO experiment 5 achieved DTFs: 213, 136, 68, 102, 170, 136, 170
10:35:53 INFO experiment 5 PASSED: max observed error is 73, allowed is 150
10:36:04 INFO --ran 6 experiments, 1 passed, 5 failed
After patches
==:05:22 INFO experiment 0 achieved DTFs: 501, 498
11:05:22 INFO experiment 0 PASSED: max observed error is 2, allowed is 150
11:06:07 INFO experiment 1 achieved DTFs: 874, 125
11:06:07 INFO experiment 1 PASSED: max observed error is 26, allowed is 150
11:06:53 INFO experiment 2 achieved DTFs: 121, 878
11:06:53 INFO experiment 2 PASSED: max observed error is 22, allowed is 150
11:07:46 INFO experiment 3 achieved DTFs: 589, 205, 204
11:07:46 INFO experiment 3 PASSED: max observed error is 11, allowed is 150
11:08:34 INFO experiment 4 achieved DTFs: 616, 109, 109, 163
11:08:34 INFO experiment 4 PASSED: max observed error is 34, allowed is 150
11:09:29 INFO experiment 5 achieved DTFs: 139, 139, 139, 139, 140, 141, 160
11:09:29 INFO experiment 5 PASSED: max observed error is 1, allowed is 150
11:09:46 INFO --ran 6 experiments, 6 passed, 0 failed
Summary
Isolation between buffered writers is clearly better with this patch.
= Read latency results
To test read latency, I created two containers:
- One called "readers", with weight 900
- One called "writers", with weight 100
I ran this fio workload in "readers":
[global]
directory=/mnt/iostestmnt/fio
runtime0
time_based=1
group_reporting=1
exec_prerun='echo 3 > /proc/sys/vm/drop_caches'
cgroup_nodelete=1
bs=4K
sizeQ2M
[iostest-read]
description="reader"
numjobs
rw=randread
new_group=1
and this fio workload in "writers"
[global]
directory=/mnt/iostestmnt/fio
runtime0
time_based=1
group_reporting=1
exec_prerun='echo 3 > /proc/sys/vm/drop_caches'
cgroup_nodelete=1
bs=4K
sizeQ2M
[iostest-write]
description="writer"
cgroup=writers
numjobs=3
rw=write
new_group=1
I've pasted the results from the "read" workload inline.
Before patches
=Starting 16 processes
Jobs: 14 (f): [_rrrrrr_rrrrrrrr] [36.2% done] [352K/0K /s] [86 /0 iops] [eta 01m:00s]·············
iostest-read: (groupid=0, jobs): err= 0: pid 606
Description : ["reader"]
read : io532KB, bwE5814 B/s, iops1 , runt= 30400msec
clat (usec): min!90 , max0399K, avg0395175.13, stdev= 0.20
lat (usec): min!90 , max0399K, avg0395177.07, stdev= 0.20
bw (KB/s) : min= 0, max= 260, per=0.00%, avg= 0.00, stdev= 0.00
cpu : usr=0.00%, sys=0.03%, ctx691, majf=2, minfF8
IO depths : 10.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >d=0.0%
submit : 0=0.0%, 40.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >d=0.0%
complete : 0=0.0%, 40.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >d=0.0%
issued r/w/d: total383/0/0, short=0/0/0
lat (msec): 4=0.03%, 10=2.66%, 20t.84%, 50!.90%, 100=0.09%
lat (msec): 250=0.06%, > 00=0.41%
Run status group 0 (all jobs):
READ: io532KB, aggrbD5KB/s, minbE5KB/s, maxbE5KB/s, mint0400msec, maxt0400msec
Disk stats (read/write):
sdb: ios744/18, merge=0/16, ticksT2713/1675, in_queueU0714, util™.15%
After patches
tarting 16 processes
Jobs: 16 (f): [rrrrrrrrrrrrrrrr] [100.0% done] [557K/0K /s] [136 /0 iops] [eta 00m:00s]
iostest-read: (groupid=0, jobs): err= 0: pid183
Description : ["reader"]
read : io940KB, bwP6105 B/s, iops3 , runt= 30228msec
clat (msec): min=2 , max)866 , avgF3.42, stdev1.84
lat (msec): min=2 , max)866 , avgF3.42, stdev1.84
bw (KB/s) : min= 0, max= 198, per1.69%, avg6.52, stdev.83
cpu : usr=0.01%, sys=0.03%, ctxB74, majf=2, minfF4
IO depths : 10.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >d=0.0%
submit : 0=0.0%, 40.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >d=0.0%
complete : 0=0.0%, 40.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >d=0.0%
issued r/w/d: total735/0/0, short=0/0/0
lat (msec): 4=0.05%, 10=0.32%, 202.99%, 50d.61%, 100=1.26%
lat (msec): 250=0.11%, 500=0.11%, 750=0.16%, 1000=0.05%, > 00=0.35%
Run status group 0 (all jobs):
READ: io940KB, aggrbI4KB/s, minbP6KB/s, maxbP6KB/s, mint0228msec, maxt0228msec
Disk stats (read/write):
sdb: iosA89/0, merge=0/0, ticks–428/0, in_queueG8798, util0.00%
Summary
Read latencies are a bit worse, but this overhead is only imposed when users
ask for this feature by turning on CONFIG_BLKIOTRACK. We expect there to be a something of a latency vs isolation tradeoff.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
dirtied a page, and uses that information to provide isolation between
cgroups performing writeback.
I know that there is some discussion to remove request descriptor limits
entirely, but I included a patch to introduce per-cgroup limits to enable
this functionality. Without it, we didn't see much isolation improvement.
I think most of this material has been discussed on lkml previously, this is
just another attempt to make a patchset that handles buffered writes for CFQ.
There was a lot of previous discussion at:
http://thread.gmane.org/gmane.linux.kernel/1007922
Thanks to Andrea Righi, Kamezawa Hiroyuki, Munehiro Ikeda, Nauman Rafique,
and Vivek Goyal for work on previous versions of these patches.
For version 2:
- I collected more statistics and provided data in the cover sheet
- blkio id is now stored inside "flags" in page_cgroup, with cmpxchg
- I cleaned up some patch names
- Added symmetric reference wrappers in cfq-iosched
There are a couple lingering issues that exist in this patchset-- it's meant
to be an RFC to discuss the overall design for tracking of buffered writes.
I have at least a couple of patches to finish to make absolutely sure that
refcounts and locking are handled properly, I just need to do more testing.
Documentation/block/biodoc.txt | 10 +
block/blk-cgroup.c | 203 +++++++++++++++++-
block/blk-cgroup.h | 9 +-
block/blk-core.c | 218 +++++++++++++
block/blk-settings.c | 2 +-
block/blk-sysfs.c | 59 +++
block/cfq-iosched.c | 473 ++++++++++++++++++++++++++++++-
block/cfq.h | 6 +-
block/elevator.c | 7 +-
fs/buffer.c | 2 +
fs/direct-io.c | 2 +
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 81 +++++++-
include/linux/blkio-track.h | 89 ++++++++
include/linux/elevator.h | 14 +-
include/linux/iocontext.h | 1 +
include/linux/memcontrol.h | 6 +
include/linux/mmzone.h | 4 +-
include/linux/page_cgroup.h | 38 +++-
init/Kconfig | 16 ++
mm/Makefile | 3 +-
mm/bounce.c | 2 +
mm/filemap.c | 2 +
mm/memcontrol.c | 6 +
mm/memory.c | 6 +
mm/page-writeback.c | 14 +-
mm/page_cgroup.c | 29 ++-
mm/swap_state.c | 2 +
28 files changed, 1066 insertions(+), 240 deletions(-)
8f0b0f4 cfq: Don't allow preemption across cgroups
a47cdc6 block: Per cgroup request descriptor counts
8dd7adb cfq: add per cgroup writeout done by flusher stat
1fa0b6d cfq: Fix up tracked async workload length.
e9e85d3 block: Modify CFQ to use IO tracking information.
f8ffb19 cfq-iosched: Make async queues per cgroup
1d9ee09 block,fs,mm: IO cgroup tracking for buffered write
31c7321 cfq-iosched: add symmetric reference wrappers
= Isolation experiment results
For isolation testing, we run a test that's available at:
git://google3-2.osuosl.org/tests/blkcgroup.git
It creates containers, runs workloads, and checks to see how well we meet
isolation targets. For the purposes of this patchset, I only ran
tests among buffered writers.
Before patches
:32:06 INFO experiment 0 achieved DTFs: 666, 333
10:32:06 INFO experiment 0 FAILED: max observed error is 167, allowed is 150
10:32:51 INFO experiment 1 achieved DTFs: 647, 352
10:32:51 INFO experiment 1 FAILED: max observed error is 253, allowed is 150
10:33:35 INFO experiment 2 achieved DTFs: 298, 701
10:33:35 INFO experiment 2 FAILED: max observed error is 199, allowed is 150
10:34:19 INFO experiment 3 achieved DTFs: 445, 277, 277
10:34:19 INFO experiment 3 FAILED: max observed error is 155, allowed is 150
10:35:05 INFO experiment 4 achieved DTFs: 418, 104, 261, 215
10:35:05 INFO experiment 4 FAILED: max observed error is 232, allowed is 150
10:35:53 INFO experiment 5 achieved DTFs: 213, 136, 68, 102, 170, 136, 170
10:35:53 INFO experiment 5 PASSED: max observed error is 73, allowed is 150
10:36:04 INFO --ran 6 experiments, 1 passed, 5 failed
After patches
==:05:22 INFO experiment 0 achieved DTFs: 501, 498
11:05:22 INFO experiment 0 PASSED: max observed error is 2, allowed is 150
11:06:07 INFO experiment 1 achieved DTFs: 874, 125
11:06:07 INFO experiment 1 PASSED: max observed error is 26, allowed is 150
11:06:53 INFO experiment 2 achieved DTFs: 121, 878
11:06:53 INFO experiment 2 PASSED: max observed error is 22, allowed is 150
11:07:46 INFO experiment 3 achieved DTFs: 589, 205, 204
11:07:46 INFO experiment 3 PASSED: max observed error is 11, allowed is 150
11:08:34 INFO experiment 4 achieved DTFs: 616, 109, 109, 163
11:08:34 INFO experiment 4 PASSED: max observed error is 34, allowed is 150
11:09:29 INFO experiment 5 achieved DTFs: 139, 139, 139, 139, 140, 141, 160
11:09:29 INFO experiment 5 PASSED: max observed error is 1, allowed is 150
11:09:46 INFO --ran 6 experiments, 6 passed, 0 failed
Summary
Isolation between buffered writers is clearly better with this patch.
= Read latency results
To test read latency, I created two containers:
- One called "readers", with weight 900
- One called "writers", with weight 100
I ran this fio workload in "readers":
[global]
directory=/mnt/iostestmnt/fio
runtime0
time_based=1
group_reporting=1
exec_prerun='echo 3 > /proc/sys/vm/drop_caches'
cgroup_nodelete=1
bs=4K
sizeQ2M
[iostest-read]
description="reader"
numjobs
rw=randread
new_group=1
and this fio workload in "writers"
[global]
directory=/mnt/iostestmnt/fio
runtime0
time_based=1
group_reporting=1
exec_prerun='echo 3 > /proc/sys/vm/drop_caches'
cgroup_nodelete=1
bs=4K
sizeQ2M
[iostest-write]
description="writer"
cgroup=writers
numjobs=3
rw=write
new_group=1
I've pasted the results from the "read" workload inline.
Before patches
=Starting 16 processes
Jobs: 14 (f): [_rrrrrr_rrrrrrrr] [36.2% done] [352K/0K /s] [86 /0 iops] [eta 01m:00s]·············
iostest-read: (groupid=0, jobs): err= 0: pid 606
Description : ["reader"]
read : io532KB, bwE5814 B/s, iops1 , runt= 30400msec
clat (usec): min!90 , max0399K, avg0395175.13, stdev= 0.20
lat (usec): min!90 , max0399K, avg0395177.07, stdev= 0.20
bw (KB/s) : min= 0, max= 260, per=0.00%, avg= 0.00, stdev= 0.00
cpu : usr=0.00%, sys=0.03%, ctx691, majf=2, minfF8
IO depths : 10.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >d=0.0%
submit : 0=0.0%, 40.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >d=0.0%
complete : 0=0.0%, 40.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >d=0.0%
issued r/w/d: total383/0/0, short=0/0/0
lat (msec): 4=0.03%, 10=2.66%, 20t.84%, 50!.90%, 100=0.09%
lat (msec): 250=0.06%, > 00=0.41%
Run status group 0 (all jobs):
READ: io532KB, aggrbD5KB/s, minbE5KB/s, maxbE5KB/s, mint0400msec, maxt0400msec
Disk stats (read/write):
sdb: ios744/18, merge=0/16, ticksT2713/1675, in_queueU0714, util™.15%
After patches
tarting 16 processes
Jobs: 16 (f): [rrrrrrrrrrrrrrrr] [100.0% done] [557K/0K /s] [136 /0 iops] [eta 00m:00s]
iostest-read: (groupid=0, jobs): err= 0: pid183
Description : ["reader"]
read : io940KB, bwP6105 B/s, iops3 , runt= 30228msec
clat (msec): min=2 , max)866 , avgF3.42, stdev1.84
lat (msec): min=2 , max)866 , avgF3.42, stdev1.84
bw (KB/s) : min= 0, max= 198, per1.69%, avg6.52, stdev.83
cpu : usr=0.01%, sys=0.03%, ctxB74, majf=2, minfF4
IO depths : 10.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >d=0.0%
submit : 0=0.0%, 40.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >d=0.0%
complete : 0=0.0%, 40.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >d=0.0%
issued r/w/d: total735/0/0, short=0/0/0
lat (msec): 4=0.05%, 10=0.32%, 202.99%, 50d.61%, 100=1.26%
lat (msec): 250=0.11%, 500=0.11%, 750=0.16%, 1000=0.05%, > 00=0.35%
Run status group 0 (all jobs):
READ: io940KB, aggrbI4KB/s, minbP6KB/s, maxbP6KB/s, mint0228msec, maxt0228msec
Disk stats (read/write):
sdb: iosA89/0, merge=0/0, ticks–428/0, in_queueG8798, util0.00%
Summary
Read latencies are a bit worse, but this overhead is only imposed when users
ask for this feature by turning on CONFIG_BLKIOTRACK. We expect there to be a something of a latency vs isolation tradeoff.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Similar topics
- [RFC/PATCH v2 0/8] Clean up write-combining MTRR addition
- [RFC PATCH 0/8] printk: Make it usable on nohz CPUs v2
Make your own search :
Tags
Create a new topic
Follow the discussion
14 replies
Make a reply
May 19th, 2013 - 5:27 PM ET
Join now


Replies