[PATCH 0/8] idle page tracking / working set estimation

September 16th, 2011 - 11:50 pm ET by Michel Lespinasse | Report spam
Please comment on the following patches (which are against the v3.0 kernel).
We are using these to collect memory utilization statistics for each cgroup
accross many machines, and optimize job placement accordingly.

The statistics are intended to be compared accross many machines - we
don't just want to know which cgroup to reclaim from on an individual
machine, we also need to know which machine is best to target a job onto
within a large cluster. Also, we try to have a low impact on the normal
MM algorithms - we think they already do a fine job balancing resources
on individual machines, so we are not trying to mess up with that here.

Patch 1 introduces no functionality; it modifies the page_referenced API
so that it can be more easily extended in patch 3.

Patch 2 documents the proposed features, and adds a configuration option
for these. When the features are compiled in, they are still disabled
until the administrator sets up the desired scanning interval; however
the configuration option seems necessary as the features make use of
3 extra page flags - there is plenty of space for these in 64-bit builds,
but less so in 32-bit builds...

Patch 3 introduces page_referenced_kstaled(), which is similar to
page_referenced() but is used for idle page tracking rather than
for memory reclaimation. Since both functions clear the pte_young bits
and we don't want them to interfere with each other, two new page flags
are introduced that track when young pte references have been cleared by
each of the page_referenced variants. The page_referenced functions are also
extended to return the dirty status of any pte references encountered.

Patch 4 introduces the 'kstaled' thread that handles idle page tracking.
The thread starts disabled; one enables it by setting a scanning interval
in /sys/kernel/mm/kstaled/scan_seconds. It then scans all physical memory
pages, looking for idle pages - pages that have not been touched since the
previous scan interval. These pages are further classified into idle_clean
(which are immediately reclaimable), idle_dirty_swap (which are reclaimable
if swap is enabled on the system), and idle_dirty_file (which are reclaimable
after writeback occurs). These statistics are published for each cgroup in
a new /dev/cgroup/*/memory.idle_page_stats file. We did not use the
memory.stat file there because we thought these stats are different -
first, they are meaningless until one sets the scan_seconds value, and
then they are only updated once per scan interval where the memory.stat
values are continually updated.

Patch 5 is a small optimization skipping over memory holes.

Patch 6 rate limits the idle page scanning so that it occurs in small
chunks over the length of the scan interval, rather than all at once.

Patch 7 adds extra functionality to track how long a given page has been
idle, so that memory.idle_page_stats can report pages that have been
idle for 1,2,5,15,30,60,120 or 240 consecutive scan intervals.

Patch 8 adds extra functionality in the form of an incremental update
feature. Here we only report immediately reclaimable idle pages; however
we don't want to wait for the end of a scan interval to update this number
if the system experiences a rapid increase in memory pressure.

Michel Lespinasse (8):
page_referenced: replace vm_flags parameter with struct pr_info
kstaled: documentation and config option.
kstaled: page_referenced_kstaled() and supporting infrastructure.
kstaled: minimalistic implementation.
kstaled: skip non-RAM regions.
kstaled: rate limit pages scanned per second.
kstaled: add histogram sampling functionality
kstaled: add incrementally updating stale page count

Documentation/cgroups/memory.txt | 103 ++++++++-
arch/x86/include/asm/page_types.h | 8 +
arch/x86/kernel/e820.c | 45 ++++
include/linux/ksm.h | 9 +-
include/linux/mmzone.h | 11 +
include/linux/page-flags.h | 50 ++++
include/linux/pagemap.h | 11 +-
include/linux/rmap.h | 82 ++++++-
mm/Kconfig | 10 +
mm/internal.h | 1 +
mm/ksm.c | 15 +-
mm/memcontrol.c | 492 +++++++++++++++++++++++++++++++++++++
mm/memory_hotplug.c | 6 +
mm/mlock.c | 1 +
mm/rmap.c | 136 ++++++--
mm/swap.c | 1 +
mm/vmscan.c | 20 +-
17 files changed, 904 insertions(+), 97 deletions(-)

1.7.3.1

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
email Follow the discussionReplies 21 repliesReplies Make a reply

Similar topics

Replies

#16 Michel Lespinasse
September 23rd, 2011 - 06:20 am ET | Report spam
On Thu, Sep 22, 2011 at 4:15 PM, Andrew Morton wrote:
On Fri, 16 Sep 2011 20:39:11 -0700
Michel Lespinasse wrote:

Scan some number of pages from each node every second, instead of trying to
scan the entime memory at once and being idle for the rest of the configured
interval.



Well...  why?  The amount of work done per scan interval is the same
(actually, it will be slightly increased due to cache evictions).

I think we should see a good explanation of what observed problem this
hackery^Wtweak is trying to solve.  Once that is revealed, we can
compare the proposed solution with one based on thread policy/priority
(for example).



There are two aspects to this:

- some people might find it nicer to have a small amount of load
during the entire scan interval, rather than some spike when we
trigger the scanning and some idle time afterwards. That part is
highly debatable and there are probably better ways to achieve this.

- jitter reduction - if we were to scan the entire memory at once
without sleeping, the pages that are scanned first would have a fairly
constant interval between times they are looked at; however if the
time to scan pages is not constant (it could vary depending on CPU
load and pages getting allocated and freed) the pages that are scanned
towards the end of each scan would have a bit more jitter. This effect
is reduced by trying to scan a fixed number of pages per second.

This is all rather unpleasing.



Yeah, this is not my favourite patch in the series :/

Would it help if I reordered it last in the series, as it seems more
controversial & the later ones don't functionally depend on it ?

Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#17 Michel Lespinasse
September 23rd, 2011 - 06:30 am ET | Report spam
On Thu, Sep 22, 2011 at 4:15 PM, Andrew Morton wrote:
On Fri, 16 Sep 2011 20:39:12 -0700
Michel Lespinasse wrote:

add statistics for pages that have been idle for 1,2,5,15,30,60,120 or
240 scan intervals into /dev/cgroup/*/memory.idle_page_stats



Why?  What's the use case for this feature?



In the fakenuma implementation of kstaled, we were able to configure a
different scan rate for each container (which was represented in the
kernel as a set of fakenuma nodes, rather than a memory cgroup). This
was used to reclaim memory more agressively from some containers than
others, by varying the interval after which pages would be considered
idle.

In the memcg implementation, scanning is done globally so we can't
configure a per-cgroup rate. Instead, we track the number of scan
cycles that each page has been observed to be idle for. At that point,
we could have a per-cgroup configurable threshold and report pages
that have been idle for longer than that number of scans; however it
seemed nicer to provide a full histogram since the information is
actually available.

Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#18 Rik van Riel
September 23rd, 2011 - 03:30 pm ET | Report spam
On 09/16/2011 11:39 PM, Michel Lespinasse wrote:
Extend memory cgroup documentation do describe the optional idle page
tracking features, and add the corresponding configuration option.


Signed-off-by: Michel Lespinasse

a/mm/Kconfig
+++ b/mm/Kconfig
@@ -370,3 +370,13 @@ config CLEANCACHE
in a negligible performance hit.

If unsure, say Y to enable cleancache
+
+config KSTALED
+ depends on CGROUP_MEM_RES_CTLR



Looking at patch #3, I wonder if this needs to be dependent
on 64 bit, or at least make sure this is not selected when
a user builds a 32 bit kernel with NUMA.

The reason is that on a 32 bit system we could run out of
page flags + zone bits + node bits.

+ bool "Per-cgroup idle page tracking"
+ help
+ This feature allows the kernel to report the amount of user pages
+ in a cgroup that have not been touched in a given time.
+ This information may be used to size the cgroups and/or for
+ job placement within a compute cluster.
+ See Documentation/cgroups/memory.txt for a more complete description.





All rights reversed
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#19 Balbir Singh
September 27th, 2011 - 06:10 am ET | Report spam
On Sat, Sep 17, 2011 at 9:09 AM, Michel Lespinasse wrote:
Please comment on the following patches (which are against the v3.0 kernel).
We are using these to collect memory utilization statistics for each cgroup
accross many machines, and optimize job placement accordingly.

The statistics are intended to be compared accross many machines - we
don't just want to know which cgroup to reclaim from on an individual
machine, we also need to know which machine is best to target a job onto
within a large cluster. Also, we try to have a low impact on the normal
MM algorithms - we think they already do a fine job balancing resources
on individual machines, so we are not trying to mess up with that here.

Patch 1 introduces no functionality; it modifies the page_referenced API
so that it can be more easily extended in patch 3.

Patch 2 documents the proposed features, and adds a configuration option
for these. When the features are compiled in, they are still disabled
until the administrator sets up the desired scanning interval; however
the configuration option seems necessary as the features make use of
3 extra page flags - there is plenty of space for these in 64-bit builds,
but less so in 32-bit builds...

Patch 3 introduces page_referenced_kstaled(), which is similar to
page_referenced() but is used for idle page tracking rather than
for memory reclaimation. Since both functions clear the pte_young bits
and we don't want them to interfere with each other, two new page flags
are introduced that track when young pte references have been cleared by
each of the page_referenced variants.



Sorry, I have trouble parsing this sentence, could you elaborate on "when"?


Balbir Singh
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#20 Michel Lespinasse
September 27th, 2011 - 06:30 am ET | Report spam
On Tue, Sep 27, 2011 at 3:03 AM, Balbir Singh wrote:
On Sat, Sep 17, 2011 at 9:09 AM, Michel Lespinasse wrote:
Patch 3 introduces page_referenced_kstaled(), which is similar to
page_referenced() but is used for idle page tracking rather than
for memory reclaimation. Since both functions clear the pte_young bits
and we don't want them to interfere with each other, two new page flags
are introduced that track when young pte references have been cleared by
each of the page_referenced variants.



Sorry, I have trouble parsing this sentence, could you elaborate on "when"?



page_referenced() indicates if a page was accessed since the previous
page_referenced() call.

page_referenced_kstaled() indicates if a page was accessed since the
previous page_referenced_kstaled() call.

Both of the functions need to clear PTE young bits; however we don't
want the two functions to interfere with each other. To achieve this,
we add two page bits to indicate when a young PTE has been observed by
one of the functions but not by the other.

Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
Help Create a new topicNext page Previous pageReplies Make a reply
Search Make your own search