[PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

May 27th, 2011 - 08:40 am ET by Ankita Garg | Report spam
Hi,

Modern systems offer higher CPU performance and large amount of memory in
each generation in order to support application demands. Memory subsystem has
began to offer wide range of capabilities for managing power consumption,
which is driving the need to relook at the way memory is managed by the
operating system. Linux VM subsystem has sophisticated algorithms to
optimally manage the scarce resources for best overall system performance.
Apart from the capacity and location of memory areas, the VM subsystem tracks
special addressability restrictions in zones and relative distance from CPU as
NUMA nodes if necessary. Power management capabilities in the memory subsystem
and inclusion of different class of main memory like PCM, or non-volatile RAM,
brings in new boundaries and attributes that needs to be tagged within the
Linux VM subsystem for exploitation by the kernel and applications.

This patchset proposes a generic memory regions infrastructure that can be
used to tag boundaries of memory blocks which belongs to a specific memory
power management domain and further enable exploitation of platform memory
power management capabilities.

How can Linux VM help memory power savings?

o Consolidate memory allocations and/or references such that they are
not spread across the entire memory address space. Basically area of memory
that is not being referenced, can reside in low power state.

o Support targeted memory reclaim, where certain areas of memory that can be
easily freed can be offlined, allowing those areas of memory to be put into
lower power states.

What is a Memory Region ?
-

Memory regions is a generic memory management framework that enables the
virtual memory manager to consider memory characteristics when making memory
allocation and deallocation decisions. It is a layer of abstraction under the
real NUMA nodes, that encapsulate knowledge of the underlying memory hardware.
This layer is created at boot time, with information from firmware regarding
the granularity at which memory power can be managed on the platform. For
example, on platforms with support for Partial Array Self-Refresh (PASR) [1],
regions could be aligned to memory unit that can be independently put into
self-refresh or turned off (content destructive power off). On the other hand,
platforms with support for multiple memory controllers that control the power
states of memory, one memory region could be created for all the memory under
a single memory controller.

The aim of the alignment is to ensure that memory allocations, deallocations
and reclaim are performed within a defined hardware boundary. By creating
zones under regions, the buddy allocator would operate at the level of
regions. The proposed data structure is as shown in the Figure below:


|N0 |N1 |N2 |N3 |.. |.. |Nn |
/ \ \
/ \ \
/ \ \
|
| Mem Rgn0 | | | Mem Rgn3 |
|
| | |
| |
| | Mem Rgn1 | ->| zones |
|
| |
| -->| zones |
|
->| zones |


Memory regions enable the following :

o Sequential allocation of memory in the order of memory regions, thus
ensuring that greater number of memory regions are devoid of allocations to
begin with
o With time however, the memory allocations will tend to be spread across
different regions. But the notion of a region boundary and region level
memory statistics will enable specific regions to be evacuated using
targetted allocation and reclaim.

Lumpy reclaim and other memory compaction work by Mel Gorman, would further
aid in consolidation of memory [4].

Memory regions is just a base infrastructure that would enable the Linux VM to
be aware of the physical memory hardware characterisitics, a pre-requisite to
implementing other sophisticated algorithms and techniques to actually
conserve power.

Advantages

Memory regions framework works with existing memory management data
structures and only adds one more layer of abstraction that is required to
capture special boundaries and properties. Most VM code paths work similar
to current implementation with additional traversal of zone data structures
in pre-defined order.

Alternative Approach:

There are other ways in which memory belonging to the same power domain could
be grouped together. Fake NUMA nodes under a real NUMA node could encapsulate
information about the memory hardware units that can be independently power
managed. With minimal code changes, the same functionality as memory regions
can be achieved. However, the fake NUMA nodes is a non-intuitive solution,
that breaks the NUMA semantics and is not generic in nature. It would present
an incorrect view of the system to the administrator, by showing that it has a
greater number of NUMA nodes than actually present.

Challenges
-

o Memory interleaving is typically used on all platforms to increase the
memory bandwidth and hence memory performance. However, in the presence of
interleaving, the amount of idle memory within the hardware domain reduces,
impacting power savings. For a given platform, it is important to select an
interleaving scheme that gives good performance with optimum power savings.

This is a RFC patchset with minimal functionality to demonstrate the
requirement and proposed implementation options. It has been tested on TI
OMAP4 Panda board with 1Gb RAM and the Samsung Exynos 4210 board. The patch
applies on kernel version 2.6.39-rc5, compiled with the default config files
for the two platforms. I have turned off cgroup, memory hotplug and kexec to
begin. Support to these framework can be easily extended. The u-boot
bootloader does not yet export information regarding the physical memory bank
boundaries and hence the regions are not correctly aligned to hardware and
hence hard coded for test/demo purposes. Also, the code assumes that atleast
one region is present in the node. Compile time exclusion of memory regions is
a todo.

Results
-
Ran pagetest, a simple C program that allocates and touches a required number
of pages, on a Samsung Exynos 4210 board with ~2GB RAM, booted with 4 memory
regions, each with ~512MB. The allocation size used was 512MB. Below is the
free page statistics while running the benchmark:


| | start | ~480MB | 512MB |

| Region 0 | 124013 | 1129 | 484 |
| Region 1 | 131072 | 131072 | 130824 |
| Region 2 | 131072 | 131072 | 131072 |
| Region 3 | 57332 | 57332 | 57332 |


(The total number of pages in Region 3 is 57332, as it contains all the
remaining pages and hence the region size is not 512MB).

Column 1 indicates the number of free pages in each region at the start of the
benchmark, column 2 at about 480MB allocation and column 3 at 512MB
allocation. The memory in regions 1,2 & 3 is free and only region0 is
utilized. So if the regions are aligned to the hardware memory units, free
regions could potentially be put either into low power state or turned off. It
may be possible to allocate from lower address without regions, but once the
page reclaim comes into play, the page allocations will tend to get spread
around.

References
-

[1] Partial Array Self Refresh
http://focus.ti.com/general/docs/wt...vigationId037
[2] TI OMAP$ Panda board
http://pandaboard.org/node/224/#manual
[3] Memory Regions discussion at Ubuntu Development Summit, May 2011
https://wiki.linaro.org/Specs/Kerne...egions.odp
[4] Memory compaction
http://lwn.net/Articles/368869/

Ankita Garg (10):
mm: Introduce the memory regions data structure
mm: Helper routines
mm: Init zones inside memory regions
mm: Refer to zones from memory regions
mm: Create zonelists
mm: Verify zonelists
mm: Modify vmstat
mm: Modify vmscan
mm: Reflect memory region changes in zoneinfo
mm: Create memory regions at boot-up

include/linux/mm.h | 25 +++-
include/linux/mmzone.h | 58 +++++++--
include/linux/vmstat.h | 22 ++-
mm/mm_init.c | 51 ++++
mm/mmzone.c | 36 ++++-
mm/page_alloc.c | 368 +++++++++++++++++++++++++++++++--
mm/vmscan.c | 284 ++++++++++++++++++++--
mm/vmstat.c | 77 ++++++-
8 files changed, 581 insertions(+), 340 deletions(-)

1.7.4

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
email Follow the discussionReplies 35 repliesReplies Make a reply

Similar topics

Replies

#11 Dave Hansen
May 27th, 2011 - 05:40 pm ET | Report spam
On Fri, 2011-05-27 at 23:50 +0530, Vaidyanathan Srinivasan wrote:
The overall idea is to have a VM data structure that can capture
various boundaries of memory, and enable the allocations and reclaim
logic to target certain areas based on the boundaries and properties
required.



It's worth noting that we already do targeted reclaim on boundaries
other than zones. The lumpy reclaim and memory compaction logically do
the same thing. So, it's at least possible to do this without having
the global LRU designed around the way you want to reclaim.

Also, if you get _too_ dependent on the global LRU, what are you going
to do if our cgroup buddies manage to get cgroup'd pages off the global
LRU?


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#12 Andrew Morton
May 28th, 2011 - 04:00 am ET | Report spam
On Fri, 27 May 2011 18:01:28 +0530 Ankita Garg wrote:

This patchset proposes a generic memory regions infrastructure that can be
used to tag boundaries of memory blocks which belongs to a specific memory
power management domain and further enable exploitation of platform memory
power management capabilities.



A couple of quick thoughts...

I'm seeing no estimate of how much energy we might save when this work
is completed. But saving energy is the entire point of the entire
patchset! So please spend some time thinking about that and update and
maintain the [patch 0/n] description so others can get some idea of the
benefit we might get from all of this. That estimate should include an
estimate of what proportion of machines are likely to have hardware
which can use this feature and in what timeframe.

IOW, if it saves one microwatt on 0.001% of machines, not interested ;)


Also, all this code appears to be enabled on all machines? So machines
which don't have the requisite hardware still carry any additional
overhead which is added here. I can see that ifdeffing a feature like
this would be ghastly but please also have a think about the
implications of this and add that discussion also.

If possible, it would be good to think up some microbenchmarks which
probe the worst-case performance impact and describe those and present
the results. So others can gain an understanding of the runtime costs.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#13 Ankita Garg
May 28th, 2011 - 09:20 am ET | Report spam
Hi Andrew,

On Sat, May 28, 2011 at 12:56:40AM -0700, Andrew Morton wrote:
On Fri, 27 May 2011 18:01:28 +0530 Ankita Garg wrote:

> This patchset proposes a generic memory regions infrastructure that can be
> used to tag boundaries of memory blocks which belongs to a specific memory
> power management domain and further enable exploitation of platform memory
> power management capabilities.

A couple of quick thoughts...

I'm seeing no estimate of how much energy we might save when this work
is completed. But saving energy is the entire point of the entire
patchset! So please spend some time thinking about that and update and
maintain the [patch 0/n] description so others can get some idea of the
benefit we might get from all of this. That estimate should include an
estimate of what proportion of machines are likely to have hardware
which can use this feature and in what timeframe.




This patchset is definitely not for inclusion. The intention of this RFC
series is to convey the idea and demonstrate the intricacies of the VM
design. Partial Array Self-Refresh (PASR) is an upcoming technology that
is supported on some platforms today, but will be an important feature
in future platforms to conserve idle power consumed by memory subsystem.
Mobile devices that are predominantly in the standby state can exploit
PASR feature to partially turn off areas of memory that are free.

Unfortunately, at this point we are unable to provide an estimate of the
power savings, as the hardware platforms do not yet export information
about the underlying memory hardware topology. We are working on this
and hope to have some estimations in a month or two. However, will
evaluate the performance impact of the changes and share the same.

IOW, if it saves one microwatt on 0.001% of machines, not interested ;)


Also, all this code appears to be enabled on all machines? So machines
which don't have the requisite hardware still carry any additional
overhead which is added here. I can see that ifdeffing a feature like
this would be ghastly but please also have a think about the
implications of this and add that discussion also.

If possible, it would be good to think up some microbenchmarks which
probe the worst-case performance impact and describe those and present
the results. So others can gain an understanding of the runtime costs.




Regards,
Ankita Garg ()
Linux Technology Center
IBM India Systems & Technology Labs,
Bangalore, India
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#14 Jean-Christophe PLAGNIOL-VILLARD
May 28th, 2011 - 11:00 am ET | Report spam
On 18:01 Fri 27 May , Ankita Garg wrote:
Memory regions are created at boot up time, from the information obtained
from the firmware. This patchset was developed on ARM platform, on which at
present u-boot bootloader does not export information about memory units that
can be independently power managed. For the purpose of demonstration, 2 hard
coded memory regions are created, of 256MB each on the Panda board with 512MB
RAM.

Signed-off-by: Ankita Garg

include/linux/mmzone.h | 8 +++--
mm/page_alloc.c | 29 +++++++++++++++++++++++++++++
2 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bc3e3fd..5dbe1e1 100644
a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -627,14 +627,12 @@ typedef struct mem_region_list_data {
*/
struct bootmem_data;
typedef struct pglist_data {
-/* The linkage to node_zones is now removed. The new hierarchy introduced
- * is pg_data_t -> mem_region -> zones
- * struct zone node_zones[MAX_NR_ZONES];
- */
struct zonelist node_zonelists[MAX_ZONELISTS];
int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
- struct page *node_mem_map;
+ strs pg_data_t -> mem_region -> zones
+ * struct zone node_zones[MAX_NR_ZONES];
+ */uct page *node_mem_map;


what is time?
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
struct page_cgroup *node_page_cgroup;
#endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index da8b045..3d994e8 100644
a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4285,6 +4285,34 @@ static inline int pageblock_default_order(unsigned int order)

#endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */

+#define REGIONS_SIZE (512 << 20) >> PAGE_SHIFT


fix a region size why?
+
+static void init_node_memory_regions(struct pglist_data *pgdat)
+{


Best Regards,
J.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#15 Ankita Garg
May 29th, 2011 - 04:20 am ET | Report spam
Hi Dave,

On Fri, May 27, 2011 at 02:31:52PM -0700, Dave Hansen wrote:
On Fri, 2011-05-27 at 23:50 +0530, Vaidyanathan Srinivasan wrote:
> The overall idea is to have a VM data structure that can capture
> various boundaries of memory, and enable the allocations and reclaim
> logic to target certain areas based on the boundaries and properties
> required.

It's worth noting that we already do targeted reclaim on boundaries
other than zones. The lumpy reclaim and memory compaction logically do
the same thing. So, it's at least possible to do this without having
the global LRU designed around the way you want to reclaim.




My understanding maybe incorrect, but doesn't both lumpy reclaim and
memory compaction still work under zone boundary ? While trying to free
up higher order pages, lumpy reclaim checks to ensure that pages that
are selected do not cross zone boundary. Further, compaction walks
through the pages in a zone and tries to re-arrange them.

Also, if you get _too_ dependent on the global LRU, what are you going
to do if our cgroup buddies manage to get cgroup'd pages off the global
LRU?




Regards,
Ankita Garg ()
Linux Technology Center
IBM India Systems & Technology Labs,
Bangalore, India
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
Help Create a new topicNext page Previous pageReplies Make a reply
Search Make your own search