[PATCH 0/2] [RFC] Volatile ranges (v4)

March 16th, 2012 - 06:00 pm ET by John Stultz | Report spam
Ok. So here's another iteration of the fadvise volatile range code.

I realize this is still a way off from being ready, but I wanted to post
what I have to share with folks working on the various range/interval
management ideas as well as update folks who've provided feedback on the
volatile range code.

So just on the premise: Ideally, I want delayed reclaim based hole
punching.

Application has a possibly shared mmapped cache file, which it can mark
chunks of which volatile or nonvolatile as it uses it. If the kernel
needs memory, it can zap any ranges that are currently marked volatile.

Some examples would be rendering of images or web pages that are not
on-screen. This allows the application to volunteer memory for
reclaiming, and the kernel to grab it only when it needs. This differs
from some of the memory notification schemes, in that it allows the
kernel to immediately reclaim what it needs, rather then having to
request applications to give up memory (which may add further memory
load) until enough is free. However, unlike the notification scheme,
it does require applications to mark and unmark pages as volatile as
they use them.

Current use cases (ie: users of Android's ashmem) only use shmfs/tmpfs.
However, I don't see right off why it should be limited to shm. As long
as punching a hole in a file can be done w/ minimal memory overhead
this could be useful and have somewhat sane behavior.

We could also only zap the page cache, not writing any dirty data out.
However, w/ non-shm files, discarding dirty data without hole punching
would obviously leave persistent files in a non-coherent state. This
may further re-inforce that the design should be shm only if we don't
do hole punching.

On the topic of hole punching, the kernel doesn't seem completely
unified in this as well. As I understand, there are two methods to do
hole punching: FALLOCATE_FL_PUNCH_HOLE vs MADV_REMOVE, and they don't
necessarily overlap in support. For the most part, it seems persistent
filesystems require FALLOCATE_FL_PUNCH, where as shmfs/tmpfs uses
MADV_REMOVE. But I may be misunderstanding the subtle difference here,
so if anyone wants to clarify this, it would be great.

One concern was that if the design is shm only, fadvise might not be
the right interface, as it should be generic. The
madvise(MADV_REMOVE,...) interface gives some precedence to shmfs/tmpfs
only calls, but I'd like to get some further feedback as to what folks
think of this. If we are shm/tmpfs only, I could rework this design to
use madvise instead of fadvise if folks would prefer.

Also, there's still the issue that lockdep doesn't like me calling
vmtruncate_range from the shrinker due to any allocations being done
while the i_mutex is taken could cause the shrinker to run and need the
i_mutex again. I did try using invalidate_inode_pages2_range() but it
always returns EBUSY in this context, so I suspect I want something
else. I'm currently reading shmem_truncate_range() and zap_page_range()
to get a better idea of how to this might be best accomplished.

Regarding feedback suggesting dropping the LRU ranges, and instead
keeping the volatile/purged data in radix tags and to manage things at
writeout time. My concern there is having the LRU behavior on the
entire range from when it was marked volatile instead of the actual
last page access (you might have ranges that have frequent use areas
and non-frequent use). Also sorting out how to evict the entire range
when one page is dropped might be funky. However, I'll likely revisit
this soon, but for this iteration I didn't get to it.

I still also realize I have the issue of bloating the address_space
structure to handle, and I suspect if I continue w/ this approach
I'll use a separate hash table to store the range-tree roots in my
next revision.

Anyway, thanks for the continued advice and feedback!
-john

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

John Stultz (2):
[RFC] Range tree implementation
[RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags

fs/inode.c | 4 +
include/linux/fadvise.h | 5 +
include/linux/fs.h | 2 +
include/linux/rangetree.h | 53 ++++++++
include/linux/volatile.h | 14 ++
lib/Makefile | 2 +-
lib/rangetree.c | 105 +++++++++++++++
mm/Makefile | 2 +-
mm/fadvise.c | 16 ++-
mm/volatile.c | 313 +++++++++++++++++++++++++++++++++++++++++++++
10 files changed, 513 insertions(+), 3 deletions(-)
create mode 100644 include/linux/rangetree.h
create mode 100644 include/linux/volatile.h
create mode 100644 lib/rangetree.c
create mode 100644 mm/volatile.c

1.7.3.2.146.gca209

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
email Follow the discussionReplies 9 repliesReplies Make a reply

Similar topics

Replies

#1 John Stultz
March 16th, 2012 - 06:00 pm ET | Report spam
This patch provides new fadvise flags that can be used to mark
file pages as volatile, which will allow it to be discarded if the
kernel wants to reclaim memory.

This is useful for userspace to allocate things like caches, and lets
the kernel destructively (but safely) reclaim them when there's memory
pressure.

It's different from FADV_DONTNEED since the pages are not immediately
discarded; they are only discarded under pressure.

This is very much influenced by the Android Ashmem interface by
Robert Love so credits to him and the Android developers.
In many cases the code & logic come directly from the ashmem patch.
The intent of this patch is to allow for ashmem-like behavior, but
embeds the idea a little deeper into the VM code, instead of isolating
it into a specific driver.

I'm very much a newbie at the VM code, so At this point, I just want
to try to get some input on the patch, so if you have another idea
for using something other then fadvise, or other thoughts on how the
volatile ranges are stored, I'd be really interested in hearing them.
So let me know if you have any comments for feedback!

Also many thanks to Dave Hansen who helped design and develop the
initial version of this patch, and has provided continued review and
mentoring for me in the VM code.

v2:
* After the valid critique that just dropping pages would poke holes
in volatile ranges, and instead we should zap an entire range if we
drop any of it, I changed the code to more closely mimic the ashmem
implementation, which zaps entire ranges via a shrinker using an lru
list that tracks which range has been marked volatile the longest.

v3:
* Reworked to use range tree implementation.

v4:
* Renamed functions to avoid confusion.
* More consistant PAGE_CACHE_SHIFT usage, suggested by Dmitry
Adamushko
* Fixes exit without unlocking issue found by Dmitry Adamushko
* Migrate to rbtree based rangetree implementation
* Simplified locking to use global lock (we were grabbing global
lru lock every time anyway).
* Avoid ENOMEM isses by allocating before we get into complicated
code.
* Add some documentation to the volatile.c file from Neil Brown

Known issues:
* Lockdep doesn't like calling vmtruncate_range() from a shrinker.
Any help here on how to address this would be appreciated.
I've tried switching to invalidate_inode_pages2_range, but
that always returns EBUSY in my testing, and I don't really
want to launder dirty pages, instead I want to zap them.
* Concern over bloating the address_space struct

CC: Andrew Morton
CC: Android Kernel Team
CC: Robert Love
CC: Mel Gorman
CC: Hugh Dickins
CC: Dave Hansen
CC: Rik van Riel
CC: Dmitry Adamushko
CC: Dave Chinner
CC: Neil Brown
CC: Andrea Righi
CC: Aneesh Kumar K.V
Signed-off-by: John Stultz

fs/inode.c | 4 +
include/linux/fadvise.h | 5 +
include/linux/fs.h | 2 +
include/linux/volatile.h | 14 ++
mm/Makefile | 2 +-
mm/fadvise.c | 16 +++-
mm/volatile.c | 313 ++++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 354 insertions(+), 2 deletions(-)
create mode 100644 include/linux/volatile.h
create mode 100644 mm/volatile.c

diff --git a/fs/inode.c b/fs/inode.c
index d3ebdbe..f602dc2 100644
a/fs/inode.c
+++ b/fs/inode.c
@@ -27,6 +27,7 @@
#include <linux/cred.h>
#include <linux/buffer_head.h> /* for inode_has_buffers */
#include <linux/ratelimit.h>
+#include <linux/volatile.h>
#include "internal.h"

/*
@@ -254,6 +255,7 @@ void __destroy_inode(struct inode *inode)
if (inode->i_default_acl && inode->i_default_acl != ACL_NOT_CACHED)
posix_acl_release(inode->i_default_acl);
#endif
+ mapping_clear_volatile_ranges(&inode->i_data);
this_cpu_dec(nr_inodes);
}
EXPORT_SYMBOL(__destroy_inode);
@@ -360,6 +362,8 @@ void address_space_init_once(struct address_space *mapping)
spin_lock_init(&mapping->private_lock);
INIT_RAW_PRIO_TREE_ROOT(&mapping->i_mmap);
INIT_LIST_HEAD(&mapping->i_mmap_nonlinear);
+ range_tree_init(&mapping->volatile_root);
+
}
EXPORT_SYMBOL(address_space_init_once);

diff --git a/include/linux/fadvise.h b/include/linux/fadvise.h
index e8e7471..443951c 100644
a/include/linux/fadvise.h
+++ b/include/linux/fadvise.h
@@ -18,4 +18,9 @@
#define POSIX_FADV_NOREUSE 5 /* Data will be accessed once. */
#endif

+#define POSIX_FADV_VOLATILE 8 /* _can_ toss, but don't toss now */
+#define POSIX_FADV_NONVOLATILE 9 /* Remove VOLATILE flag */
+
+
+
#endif /* FADVISE_H_INCLUDED */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 69cd5bb..2e20be1 100644
a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -10,6 +10,7 @@
#include <linux/ioctl.h>
#include <linux/blk_types.h>
#include <linux/types.h>
+#include <linux/rangetree.h>

/*
* It's silly to have NR_OPEN bigger than NR_FILE, but you can change
@@ -655,6 +656,7 @@ struct address_space {
spinlock_t private_lock; /* for use by the address_space */
struct list_head private_list; /* ditto */
struct address_space *assoc_mapping; /* ditto */
+ struct range_tree_root volatile_root; /* volatile range list */
} __attribute__((aligned(sizeof(long))));
/*
* On most architectures that alignment is already the case; but
diff --git a/include/linux/volatile.h b/include/linux/volatile.h
new file mode 100644
index 0000000..5460d7b
/dev/null
+++ b/include/linux/volatile.h
@@ -0,0 +1,14 @@
+#ifndef _LINUX_VOLATILE_H
+#define _LINUX_VOLATILE_H
+
+#include <linux/fs.h>
+
+extern long mapping_range_volatile(struct address_space *mapping,
+ pgoff_t start_index, pgoff_t end_index);
+extern long mapping_range_nonvolatile(struct address_space *mapping,
+ pgoff_t start_index, pgoff_t end_index);
+extern long mapping_range_isvolatile(struct address_space *mapping,
+ pgoff_t start_index, pgoff_t end_index);
+extern void mapping_clear_volatile_ranges(struct address_space *mapping);
+
+#endif /* _LINUX_VOLATILE_H */
diff --git a/mm/Makefile b/mm/Makefile
index 50ec00e..7b6c7a8 100644
a/mm/Makefile
+++ b/mm/Makefile
@@ -13,7 +13,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
readahead.o swap.o truncate.o vmscan.o shmem.o \
prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
page_isolation.o mm_init.o mmu_context.o percpu.o \
- $(mmu-y)
+ volatile.o $(mmu-y)
obj-y += init-mm.o

ifdef CONFIG_NO_BOOTMEM
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 469491e0..3e33845 100644
a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -17,6 +17,7 @@
#include <linux/fadvise.h>
#include <linux/writeback.h>
#include <linux/syscalls.h>
+#include <linux/volatile.h>

#include <asm/unistd.h>

@@ -106,7 +107,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
nrpages = end_index - start_index + 1;
if (!nrpages)
nrpages = ~0UL;
-
+
ret = force_page_cache_readahead(mapping, file,
start_index,
nrpages);
@@ -128,6 +129,19 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
invalidate_mapping_pages(mapping, start_index,
end_index);
break;
+ case POSIX_FADV_VOLATILE:
+ /* First and last PARTIAL page! */
+ start_index = offset >> PAGE_CACHE_SHIFT;
+ end_index = endbyte >> PAGE_CACHE_SHIFT;
+ ret = mapping_range_volatile(mapping, start_index, end_index);
+ break;
+ case POSIX_FADV_NONVOLATILE:
+ /* First and last PARTIAL page! */
+ start_index = offset >> PAGE_CACHE_SHIFT;
+ end_index = endbyte >> PAGE_CACHE_SHIFT;
+ ret = mapping_range_nonvolatile(mapping, start_index,
+ end_index);
+ break;
default:
ret = -EINVAL;
}
diff --git a/mm/volatile.c b/mm/volatile.c
new file mode 100644
index 0000000..e412a8b
/dev/null
+++ b/mm/volatile.c
@@ -0,0 +1,313 @@
+/* mm/volatile.c
+ *
+ * Volatile page range managment.
+ * Copyright 2011 Linaro
+ *
+ * Based on mm/ashmem.c
+ * by Robert Love
+ * Copyright (C) 2008 Google, Inc.
+ *
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ *
+ * The goal behind volatile ranges is to allow applications to interact
+ * with the kernel's cache management infrastructure. In particular an
+ * application can say "this memory contains data that might be useful in
+ * the future, but can be reconstructed if necessary, so if the kernel
+ * needs, it can zap and reclaim this memory without having to swap it out.
+ *
+ * The proposed mechanism - at a high level - is for user-space to be able
+ * to say "This memory is volatile" and then later "this memory is no longer
+ * volatile". If the content of the memory is still available the second
+ * request succeeds. If not, the memory is marked non-volatile and an
+ * error is returned to denote that the contents have been lost.
+ *
+ * Credits to Neil Brown for the above description.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/pagemap.h>
+#include <linux/volatile.h>
+
+static DEFINE_MUTEX(volatile_mutex);
+
+struct volatile_range {
+ struct list_head lru;
+ struct range_tree_node range_node;
+
+ unsigned int purged;
+ struct address_space *mapping;
+};
+
+/* LRU list of volatile page ranges */
+static LIST_HEAD(volatile_lru_list);
+
+/* Count of pages on our LRU list */
+static u64 lru_count;
+
+
+static inline u64 range_size(struct volatile_range *range)
+{
+ return range->range_node.end - range->range_node.start + 1;
+}
+
+static inline void lru_add(struct volatile_range *range)
+{
+ list_add_tail(&range->lru, &volatile_lru_list);
+ lru_count += range_size(range);
+}
+
+static inline void lru_del(struct volatile_range *range)
+{
+ list_del(&range->lru);
+ lru_count -= range_size(range);
+}
+
+#define range_on_lru(range) (!(range)->purged)
+
+
+static inline void volatile_range_resize(struct volatile_range *range,
+ pgoff_t start_index, pgoff_t end_index)
+{
+ size_t pre = range_size(range);
+
+ range->range_node.start = start_index;
+ range->range_node.end = end_index;
+
+ if (range_on_lru(range))
+ lru_count -= pre - range_size(range);
+}
+
+static struct volatile_range *vrange_alloc(void)
+{
+ struct volatile_range *new;
+
+ new = kzalloc(sizeof(struct volatile_range), GFP_KERNEL);
+ if (!new)
+ return 0;
+ range_tree_node_init(&new->range_node);
+ return new;
+}
+
+static void vrange_del(struct volatile_range *vrange)
+{
+ struct address_space *mapping;
+ mapping = vrange->mapping;
+
+ if (range_on_lru(vrange))
+ lru_del(vrange);
+ range_tree_remove(&mapping->volatile_root, &vrange->range_node);
+ kfree(vrange);
+}
+
+
+
+/*
+ * Mark a region as volatile, allowing dirty pages to be purged
+ * under memory pressure
+ */
+long mapping_range_volatile(struct address_space *mapping,
+ pgoff_t start_index, pgoff_t end_index)
+{
+ struct volatile_range *new;
+ struct range_tree_node *node;
+
+ u64 start, end;
+ int purged = 0;
+ start = (u64)start_index;
+ end = (u64)end_index;
+
+ new = vrange_alloc();
+ if (!new)
+ return -ENOMEM;
+
+ mutex_lock(&volatile_mutex);
+
+ node = range_tree_in_range_adjacent(&mapping->volatile_root,
+ start, end);
+ while (node) {
+ struct volatile_range *vrange;
+
+ /* Already entirely marked volatile, so we're done */
+ if (node->start < start && node->end > end) {
+ /* don't need the allocated value */
+ kfree(new);
+ goto out;
+ }
+
+ /* Grab containing volatile range */
+ vrange = container_of(node, struct volatile_range, range_node);
+
+ /* resize range */
+ start = min_t(u64, start, node->start);
+ end = max_t(u64, end, node->end);
+ purged |= vrange->purged;
+
+
+ vrange_del(vrange);
+
+ /* get the next possible overlap */
+ node = range_tree_in_range(&mapping->volatile_root, start, end);
+ }
+
+ new->mapping = mapping;
+ new->range_node.start = start;
+ new->range_node.end = end;
+ new->purged = purged;
+
+ range_tree_add(&mapping->volatile_root, &new->range_node);
+ if (range_on_lru(new))
+ lru_add(new);
+
+out:
+ mutex_unlock(&volatile_mutex);
+
+ return 0;
+}
+
+/*
+ * Mark a region as nonvolatile, returns 1 if any pages in the region
+ * were purged.
+ */
+long mapping_range_nonvolatile(struct address_space *mapping,
+ pgoff_t start_index, pgoff_t end_index)
+{
+ struct volatile_range *new;
+ struct range_tree_node *node;
+ int ret = 0;
+ u64 start, end;
+ int used_new = 0;
+
+
+ start = (u64)start_index;
+ end = (u64)end_index;
+
+ /* create new node */
+ new = vrange_alloc();
+ if (!new)
+ return -ENOMEM;
+
+ mutex_lock(&volatile_mutex);
+ node = range_tree_in_range(&mapping->volatile_root, start, end);
+ while (node) {
+ struct volatile_range *vrange;
+ vrange = container_of(node, struct volatile_range, range_node);
+
+
+ ret |= vrange->purged;
+
+ if (start <= node->start && end >= node->end) {
+ vrange_del(vrange);
+ } else if (node->start >= start) {
+ volatile_range_resize(vrange, end+1, node->end);
+ } else if (node->end <= end) {
+ volatile_range_resize(vrange, node->start, start-1);
+ } else {
+ /* we only do this once */
+ used_new = 1;
+ new->mapping = mapping;
+ new->range_node.start = end + 1;
+ new->range_node.end = node->end;
+ volatile_range_resize(vrange, node->start, start-1);
+ range_tree_add(&mapping->volatile_root,
+ &new->range_node);
+ if (range_on_lru(new))
+ lru_add(new);
+
+ break;
+ }
+ node = range_tree_in_range(&mapping->volatile_root, start, end);
+ }
+ mutex_unlock(&volatile_mutex);
+
+ if (!used_new)
+ kfree(new);
+
+ return ret;
+}
+
+
+/*
+ * Cleans up any volatile ranges.
+ */
+void mapping_clear_volatile_ranges(struct address_space *mapping)
+{
+ struct volatile_range *tozap;
+
+ mutex_lock(&volatile_mutex);
+ while (!range_tree_empty(&mapping->volatile_root)) {
+ struct range_tree_node *tmp;
+ tmp = range_tree_root_node(&mapping->volatile_root);
+ tozap = container_of(tmp, struct volatile_range, range_node);
+ vrange_del(tozap);
+
+ }
+ mutex_unlock(&volatile_mutex);
+}
+
+/*
+ * Purges volatile ranges when under memory pressure
+ */
+static int volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
+{
+ struct volatile_range *range, *next;
+ unsigned long nr_to_scan = sc->nr_to_scan;
+ const gfp_t gfp_mask = sc->gfp_mask;
+
+ if (nr_to_scan && !(gfp_mask & __GFP_FS))
+ return -1;
+ if (!nr_to_scan)
+ return lru_count;
+
+ mutex_lock(&volatile_mutex);
+ list_for_each_entry_safe(range, next, &volatile_lru_list, lru) {
+ struct inode *inode;
+ loff_t start, end;
+
+ inode = range->mapping->host;
+
+ start = range->range_node.start << PAGE_CACHE_SHIFT;
+ end = ((range->range_node.end + 1) << PAGE_CACHE_SHIFT) - 1;
+
+ /*
+ * XXX - calling vmtruncate_range from a shrinker causes
+ * lockdep warnings. Revisit this!
+ */
+ if (!vmtruncate_range(inode, start, end)) {
+ lru_del(range);
+ range->purged = 1;
+ nr_to_scan -= range_size(range);
+ }
+
+ if (nr_to_scan <= 0)
+ break;
+ }
+ mutex_unlock(&volatile_mutex);
+
+ return lru_count;
+}
+
+static struct shrinker volatile_shrinker = {
+ .shrink = volatile_shrink,
+ .seeks = DEFAULT_SEEKS,
+};
+
+static int __init volatile_init(void)
+{
+ register_shrinker(&volatile_shrinker);
+ return 0;
+}
+
+arch_initcall(volatile_init);
1.7.3.2.146.gca209

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#2 John Stultz
March 16th, 2012 - 06:00 pm ET | Report spam
After Andrew suggested something like his mumbletree idea
to better store a list of ranges, I worked on a few different
approaches, and this is what I've finally managed to get working.

I suspect range-tree isn't a totally accurate name, but I
couldn't quite make out the difference between range trees
and interval trees, so I just picked one to call it. Do
let me know if you have a better name.

The idea of storing ranges in a tree is nice, but has a number
of complications. When adding a range, its possible that a
large range will consume and merge a number of smaller ranges.
When removing a range, its possible you may end up splitting an
existing range, causing one range to become two. This makes it
very difficult to provide generic list_head like behavior, as
the parent structures would need to be duplicated and removed,
and that has lots of memory ownership issues.

So, this is a much simplified and more list_head like
implementation. You can add a node to a tree, or remove a node
to a tree, but the generic implementation doesn't do the
merging or splitting for you. But it does provide helpers to
find overlapping and adjacent ranges.

Andrew also really wanted this range-tree implementation to be
resuable so we don't duplicate the file locking logic. I'm not
totally convinced that the requirements between the volatile
ranges and file locking are really equivelent, but this reduced
impelementation may make it possible.

Do let me know what you think or if you have other ideas for
better ways to do the same.

Changelog:
v2:
* Reworked code to use an rbtree instead of splaying

CC: Andrew Morton
CC: Android Kernel Team
CC: Robert Love
CC: Mel Gorman
CC: Hugh Dickins
CC: Dave Hansen
CC: Rik van Riel
CC: Dmitry Adamushko
CC: Dave Chinner
CC: Neil Brown
CC: Andrea Righi
CC: Aneesh Kumar K.V
Signed-off-by: John Stultz

include/linux/rangetree.h | 53 +++++++++++++++++++++++
lib/Makefile | 2 +-
lib/rangetree.c | 105 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 159 insertions(+), 1 deletions(-)
create mode 100644 include/linux/rangetree.h
create mode 100644 lib/rangetree.c

diff --git a/include/linux/rangetree.h b/include/linux/rangetree.h
new file mode 100644
index 0000000..ca03821
/dev/null
+++ b/include/linux/rangetree.h
@@ -0,0 +1,53 @@
+#ifndef _LINUX_RANGETREE_H
+#define _LINUX_RANGETREE_H
+
+#include <linux/types.h>
+#include <linux/rbtree.h>
+
+struct range_tree_node {
+ struct rb_node rb;
+ u64 start;
+ u64 end;
+};
+
+struct range_tree_root {
+ struct rb_root head;
+};
+
+static inline void range_tree_init(struct range_tree_root *root)
+{
+ root->head = RB_ROOT;
+}
+
+static inline void range_tree_node_init(struct range_tree_node *node)
+{
+ rb_init_node(&node->rb);
+ node->start = 0;
+ node->end = 0;
+}
+
+static inline int range_tree_empty(struct range_tree_root *root)
+{
+ return RB_EMPTY_ROOT(&root->head);
+}
+
+static inline
+struct range_tree_node *range_tree_root_node(struct range_tree_root *root)
+{
+ struct range_tree_node *ret;
+ ret = container_of(root->head.rb_node, struct range_tree_node, rb);
+ return ret;
+}
+
+extern struct range_tree_node *range_tree_in_range(struct range_tree_root *root,
+ u64 start, u64 end);
+extern struct range_tree_node *range_tree_in_range_adjacent(
+ struct range_tree_root *root,
+ u64 start, u64 end);
+extern void range_tree_add(struct range_tree_root *root,
+ struct range_tree_node *node);
+extern void range_tree_remove(struct range_tree_root *root,
+ struct range_tree_node *node);
+#endif
+
+
diff --git a/lib/Makefile b/lib/Makefile
index 18515f0..f43ef0d 100644
a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
idr.o int_sqrt.o extable.o prio_tree.o \
sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
proportions.o prio_heap.o ratelimit.o show_mem.o \
- is_single_threaded.o plist.o decompress.o
+ is_single_threaded.o plist.o decompress.o rangetree.o

lib-$(CONFIG_MMU) += ioremap.o
lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/rangetree.c b/lib/rangetree.c
new file mode 100644
index 0000000..0f6208a
/dev/null
+++ b/lib/rangetree.c
@@ -0,0 +1,105 @@
+#include <linux/rangetree.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+
+
+/**
+ * range_tree_in_range - Returns the first node that overlaps with the
+ * given range
+ * @root: range_tree root
+ * @start: range start
+ * @end: range end
+ *
+ */
+struct range_tree_node *range_tree_in_range(struct range_tree_root *root,
+ u64 start, u64 end)
+{
+ struct rb_node **p = &root->head.rb_node;
+ struct range_tree_node *candidate;
+
+ while (*p) {
+ candidate = rb_entry(*p, struct range_tree_node, rb);
+ if (end < candidate->start)
+ p = &(*p)->rb_left;
+ else if (start > candidate->end)
+ p = &(*p)->rb_right;
+ else
+ return candidate;
+ }
+
+ return 0;
+}
+
+
+/**
+ * range_tree_in_range - Returns the first node that overlaps or is adjacent
+ * with the given range
+ * @root: range_tree root
+ * @start: range start
+ * @end: range end
+ *
+ */
+struct range_tree_node *range_tree_in_range_adjacent(
+ struct range_tree_root *root,
+ u64 start, u64 end)
+{
+ struct rb_node **p = &root->head.rb_node;
+ struct range_tree_node *candidate;
+
+ while (*p) {
+ candidate = rb_entry(*p, struct range_tree_node, rb);
+ if (end+1 < candidate->start)
+ p = &(*p)->rb_left;
+ else if (start > candidate->end + 1)
+ p = &(*p)->rb_right;
+ else
+ return candidate;
+ }
+ return 0;
+}
+
+/**
+ * range_tree_add - Add a node to a range tree
+ * @root: range tree to be added to
+ * @node: range_tree_node to be added
+ *
+ * Adds a node to the range tree.
+ */
+void range_tree_add(struct range_tree_root *root,
+ struct range_tree_node *node)
+{
+ struct rb_node **p = &root->head.rb_node;
+ struct rb_node *parent = NULL;
+ struct range_tree_node *ptr;
+
+ WARN_ON_ONCE(!RB_EMPTY_NODE(&node->rb));
+
+ while (*p) {
+ parent = *p;
+ ptr = rb_entry(parent, struct range_tree_node, rb);
+ if (node->start < ptr->start)
+ p = &(*p)->rb_left;
+ else
+ p = &(*p)->rb_right;
+ }
+ rb_link_node(&node->rb, parent, p);
+ rb_insert_color(&node->rb, &root->head);
+
+}
+
+
+/**
+ * range_tree_remove: Removes a given node from the tree
+ * @root: root of tree
+ * @node: Node to be removed
+ *
+ * Removes a node and splays the tree
+ */
+void range_tree_remove(struct range_tree_root *root,
+ struct range_tree_node *node)
+{
+ WARN_ON_ONCE(RB_EMPTY_NODE(&node->rb));
+
+ rb_erase(&node->rb, &root->head);
+ RB_CLEAR_NODE(&node->rb);
+}
1.7.3.2.146.gca209

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#3 John Stultz
March 16th, 2012 - 07:50 pm ET | Report spam
On 03/16/2012 03:51 PM, John Stultz wrote:
This patch provides new fadvise flags that can be used to mark
file pages as volatile, which will allow it to be discarded if the
kernel wants to reclaim memory.


Right after sending this I realized I had forgotten to include
some fixes for issues Dmitry pointed out. So I've included them
here.

Signed-off-by: John Stultz

mm/volatile.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/volatile.c b/mm/volatile.c
index e412a8b..f40c02e 100644
a/mm/volatile.c
+++ b/mm/volatile.c
@@ -220,11 +220,12 @@ long mapping_range_nonvolatile(struct address_space *mapping,
new->mapping = mapping;
new->range_node.start = end + 1;
new->range_node.end = node->end;
- volatile_range_resize(vrange, node->start, start-1);
+ new->purged = vrange->purged;
range_tree_add(&mapping->volatile_root,
&new->range_node);
if (range_on_lru(new))
lru_add(new);
+ volatile_range_resize(vrange, node->start, start-1);

break;
}
@@ -263,7 +264,7 @@ void mapping_clear_volatile_ranges(struct address_space *mapping)
static int volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
{
struct volatile_range *range, *next;
- unsigned long nr_to_scan = sc->nr_to_scan;
+ s64 nr_to_scan = sc->nr_to_scan;
const gfp_t gfp_mask = sc->gfp_mask;

if (nr_to_scan&& !(gfp_mask& __GFP_FS))
1.7.3.2.146.gca209



To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#4 Dmitry Adamushko
March 17th, 2012 - 11:30 am ET | Report spam
Hi John,

[ ... ]

+/*
+ * Mark a region as volatile, allowing dirty pages to be purged
+ * under memory pressure
+ */
+long mapping_range_volatile(struct address_space *mapping,
+                               pgoff_t start_index, pgoff_t end_index)
+{
+       struct volatile_range *new;
+       struct range_tree_node *node;
+
+       u64 start, end;
+       int purged = 0;
+       start = (u64)start_index;
+       end = (u64)end_index;
+
+       new = vrange_alloc();
+       if (!new)
+               return -ENOMEM;
+
+       mutex_lock(&volatile_mutex);
+
+       node = range_tree_in_range_adjacent(&mapping->volatile_root,
+                                               start, end);
+       while (node) {
+               struct volatile_range *vrange;
+
+               /* Already entirely marked volatile, so we're done */
+               if (node->start < start && node->end > end) {
+                       /* don't need the allocated value */
+                       kfree(new);
+                       goto out;
+               }
+
+               /* Grab containing volatile range */
+               vrange = container_of(node, struct volatile_range, range_node);
+
+               /* resize range */
+               start = min_t(u64, start, node->start);
+               end = max_t(u64, end, node->end);
+               purged |= vrange->purged;
+
+
+               vrange_del(vrange);
+
+               /* get the next possible overlap */
+               node = range_tree_in_range(&mapping->volatile_root, start, end);



I guess range_tree_in_range_adjacent() should be used here again.
There can be 2 adjacent regions (left and right), and we'll miss one
of them with range_tree_in_range().

Also (as I had already mentioned before), I think that new ranges must
not be merged with the existing "vrange->purged == 1" ranges.
Otherwise, for some use cases, the whole idea of 'volatility' gets
broken. For example, when an application is processing a big buffer in
small consequent chunks (marking a chunk as volatile when done with
it), and the range gets 'purged' by the kernel early in this process
(when it's still small).

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#5 Dmitry Adamushko
March 18th, 2012 - 04:20 am ET | Report spam
On 17 March 2012 17:21, Dmitry Adamushko wrote:
Hi John,

[ ... ]

+/*
+ * Mark a region as volatile, allowing dirty pages to be purged
+ * under memory pressure
+ */
+long mapping_range_volatile(struct address_space *mapping,
+                               pgoff_t start_index, pgoff_t end_index)
+{
+       struct volatile_range *new;
+       struct range_tree_node *node;
+
+       u64 start, end;
+       int purged = 0;
+       start = (u64)start_index;
+       end = (u64)end_index;
+
+       new = vrange_alloc();
+       if (!new)
+               return -ENOMEM;
+
+       mutex_lock(&volatile_mutex);
+
+       node = range_tree_in_range_adjacent(&mapping->volatile_root,
+                                               start, end);
+       while (node) {
+               struct volatile_range *vrange;
+
+               /* Already entirely marked volatile, so we're done */
+               if (node->start < start && node->end > end) {
+                       /* don't need the allocated value */
+                       kfree(new);
+                       goto out;
+               }
+
+               /* Grab containing volatile range */
+               vrange = container_of(node, struct volatile_range, range_node);
+
+               /* resize range */
+               start = min_t(u64, start, node->start);
+               end = max_t(u64, end, node->end);
+               purged |= vrange->purged;
+
+
+               vrange_del(vrange);
+
+               /* get the next possible overlap */
+               node = range_tree_in_range(&mapping->volatile_root, start, end);



I guess range_tree_in_range_adjacent() should be used here again.
There can be 2 adjacent regions (left and right), and we'll miss one
of them with range_tree_in_range().

Also (as I had already mentioned before), I think that new ranges must
not be merged with the existing "vrange->purged == 1" ranges.
Otherwise, for some use cases, the whole idea of 'volatility' gets
broken. For example, when an application is processing a big buffer in
small consequent chunks (marking a chunk as volatile when done with
it), and the range gets 'purged' by the kernel early in this process
(when it's still small).



Alternatively, we could immediately truncate purged==0 ranges
(including the one for which mapping_range_volatile() is called) when
merging them with purged==1 ranges. This would result in a more
consistent, but perhaps too aggressive behavior.

Let's consider the following use case:

[1, 10] is an existing purged==1 volatile region, and an application
declares [11, 12] as volatile.

1) current approach: [1, 12] a single purged==1 region, where [11, 12]
was not really truncated (and it could have been [11, 100]);

2) aggressive purge-it-all approach: a single [1, 12] purged==1 region.
The newly added region gets truncated even when there is no shortage
of memory at the moment of addition.

3) do-not-merge approach: [1, 10] purged==1 and [11, 12] purged==0
regions; the later is on the lru list.
it adds extra complexities though (e.g. the need to merge purged
ranges in the shrinker code).

But what's more important, do we have a model of application behavior
that is expected to be observed in, say, 90+% of cases? What patterns
are more common?

For example,

1) make_volatile [1, 10] ; ... ; make_volatile [5, 15] //
overlapping volatile regions
2) make_volatile [1, 10] ; ... ; make_volatile [1, 15] // explicit merge
3) make_volatile [1, 10] ; ... ; make_volatile [11, 15] // adjacent
volatile regions

I guess (2) and (3) would be more common, and (3) even more so with
independently used regions (say, by different threads). For (3), do we
really want to merge purged==0 regions when they are adjacent? What if
they are used independently? Let's consider this case:

(a) make_volatile [1, 100] ; ... ; (z) make_volatile [101, 102]

(z) region is used much more frequently by the application [
make_nonvolatile -> do-smth -> make_volatile ], and (a) is not used
often - it's volatile most of the time. If we merge both regions when
they are still purged==0, the whole [1, 102] will tend to be at the
end of the LRU list =>
- we miss an opportunity to truncate (a);
- other regions that are more frequently used than (a) get truncated.

In this light, (3) would be better off behaving as if (a) and (z) were
not adjacent, i.e. no merge...

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
Help Create a new topicNext page Replies Make a reply
Search Make your own search