[PATCH 00/11] kmem controller for memcg: stripped down version

June 25th, 2012 - 10:20 am ET by Glauber Costa | Report spam
Hi,

What I am proposing with this series is a stripped down version of the
kmem controller for memcg that would allow us to merge significant parts
of the infrastructure, while leaving out, for now, the polemic bits about
the slab while it is being reworked by Cristoph.

Me reasoning for that is that after the last change to introduce a gfp
flag to mark kernel allocations, it became clear to me that tracking other
resources like the stack would then follow extremely naturaly. I figured
that at some point we'd have to solve the issue pointed by David, and avoid
testing the Slab flag in the page allocator, since it would soon be made
more generic. I do that by having the callers to explicit mark it.

So to demonstrate how it would work, I am introducing a stack tracker here,
that is already a functionality per-se: it successfully stops fork bombs to
happen. (Sorry for doing all your work, Frederic =p ). Note that after all
memcg infrastructure is deployed, it becomes very easy to track anything.
The last patch of this series is extremely simple.

The infrastructure is exactly the same we had in memcg, but stripped down
of the slab parts. And because what we have after those patches is a feature
per-se, I think it could be considered for merging.

Let me know what you think.

Glauber Costa (9):
memcg: change defines to an enum
kmem slab accounting basic infrastructure
Add a __GFP_KMEMCG flag
memcg: kmem controller infrastructure
mm: Allocate kernel pages to the right memcg
memcg: disable kmem code when not in use.
memcg: propagate kmem limiting information to children
memcg: allow a memcg with kmem charges to be destructed.
protect architectures where THREAD_SIZE >= PAGE_SIZE against fork
bombs

Suleiman Souhlal (2):
memcg: Make it possible to use the stock for more than one page.
memcg: Reclaim when more than one page needed.

include/linux/gfp.h | 11 +-
include/linux/memcontrol.h | 46 +++++
include/linux/thread_info.h | 6 +
kernel/fork.c | 4 +-
mm/memcontrol.c | 395 ++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 27 +++
6 files changed, 464 insertions(+), 25 deletions(-)

1.7.10.2

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
email Follow the discussionReplies 90 repliesReplies Make a reply

Replies

#1 Glauber Costa
June 25th, 2012 - 10:20 am ET | Report spam
The current memcg slab cache management fails to present satisfatory hierarchical
behavior in the following scenario:

-> /cgroups/memory/A/B/C

* kmem limit set at A
* A and B empty taskwise
* bash in C does find /

Because kmem_accounted is a boolean that was not set for C, no accounting
would be done. This is, however, not what we expect.

The basic idea, is that when a cgroup is limited, we walk the tree
upwards (something Kame and I already thought about doing for other purposes),
and make sure that we store the information about the parent being limited in
kmem_accounted (that is turned into a bitmap: two booleans would not be space
efficient). The code for that is taken from sched/core.c. My reasons for not
putting it into a common place is to dodge the type issues that would arise
from a common implementation between memcg and the scheduler - but I think
that it should ultimately happen, so if you want me to do it now, let me
know.

We do the reverse operation when a formerly limited cgroup becomes unlimited.

Signed-off-by: Glauber Costa
CC: Christoph Lameter
CC: Pekka Enberg
CC: Michal Hocko
CC: Kamezawa Hiroyuki
CC: Johannes Weiner
CC: Suleiman Souhlal

mm/memcontrol.c | 86 +++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 71 insertions(+), 15 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fe5388e..a6a440b 100644
a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -287,7 +287,11 @@ struct mem_cgroup {
* Should the accounting and control be hierarchical, per subtree?
*/
bool use_hierarchy;
- bool kmem_accounted;
+ /*
+ * bit0: accounted by this cgroup
+ * bit1: accounted by a parent.
+ */
+ volatile unsigned long kmem_accounted;

bool oom_lock;
atomic_t under_oom;
@@ -340,6 +344,9 @@ struct mem_cgroup {
#endif
};

+#define KMEM_ACCOUNTED_THIS 0
+#define KMEM_ACCOUNTED_PARENT 1
+
/* Stuffs for move charges at task migration. */
/*
* Types of charges to be moved. "move_charge_at_immitgrate" is treated as a
@@ -589,7 +596,7 @@ EXPORT_SYMBOL(__mem_cgroup_free_kmem_page);

static void disarm_kmem_keys(struct mem_cgroup *memcg)
{
- if (memcg->kmem_accounted)
+ if (test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted))
static_key_slow_dec(&mem_cgroup_kmem_enabled_key);
}
#else
@@ -4027,6 +4034,66 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
len = scnprintf(str, sizeof(str), "%llu", (unsigned long long)val);
return simple_read_from_buffer(buf, nbytes, ppos, str, len);
}
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+static void mem_cgroup_update_kmem_limit(struct mem_cgroup *memcg, u64 val)
+{
+ struct mem_cgroup *iter;
+
+ mutex_lock(&set_limit_mutex);
+ if (!test_and_set_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted) &&
+ val != RESOURCE_MAX) {
+
+ /*
+ * Once enabled, can't be disabled. We could in theory
+ * disable it if we haven't yet created any caches, or
+ * if we can shrink them all to death.
+ *
+ * But it is not worth the trouble
+ */
+ static_key_slow_inc(&mem_cgroup_kmem_enabled_key);
+
+ if (!memcg->use_hierarchy)
+ goto out;
+
+ for_each_mem_cgroup_tree(iter, memcg) {
+ if (iter == memcg)
+ continue;
+ set_bit(KMEM_ACCOUNTED_PARENT, &iter->kmem_accounted);
+ }
+
+ } else if (test_and_clear_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted)
+ && val == RESOURCE_MAX) {
+
+ if (!memcg->use_hierarchy)
+ goto out;
+
+ for_each_mem_cgroup_tree(iter, memcg) {
+ struct mem_cgroup *parent;
+ if (iter == memcg)
+ continue;
+ /*
+ * We should only have our parent bit cleared if none of
+ * ouri parents are accounted. The transversal order of
+ * our iter function forces us to always look at the
+ * parents.
+ */
+ parent = parent_mem_cgroup(iter);
+ while (parent && (parent != memcg)) {
+ if (test_bit(KMEM_ACCOUNTED_THIS, &parent->kmem_accounted))
+ goto noclear;
+
+ parent = parent_mem_cgroup(parent);
+ }
+ clear_bit(KMEM_ACCOUNTED_PARENT, &iter->kmem_accounted);
+noclear:
+ continue;
+ }
+ }
+out:
+ mutex_unlock(&set_limit_mutex);
+}
+#endif
/*
* The user of this function is...
* RES_LIMIT.
@@ -4064,19 +4131,8 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
ret = res_counter_set_limit(&memcg->kmem, val);
if (ret)
break;
- /*
- * Once enabled, can't be disabled. We could in theory
- * disable it if we haven't yet created any caches, or
- * if we can shrink them all to death.
- *
- * But it is not worth the trouble
- */
- mutex_lock(&set_limit_mutex);
- if (!memcg->kmem_accounted && val != RESOURCE_MAX) {
- static_key_slow_inc(&mem_cgroup_kmem_enabled_key);
- memcg->kmem_accounted = true;
- }
- mutex_unlock(&set_limit_mutex);
+ mem_cgroup_update_kmem_limit(memcg, val);
+ break;
}
#endif
else
1.7.10.2

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Similar topics