[PATCH 00/16] Series short description

June 27th, 2012 - 11:00 pm ET by Paul Turner | Report spam
Hi all,

Please find attached the latest version for CFS load-tracking.

It implements load-tracking on a per-sched_entity (currently SCHED_NORMAL, but
could be extended to RT as well) basis. This results in a bottom-up
load-computation in which entities contribute to their parents' load, as
opposed to the current top-down where the parent averages its children. In
particular this allows us to correctly migrate load with their accompanying
entities and provides the necessary inputs for intelligent load-balancing and
power-management.

Modulo review I think this is now close to a mergable state.

Notable updates:
-
- Few stability issues fixed; in particular on systems with tasks having idle
periods spanning several days. Minor code cleanups.
- 32 bit performance improved (now reported faster[*] on 32-bit ARM than
without, presumably due to better load-balance or reduced overhead. Thanks
to Vincent Guittot and others for looking at this)
- All configs delinted
- Config dependencies seperated so that load-tracking can be used without
FAIR_GROUP_SCHED so that we may generally use it for power governors and
per-entity load-balance (in progress).

Thanks to Peter, Vincent, Morten, Nikunj, and others for testing and comments.

Rewinding some context since I've been buried with internal work and it's been
a while since my last posting:

Background:
-
We currently track the load average at the parenting cfs_rq level. We divide
execution into a series of consecutive "windows, w_i". Within each we track:
\Sum load_i * t_i where \Sum t_i = w and each load_i a disjoint load level.

The load average is then \Sum w_j / 2^n.

This this works reasonably well but there's a few problems
1) Since decay occurs at boundary window boundary there are 'skews':
At a window boundary the 'foreground' time has a bias against the
time immediately preceding it (as a result of the folding division)
e.g. __xx_|_yyyy___ vs __xx_yyyy_|___ (where '|' is a window boundary).

The accounting here is 2x + 4y/2 or 2x + 4y, depending on which side of the
window your load lands on.

2) Everything within a window 'w' is accounted equally, we only fold at
the boundaries. This actually means you can't set 'w' large enough to
really have meaningful coverage of the sched period without throwing
decay out the window. But then with w/2 << sched_period (currently
w/2=5ms) the average ends up having fairly drastic swings even with
stable loads.

(Note: Before we even looked at migrating to per-entity tracking we evaluating
foreground window into the considered average until it was "complete", this
represented a measurable improvement in stability under predictable load.)

3) Since the load sum average is per-cfs_rq and not per-entity when a task
entity migrates we lose its contribution to load-sum and effectively double
count it while it former sum decays.

New approach:
-
Instead of tracking load on a per-cfs_rq basis we do it on a per-sched_entity
basis. The load sum for a cfs_rq is then just the sum of its childrens' load
averages. The obvious immediately nice property is that when we migrate thread
A from cpu1-->cpu2, we carry its load with it; leaving the global load sum
unmodified.

The 'windows' above are replaced with more fine-grained tracking, that is (for
an entity j):

runnable_j = u_i * y^i , load_avg_j = runnable_j * weight_j [*]

Where: u_i is the usage in the last i`th ~ms and y is chosen such that
y^~sched_period = 1/2 (we currently choose k2).This means that load tracked 1
sched_period ago contributes about ~50% as current execution.

Now, the real challenge in tracking at an entity basis is handling blocked
entities. Obviously runnable entities will be updated naturally as we iterate
through them but there could be O(many) blocked entities so we can't afford to
iterate through them and update their tracking.

That our decay for a unit period is exponential introduces a particularly nice
property here:
We can separate the contributed load on a cfs_rq into blocked and runnable.
Runnable load is just the sum of all load_avg_j above, maintained by the
entities themselves as they run and self update (when they update their
load_avg they update the cumulative sum also).

Blocked load then looks like:
load_avg_j = weight_k * \Sum u[k]_n * y^n

To account an entirely idle period we then only need to multiply by y.

This ends up being orders of magnitude more accurate than the current
tracking schema, even with the old shares_window massively dilated to
better capture a stable load-state. It's also obviously stable under
migration.

As referenced above this also allows us to potentially improve decisions within
the load-balancer, for both distribution and power-management.

Example: consider 1x80% task and 2x40% tasks on a 2-core machine. It's
currently a bit of a gamble as to whether you get an {AB, B} or {A, BB} split
since they have equal weight (assume 1024). With per-task tracking we can
actually consider them at their contributed weight and see a stable ~{800,{400,
400}} load-split. Likewise within balance_tasks we can consider the load
migrated to be that actually contributed. We have some patches looking at this
and will post them as they mature.

Thanks,

- Paul



Ben Segall (1):
sched: maintain per-rq runnable averages

Paul Turner (15):
sched: track the runnable average on a per-task entitiy basis
sched: aggregate load contributed by task entities on parenting cfs_rq
sched: maintain the load contribution of blocked entities
sched: add an rq migration call-back to sched_class
sched: account for blocked load waking back up
sched: aggregate total task_group load
sched: compute load contribution by a group entity
sched: normalize tg load contributions against runnable time
sched: maintain runnable averages across throttled periods
sched: replace update_shares weight distribution with per-entity computation
sched: refactor update_shares_cpu() -> update_blocked_avgs()
sched: update_cfs_shares at period edge
sched: make __update_entity_runnable_avg() fast
sched: implement usage tracking
sched: introduce temporary FAIR_GROUP_SCHED dependency for load-tracking


include/linux/sched.h | 18 +
kernel/sched/core.c | 5
kernel/sched/debug.c | 39 ++
kernel/sched/fair.c | 793 +++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 60 ++--
5 files changed, 720 insertions(+), 195 deletions(-)

Signature

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
email Follow the discussionReplies 39 repliesReplies Make a reply

Replies

#1 Paul Turner
June 27th, 2012 - 11:00 pm ET | Report spam
We are currently maintaining:
runnable_load(cfs_rq) = \Sum task_load(t)

For all running children t of cfs_rq. While this can be naturally updated for
tasks in a runnable state (as they are scheduled); this does not account for
the load contributed by blocked task entities.

This can be solved by introducing a separate accounting for blocked load:
blocked_load(cfs_rq) = \Sum runnable(b) * weight(b)

Obviously we do not want to iterate over all blocked entities to account for
their decay, we instead observe that:
runnable_load(t) = \Sum p_i*y^i

and that to account for an additional idle period we only need to compute:
y*runnable_load(t).

This means that we can compute all blocked entities at once by evaluating:
blocked_load(cfs_rq)` = y * blocked_load(cfs_rq)

Finally we maintain a decay counter so that when a sleeping entity re-awakens
we can determine how much of its load should be removed from the blocked sum.

Signed-off-by: Paul Turner
Signed-off-by: Ben Segall

include/linux/sched.h | 1
kernel/sched/core.c | 3 +
kernel/sched/debug.c | 3 +
kernel/sched/fair.c | 130 ++++++++++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 4 +-
5 files changed, 126 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0c54ce0..842c4df 100644
a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1139,6 +1139,7 @@ struct load_weight {
struct sched_avg {
u32 runnable_avg_sum, runnable_avg_period;
u64 last_runnable_update;
+ s64 decay_count;
unsigned long load_avg_contrib;
};

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9bb7d28..aeb8e56 100644
a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1713,6 +1713,9 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);

+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+ p->se.avg.decay_count = 0;
+#endif
#ifdef CONFIG_SCHEDSTATS
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
#endif
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index aeb74e3..2aa60cf 100644
a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -95,6 +95,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
P(se->avg.runnable_avg_sum);
P(se->avg.runnable_avg_period);
P(se->avg.load_avg_contrib);
+ P(se->avg.decay_count);
#endif
#undef PN
#undef P
@@ -230,6 +231,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
atomic_read(&cfs_rq->tg->load_weight));
SEQ_printf(m, " .%-30s: %lld", "runnable_load_avg",
cfs_rq->runnable_load_avg);
+ SEQ_printf(m, " .%-30s: %lld", "blocked_load_avg",
+ cfs_rq->blocked_load_avg);
#endif

print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8229766..6200d20 100644
a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1085,6 +1085,20 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
return decayed;
}

+/* Synchronize an entity's decay with its parentin cfs_rq.*/
+static inline void __synchronize_entity_decay(struct sched_entity *se)
+{
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ u64 decays = atomic64_read(&cfs_rq->decay_counter);
+
+ decays -= se->avg.decay_count;
+ if (!decays)
+ return;
+
+ se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
+ se->avg.decay_count += decays;
+}
+
/* Compute the current contribution to load_avg by se, return any delta */
static long __update_entity_load_avg_contrib(struct sched_entity *se)
{
@@ -1100,8 +1114,18 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
return se->avg.load_avg_contrib - old_contrib;
}

+static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
+ long load_contrib)
+{
+ if (likely(load_contrib < cfs_rq->blocked_load_avg))
+ cfs_rq->blocked_load_avg -= load_contrib;
+ else
+ cfs_rq->blocked_load_avg = 0;
+}
+
/* Update a sched_entity's runnable average */
-static inline void update_entity_load_avg(struct sched_entity *se)
+static inline void update_entity_load_avg(struct sched_entity *se,
+ int update_cfs_rq)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
long contrib_delta;
@@ -1111,8 +1135,34 @@ static inline void update_entity_load_avg(struct sched_entity *se)
return;

contrib_delta = __update_entity_load_avg_contrib(se);
+
+ if (!update_cfs_rq)
+ return;
+
if (se->on_rq)
cfs_rq->runnable_load_avg += contrib_delta;
+ else
+ subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
+}
+
+/*
+ * Decay the load contributed by all blocked children and account this so that
+ * they their contribution may appropriately discounted when they wake up.
+ */
+static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq)
+{
+ u64 now = rq_of(cfs_rq)->clock_task >> 20;
+ u64 decays;
+
+ decays = now - cfs_rq->last_decay;
+ if (!decays)
+ return;
+
+ cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
+ decays);
+ atomic64_add(decays, &cfs_rq->decay_counter);
+
+ cfs_rq->last_decay = now;
}

static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
@@ -1122,26 +1172,56 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable)

/* Add the load generated by se into cfs_rq's child load-average */
static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
- struct sched_entity *se)
-{
- update_entity_load_avg(se);
+ struct sched_entity *se,
+ int wakeup)
+{
+ /* we track migrations using entity decay_count == 0 */
+ if (unlikely(!se->avg.decay_count)) {
+ se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
+ se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
+ wakeup = 0;
+ } else {
+ __synchronize_entity_decay(se);
+ }
+
+ if (wakeup)
+ subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
+
+ update_entity_load_avg(se, 0);
cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
+ update_cfs_rq_blocked_load(cfs_rq);
}

-/* Remove se's load from this cfs_rq child load-average */
+/*
+ * Remove se's load from this cfs_rq child load-average, if the entity is
+ * transitioning to a blocked state we track its projected decay using
+ * blocked_load_avg.
+ */
static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
- struct sched_entity *se)
+ struct sched_entity *se,
+ int sleep)
{
- update_entity_load_avg(se);
+ update_entity_load_avg(se, 1);
+
cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
+ if (sleep) {
+ cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
+ se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
+ } else {
+ se->avg.decay_count = 0;
+ }
}
#else
-static inline void update_entity_load_avg(struct sched_entity *se) {}
+static inline void update_entity_load_avg(struct sched_entity *se,
+ int update_cfs_rq) {}
static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
- struct sched_entity *se) {}
+ struct sched_entity *se,
+ int wakeup) {}
static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
- struct sched_entity *se) {}
+ struct sched_entity *se,
+ int sleep) {}
+static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq) {}
#endif

static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -1270,7 +1350,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
*/
update_curr(cfs_rq);
update_cfs_load(cfs_rq, 0);
- enqueue_entity_load_avg(cfs_rq, se);
+ enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
account_entity_enqueue(cfs_rq, se);
update_cfs_shares(cfs_rq);

@@ -1345,7 +1425,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
- dequeue_entity_load_avg(cfs_rq, se);
+ dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);

update_stats_dequeue(cfs_rq, se);
if (flags & DEQUEUE_SLEEP) {
@@ -1516,7 +1596,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
/* Put 'current' back into the tree. */
__enqueue_entity(cfs_rq, prev);
/* in !on_rq case, update occurred at dequeue */
- update_entity_load_avg(prev);
+ update_entity_load_avg(prev, 1);
}
cfs_rq->curr = NULL;
}
@@ -1532,7 +1612,8 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
/*
* Ensure that runnable average is periodically updated.
*/
- update_entity_load_avg(curr);
+ update_entity_load_avg(curr, 1);
+ update_cfs_rq_blocked_load(cfs_rq);

/*
* Update share accounting for long-running entities.
@@ -2391,6 +2472,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

update_cfs_load(cfs_rq, 0);
update_cfs_shares(cfs_rq);
+ update_entity_load_avg(se, 1);
}

if (!se) {
@@ -2452,6 +2534,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

update_cfs_load(cfs_rq, 0);
update_cfs_shares(cfs_rq);
+ update_entity_load_avg(se, 1);
}

if (!se) {
@@ -3557,6 +3640,7 @@ static int update_shares_cpu(struct task_group *tg, int cpu)

update_rq_clock(rq);
update_cfs_load(cfs_rq, 1);
+ update_cfs_rq_blocked_load(cfs_rq);

/*
* We need to update shares after updating tg->load_weight in
@@ -5379,6 +5463,21 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
place_entity(cfs_rq, se, 0);
se->vruntime -= cfs_rq->min_vruntime;
}
+
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+ /*
+ * Remove our load from contribution when we leave sched_fair
+ * and ensure we don't carry in an old decay_count if we
+ * switch back.
+ */
+ if (p->se.avg.decay_count) {
+ struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
+ __synchronize_entity_decay(&p->se);
+ subtract_blocked_load_contrib(cfs_rq,
+ p->se.avg.load_avg_contrib);
+ p->se.avg.decay_count = 0;
+ }
+#endif
}

/*
@@ -5425,6 +5524,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
#ifndef CONFIG_64BIT
cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
#endif
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+ atomic64_set(&cfs_rq->decay_counter, 1);
+#endif
}

#ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 26cc36f..a96adf1 100644
a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -229,7 +229,9 @@ struct cfs_rq {
* This allows for the description of both thread and group usage (in
* the FAIR_GROUP_SCHED case).
*/
- u64 runnable_load_avg;
+ u64 runnable_load_avg, blocked_load_avg;
+ atomic64_t decay_counter;
+ u64 last_decay;
#endif
#ifdef CONFIG_FAIR_GROUP_SCHED
struct rq *rq; /* cpu runqueue to which this cfs_rq is attached */


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Similar topics