[RFC PATCH V1 0/2] cpuidle: global registration of idle states with per-cpu statistics

March 22nd, 2011 - 07:50 am ET by Trinabh Gupta | Report spam
This patch series is an early RFC to discuss the feasibility of
avoiding registering of idle states from each cpu.

The core change is to split the cpuidle_device structure into parts
that can be global and parts that has to remain per-cpu. The per-cpu
pieces are mostly generic statistics that can be independent of
current running driver.

Motivation:
* Simplify the cpuidle subsystem framework and have
registration/unregistration done by single cpu.

* Minimise the data structure that needs to be maintained for multiple
cpuidle drivers

* Reference: https://lkml.org/lkml/2011/2/10/37

Advantages:
* Make the cpuidle framework simple for most use cases where C-States
are symmetric. In case there are asymmetric C-States detected,
fallback mechanism should be incorporated to maintain the system
functional
https://lkml.org/lkml/2011/2/10/257
https://lkml.org/lkml/2011/2/10/37

* Non x86 archs that does not have asymmetric C-States like POWER, may
not need the fallback mechanism and hence the framework will be
simple for most use cases.

Disadvantages:
* Asymmetric C-States are part of x86 ACPI specification. Incorrect
handling may functionally affect the system

* Incorporating per-cpu masks for each state to allow/dis-allow global
states on subset of CPUs may result in an implementation that is
not better than current solution of having per-cpu states.

This patch series applies on top of the pm_idle cleanup patch
https://lkml.org/lkml/2011/3/22/150 (cpuidle: Cleanup pm_idle and
include driver/cpuidle.c in-kernel)

This patch series is tested on x86 Nehalem system with multiple ACPI
C-States.

This patch series has limitations of not handling multiple driver
registration and switching between drivers on all CPUs mainly due to
incomplete handling of per-cpu enable/disable and driver_data.

Please let us know your comments and suggest possible approaches to
the problem.



Trinabh Gupta (2):
cpuidle: API changes in callers using new cpuidle_state_stats
cpuidle: Data structure changes for global cpuidle device


drivers/acpi/processor_driver.c | 19 +
drivers/acpi/processor_idle.c | 45 ++++++-
drivers/cpuidle/cpuidle.c | 169 ++++++++++++++++++++++--
drivers/cpuidle/cpuidle.h | 4 -
drivers/cpuidle/driver.c | 24 --
drivers/cpuidle/governor.c | 10 +-
drivers/cpuidle/governors/Makefile | 2
drivers/cpuidle/governors/menu.c | 16 ++-
drivers/cpuidle/sysfs.c | 62 +++++++
drivers/idle/default_driver.c | 39 +++--
include/linux/cpuidle.h | 70 ++++++++++--
11 files changed, 245 insertions(+), 215 deletions(-)

-Trinabh
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
email Follow the discussionReplies 6 repliesReplies Make a reply

Similar topics

Replies

#1 Len Brown
March 25th, 2011 - 02:30 am ET | Report spam
On Tue, 22 Mar 2011, Trinabh Gupta wrote:

This patch series is an early RFC to discuss the feasibility of
avoiding registering of idle states from each cpu.

The core change is to split the cpuidle_device structure into parts
that can be global and parts that has to remain per-cpu. The per-cpu
pieces are mostly generic statistics that can be independent of
current running driver.

Motivation:
* Simplify the cpuidle subsystem framework and have
registration/unregistration done by single cpu.

* Minimise the data structure that needs to be maintained for multiple
cpuidle drivers

* Reference: https://lkml.org/lkml/2011/2/10/37

Advantages:
* Make the cpuidle framework simple for most use cases where C-States
are symmetric. In case there are asymmetric C-States detected,
fallback mechanism should be incorporated to maintain the system
functional
https://lkml.org/lkml/2011/2/10/257
https://lkml.org/lkml/2011/2/10/37

* Non x86 archs that does not have asymmetric C-States like POWER, may
not need the fallback mechanism and hence the framework will be
simple for most use cases.

Disadvantages:
* Asymmetric C-States are part of x86 ACPI specification. Incorrect
handling may functionally affect the system



I think this is a non-issue.

* Incorporating per-cpu masks for each state to allow/dis-allow global
states on subset of CPUs may result in an implementation that is
not better than current solution of having per-cpu states.



I don't think this is needed.

This patch series applies on top of the pm_idle cleanup patch
https://lkml.org/lkml/2011/3/22/150 (cpuidle: Cleanup pm_idle and
include driver/cpuidle.c in-kernel)

This patch series is tested on x86 Nehalem system with multiple ACPI
C-States.

This patch series has limitations of not handling multiple driver
registration and switching between drivers on all CPUs mainly due to
incomplete handling of per-cpu enable/disable and driver_data.



I think this series is more important than the feature of multiple
driver/system support, and thus should come first.

thanks,
Len Brown, Intel Open Source Technology Center

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#2 Len Brown
March 25th, 2011 - 03:20 am ET | Report spam
I agree it is silly to allocate a cpuidle_device
for every cpu in the system as we do today.

Yes, splitting the counters out of cpuidle_device
is a necessary part of fixing that.

However, cpuidle_device.cpuidle_state[] is currently not per-driver,
it is per-cpu, and it is writable.

In particular, the cpuidle_device->prepare() mechanism
causes updates to the cpuidle_state[].flags,
setting and clearing CPUIDLE_FLAG_IGNORE to
tell the governor not to chose a state
on a per-cpu basis at run-time.

I don't like that mechanism.
I'd like to see it replaced, and when replaced,
cpuidle_state[] can be per system-wide driver.

I think the real problem that prepare() was trying to solve
is that the driver today does not have the ability to over-rule
the choice made by the governor. The driver may discover
in the course of trying to satisfy the request of the governor
that it needs to demote to a shallower state; or it may
do its best to satisfy the governor's request, and the hardware
may demote its request to a shallower state.

Unfortunately, when this happens, the driver dutifully
returns the time spent in the state to cpuidle_idle_call(),
who then updates the wrong last_residency, time, and usage counters.

Sure is ironic for the driver to allocate the data structures and
then hand the timer to the uppper layer, just to have the upper layer
update the wrong data structures...

Surely the driver enter routine should update the counters
that the driver was obligated to allocate, and it should return
the state actually entered (for tracing), rather than the time spent
there.

The generic cpuidle code should simply handle where the counters live
in the sysfs namespace, not updating the counters.

This needs to be addressed before cpuidle_device.cpuidle_state[]
can be made one/system.

cheers,
Len Brown, Intel Open Source Technology Center

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#3 Vaidyanathan Srinivasan
March 25th, 2011 - 12:30 pm ET | Report spam
* Len Brown [2011-03-25 03:28:59]:

On Tue, 22 Mar 2011, Trinabh Gupta wrote:

> This patch series is an early RFC to discuss the feasibility of
> avoiding registering of idle states from each cpu.
>
> The core change is to split the cpuidle_device structure into parts
> that can be global and parts that has to remain per-cpu. The per-cpu
> pieces are mostly generic statistics that can be independent of
> current running driver.
>
> Motivation:
> * Simplify the cpuidle subsystem framework and have
> registration/unregistration done by single cpu.
>
> * Minimise the data structure that needs to be maintained for multiple
> cpuidle drivers
>
> * Reference: https://lkml.org/lkml/2011/2/10/37
>
> Advantages:
> * Make the cpuidle framework simple for most use cases where C-States
> are symmetric. In case there are asymmetric C-States detected,
> fallback mechanism should be incorporated to maintain the system
> functional
> https://lkml.org/lkml/2011/2/10/257
> https://lkml.org/lkml/2011/2/10/37
>
> * Non x86 archs that does not have asymmetric C-States like POWER, may
> not need the fallback mechanism and hence the framework will be
> simple for most use cases.
>
> Disadvantages:
> * Asymmetric C-States are part of x86 ACPI specification. Incorrect
> handling may functionally affect the system

I think this is a non-issue.



Hi Len,

Thanks for the confirmation. This gives us the right direction to
proceed with the cleanup.

> * Incorporating per-cpu masks for each state to allow/dis-allow global
> states on subset of CPUs may result in an implementation that is
> not better than current solution of having per-cpu states.

I don't think this is needed.

> This patch series applies on top of the pm_idle cleanup patch
> https://lkml.org/lkml/2011/3/22/150 (cpuidle: Cleanup pm_idle and
> include driver/cpuidle.c in-kernel)
>
> This patch series is tested on x86 Nehalem system with multiple ACPI
> C-States.
>
> This patch series has limitations of not handling multiple driver
> registration and switching between drivers on all CPUs mainly due to
> incomplete handling of per-cpu enable/disable and driver_data.

I think this series is more important than the feature of multiple
driver/system support, and thus should come first.



Removing pm_idle() gets us to the situation where we have to support
multiple registration to allow default_idle driver to register and do
mwait. Your cleanup patch to x86 idle will make the transition much
simple. Lets us have removal of pm_idle() and removal of per-cpu
registration in the same patch series as the next step.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#4 Vaidyanathan Srinivasan
March 25th, 2011 - 12:50 pm ET | Report spam
* Len Brown [2011-03-25 04:12:03]:

I agree it is silly to allocate a cpuidle_device
for every cpu in the system as we do today.

Yes, splitting the counters out of cpuidle_device
is a necessary part of fixing that.

However, cpuidle_device.cpuidle_state[] is currently not per-driver,
it is per-cpu, and it is writable.

In particular, the cpuidle_device->prepare() mechanism
causes updates to the cpuidle_state[].flags,
setting and clearing CPUIDLE_FLAG_IGNORE to
tell the governor not to chose a state
on a per-cpu basis at run-time.

I don't like that mechanism.
I'd like to see it replaced, and when replaced,
cpuidle_state[] can be per system-wide driver.



Thanks for the detailed review. I agree that we should rework
handling of the cpuidle_state[].flags. However, is the prepare()
mechanism used at all? Can we remove the option completely?

I think the real problem that prepare() was trying to solve
is that the driver today does not have the ability to over-rule
the choice made by the governor. The driver may discover
in the course of trying to satisfy the request of the governor
that it needs to demote to a shallower state; or it may
do its best to satisfy the governor's request, and the hardware
may demote its request to a shallower state.

Unfortunately, when this happens, the driver dutifully
returns the time spent in the state to cpuidle_idle_call(),
who then updates the wrong last_residency, time, and usage counters.



I did not get this scenario. Are you saying

target_state->enter(dev, target_state) can enter a different state
than the one suggested by target_state?

I understand the hardware demotion part, but can we really detect the
target 'demoted' state in that case? I guess not.

Sure is ironic for the driver to allocate the data structures and
then hand the timer to the uppper layer, just to have the upper layer
update the wrong data structures...

Surely the driver enter routine should update the counters
that the driver was obligated to allocate, and it should return
the state actually entered (for tracing), rather than the time spent
there.



Can we do something like this:

last_state = target_state->enter(dev, target_state)

dev->last_state and dev->last_residency are updated inside
target_state->enter()

The returned last_state is just for tracing, actual data is already
updated in the cpuidle_dev structure and used for sysfs display.

The generic cpuidle code should simply handle where the counters live
in the sysfs namespace, not updating the counters.
This needs to be addressed before cpuidle_device.cpuidle_state[]
can be made one/system.



Agreed.

Thanks again for the recommendations.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#5 Len Brown
April 01st, 2011 - 08:10 pm ET | Report spam
> I think the real problem that prepare() was trying to solve
> is that the driver today does not have the ability to over-rule
> the choice made by the governor. The driver may discover
> in the course of trying to satisfy the request of the governor
> that it needs to demote to a shallower state; or it may
> do its best to satisfy the governor's request, and the hardware
> may demote its request to a shallower state.
>
> Unfortunately, when this happens, the driver dutifully
> returns the time spent in the state to cpuidle_idle_call(),
> who then updates the wrong last_residency, time, and usage counters.

I did not get this scenario. Are you saying

target_state->enter(dev, target_state) can enter a different state
than the one suggested by target_state?



Yes.

I understand the hardware demotion part, but can we really detect the
target 'demoted' state in that case? I guess not.



In addition to HW demotion, SW may also notice at the last minute
that some criteria for entering a state is not met, and need
to demote in SW.

Effectively it is the same as HW demotion, except SW can see
that it is going to happen ahead of time.

The example is MRST S0i3, where a number of dependencies on
SOC device state must be met to enter the state, and those
dependencies can change state at any time.

> Sure is ironic for the driver to allocate the data structures and
> then hand the timer to the uppper layer, just to have the upper layer
> update the wrong data structures...
>
> Surely the driver enter routine should update the counters
> that the driver was obligated to allocate, and it should return
> the state actually entered (for tracing), rather than the time spent
> there.

Can we do something like this:

last_state = target_state->enter(dev, target_state)

dev->last_state and dev->last_residency are updated inside
target_state->enter()

The returned last_state is just for tracing, actual data is already
updated in the cpuidle_dev structure and used for sysfs display.



dev->last_state and
dev->last_residency
seem like part of hacky support for run-time changing of c-states.

I think if we move the couter incrementing into the driver,
then the upper layer shouldn't need to care about it.
The driver will always have a better idea of what happened
than cpuidle_idle_call().

> The generic cpuidle code should simply handle where the counters live
> in the sysfs namespace, not updating the counters.
> This needs to be addressed before cpuidle_device.cpuidle_state[]
> can be made one/system.

Agreed.



thanks,
Len Brown, Intel Open Source Technolgy Center

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
Help Create a new topicNext page Replies Make a reply
Search Make your own search