
Device Power Management


Device power management in the kernel is made possible by the new
driver model in the 2.5 kernel. In fact, the driver model was inspired
by the requirement to implement decent power management in the kernel.
The new driver model allows generic kernel to communicate with every
device in the system, regardless of the bus the device resides on, or
the class it belongs to. 

The driver model also provides a hierarchical representation of the
devices in the system. This is key to power management, since the
kernel cannot power down a device that another device, that isn't
powered down, relies on for power. For example, the system cannot
power down a parent device whose children are still powered up and
depend on their parent for power. 


In its simplest form, device power management consists of a
description of the state a device is in, and a mechanism for
controlling those states. Device power states are described as 'D'
states, and consist of states D0-D3, inclusive. This device state
representation is inspired by the PCI device specification and the
ACPI specification [ACPI]. Though not all device types define power
states in this way, this representation can map on to all known
device types. 

Each D state represents a tradeoff between the amount of power a
device is consuming and how functional a device is. In a lower power
state (represented by a higher digit following D), some amount of
power to a device is lost. This means that some of the device's
operating state is lost, and must be restored by its driver when
returning to the D0 state. 

D0 represents the state when the device is fully powered on and ready
for, or in, use. This state is implicitly supported by every device,
since every device may be powered on at some point while the system is
running. In this state, all units of a device are powered on, and no
device state is lost.

D3 represents the state when the device is off. This state is also
implicitly supported by every device, since every device is implicitly
powered off when the system is powered off. In this state, all device
context is lost and must be restored before using the device
again. This usually means the device must also be completely
reinitialized.

The PCI Power Management spec goes on to define D3hot as a D3 state
that is entered via driver control and D3cold that is entered when the
entire system is powered down. In D3hot, the device may not lose all
operating power, requiring less restoration that must take place. This
is however, device-dependent. The kernel does not distinguish between
the two, though a driver theoretically could take extra steps to do
so. 

D1 and D2 are intermediate power states that are optionally supported
by a device. In each case, the device is not functional, but not
entirely powered off. In order to bring the device back to an
operating state, less work is required than reviving the device from
D3. In D1, more power is consumed than in D2, but more device context
is preserved.

A device's power management information is stored in struct
device_pm:

struct device_pm {
#ifdef CONFIG_PM
        dev_power_t     power_state;
        u8              * saved_state;
	atomic_t	depend;
	atomic_t	disable;
        struct kobject  kobj;
#endif
};

struct device contains a statically allocated device_pm object. The
configuration dependency on CONFIG_PM guarantees the overhead for the
structure is nil when power management support is not compiled in. 

The kernel defines the following power states in include/linux/pm.h:

typedef enum {
        DEVICE_PM_ON,
        DEVICE_PM_INT1,
        DEVICE_PM_INT2,
        DEVICE_PM_OFF,
        DEVICE_PM_UNKNOWN,
} dev_power_t;

When a device is registered, it's initial power state is set to
DEVICE_PM_UNKOWN. The device driver may query the device and
initialize the known power state using

void device_pm_init_power_state(struct device * dev, dev_power_t state);


Controlling a Device's State

A device's power state may be controlled by the suspend() and resume()
methods in struct device_driver:

  int     (*suspend)      (struct device * dev, u32 state, u32 level);
  int     (*resume)       (struct device * dev, u32 level);

These methods may be initialized by the low-level device driver,
though they are typically initialized at registration time by the bus
driver that the driver belongs to. The bus's functions should forward
power management requests to the bus-specific driver, modifying the
semantics where necessary. 

This model is used to provide the easiest route when converting to the
new driver model. However, a device driver's explicit initialization
of these methods will be honored. 

The same methods are called during individual device power management
transitions and system power management transitions.


There are two steps to suspending a device and two steps to resume
it. In order to suspend a device, two separate calls are made to the
suspend() method - one to save state, and another to power the device
down. Conversely, one call is made to the resume() method to power the
device up, and another to restore device state. 

These steps are encoded thusly:

enum {
        SUSPEND_SAVE_STATE,
        SUSPEND_POWER_DOWN,
};

enum {
        RESUME_POWER_ON,
        RESUME_RESTORE_STATE,
};

and are passed as the 'level' parameter to each method. 

During the SUSPEND_SAVE_STATE call, the driver is expected to stop all
device requests and save all relevant device context based on the
state the device is entering. 

This call is made in process context, so the driver may sleep and
allocate memory to save state. However during system suspend, backing
swap devices may have already been powered down, so drivers should
use GFP_ATOMIC when allocating memory. 

SUSPEND_POWER_DOWN is used only to physically power the device
down. This call has some caveats, and drivers must be aware of
them. Interrupts will be disabled when this device is called. However,
during run-time device power management, interrupts will be re-enabled
once the call returns. Some devices are known to cause problems once
they are powered down and interrupts reenabled - e.g. flooding the
system with interrupts. Drivers should be careful not to service power
management requests for devices known to be buggy. 

During system power management, interrupts are disabled and remain
disabled while powering down all devices in the system.

The resume sequence is identical, though reversed, from the suspened
sequence. The RESUME_POWER_ON stage is performed first, with interrupts
disabled. The driver is expected to power the device on. Interrupts
are then enabled and the RESUME_RESTORE_STATE is performed, and the
driver is expected to restore device state and free memory that was
previously allocated.

A driver may use the struct device_pm::state field to store a pointer
to device state when the device is powered down. n


Power Dependencies

Devices that are children of other devices (e.g. devices behind a PCI
bridge) depend on their parent devices to be powered up to either
provide power to them and/or provide I/O transactions. 

The system must respect the power dependencies of devices and must not
attempt to power down a device which another device depends on being
on. Put another way, all children devices must be powered down before
their parent can be powered down. Conversely, the parent device must
be powered up before any children devices may be accessed. 

Expressing this type of dependency is simple, since it is easy to
determine whether or not a device has any children or not. But, there
are more interesting power dependencies that are more difficult to
express. 

On a PCI Hotplug system, the hotplug controller that controls power to
a range of slots may reside on the primary PCI bus. However, the slots
it controls may reside behind a PCI-PCI bridge that is a peer of the
hotplug controller. The devices in the slots depend on the hotplug
controller being on to operate, but it is not the devices' parent. 
There are similar transversal relationships on some embedded platforms
in which some I/O controller resides near the system root that some
PCI devices, several layers deep, may depend on to communicate
properly.

Both types of power dependencies are represented using the struct
device_pm::depend field. Implicit dependencies, like parent-child
relationships, are handled by the depend count being incremented when
a child is registered with the PM core. When that child device is
powered down or removed, its parent's depend count is decremented. 
Only when a device's depend count is 0 may it be powered down. 

Explicit power dependencies can be imposed on devices using 

int device_pm_get(struct device *);
void device_pm_put(struct device *);

device_pm_get() will increment a device's dependency count, and
device_pm_put() will decrement it. It is up to the driver to properly
manage the dependency counts on device discovery, removal, and power
management requests. 


Disabling Power Management

There are circumstances in which a driver must refuse a power
management request. This is usually because the driver author does not
know the proper reinitialization sequence, or because the user is
performing an uninterruptible operation like burning a CD. 

It is valid for a driver to return an error from a suspend() method
call. Although, a driver may know a priori that it can't handle the
request. This works to the system's benefit, since the PM core can
check if any devices have disabled power management before starting a
suspend transition. 

To disable power management, a device may call

int device_pm_disable(struct device *);
void device_pm_enable(struct device *);

The former increments the struct device_pm::disable count, and the
lattr decrements it. If the count is positive, system power management
will be disabled completely, and device power management on that
device. 

This calls should be used judiciously, since they have a global impact
on system power management.

