
System Power Management 


System power management (SPM) is the process of placing the entire
system into a low-power state. In a low-power state, the system is
consuming a small, but minimal, amount of power, yet maintaining a
relatively low response latency to the user. The exact amount of power
and response latency depends on the state the system is in.


Power States

The states a system can enter are dependent on the underlying
platform, and differ across architectures; even generations of the
same architecture. There tend to be three states that are found on
most archtitectures that support a form of SPM, though. The kernel
explicitly supports these states - Standby, Suspend, and Hibernate,
and provides a mechnanism for a platform driver (an architectural port
of the kernel) to define new states. 


typedef enum {
        POWER_ON        = 0,
        POWER_STANDBY   = 0x01,
        POWER_SUSPEND   = 0x02,
        POWER_HIBERNATE = 0x04,
} pm_system_state_t;


Standby is a low-latency power that is sometimes referred to as
'power-on suspend'. In this state, the system conserves power by
placing the CPU in a halt state and the devices in the D1 state. The
power savings are not significant, but the response latency is minimal
-- typically less than 1 second. 

Suspend is also commonly known as 'suspend-to-RAM'. In this state, all
devices are placed in the D3 state and the entire system, except main
memory, is expected to lose power. Memory is placed in self-refresh
mode, so its contents are not lost. Response latency is higher than
Standby, yet still very low -- between 3-5 seconds.

Hibernate conserves the most power by turning off the entire system,
after saving state to a persistant medium, usually a disk. All devices
are powered off unconditionally. The response latency is the highest
-- about 30 seconds -- but still quicker than performing a full boot
sequence. 

Most platforms support these states, though some platforms may support
other states or have requirements that don't match the assumptions
above. For example, some PPC laptops support Suspend, but because of a
lack of documentation, the video devices cannot be fully reinitialized
and hence may not enter the D3 state. The hardware will supply enough
power to devices for them to stay in the D2 state, which the drivers
are capable of recovering from. 

Instead of cluttering the code with a lot of conditional policy to
determine the correct state for devices to enter, the PM subsystem
abstracts system state information into a dynamically registered
objects. 

struct pm_state {
        struct pm_driver        * drv;
        pm_system_state_t       sys;
        pm_device_state_t       dev;
        struct kobject          kobj;
};


The drv field is a pointer to the platform-specific object configured
to handle the power state. The sys field is the low-level power state
that the system will enter. The dev field is the lowest power state
that devices may enter. The kobj field is the generic object for
managing an instance's lifetime. 

The kernel defines default power state objects representing the
assumptions above:


struct pm_state pm_state_standby;
struct pm_state pm_state_suspend;
struct pm_state pm_state_hibernate;

Platform drivers may also define and register additional power states
that they support using:


int pm_state_register(struct pm_state *);
void pm_state_unregister(struct pm_state *);


The PM sysfs interface 

The PM infrastructure registers a top-level subsystem with the kobject
core, which provides the /sys/power/ directory in sysfs. By default,
there is one file in the directory: 


/sys/power/state 


Reading from this file displays the states that are currently
registered with the system; e.g.:


# cat /sys/power/state 
standby suspend hibernate 


By writing the name of a state to this file, the system will perform a
power state transition, which are described next. 

Each power state that is registered receives a directory in
/sys/power, and three attribute files: 



# tree /sys/power/suspend/
/sys/power/suspend/
|-- devices
|-- driver
`-- system


The 'devices' file and the 'system' file describe which power state
the devices in the computer and the state the computer itself are to
enter, respectively. The 'driver' displays which low-level platform PM
driver is configured to handle the power transition. Writing to this
file sets the driver internally. 


Power Management Platform Drivers


The process of transition the OS into a low-power state is largely
platform-agnostic. Howver, the low-level mechanism for actually 
transitioning the hardware to a low-power state is very platform
specific, and even dependent on the generation of the hardware. 

On some platforms, there may be multiple ways to enter a low-power
state, presenting a policy decision for the user to make. Note this
arises usually only in choosing whether to enter a minimal power state
during a Hibernation transition, or turning the system completely
off. 

To cope with these variations, the PM core defines a simple driver
model: 


struct pm_driver {
        u32                     states;
        int     (*prepare)      (u32 state);
        int     (*save)         (u32 state);
        int     (*sleep)        (u32 state);
        int     (*restore)      (u32 state);
        int     (*cleanup)      (u32 state);
        struct kobject          kobj;
};

int pm_driver_register(struct pm_driver *);
void pm_driver_unregister(struct pm_driver *);

The states field of struct pm_driver is a logical or of the states the
driver supports. The methods are platform-specific calls that the PM
core executes during a power state transition. They are designed to
perform the following: 

* prepare - Verify that the platform can enter the requested state
and perform any necessary preparation for entering the state. 

* save - Save low-level state of the platform and the CPU(s). 

* sleep - Enter the requested state. 

* restore - Restore low-level register state of the platform and
CPU(s). 

* cleanup - Perform any necessary actions to leave the sleep state. 


A platform should intialize and register a driver on startup: 


static struct pm_driver acpi_pm_driver = {
        .states		= POWER_STANDBY | POWER_SUSPEND | POWER_HIBERNATE,
        .prepare        = acpi_enter_sleep_state_prep,
        .sleep          = acpi_pm_sleep,
        .cleanup        = acpi_leave_sleep_state,
        .kobj           = { .name = "acpi" },
};

static int __init acpi_sleep_init(void)
{
	return pm_driver_register(&acpi_pm_driver);
}


Each registered PM driver receives a directory in sysfs in
/sys/power. Each driver receives one default attribute file named
'states', which displays the power states the driver supports. This
file is not writable by userspace. 

# tree /sys/power/acpi/
/sys/power/acpi/
`-- states
# cat /sys/power/acpi/states 
standby suspend hibernate 


Platform drivers may define and export their own attributes. 

struct pm_attribute {
        struct attribute        attr;
        ssize_t (*show)(struct pm_driver *, char *);
        ssize_t (*store)(struct pm_driver *, const char *, size_t);
};

int pm_attribute_create(struct pm_driver *, struct pm_attribute *);
void pm_attribute_remove(struct pm_driver *, struct pm_attribute *);

The semantics for pm_driver attributes follow the same semantics as
other sysfs attributes. Please see the kernel sysfs documentation for
more information. 


Power State Transitions

Transition the system to a low-power state is, unfortunately, not as
simple as telling the platform to enter the requested low power
state. The file drivers/power/suspend.c contains the entire sequence,
and should be used as reference material for the official process. A
synopsis is provided here. 

The first step is to verify that the system can enter the power
state. The PM core must have a driver that supports the requested
state, the driver must return success from their prepare() method, and
the driver core must return success from device_pm_check(). Next, the
PM core queisces the running system by disabling preemption and
'freezing' all processes.

Next, system state is saved by calling device_suspend() to save device
state, and the driver's save() method to save low-level system state. 
If we're entering a variant of the Hibernation state, the contents of
memory must be saved to a persistant medium. pm_hibernate_save() is
called to perform this, which is described in the section
Hibernation. 

Once state is saved, the PM core disables interrupts and calls
device_power_down() to place each device in the specified low power
state. Finally, it calls the driver's sleep() method to transition the
system to the low-power state. 

The resume sequence has two variants, depending on whether the system
is returning from a Hibernation state or not. If it is not, the
platform is responsible for returning excecution to the correct place
(after the return from the driver's sleep() method). This may be a
function of the processor, the firmware, or the low-level platform
driver. 

If we're returning from Hibernation, the system detects it during a
boot process in the function pm_resume(). pm_resume is a
late_initcall, which means it is called after most subsystems and
drivers have been registered and initialized, including all
non-modular PM drivers. It calls pm_hibernate_load(), which is
responsible for attempting to read, load, and restore a saved memory
image. Doing this replaces the currently running system with a saved
one, and execution returns to after the call to pm_hibernate_store(). 

One way or another, the PM core proceeds to power on all devices and
restore interrupts. The driver's restore() method is called to restore
low-level system state, and device_pm_resume() is called to restore
device context. Finally, the driver's cleanup() method is called,
processes are 'thawed', and preemption is reenabled. 

A suspend transition is triggered by writing the requested state to
the sysfs file /sys/power/state. Once the complete transition is
complete, execution will return to the process that the value. 

