
Hibernation


This section describes the Hibernate power state; specifically the
process the PM core uses to save memory to a persistant medium, and
the model for implementing a low-level backend driver to read and
write saved state from a specific medium. 

As mentioned in the previous section, Hibernate is a low-power state
in which system memory state is saved to a persistant medium before
the system is powered off and restored during the system boot
sequence. 

Hibernate is the only low-power state that can be used in the absence
of any platform support for power management. Instead of entering a
low-power state, the configured PM driver may simply turn the system
off. This mechanism provides perfect power savings (by not consuming
any), and can be used to work around broken power management firmware
or hardware. The PM core registers a default platform driver that
supplies this mechanism. It is named 'shutdown' and supports the
Hibernate state only.

Hibernation can also add value to situations which would otherwise
ignore standard power management concepts. For example, system state
can be saved and restored during should a battery backup become
critcally low. Or, system state could be saved when the kernel Oops'd
or hit a BUG(). The system could be rebooted and the state examined
later. 


Hibernation Backend Drivers

Hibernate is commonly referred to as 'suspend-to-disk', implying that
the medium that system state is saved to is a physical disk. This
assumption does not offer the possibility that another type of media
may be used to capture state, nor does it make the distinction of how
the state is stored on disk, since it could theoretically be stored on
a dedicated partition, in free swap space, or in a regular file on an
arbitrary filesystem. 

The PM subsystem offers the ability to configure a variable medium
type to save state to. 


struct pm_backend {
        int     (*open) (void);
        void    (*close)(void);
        int     (*read_image)(void);
        int     (*write_image)(void);
        struct kobject  kobj;
};

int pm_backend_register(struct pm_backend *);
void pm_backend_unregister(struct pm_backend * );



The PM core provides a default backend driver named 'pmdisk' that uses
a dedicated partition type to save state. The internals of pmdisk are
discussed later. 

Backend drivers are registered as children of the Hibernate pm_state
object, and are represented by directories in sysfs. 

They may also define and export attributes using the following
interface: 


struct pm_backend_attr {
        struct attribute attr;
        ssize_t (*show)(struct pm_backend *, char *);
        ssize_t (*store)(struct pm_backend *, const char *, size_t);
};

int pm_backend_attr_create(struct pm_backend *, struct pm_backend_attr *);
void pm_backend_attr_remove(struct pm_backend *, struct pm_backend_attr *);



Snapshotting Memory


The Hibernate core 'snapshots' system memory by indexing and copying 
every active page in the system. Once a snapshot is complete, the
saved image and index is passed to the backend driver to store
persistantly. 

The snapshot process has one critical requirement: that at least half
of memory be free. This imposes a strict limitation on the use of the
current Hibernate implementation during periods of high memory
usage. However, this design decision simplifies the requirements of
the implementaion itself.

The snapshot sequence is a three-step process. First, all of the
active pages in the system are indexed, enough new pages are allocated
to clone these pages, then each page is copied into its clone, or
'shadow'. 

Active pages are detected by iterating over each page frame number
('pfn') in the system and determining whether we should save it or
not. A page's saveability is initially determined by whether or not
the PageNosave bit is set, and then whether the page is free or
not. Reserved pages may not be saveable, depending on whether they
exist in the '__nosave' data section.

Pages marked 'Nosave' or declared in the '__nosave' section (with the
'___nosavedata' suffix) are volatile data and variables internal to
the Hibernate core. They are used and modified during the snapshot
process, and are not saved.

Saveable pages are indexed in page-sized array called pm_chapters:


#define PG_PER_CHAPT    (PAGE_SIZE / sizeof(pgoff_t))

struct pm_chapter {
        pgoff_t         c_pages[PG_PER_CHAPT];
};


pm_chapters are dynamically allocated based on the number of saveable
pages in the system. The addresses of the allocated chapters are
stored in another page-sized array, called a pm_volume:

#define CHAPT_PER_VOL   (PAGE_SIZE / sizeof(struct pm_chapter *))

struct pm_volume {
        struct pm_chapter * v_chapters[CHAPT_PER_VOL];
};

There are two statically pm_volumes in the Hibernate core - one for
the memory index (pm_mem_index), and one for the snapshot
(pm_mem_shadow). This imposes an upper limit on the number on the
amount of memory that can be snapshotted by the Hibernate core: 

       CHAPT_PER_VOL * PG_PER_CHAPT * PAGE_SIZE / 2

is the number of bytes that can be saved, assuming half of memory must
be free to store the snapshot. On a 32-bit x86 machine with 4K-sized
pages, this works out to be: 

	1024 * 1024 * 4096 / 2
	= 2,147,483,648 bytes
	= 2 GB

which is more than enough, since accessing memory above 1GB requires
4M-sized pages. 

After memory has been indexed, but before it has been copied, the
contents of pm_mem_index and pm_mem_shadow are copied to pm_mem_clone
and pm_shadow_clone. The latter are also statically allocated objects,
but are not declared 'nosave'. The purpose of the clones is to save
the addresses of the dynamically allocated chapter pages so we can
free them once the saved image has been restored.

At this stage, the Hibernate core calls a required architecture-
specific function:

	 int pm_arch_hibernate(pm_system_state_t state);

The state parameter should be set to POWER_HIBERNATE. This call is
responsible for saving low-level register state _and_ calling
pm_hibernate_save(), which copies each indexed page in pm_mem_index to
its corresponding page in pm_mem_shadow.


Restoring Memory

During a resume sequence, the Hibernate core calls the backend's
open() method, which is responsible for setting pm_num_pages, which
the Hibernate core will use to pre-allocate pm_mem_index and
pm_mem_shadow. 

The backend's read_image() method is called, which populates
pm_mem_index with the target location of each saved page, and
pm_mem_shadow, which contains the saved pages.

The saved image will replace the memory on a different running
system. The pages that have been allocated to store the saved image
populated from the backend may conflict with pages in the saved image
that are to be restored. The Hibernate backend must guarantee that
none of the pages currently pointed to by pm_mem_shadow conflict with
the pages indexed by pm_mem_index. To do this, it loops through each
page address in pm_mem_shadow and compares them with each page address
in pm_mem_index. If any matches are found, a new page is allocated and
the contents copied. 

To replace memory, the Hibernate core calls 

	   pm_arch_hibernate(POWER_ON);

The architecture is responsible for iterating over the pages in
pm_mem_shadow and copying each one to its destination, as indexed in
pm_mem_index. It is also responsible for restoring low-level register
state once memory has been replaced. 

This burden is placed on the architecture so it can implement a
replacement algorithm without using the stack for variable
storage. The saved memory image contains the saved stack, while the
current stack pointer register will point to a location on the stack
in the memory being replaced. These will likely not match and cause
the system to crash very quickly. 

Once the memory image is restored, the architecture must restore
register context to get the stack pointer pointing to the right
place. This is the reason that the same function is called to both
save and restore of low-level registers. 

Returning from pm_arch_hibernate() once memory has been replaced will
restore execution to the point in hibernate_write() where
pm_arch_hibernate() was called, in the saving sequence. To detect
this, the Hibernate core declares:

static in_suspend __nosavedata = 0;

and sets to it one during the save path. Since it's not saved, it will
be 0 during the restore path, allowing the Hibernate core to behave
appropriately. The cloned volumes are copied back into pm_mem_index
and pm_mem_shadow, and the dynamically allocated pages are freed.



Backend Driver Semantics


The Hibernate core calls the backend driver's open() method before any
any Hibernate operation. It is the backend's responsibility to verify
the existence of the media and to open any necessary communication
channels to it. The backend driver is responsible for reading image
metadata from the medium and setting pm_num_pages to the number of
saved pages if a saved image exists. The Hibernate core will use this
value to pre-allocate storage for the saved pages. 

It may also use this opportunity to verify there is enough free space
on the device. The maximum requirement is the total amount of memory
in the system, as indicated by:

       num_physpages * PAGE_SIZE

This check is optional at this stage, since the size of the saved
memory image may be much smaller than this, and may fit on a device
with less free space than the total size of memory. 

When the Hibernate core is done, it will call the backend's close()
method. The backend is responsible for closing any communication
channels to the storage medium and freeing any memory it had
allocated. 


After the Hibernate core has shadowed memory, it calls the backend's
write_image() method. It does not pass any parameters. pm_mem_index
and pm_mem_shadow must be used directly. The backend must saved each
page pointed to in each chapter of pm_mem_shadow. It must also save
each chapter page of pm_mem_index. The exact format in which these are
saved are up to the driver. 

When restoring a memory image, after the Hibernate core has allocated
storage for the saved memory, the backend's read_image() method is
called. pm_mem_index contains enough allocated chapters to store the
saved chapters and pm_mem_shadow contains enough allocated chapters
and pages to store all of the saved pages. The backend must populate
all of these. 



pmdisk

pmdisk is a simple hibernate backend driver. It uses a dedicated
partition with a custom format for storing system state. Internally,
pmdisk uses the bio layer to read and write pages directly to/from the
disk. 

A pmdisk partition may be created using a utility called 'pmdisk',
which can be found here:

      http://kernel.org/pub/linux/kernel/people/mochel/power/

This utility simply writes a pmdisk header to a partition, which is
defined as:


#define PM_HIBERNATE_SIG        "PMHibernate"
#define PM_HIBERNATE_VER        1


#define PM_UNUSED_SPACE (PAGE_SIZE - (4 * sizeof(unsigned long) + 16))

struct pmdisk_header {
        char                            h_unused[PM_UNUSED_SPACE];
        unsigned long                   h_version;
        unsigned long                   h_chksum;
        unsigned long                   h_pages;
        unsigned long                   h_chapters;
        char                            h_sig[16];
} __attribute__((packed));


Internally, the pmdisk backend driver reads the header from the first
page of the configured partition when its open() method is called. It
verifies that it is a pmdisk partition, and sets pm_num_pages if there
is an image stored on the disk. 

On a close() call, pmdisk set the h_pages, h_chapters, and h_chksum
fields of the header and writes it to the first page on the disk. Note
that on a memory restore operation, pm_num_pages will be 0, signifying
the memory image on the disk is no longer valid. 


A saved memory image on a pmdisk partition is layed out like: 

    0:		pmdisk header
    1  to  Nc   Saved chapters of pm_mem_index
    Nc to  Np	Saved pages from pm_mem_shadow


On a write_image() call, pmdisk will first initialize an internal
checksum variable.It will then write each chapter from pm_mem_index to
disk, then each page from pm_mem_shadow to disk. As it writes each
page, it will pass it to a checksum function. The checksum function is
simple and definitely not cryptographically secure. But, it does
provide an easy verification that an image on disk is valid. 

On a read_image() call, pmdisk reads each chapter into pm_mem_index
and each page into pm_mem_shadow. As it reads each page, it checksums
them. Once all pages have been read, it compares the current checksum
with the h_chksum field of the header. It returns success only if they
match.

The internal pmdisk exports a sysfs attribute file named 'dev' which
userspace must use to tell the kernel of the correct pmdisk partition
to use. There is currently no way for pmdisk to automatically detect
any valid partitions in the system. 

The value that userspace must write is a 16-bit dev_t value in
hexadecimal format containing the major/minor number pair of the
device to use. This format is not favored, but is the only current
method for obtaining a reference to a specific block device at the
time of writing. This interface will change in the future. 


