|  | The memory API | 
|  | ============== | 
|  |  | 
|  | The memory API models the memory and I/O buses and controllers of a QEMU | 
|  | machine.  It attempts to allow modelling of: | 
|  |  | 
|  | - ordinary RAM | 
|  | - memory-mapped I/O (MMIO) | 
|  | - memory controllers that can dynamically reroute physical memory regions | 
|  | to different destinations | 
|  |  | 
|  | The memory model provides support for | 
|  |  | 
|  | - tracking RAM changes by the guest | 
|  | - setting up coalesced memory for kvm | 
|  | - setting up ioeventfd regions for kvm | 
|  |  | 
|  | Memory is modelled as an acyclic graph of MemoryRegion objects.  Sinks | 
|  | (leaves) are RAM and MMIO regions, while other nodes represent | 
|  | buses, memory controllers, and memory regions that have been rerouted. | 
|  |  | 
|  | In addition to MemoryRegion objects, the memory API provides AddressSpace | 
|  | objects for every root and possibly for intermediate MemoryRegions too. | 
|  | These represent memory as seen from the CPU or a device's viewpoint. | 
|  |  | 
|  | Types of regions | 
|  | ---------------- | 
|  |  | 
|  | There are four types of memory regions (all represented by a single C type | 
|  | MemoryRegion): | 
|  |  | 
|  | - RAM: a RAM region is simply a range of host memory that can be made available | 
|  | to the guest. | 
|  |  | 
|  | - MMIO: a range of guest memory that is implemented by host callbacks; | 
|  | each read or write causes a callback to be called on the host. | 
|  |  | 
|  | - container: a container simply includes other memory regions, each at | 
|  | a different offset.  Containers are useful for grouping several regions | 
|  | into one unit.  For example, a PCI BAR may be composed of a RAM region | 
|  | and an MMIO region. | 
|  |  | 
|  | A container's subregions are usually non-overlapping.  In some cases it is | 
|  | useful to have overlapping regions; for example a memory controller that | 
|  | can overlay a subregion of RAM with MMIO or ROM, or a PCI controller | 
|  | that does not prevent card from claiming overlapping BARs. | 
|  |  | 
|  | - alias: a subsection of another region.  Aliases allow a region to be | 
|  | split apart into discontiguous regions.  Examples of uses are memory banks | 
|  | used when the guest address space is smaller than the amount of RAM | 
|  | addressed, or a memory controller that splits main memory to expose a "PCI | 
|  | hole".  Aliases may point to any type of region, including other aliases, | 
|  | but an alias may not point back to itself, directly or indirectly. | 
|  |  | 
|  | It is valid to add subregions to a region which is not a pure container | 
|  | (that is, to an MMIO, RAM or ROM region). This means that the region | 
|  | will act like a container, except that any addresses within the container's | 
|  | region which are not claimed by any subregion are handled by the | 
|  | container itself (ie by its MMIO callbacks or RAM backing). However | 
|  | it is generally possible to achieve the same effect with a pure container | 
|  | one of whose subregions is a low priority "background" region covering | 
|  | the whole address range; this is often clearer and is preferred. | 
|  | Subregions cannot be added to an alias region. | 
|  |  | 
|  | Region names | 
|  | ------------ | 
|  |  | 
|  | Regions are assigned names by the constructor.  For most regions these are | 
|  | only used for debugging purposes, but RAM regions also use the name to identify | 
|  | live migration sections.  This means that RAM region names need to have ABI | 
|  | stability. | 
|  |  | 
|  | Region lifecycle | 
|  | ---------------- | 
|  |  | 
|  | A region is created by one of the constructor functions (memory_region_init*()) | 
|  | and destroyed by the destructor (memory_region_destroy()).  In between, | 
|  | a region can be added to an address space by using memory_region_add_subregion() | 
|  | and removed using memory_region_del_subregion().  Region attributes may be | 
|  | changed at any point; they take effect once the region becomes exposed to the | 
|  | guest. | 
|  |  | 
|  | Overlapping regions and priority | 
|  | -------------------------------- | 
|  | Usually, regions may not overlap each other; a memory address decodes into | 
|  | exactly one target.  In some cases it is useful to allow regions to overlap, | 
|  | and sometimes to control which of an overlapping regions is visible to the | 
|  | guest.  This is done with memory_region_add_subregion_overlap(), which | 
|  | allows the region to overlap any other region in the same container, and | 
|  | specifies a priority that allows the core to decide which of two regions at | 
|  | the same address are visible (highest wins). | 
|  | Priority values are signed, and the default value is zero. This means that | 
|  | you can use memory_region_add_subregion_overlap() both to specify a region | 
|  | that must sit 'above' any others (with a positive priority) and also a | 
|  | background region that sits 'below' others (with a negative priority). | 
|  |  | 
|  | If the higher priority region in an overlap is a container or alias, then | 
|  | the lower priority region will appear in any "holes" that the higher priority | 
|  | region has left by not mapping subregions to that area of its address range. | 
|  | (This applies recursively -- if the subregions are themselves containers or | 
|  | aliases that leave holes then the lower priority region will appear in these | 
|  | holes too.) | 
|  |  | 
|  | For example, suppose we have a container A of size 0x8000 with two subregions | 
|  | B and C. B is a container mapped at 0x2000, size 0x4000, priority 1; C is | 
|  | an MMIO region mapped at 0x0, size 0x6000, priority 2. B currently has two | 
|  | of its own subregions: D of size 0x1000 at offset 0 and E of size 0x1000 at | 
|  | offset 0x2000. As a diagram: | 
|  |  | 
|  | 0      1000   2000   3000   4000   5000   6000   7000    8000 | 
|  | |------|------|------|------|------|------|------|-------| | 
|  | A:    [                                                       ] | 
|  | C:    [CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC] | 
|  | B:                  [                          ] | 
|  | D:                  [DDDDD] | 
|  | E:                                [EEEEE] | 
|  |  | 
|  | The regions that will be seen within this address range then are: | 
|  | [CCCCCCCCCCCC][DDDDD][CCCCC][EEEEE][CCCCC] | 
|  |  | 
|  | Since B has higher priority than C, its subregions appear in the flat map | 
|  | even where they overlap with C. In ranges where B has not mapped anything | 
|  | C's region appears. | 
|  |  | 
|  | If B had provided its own MMIO operations (ie it was not a pure container) | 
|  | then these would be used for any addresses in its range not handled by | 
|  | D or E, and the result would be: | 
|  | [CCCCCCCCCCCC][DDDDD][BBBBB][EEEEE][BBBBB] | 
|  |  | 
|  | Priority values are local to a container, because the priorities of two | 
|  | regions are only compared when they are both children of the same container. | 
|  | This means that the device in charge of the container (typically modelling | 
|  | a bus or a memory controller) can use them to manage the interaction of | 
|  | its child regions without any side effects on other parts of the system. | 
|  | In the example above, the priorities of D and E are unimportant because | 
|  | they do not overlap each other. It is the relative priority of B and C | 
|  | that causes D and E to appear on top of C: D and E's priorities are never | 
|  | compared against the priority of C. | 
|  |  | 
|  | Visibility | 
|  | ---------- | 
|  | The memory core uses the following rules to select a memory region when the | 
|  | guest accesses an address: | 
|  |  | 
|  | - all direct subregions of the root region are matched against the address, in | 
|  | descending priority order | 
|  | - if the address lies outside the region offset/size, the subregion is | 
|  | discarded | 
|  | - if the subregion is a leaf (RAM or MMIO), the search terminates, returning | 
|  | this leaf region | 
|  | - if the subregion is a container, the same algorithm is used within the | 
|  | subregion (after the address is adjusted by the subregion offset) | 
|  | - if the subregion is an alias, the search is continued at the alias target | 
|  | (after the address is adjusted by the subregion offset and alias offset) | 
|  | - if a recursive search within a container or alias subregion does not | 
|  | find a match (because of a "hole" in the container's coverage of its | 
|  | address range), then if this is a container with its own MMIO or RAM | 
|  | backing the search terminates, returning the container itself. Otherwise | 
|  | we continue with the next subregion in priority order | 
|  | - if none of the subregions match the address then the search terminates | 
|  | with no match found | 
|  |  | 
|  | Example memory map | 
|  | ------------------ | 
|  |  | 
|  | system_memory: container@0-2^48-1 | 
|  | | | 
|  | +---- lomem: alias@0-0xdfffffff ---> #ram (0-0xdfffffff) | 
|  | | | 
|  | +---- himem: alias@0x100000000-0x11fffffff ---> #ram (0xe0000000-0xffffffff) | 
|  | | | 
|  | +---- vga-window: alias@0xa0000-0xbfffff ---> #pci (0xa0000-0xbffff) | 
|  | |      (prio 1) | 
|  | | | 
|  | +---- pci-hole: alias@0xe0000000-0xffffffff ---> #pci (0xe0000000-0xffffffff) | 
|  |  | 
|  | pci (0-2^32-1) | 
|  | | | 
|  | +--- vga-area: container@0xa0000-0xbffff | 
|  | |      | | 
|  | |      +--- alias@0x00000-0x7fff  ---> #vram (0x010000-0x017fff) | 
|  | |      | | 
|  | |      +--- alias@0x08000-0xffff  ---> #vram (0x020000-0x027fff) | 
|  | | | 
|  | +---- vram: ram@0xe1000000-0xe1ffffff | 
|  | | | 
|  | +---- vga-mmio: mmio@0xe2000000-0xe200ffff | 
|  |  | 
|  | ram: ram@0x00000000-0xffffffff | 
|  |  | 
|  | This is a (simplified) PC memory map. The 4GB RAM block is mapped into the | 
|  | system address space via two aliases: "lomem" is a 1:1 mapping of the first | 
|  | 3.5GB; "himem" maps the last 0.5GB at address 4GB.  This leaves 0.5GB for the | 
|  | so-called PCI hole, that allows a 32-bit PCI bus to exist in a system with | 
|  | 4GB of memory. | 
|  |  | 
|  | The memory controller diverts addresses in the range 640K-768K to the PCI | 
|  | address space.  This is modelled using the "vga-window" alias, mapped at a | 
|  | higher priority so it obscures the RAM at the same addresses.  The vga window | 
|  | can be removed by programming the memory controller; this is modelled by | 
|  | removing the alias and exposing the RAM underneath. | 
|  |  | 
|  | The pci address space is not a direct child of the system address space, since | 
|  | we only want parts of it to be visible (we accomplish this using aliases). | 
|  | It has two subregions: vga-area models the legacy vga window and is occupied | 
|  | by two 32K memory banks pointing at two sections of the framebuffer. | 
|  | In addition the vram is mapped as a BAR at address e1000000, and an additional | 
|  | BAR containing MMIO registers is mapped after it. | 
|  |  | 
|  | Note that if the guest maps a BAR outside the PCI hole, it would not be | 
|  | visible as the pci-hole alias clips it to a 0.5GB range. | 
|  |  | 
|  | Attributes | 
|  | ---------- | 
|  |  | 
|  | Various region attributes (read-only, dirty logging, coalesced mmio, ioeventfd) | 
|  | can be changed during the region lifecycle.  They take effect once the region | 
|  | is made visible (which can be immediately, later, or never). | 
|  |  | 
|  | MMIO Operations | 
|  | --------------- | 
|  |  | 
|  | MMIO regions are provided with ->read() and ->write() callbacks; in addition | 
|  | various constraints can be supplied to control how these callbacks are called: | 
|  |  | 
|  | - .valid.min_access_size, .valid.max_access_size define the access sizes | 
|  | (in bytes) which the device accepts; accesses outside this range will | 
|  | have device and bus specific behaviour (ignored, or machine check) | 
|  | - .valid.aligned specifies that the device only accepts naturally aligned | 
|  | accesses.  Unaligned accesses invoke device and bus specific behaviour. | 
|  | - .impl.min_access_size, .impl.max_access_size define the access sizes | 
|  | (in bytes) supported by the *implementation*; other access sizes will be | 
|  | emulated using the ones available.  For example a 4-byte write will be | 
|  | emulated using four 1-byte writes, if .impl.max_access_size = 1. | 
|  | - .impl.valid specifies that the *implementation* only supports unaligned | 
|  | accesses; unaligned accesses will be emulated by two aligned accesses. | 
|  | - .old_portio and .old_mmio can be used to ease porting from code using | 
|  | cpu_register_io_memory() and register_ioport().  They should not be used | 
|  | in new code. |