目录

qemu-kvm源码解析-内存虚拟化

qemu-kvm源码解析-内存虚拟化

内存虚拟化介绍

宿主机上的程序地址转换时为 HVA(宿主机虚拟地址)–MMU–>HPA(宿主机物理地址)

而宿主机上的虚拟机面临两层转化需求:

GVP(虚拟机虚拟地址)–MMU–>GPA(虚拟机物理地址)

GPA(虚拟机物理地址)–VMM–>HPA(宿主机物理地址)

虚拟机内存转化,以往依赖影子页面技术,现在主要依赖EPT技术。

  1. 虚拟机中GVP(虚拟机虚拟地址)–MMU–>GPA(虚拟机物理地址), 因虚拟机系统无法感知自己被虚拟因此按照MMU默认处理地址转换即可。
  2. CPU可感知自己在虚拟机中运行,此时将自动额外查询EPT页面完成GPA(虚拟机物理地址)到HPA(宿主机物理地址)的转化。
  3. EPT页表由VMM实现维护,主要是构建GPA到HPA的映射并注册为EPT页表, 是在进行EPT表查询失败产生EPT异常退出时在KVM中注册的。

查询当前系统是否支持EPT可通过命令

cat /proc/cpuinfo |grep -E 'ept|pdpe1gb'

https://i-blog.csdnimg.cn/direct/283cc18fc432493eac6dc4b095a25a9d.png

一般来讲,EPT使用的是IA-32e的分页模式,即使48位物理地址,总共分为四级页表,每级页表使用9位物理地址定位,最后12位表示在一个页(4KB)内的偏移

https://i-blog.csdnimg.cn/direct/a8f821217332477cacefd2b07446e35e.png

启动qemu并在4444 端口开启 hmp

/home/xiyanxiyan10/project/qemu/build/qemu-system-x86_64 -m 10240 -enable-kvm -cpu host  -s -kernel /home/xiyanxiyan10/project/linux-source-6.2.0/arch/x86/boot/bzImage -hda ./     
    rootfs.img -nographic -append "root=/dev/sda rw console=ttyS0 nokaslr" -qmp tcp:127.0.0.1:4444,server,nowait 

使用qemu提供的脚本进入交互式hmp命令行界面, 查询虚拟机内存信息

xiyanxiyan10@xiyanxiyan10:~/project/qemu$ scripts/qmp/qmp-shell -H localhost:4444
Welcome to the HMP shell!
Connected to QEMU 8.2.50

(QEMU) info mtree
address-space: VGA
  0000000000000000-ffffffffffffffff (prio 0, i/o): bus master container

...

qemu内存数据组织

MemoryRegion

顾名思义,这是用来表达一段内存区域的。其中重要的两个成员就是:addr和size,表示了这段内存区域 对应的起始地址和大小。

从内存模型的角度出发,MemoryRegion重要的特征是 形成了一棵内存区域树

来一个简单的图示意一下:

                            struct MemoryRegion
                            +------------------------+                                         
                            |name                    |                                         
                            |  (const char *)        |                                         
                            +------------------------+                                         
                            |addr                    |                                         
                            |  (hwaddr)              |                                         
                            |size                    |                                         
                            |  (Int128)              |                                         
                            +------------------------+                                         
                            |subregions              |                                         
                            |    QTAILQ_HEAD()       |                                         
                            +------------------------+                                         
                                       |
                                       |
               ----+-------------------+---------------------+----
                   |                                         |
                   |                                         |
                   |                                         |

     struct MemoryRegion                            struct MemoryRegion
     +------------------------+                     +------------------------+
     |name                    |                     |name                    |
     |  (const char *)        |                     |  (const char *)        |
     +------------------------+                     +------------------------+
     |addr                    |                     |addr                    |
     |  (hwaddr)              |                     |  (hwaddr)              |
     |size                    |                     |size                    |
     |  (Int128)              |                     |  (Int128)              |
     +------------------------+                     +------------------------+
     |subregions              |                     |subregions              |
     |    QTAILQ_HEAD()       |                     |    QTAILQ_HEAD()       |
     +------------------------+                     +------------------------+

那我们来看看一个MemoryRegion的树形结构会是什么样子的。

每个address-space 指向一个MemoryRegion 根, MemoryRegion 根下是一个MemoryRegion节点构成的树。

(QEMU) info mtree
address-space: VGA
  0000000000000000-ffffffffffffffff (prio 0, i/o): bus master container

address-space: piix3-ide
  0000000000000000-ffffffffffffffff (prio 0, i/o): bus master container
    0000000000000000-ffffffffffffffff (prio 0, i/o): alias bus master @system 0000000000000000-ffffffffffffffff

address-space: e1000
  0000000000000000-ffffffffffffffff (prio 0, i/o): bus master container

address-space: cpu-memory-0
address-space: memory
  0000000000000000-ffffffffffffffff (prio 0, i/o): system
    0000000000000000-00000000bfffffff (prio 0, ram): alias ram-below-4g @pc.ram 0000000000000000-00000000bfffffff
    0000000000000000-ffffffffffffffff (prio -1, i/o): pci
      00000000000a0000-00000000000bffff (prio 1, i/o): vga-lowmem
      00000000000c0000-00000000000dffff (prio 1, rom): pc.rom
      00000000000e0000-00000000000fffff (prio 1, rom): alias isa-bios @pc.bios 0000000000020000-000000000003ffff
      00000000fd000000-00000000fdffffff (prio 1, ram): vga.vram
      00000000febc0000-00000000febdffff (prio 1, i/o): e1000-mmio
      00000000febf0000-00000000febf0fff (prio 1, i/o): vga.mmio
        00000000febf0000-00000000febf017f (prio 0, i/o): edid
        00000000febf0400-00000000febf041f (prio 0, i/o): vga ioports remapped
        00000000febf0500-00000000febf0515 (prio 0, i/o): bochs dispi interface
        00000000febf0600-00000000febf0607 (prio 0, i/o): qemu extended regs
      00000000fffc0000-00000000ffffffff (prio 0, rom): pc.bios
    00000000000a0000-00000000000bffff (prio 1, i/o): alias smram-region @pci 00000000000a0000-00000000000bffff
    00000000000c0000-00000000000c3fff (prio 1, ram): alias pam-rom @pc.ram 00000000000c0000-00000000000c3fff
    00000000000c4000-00000000000c7fff (prio 1, ram): alias pam-rom @pc.ram 00000000000c4000-00000000000c7fff
    00000000000c8000-00000000000cbfff (prio 1, ram): alias pam-rom @pc.ram 00000000000c8000-00000000000cbfff
    00000000000cb000-00000000000cdfff (prio 1000, ram): alias kvmvapic-rom @pc.ram 00000000000cb000-00000000000cdfff
    00000000000cc000-00000000000cffff (prio 1, ram): alias pam-rom @pc.ram 00000000000cc000-00000000000cffff
    00000000000d0000-00000000000d3fff (prio 1, ram): alias pam-rom @pc.ram 00000000000d0000-00000000000d3fff
    00000000000d4000-00000000000d7fff (prio 1, ram): alias pam-rom @pc.ram 00000000000d4000-00000000000d7fff
    00000000000d8000-00000000000dbfff (prio 1, ram): alias pam-rom @pc.ram 00000000000d8000-00000000000dbfff
    00000000000dc000-00000000000dffff (prio 1, ram): alias pam-rom @pc.ram 00000000000dc000-00000000000dffff
    00000000000e0000-00000000000e3fff (prio 1, ram): alias pam-rom @pc.ram 00000000000e0000-00000000000e3fff
    00000000000e4000-00000000000e7fff (prio 1, ram): alias pam-ram @pc.ram 00000000000e4000-00000000000e7fff
    00000000000e8000-00000000000ebfff (prio 1, ram): alias pam-ram @pc.ram 00000000000e8000-00000000000ebfff
    00000000000ec000-00000000000effff (prio 1, ram): alias pam-ram @pc.ram 00000000000ec000-00000000000effff
    00000000000f0000-00000000000fffff (prio 1, ram): alias pam-rom @pc.ram 00000000000f0000-00000000000fffff
    00000000fec00000-00000000fec00fff (prio 0, i/o): kvm-ioapic
    00000000fed00000-00000000fed003ff (prio 0, i/o): hpet
    00000000fee00000-00000000feefffff (prio 4096, i/o): kvm-apic-msi
    0000000100000000-00000002bfffffff (prio 0, ram): alias ram-above-4g @pc.ram 00000000c0000000-000000027fffffff

FlatView/FlatRange

FlatView就是平面视图。那是啥的平面视图呢?我就知道你聪明,不用猜就知道。 是MemoryRegion的平面视图。刚才咱不是看了么,MemoryRegion形成了一棵高大雄伟的树,但是 要用的时候还是得铺平了看起来舒服。

和刚才一样,我们也来瞅一眼这个数据结构的样子。

FlatView (An array of FlatRange)
+----------------------+
|nr                    |
|nr_allocated          |
|   (unsigned)         |         FlatRange             FlatRange
+----------------------+         
|ranges                | ------> +---------------------+---------------------+
|   (FlatRange *)      |         |offset_in_region     |offset_in_region     |
+----------------------+         |                     |                     |
                                 +---------------------+---------------------+
                                 |addr(AddrRange)      |addr(AddrRange)      |
                                 |    +----------------|    +----------------+
                                 |    |start (Int128)  |    |start (Int128)  |
                                 |    |size  (Int128)  |    |size  (Int128)  |
                                 +----+----------------+----+----------------+
                                 |mr                   |mr                   |
                                 | (MemoryRegion *)    | (MemoryRegion *)    |
                                 +---------------------+---------------------+

Address-space

接下来我们来看看这几者之间的关联。

AddressSpace               
+-------------------------+
|name                     |
|   (char *)              |          FlatView (An array of FlatRange)
+-------------------------+          +----------------------+
|current_map              | -------->|nr                    |
|   (FlatView *)          |          |nr_allocated          |
+-------------------------+          |   (unsigned)         |         FlatRange             FlatRange
|                         |          +----------------------+         
|                         |          |ranges                | ------> +---------------------+---------------------+
|                         |          |   (FlatRange *)      |         |offset_in_region     |offset_in_region     |
|                         |          +----------------------+         |                     |                     |
|                         |                                           +---------------------+---------------------+
|                         |                                           |addr(AddrRange)      |addr(AddrRange)      |
|                         |                                           |    +----------------|    +----------------+
|                         |                                           |    |start (Int128)  |    |start (Int128)  |
|                         |                                           |    |size  (Int128)  |    |size  (Int128)  |
|                         |                                           +----+----------------+----+----------------+
|                         |                                           |mr                   |mr                   |
|                         |                                           | (MemoryRegion *)    | (MemoryRegion *)    |
|                         |                                           +---------------------+---------------------+
|                         |
|                         |
|                         |
|                         |          MemoryRegion(system_memory/system_io)
+-------------------------+          +----------------------+
|root                     |          |                      | root of a MemoryRegion
|   (MemoryRegion *)      | -------->|                      | tree
+-------------------------+          +----------------------+

RamBlock

RAMBlock数据结构就是描述虚拟机在主机上对应的内存空间的, qemu使用链表将他们串联组织。

  ram_list (RAMList)
  +------------------------------+
  |dirty_memory[]                |
  |    (unsigned long *)         |
  +------------------------------+
  |blocks                        |
  |    QLIST_HEAD                |
  +------------------------------+
   |
   |     RAMBlock                                     RAMBlock
   |     +---------------------------+                +---------------------------+
   +---> |next                       | -------------> |next                       |
         |    QLIST_ENTRY(RAMBlock)  |                |    QLIST_ENTRY(RAMBlock)  |
         +---------------------------+                +---------------------------+
         |offset                     |                |offset                     |
         |used_length                |                |used_length                |
         |max_length                 |                |max_length                 |
         |    (ram_addr_t)           |                |    (ram_addr_t)           |
         +---------------------------+                +---------------------------+

GPA -> HVA 的映射由MemoryRegion->addr到RAMBlock->host完成。

因此 MemoryRegion 与 RAMBlock 的关联建立了 虚拟机内存与物理机上虚拟地址间的映射。

        RAMBlock                                                     RAMBlock
         +---------------------------+                                +---------------------------+
         |next                       | -----------------------------> |next                       |
         |    QLIST_ENTRY(RAMBlock)  |                                |    QLIST_ENTRY(RAMBlock)  |
         +---------------------------+                                +---------------------------+
         |offset                     |                                |offset                     |
         |used_length                |                                |used_length                |
         |max_length                 |                                |max_length                 |
         |    (ram_addr_t)           |                                |    (ram_addr_t)           |
         +---------------------------+                                +---------------------------+
         |host                       |  virtual address of a ram      |host                       |  
         |    (uint8_t *)            |  in host (mmap)                |    (uint8_t *)            |
         +---------------------------+                                +---------------------------+
         |mr                         |                                |mr                         |
         |    (struct MemoryRegion *)|                                |    (struct MemoryRegion *)|
         +---------------------------+                                +---------------------------+
          |                                                            |
          |                                                            |
          |                                                            |
          |   struct MemoryRegion                                      |   struct MemoryRegion
          +-->+------------------------+                               +-->+------------------------+
              |name                    |                                   |name                    |
              |  (const char *)        |                                   |  (const char *)        |
              +------------------------+                                   +------------------------+
              |addr                    |  physical address in guest        |addr                    |
              |  (hwaddr)              |  (offset in RAMBlock)             |  (hwaddr)              |
              |size                    |                                   |size                    |
              |  (Int128)              |                                   |  (Int128)              |
              +------------------------+                                   +------------------------+

MemoryListener

为了让EPT正常工作,还需要将虚拟机的内存布局通知到KVM,并且每次变化都需要通知KVM进行修改。这个过程是通过MemoryListener来实现的。MemoryListener定义如下。

  MemoryListerner
  +---------------------------+
  |begin                      |
  |commit                     |
  +---------------------------+
  |region_add                 |
  |region_del                 |
  +---------------------------+
  |eventfd_add                |
  |eventfd_del                |
  +---------------------------+
  |log_start                  |
  |log_stop                   |
  +---------------------------+

一个AddressSpace 下挂有一组关注该AddressSpace 变化的MemoryListerner,当AddressSpace 更新时将调用挂在该AddressSpace 下的所有 Listener, 其中包含将虚拟机的内存布局通知到KVM的Listener(kvm_region_add, kvm_region_del), 当然也可以继续扩展其他关注内存变化的Listener。

https://i-blog.csdnimg.cn/direct/0438d214c26e49fa93ccdca831cb6391.png

数据结构

/**
 * struct AddressSpace: describes a mapping of addresses to #MemoryRegion objects
 */
struct AddressSpace {
    /* private: */
    struct rcu_head rcu;
    char *name;
    // 指向 MemoryRegion 树的根部
    MemoryRegion *root;

    // 将 MemoryRegion 树, 展开后的平坦的视图
    /* Accessed via RCU.  */
    struct FlatView *current_map;

    int ioeventfd_nb;
    int ioeventfd_notifiers;
    struct MemoryRegionIoeventfd *ioeventfds;
    // 内存发生事件变化的回调处理链表
    QTAILQ_HEAD(, MemoryListener) listeners;
    
    // AddressSpace是使用链表串联组织的
    QTAILQ_ENTRY(AddressSpace) address_spaces_link;
};

根据填写属性的不同常见的MemoryRegion有如下几类。

  • RAM:host上一段实际分配给虚拟机作为物理内存的虚拟内存。
  • MMIO:guest的一段内存,但是在宿主机上没有对应的虚拟内存,而是截获对这个区域的访问,调用对应读写函数用在设备模拟中。
  • ROM:与RAM类似,只是该类型内存只有只读属性,无法写入。
  • ROM device:其在读方面类似RAM,能够直接读取,而在写方面类似MMIO,写入会调用对应的写回调函数
  • container:包含若干个MemoryRegion,每一个Region在这个container的偏移都不一样。container主要用来将多个MemoryRegion合并成一个,如PCI的MemoryRegion就会包括RAM和MMIO。一般来说,container中的region不会重合,但是有的时候也有例外。
  • alias:region的另一个部分,可以使一个region被分成几个不连续的部分
/** MemoryRegion:
 *
 * A struct representing a memory region.
 */
struct MemoryRegion {
    Object parent_obj;

    /* private: */
    
    /* The following fields should fit in a cache line */
    bool romd_mode;
    bool ram;
    bool subpage;
    bool readonly; /* For RAM regions */
    bool nonvolatile;
    bool rom_device;
    bool flush_coalesced_mmio;
    bool unmergeable;
    uint8_t dirty_log_mask;
    bool is_iommu;
    
    // ram_block表示实际分配的物理内存, 即宿主机的虚拟内存
    // 使用 RAMBlock 存储
    RAMBlock *ram_block;
    Object *owner;
    /* owner as TYPE_DEVICE. Used for re-entrancy checks in MR access hotpath */
    DeviceState *dev;
    // ops里面是一组回调函数,在对MemoryRegion进行操作时会被调用,如MMIO的读写请求
    const MemoryRegionOps *ops;
    void *opaque;
    MemoryRegion *container;
    int mapped_via_alias; /* Mapped via an alias, container might be NULL */
    Int128 size;
    // addr表示该MemoryRegion所在的虚拟机的物理地址
    hwaddr addr;
    void (*destructor)(MemoryRegion *mr);
    uint64_t align;
    bool terminates;
    bool ram_device;
    bool enabled;
    bool warning_printed; /* For reservations */
    uint8_t vga_logging_count;
    MemoryRegion *alias;
    hwaddr alias_offset;
    // priority用来指示MemoryRegion的优先级
    int32_t priority;
    //subregions将该MemoryRegion所属的子MemoryRegion连接起来
    QTAILQ_HEAD(, MemoryRegion) subregions;
    // subregions_link则用来连接同一个父MemoryRegion下的相同兄弟
    QTAILQ_ENTRY(MemoryRegion) subregions_link;
    QTAILQ_HEAD(, CoalescedMemoryRange) coalesced;
    const char *name;
    unsigned ioeventfd_nb;
    MemoryRegionIoeventfd *ioeventfds;
    RamDiscardManager *rdm; /* Only for RAM */

    /* For devices designed to perform re-entrant IO into their own IO MRs */
    bool disable_reentrancy_guard;
};

/*
 * Memory region callbacks
 */
struct MemoryRegionOps {
    /* Read from the memory region. @addr is relative to @mr; @size is
     * in bytes. */
    uint64_t (*read)(void *opaque,
                     hwaddr addr,
                     unsigned size);
    /* Write to the memory region. @addr is relative to @mr; @size is
     * in bytes. */
    void (*write)(void *opaque,
                  hwaddr addr,
                  uint64_t data,
                  unsigned size);

    MemTxResult (*read_with_attrs)(void *opaque,
                                   hwaddr addr,
                                   uint64_t *data,
                                   unsigned size,
                                   MemTxAttrs attrs);
    MemTxResult (*write_with_attrs)(void *opaque,
                                    hwaddr addr,
                                    uint64_t data,
                                    unsigned size,
                                    MemTxAttrs attrs);

    //...
};

qemu分配虚拟机内存

虚拟机使用的物理内存是映射宿主机虚拟内存。

因此 qemu为虚拟机分配内存的过程即是宿主机申请虚拟内存的过程

即RamBlock 结构初始化时申请内存空间。

pc_memory_init()
  memory_region_allocate_system_memory()
    allocate_system_memory_nonnuma()
      memory_region_init_ram_nomigrate()
        memory_region_init_ram_shared_nomigrate()
        {
          mr->ram = true;
          mr->destructor = memory_region_destructor_ram;
          // 申请  RAMBlock
          mr->ram_block = qemu_ram_alloc(size, share, mr, errp);
        }
RAMBlock *qemu_ram_alloc(ram_addr_t size, uint32_t ram_flags,
                         MemoryRegion *mr, Error **errp)
{
    assert((ram_flags & ~(RAM_SHARED | RAM_NORESERVE)) == 0);
    return qemu_ram_alloc_internal(size, size, NULL, NULL, ram_flags, mr, errp);
}

static
RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
                                  void (*resized)(const char*,
                                                  uint64_t length,
                                                  void *host),
                                  void *host, uint32_t ram_flags,
                                  MemoryRegion *mr, Error **errp)
{
    // ...

    size = HOST_PAGE_ALIGN(size);
    max_size = HOST_PAGE_ALIGN(max_size);
    new_block = g_malloc0(sizeof(*new_block));
    new_block->mr = mr;
    new_block->resized = resized;
   //...
    ram_block_add(new_block, &local_err);
   //...
    return new_block;
}

static void ram_block_add(RAMBlock *new_block, Error **errp)
{
    //...

    if (!new_block->host) {
        if (xen_enabled()) {
           //...
        } else {
            new_block->host = qemu_anon_ram_alloc(new_block->max_length,
                                                  &new_block->mr->align,
                                                  shared, noreserve);
           //...
        }
    }

   //...
}
// posix 下分配空间的方法
/* alloc shared memory pages */
void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared,
                          bool noreserve)
{
    const uint32_t qemu_map_flags = (shared ? QEMU_MAP_SHARED : 0) |
                                    (noreserve ? QEMU_MAP_NORESERVE : 0);
    size_t align = QEMU_VMALLOC_ALIGN;
    void *ptr = qemu_ram_mmap(-1, size, align, qemu_map_flags, 0);

    //...
    return ptr;
}

void *qemu_ram_mmap(int fd,
                    size_t size,
                    size_t align,
                    uint32_t qemu_map_flags,
                    off_t map_offset)
{
   //...

    ptr = mmap_activate(guardptr + offset, size, fd, qemu_map_flags,
                        map_offset);
  //...

    return ptr;
}

static void *mmap_activate(void *ptr, size_t size, int fd,
                           uint32_t qemu_map_flags, off_t map_offset)
{
   // ...
    // 可见最终调用 mmap 申请内存页
    activated_ptr = mmap(ptr, size, prot, flags | map_sync_flags, fd,
                         map_offset);
    // ...
    return activated_ptr;
}

内存分配表构建

QEMU内存的分派指的是,当给定一个AddressSpace和一个地址时,要能够快速地找出其所在的MemoryRegionSection,从而找到对应的MemoryRegion。与内存分派相关的数据结构是AddressSpaceDispatch,AddressSpace结构体中的dispatch成员为AddressSpaceDispatch,记录了该AddressSpace中的分派信息。

简单来说

  • phys_map 像是CR3
  • nodes 是一个用链表存储了的页表
  • sections 是nodes的叶子ptr指向的内容,其中包含了MemoryRegion

当查询MemoryRegion 时通过 nodes 与 phys_map 配合快速在 sections 中找到对应的section, 从而查询到关联的 MemoryRegion

        AddressSpaceDispatch
        +-------------------------+
        |as                       |
        |   (AddressSpace *)      |
        +-------------------------+
        |mru_section              |
        |   (MemoryRegionSection*)|
        |                         |
        |                         |
        |                         |
        |                         |
        |                         |
        +-------------------------+
        |map(PhysPageMap)         |         MemoryRegionSection[]
        |   +---------------------+         +---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
        |   |sections             |-------->|mr = io_mem_unassigned     |mr = io_mem_notdirty       |mr = io_mem_rom            |mr = io_mem_watch          |mr  = one mr in tree       |mr  = subpage_t->iomem     |
        |   | MemoryRegionSection*|         |   (MemoryRegion *)        |   (MemoryRegion *)        |   (MemoryRegion *)        |   (MemoryRegion *)        |   (MemoryRegion *)        |   (MemoryRegion *)        |
        |   |                     |         |                           |                           |                           |                           |                           |                           |
        |   +---------------------+         +---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
        |   |sections_nb          |         |fv                         |fv                         |fv                         |fv                         |fv                         |fv                         |
        |   |sections_nb_alloc    |         |   (FlatView *)            |   (FlatView *)            |   (FlatView *)            |   (FlatView *)            |   (FlatView *)            |   (FlatView *)            |
        |   |   (unsigned)        |         +---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
        |   +---------------------+         |size (Int128)              |size (Int128)              |size (Int128)              |size (Int128)              |size (Int128)              |size (Int128)              |
        |   |                     |         +---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
        |   |                     |         |offset_within_region       |offset_within_region       |offset_within_region       |offset_within_region       |offset_within_region       |offset_within_region       |
        |   |                     |         |   (hwaddr)                |   (hwaddr)                |   (hwaddr)                |   (hwaddr)                |   (hwaddr)                |   (hwaddr)                |
        |   |                     |         |offset_within_address_space|offset_within_address_space|offset_within_address_space|offset_within_address_space|offset_within_address_space|offset_within_address_space|
        |   |                     |         |   (hwaddr)  GPA           |   (hwaddr)  GPA           |   (hwaddr)  GPA           |   (hwaddr)  GPA           |   (hwaddr)                |   (hwaddr)                |
        |   |                     |         +---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
        |   |                     |                                                                                                                                  ^
        |   |                     |                                                           nodes[1]                                                               |
        |   |                     |                                                     +---->+------------------+                                                   |
        |   |                     |                                                     |     |u32 skip:6        | = 0                                               |
        |   |                     |                                                     |     |u32 ptr:26        | = 4      -----------------------------------------+
        |   |                     |         P_L2_LEVELS = 6                             |     +------------------+
        |   |                     |         nodes[0] = PhysPageEntry[P_L2_SIZE = 2^9]   |     |                  |
        |   +---------------------+         +------------------+                        |     |  ...             |
        |   |nodes                | ------->|u32 skip:6        | = 1                    |     |                  |
        |   |  (Node *)           |         |u32 ptr:26        | = 1       -------------+     +------------------+
        |   +---------------------+         +------------------+                              |u32 skip:6        | = 0
        |   |nodes_nb             |         |                  |                              |u32 ptr:26        | = PHYS_SECTION_UNASSIGNED
        |   |nodes_nb_alloc       |         |  ...             |                              +------------------+
        |   |  (unsigned)         |         |                  |
        |   +---------------------+         +------------------+
        |   |                     |         |u32 skip:6        | = 1
        |   |                     |         |u32 ptr:26        | = PHYS_MAP_NODE_NIL          nodes[2]
        |   |                     |         +------------------+                        +---->+------------------+
        |   |                     |         |u32 skip:6        | = 1                    |     |u32 skip:6        |
        |   |                     |         |u32 ptr:26        | = 2       -------------+     |u32 ptr:26        |
        |   |                     |         +------------------+                              +------------------+
        |   |                     |              ^                                            |                  |
        |   |                     |              |                                            |  ...             |
        |   |                     |              |                                            |                  |
        +---+---------------------+              |                                            +------------------+
        |phys_map(PhysPageEntry)  |              |                                            |u32 skip:6        |
        |   +---------------------+              |                                            |u32 ptr:26        |
        |   |u32 skip:6           | = 1          |                                            +------------------+
        |   |u32 ptr:26           | = 0  --------+
        +---+---------------------+

对应数据结构如下

typedef PhysPageEntry Node[P_L2_SIZE];

typedef struct PhysPageMap {
    struct rcu_head rcu;

    unsigned sections_nb;
    unsigned sections_nb_alloc;
    unsigned nodes_nb;
    unsigned nodes_nb_alloc;
    Node *nodes;
    MemoryRegionSection *sections;
} PhysPageMap;

struct AddressSpaceDispatch {
    MemoryRegionSection *mru_section;
    /* This is a multi-level map on the physical address space.
     * The bottom level has pointers to MemoryRegionSections.
     */
    PhysPageEntry phys_map;
    PhysPageMap map;
};

struct MemoryRegionSection {
    Int128 size;
    MemoryRegion *mr;
    FlatView *fv;
    hwaddr offset_within_region;
    hwaddr offset_within_address_space;
    bool readonly;
    bool nonvolatile;
    bool unmergeable;
};

提交内存到KVM

在 QEMU 中, memory_region_transaction_commit 是一个关键函数,用于提交内存事务,确保虚拟机的内存映射在 MemoryRegion 层级的更改被正确同步到底层的内存子系统或硬件加速(如 KVM)。它的调用时机通常与 内存映射的变更 紧密相关,具体如下:

初始化 内存 布局时

  • 当虚拟机启动时,QEMU 会初始化客户机的内存布局,包括:

    • 设置 RAM 区域。
    • 注册设备内存(如 MMIO 区域)。
  • 在内存区域的变更完成后,调用 memory_region_transaction_commit 提交更改。

内存 区域添加或删除时

  • 动态添加或移除内存区域(如热插拔内存或设备)时,QEMU 会先对内存映射进行更改,并在完成后调用 memory_region_transaction_commit 提交更新。

客户机 地址空间 调整时

  • 当客户机地址空间(如 PCI 地址空间)发生更改时,需要更新内存区域的映射。
  • 这些调整通常由设备模型(Device Model)触发,在完成设备地址空间调整后调用 memory_region_transaction_commit

快照恢复或迁移时

  • 在快照恢复或虚拟机迁移时,内存映射需要重新构建。
  • QEMU 会调用 memory_region_transaction_commit 确保新的内存布局生效。

访问权限或属性变更时

  • 如果需要调整内存区域的访问权限(如读写权限)或属性(如缓存策略),这些更改会通过事务机制提交。

源码中调用 memory_region_transaction_commit 的函数列表可参考对照函数名

static void memory_region_finalize(Object *obj)
void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
void memory_region_set_dirty(MemoryRegion *mr, hwaddr addr,
                             hwaddr size)
void memory_region_set_readonly(MemoryRegion *mr, bool readonly)
void memory_region_del_eventfd(MemoryRegion *mr,
                               hwaddr addr,
                               unsigned size,
                               bool match_data,
                               uint64_t data,
                               EventNotifier *e)
...

memory_region_transaction_commit 的函数主要流程如下

memory_region_transaction_commit(), update topology or ioeventfds
     flatviews_reset()
         flatviews_init()
             flat_views = g_hash_table_new_full()
             empty_view = generate_memory_topology(NULL);
         generate_memory_topology()
     MEMORY_LISTENER_CALL_GLOBAL(begin, Forward)
     address_space_set_flatview()
         address_space_update_topology_pass(false)
         address_space_update_topology_pass(true)
     address_space_update_ioeventfds()
         address_space_add_del_ioeventfds()
     MEMORY_LISTENER_CALL_GLOBAL(commit, Forward)
  • flatviews_reset: 重构所有AddressSpace的flatview
  • MEMORY_LISTENER_CALL_GLOBAL(begin, Forward)
  • address_space_set_flatview: 根据变化添加删除region
  • address_space_update_ioeventfds: 根据变化添加删除eventfd
  • MEMORY_LISTENER_CALL_GLOBAL(commit, Forward)

前三部主要是根据事件更新qemu的内存管理映射结构, 最后一部即是将当前内存映射更新通过listener 回调同步到kvm中。

对于kvm 模式启动的qemu, 其listener提交回调为

 kml->listener.region_add = kvm_region_add;
 kml->listener.region_del = kvm_region_del;

以region_add为例, 调用链为以region_add为例, 调用链为

kvm_region_add
        kvm_set_phys_mem
            kvm_set_user_memory_region
                kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);kvm_region_add
        kvm_set_phys_mem
            kvm_set_user_memory_region
                kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);

kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);即是将qemu构建的内存空间映射提交给kvm。

关注函数kvm_set_user_memory_region 函数,其将KVMSlot 放入kvm_userspace_memory_region 进行对kvm的内存信息提交。

而参数中KVMSlot *slot,则是上层调用函数将MemoryRegionSection 信息转化为 KVMSlot 的, 以对应KVM的交互信息协议。

static int kvm_set_user_memory_region(KVMMemoryListener *kml, KVMSlot *slot, bool new)
{
    KVMState *s = kvm_state;
    struct kvm_userspace_memory_region mem;
    int ret;

    mem.slot = slot->slot | (kml->as_id << 16);
    mem.guest_phys_addr = slot->start_addr;
    mem.userspace_addr = (unsigned long)slot->ram;
    mem.flags = slot->flags;

   //...
    mem.memory_size = slot->memory_size;
    ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);
    slot->old_flags = mem.flags;
    //...
    return ret;
}

KVM构建EPT

VCPU创建好之后,在初始化的时候会调用kvm_mmu_setup进行MMU的初始化,相关函数调用为init_kvm_mmu。

static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu,
                 union kvm_cpu_role cpu_role)
{
    // ...
    context->root_role.word = root_role.word;
    // tdp_page_fault,用来处理EPT的页访问错误,之后根据VCPU所处的模式设置相应的值
    context->page_fault = kvm_tdp_page_fault;
    //...
}
int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
    // ...
    return direct_page_fault(vcpu, fault);
}

继续将调用函数 direct_page_fault, 主要任务是

  • 定位缺页的客户机地址范围(GPA)。
  • 使用 kvm_memory_slot 找到对应的主机物理内存(HPA)。
  • 更新二级页表(EPT/NPT)以建立映射。
static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
    bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu);

    unsigned long mmu_seq;
    int r;
    
    // 虚拟机地址 gfn
    fault->gfn = fault->addr >> PAGE_SHIFT;
    // 使用 gfn查询到的 宿主机对应的 slot, 确定虚拟机地址与需要关联的宿主机地址信息
    // slot 即qemu上报给kvm 的地址信息
    fault->slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn);
    // ...
    if (is_tdp_mmu_fault) {
        r = kvm_tdp_mmu_map(vcpu, fault);
    } else {
        r = make_mmu_pages_available(vcpu);
        if (r)
            goto out_unlock;
        // 构建 ept表 或 npt 映射,与cpu构架相关
        r = __direct_map(vcpu, fault);
    }

    //...
    return r;
}

MMIO处理

调试

调试qemu 触发内存信息提交至KVM。

参见协议栈,在虚拟机启动时触发了kvm_region_commit, 并最终将组织的内存信息提交给KVM

(gdb) bt
#0  kvm_set_user_memory_region (slot=0x7ffff4c79010, new=new@entry=true, kml=<optimized out>) at ../accel/kvm/kvm-all.c:282
#1  0x0000555555cba6f2 in kvm_set_phys_mem (kml=kml@entry=0x555556dd0040, section=section@entry=0x555557406ec0, add=<optimized out>, add@entry=true) at ../accel/kvm/kvm-all.c:1365
#2  0x0000555555cbab4c in kvm_region_commit (listener=0x555556dd0040) at ../accel/kvm/kvm-all.c:1574
#3  0x0000555555c56bae in memory_region_transaction_commit () at ../system/memory.c:1137
#4  memory_region_transaction_commit () at ../system/memory.c:1117
#5  0x0000555555b7a525 in pc_memory_init
    (pcms=pcms@entry=0x55555700fc80, system_memory=system_memory@entry=0x555557019600, rom_memory=rom_memory@entry=0x555556dc8dc0, pci_hole64_size=pci_hole64_size@entry=2147483648)
    at ../hw/i386/pc.c:961
#6  0x0000555555b604d3 in pc_init1 (machine=0x55555700fc80, pci_type=0x555555ef9ac7 "i440FX", host_type=0x555555ef9ae6 "i440FX-pcihost") at ../hw/i386/pc_piix.c:243
#7  0x0000555555908201 in machine_run_board_init (machine=0x55555700fc80, mem_path=<optimized out>, errp=<optimized out>, errp@entry=0x555556d3f378 <error_fatal>)
    at ../hw/core/machine.c:1541
#8  0x0000555555abe1c6 in qemu_init_board () at ../system/vl.c:2614

调试KVM页表构建,可见当虚拟机缺页时触发了AMD CPU的页构建函数。

kvm_tdp_mmu_map

  • 主动操作

    • 通常在配置或更新客户机内存时调用,例如初始化或更改内存映射时。
    • 开发者可以直接调用,用于明确地为某一范围内的客户机地址建立映射。
  • 触发条件

    • 不依赖客户机行为,是由主机(KVM)主动调用。
  • 典型场景

    • 当创建或调整 kvm_memory_slot 时。
    • 当需要在运行前预先建立映射以优化性能时。
  • 输入和职责

    • 输入包括 GPA 范围、HPA 起始地址、映射权限等。
    • 直接更新 TDP 页表,可能会涉及多页的批量映射。

direct_page_fault

  • 被动操作

    • 处理客户机运行时触发的 TDP 缺页异常。
    • 在客户机尝试访问未被映射的地址或权限不足的地址时,由硬件中断触发。
  • 触发条件

    • 依赖客户机行为。
    • 硬件检测到 TDP 缺页或权限问题,导致 VM 退出到主机,触发异常处理。
  • 典型场景

    • 客户机访问尚未映射的 GPA。
    • 客户机试图执行权限不足的操作,例如写入只读页。
  • 输入和职责

    • 输入包括缺页的 GPA 和触发的异常信息(如读/写/执行权限)。
    • 根据 GPA 定位对应的 kvm_memory_slot ,计算 HPA,更新 TDP 页表并恢复客户机执行。
Breakpoint 3, kvm_tdp_mmu_map (vcpu=vcpu@entry=0xffff88810a168000, fault=fault@entry=0xffffc90000b83bd0) at arch/x86/kvm/mmu/tdp_mmu.c:1159
1159        {
(gdb) c
Continuing.

Breakpoint 2, direct_page_fault (vcpu=vcpu@entry=0xffff88810a168000, fault=fault@entry=0xffffc90000b83bd0) at arch/x86/kvm/mmu/mmu.c:4267
4267        {
(gdb) bt
#0  direct_page_fault (vcpu=vcpu@entry=0xffff88810a168000, fault=fault@entry=0xffffc90000b83bd0) at arch/x86/kvm/mmu/mmu.c:4267
#1  0xffffffffa027b3ad in kvm_tdp_page_fault (vcpu=vcpu@entry=0xffff88810a168000, fault=fault@entry=0xffffc90000b83bd0) at arch/x86/kvm/mmu/mmu.c:4393
#2  0xffffffffa027b6bd in kvm_mmu_do_page_fault (prefetch=false, err=4, cr2_or_gpa=1008168, vcpu=0xffff88810a168000) at arch/x86/kvm/mmu/mmu_internal.h:291
#3  kvm_mmu_page_fault (vcpu=0xffff88810a168000, cr2_or_gpa=1008168, error_code=4294967300, insn=0x0 <fixed_percpu_data>, insn_len=0) at arch/x86/kvm/mmu/mmu.c:5592

参考文档