qemu-kvm源码解析-内存虚拟化
qemu-kvm源码解析-内存虚拟化
内存虚拟化介绍
宿主机上的程序地址转换时为 HVA(宿主机虚拟地址)–MMU–>HPA(宿主机物理地址)
而宿主机上的虚拟机面临两层转化需求:
GVP(虚拟机虚拟地址)–MMU–>GPA(虚拟机物理地址)
GPA(虚拟机物理地址)–VMM–>HPA(宿主机物理地址)
虚拟机内存转化,以往依赖影子页面技术,现在主要依赖EPT技术。
- 虚拟机中GVP(虚拟机虚拟地址)–MMU–>GPA(虚拟机物理地址), 因虚拟机系统无法感知自己被虚拟因此按照MMU默认处理地址转换即可。
- CPU可感知自己在虚拟机中运行,此时将自动额外查询EPT页面完成GPA(虚拟机物理地址)到HPA(宿主机物理地址)的转化。
- EPT页表由VMM实现维护,主要是构建GPA到HPA的映射并注册为EPT页表, 是在进行EPT表查询失败产生EPT异常退出时在KVM中注册的。
查询当前系统是否支持EPT可通过命令
cat /proc/cpuinfo |grep -E 'ept|pdpe1gb'
一般来讲,EPT使用的是IA-32e的分页模式,即使48位物理地址,总共分为四级页表,每级页表使用9位物理地址定位,最后12位表示在一个页(4KB)内的偏移
启动qemu并在4444 端口开启 hmp
/home/xiyanxiyan10/project/qemu/build/qemu-system-x86_64 -m 10240 -enable-kvm -cpu host -s -kernel /home/xiyanxiyan10/project/linux-source-6.2.0/arch/x86/boot/bzImage -hda ./
rootfs.img -nographic -append "root=/dev/sda rw console=ttyS0 nokaslr" -qmp tcp:127.0.0.1:4444,server,nowait
使用qemu提供的脚本进入交互式hmp命令行界面, 查询虚拟机内存信息
xiyanxiyan10@xiyanxiyan10:~/project/qemu$ scripts/qmp/qmp-shell -H localhost:4444
Welcome to the HMP shell!
Connected to QEMU 8.2.50
(QEMU) info mtree
address-space: VGA
0000000000000000-ffffffffffffffff (prio 0, i/o): bus master container
...
qemu内存数据组织
MemoryRegion
顾名思义,这是用来表达一段内存区域的。其中重要的两个成员就是:addr和size,表示了这段内存区域 对应的起始地址和大小。
从内存模型的角度出发,MemoryRegion重要的特征是 形成了一棵内存区域树 。
来一个简单的图示意一下:
struct MemoryRegion
+------------------------+
|name |
| (const char *) |
+------------------------+
|addr |
| (hwaddr) |
|size |
| (Int128) |
+------------------------+
|subregions |
| QTAILQ_HEAD() |
+------------------------+
|
|
----+-------------------+---------------------+----
| |
| |
| |
struct MemoryRegion struct MemoryRegion
+------------------------+ +------------------------+
|name | |name |
| (const char *) | | (const char *) |
+------------------------+ +------------------------+
|addr | |addr |
| (hwaddr) | | (hwaddr) |
|size | |size |
| (Int128) | | (Int128) |
+------------------------+ +------------------------+
|subregions | |subregions |
| QTAILQ_HEAD() | | QTAILQ_HEAD() |
+------------------------+ +------------------------+
那我们来看看一个MemoryRegion的树形结构会是什么样子的。
每个address-space 指向一个MemoryRegion 根, MemoryRegion 根下是一个MemoryRegion节点构成的树。
(QEMU) info mtree
address-space: VGA
0000000000000000-ffffffffffffffff (prio 0, i/o): bus master container
address-space: piix3-ide
0000000000000000-ffffffffffffffff (prio 0, i/o): bus master container
0000000000000000-ffffffffffffffff (prio 0, i/o): alias bus master @system 0000000000000000-ffffffffffffffff
address-space: e1000
0000000000000000-ffffffffffffffff (prio 0, i/o): bus master container
address-space: cpu-memory-0
address-space: memory
0000000000000000-ffffffffffffffff (prio 0, i/o): system
0000000000000000-00000000bfffffff (prio 0, ram): alias ram-below-4g @pc.ram 0000000000000000-00000000bfffffff
0000000000000000-ffffffffffffffff (prio -1, i/o): pci
00000000000a0000-00000000000bffff (prio 1, i/o): vga-lowmem
00000000000c0000-00000000000dffff (prio 1, rom): pc.rom
00000000000e0000-00000000000fffff (prio 1, rom): alias isa-bios @pc.bios 0000000000020000-000000000003ffff
00000000fd000000-00000000fdffffff (prio 1, ram): vga.vram
00000000febc0000-00000000febdffff (prio 1, i/o): e1000-mmio
00000000febf0000-00000000febf0fff (prio 1, i/o): vga.mmio
00000000febf0000-00000000febf017f (prio 0, i/o): edid
00000000febf0400-00000000febf041f (prio 0, i/o): vga ioports remapped
00000000febf0500-00000000febf0515 (prio 0, i/o): bochs dispi interface
00000000febf0600-00000000febf0607 (prio 0, i/o): qemu extended regs
00000000fffc0000-00000000ffffffff (prio 0, rom): pc.bios
00000000000a0000-00000000000bffff (prio 1, i/o): alias smram-region @pci 00000000000a0000-00000000000bffff
00000000000c0000-00000000000c3fff (prio 1, ram): alias pam-rom @pc.ram 00000000000c0000-00000000000c3fff
00000000000c4000-00000000000c7fff (prio 1, ram): alias pam-rom @pc.ram 00000000000c4000-00000000000c7fff
00000000000c8000-00000000000cbfff (prio 1, ram): alias pam-rom @pc.ram 00000000000c8000-00000000000cbfff
00000000000cb000-00000000000cdfff (prio 1000, ram): alias kvmvapic-rom @pc.ram 00000000000cb000-00000000000cdfff
00000000000cc000-00000000000cffff (prio 1, ram): alias pam-rom @pc.ram 00000000000cc000-00000000000cffff
00000000000d0000-00000000000d3fff (prio 1, ram): alias pam-rom @pc.ram 00000000000d0000-00000000000d3fff
00000000000d4000-00000000000d7fff (prio 1, ram): alias pam-rom @pc.ram 00000000000d4000-00000000000d7fff
00000000000d8000-00000000000dbfff (prio 1, ram): alias pam-rom @pc.ram 00000000000d8000-00000000000dbfff
00000000000dc000-00000000000dffff (prio 1, ram): alias pam-rom @pc.ram 00000000000dc000-00000000000dffff
00000000000e0000-00000000000e3fff (prio 1, ram): alias pam-rom @pc.ram 00000000000e0000-00000000000e3fff
00000000000e4000-00000000000e7fff (prio 1, ram): alias pam-ram @pc.ram 00000000000e4000-00000000000e7fff
00000000000e8000-00000000000ebfff (prio 1, ram): alias pam-ram @pc.ram 00000000000e8000-00000000000ebfff
00000000000ec000-00000000000effff (prio 1, ram): alias pam-ram @pc.ram 00000000000ec000-00000000000effff
00000000000f0000-00000000000fffff (prio 1, ram): alias pam-rom @pc.ram 00000000000f0000-00000000000fffff
00000000fec00000-00000000fec00fff (prio 0, i/o): kvm-ioapic
00000000fed00000-00000000fed003ff (prio 0, i/o): hpet
00000000fee00000-00000000feefffff (prio 4096, i/o): kvm-apic-msi
0000000100000000-00000002bfffffff (prio 0, ram): alias ram-above-4g @pc.ram 00000000c0000000-000000027fffffff
FlatView/FlatRange
FlatView就是平面视图。那是啥的平面视图呢?我就知道你聪明,不用猜就知道。 是MemoryRegion的平面视图。刚才咱不是看了么,MemoryRegion形成了一棵高大雄伟的树,但是 要用的时候还是得铺平了看起来舒服。
和刚才一样,我们也来瞅一眼这个数据结构的样子。
FlatView (An array of FlatRange)
+----------------------+
|nr |
|nr_allocated |
| (unsigned) | FlatRange FlatRange
+----------------------+
|ranges | ------> +---------------------+---------------------+
| (FlatRange *) | |offset_in_region |offset_in_region |
+----------------------+ | | |
+---------------------+---------------------+
|addr(AddrRange) |addr(AddrRange) |
| +----------------| +----------------+
| |start (Int128) | |start (Int128) |
| |size (Int128) | |size (Int128) |
+----+----------------+----+----------------+
|mr |mr |
| (MemoryRegion *) | (MemoryRegion *) |
+---------------------+---------------------+
Address-space
接下来我们来看看这几者之间的关联。
AddressSpace
+-------------------------+
|name |
| (char *) | FlatView (An array of FlatRange)
+-------------------------+ +----------------------+
|current_map | -------->|nr |
| (FlatView *) | |nr_allocated |
+-------------------------+ | (unsigned) | FlatRange FlatRange
| | +----------------------+
| | |ranges | ------> +---------------------+---------------------+
| | | (FlatRange *) | |offset_in_region |offset_in_region |
| | +----------------------+ | | |
| | +---------------------+---------------------+
| | |addr(AddrRange) |addr(AddrRange) |
| | | +----------------| +----------------+
| | | |start (Int128) | |start (Int128) |
| | | |size (Int128) | |size (Int128) |
| | +----+----------------+----+----------------+
| | |mr |mr |
| | | (MemoryRegion *) | (MemoryRegion *) |
| | +---------------------+---------------------+
| |
| |
| |
| | MemoryRegion(system_memory/system_io)
+-------------------------+ +----------------------+
|root | | | root of a MemoryRegion
| (MemoryRegion *) | -------->| | tree
+-------------------------+ +----------------------+
RamBlock
RAMBlock数据结构就是描述虚拟机在主机上对应的内存空间的, qemu使用链表将他们串联组织。
ram_list (RAMList)
+------------------------------+
|dirty_memory[] |
| (unsigned long *) |
+------------------------------+
|blocks |
| QLIST_HEAD |
+------------------------------+
|
| RAMBlock RAMBlock
| +---------------------------+ +---------------------------+
+---> |next | -------------> |next |
| QLIST_ENTRY(RAMBlock) | | QLIST_ENTRY(RAMBlock) |
+---------------------------+ +---------------------------+
|offset | |offset |
|used_length | |used_length |
|max_length | |max_length |
| (ram_addr_t) | | (ram_addr_t) |
+---------------------------+ +---------------------------+
GPA -> HVA 的映射由MemoryRegion->addr到RAMBlock->host完成。
因此 MemoryRegion 与 RAMBlock 的关联建立了 虚拟机内存与物理机上虚拟地址间的映射。
RAMBlock RAMBlock
+---------------------------+ +---------------------------+
|next | -----------------------------> |next |
| QLIST_ENTRY(RAMBlock) | | QLIST_ENTRY(RAMBlock) |
+---------------------------+ +---------------------------+
|offset | |offset |
|used_length | |used_length |
|max_length | |max_length |
| (ram_addr_t) | | (ram_addr_t) |
+---------------------------+ +---------------------------+
|host | virtual address of a ram |host |
| (uint8_t *) | in host (mmap) | (uint8_t *) |
+---------------------------+ +---------------------------+
|mr | |mr |
| (struct MemoryRegion *)| | (struct MemoryRegion *)|
+---------------------------+ +---------------------------+
| |
| |
| |
| struct MemoryRegion | struct MemoryRegion
+-->+------------------------+ +-->+------------------------+
|name | |name |
| (const char *) | | (const char *) |
+------------------------+ +------------------------+
|addr | physical address in guest |addr |
| (hwaddr) | (offset in RAMBlock) | (hwaddr) |
|size | |size |
| (Int128) | | (Int128) |
+------------------------+ +------------------------+
MemoryListener
为了让EPT正常工作,还需要将虚拟机的内存布局通知到KVM,并且每次变化都需要通知KVM进行修改。这个过程是通过MemoryListener来实现的。MemoryListener定义如下。
MemoryListerner
+---------------------------+
|begin |
|commit |
+---------------------------+
|region_add |
|region_del |
+---------------------------+
|eventfd_add |
|eventfd_del |
+---------------------------+
|log_start |
|log_stop |
+---------------------------+
一个AddressSpace 下挂有一组关注该AddressSpace 变化的MemoryListerner,当AddressSpace 更新时将调用挂在该AddressSpace 下的所有 Listener, 其中包含将虚拟机的内存布局通知到KVM的Listener(kvm_region_add, kvm_region_del), 当然也可以继续扩展其他关注内存变化的Listener。
数据结构
/**
* struct AddressSpace: describes a mapping of addresses to #MemoryRegion objects
*/
struct AddressSpace {
/* private: */
struct rcu_head rcu;
char *name;
// 指向 MemoryRegion 树的根部
MemoryRegion *root;
// 将 MemoryRegion 树, 展开后的平坦的视图
/* Accessed via RCU. */
struct FlatView *current_map;
int ioeventfd_nb;
int ioeventfd_notifiers;
struct MemoryRegionIoeventfd *ioeventfds;
// 内存发生事件变化的回调处理链表
QTAILQ_HEAD(, MemoryListener) listeners;
// AddressSpace是使用链表串联组织的
QTAILQ_ENTRY(AddressSpace) address_spaces_link;
};
根据填写属性的不同常见的MemoryRegion有如下几类。
- RAM:host上一段实际分配给虚拟机作为物理内存的虚拟内存。
- MMIO:guest的一段内存,但是在宿主机上没有对应的虚拟内存,而是截获对这个区域的访问,调用对应读写函数用在设备模拟中。
- ROM:与RAM类似,只是该类型内存只有只读属性,无法写入。
- ROM device:其在读方面类似RAM,能够直接读取,而在写方面类似MMIO,写入会调用对应的写回调函数
- container:包含若干个MemoryRegion,每一个Region在这个container的偏移都不一样。container主要用来将多个MemoryRegion合并成一个,如PCI的MemoryRegion就会包括RAM和MMIO。一般来说,container中的region不会重合,但是有的时候也有例外。
- alias:region的另一个部分,可以使一个region被分成几个不连续的部分
/** MemoryRegion:
*
* A struct representing a memory region.
*/
struct MemoryRegion {
Object parent_obj;
/* private: */
/* The following fields should fit in a cache line */
bool romd_mode;
bool ram;
bool subpage;
bool readonly; /* For RAM regions */
bool nonvolatile;
bool rom_device;
bool flush_coalesced_mmio;
bool unmergeable;
uint8_t dirty_log_mask;
bool is_iommu;
// ram_block表示实际分配的物理内存, 即宿主机的虚拟内存
// 使用 RAMBlock 存储
RAMBlock *ram_block;
Object *owner;
/* owner as TYPE_DEVICE. Used for re-entrancy checks in MR access hotpath */
DeviceState *dev;
// ops里面是一组回调函数,在对MemoryRegion进行操作时会被调用,如MMIO的读写请求
const MemoryRegionOps *ops;
void *opaque;
MemoryRegion *container;
int mapped_via_alias; /* Mapped via an alias, container might be NULL */
Int128 size;
// addr表示该MemoryRegion所在的虚拟机的物理地址
hwaddr addr;
void (*destructor)(MemoryRegion *mr);
uint64_t align;
bool terminates;
bool ram_device;
bool enabled;
bool warning_printed; /* For reservations */
uint8_t vga_logging_count;
MemoryRegion *alias;
hwaddr alias_offset;
// priority用来指示MemoryRegion的优先级
int32_t priority;
//subregions将该MemoryRegion所属的子MemoryRegion连接起来
QTAILQ_HEAD(, MemoryRegion) subregions;
// subregions_link则用来连接同一个父MemoryRegion下的相同兄弟
QTAILQ_ENTRY(MemoryRegion) subregions_link;
QTAILQ_HEAD(, CoalescedMemoryRange) coalesced;
const char *name;
unsigned ioeventfd_nb;
MemoryRegionIoeventfd *ioeventfds;
RamDiscardManager *rdm; /* Only for RAM */
/* For devices designed to perform re-entrant IO into their own IO MRs */
bool disable_reentrancy_guard;
};
/*
* Memory region callbacks
*/
struct MemoryRegionOps {
/* Read from the memory region. @addr is relative to @mr; @size is
* in bytes. */
uint64_t (*read)(void *opaque,
hwaddr addr,
unsigned size);
/* Write to the memory region. @addr is relative to @mr; @size is
* in bytes. */
void (*write)(void *opaque,
hwaddr addr,
uint64_t data,
unsigned size);
MemTxResult (*read_with_attrs)(void *opaque,
hwaddr addr,
uint64_t *data,
unsigned size,
MemTxAttrs attrs);
MemTxResult (*write_with_attrs)(void *opaque,
hwaddr addr,
uint64_t data,
unsigned size,
MemTxAttrs attrs);
//...
};
qemu分配虚拟机内存
虚拟机使用的物理内存是映射宿主机虚拟内存。
因此 qemu为虚拟机分配内存的过程即是宿主机申请虚拟内存的过程
即RamBlock 结构初始化时申请内存空间。
pc_memory_init()
memory_region_allocate_system_memory()
allocate_system_memory_nonnuma()
memory_region_init_ram_nomigrate()
memory_region_init_ram_shared_nomigrate()
{
mr->ram = true;
mr->destructor = memory_region_destructor_ram;
// 申请 RAMBlock
mr->ram_block = qemu_ram_alloc(size, share, mr, errp);
}
RAMBlock *qemu_ram_alloc(ram_addr_t size, uint32_t ram_flags,
MemoryRegion *mr, Error **errp)
{
assert((ram_flags & ~(RAM_SHARED | RAM_NORESERVE)) == 0);
return qemu_ram_alloc_internal(size, size, NULL, NULL, ram_flags, mr, errp);
}
static
RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
void (*resized)(const char*,
uint64_t length,
void *host),
void *host, uint32_t ram_flags,
MemoryRegion *mr, Error **errp)
{
// ...
size = HOST_PAGE_ALIGN(size);
max_size = HOST_PAGE_ALIGN(max_size);
new_block = g_malloc0(sizeof(*new_block));
new_block->mr = mr;
new_block->resized = resized;
//...
ram_block_add(new_block, &local_err);
//...
return new_block;
}
static void ram_block_add(RAMBlock *new_block, Error **errp)
{
//...
if (!new_block->host) {
if (xen_enabled()) {
//...
} else {
new_block->host = qemu_anon_ram_alloc(new_block->max_length,
&new_block->mr->align,
shared, noreserve);
//...
}
}
//...
}
// posix 下分配空间的方法
/* alloc shared memory pages */
void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared,
bool noreserve)
{
const uint32_t qemu_map_flags = (shared ? QEMU_MAP_SHARED : 0) |
(noreserve ? QEMU_MAP_NORESERVE : 0);
size_t align = QEMU_VMALLOC_ALIGN;
void *ptr = qemu_ram_mmap(-1, size, align, qemu_map_flags, 0);
//...
return ptr;
}
void *qemu_ram_mmap(int fd,
size_t size,
size_t align,
uint32_t qemu_map_flags,
off_t map_offset)
{
//...
ptr = mmap_activate(guardptr + offset, size, fd, qemu_map_flags,
map_offset);
//...
return ptr;
}
static void *mmap_activate(void *ptr, size_t size, int fd,
uint32_t qemu_map_flags, off_t map_offset)
{
// ...
// 可见最终调用 mmap 申请内存页
activated_ptr = mmap(ptr, size, prot, flags | map_sync_flags, fd,
map_offset);
// ...
return activated_ptr;
}
内存分配表构建
QEMU内存的分派指的是,当给定一个AddressSpace和一个地址时,要能够快速地找出其所在的MemoryRegionSection,从而找到对应的MemoryRegion。与内存分派相关的数据结构是AddressSpaceDispatch,AddressSpace结构体中的dispatch成员为AddressSpaceDispatch,记录了该AddressSpace中的分派信息。
简单来说
- phys_map 像是CR3
- nodes 是一个用链表存储了的页表
- sections 是nodes的叶子ptr指向的内容,其中包含了MemoryRegion
当查询MemoryRegion 时通过 nodes 与 phys_map 配合快速在 sections 中找到对应的section, 从而查询到关联的 MemoryRegion
AddressSpaceDispatch
+-------------------------+
|as |
| (AddressSpace *) |
+-------------------------+
|mru_section |
| (MemoryRegionSection*)|
| |
| |
| |
| |
| |
+-------------------------+
|map(PhysPageMap) | MemoryRegionSection[]
| +---------------------+ +---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
| |sections |-------->|mr = io_mem_unassigned |mr = io_mem_notdirty |mr = io_mem_rom |mr = io_mem_watch |mr = one mr in tree |mr = subpage_t->iomem |
| | MemoryRegionSection*| | (MemoryRegion *) | (MemoryRegion *) | (MemoryRegion *) | (MemoryRegion *) | (MemoryRegion *) | (MemoryRegion *) |
| | | | | | | | | |
| +---------------------+ +---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
| |sections_nb | |fv |fv |fv |fv |fv |fv |
| |sections_nb_alloc | | (FlatView *) | (FlatView *) | (FlatView *) | (FlatView *) | (FlatView *) | (FlatView *) |
| | (unsigned) | +---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
| +---------------------+ |size (Int128) |size (Int128) |size (Int128) |size (Int128) |size (Int128) |size (Int128) |
| | | +---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
| | | |offset_within_region |offset_within_region |offset_within_region |offset_within_region |offset_within_region |offset_within_region |
| | | | (hwaddr) | (hwaddr) | (hwaddr) | (hwaddr) | (hwaddr) | (hwaddr) |
| | | |offset_within_address_space|offset_within_address_space|offset_within_address_space|offset_within_address_space|offset_within_address_space|offset_within_address_space|
| | | | (hwaddr) GPA | (hwaddr) GPA | (hwaddr) GPA | (hwaddr) GPA | (hwaddr) | (hwaddr) |
| | | +---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
| | | ^
| | | nodes[1] |
| | | +---->+------------------+ |
| | | | |u32 skip:6 | = 0 |
| | | | |u32 ptr:26 | = 4 -----------------------------------------+
| | | P_L2_LEVELS = 6 | +------------------+
| | | nodes[0] = PhysPageEntry[P_L2_SIZE = 2^9] | | |
| +---------------------+ +------------------+ | | ... |
| |nodes | ------->|u32 skip:6 | = 1 | | |
| | (Node *) | |u32 ptr:26 | = 1 -------------+ +------------------+
| +---------------------+ +------------------+ |u32 skip:6 | = 0
| |nodes_nb | | | |u32 ptr:26 | = PHYS_SECTION_UNASSIGNED
| |nodes_nb_alloc | | ... | +------------------+
| | (unsigned) | | |
| +---------------------+ +------------------+
| | | |u32 skip:6 | = 1
| | | |u32 ptr:26 | = PHYS_MAP_NODE_NIL nodes[2]
| | | +------------------+ +---->+------------------+
| | | |u32 skip:6 | = 1 | |u32 skip:6 |
| | | |u32 ptr:26 | = 2 -------------+ |u32 ptr:26 |
| | | +------------------+ +------------------+
| | | ^ | |
| | | | | ... |
| | | | | |
+---+---------------------+ | +------------------+
|phys_map(PhysPageEntry) | | |u32 skip:6 |
| +---------------------+ | |u32 ptr:26 |
| |u32 skip:6 | = 1 | +------------------+
| |u32 ptr:26 | = 0 --------+
+---+---------------------+
对应数据结构如下
typedef PhysPageEntry Node[P_L2_SIZE];
typedef struct PhysPageMap {
struct rcu_head rcu;
unsigned sections_nb;
unsigned sections_nb_alloc;
unsigned nodes_nb;
unsigned nodes_nb_alloc;
Node *nodes;
MemoryRegionSection *sections;
} PhysPageMap;
struct AddressSpaceDispatch {
MemoryRegionSection *mru_section;
/* This is a multi-level map on the physical address space.
* The bottom level has pointers to MemoryRegionSections.
*/
PhysPageEntry phys_map;
PhysPageMap map;
};
struct MemoryRegionSection {
Int128 size;
MemoryRegion *mr;
FlatView *fv;
hwaddr offset_within_region;
hwaddr offset_within_address_space;
bool readonly;
bool nonvolatile;
bool unmergeable;
};
提交内存到KVM
在 QEMU 中,
memory_region_transaction_commit
是一个关键函数,用于提交内存事务,确保虚拟机的内存映射在
MemoryRegion
层级的更改被正确同步到底层的内存子系统或硬件加速(如 KVM)。它的调用时机通常与
内存映射的变更
紧密相关,具体如下:
初始化 内存 布局时
当虚拟机启动时,QEMU 会初始化客户机的内存布局,包括:
- 设置 RAM 区域。
- 注册设备内存(如 MMIO 区域)。
在内存区域的变更完成后,调用
memory_region_transaction_commit
提交更改。
内存 区域添加或删除时
- 动态添加或移除内存区域(如热插拔内存或设备)时,QEMU 会先对内存映射进行更改,并在完成后调用
memory_region_transaction_commit
提交更新。
客户机 地址空间 调整时
- 当客户机地址空间(如 PCI 地址空间)发生更改时,需要更新内存区域的映射。
- 这些调整通常由设备模型(Device Model)触发,在完成设备地址空间调整后调用
memory_region_transaction_commit
。
快照恢复或迁移时
- 在快照恢复或虚拟机迁移时,内存映射需要重新构建。
- QEMU 会调用
memory_region_transaction_commit
确保新的内存布局生效。
访问权限或属性变更时
- 如果需要调整内存区域的访问权限(如读写权限)或属性(如缓存策略),这些更改会通过事务机制提交。
源码中调用
memory_region_transaction_commit
的函数列表可参考对照函数名
static void memory_region_finalize(Object *obj)
void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
void memory_region_set_dirty(MemoryRegion *mr, hwaddr addr,
hwaddr size)
void memory_region_set_readonly(MemoryRegion *mr, bool readonly)
void memory_region_del_eventfd(MemoryRegion *mr,
hwaddr addr,
unsigned size,
bool match_data,
uint64_t data,
EventNotifier *e)
...
memory_region_transaction_commit
的函数主要流程如下
memory_region_transaction_commit(), update topology or ioeventfds
flatviews_reset()
flatviews_init()
flat_views = g_hash_table_new_full()
empty_view = generate_memory_topology(NULL);
generate_memory_topology()
MEMORY_LISTENER_CALL_GLOBAL(begin, Forward)
address_space_set_flatview()
address_space_update_topology_pass(false)
address_space_update_topology_pass(true)
address_space_update_ioeventfds()
address_space_add_del_ioeventfds()
MEMORY_LISTENER_CALL_GLOBAL(commit, Forward)
- flatviews_reset: 重构所有AddressSpace的flatview
- MEMORY_LISTENER_CALL_GLOBAL(begin, Forward)
- address_space_set_flatview: 根据变化添加删除region
- address_space_update_ioeventfds: 根据变化添加删除eventfd
- MEMORY_LISTENER_CALL_GLOBAL(commit, Forward)
前三部主要是根据事件更新qemu的内存管理映射结构, 最后一部即是将当前内存映射更新通过listener 回调同步到kvm中。
对于kvm 模式启动的qemu, 其listener提交回调为
kml->listener.region_add = kvm_region_add;
kml->listener.region_del = kvm_region_del;
以region_add为例, 调用链为以region_add为例, 调用链为
kvm_region_add
kvm_set_phys_mem
kvm_set_user_memory_region
kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);kvm_region_add
kvm_set_phys_mem
kvm_set_user_memory_region
kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);
kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);即是将qemu构建的内存空间映射提交给kvm。
关注函数kvm_set_user_memory_region 函数,其将KVMSlot 放入kvm_userspace_memory_region 进行对kvm的内存信息提交。
而参数中KVMSlot *slot,则是上层调用函数将MemoryRegionSection 信息转化为 KVMSlot 的, 以对应KVM的交互信息协议。
static int kvm_set_user_memory_region(KVMMemoryListener *kml, KVMSlot *slot, bool new)
{
KVMState *s = kvm_state;
struct kvm_userspace_memory_region mem;
int ret;
mem.slot = slot->slot | (kml->as_id << 16);
mem.guest_phys_addr = slot->start_addr;
mem.userspace_addr = (unsigned long)slot->ram;
mem.flags = slot->flags;
//...
mem.memory_size = slot->memory_size;
ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);
slot->old_flags = mem.flags;
//...
return ret;
}
KVM构建EPT
VCPU创建好之后,在初始化的时候会调用kvm_mmu_setup进行MMU的初始化,相关函数调用为init_kvm_mmu。
static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu,
union kvm_cpu_role cpu_role)
{
// ...
context->root_role.word = root_role.word;
// tdp_page_fault,用来处理EPT的页访问错误,之后根据VCPU所处的模式设置相应的值
context->page_fault = kvm_tdp_page_fault;
//...
}
int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
// ...
return direct_page_fault(vcpu, fault);
}
继续将调用函数 direct_page_fault, 主要任务是
- 定位缺页的客户机地址范围(GPA)。
- 使用
kvm_memory_slot
找到对应的主机物理内存(HPA)。 - 更新二级页表(EPT/NPT)以建立映射。
static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu);
unsigned long mmu_seq;
int r;
// 虚拟机地址 gfn
fault->gfn = fault->addr >> PAGE_SHIFT;
// 使用 gfn查询到的 宿主机对应的 slot, 确定虚拟机地址与需要关联的宿主机地址信息
// slot 即qemu上报给kvm 的地址信息
fault->slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn);
// ...
if (is_tdp_mmu_fault) {
r = kvm_tdp_mmu_map(vcpu, fault);
} else {
r = make_mmu_pages_available(vcpu);
if (r)
goto out_unlock;
// 构建 ept表 或 npt 映射,与cpu构架相关
r = __direct_map(vcpu, fault);
}
//...
return r;
}
MMIO处理
调试
调试qemu 触发内存信息提交至KVM。
参见协议栈,在虚拟机启动时触发了kvm_region_commit, 并最终将组织的内存信息提交给KVM
(gdb) bt
#0 kvm_set_user_memory_region (slot=0x7ffff4c79010, new=new@entry=true, kml=<optimized out>) at ../accel/kvm/kvm-all.c:282
#1 0x0000555555cba6f2 in kvm_set_phys_mem (kml=kml@entry=0x555556dd0040, section=section@entry=0x555557406ec0, add=<optimized out>, add@entry=true) at ../accel/kvm/kvm-all.c:1365
#2 0x0000555555cbab4c in kvm_region_commit (listener=0x555556dd0040) at ../accel/kvm/kvm-all.c:1574
#3 0x0000555555c56bae in memory_region_transaction_commit () at ../system/memory.c:1137
#4 memory_region_transaction_commit () at ../system/memory.c:1117
#5 0x0000555555b7a525 in pc_memory_init
(pcms=pcms@entry=0x55555700fc80, system_memory=system_memory@entry=0x555557019600, rom_memory=rom_memory@entry=0x555556dc8dc0, pci_hole64_size=pci_hole64_size@entry=2147483648)
at ../hw/i386/pc.c:961
#6 0x0000555555b604d3 in pc_init1 (machine=0x55555700fc80, pci_type=0x555555ef9ac7 "i440FX", host_type=0x555555ef9ae6 "i440FX-pcihost") at ../hw/i386/pc_piix.c:243
#7 0x0000555555908201 in machine_run_board_init (machine=0x55555700fc80, mem_path=<optimized out>, errp=<optimized out>, errp@entry=0x555556d3f378 <error_fatal>)
at ../hw/core/machine.c:1541
#8 0x0000555555abe1c6 in qemu_init_board () at ../system/vl.c:2614
调试KVM页表构建,可见当虚拟机缺页时触发了AMD CPU的页构建函数。
kvm_tdp_mmu_map
主动操作 :
- 通常在配置或更新客户机内存时调用,例如初始化或更改内存映射时。
- 开发者可以直接调用,用于明确地为某一范围内的客户机地址建立映射。
触发条件 :
- 不依赖客户机行为,是由主机(KVM)主动调用。
典型场景 :
- 当创建或调整
kvm_memory_slot
时。 - 当需要在运行前预先建立映射以优化性能时。
- 当创建或调整
输入和职责 :
- 输入包括 GPA 范围、HPA 起始地址、映射权限等。
- 直接更新 TDP 页表,可能会涉及多页的批量映射。
direct_page_fault
被动操作 :
- 处理客户机运行时触发的 TDP 缺页异常。
- 在客户机尝试访问未被映射的地址或权限不足的地址时,由硬件中断触发。
触发条件 :
- 依赖客户机行为。
- 硬件检测到 TDP 缺页或权限问题,导致 VM 退出到主机,触发异常处理。
典型场景 :
- 客户机访问尚未映射的 GPA。
- 客户机试图执行权限不足的操作,例如写入只读页。
输入和职责 :
- 输入包括缺页的 GPA 和触发的异常信息(如读/写/执行权限)。
- 根据 GPA 定位对应的
kvm_memory_slot
,计算 HPA,更新 TDP 页表并恢复客户机执行。
Breakpoint 3, kvm_tdp_mmu_map (vcpu=vcpu@entry=0xffff88810a168000, fault=fault@entry=0xffffc90000b83bd0) at arch/x86/kvm/mmu/tdp_mmu.c:1159
1159 {
(gdb) c
Continuing.
Breakpoint 2, direct_page_fault (vcpu=vcpu@entry=0xffff88810a168000, fault=fault@entry=0xffffc90000b83bd0) at arch/x86/kvm/mmu/mmu.c:4267
4267 {
(gdb) bt
#0 direct_page_fault (vcpu=vcpu@entry=0xffff88810a168000, fault=fault@entry=0xffffc90000b83bd0) at arch/x86/kvm/mmu/mmu.c:4267
#1 0xffffffffa027b3ad in kvm_tdp_page_fault (vcpu=vcpu@entry=0xffff88810a168000, fault=fault@entry=0xffffc90000b83bd0) at arch/x86/kvm/mmu/mmu.c:4393
#2 0xffffffffa027b6bd in kvm_mmu_do_page_fault (prefetch=false, err=4, cr2_or_gpa=1008168, vcpu=0xffff88810a168000) at arch/x86/kvm/mmu/mmu_internal.h:291
#3 kvm_mmu_page_fault (vcpu=0xffff88810a168000, cr2_or_gpa=1008168, error_code=4294967300, insn=0x0 <fixed_percpu_data>, insn_len=0) at arch/x86/kvm/mmu/mmu.c:5592