本文为我在学习kvm源码过程中的总结笔记,因个人能力有限,文中可能出现相关错误,请各位批评指正。
本文发出时最新Linux内核版本为v5.18-rc3。
我的邮箱:szhou12@stu.pku.edu.cn
参考资料:
[1]Linux:v5.16.20、v5.17.3
[2]https://www.cnblogs.com/LoyenWang/p/13943005.html
0 前言
linux5.17版本内核中引入了红黑树对memory slot进行组织管理,而在5.0~5.16版本及使用数组组织memmory slot进行组织。除了结构体以外,相关的内存分配函数也针对新的slot组织方式做出了调整。本文着重于分析5.16中kvm内核代码,最后再讨论5.17版本做出的改动。若无特殊说明,本文中的代码均来自Linux v5.16.20版本,所有涉及到具体架构的代码均来自 /arch/x86/。
此外,本文仅涉及Linux内核中的kvm代码,不涉及qemu源码。
1 宏定义与结构体
kvm内核中涉及到多种结构体定义以及一些宏定义,理顺其中的关系对于理解代码有着很重要的意义
1.1 kvm地址类型定义
我们知道,Guest OS和Host OS都有自己的虚拟地址和物理地址,虚拟机会通过MMU找到它所认为的物理地址,而这个物理地址与宿主的虚拟地址又有一定的对应关系,宿主机的虚拟地址可以通过MMU访问物理内存。当然,也可以使用影子页表直接使用虚拟机的虚拟内存访问物理内存。
本文章主要讲述kvm_memory_slot及相关函数的代码,不涉及影子页表及其实现。
在头文件vkm_types.h中,定义了以下六种地址类型:
// include/linux/kvm_types.h
/*
* Address types:
*
* gva - guest virtual address
* gpa - guest physical address
* gfn - guest frame number
* hva - host virtual address
* hpa - host physical address
* hfn - host frame number
*/
typedef unsigned long gva_t;
typedef u64 gpa_t;
typedef u64 gfn_t;
#define GPA_INVALID (~(gpa_t)0)
typedef unsigned long hva_t;
typedef u64 hpa_t;
typedef u64 hfn_t;
gva_t: GVA,客户机虚拟地址
gpa_t: GPA,客户机物理地址
gfn_t: GFN,客户机页帧编号
hva_t: HVA,宿主机虚拟地址
hpa_t: HPA,宿主机物理地址
hfn_t: HFN, 宿主机页帧编号
1.2 struct kvm 结构体
kvm结构体定义了一个kvm虚拟机的实体,包含虚拟机内相应的数据以及控制字段
// include/linux/kvm_host.h
struct kvm {
...
struct mm_struct *mm; /* userspace tied to this vm */
struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
...
};
因为这个结构体包含众多变量,因此只摘出与内存管理相关的部分,此外,还有一些内存操作时所使用的的锁也没有记录进来。
struct mm_struct: 其中包含了struct vm_area_struct,指向一个进程的虚拟空间中的不同区域。
struct kvm_memslots: 该结构体可以理解为虚拟机上的内存插槽,其中的KVM_ADDRESS_SPACE_NUM默认定义为1。
// include/linux/kvm_host.h
#ifndef KVM_ADDRESS_SPACE_NUM
#define KVM_ADDRESS_SPACE_NUM 1
#endif
1.3 struct kvm_memslots结构体
// include/linux/kvm_host.h
/*
* Note:
* memslots are not sorted by id anymore, please use id_to_memslot()
* to get the memslot by its id.
*/
struct kvm_memslots {
u64 generation;
/* The mapping table from slot id to the index in memslots[]. */
short id_to_index[KVM_MEM_SLOTS_NUM];
atomic_t last_used_slot;
int used_slots;
struct kvm_memory_slot memslots[];
};
正如代码注释中所说,memslot是没有经过排序的,需要使用id_to_memslot()函数获得slot id和数组下标的对应关系。结构体中还有变量记录最后使用的slot和已使用的slot的数量。
atomic_t声明了一个具有原子操作特性的整数,详细信息可以查阅相关资料
1.4 struct kvm_memory_slot结构体
// include/linux/kvm_host.h
struct kvm_memory_slot {
gfn_t base_gfn;
unsigned long npages;
unsigned long *dirty_bitmap;
struct kvm_arch_memory_slot arch;
unsigned long userspace_addr;
u32 flags;
short id;
u16 as_id;
};
gfn_t base_gfn: 起始页帧
unsigned long npages: 这个slot的包含的页数(slot的大小)
dirty_bitmap: 脏页的位示图
arch: 与硬件架构相关的memory_slot结构体
unsigned long userspace_addr: 用户空间地址
u32 flags: 标志位
id: memslot的id值,呼应1.3中的id_to_index和id_to_memslot()
as_id: 地址空间(address space)的id值,可以理解为正在使用主板上的第几根插槽上插的内存(当然虚拟机没有主板)
1.5 struct kvm_userspace_memory_region结构体
// include/uapi/linux/kvm.h
/* for KVM_SET_USER_MEMORY_REGION */
struct kvm_userspace_memory_region {
__u32 slot;
__u32 flags;
__u64 guest_phys_addr;
__u64 memory_size; /* bytes */
__u64 userspace_addr; /* start of the userspace allocated memory */
};
该结构体存储在用户空间,供qemu使用,用于映射虚拟机物理地址与宿主机的虚拟地址。
1.6 各结构体之间的关系
1.7 kvm中flag标志的宏定义
// include/uapi/linux/kvm.h
/*
* The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
* other bits are reserved for kvm internal use which are defined in
* include/linux/kvm_host.h.
*/
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
// include/linux/kvm.host.h
/*
* The bit 16 ~ bit 31 of kvm_memory_region::flags are internally used
* in kvm, other bits are visible for userspace which are defined in
* include/linux/kvm_h.
*/
#define KVM_MEMSLOT_INVALID (1UL << 16)
目前一共定义了三个标志,分别为脏页、只读和非法。其中前两个标志放在uapi的文件夹下,是可以被用户空间获取到的。
1UL << 1 :
U代表无符号数,L代表long类型,左移1位后,该数值为 0b000...0010
1.8 enum kvm_mr_change内存操作
// include/linux/kvm_host.h
/*
* KVM_SET_USER_MEMORY_REGION ioctl allows the following operations:
* - create a new memory slot
* - delete an existing memory slot
* - modify an existing memory slot
* -- move it in the guest physical memory space
* -- just change its flags
*
* Since flags can be changed by some of these operations, the following
* differentiation is the best we can do for __kvm_set_memory_region():
*/
enum kvm_mr_change {
KVM_MR_CREATE,
KVM_MR_DELETE,
KVM_MR_MOVE,
KVM_MR_FLAGS_ONLY,
};
这个枚举类型声明了在ioctl中KVM_SET_USER_MEMORY_REGION分支下可以对内存进行的四种操作,分别是:
1、创建一个新的slot
2、删除一个已有的slot
3、在客户机物理内存中移动一个slot
4、修改客户机物理内存中slot的标志
1.9 linux定义的错误类型
// include/uapi/asm-generic/error-base.h
#define EPERM 1 /* Operation not permitted */
#define ENOENT 2 /* No such file or directory */
#define ESRCH 3 /* No such process */
#define EINTR 4 /* Interrupted system call */
#define EIO 5 /* I/O error */
#define ENXIO 6 /* No such device or address */
#define E2BIG 7 /* Argument list too long */
#define ENOEXEC 8 /* Exec format error */
#define EBADF 9 /* Bad file number */
#define ECHILD 10 /* No child processes */
#define EAGAIN 11 /* Try again */
#define ENOMEM 12 /* Out of memory */
#define EACCES 13 /* Permission denied */
#define EFAULT 14 /* Bad address */
#define ENOTBLK 15 /* Block device required */
#define EBUSY 16 /* Device or resource busy */
#define EEXIST 17 /* File exists */
#define EXDEV 18 /* Cross-device link */
#define ENODEV 19 /* No such device */
#define ENOTDIR 20 /* Not a directory */
#define EISDIR 21 /* Is a directory */
#define EINVAL 22 /* Invalid argument */
#define ENFILE 23 /* File table overflow */
#define EMFILE 24 /* Too many open files */
#define ENOTTY 25 /* Not a typewriter */
#define ETXTBSY 26 /* Text file busy */
#define EFBIG 27 /* File too large */
#define ENOSPC 28 /* No space left on device */
#define ESPIPE 29 /* Illegal seek */
#define EROFS 30 /* Read-only file system */
#define EMLINK 31 /* Too many links */
#define EPIPE 32 /* Broken pipe */
#define EDOM 33 /* Math argument out of domain of func */
#define ERANGE 34 /* Math result not representable */
函数在进行检查的时候会用到这些错误定义。
2 内存操作函数
2.1 kvm_vm_ioctl()
// virt/kvm/kvm_main.c
static long kvm_vm_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
struct kvm *kvm = filp->private_data;
void __user *argp = (void __user *)arg;
int r;
if (kvm->mm != current->mm || kvm->vm_dead)
return -EIO;
switch (ioctl) {
...
case KVM_SET_USER_MEMORY_REGION: {
struct kvm_userspace_memory_region kvm_userspace_mem;
r = -EFAULT;
if (copy_from_user(&kvm_userspace_mem, argp,
sizeof(kvm_userspace_mem)))
goto out;
r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
break;
}
...
out:
return r;
}
kvm_vm_ioctl()函数是调用kvm的接口,一共有三个输入:
struct file *filep: 传入kvm的文件指针,其file结构体中private指针指向代表该客户机的struct kvm结构体
unsigned int ioctl: 传入io操作编号
unsigned long arg: 传入用户参数的指针
函数的运行流程如下图所示:
函数在接收到传入参数后,跳转到KVM_SET_MEMORY_REGION分支,从用户空间读取参数填充kvm_userspace_memory_region结构体,之后调用kvm_vm_ioctl_set_memory_region()函数。
2.2 kvm_vm_ioctl_set_memory_region()
// virt/kvm/kvm_main.c
static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
struct kvm_userspace_memory_region *mem)
{
if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
return -EINVAL;
return kvm_set_memory_region(kvm, mem);
}
该函数传入kvm结构体和kvm_userspace_memory_region结构体,在验证传入slot值的有效性后调用kvm_set_memory_region()函数。
其中KVM_USER_MEM_SLOTS的定义如下
// include/linux/kvm_host.h
#ifndef KVM_PRIVATE_MEM_SLOTS
#define KVM_PRIVATE_MEM_SLOTS 0
#endif
#define KVM_MEM_SLOTS_NUM SHRT_MAX
#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_PRIVATE_MEM_SLOTS)
// include/vdso/limits.h
#define USHRT_MAX ((unsigned short)~0U)
#define SHRT_MAX ((short)(USHRT_MAX >> 1))
默认情况下,隐私页数量定义为0,KVM_USER_MEM_SLOTS即为short类型所表示的最大值。
2.3 kvm_set_memory_region()
// virt/kvm/kvm_main.c
int kvm_set_memory_region(struct kvm *kvm,
const struct kvm_userspace_memory_region *mem)
{
int r;
mutex_lock(&kvm->slots_lock);
r = __kvm_set_memory_region(kvm, mem);
mutex_unlock(&kvm->slots_lock);
return r;
}
这个函数在获取到kvm结构体中的slots_lock锁之后,调用__kvm_set_memory_region()函数。
2.4 __kvm_set_memory_region()
// virt/kvm/kvm_main.c
/*
* Allocate some memory and give it an address in the guest physical address
* space.
*
* Discontiguous memory is allowed, mostly for framebuffers.
*
* Must be called holding kvm->slots_lock for write.
*/
int __kvm_set_memory_region(struct kvm *kvm,
const struct kvm_userspace_memory_region *mem)
{
struct kvm_memory_slot old, new;
struct kvm_memory_slot *tmp;
enum kvm_mr_change change;
int as_id, id;
int r;
r = check_memory_region_flags(mem);
if (r)
return r;
as_id = mem->slot >> 16;
id = (u16)mem->slot;
/* General sanity checks */
if ((mem->memory_size & (PAGE_SIZE - 1)) ||
(mem->memory_size != (unsigned long)mem->memory_size))
return -EINVAL;
if (mem->guest_phys_addr & (PAGE_SIZE - 1))
return -EINVAL;
/* We can read the guest memory with __xxx_user() later on. */
if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
(mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
!access_ok((void __user *)(unsigned long)mem->userspace_addr,
mem->memory_size))
return -EINVAL;
if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
return -EINVAL;
if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
return -EINVAL;
/*
* Make a full copy of the old memslot, the pointer will become stale
* when the memslots are re-sorted by update_memslots(), and the old
* memslot needs to be referenced after calling update_memslots(), e.g.
* to free its resources and for arch specific behavior.
*/
tmp = id_to_memslot(__kvm_memslots(kvm, as_id), id);
if (tmp) {
old = *tmp;
tmp = NULL;
} else {
memset(&old, 0, sizeof(old));
old.id = id;
}
if (!mem->memory_size)
return kvm_delete_memslot(kvm, mem, &old, as_id);
new.as_id = as_id;
new.id = id;
new.base_gfn = mem->guest_phys_addr >> PAGE_SHIFT;
new.npages = mem->memory_size >> PAGE_SHIFT;
new.flags = mem->flags;
new.userspace_addr = mem->userspace_addr;
if (new.npages > KVM_MEM_MAX_NR_PAGES)
return -EINVAL;
if (!old.npages) {
change = KVM_MR_CREATE;
new.dirty_bitmap = NULL;
} else { /* Modify an existing slot. */
if ((new.userspace_addr != old.userspace_addr) ||
(new.npages != old.npages) ||
((new.flags ^ old.flags) & KVM_MEM_READONLY))
return -EINVAL;
if (new.base_gfn != old.base_gfn)
change = KVM_MR_MOVE;
else if (new.flags != old.flags)
change = KVM_MR_FLAGS_ONLY;
else /* Nothing to change. */
return 0;
/* Copy dirty_bitmap from the current memslot. */
new.dirty_bitmap = old.dirty_bitmap;
}
if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) {
/* Check for overlaps */
kvm_for_each_memslot(tmp, __kvm_memslots(kvm, as_id)) {
if (tmp->id == id)
continue;
if (!((new.base_gfn + new.npages <= tmp->base_gfn) ||
(new.base_gfn >= tmp->base_gfn + tmp->npages)))
return -EEXIST;
}
}
/* Allocate/free page dirty bitmap as needed */
if (!(new.flags & KVM_MEM_LOG_DIRTY_PAGES))
new.dirty_bitmap = NULL;
else if (!new.dirty_bitmap && !kvm->dirty_ring_size) {
r = kvm_alloc_dirty_bitmap(&new);
if (r)
return r;
if (kvm_dirty_log_manual_protect_and_init_set(kvm))
bitmap_set(new.dirty_bitmap, 0, new.npages);
}
r = kvm_set_memslot(kvm, mem, &new, as_id, change);
if (r)
goto out_bitmap;
if (old.dirty_bitmap && !new.dirty_bitmap)
kvm_destroy_dirty_bitmap(&old);
return 0;
out_bitmap:
if (new.dirty_bitmap && !old.dirty_bitmap)
kvm_destroy_dirty_bitmap(&new);
return r;
}
EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
这是kvm进行GPA到HVA映射的主要函数,其主要流程如下图所示:
__kvm_set_memory_region()函数主要进行一些页对齐、内存溢出的检查,并根据内存使用情况设置本次操作的操作数,最终调用kvm_set_memslot()函数。
2.5 kvm_set_memslot()
static int kvm_set_memslot(struct kvm *kvm,
const struct kvm_userspace_memory_region *mem,
struct kvm_memory_slot *new, int as_id,
enum kvm_mr_change change)
{
struct kvm_memory_slot *slot, old;
struct kvm_memslots *slots;
int r;
/*
* Released in install_new_memslots.
*
* Must be held from before the current memslots are copied until
* after the new memslots are installed with rcu_assign_pointer,
* then released before the synchronize srcu in install_new_memslots.
*
* When modifying memslots outside of the slots_lock, must be held
* before reading the pointer to the current memslots until after all
* changes to those memslots are complete.
*
* These rules ensure that installing new memslots does not lose
* changes made to the previous memslots.
*/
mutex_lock(&kvm->slots_arch_lock);
slots = kvm_dup_memslots(__kvm_memslots(kvm, as_id), change);
if (!slots) {
mutex_unlock(&kvm->slots_arch_lock);
return -ENOMEM;
}
if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
/*
* Note, the INVALID flag needs to be in the appropriate entry
* in the freshly allocated memslots, not in @old or @new.
*/
slot = id_to_memslot(slots, new->id);
slot->flags |= KVM_MEMSLOT_INVALID;
/*
* We can re-use the memory from the old memslots.
* It will be overwritten with a copy of the new memslots
* after reacquiring the slots_arch_lock below.
*/
slots = install_new_memslots(kvm, as_id, slots);
/* From this point no new shadow pages pointing to a deleted,
* or moved, memslot will be created.
*
* validation of sp->gfn happens in:
* - gfn_to_hva (kvm_read_guest, gfn_to_pfn)
* - kvm_is_visible_gfn (mmu_check_root)
*/
kvm_arch_flush_shadow_memslot(kvm, slot);
/* Released in install_new_memslots. */
mutex_lock(&kvm->slots_arch_lock);
/*
* The arch-specific fields of the memslots could have changed
* between releasing the slots_arch_lock in
* install_new_memslots and here, so get a fresh copy of the
* slots.
*/
kvm_copy_memslots(slots, __kvm_memslots(kvm, as_id));
}
/*
* Make a full copy of the old memslot, the pointer will become stale
* when the memslots are re-sorted by update_memslots(), and the old
* memslot needs to be referenced after calling update_memslots(), e.g.
* to free its resources and for arch specific behavior. This needs to
* happen *after* (re)acquiring slots_arch_lock.
*/
slot = id_to_memslot(slots, new->id);
if (slot) {
old = *slot;
} else {
WARN_ON_ONCE(change != KVM_MR_CREATE);
memset(&old, 0, sizeof(old));
old.id = new->id;
old.as_id = as_id;
}
/* Copy the arch-specific data, again after (re)acquiring slots_arch_lock. */
memcpy(&new->arch, &old.arch, sizeof(old.arch));
r = kvm_arch_prepare_memory_region(kvm, new, mem, change);
if (r)
goto out_slots;
update_memslots(slots, new, change);
slots = install_new_memslots(kvm, as_id, slots);
kvm_arch_commit_memory_region(kvm, mem, &old, new, change);
/* Free the old memslot's metadata. Note, this is the full copy!!! */
if (change == KVM_MR_DELETE)
kvm_free_memslot(kvm, &old);
kvfree(slots);
return 0;
out_slots:
if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
slot = id_to_memslot(slots, new->id);
slot->flags &= ~KVM_MEMSLOT_INVALID;
slots = install_new_memslots(kvm, as_id, slots);
} else {
mutex_unlock(&kvm->slots_arch_lock);
}
kvfree(slots);
return r;
}
写不动了,勉强拿手画的表看看吧。
后记
kvm的体系太庞大了,涉及到架构的部分还得对着硬件手册来学,实在写得累了,后面有机会再补充更详细的内容。5.17的对比也后面再说吧。