匿名缺页中断是内核中最常见的缺页中断之一,通常在应用程序中对mallo申请到的内存或者是通过mmap申请到的匿名映射内存访问触发的page fault都属于匿名缺页中断。匿名缺页中断会采用系统默认处理方式对缺页中断进程处理,处理匿名缺页中断函数为do_anonymous_page函数
do_anonymous_page
do_anonymous_page函数接口如下:
vm_fault_t do_anonymous_page(struct vm_fault *vmf)
入参:
- struct vm_fault *vmf: handle_mm_fault函数规整传入的page fault参数以及后续一些处理结构,该结构见《linux那些事之page fault(AMD64架构)(handle_mm_fault)(3)》。
返回值:
- vm_faulr_t:处理page fault结果之后各种错误码,具体见《linux那些事之page fault(AMD64架构)(user space)(2)》。
do_anonymous_page处理流程
do_anonymous_page函数处理逻辑相对比较简单,主要是针对读内存是否可以支持zero page和 写内存以及不支持zero page场景进程处理,处理过程如下:
do_anonymous_page源码分析
结合do_anonymous_page源码和上述流程进程分析:
/*
* We enter with non-exclusive mmap_lock (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
* We return with mmap_lock still held, but pte unmapped and unlocked.
*/
static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
struct page *page;
vm_fault_t ret = 0;
pte_t entry;
/* File mapping without ->vm_ops ? */
if (vma->vm_flags & VM_SHARED)
return VM_FAULT_SIGBUS;
/*
* Use pte_alloc() instead of pte_alloc_map(). We can't run
* pte_offset_map() on pmds where a huge pmd might be created
* from a different thread.
*
* pte_alloc_map() is safe to use under mmap_write_lock(mm) or when
* parallel threads are excluded by other means.
*
* Here we only have mmap_read_lock(mm).
*/
if (pte_alloc(vma->vm_mm, vmf->pmd))
return VM_FAULT_OOM;
/* See the comment in pte_alloc_one_map() */
if (unlikely(pmd_trans_unstable(vmf->pmd)))
return 0;
/* Use the zero-page for reads */
if (!(vmf->flags & FAULT_FLAG_WRITE) &&
!mm_forbids_zeropage(vma->vm_mm)) {
entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
vma->vm_page_prot));
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
if (!pte_none(*vmf->pte)) {
update_mmu_tlb(vma, vmf->address, vmf->pte);
goto unlock;
}
ret = check_stable_address_space(vma->vm_mm);
if (ret)
goto unlock;
/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
return handle_userfault(vmf, VM_UFFD_MISSING);
}
goto setpte;
}
/* Allocate our own private page. */
if (unlikely(anon_vma_prepare(vma)))
goto oom;
page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
if (!page)
goto oom;
if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL))
goto oom_free_page;
cgroup_throttle_swaprate(page, GFP_KERNEL);
/*
* The memory barrier inside __SetPageUptodate makes sure that
* preceding stores to the page contents become visible before
* the set_pte_at() write.
*/
__SetPageUptodate(page);
entry = mk_pte(page, vma->vm_page_prot);
entry = pte_sw_mkyoung(entry);
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
if (!pte_none(*vmf->pte)) {
update_mmu_cache(vma, vmf->address, vmf->pte);
goto release;
}
ret = check_stable_address_space(vma->vm_mm);
if (ret)
goto release;
/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
put_page(page);
return handle_userfault(vmf, VM_UFFD_MISSING);
}
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
lru_cache_add_active_or_unevictable(page, vma);
setpte:
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, vmf->address, vmf->pte);
unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
return ret;
release:
put_page(page);
goto unlock;
oom_free_page:
put_page(page);
oom:
return VM_FAULT_OOM;
}
主要分析结果如下:
- vm_flags 是否标记为共享VM_SHARED,如果是共享则意味着之前以及通过mmap方式在其他进程申请过物理内存,vma应该存在对应物理内存映射,不应该再发生page fault。
- pte_alloc:如果pte存在则不需要申请一个新的pte(handle_mm_fault以及申请过pte),如果申请失败则意味这物理内存不够,返回OOM错误码触发OOM。
- Transparent Hugepage Support 是否开启,如果没有开启则继续下一步,如果开启在walk page table之前需要运行pmd_trans_unstable一遍,防止PMD为NULL。
- 如果page fault是通过读内存触发,并且没有禁止zero page,则针对读内存可以按照zero page进行特殊处理,否则就与写内存处理一致。
- 写内存触发,首先调用anon_vma_prepare,对vma进行预处理,主要是创建anon_vma和anon_vma_chain,为后续反向映射做准备。
- 调用alloc_zeroed_user_highpage_movable,申请物理内存,尽量从high zone 中的可移动区域中申请物理内存,64位系统没有high zone因此从normal zone中的可移动区域申请物理内存。
- 申请内存成功之后,进程mcgroup处理,将新申请的page加入到mcgroup管理。
- __SetPageUptodate: 更新page flag为PG_uptodate,意味着page 内容已经更新。
- 获取pte 并重新按照需求构建entry,调用pte_offset_map_lock 将pte lock防止同时更新,并更多虚拟内存对应物理内存映射。
- check_stable_address_space:对vma地址空间进行检查。
- inc_mm_counter_fast:更新匿名page fault统计计数。
- page_add_new_anon_rmap: 将申请的page 加入到对应反向映射中。
- lru_cache_add_active_or_unevictable:加入到LRU链表中。
- set_pte_at:更新pte;
- update_mmu_cache: 更新MMU。
- pte_unmap_unlock:是否PTE lock,并返回。
alloc_zeroed_user_highpage_movable
alloc_zeroed_user_highpage_movable函数用于从buddy中获取一个空闲页,中间还需要经过mempolicy子系统根据内存申请策略,对内存进行管控,一个匿名映射获取物理内存过程大概如下:
- 触发匿名page fault之后,do_anonymous_page通过alloc_zeroed_user_highpage_movable 获取物理内存.
- alloc_zeroed_user_highpage_movable:默认指定从high zone中中的可移动区域获取物理内存,但是由于64位系统没有high zone所以从normal zone中获取物理内存,调用alloc_pages_vma进入到Mempolicy子系统.
- Mempolicy子系统决定了申请内存从哪里申请,决定了申请内存策略,如果是numa系统还决定了申请内存numa node节点,通过get_page_from_free_are进入到buddy系统
- get_page_from_free_are从buddy系统中获取到一个空闲页。
整个物理内存管理结果如下:
- 物理内存被划分成ZONE_DMA_ZONE_NORMAL以及32位系统还有ZONE_HIGH(可以 《linux内核那些事之ZONE》了解)。
- 每个zone 区域的物理内存被buddy 按照不同order页进行管理(通过《linux内核那些事之buddy》了解。
alloc_zeroed_user_highpage_movable源码
alloc_zeroed_user_highpage_movable函数源码:
/**
* alloc_zeroed_user_highpage_movable - Allocate a zeroed HIGHMEM page for a VMA that the caller knows can move
* @vma: The VMA the page is to be allocated for
* @vaddr: The virtual address the page will be inserted into
*
* This function will allocate a page for a VMA that the caller knows will
* be able to migrate in the future using move_pages() or reclaimed
*/
static inline struct page *
alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
unsigned long vaddr)
{
return __alloc_zeroed_user_highpage(__GFP_MOVABLE, vma, vaddr);
}
指定GFP为可移动区域__GFP_MOVABLE,__alloc_zeroed_user_highpage函数如下,如果开启__HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE 指针zero order:
#define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
不支持zero order进入:
/**
* __alloc_zeroed_user_highpage - Allocate a zeroed HIGHMEM page for a VMA with caller-specified movable GFP flags
* @movableflags: The GFP flags related to the pages future ability to move like __GFP_MOVABLE
* @vma: The VMA the page is to be allocated for
* @vaddr: The virtual address the page will be inserted into
*
* This function will allocate a page for a VMA but the caller is expected
* to specify via movableflags whether the page will be movable in the
* future or not
*
* An architecture may override this function by defining
* __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE and providing their own
* implementation.
*/
static inline struct page *
__alloc_zeroed_user_highpage(gfp_t movableflags,
struct vm_area_struct *vma,
unsigned long vaddr)
{
struct page *page = alloc_page_vma(GFP_HIGHUSER | movableflags,
vma, vaddr);
if (page)
clear_user_highpage(page, vaddr);
return page;
}
alloc_page_vma (include\linux\gfp.h)宏定义如下:
#define alloc_page_vma(gfp_mask, vma, addr) \
alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
调用alloc_pages_vma进入到mempolicy子系统(后续再学习了解),获取到当前运行所处的numa id(numa_node_id())。