继续《linux那些事之LRU（1）》，以匿名页分析page 在不同时期所处位置说明page LRU。

系统上电memblock->buddy free list

系统上电之后，在初始化造成阶段此时buddy 并没有初始化建立，早期early阶段内存是被memblock接管（请参考《linux内核那些事之early boot memory-memblock》和《linux内核那些事之buddy》），memblock函数通过调用memblock_free_pages 将物理内存按照页形式加入到buddy中，并尽量将连续物理页按照最大order加入通过__free_pages_core到相应free list中：

void __free_pages_core(struct page *page, unsigned int order)
{
	unsigned int nr_pages = 1 << order;
	struct page *p = page;
	unsigned int loop;

	prefetchw(p);
	for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
		prefetchw(p + 1);
		__ClearPageReserved(p);
		set_page_count(p, 0);
	}
	__ClearPageReserved(p);
	set_page_count(p, 0);

	atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
	set_page_refcounted(page);
	__free_pages(page, order);
}

buddy初始化完成之后，物理内存管理权限将会从memblock转移到buddy系统中。
此时page 中的lru前后指向的都是同一级别order空闲 free list。

buddy free list-－>active/unevictable LRU

当一个程序通过malloc或者mapp申请到虚拟内存之后，当第一次对该内存进行写操作时，会通过page fault触发从buddy中申请物理内存操作，以x86架构说明过程如下：

在匿名页出来过程中do_anonymous_page（《linux那些事之page fault(do_anonymous_page)（4）》）出来过程中：

alloc_zeroed_user_highpage_movable。最终会通过buddy申请到物理内存。
lru_cache_add_active_or_unevictable：会将申请到的物理内存加入到active和unevictable lru中。

lru_cache_add_active_or_unevictable

lru_cache_add_active_or_unevictable为swap对外通过的接口，用于将page 加入到对应LRU list中：

void lru_cache_add_active_or_unevictable(struct page *page,
					 struct vm_area_struct *vma)
{
	VM_BUG_ON_PAGE(PageLRU(page), page);

	if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
		SetPageActive(page);
	else if (!TestSetPageMlocked(page)) {
		/*
		 * We use the irq-unsafe __mod_zone_page_stat because this
		 * counter is not modified from interrupt context, and the pte
		 * lock is held(spinlock), which implies preemption disabled.
		 */
		__mod_zone_page_state(page_zone(page), NR_MLOCK,
				    hpage_nr_pages(page));
		count_vm_event(UNEVICTABLE_PGMLOCKED);
	}
	lru_cache_add(page);
}

page 所属vma 是否pin memory, 如果不是，则将SetPageActive 设置为PG_active，后续该page将会加入到active list。
如果vma 是否pin memory, 但是物理页没有设置PG_mlocked，则说明vma和page 不一致，需要修改page 也为pin lock。
调用lru_cache_add，将其加入对应LRU list中，如果不是pin memory，则首先会加入到active list中，如果为pin memory则说明不允许将页面交换出去，需要加入到unevictable list中（LRU_UNEVICTABLE)。

active LRU->unactive LRU

当page 被加入到active LRU中，后续当内存不足时会触发内存回收动作，将不常用的page 从active 转移到unactive LRU链表中。要了解page 如何从active 转移到unactive LRU链表，需要首先明白内核是如何判断一个page是否active标准？如果页面判断active不合理，会造成页面在active 和unactive LRU来回震荡切换，影响性能。

Determing PAGE ACTIVE

要理解page 整个状态变量，需要了解内核如何对PAGE ACTIVE判断。由于大多硬件架构，并没有提供对page 访问计数功能，只提供了ACCESS位表明该page是否被访问过，因此对page active判断需要结合硬件和软件判断：

硬件判断：根据X86 地址转换过程可知（《linux那些事之 page translation(硬件篇）》, PTE中存在ACCESS（_PAGE_BIT_ACCESSED）位标记位该页面是否被访问，可以利用该标记位判断该页面是否被访问过。

软件结合：page flag中使用PG_active和PG_referenced两个标记位结合来判断一个页面是否被访问过。其中PG_active标记位用来表明page 所处LRU 链表。PG_referenced表明位用来表明page最近是否被访问过（是否被访问过需要结合PTE access标记位来判断）。

PTE access位

当一个程序访问内存时，在VA->PA转换过程中，硬件会自动将access位设置位1，表明最近被访问过。但是硬件不会自动将access位清0，需要软件将其清0，然后由软件在一定时间内查看access位是否置1，如果置1说明最近被访问过，不能被swap out出去。在内核中ptep_test_and_clear_young()函数实现检查access位置1，如果被访问过将其清0，后续一段时间之后根据需要查看是否再次被置1：

int ptep_test_and_clear_young(struct vm_area_struct *vma,
			      unsigned long addr, pte_t *ptep)
{
	int ret = 0;

	if (pte_young(*ptep))
		ret = test_and_clear_bit(_PAGE_BIT_ACCESSED,
					 (unsigned long *) &ptep->pte);

	return ret;
}

pte_young：查看PTE access位是否置1，如果置1表明被访问过，返回为1，否则返回为0.
test_and_clear_bit：该函数会再次测试access为1，并且同时将access置为0，该函数位原子操作用来防止并发问题。
ptep_test_and_clear_young：如果返回为1，则表明被访问过并且被置0，返回为0，表明没有被访问过。

内核中还有一个ptep_clear_flush_young_notify 宏用于刷新PTE access:

#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
({									\
	int __young;							\
	struct vm_area_struct *___vma = __vma;				\
	unsigned long ___address = __address;				\
	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
						  ___address,		\
						  ___address +		\
							PAGE_SIZE);	\
	__young;							\
})

ptep_clear_flush_young: 获取pte 是否访问 access并将access清0.
mmu_notifier_clear_flush_young：通知MMU，pte已经被重新刷新。

PG_active和PG_referenced

内核使用PG_active和PG_referenced两个标记位表明一个页面活跃度（为什么不用一个PG_active表示页面活跃度《professional linux kernel architecture》18.6.3给出了合理节省）。

PG_active 用来表明page是否加入active LRU链表中，如果设置PG_active则加入到active LRU链表中，如果没有设置PG_active，则加入unactive LRU链表中。
PG_referenced用于表明页面是否被访问过。

整个LRU 切换遵循二次机会法则，通过PG_active和PG_referenced来完成对page 在active 和unactive LRU list中切换：

上述图虽然基于2.6内核版本，但是仍然适用于5.8内核版本，只是API有稍微变化：

buddy free list-->active list:通过lru_cache_add
buddy free list-->inactive list:通过lru_cache_add
active --->active list: page_referenced页面被访问过，那么页面还处于active list不变。如果没有被访问过则将其加入active LRU链表尾部等待给予第二次机会，如果还是没有被访问则会加入到unactive list。如果位于链表尾部被访问过，则将其移到active 头部，不再转移到unactive list。
active ---> inactive list：位于active list链表尾部的page ,page_referenced判断一个页面是否被访问过，如果没有被访问过，将在shrink_active_list中将page从active 转移到inactive list。
inactive--> inactive list：位于inactive list中链表头部page，如果没有被访问过，则会被移到尾部（给予第二次机会）。如果位于链表头部page被访问过，则会将其加入active list。
inactive- > active list:如果位于链表头部page被访问过，则会将其加入active list
inactive-swap out: 位于inactive 链表尾部，没有被使用，则将会被swap out到磁盘中。

整个状态迁移过程中比较关键的几个函数：

page_referenced: 判断一个页面是否被访问过，如果被访问过则会把PTE access清零。
mark_page_accessed: 访问一个页面
shrink_active_list:扫描active list LRU所有页面将二次机会后没有被访问过的页面转移到inactive list中。

page_referenced

page_referenced基于page_referenced_one函数封装，内核中经常用page_referenced来判断page 是否被访问过：


/**
 * page_referenced - test if the page was referenced
 * @page: the page to test
 * @is_locked: caller holds lock on the page
 * @memcg: target memory cgroup
 * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
 *
 * Quick test_and_clear_referenced for all mappings to a page,
 * returns the number of ptes which referenced the page.
 */
int page_referenced(struct page *page,
		    int is_locked,
		    struct mem_cgroup *memcg,
		    unsigned long *vm_flags)
{
	int we_locked = 0;
	struct page_referenced_arg pra = {
		.mapcount = total_mapcount(page),
		.memcg = memcg,
	};
	struct rmap_walk_control rwc = {
		.rmap_one = page_referenced_one,
		.arg = (void *)&pra,
		.anon_lock = page_lock_anon_vma_read,
	};

	*vm_flags = 0;
	if (!pra.mapcount)
		return 0;

	if (!page_rmapping(page))
		return 0;

	if (!is_locked && (!PageAnon(page) || PageKsm(page))) {
		we_locked = trylock_page(page);
		if (!we_locked)
			return 1;
	}

	/*
	 * If we are reclaiming on behalf of a cgroup, skip
	 * counting on behalf of references from different
	 * cgroups
	 */
	if (memcg) {
		rwc.invalid_vma = invalid_page_referenced_vma;
	}

	rmap_walk(page, &rwc);
	*vm_flags = pra.vm_flags;

	if (we_locked)
		unlock_page(page);

	return pra.referenced;
}

pra.mapcount: 说明该页面被有被使用过，可以直接返回，如果大于0说明被一个或多个进程使用。
page_rmapping：再次检查是否被进程使用过，没有记录被使用的进程信息，直接返回0
!is_locked && (!PageAnon(page) || PageKsm(page)）：如果是ksm页面且不是匿名页面，则说明被使用过
rmap_walk：通过反向映射查看该page 对应的所有进程vma是否都使用过，每个页面通过page_referenced_one遍历所有反向映射，如果为ture说明被使用过，其中 pra.referenced代表的是访问过该页面的进程数目，如果大于0说明被一个或多个进程访问。

page_referenced_one

page_referenced_one函数用于获取一个物理页page被多少个进程在同时引入以判断当前页面是否处于active状态。内核中，经常出现一个page可能会同时被多个进程使用（例如share memory或者共享映射）：

内核中page维护一个反向映射列表将page被所有使用到的vma映射起来，page_referenced_one会遍历该page所有的vma，并查找对应的pte表是否处于access：

/*
 * arg: page_referenced_arg will be passed
 */
static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
			unsigned long address, void *arg)
{
	struct page_referenced_arg *pra = arg;
	struct page_vma_mapped_walk pvmw = {
		.page = page,
		.vma = vma,
		.address = address,
	};
	int referenced = 0;

	while (page_vma_mapped_walk(&pvmw)) {
		address = pvmw.address;

		if (vma->vm_flags & VM_LOCKED) {
			page_vma_mapped_walk_done(&pvmw);
			pra->vm_flags |= VM_LOCKED;
			return false; /* To break the loop */
		}

		if (pvmw.pte) {
			if (ptep_clear_flush_young_notify(vma, address,
						pvmw.pte)) {
				/*
				 * Don't treat a reference through
				 * a sequentially read mapping as such.
				 * If the page has been used in another mapping,
				 * we will catch it; if this other mapping is
				 * already gone, the unmap path will have set
				 * PG_referenced or activated the page.
				 */
				if (likely(!(vma->vm_flags & VM_SEQ_READ)))
					referenced++;
			}
		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
			if (pmdp_clear_flush_young_notify(vma, address,
						pvmw.pmd))
				referenced++;
		} else {
			/* unexpected pmd-mapped page? */
			WARN_ON_ONCE(1);
		}

		pra->mapcount--;
	}

	if (referenced)
		clear_page_idle(page);
	if (test_and_clear_page_young(page))
		referenced++;

	if (referenced) {
		pra->referenced++;
		pra->vm_flags |= vma->vm_flags;
	}

	if (!pra->mapcount)
		return false; /* To break the loop */

	return true;
}

page_vma_mapped_walk：遍历出page对应所有vma
vma->vm_flags & VM_LOCKED：如果vma处于lock状态直接返回false;
pvmw.pte不为NULL，则会ptep_clear_flush_young_notify，查看pte access 是否被置位，并将其清零，如果access被置位，则说明最近被访问过，referenced++。
当遍历完成所有vma后，查看referenced 是否为0，如果非零，则说明被referenced个进程访问过，返回为true.

page_referenced_one 当返回为true时说明最近被访问过，同时会将pte access清零，等一段时候之后会再次调用page_referenced_one 查看是否再次被访问。

shrink_active_list

shrink_active_list负责将active list中最近没有访问的页面从active->unactive list LRU中，函数接口如下：

void shrink_active_list(unsigned long nr_to_scan,
			       struct lruvec *lruvec,
			       struct scan_control *sc,
			       enum lru_list lru)

unsigned long nr_to_scan：需要将多少页从active 转移到inactive中
struct lruvec *lruvec：所属阶段pgdat LRU list
struct scan_control *sc：包括对后面处理所需要的参数，以及保存部分执行shrink_active_list后结果。
enum lru_list lru：指明所要转移的LRU

shrink_active_lis流程

shrink_active_lis函数处理流程如下：

shrink_active_list处理流程比较复杂，但是其核心思想比较简单就是从active LRU中挑选出合适的page（最近没有使用的page)转移到unactive LRU中。

shrink_active_list源码分析

结合shrink_active_list源码分析：


static void shrink_active_list(unsigned long nr_to_scan,
			       struct lruvec *lruvec,
			       struct scan_control *sc,
			       enum lru_list lru)
{
	unsigned long nr_taken;
	unsigned long nr_scanned;
	unsigned long vm_flags;
	LIST_HEAD(l_hold);	/* The pages which were snipped off */
	LIST_HEAD(l_active);
	LIST_HEAD(l_inactive);
	struct page *page;
	unsigned nr_deactivate, nr_activate;
	unsigned nr_rotated = 0;
	int file = is_file_lru(lru);
	struct pglist_data *pgdat = lruvec_pgdat(lruvec);

	lru_add_drain();

	spin_lock_irq(&pgdat->lru_lock);

	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
				     &nr_scanned, sc, lru);

	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);

	__count_vm_events(PGREFILL, nr_scanned);
	__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);

	spin_unlock_irq(&pgdat->lru_lock);

	while (!list_empty(&l_hold)) {
		cond_resched();
		page = lru_to_page(&l_hold);
		list_del(&page->lru);

		if (unlikely(!page_evictable(page))) {
			putback_lru_page(page);
			continue;
		}

		if (unlikely(buffer_heads_over_limit)) {
			if (page_has_private(page) && trylock_page(page)) {
				if (page_has_private(page))
					try_to_release_page(page, 0);
				unlock_page(page);
			}
		}

		if (page_referenced(page, 0, sc->target_mem_cgroup,
				    &vm_flags)) {
			/*
			 * Identify referenced, file-backed active pages and
			 * give them one more trip around the active list. So
			 * that executable code get better chances to stay in
			 * memory under moderate memory pressure.  Anon pages
			 * are not likely to be evicted by use-once streaming
			 * IO, plus JVM can create lots of anon VM_EXEC pages,
			 * so we ignore them here.
			 */
			if ((vm_flags & VM_EXEC) && page_is_file_lru(page)) {
				nr_rotated += hpage_nr_pages(page);
				list_add(&page->lru, &l_active);
				continue;
			}
		}

		ClearPageActive(page);	/* we are de-activating */
		SetPageWorkingset(page);
		list_add(&page->lru, &l_inactive);
	}

	/*
	 * Move pages back to the lru list.
	 */
	spin_lock_irq(&pgdat->lru_lock);

	nr_activate = move_pages_to_lru(lruvec, &l_active);
	nr_deactivate = move_pages_to_lru(lruvec, &l_inactive);
	/* Keep all free pages in l_active list */
	list_splice(&l_inactive, &l_active);

	__count_vm_events(PGDEACTIVATE, nr_deactivate);
	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);

	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
	spin_unlock_irq(&pgdat->lru_lock);

	mem_cgroup_uncharge_list(&l_active);
	free_unref_page_list(&l_active);
	trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate,
			nr_deactivate, nr_rotated, sc->priority, file);
}

shrink_active_list函数在内存收回机制中非常重要，其比较核心的几个处理：

lru_add_drain：将所有cpu中lru_pvecs所缓存的page 都强制刷新到对应的lru，因为后续需要从对应lru链表中取出合适的页面做迁移。
isolate_lru_pages：从指定的lru 链表中选择最近没有用到的页面，从指定lru链表中从链表尾部开始查看，最多nr_to_scan 个page，将最近没有用到的页面加入到l_hold中，并将page中lru list中摘除掉，此时页面位于l_hold list中，并不归属于任何LRU list 因此该函数命名为isolate孤立意思是该页面从指定lru 中孤立出来。
依次遍历l_hold 中的页面，并且排除掉一些特殊页面（避免unevicatable 、最近被访问过的具有可执行权限的代码段或者文件page 等会被重新放入到active list中并不会参与swap out)，将排除之后的页面加入到l_inactive list中
最后将l_inactive list中页面加入到对应unactive LRU中，l_active 中的页面恢复到active LRU中。
其中LRU list转移通过move_pages_to_lru函数完成。

move_pages_to_lru

move_pages_to_lru将page转移到对应的lru list中：

static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
						     struct list_head *list)
{
	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
	int nr_pages, nr_moved = 0;
	LIST_HEAD(pages_to_free);
	struct page *page;
	enum lru_list lru;

	while (!list_empty(list)) {
		page = lru_to_page(list);
		VM_BUG_ON_PAGE(PageLRU(page), page);
		if (unlikely(!page_evictable(page))) {
			list_del(&page->lru);
			spin_unlock_irq(&pgdat->lru_lock);
			putback_lru_page(page);
			spin_lock_irq(&pgdat->lru_lock);
			continue;
		}
		lruvec = mem_cgroup_page_lruvec(page, pgdat);

		SetPageLRU(page);
		lru = page_lru(page);

		nr_pages = hpage_nr_pages(page);
		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
		list_move(&page->lru, &lruvec->lists[lru]);

		if (put_page_testzero(page)) {
			__ClearPageLRU(page);
			__ClearPageActive(page);
			del_page_from_lru_list(page, lruvec, lru);

			if (unlikely(PageCompound(page))) {
				spin_unlock_irq(&pgdat->lru_lock);
				destroy_compound_page(page);
				spin_lock_irq(&pgdat->lru_lock);
			} else
				list_add(&page->lru, &pages_to_free);
		} else {
			nr_moved += nr_pages;
			if (PageActive(page))
				workingset_age_nonresident(lruvec, nr_pages);
		}
	}

	/*
	 * To save our caller's stack, now use input list for pages to free.
	 */
	list_splice(&pages_to_free, list);

	return nr_moved;
}