0
点赞
收藏
分享

微信扫一扫

linux那些事之ZONE(zonelist)(2)

承接《linux内核那些事之ZONE》,一个节点的物理内存被按照需要划分成不同的zone进行管理,而每个节点通过pg_data_t进行管理(想要了解可以参考《linux内核那些事之物理内存模型之SPARSE(3)》,针对NUMA系统而言:

 每个node节点都有自己的本地内存,因此都有相应自己的pg_data_t和zone管理,同时为了方便整个系统所有node节点上的内存按照zone排布组成了zone list结构。

NUMA zonelist

针对上述NUMA系统内核内存管理数据结构组织框架如下:

pg_data_t结构中用于管理本节点以及申请内存失败时FallBack 取下一个zone的顺序zonelist:

typedef struct pglist_data {
	/*
	 * node_zones contains just the zones for THIS node. Not all of the
	 * zones may be populated, but it is the full list. It is referenced by
	 * this node's node_zonelists as well as other node's node_zonelists.
	 */
	struct zone node_zones[MAX_NR_ZONES];

	/*
	 * node_zonelists contains references to all zones in all nodes.
	 * Generally the first zones will be references to this node's
	 * node_zones.
	 */
	struct zonelist node_zonelists[MAX_ZONELISTS];

	int nr_zones; /* number of populated zones in this node */

    ... ...
} pg_data_t;

NUMA系统中pg_data_t中与zone相关的几个结构成员:

  • struct zone node_zones[MAX_NR_ZONES]:用于管理本节点所有可能zone划分,按照zone type依次从低到高。
  • int nr_zones:用于记录本节点所管理的zones数目。
  • struct zonelist node_zonelists[MAX_ZONELISTS]:用于管理内存申请zone顺序。

以下是内核文档对zone list描述,对于zonelist作用理解非常重要

struct zonelist

MAX_ZONELISTS 位struct zonelist成员的数组:

enum {
	ZONELIST_FALLBACK,	/* zonelist with fallback */
#ifdef CONFIG_NUMA
	/*
	 * The NUMA zonelists are doubled because we need zonelists that
	 * restrict the allocations to a single node for __GFP_THISNODE.
	 */
	ZONELIST_NOFALLBACK,	/* zonelist without fallback (__GFP_THISNODE) */
#endif
	MAX_ZONELISTS
};

主要包含ZONELIST_FALLBACK和ZONELIST_NOFALLBACK:

  • ZONELIST_FALLBACK:不管时在NUMA或SMP系统中都存在,当内存失败时从该FALLBACK中管理的zone顺序取出下一个zone中申请内存。SMP系统中只有一个节点其顺序依次从ZONE_HIGHMEM、ZONE_NORMAL、ZONE_DMA中从高到低排序。当属NUMA系统中,首先安排本节点之后,再根据NUMA系统由近到远安排其他远端节点内存,当本节点没有内存时,就有依次按照最近原则尽量从最近远端节点中获取内存,决定内存分配策略。
  • ZONELIST_NOFALLBACK:NUMA系统有时候根据需要只想从本节点中获取内存,不想总远端中获取内存。因此NOFALLBACK中zonelist只是安排本节点zone顺序,依次从高到低中排序。

struct zonelist结构定义如下:

/*
 * One allocation request operates on a zonelist. A zonelist
 * is a list of zones, the first one is the 'goal' of the
 * allocation, the other zones are fallback zones, in decreasing
 * priority.
 *
 * To speed the reading of the zonelist, the zonerefs contain the zone index
 * of the entry being read. Helper functions to access information given
 * a struct zoneref are
 *
 * zonelist_zone()	- Return the struct zone * for an entry in _zonerefs
 * zonelist_zone_idx()	- Return the index of the zone for an entry
 * zonelist_node_idx()	- Return the index of the node for an entry
 */
struct zonelist {
	struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
};

struct zonelist 结构中安排了zone排序顺序(zone 排序顺序由内核在初始化时根据系统情况按照就近原则排序好),zoneref 结构记录详细zone结构信息定义如下:

/*
 * This struct contains information about a zone in a zonelist. It is stored
 * here to avoid dereferences into large structures and lookups of tables
 */
struct zoneref {
	struct zone *zone;	/* Pointer to actual zone */
	int zone_idx;		/* zone_idx(zoneref->zone) */
};
  • struct zone *zone:为指向的zone结构。
  • int zone_idx:zone index。

zonelist构建

zonelist在位于内核起始中根据系统情况由近到远构建,每个节点内都有相应的zonelist,决定内存分配顺序。

---->start_kernel
    ---->build_all_zonelists
	    ---->build_all_zonelists_init
		  ---->__build_all_zonelists
		      ---->build_zonelists
			      ---->build_zonelists_in_node_order
				  ---->build_thisnode_zonelists

 zonelist初始化调用过程如下最终build_zonelists函数中调用build_zonelists_in_node_order用于初始化ZONELIST_FALLBACK对应zonelist,build_thisnode_zonelists用于初始化ZONELIST_NOFALLBACK对应zonelist。

build_zonelists

build_zonelists分为NUMA系统和SMP系统,NUMA系统处理如下:


/*
 * Build zonelists ordered by zone and nodes within zones.
 * This results in conserving DMA zone[s] until all Normal memory is
 * exhausted, but results in overflowing to remote node while memory
 * may still exist in local DMA zone.
 */

static void build_zonelists(pg_data_t *pgdat)
{
	static int node_order[MAX_NUMNODES];
	int node, load, nr_nodes = 0;
	nodemask_t used_mask = NODE_MASK_NONE;
	int local_node, prev_node;

	/* NUMA-aware ordering of nodes */
	local_node = pgdat->node_id;
	load = nr_online_nodes;
	prev_node = local_node;

	memset(node_order, 0, sizeof(node_order));
	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
		/*
		 * We don't want to pressure a particular node.
		 * So adding penalty to the first node in same
		 * distance group to make it round-robin.
		 */
		if (node_distance(local_node, node) !=
		    node_distance(local_node, prev_node))
			node_load[node] = load;

		node_order[nr_nodes++] = node;
		prev_node = node;
		load--;
	}

	build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
	build_thisnode_zonelists(pgdat);
}
  • 调用find_next_best_node,从NUMA拓扑结构中获取到该pg_data_t节点所对应的节点拓扑信息,按照从近到远依次将信息保存到node_order。
  • build_zonelists_in_node_order:初始化ZONELIST_FALLBACK 对应zone list节点顺序。
  • build_thisnode_zonelists:构建本节点zone 顺序依次从高到低排序。

build_zonelists_in_node_order

build_zonelists_in_node_order用于构建ZONELIST_FALLBACK对应zone list:

/*
 * Build zonelists ordered by node and zones within node.
 * This results in maximum locality--normal zone overflows into local
 * DMA zone, if any--but risks exhausting DMA zone.
 */
static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order,
		unsigned nr_nodes)
{
	struct zoneref *zonerefs;
	int i;

	zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;

	for (i = 0; i < nr_nodes; i++) {
		int nr_zones;

		pg_data_t *node = NODE_DATA(node_order[i]);

		nr_zones = build_zonerefs_node(node, zonerefs);
		zonerefs += nr_zones;
	}
	zonerefs->zone = NULL;
	zonerefs->zone_idx = 0;
}
  • 根据传入node_order顺序,把每个节点内存划分zone依次通过调用build_zonerefs_node加入到zonelist中,内存申请失败FALLBACK时,首先优先从本地节点的其他zone中申请内存,如果本地节点中zone内存不能够满足,则从其他节点中申请内存(当然此时申请的内存运行效率较低)

build_thisnode_zonelists

build_thisnode_zonelists主要是构建ZONELIST_NOFALLBACK对应zone list:

/*
 * Build gfp_thisnode zonelists
 */
static void build_thisnode_zonelists(pg_data_t *pgdat)
{
	struct zoneref *zonerefs;
	int nr_zones;

	zonerefs = pgdat->node_zonelists[ZONELIST_NOFALLBACK]._zonerefs;
	nr_zones = build_zonerefs_node(pgdat, zonerefs);
	zonerefs += nr_zones;
	zonerefs->zone = NULL;
	zonerefs->zone_idx = 0;
}

ZONELIST_NOFALLBACK对应zone list只有本节点信息,zone排序依次从高到低。

ZONELIST_NOFALLBACK主要是针对在有些特定的内存申请场景,只希望从指定节点或者本地节点中申请内存,当本地节点或者指定内存节点内存不足时,能够立即知道内存不足,而不是希望通过FALLBACK机制从其他节点中继续申请内存。比如numa_node_id()或者CPU_TO_node()等类似函数,只从指定的节点中获取内存。

build_zonerefs_node

build_zonerefs_node()将指定的节点中的zone 按照从高到低加入到zone list中:

/*
 * Builds allocation fallback zone lists.
 *
 * Add all populated zones of a node to the zonelist.
 */
static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
{
	struct zone *zone;
	enum zone_type zone_type = MAX_NR_ZONES;
	int nr_zones = 0;

	do {
		zone_type--;
		zone = pgdat->node_zones + zone_type;
		if (managed_zone(zone)) {
			zoneref_set_zone(zone, &zonerefs[nr_zones++]);
			check_highest_zone(zone_type);
		}
	} while (zone_type);

	return nr_zones;
}

从系统支持最高zone type依次将指定节点的pg_data_t中的node_zone按照从高到低顺序加入到zoneref中,当分配内存时优先从较高zone type 中分配内存。

参考资料

https://www.kernel.org/doc/Documentation/vm/numa

https://www.codetd.com/en/article/10692056

举报

相关推荐

0 条评论