承接《linux内核那些事之ZONE》,一个节点的物理内存被按照需要划分成不同的zone进行管理,而每个节点通过pg_data_t进行管理(想要了解可以参考《linux内核那些事之物理内存模型之SPARSE(3)》,针对NUMA系统而言:
每个node节点都有自己的本地内存,因此都有相应自己的pg_data_t和zone管理,同时为了方便整个系统所有node节点上的内存按照zone排布组成了zone list结构。
NUMA zonelist
针对上述NUMA系统内核内存管理数据结构组织框架如下:
pg_data_t结构中用于管理本节点以及申请内存失败时FallBack 取下一个zone的顺序zonelist:
typedef struct pglist_data {
/*
* node_zones contains just the zones for THIS node. Not all of the
* zones may be populated, but it is the full list. It is referenced by
* this node's node_zonelists as well as other node's node_zonelists.
*/
struct zone node_zones[MAX_NR_ZONES];
/*
* node_zonelists contains references to all zones in all nodes.
* Generally the first zones will be references to this node's
* node_zones.
*/
struct zonelist node_zonelists[MAX_ZONELISTS];
int nr_zones; /* number of populated zones in this node */
... ...
} pg_data_t;
NUMA系统中pg_data_t中与zone相关的几个结构成员:
- struct zone node_zones[MAX_NR_ZONES]:用于管理本节点所有可能zone划分,按照zone type依次从低到高。
- int nr_zones:用于记录本节点所管理的zones数目。
- struct zonelist node_zonelists[MAX_ZONELISTS]:用于管理内存申请zone顺序。
以下是内核文档对zone list描述,对于zonelist作用理解非常重要
struct zonelist
MAX_ZONELISTS 位struct zonelist成员的数组:
enum {
ZONELIST_FALLBACK, /* zonelist with fallback */
#ifdef CONFIG_NUMA
/*
* The NUMA zonelists are doubled because we need zonelists that
* restrict the allocations to a single node for __GFP_THISNODE.
*/
ZONELIST_NOFALLBACK, /* zonelist without fallback (__GFP_THISNODE) */
#endif
MAX_ZONELISTS
};
主要包含ZONELIST_FALLBACK和ZONELIST_NOFALLBACK:
- ZONELIST_FALLBACK:不管时在NUMA或SMP系统中都存在,当内存失败时从该FALLBACK中管理的zone顺序取出下一个zone中申请内存。SMP系统中只有一个节点其顺序依次从ZONE_HIGHMEM、ZONE_NORMAL、ZONE_DMA中从高到低排序。当属NUMA系统中,首先安排本节点之后,再根据NUMA系统由近到远安排其他远端节点内存,当本节点没有内存时,就有依次按照最近原则尽量从最近远端节点中获取内存,决定内存分配策略。
- ZONELIST_NOFALLBACK:NUMA系统有时候根据需要只想从本节点中获取内存,不想总远端中获取内存。因此NOFALLBACK中zonelist只是安排本节点zone顺序,依次从高到低中排序。
struct zonelist结构定义如下:
/*
* One allocation request operates on a zonelist. A zonelist
* is a list of zones, the first one is the 'goal' of the
* allocation, the other zones are fallback zones, in decreasing
* priority.
*
* To speed the reading of the zonelist, the zonerefs contain the zone index
* of the entry being read. Helper functions to access information given
* a struct zoneref are
*
* zonelist_zone() - Return the struct zone * for an entry in _zonerefs
* zonelist_zone_idx() - Return the index of the zone for an entry
* zonelist_node_idx() - Return the index of the node for an entry
*/
struct zonelist {
struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
};
struct zonelist 结构中安排了zone排序顺序(zone 排序顺序由内核在初始化时根据系统情况按照就近原则排序好),zoneref 结构记录详细zone结构信息定义如下:
/*
* This struct contains information about a zone in a zonelist. It is stored
* here to avoid dereferences into large structures and lookups of tables
*/
struct zoneref {
struct zone *zone; /* Pointer to actual zone */
int zone_idx; /* zone_idx(zoneref->zone) */
};
- struct zone *zone:为指向的zone结构。
- int zone_idx:zone index。
zonelist构建
zonelist在位于内核起始中根据系统情况由近到远构建,每个节点内都有相应的zonelist,决定内存分配顺序。
---->start_kernel
---->build_all_zonelists
---->build_all_zonelists_init
---->__build_all_zonelists
---->build_zonelists
---->build_zonelists_in_node_order
---->build_thisnode_zonelists
zonelist初始化调用过程如下最终build_zonelists函数中调用build_zonelists_in_node_order用于初始化ZONELIST_FALLBACK对应zonelist,build_thisnode_zonelists用于初始化ZONELIST_NOFALLBACK对应zonelist。
build_zonelists
build_zonelists分为NUMA系统和SMP系统,NUMA系统处理如下:
/*
* Build zonelists ordered by zone and nodes within zones.
* This results in conserving DMA zone[s] until all Normal memory is
* exhausted, but results in overflowing to remote node while memory
* may still exist in local DMA zone.
*/
static void build_zonelists(pg_data_t *pgdat)
{
static int node_order[MAX_NUMNODES];
int node, load, nr_nodes = 0;
nodemask_t used_mask = NODE_MASK_NONE;
int local_node, prev_node;
/* NUMA-aware ordering of nodes */
local_node = pgdat->node_id;
load = nr_online_nodes;
prev_node = local_node;
memset(node_order, 0, sizeof(node_order));
while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
/*
* We don't want to pressure a particular node.
* So adding penalty to the first node in same
* distance group to make it round-robin.
*/
if (node_distance(local_node, node) !=
node_distance(local_node, prev_node))
node_load[node] = load;
node_order[nr_nodes++] = node;
prev_node = node;
load--;
}
build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
build_thisnode_zonelists(pgdat);
}
- 调用find_next_best_node,从NUMA拓扑结构中获取到该pg_data_t节点所对应的节点拓扑信息,按照从近到远依次将信息保存到node_order。
- build_zonelists_in_node_order:初始化ZONELIST_FALLBACK 对应zone list节点顺序。
- build_thisnode_zonelists:构建本节点zone 顺序依次从高到低排序。
build_zonelists_in_node_order
build_zonelists_in_node_order用于构建ZONELIST_FALLBACK对应zone list:
/*
* Build zonelists ordered by node and zones within node.
* This results in maximum locality--normal zone overflows into local
* DMA zone, if any--but risks exhausting DMA zone.
*/
static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order,
unsigned nr_nodes)
{
struct zoneref *zonerefs;
int i;
zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;
for (i = 0; i < nr_nodes; i++) {
int nr_zones;
pg_data_t *node = NODE_DATA(node_order[i]);
nr_zones = build_zonerefs_node(node, zonerefs);
zonerefs += nr_zones;
}
zonerefs->zone = NULL;
zonerefs->zone_idx = 0;
}
- 根据传入node_order顺序,把每个节点内存划分zone依次通过调用build_zonerefs_node加入到zonelist中,内存申请失败FALLBACK时,首先优先从本地节点的其他zone中申请内存,如果本地节点中zone内存不能够满足,则从其他节点中申请内存(当然此时申请的内存运行效率较低)
build_thisnode_zonelists
build_thisnode_zonelists主要是构建ZONELIST_NOFALLBACK对应zone list:
/*
* Build gfp_thisnode zonelists
*/
static void build_thisnode_zonelists(pg_data_t *pgdat)
{
struct zoneref *zonerefs;
int nr_zones;
zonerefs = pgdat->node_zonelists[ZONELIST_NOFALLBACK]._zonerefs;
nr_zones = build_zonerefs_node(pgdat, zonerefs);
zonerefs += nr_zones;
zonerefs->zone = NULL;
zonerefs->zone_idx = 0;
}
ZONELIST_NOFALLBACK对应zone list只有本节点信息,zone排序依次从高到低。
ZONELIST_NOFALLBACK主要是针对在有些特定的内存申请场景,只希望从指定节点或者本地节点中申请内存,当本地节点或者指定内存节点内存不足时,能够立即知道内存不足,而不是希望通过FALLBACK机制从其他节点中继续申请内存。比如numa_node_id()或者CPU_TO_node()等类似函数,只从指定的节点中获取内存。
build_zonerefs_node
build_zonerefs_node()将指定的节点中的zone 按照从高到低加入到zone list中:
/*
* Builds allocation fallback zone lists.
*
* Add all populated zones of a node to the zonelist.
*/
static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
{
struct zone *zone;
enum zone_type zone_type = MAX_NR_ZONES;
int nr_zones = 0;
do {
zone_type--;
zone = pgdat->node_zones + zone_type;
if (managed_zone(zone)) {
zoneref_set_zone(zone, &zonerefs[nr_zones++]);
check_highest_zone(zone_type);
}
} while (zone_type);
return nr_zones;
}
从系统支持最高zone type依次将指定节点的pg_data_t中的node_zone按照从高到低顺序加入到zoneref中,当分配内存时优先从较高zone type 中分配内存。
参考资料
https://www.kernel.org/doc/Documentation/vm/numa
https://www.codetd.com/en/article/10692056