0
点赞
收藏
分享

微信扫一扫

postgresql中work_mem的实现及临时文件的实现

李雨喵 2022-12-14 阅读 146

  在postgresql中,work_mem不足(包括物化、hashjoin、某些table函数如generate_series)时写临时文件的实现在src/backend/utils/sort/tuplestore.c(它本质上是 tuplesort.c (logtape.c协助其实现multiple merge支持)的简化版本,src/backend/utils/sort/sharedtuplestore.c其实parallel-aware版本)中。

创建临时文件并写入元组到临时文件的过程

static void
tuplestore_puttuple_common(Tuplestorestate *state, void *tuple)
{
TSReadPointer *readptr;
int i;
ResourceOwner oldowner;

state->tuples++;

switch (state->status)
{
case TSS_INMEM: // 第一次写的状态,随后状态会改为TSS_WRITEFILE

/*
* Update read pointers as needed; see API spec above.
*/
readptr = state->readptrs;
for (i = 0; i < state->readptrcount; readptr++, i++)
{
if (readptr->eof_reached && i != state->activeptr)
{
readptr->eof_reached = false;
readptr->current = state->memtupcount;
}
}

/*
* Grow the array as needed. Note that we try to grow the array
* when there is still one free slot remaining --- if we fail,
* there'll still be room to store the incoming tuple, and then
* we'll switch to tape-based operation.
*/
if (state->memtupcount >= state->memtupsize - 1)
{
(void) grow_memtuples(state); // 如果元祖数量超过了分配的元组可容纳大小,(默认是1024,在tuplesort_begin_common中初始化)则进行扩展,
// 但是大小不会超过work_mem/sizeof(MinTuple)的大小
Assert(state->memtupcount < state->memtupsize);
}

/* Stash the tuple in the in-memory array */
state->memtuples[state->memtupcount++] = tuple;

/*
* Done if we still fit in available memory and have array slots.
*/
if (state->memtupcount < state->memtupsize && !LACKMEM(state))
return;

/*
* Nope; time to switch to tape-based operation. Make sure that
* the temp file(s) are created in suitable temp tablespaces.
*/
PrepareTempTablespaces();

/* associate the file with the store's resource owner */
oldowner = CurrentResourceOwner;
CurrentResourceOwner = state->resowner;

state->myfile = lt_BufFileCreateTempForWorkMemory(state->interXact); // 创建临时文件lightdb扩展支持更多的debug机制

CurrentResourceOwner = oldowner;

/*
* Freeze the decision about whether trailing length words will be
* used. We can't change this choice once data is on tape, even
* though callers might drop the requirement.
*/
state->backward = (state->eflags & EXEC_FLAG_BACKWARD) != 0;
state->status = TSS_WRITEFILE;
dumptuples(state); // 写元组到文件,在内部会将元组的指针偏移量转换为文件偏移量
break;
case TSS_WRITEFILE: // 随后写元祖

写元组到临时文件

>    writetup_heap    C++ (gdb)
dumptuples C++ (gdb)
tuplestore_puttuple_common C++ (gdb)
tuplestore_putvalues C++ (gdb)
ExecMakeTableFunctionResult C++ (gdb) // 循环写
FunctionNext C++ (gdb) // 第一次访问表函数时会触发物化
ExecScanFetch C++ (gdb)
ExecScan C++ (gdb) // 执行过滤、投影
ExecScanFetch C++ (gdb)
ExecScan C++ (gdb)
ExecProcNode C++ (gdb)
ExecModifyTable C++ (gdb)
ExecProcNode C++ (gdb)
ExecutePlan C++ (gdb)
standard_ExecutorRun C++ (gdb)
pgss_ExecutorRun C++ (gdb)
ltsa_ExecutorRun C++ (gdb)
pgsp_ExecutorRun C++ (gdb)
ProcessQuery C++ (gdb)
PortalRunMulti C++ (gdb)
PortalRun C++ (gdb)
exec_simple_query C++ (gdb)
PostgresMain C++ (gdb)
BackendRun C++ (gdb)
BackendStartup C++ (gdb)
ServerLoop C++ (gdb)
PostmasterMain C++ (gdb)
main C++ (gdb)

但是实际的记录和临时文件中的存储并没有对上​​137609_ltsql_tmp137609.rar​​。如下:

zjh@postgres=# insert into t select id,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' from generate_series(1,200000) id;
INSERT 0 200000

其中id字段可以对上(4个字节),36个a没有对上,只用了10字节存储,看起来并非一一对应存储?因为内部还是和标准数据类型体系一样,有数据类型的概念,所以对于'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',采用的是varattrib_4b存储字符串,所以是压缩格式。

实际记录中,存储的是generate_series的输出也就是一个字段,而不是select,因此只存储了整型,除了4字节的数据本身外,还有10字节的额外负载(而MinimalTuple是16字节,(unsigned long)(&(((MinimalTuple)0)->t_bits))=15+1)。

postgresql中work_mem的实现及临时文件的实现_sed

 

在minimaltuple中,t_hoff这个字段值不正确。

uint8        t_hoff;            /* sizeof header incl. bitmap, padding */

t_hoff

uint8

1 byte

offset to user data

postgresql中work_mem的实现及临时文件的实现_临时文件_02

postgresql中work_mem的实现及临时文件的实现_sed_03

MINIMAL_TUPLE_DATA_OFFSET == (unsigned long)&((MinimalTuple) 0)->t_infomask2   ==  10

char *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
unsigned int tupbodylen = tuple->t_len - MINIMAL_TUPLE_DATA_OFFSET; == 10
unsigned int tuplen = tupbodylen + sizeof(int); == 14

BufFileWrite(state->myfile, (void *) &tuplen, sizeof(tuplen)); === 写入4字节总长度 值14
BufFileWrite(state->myfile, (void *) tupbody, tupbodylen); === 写入10字节:包含MinimalTuple后面部分6字节+值(本例中为整型)

所以加起来14刚好。

 其次,运行期间临时文件的大小和运行完成后磁盘中显示的大小并不一致,如下:

[zjh@hs-10-20-30-193 keep]$ ll
total 2728
-rw------- 2 zjh zjh 2793472 Sep 11 15:51 137609_ltsql_tmp137609.10   --期间,该值为file.curPos的值,也就是实际大小。

[zjh@hs-10-20-30-193 keep]$ ll
total 2736
-rw------- 1 zjh zjh 2800000 Sep 11 15:34 137609_ltsql_tmp137609.6  --完成后

即使ExecMakeTableFunctionResult已经执行完成,到了ExecScanFetch fetch函数时,大小仍然是2793472,此时物化已经完成。因为在pg实现中,flush操作是以BLOCKSZ为单位的,达到BLOCKSZ才会做bufdump。所以最后一部分记录因为没有达到BLOCKSZ(2800000 - 2793472 = 6528),所以中间过程中并不是精确的2800000。

==== 按照2800000字节来算,每行14字节(也就是物理存储上并不包含MinimalTuple的header部分),大约200000行。而从文件内容来看,20万行确实需要2800000字节。

postgresql中work_mem的实现及临时文件的实现_元组_04

work_mem=4MB,应该内存中足以放下,为什么需要磁盘文件呢?因为在内存中,是包含MinimalTuple header部分的,而且并不是压缩格式,大约每行需要56个字节(其中数据本身41字节=4+1(short var header)+36,pack形式存储,如下:),也就是只能大约存储74897行,所以需要临时文件。

else if (VARLENA_ATT_IS_PACKABLE(att) &&
VARATT_CAN_MAKE_SHORT(val))
{
/* convert to short varlena -- no alignment */
data_length = VARATT_CONVERTED_SHORT_SIZE(val); -- VARATT_CONVERTED_SHORT_SIZE == ((((varattrib_4b *) (val))->va_4byte.va_header >> 2) & 0x3FFFFFFF)
SET_VARSIZE_SHORT(data, data_length);
memcpy(data + 1, VARDATA(val), data_length - 1); -- VARDATA == ((((varattrib_4b *) (val))))->va_4byte
}

所以,我比较好奇的是,压缩在哪里体现?在tuplestore_gettupleslot从临时文件读回元组的时候大小也不是41。

 

​​https://15721.courses.cs.cmu.edu/spring2020/slides/18-sortmergejoins.pdf​​

​​https://www.cs.cornell.edu/courses/cs432/2003fa/slides/ExternalSorting.pdf​​

​​https://wiki.postgresql.org/wiki/Parallel_External_Sort​​


举报

相关推荐

0 条评论