高效分区转化插件pg_rewrite-CFANZ编程社区

前言

我们知道，对于普通表转化为分区表，是比较麻烦的，pg_pathman倒是提供了函数可以在线非阻塞式搬迁数据，但是pg_pathman的问题比较多，bug多，并且随着14的release也不会再更新了，好在最近cybertech又出了一个新的插件，pg_rewrite，可以高效将普通表转化为分区表。

安装

地址在 https://github.com/cybertec-postgresql/pg_rewrite

安装很简单

[postgres@xiongcc ~]$ git clone https://github.com/cybertec-postgresql/pg_rewrite.git
Cloning into 'pg_rewrite'...
remote: Enumerating objects: 22, done.
remote: Counting objects: 100% (22/22), done.
remote: Compressing objects: 100% (17/17), done.
remote: Total 22 (delta 3), reused 22 (delta 3), pack-reused 0
Unpacking objects: 100% (22/22), done.
[postgres@xiongcc ~]$ cd pg_rewrite/
[postgres@xiongcc pg_rewrite]$ make
gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -g -O0 -ggdb -g3 -fPIC -I. -I./ -I/usr/pgsql-14/include/server -I/usr/pgsql-14/include/internal  -D_GNU_SOURCE   -c -o pg_rewrite.o pg_rewrite.c
gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -g -O0 -ggdb -g3 -fPIC -I. -I./ -I/usr/pgsql-14/include/server -I/usr/pgsql-14/include/internal  -D_GNU_SOURCE   -c -o concurrent.o concurrent.c
gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -g -O0 -ggdb -g3 -fPIC -shared -o pg_rewrite.so pg_rewrite.o concurrent.o  -L/usr/pgsql-14/lib    -Wl,--as-needed -Wl,-rpath,'/usr/pgsql-14/lib',--enable-new-dtags  
[postgres@xiongcc pg_rewrite]$ make install
/bin/mkdir -p '/usr/pgsql-14/lib'
/bin/mkdir -p '/usr/pgsql-14/share/extension'
/bin/mkdir -p '/usr/pgsql-14/share/extension'
/bin/install -c -m 755  pg_rewrite.so '/usr/pgsql-14/lib/pg_rewrite.so'
/bin/install -c -m 644 .//pg_rewrite.control '/usr/pgsql-14/share/extension/'
/bin/install -c -m 644 .//pg_rewrite--1.0.sql  '/usr/pgsql-14/share/extension/'

pg_rewrite提供了一个函数用于转化分区表

postgres=# create extension pg_rewrite ;
CREATE EXTENSION
postgres=# \df *partition_table*
                                            List of functions
 Schema |      Name       | Result data type |                Argument data types                 | Type 
--------+-----------------+------------------+----------------------------------------------------+------
 public | partition_table | void             | src_table text, dst_table text, src_table_new text | func
(1 row)

总共有三个配置参数

postgres=# select setting,name from pg_settings where name like '%rewrite%';
 setting |           name            
---------+---------------------------
 on      | rewrite.check_constraints
 0       | rewrite.max_xlock_time
 0       | rewrite.wait_after_load
(3 rows)

•rewrite.check_constraints：在开始复制数据之前，会检查目标表是否有与源表相同的约束，如果发现有差异，会抛出一个ERROR。问题是，由于目标表的约束条件（意外地）缺失，一旦处理完成，违反源表约束条件的数据就会被允许出现在目标表中。即使在目标表上有一个额外的约束也是一个问题，因为pg_rewrite只假设它复制的所有数据都满足源表的约束，但是它并没有根据目标表的额外约束来验证它们。通过设置rewrite.check_constraints为false，用户可以关闭约束条件检查。在这样做之前，请务必小心谨慎。•rewrite.max_xlock_time：尽管被处理的表在大部分时间内可以被其他事务进行读和写操作，但需要一个独占锁来最终完成处理（即处理剩余的并发变化和重命名表）。如果扩展函数似乎对表的访问阻塞太多，可以考虑设置 "rewrite.max_xlock_time "GUC参数。比如说 SET rewrite.max_xlock_time TO 100;用于告诉持有Exclusive 独占锁时间不应该超过0.1秒（100毫秒）。如果最后阶段需要更多的时间，特定的函数会释放独占锁，处理其他事务在中间提交的修改，并再次尝试最后阶段。如果再超过几次锁的持续时间，就会报告错误。如果发生这种情况，你应该增加设置，或者在写活动较少时，尝试处理有问题的表。默认值是0，意味着最后阶段需要多少时间都可以。

还有一个参数，在官方文档里没看到介绍，只有翻一下源码，因为是非阻塞式搬迁数据，允许搬迁过程中进行DML，这个参数控制初始加载完成后，在开始解码其他事务的数据变化之前，需要等待的时间。

/*
 * Time (in seconds) to wait after the initial load has completed and before
 * we start decoding of data changes introduced by other transactions. This
 * helps to ensure defined order of steps when we test processing of the
 * concurrent changes.
 */
int            rewrite_wait_after_load = 0;

/*
     * During regression tests, wait until the other transactions performed
     * their data changes so that we can process them.
     *
     * Since this should only be used for tests, don't bother using
     * WaitLatch().
     */
    if (rewrite_wait_after_load > 0)
    {
        LOCKTAG        tag;
        Oid            extension_id;
        LockAcquireResult lock_res PG_USED_FOR_ASSERTS_ONLY;

        extension_id = get_extension_oid("pg_rewrite", false);
        SET_LOCKTAG_OBJECT(tag, MyDatabaseId, ExtensionRelationId,
                           extension_id, 0);
        lock_res = LockAcquire(&tag, ExclusiveLock, false, false);
        Assert(lock_res == LOCKACQUIRE_OK);

        /*
         * Misuse lock on our extension to let the concurrent backend(s) check
         * that we're exactly here.
         */
        pg_usleep(rewrite_wait_after_load * 1000000L);
        LockRelease(&tag, ExclusiveLock, false);
    }

实操

先新建一个普通表

postgres=# CREATE TABLE t1 (
    id         int not null,
    tm         timestamptz not null
);
CREATE TABLE
postgres=# insert into t1 select extract(epoch from seq), seq from generate_series('2020-01-01'::timestamptz, '2020-05-31 23:59:59'::timestamptz, interval '10 seconds') as seq;
INSERT 0 1313280

再新建一个要转化后的分区表

postgres=# CREATE TABLE ptab01 (                                                                           
    id         int not null,
    tm         timestamptz not null
) PARTITION BY RANGE (tm);
CREATE TABLE
postgres=# create table ptab01_202001 partition of ptab01 for values from ('2020-01-01') to ('2020-02-01');
CREATE TABLE
postgres=# create table ptab01_202002 partition of ptab01 for values from ('2020-02-01') to ('2020-03-01');
CREATE TABLE
postgres=# create table ptab01_202003 partition of ptab01 for values from ('2020-03-01') to ('2020-04-01');
CREATE TABLE
postgres=# create table ptab01_202004 partition of ptab01 for values from ('2020-04-01') to ('2020-05-01');
CREATE TABLE
postgres=# create table ptab01_202005 partition of ptab01 for values from ('2020-05-01') to ('2020-06-01');
CREATE TABLE

postgres=# \d
                   List of relations
 Schema |     Name      |       Type        |  Owner   
--------+---------------+-------------------+----------
 public | ptab01        | partitioned table | postgres
 public | ptab01_202001 | table             | postgres
 public | ptab01_202002 | table             | postgres
 public | ptab01_202003 | table             | postgres
 public | ptab01_202004 | table             | postgres
 public | ptab01_202005 | table             | postgres
 public | t1            | table             | postgres
(7 rows)

然后使用pg_rewrite提供的函数转化为分区表

postgres=# \df *partition_table*
                                            List of functions
 Schema |      Name       | Result data type |                Argument data types                 | Type 
--------+-----------------+------------------+----------------------------------------------------+------
 public | partition_table | void             | src_table text, dst_table text, src_table_new text | func
(1 row)

postgres=# select * from partition_table('t1','ptab01','t1_old');
ERROR:  Table "t1" has no identity index

1.第一个参数是源表，也就是需要改造成分区表的表，此处是t12.第二个参数是目标表，也就是需要改造后的分区表，此处是ptab013.第三个参数是源表重命名后的表，此处是t1_old，所以最大需要两倍的磁盘空间

此处报错，可以看到需要身份标识identity，此处我们就直接加个主键，身份标识总共有四种策略

•默认模式 (default)：非系统表采用的默认模式，如果有主键，则用主键列作为身份标识•索引模式 (index)：将某一个符合条件的索引中的列，用作身份标识•完整模式 (full)：将整行记录中的所有列作为复制标识，仅能作为兜底方案•无身份模式 (nothing)：不记录任何复制标识，这意味着update和delete操作无法复制到订阅者上。

postgres=# alter table t1 add primary key(tm);
ALTER TABLE
postgres=# select * from partition_table('t1','ptab01','t1_old');
ERROR:  logical decoding requires wal_level >= logical

可以看到，还需要logical的级别，更改了之后，再来看下效果

postgres=# select * from partition_table('t1','ptab01','t1_old');
ERROR:  the source and destination relations have different primary key

又提示主键不一致，check_constraints参数的原因，发现约束有差异，会抛出一个ERROR

postgres=# alter table ptab01 add primary key(tm);
ALTER TABLE

大概花了6s，130W行数据，转化速度也还算快

postgres=# select * from partition_table('t1','ptab01','t1_old');
 partition_table 
-----------------
 
(1 row)

2021-12-11 21:22:39.697 CST [18011] LOG:  logical decoding found consistent point at 0/F7A57AF8
2021-12-11 21:22:39.697 CST [18011] DETAIL:  There are no running transactions.
2021-12-11 21:22:39.697 CST [18011] STATEMENT:  select * from partition_table('t1','ptab01','t1_old');
2021-12-11 21:22:46.626 CST [18011] LOG:  duration: 6936.349 ms  statement: select * from partition_table('t1','ptab01','t1_old');

看下转化后的效果

postgres=# select * from partition_table('t1','ptab01','t1_old');
 partition_table 
-----------------
 
(1 row)

postgres=# \d+
                                              List of relations
 Schema |     Name      |       Type        |  Owner   | Persistence | Access method |  Size   | Description 
--------+---------------+-------------------+----------+-------------+---------------+---------+-------------
 public | ptab01_202001 | table             | postgres | permanent   | heap          | 11 MB   | 
 public | ptab01_202002 | table             | postgres | permanent   | heap          | 11 MB   | 
 public | ptab01_202003 | table             | postgres | permanent   | heap          | 11 MB   | 
 public | ptab01_202004 | table             | postgres | permanent   | heap          | 11 MB   | 
 public | ptab01_202005 | table             | postgres | permanent   | heap          | 11 MB   | 
 public | t1            | partitioned table | postgres | permanent   |               | 0 bytes | 
 public | t1_old        | table             | postgres | permanent   | heap          | 56 MB   | 
(7 rows)

postgres=# \d+ t1
                                              Partitioned table "public.t1"
 Column |           Type           | Collation | Nullable | Default | Storage | Compression | Stats target | Description 
--------+--------------------------+-----------+----------+---------+---------+-------------+--------------+-------------
 id     | integer                  |           | not null |         | plain   |             |              | 
 tm     | timestamp with time zone |           | not null |         | plain   |             |              | 
Partition key: RANGE (tm)
Indexes:
    "ptab01_pkey" PRIMARY KEY, btree (tm)
Partitions: ptab01_202001 FOR VALUES FROM ('2020-01-01 00:00:00+08') TO ('2020-02-01 00:00:00+08'),
            ptab01_202002 FOR VALUES FROM ('2020-02-01 00:00:00+08') TO ('2020-03-01 00:00:00+08'),
            ptab01_202003 FOR VALUES FROM ('2020-03-01 00:00:00+08') TO ('2020-04-01 00:00:00+08'),
            ptab01_202004 FOR VALUES FROM ('2020-04-01 00:00:00+08') TO ('2020-05-01 00:00:00+08'),
            ptab01_202005 FOR VALUES FROM ('2020-05-01 00:00:00+08') TO ('2020-06-01 00:00:00+08')

可以看到，原先的分区子表变成了t1表的子分区了，也如我们预期，t1表要转化成分区表，而原先的t1的表变成了t1_old表。

注意点1

因为最后需要rename，所以会涉及到一个很短暂的AccessExclusive锁，根据max_xlock_time参数来决定等待多久。

重新操作一遍，并且开启一个事务查询t1表

postgres=# begin;
BEGIN
postgres=*# select id from t1 limit 1;
     id     
------------
 1577808000
(1 row)

postgres=*# select pg_backend_pid();
 pg_backend_pid 
----------------
          18011
(1 row)
---未提交

另一个会话进行转化

postgres=# select * from partition_table('t1','ptab01','t1_old');
---阻塞

查看锁，可以看到，被查询阻塞了

postgres=# SELECT                                      
    blocked_locks.pid AS blocked_pid,
    blocked_activity.usename AS blocked_user,
    now() - blocked_activity.query_start AS blocked_duration,
    blocking_locks.pid AS blocking_pid,
    blocking_activity.usename AS blocking_user,
    now() - blocking_activity.query_start AS blocking_duration,
    blocked_activity.query AS blocked_statement,
    blocking_activity.query AS blocking_statement
FROM                                                       
    pg_catalog.pg_locks AS blocked_locks
    JOIN pg_catalog.pg_stat_activity AS blocked_activity ON blocked_activity.pid = blocked_locks.pid
    JOIN pg_catalog.pg_locks AS blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
        AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
        AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
        AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
        AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
        AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
        AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
        AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
        AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
        AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
        AND blocking_locks.pid != blocked_locks.pid
    JOIN pg_catalog.pg_stat_activity AS blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE
    NOT blocked_locks.granted;
-[ RECORD 1 ]------+-------------------------------------------------------
blocked_pid        | 18009
blocked_user       | postgres
blocked_duration   | 00:00:29.163332
blocking_pid       | 18011
blocking_user      | postgres
blocking_duration  | 00:00:36.395631
blocked_statement  | select * from partition_table('t1','ptab01','t1_old');
blocking_statement | select id from t1 limit 1;


postgres=# select * from pg_locks where pid = 18009 and granted = 'f';
-[ RECORD 1 ]------+-----------------------------
locktype           | relation
database           | 13890
relation           | 16847
page               | 
tuple              | 
virtualxid         | 
transactionid      | 
classid            | 
objid              | 
objsubid           | 
virtualtransaction | 3/24
pid                | 18009
mode               | AccessExclusiveLock
granted            | f
fastpath           | f
waitstart          | 2021-12-11 21:39:17.37459+08

注意点2

其实就是逻辑订阅里面的限制：序列号数据不会被复制。因为pg_rewrite也是使用的logical replication，所以也得需要额外的配置。

wal_level = logical
max_replication_slots = 1
# ... or add 1 to the current value.
shared_preload_libraries = 'pg_rewrite'
# ... or add the library to the existing ones.

序列号所服务的标识列与SERIAL类型里面的数据作为表的一部分当然会被复制，但序列号本身仍会在订阅者上保持为初始值。

postgres=# CREATE TABLE t1 (
    id         serial not null,
    tm         timestamptz not null primary key
);
CREATE TABLE

postgres=# insert into t1(tm) select generate_series('2020-01-01'::timestamptz, '2020-05-31 23:59:59'::timestamptz, interval '10 seconds') as seq;
INSERT 0 1313280

postgres=# CREATE TABLE ptab01 (
    id         serial not null,
    tm         timestamptz not null
) PARTITION BY RANGE (tm);
CREATE TABLE
postgres=# create table ptab01_202001 partition of ptab01 for values from ('2020-01-01') to ('2020-02-01');
CREATE TABLE
postgres=# create table ptab01_202002 partition of ptab01 for values from ('2020-02-01') to ('2020-03-01');
CREATE TABLE
postgres=# create table ptab01_202003 partition of ptab01 for values from ('2020-03-01') to ('2020-04-01');
CREATE TABLE
postgres=# create table ptab01_202004 partition of ptab01 for values from ('2020-04-01') to ('2020-05-01');
CREATE TABLE
postgres=# create table ptab01_202005 partition of ptab01 for values from ('2020-05-01') to ('2020-06-01');
CREATE TABLE
postgres=# alter table ptab01 add primary key(tm);
ALTER TABLE

postgres=# select max(id) from t1;
   max   
---------
 1313280
(1 row)

再次进行转化

postgres=# select * from partition_table('t1','ptab01','t1_old');
 partition_table 
-----------------
 
(1 row)

postgres=# \d+
                                               List of relations
 Schema |     Name      |       Type        |  Owner   | Persistence | Access method |    Size    | Description 
--------+---------------+-------------------+----------+-------------+---------------+------------+-------------
 public | ptab01_202001 | table             | postgres | permanent   | heap          | 11 MB      | 
 public | ptab01_202002 | table             | postgres | permanent   | heap          | 11 MB      | 
 public | ptab01_202003 | table             | postgres | permanent   | heap          | 11 MB      | 
 public | ptab01_202004 | table             | postgres | permanent   | heap          | 11 MB      | 
 public | ptab01_202005 | table             | postgres | permanent   | heap          | 11 MB      | 
 public | ptab01_id_seq | sequence          | postgres | permanent   |               | 8192 bytes | 
 public | t1            | partitioned table | postgres | permanent   |               | 0 bytes    | 
 public | t1_id_seq     | sequence          | postgres | permanent   |               | 8192 bytes | 
 public | t1_old        | table             | postgres | permanent   | heap          | 56 MB      | 
(9 rows)

插入一条数据

postgres=# insert into t1(tm) values ('2020-03-01 23:11:32');
INSERT 0 1
postgres=# select * from t1 where tm = '2020-03-01 23:11:32';
 id |           tm           
----+------------------------
  4 | 2020-03-01 23:11:32+08
(1 row)

postgres=# insert into t1(tm) values ('2020-03-01 23:12:32');
INSERT 0 1
postgres=# select * from t1 where tm = '2020-03-01 23:12:32';
 id |           tm           
----+------------------------
  5 | 2020-03-01 23:12:32+08
(1 row)

postgres=# insert into t1_old(tm) values ('2020-03-01 23:12:32');
INSERT 0 1
postgres=# select * from t1_old where tm = '2020-03-01 23:12:32';
   id    |           tm           
---------+------------------------
 1313281 | 2020-03-01 23:12:32+08
(1 row)

可以看到，序列的起始值和原来的表不一样，假如是主键的话，可能业务会报错，这一点需要格外注意。

小结

pg_rewrite支持在线将普通表转换为分区表，仅仅在最后更换表名的时候需要获取一个排它锁，https://github.com/cybertec-postgresql/pg_rewrite

1.解决了从非分区表变更为分区表的长时间锁问题2.需要使用logical replication，从非分区表增量将数据复制到分区表3.只需要短暂的排他锁，在同步完数据后用于切换表名，这一点和pg_repack是一样的原理。4.非分区表一定要有主键，或者说是身份标识5.分区表建议约束和非分区表保持一致，例如check、not null、主键、default value 等约束，不然可能会使数据变得复杂。6.serial字段也记得要设置妥当，序列号本身仍会在订阅者上保持为初始值。

最近发现一个不错的技巧，假如我们想看源代码，一般都是克隆到本地，然后导入到IDE里面去看，此处分享一个不错的技巧，假如我们要看pg_rewrite的代码，仓库代码是，https://github.com/cybertec-postgresql/pg_rewrite，那么我们可以这么输入 https://github1s.com/cybertec-postgresql/pg_rewrite，即添加一个`1s`，这本质上一个 Web 版本的 VSCode，我觉得这比 Octotree 更好用。听名字就知道，它可以让你在 1 秒内（俗称 +1s）通过在线版本的 VS Code 来打开 GitHub 上的代码，只需要在对应项目的 URL 后面加上 1s 即可。

参考

https://github.com/cybertec-postgresql/pg_rewrite

https://www.cybertec-postgresql.com/en/pg_rewrite-postgresql-table-partitioning/

https://github.com/digoal/blog/blob/master/202112/20211209_01.md