生产案例记 | 罕见的索引失效续-CFANZ编程社区

前言

在前阵子，写了一篇关于《罕见的索引失效》，事务里创建索引无法使用，原因是pg_index.indcheckxmin字段被设置为了true，导致了事务内无法走索引的现象

如果构建过程中不是hot safe的，需要将pg_index中索引的indcheckxmin设置为true。设置indcheckxmin的目的是告诉其他事务，本索引可能是unsafe的。对应的事务在生成执行计划的时候，如果发现索引的indcheckxmin标记为true，则需要比较创建索引的事务和当前事务的先后顺序，决定是否能使用索引。

可以翻一下之前的链接罕见的索引失效，今天因为同事的一个意外发现，阴差阳错的将之前索引失效文章里面的问题搞明白了。

回顾

回顾一下当时的现象，事务ID是16415，但是索引插入的事务是16416

postgres=# begin;
BEGIN
postgres=# create index myidx2 on t2(id);
CREATE INDEX
postgres=# select txid_current();
 txid_current 
--------------
        16415
(1 row)

postgres=# select txid_current_snapshot();
 txid_current_snapshot 
-----------------------
 16415:16415:
(1 row)

postgres=# select xmin,xmax,indcheckxmin from pg_index where indexrelid='myidx'::regclass;
 xmin  | xmax | indcheckxmin 
-------+------+--------------
 16416 |   0  | t
(1 row)

当时的分析是

在异常情况下，也就是在生产当中的情况，当indcheckxmin为true的时候，可以看到，插入的xmin居然是当前的事务ID加了1，按照隔离级别来说，大于xmax（16415）不可见，所以这个索引无法使用。

现在回过头来看，是错的！今天经过研究发现是 ON_ERROR_ROLLBACK 导致的！

postgres=# begin;
BEGIN
postgres=*# create index my_index on t2(id);
CREATE INDEX
postgres=*# select xmin,xmax,indcheckxmin from pg_index where indexrelid='my_index'::regclass;
 xmin | xmax | indcheckxmin 
------+------+--------------
 1070 |    0 | f
(1 row)

postgres=*# select txid_current();
 txid_current 
--------------
         1069
(1 row)

postgres=*# explain select * from t2 where id = 99;
                    QUERY PLAN                    
--------------------------------------------------
 Seq Scan on t2  (cost=0.00..1.05 rows=1 width=4)
   Filter: (id = 99)
(2 rows)

postgres=*# set enable_seqscan to off;
SET
postgres=*# explain select * from t2 where id = 99;
                               QUERY PLAN                               
------------------------------------------------------------------------
 Index Only Scan using my_index on t2  (cost=0.13..8.15 rows=1 width=4)
   Index Cond: (id = 99)
(2 rows)

可以看到，xmin加了1，但是indcheckxmin不是true，也是可以使用索引的！只要indcheckxmin为true，就无法使用。

缘起

为什么会发现这个问题呢？线上同事发过来了一段SQL，现象如下

postgres=# create table test(id int);
CREATE TABLE
postgres=# begin;
BEGIN
postgres=*# select txid_current();
 txid_current 
--------------
         1063
(1 row)

postgres=*# insert into test values(1);
INSERT 0 1
postgres=*# insert into test values(2);
INSERT 0 1
postgres=*# commit ;
COMMIT
postgres=# select xmin,xmax,id from test;
 xmin | xmax | id 
------+------+----
 1064 |    0 |  1
 1065 |    0 |  2
(2 rows)

postgres=# select txid_current();
 txid_current 
--------------
         1066
(1 row)

可以看到，在事务里面，事务号也推进了，居然也加1了，xmin变成了1064和1065，通常情况下，应该都是1063。

当时第一反应就是是否安装了低版本的pg_show_plans插件？这个我是印象尤为深刻的，因为之前提过一个bug：Query consumes transaction ID，https://github.com/cybertec-postgresql/pg_show_plans/issues/20，感兴趣的童鞋可以看一下，低版本的pg_show_plans即使select 1也会消耗事务ID，会导致事务ID消耗十分迅速，但是看了一下，没有加载这个插件，那么不是这个问题。

插件问题排除之后，再看一下可能的问题，很巧正好前两天整理了一下子事务相关的文章，假如设置了子事务，也会有这种现象，看下效果

[postgres@xiongcc ~]$ psql
psql (14rc1)
Type "help" for help.

postgres=# \set ON_ERROR_ROLLBACK on
postgres=# \echo :ON_ERROR_ROLLBACK
on
postgres=# 
postgres=# begin;
BEGIN
postgres=*# select txid_current();
 txid_current 
--------------
         1067
(1 row)

postgres=*# insert into test values(3);
INSERT 0 1
postgres=*# commit ;
COMMIT
postgres=# select xmin,xmax,id from test where id = 3;
 xmin | xmax | id 
------+------+----
 1068 |    0 |  3
(1 row)

可以看到，xmin也加了1，因为设置了ON_ERROR_ROLLBACK，本质上也是通过savepoint来实现的。

ON_ERROR_ROLLBACK

当被设置为on时，如果事务块中的一个语句产生一个错误，该错误会被忽略并且该事务会继续。当被设置为interactive时，只在交互式会话中忽略这类错误，而读取脚本文件时则不会忽略错误。当被重置或者设置为off（默认值）时，事务块中产生错误的一个语句会中止整个事务。错误回滚模式的工作原理是在事务块的每个命令之前都为你发出一个隐式的SAVEPOINT，然后在该命令失败时回滚到该保存点。

参照digoal的 PostgreSQL savepoint 的内存开销以及子事务溢出问题以及 SUBTRANSACTIONS AND PERFORMANCE IN POSTGRESQL

1.每个savepoint都会消耗一个txid2.每个savepoint消耗8K的会话本地内存(CurTransactionContext)3.当有并发事务时, 建议每个事务的子事务不要超过64个, 否则会引入SubtransControlLock等待事件, 可能影响性能。

那么再看一下是否是ON_ERROR_ROLLBACK这个参数导致的，通过echo打印参数，发现ON_ERROR_ROLLBACK是interactive的，可以通过在脚本里面设置ON_ERROR_ROLLBACK，防止导到一半报错，毕竟导入了99%报错，真是一件令人抓狂的事情啊。

postgres=# \echo :ON_ERROR_ROLLBACK
interactive

所以也会给每一个语句加上一个savepoint，消耗事务ID，不过还好只在交互式会话中，也就是psql中，检查一下环境变量.psqlrc

[postgres@xiongcc ~]$ cat .psqlrc | grep ON_ERROR_ROLLBACK
\set ON_ERROR_ROLLBACK interactive

卧槽惊呆了！好吧，原来是我们的云上环境默认都统一加载了此环境变量，那么之前的xmin索引失效加1的现象也就明了了，正是这个参数搞的鬼！另外在JDBC中也有类似参数设置

•autosave = String Specifies what the driver should do if a query fails. In autosave=always mode, JDBC driver sets a savepoint before each query, and rolls back to that savepoint in case of failure. In autosave=never mode (default), no savepoint dance is made ever. In autosave=conservative mode, savepoint is set for each query, however the rollback is done only for rare cases like 'cached statement cannot change return type' or 'statement XXX is not valid' so JDBC driver rollsback and retries The default is never•cleanupSavepoints = boolean Determines if the SAVEPOINT created in autosave mode is released prior to the statement. This is done to avoid running out of shared buffers on the server in the case where 1000’s of queries are performed. The default is 'false'

索引失效

在之前的索引失效篇章中，写到了broken hot chain会设置indcheckxmin，看下现象

postgres=# create table t1(id int,info text);
CREATE TABLE
postgres=# alter table t1 set (fillfactor = 70);
ALTER TABLE
postgres=# insert into t1 select n,'test' from generate_series(1,100)n;
INSERT 0 100
postgres=# begin;
BEGIN
postgres=*# select txid_current();
 txid_current 
--------------
          969
(1 row)

postgres=*# update t1 set id = 99 where id = 1;            ---hot更新
UPDATE 1
postgres=*# create index myidx on t1(info);
CREATE INDEX
postgres=*# select indcheckxmin,xmin from pg_index where indexrelid = 'myidx'::regclass;
 indcheckxmin | xmin 
--------------+------
 t            |  969
(1 row)

postgres=*# set enable_seqscan to off;
SET
postgres=*# explain select info from t1 where info ='haha';
                              QUERY PLAN                               
-----------------------------------------------------------------------
 Seq Scan on t1  (cost=10000000000.00..10000000002.25 rows=1 width=32)
   Filter: (info = 'haha'::text)
(2 rows)

postgres=*# commit ;
COMMIT
postgres=# explain select info from t1 where info ='haha';
                              QUERY PLAN                              
----------------------------------------------------------------------
 Index Only Scan using myidx on t1  (cost=0.14..8.16 rows=1 width=32)
   Index Cond: (info = 'haha'::text)
(2 rows)

另外还有一个现象是old_snapshot_threshold参数导致的，是我同事发现的

postgres=# begin;
BEGIN
postgres=*# create index myindex on t1(id);
CREATE INDEX
postgres=*# select indcheckxmin from pg_index where indexrelid = 'myindex'::regclass;
 indcheckxmin 
--------------
 t
(1 row)

postgres=*# show old_snapshot_threshold ;
 old_snapshot_threshold 
------------------------
 1min
(1 row)

postgres=*# rollback ;
ROLLBACK

postgres=# begin;
BEGIN
postgres=*# create index myindex on t1(id);
CREATE INDEX
postgres=*# select indcheckxmin from pg_index where indexrelid = 'myindex'::regclass;
 indcheckxmin 
--------------
 f
(1 row)

postgres=*# show old_snapshot_threshold ;
 old_snapshot_threshold 
------------------------
 -1
(1 row)

postgres=*# rollback ;
ROLLBACK