生产案例记 | 损坏的索引-CFANZ编程社区

前言

线上某位同事碰到了一例罕见的案例，通过索引查询的数据和顺序扫描的数据不一致。

总结分享一下。

索引损坏的额外成本

对受影响的表执行DDL或DML操作时，数据库中的损坏通常以错误的形式表现出来。这些错误通常显示为由于缺少文件而导致的读取表的错误，当然，这是相当明了的。然而，损坏的索引并不像损坏的表那样明显。通常情况下，为了发现是否确实存在索引损坏，需要观察查询本身的行为（执行时间）和结果。

索引损坏的症状

1.使用索引的查询执行了更多的顺序扫描。2.一些时刻关注pg_stat_user_tables视图的DBA看到，与seq_scan相比，idx_scan列增加的速度要慢得多（或者根本就没有变化）。3.用户抱怨说，他们获取到的一些数据并不是他们所预期的。

在A和B两种情况下，在大多数情况下，你很容易被误导，得出你的统计信息已经过时的结论。在情况C中，这种行为更倾向于损坏的索引，但如果没有更多的知识，很难确定；在这种情况下，"Google and go"的方法通常对寻求挑战答案的任何人都收效甚微。所以，我们决定把它转变成一个博客。对于一般PostgreSQL用户来说，这是非常罕见的，我们的目标是帮助您使这个难以捉摸的问题更容易解决。

注意：由于隐私和合规性的限制，通过博客，我们当然不是使用实际的用户数据，而是使用一个测试用例来演示类似的结果。

预期之外的查询结果

我们的用户有一个查询，这是他常规工作流程的一部分。它是工作的一部分，根据应用需求清除旧数据。在工作的最后几次执行中，用户看到被工作清除的行的数量大大减少，但没有任何常见的可疑因素（如保留规则），发生了更改，以解释这个问题。此外，看起来该作业所运行的表的大小也在稳步增加。

排除显而易见的问题

由于在作业执行期间没有记录任何错误，我们开始查看数据库日志中是否有任何错误/警告。这项工作在某一天运行良好，没有任何问题，但有一天突然停止工作。数据库仍在运行，连接完好无损，应用程序运行良好。什么地方出了错？

我们先排除显而易见的问题：

1.业务逻辑/表结构是否有任何变化？2.保留规则是否有任何变化？3.是否有任何批量操作，如数据加载或truncate？

下一步是检查在最后一次执行清除作业时生成的日志内容，我们发现以下内容：

"ERROR: could not read block xxx in file"

错误意味着什么

让我们快速了解一下什么是块，以及为什么不能读取它而导致查询失败。在PostgreSQL中，所有的表和索引都被存储为数据页/数据块的集合，这些页面默认为8KB，不过可以在服务器编译期间指定。由于页的大小在编译期间定义之后就不会发生变化，所以当我们谈到表页时，这些页在逻辑上是等价的。然而，有了索引，第一页通常被保留为metapage，不同之处在于它包括了一些控制信息。因此，在一个索引中可能有不同类型的页面。在这种情况下，就是这个8KB的页面在错误中被称为一个块。

因此，我们可以得出结论，查询正在块/页中寻找信息，但由于某种原因无法读取它。好消息？由于错误指向一个特定的块，块本身可能是表中唯一的损坏，而不是整个表，这是一种合理的可能性。

注意：如果你想获得更多关于PostgreSQL页面结构的信息，请参考PostgreSQL文档中的数据库页面布局部分。

缩小范围

在查看Postgres数据库日志时，我们发现了类似的情况：

2020-05-03 21:55:46 CEST [6067]: user=postgres,db=postgres,app=psql.bin,client=[local]STATEMENT:  explain analyze select * from pgbench_accounts where aid >100000000 ;

2020-05-03 21:56:59 CEST [6067]: user=postgres,db=postgres,app=psql.bin,client=[local]ERROR:  could not read block 274179 in file "base/13642/24643.2": read only 0 of 8192 bytes

我们可以通过使用类似的查询来找出包含这个块的文件实际上属于什么

postgres=# select n.nspname AS schema, c.relname AS realtion from pg_class c inner join pg_namespace n on (c.relnamespace = n.oid) where c.relfilenode = '24643';
 schema |       relation
--------+----------------------
 public | idx_pgbench_accounts
(1 row)

注意：在上面的查询中，我们忽略了文件名的'.2'部分，因为只有'.'之前的数字所关联的第一个段文件与filenode是一样的。更多细节请参考PostgreSQL文档中的数据库页面布局部分。

检查表的完整性的可靠方法是对索引所属的表执行pg_dump，因为pg_dump不使用任何索引，而是直接读取表内数据。下面是输出的示例

-bash-4.2$ /opt/PostgresPlus/9.5AS/bin/pg_dump -p 5446 -v -t pgbench_accounts postgres > backup_pgbench_accounts.sql >> backup_pgbench_accounts.log 2>&1
-bash-4.2$ tail backup_pgbench_accounts.log

CREATE INDEX idx_pgbench_accounts ON pgbench_accounts USING btree (aid);


-- Completed on 2020-05-03 22:53:49 EDT

--
-- EnterpriseDB database dump complete
--

-bash-4.2$

表中的数据没问题。我们所需要做的就是重建索引来解决这个问题。我们将在本文的后面部分展示如何重建索引。

假设有问题

让我们加上一个假设，但也是一个可能的转折，即日志中没有错误。我们接下来可以做什么？我们没有错误，查询运行正常，但是用户坚持认为返回的数据不正确。我们可以生成查询的"explain analyze"计划，它会显示它从哪里获得数据。

postgres=# explain analyze select * from pgbench_accounts where  aid>1 and aid <10000 ;
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on pgbench_accounts  (cost=174.61..21830.00 rows=11126 width=97) (actual time=1.775..4.469 rows=9998 loops=1)
   Recheck Cond: ((aid > 1) AND (aid < 10000))
   Heap Blocks: exact=164
   ->  Bitmap Index Scan on idx_pgbench_accounts  (cost=0.00..171.83 rows=11126 width=0) (actual time=1.720..1.720 rows=9998 loops=1)
         Index Cond: ((aid > 1) AND (aid < 10000))
 Planning time: 0.242 ms
 Execution time: 5.286 ms
(7 rows)

可以看到，查询使用了索引。

下一步，强制查询直接命中该表，看看结果是否有什么不同。这可以通过在psql会话中将参数enable_indexscan设为off，然后再次执行查询。

postgres=# set enable_indexscan='off';
SET
postgres=# explain analyze select * from pgbench_accounts where aid >10000000;
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on pgbench_accounts  (cost=0.00..4889345.00 rows=89592207 width=97) (actual time=1859.666..15070.333 rows=90000000 loops=1)
   Filter: (aid > 10000000)
   Rows Removed by Filter: 10000000
 Planning time: 0.161 ms
 Execution time: 18394.457 ms
(12 rows)

如果在这种情况下输出的行数有差异（索引扫描为7，顺序扫描为12），你完全有理由相信你的索引出了问题。

修复问题

我们现在有足够的数据可以得出结论，索引有问题，已经损坏了。为了解决这个问题，我们必须重建索引，对于如何进行重建，我们有几个选项，每个选项都有自己的优点和缺点。

1.使用REINDEX。这个方法允许读取，但会锁定该索引所在父表的所有写入。它还会在被处理的索引上放置一个独占锁；这意味着在这个操作期间，尝试对该索引进行读取也会失败。使用这个方法的最佳时间是周期性的停机时间或负载较少的时间。2.使用REINDEX CONCURRENTLY。这是一个更好的选择，但是只在Postgres 12版本以上可用。3.使用CREATE INDEX CONCURRENTLY 并删除旧的/损坏的索引。如果你的版本超过12，你最好的选择是使用CREATE INDEX CONCURRENTLY创建一个新的索引，这不会阻塞你对表的现有操作，然后使用DROP INDEX删除损坏的索引。使用这个方法的最佳时间是负载较少的时候，因为CREATE INDEX CONCURRENTLY在工作时，当表上有大量的更新/插入/删除时，会变慢。

今后的建议

任何一位DBA从未遇到过损坏的总体机会是很少的。所以，如果它们有可能在某个时候发生，而你的恢复取决于早期发现问题的能力，你能做什么？答案不只是一件事，而是很多。监控、校验和pg_catcheck是你可以在问题发生时抓住它的几个方法。然而，要使你的环境恢复健康，取决于我们在以前的文章中谈到的东西。PITR备份和WAL流、备库和延迟备库都可以在你遇到问题时以各种方式帮助你解决。在所有的策略中，保持谨慎是确保数据完好的最佳手段......以及你的心态的平和！

前文译自 Index Corruption in PostgreSQL: The Hidden Cost of Your Queries

小结

我自己在尝试复现的过程中，也发现了一些有趣的点：

1.在13以后的版本里，对Btree索引进行了优化，表的列如果不是唯一的，可能会有很多相同的值，对应的B树索引也会有很多重复的索引记录。在PostgreSQL 13 中B树索引借鉴了GIN 索引的做法，将相同的Key 指向的对应行的ctid链接起来。这样既减小了索引的大小，又减少了很多不必要的分裂，提高了索引的检索速度（需要读取的索引页减少），也就是de-duplicate。2.PostgreSQL对于"零页"能够识别出来，并报错，但是对于不一致的非零页，无法识别，需要使用手段主动巡检检测出来

De-duplicate

在12和13的版本里均进行如下操作

postgres=# create table test(id int,info text);
CREATE TABLE
postgres=# insert into test select n,'hello' from generate_series(1,100) as n;
INSERT 0 100
postgres=# create index myidx on test(info);
CREATE INDEX
postgres=# analyze test ;
ANALYZE

12版本里面的Btree索引，可以看到，总共100条数据，每条相同值都占据了一行记录，回顾之前的《从源码出发，深度剖析字节对齐》，这里的data 就是 varlena header变长头 (0d) + 具体内容hello（68 65 6c 6c 6f）以及最后2字节的padding

postgres=# select * from bt_page_items('myidx',1) limit 10;
 itemoffset |  ctid  | itemlen | nulls | vars |          data           
------------+--------+---------+-------+------+-------------------------
          1 | (0,1)  |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00
          2 | (0,2)  |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00
          3 | (0,3)  |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00
          4 | (0,4)  |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00
          5 | (0,5)  |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00
          6 | (0,6)  |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00
          7 | (0,7)  |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00
          8 | (0,8)  |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00
          9 | (0,9)  |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00
         10 | (0,10) |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00
(10 rows)

再看一下13版本里面的Btree索引，可以看到，就一条数据，借鉴了GIN 索引的做法，将相同的Key指向的对应行的ctid链起来。

postgres=# select * from bt_page_items('myidx',1) ;
-[ RECORD 1 ]-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
itemoffset | 1
ctid       | (16,8292)
itemlen    | 616
nulls      | f
vars       | t
data       | 0d 68 65 6c 6c 6f 00 00
dead       | f
htid       | (0,1)
tids       | {"(0,1)","(0,2)","(0,3)","(0,4)","(0,5)","(0,6)","(0,7)","(0,8)","(0,9)","(0,10)","(0,11)","(0,12)","(0,13)","(0,14)","(0,15)","(0,16)","(0,17)","(0,18)","(0,19)","(0,20)","(0,21)","(0,22)","(0,23)","(0,24)","(0,25)","(0,26)","(0,27)","(0,28)","(0,29)","(0,30)","(0,31)","(0,32)","(0,33)","(0,34)","(0,35)","(0,36)","(0,37)","(0,38)","(0,39)","(0,40)","(0,41)","(0,42)","(0,43)","(0,44)","(0,45)","(0,46)","(0,47)","(0,48)","(0,49)","(0,50)","(0,51)","(0,52)","(0,53)","(0,54)","(0,55)","(0,56)","(0,57)","(0,58)","(0,59)","(0,60)","(0,61)","(0,62)","(0,63)","(0,64)","(0,65)","(0,66)","(0,67)","(0,68)","(0,69)","(0,70)","(0,71)","(0,72)","(0,73)","(0,74)","(0,75)","(0,76)","(0,77)","(0,78)","(0,79)","(0,80)","(0,81)","(0,82)","(0,83)","(0,84)","(0,85)","(0,86)","(0,87)","(0,88)","(0,89)","(0,90)","(0,91)","(0,92)","(0,93)","(0,94)","(0,95)","(0,96)","(0,97)","(0,98)","(0,99)","(0,100)"}

开启deduplicate 功能需要在创建Btree索引的时候增加deduplicate_items 的存储参数，好在默认就为on。

postgres=# create index myidx2 on test (info) with (deduplicate_items=off);
CREATE INDEX

postgres=# select * from bt_page_items('myidx2',1) limit 10;
 itemoffset |  ctid  | itemlen | nulls | vars |          data           | dead |  htid  | tids 
------------+--------+---------+-------+------+-------------------------+------+--------+------
          1 | (0,1)  |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00 | f    | (0,1)  | 
          2 | (0,2)  |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00 | f    | (0,2)  | 
          3 | (0,3)  |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00 | f    | (0,3)  | 
          4 | (0,4)  |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00 | f    | (0,4)  | 
          5 | (0,5)  |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00 | f    | (0,5)  | 
          6 | (0,6)  |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00 | f    | (0,6)  | 
          7 | (0,7)  |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00 | f    | (0,7)  | 
          8 | (0,8)  |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00 | f    | (0,8)  | 
          9 | (0,9)  |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00 | f    | (0,9)  | 
         10 | (0,10) |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00 | f    | (0,10) | 
(10 rows)

索引损坏

再来看一下索引损坏

postgres=# create table t2(id int,info text);
CREATE TABLE
postgres=# insert into t2 select n,'hello' from generate_series(1,10000) as n;
INSERT 0 10000
postgres=# insert into t2 values(1,'abcd');
INSERT 0 1
postgres=# insert into t2 values(2,'abcd');
INSERT 0 1
postgres=# create index on t2(info) with (deduplicate_items=off);
CREATE INDEX
postgres=# analyze t2;
ANALYZE
postgres=# select relfilenode,relpages,relname from pg_class where relname in ('t2','t2_info_idx');
 relfilenode | relpages |   relname   
-------------+----------+-------------
       17135 |       55 | t2
       17137 |       28 | t2_info_idx
(2 rows)

postgres=# select pg_relation_filepath('t2_info_idx');
 pg_relation_filepath 
----------------------
 base/13578/17137
(1 row)

可以看到，查询abcd的行走了index only scan

postgres=# explain select info from t2 where info = 'abcd';
                                QUERY PLAN                                 
---------------------------------------------------------------------------
 Index Only Scan using t2_info_idx on t2  (cost=0.29..4.32 rows=2 width=5)
   Index Cond: (info = 'abcd'::text)
(2 rows)

postgres=# select info from t2 where info = 'abcd';
 info 
------
 abcd
 abcd
(2 rows)

这个时候，我们去模拟索引损坏，t2_info_idx对应的文件是 base/13578/17137，我们用vim打开一下

直接打开是二进制的

^@^@^@^@¨<94>Q\^@^@^@^@H^@ð^_ð^_^D ^@^@^@^@b1^E^@^D^@^@^@^C^@^@^@^A^@^@^@^C^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@<89>Ã@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

用hexdump查看一下，可以看到，key=abcd的行位于0x2760和0x2770处

[postgres@xiongcc 13578]$ hexdump -C 17137 | grep -C 10 -w 'abcd'
000025e0  d8 88 20 00 c8 88 20 00  b8 88 20 00 a8 88 20 00  |.. ... ... ... .|
000025f0  98 88 20 00 88 88 20 00  78 88 20 00 68 88 20 00  |.. ... .x. .h. .|
00002600  58 88 20 00 48 88 20 00  38 88 20 00 28 88 20 00  |X. .H. .8. .(. .|
00002610  18 88 20 00 08 88 20 00  f8 87 20 00 e8 87 20 00  |.. ... ... ... .|
00002620  d8 87 20 00 c8 87 20 00  b8 87 20 00 a8 87 20 00  |.. ... ... ... .|
00002630  98 87 20 00 88 87 20 00  78 87 20 00 00 00 00 00  |.. ... .x. .....|
00002640  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00002750  00 00 00 00 00 00 00 00  00 00 36 00 0c 00 10 40  |..........6....@|
00002760  0b 61 62 63 64 00 00 00  00 00 36 00 0b 00 10 40  |.abcd.....6....@|
00002770  0b 61 62 63 64 00 00 00  00 00 02 00 14 00 10 40  |.abcd..........@|
00002780  0d 68 65 6c 6c 6f 00 00  00 00 02 00 13 00 10 40  |.hello.........@|
00002790  0d 68 65 6c 6c 6f 00 00  00 00 02 00 12 00 10 40  |.hello.........@|
000027a0  0d 68 65 6c 6c 6f 00 00  00 00 02 00 11 00 10 40  |.hello.........@|
000027b0  0d 68 65 6c 6c 6f 00 00  00 00 02 00 10 00 10 40  |.hello.........@|
000027c0  0d 68 65 6c 6c 6f 00 00  00 00 02 00 0f 00 10 40  |.hello.........@|
000027d0  0d 68 65 6c 6c 6f 00 00  00 00 02 00 0e 00 10 40  |.hello.........@|
000027e0  0d 68 65 6c 6c 6f 00 00  00 00 02 00 0d 00 10 40  |.hello.........@|
000027f0  0d 68 65 6c 6c 6f 00 00  00 00 02 00 0c 00 10 40  |.hello.........@|
00002800  0d 68 65 6c 6c 6f 00 00  00 00 02 00 0b 00 10 40  |.hello.........@|
00002810  0d 68 65 6c 6c 6f 00 00  00 00 02 00 0a 00 10 40  |.hello.........@|

vim打开后使用 :%!xxd 编辑二进制

0002720: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0002730: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0002740: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0002750: 0000 0000 0000 0000 0000 3600 0c00 1040  ..........6....@
0002760: 0b61 6263 6400 0000 0000 3600 0b00 1040  .abcd.....6....@
0002770: 0b61 6263 6400 0000 0000 0200 1400 1040  .abcd..........@
0002780: 0d68 656c 6c6f 0000 0000 0200 1300 1040  .hello.........@
0002790: 0d68 656c 6c6f 0000 0000 0200 1200 1040  .hello.........@
00027a0: 0d68 656c 6c6f 0000 0000 0200 1100 1040  .hello.........@

此处我们将abcd替换成mbcd，61换成6d，模拟索引写坏了

0002720: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0002730: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0002740: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0002750: 0000 0000 0000 0000 0000 3600 0c00 1040  ..........6....@
0002760: 0b6d 6263 6400 0000 0000 3600 0b00 1040  .mbcd.....6....@
0002770: 0b61 6263 6400 0000 0000 0200 1400 1040  .abcd..........@
0002780: 0d68 656c 6c6f 0000 0000 0200 1300 1040  .hello.........@
0002790: 0d68 656c 6c6f 0000 0000 0200 1200 1040  .hello.........@
00027a0: 0d68 656c 6c6f 0000 0000 0200 1100 1040  .hello.........@

编辑完成后再使用 :%!xxd -r转换回二进制，再用:wq保存，再看的时候已经变成了mbcd

[postgres@xiongcc 13578]$ hexdump -C 17137 | grep -C 10 -w 'abcd'
000025e0  d8 88 20 00 c8 88 20 00  b8 88 20 00 a8 88 20 00  |.. ... ... ... .|
000025f0  98 88 20 00 88 88 20 00  78 88 20 00 68 88 20 00  |.. ... .x. .h. .|
00002600  58 88 20 00 48 88 20 00  38 88 20 00 28 88 20 00  |X. .H. .8. .(. .|
00002610  18 88 20 00 08 88 20 00  f8 87 20 00 e8 87 20 00  |.. ... ... ... .|
00002620  d8 87 20 00 c8 87 20 00  b8 87 20 00 a8 87 20 00  |.. ... ... ... .|
00002630  98 87 20 00 88 87 20 00  78 87 20 00 00 00 00 00  |.. ... .x. .....|
00002640  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00002750  00 00 00 00 00 00 00 00  00 00 36 00 0c 00 10 40  |..........6....@|
00002760  0b 6d 62 63 64 00 00 00  00 00 36 00 0b 00 10 40  |.mbcd.....6....@|
00002770  0b 61 62 63 64 00 00 00  00 00 02 00 14 00 10 40  |.abcd..........@|
00002780  0d 68 65 6c 6c 6f 00 00  00 00 02 00 13 00 10 40  |.hello.........@|
00002790  0d 68 65 6c 6c 6f 00 00  00 00 02 00 12 00 10 40  |.hello.........@|
000027a0  0d 68 65 6c 6c 6f 00 00  00 00 02 00 11 00 10 40  |.hello.........@|
000027b0  0d 68 65 6c 6c 6f 00 00  00 00 02 00 10 00 10 40  |.hello.........@|
000027c0  0d 68 65 6c 6c 6f 00 00  00 00 02 00 0f 00 10 40  |.hello.........@|
000027d0  0d 68 65 6c 6c 6f 00 00  00 00 02 00 0e 00 10 40  |.hello.........@|
000027e0  0d 68 65 6c 6c 6f 00 00  00 00 02 00 0d 00 10 40  |.hello.........@|
000027f0  0d 68 65 6c 6c 6f 00 00  00 00 02 00 0c 00 10 40  |.hello.........@|
00002800  0d 68 65 6c 6c 6f 00 00  00 00 02 00 0b 00 10 40  |.hello.........@|
00002810  0d 68 65 6c 6c 6f 00 00  00 00 02 00 0a 00 10 40  |.hello.........@|

这个时候，再去查询一下，可以看到，还是查询出了2条

postgres=# explain select info from t2 where info = 'abcd';
                                QUERY PLAN                                 
---------------------------------------------------------------------------
 Index Only Scan using t2_info_idx on t2  (cost=0.29..4.32 rows=2 width=5)
   Index Cond: (info = 'abcd'::text)
(2 rows)

postgres=# select info from t2 where info = 'abcd';
 info 
------
 abcd
 abcd
(2 rows)

是因为缓存的影响，shared hit=3

postgres=# explain(analyze,buffers) select info from t2 where info = 'abcd';
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Index Only Scan using t2_info_idx on t2  (cost=0.29..4.32 rows=2 width=5) (actual time=0.028..0.031 rows=2 loops=1)
   Index Cond: (info = 'abcd'::text)
   Heap Fetches: 0
   Buffers: shared hit=3
 Planning Time: 0.084 ms
 Execution Time: 0.067 ms
(6 rows)

postgres=# select info from t2 where info = 'abcd';
 info 
------
 abcd
 abcd
(2 rows)

那么重启一下数据库，并使用 echo 3 > /proc/sys/vm/drop_caches 清除一下操作系统的缓存

[postgres@xiongcc ~]$ pg_ctl -D 13data/ stop
waiting for server to shut down.... done
server stopped
[root@xiongcc ~]# echo 3 > /proc/sys/vm/drop_caches
[postgres@xiongcc ~]$ pg_ctl -D 13data/ start
waiting for server to start....2021-12-07 10:30:13.629 CST [14348] LOG:  redirecting log output to logging collector process
2021-12-07 10:30:13.629 CST [14348] HINT:  Future log output will appear in directory "log".
 done
server started

[postgres@xiongcc ~]$ psql -p 5435
psql (13.2)
Type "help" for help.

postgres=# explain select info from t2 where info = 'abcd';
                                QUERY PLAN                                 
---------------------------------------------------------------------------
 Index Only Scan using t2_info_idx on t2  (cost=0.29..4.32 rows=2 width=5)
   Index Cond: (info = 'abcd'::text)
(2 rows)

postgres=# set enable_indexonlyscan to off;
SET
postgres=# set enable_indexscan to off;
SET
postgres=# set enable_bitmapscan to off;
SET
postgres=# explain select info from t2 where info = 'abcd';
                     QUERY PLAN                     
----------------------------------------------------
 Seq Scan on t2  (cost=0.00..180.03 rows=2 width=5)
   Filter: (info = 'abcd'::text)
(2 rows)

postgres=# select info from t2 where info = 'abcd';            ---顺序扫描获取出了2条数据
 info 
------
 abcd
 abcd
(2 rows)

[postgres@xiongcc ~]$ psql -p 5435
psql (13.2)
Type "help" for help.

postgres=# explain select info from t2 where info = 'abcd';
                                QUERY PLAN                                 
---------------------------------------------------------------------------
 Index Only Scan using t2_info_idx on t2  (cost=0.29..4.32 rows=2 width=5)
   Index Cond: (info = 'abcd'::text)
(2 rows)

postgres=# select info from t2 where info = 'abcd';            ---索引扫描获取出了1条数据
 info 
------
 abcd
(1 row)

这样就可以看到，使用索引获取到的数据是2条，但是顺序扫描，获取到的数据只有1条。

使用dd模拟索引损坏

前面也提到了，对于"零页"，PostgreSQL是可以准确识别出来的，使用dd模拟以下

postgres=# truncate table t2;
TRUNCATE TABLE
postgres=# insert into t2 values(1,'abcd');
INSERT 0 1
postgres=# insert into t2 values(2,'abcd');
INSERT 0 1
postgres=# insert into t2 select n,'hello' from generate_series(1,10000) as n;
INSERT 0 10000
postgres=# analyze t2;
ANALYZE
postgres=# select relfilenode,relpages,relname from pg_class where relname in ('t2','t2_info_idx');
 relfilenode | relpages |   relname   
-------------+----------+-------------
       17149 |       55 | t2
       17151 |       29 | t2_info_idx
(2 rows)

postgres=# select pg_relation_filepath('t2_info_idx');
 pg_relation_filepath 
----------------------
 base/13578/17151
(1 row)

postgres=# select * from bt_page_items('t2_info_idx',1) ;            ---key为abcd的索引项位于第一个索引页
 itemoffset | ctid  | itemlen | nulls | vars |          data           | dead | htid  | tids 
------------+-------+---------+-------+------+-------------------------+------+-------+------
          1 | (0,1) |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00 |      |       | 
          2 | (0,1) |      16 | f     | t    | 0b 61 62 63 64 00 00 00 | f    | (0,1) | 
          3 | (0,2) |      16 | f     | t    | 0b 61 62 63 64 00 00 00 | f    | (0,2) | 
(3 rows)

postgres=# select * from bt_page_items('t2_info_idx',2) limit 5;    ---第二个索引页
 itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |  htid  | tids 
------------+----------+---------+-------+------+-------------------------+------+--------+------
          1 | (2,4097) |      24 | f     | t    | 0d 68 65 6c 6c 6f 00 00 |      | (2,22) | 
          2 | (0,3)    |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00 | f    | (0,3)  | 
          3 | (0,4)    |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00 | f    | (0,4)  | 
          4 | (0,5)    |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00 | f    | (0,5)  | 
          5 | (0,6)    |      16 | f     | t    | 0d 68 65 6c 6c 6f 00 00 | f    | (0,6)  | 
(5 rows)

使用dd抹除一下，构造"零页"，只抹除abcd所在的第一个数据块

postgres=# \! dd if=/dev/zero of=/home/postgres/13data/base/13578/17151 seek=0 bs=8192 count=1 conv=notrunc
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.00013355 s, 61.3 MB/s

postgres=# explain select info from t2 where info = 'abcd';
                                QUERY PLAN                                 
---------------------------------------------------------------------------
 Index Only Scan using t2_info_idx on t2  (cost=0.29..4.32 rows=2 width=5)
   Index Cond: (info = 'abcd'::text)
(2 rows)

postgres=# select info from t2 where info = 'abcd';
 info 
------
 abcd
 abcd
(2 rows)

postgres=# \q
[postgres@xiongcc ~]$ pg_ctl -D 13data/ stop            ---排除数据库缓存的影响
waiting for server to shut down.... done
server stopped
[root@xiongcc ~]# echo 3 > /proc/sys/vm/drop_caches     ---排除操作系统缓存的影响
[postgres@xiongcc ~]$ pg_ctl -D 13data/ start
waiting for server to start....2021-12-07 10:43:12.841 CST [14756] LOG:  redirecting log output to logging collector process
2021-12-07 10:43:12.841 CST [14756] HINT:  Future log output will appear in directory "log".
 done
server started

再次查看，PostgreSQL就报错了

[postgres@xiongcc ~]$ psql -p 5435
psql (13.2)
Type "help" for help.

postgres=# explain select info from t2 where info = 'abcd';
ERROR:  index "t2_info_idx" contains unexpected zero page at block 0
HINT:  Please REINDEX it.

我们可以使用amcheck插件检查一下索引

amcheck模块提供的功能可让您验证关系结构的逻辑一致性。如果该结构看起来有效，则不会引发任何错误。

可以看到，也检测出了索引的错误。

postgres=# SELECT bt_index_check(index => c.oid, heapallindexed => i.indisunique),
               c.relname,
               c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree' AND n.nspname = 'public'
-- Don't check temp tables, which may be from another session:
AND c.relpersistence != 't'
-- Function may throw an error when this is omitted:
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC LIMIT 10;
ERROR:  index "t2_info_idx" contains unexpected zero page at block 0
HINT:  Please REINDEX it.

所以对于"零页"，PostgreSQL是可以准确识别出来的。

另外，其实对于前面那种损坏的情况，也可以检测出来

postgres=# explain select info from t2 where info = 'abcd';
                                QUERY PLAN                                 
---------------------------------------------------------------------------
 Index Only Scan using t2_info_idx on t2  (cost=0.29..4.32 rows=2 width=5)
   Index Cond: (info = 'abcd'::text)
(2 rows)

postgres=# select info from t2 where info = 'abcd';
 info 
------
 abcd
(1 row)

postgres=# set enable_indexonlyscan to off;
SET
postgres=# set enable_bitmapscan to off;
SET
postgres=# set enable_indexscan to off;
SET

postgres=# select info from t2 where info = 'abcd';
 info 
------
 abcd
 abcd
(2 rows)

postgres=# SELECT bt_index_check(index => c.oid, heapallindexed => i.indisunique),
               c.relname,
               c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree' AND n.nspname = 'public'
-- Don't check temp tables, which may be from another session:
AND c.relpersistence != 't'
-- Function may throw an error when this is omitted:
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC LIMIT 10;
ERROR:  high key invariant violated for index "t2_info_idx"
DETAIL:  Index tid=(1,3) points to heap tid=(54,12) page lsn=0/5C95C778.

总结

可以看到，这个案例是十分罕见的，对于大多数DBA来说，不需要关注这么底层。但是我们要知道，假如开发硬要说数据不对的时候，我们也需要额外思考一下，是否是索引的问题？

另外索引和表也是类似，随着不断的增删改，也会膨胀，对于Btree索引，还会涉及到分裂、合并等，导致索引页的空洞，并且索引页的复用与HEAP PAGE不一样，因为索引的内容是有序结构，只有符合顺序的ITEM才能插入对应的PAGE。不像HEAP TUPLE，只要有空间就可以插入。因此索引膨胀后，通常需要重建索引来缩小索引大小，我们可以通过pgstattuple插件观察，重建过后，索引就正常了。

所以，定时巡检索引的膨胀、定时reindex是很有必要的！万一哪天就踩到了呢？