failed shard on node [XXX]， failed recovery, failure RecoveryFailedException-CFANZ编程社区

问题描述

机房的机器发生了断电恢复。集群就呈红色

关键性描述：

nested: IndexShardRecoveryException[failed recovery]; nested: ElasticsearchException[java.io.IOException: failed to read /home/wsn/es/es7.5/node_2/data/nodes/0/indices/QGft9wywTOeSNjcsz_UUHA/3/_state/retention-leases-91171.st]

 "index" : "device_search_2020",
  "shard" : 3,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2021-04-14T02:23:35.837Z",
    "failed_allocation_attempts" : 5,
    "details" : """failed shard on node [LwWiAwmdQCiEibtiF7oqxQ]: failed recovery, failure RecoveryFailedException[[device_search_20201204][3]: Recovery failed on {reading_10.10.2.75_node2}{LwWiAwmdQCiEibtiF7oqxQ}{YVadGK2FSDKbR69l0Wu0xg}{10.10.2.75}{10.10.2.75:9402}{dil}{ml.machine_memory=539647844352, xpack.installed=true, ml.max_open_jobs=20}]; nested: IndexShardRecoveryException[failed recovery]; nested: ElasticsearchException[java.io.IOException: failed to read /home/wsn/es/es7.5/node_2/data/nodes/0/indices/QGft9wywTOeSNjcsz_UUHA/3/_state/retention-leases-91171.st]; nested: IOException[failed to read /home/wsn/es/es7.5/node_2/data/nodes/0/indices/QGft9wywTOeSNjcsz_UUHA/3/_state/retention-leases-91171.st]; nested: IOException[org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=892219961 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/home/wsn/es/es7.5/node_2/data/nodes/0/indices/QGft9wywTOeSNjcsz_UUHA/3/_state/retention-leases-91171.st")))]; nested: CorruptIndexException[codec footer mismatch (file truncated?): actual footer=892219961 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/home/wsn/es/es7.5/node_2/data/nodes/0/indices/QGft9wywTOeSNjcsz_UUHA/3/_state/retention-leases-91171.st")))]; """,
    "last_allocation_status" : "no"
  }

从 elasticsearch head上查看集群状况：

failed shard on node [XXX]， failed recovery, failure RecoveryFailedException_解决方案

failed shard on node [XXX]， failed recovery, failure RecoveryFailedException_数据_02

用kibana查看集群状况

GET /_cluster/allocation/explain?pretty

运行结果

错误分析

错误产生原因：这是在机房的集群因为断电被强关了，然后产生了异常。然后集群恢复的时候报错：IOException[failed to read /home/wsn/es/es7.5/node_2/data/nodes/0/indices/QGft9wywTOeSNjcsz_UUHA/3/_state/retention-leases-91171.st];

这是因为断电，导致的部分文件没有被刷新。然后重新恢复的时候，去检查这些文件是否期望的，但是因为断电没有被保存，所以导致期望的版本没有被保存下来。所以集群就不承认这个分片了，所以呈现红色。

到这里都觉得文件都损坏了，还怎么恢复分片呢？

从网上找了很久，国外的网站上说这个错误，就不能恢复了，需要用快照恢复数据了。

解决方案

我们尝试出来的解决方案：通过重新路由的方式，来解决。

在kibana上执行下边命令，注意索引名，分片，数据节点这些，注意自己替换。我下边有写怎么查到这些。

POST _cluster/reroute
{
"commands": [
{
"allocate_stale_primary": {

# 这是有问题的索引
"index": "device_search_20201204",

# 这是有问题的分片
"shard": 153,

# 这是哪个数据节点
"node": "reading_10.10.2.75_node2",
"accept_data_loss": true
}
}
]
}

这是有问题的索引：

可以通过命令:GET /_cluster/allocation/explain?pretty

在kibana上执行，得到结果如下：

failed shard on node [XXX]， failed recovery, failure RecoveryFailedException_解决方案_03

修复命令运行后：集群红色分片分片就变成了绿色。这个运行过程跳过了校验上图报错中说的读 retention-leases-91171.st 文件报错。IOException[failed to read /home/wsn/es/es7.5/node_2/data/nodes/0/indices/QGft9wywTOeSNjcsz_UUHA/3/_state/retention-leases-91171.st];

通过重新路由，重新生成了一份这个文件。