0
点赞
收藏
分享

微信扫一扫

复制集Secondary报错“Could not find member to sync from”处理

现象

在上一个实验中,我为了测试PSA架构中Secondary宕机对主库的影响对Secondary进行了关闭的操作,在测试完成之后重启Secondary发现这个备库无论如何都无法进行同步了。后台日志报如下错误:

{"t":{"$date":"2022-07-05T11:24:18.939+08:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"Checkpointer","msg":"WiredTiger message","attr":{"message":"[1656991458:939195][24399:0x7fa0ebe2d700], WT_SESSION.checkpoint: [WT_VERB_CHECKPOINT_PROGRESS] saving checkpoint snapshot min: 2228, snapshot max: 2228 snapshot count: 0, oldest timestamp: (1656919948, 1) , meta checkpoint timestamp: (1656920248, 1) base write gen: 23803"}}
{"t":{"$date":"2022-07-05T11:24:42.973+08:00"},"s":"I", "c":"REPL", "id":21799, "ctx":"BackgroundSync","msg":"Sync source candidate chosen","attr":{"syncSource":"vm002:27017"}}
{"t":{"$date":"2022-07-05T11:24:42.974+08:00"},"s":"I", "c":"REPL", "id":5579708, "ctx":"ReplCoordExtern-0","msg":"We are too stale to use candidate as a sync source. Denylisting this sync source because our last fetched timestamp is before their earliest timestamp","attr":{"candidate":"vm002:27017","lastOpTimeFetchedTimestamp":{"$timestamp":{"t":1656920248,"i":1}},"remoteEarliestOpTimeTimestamp":{"$timestamp":{"t":1656922797,"i":63501}},"denylistDurationMinutes":1,"denylistUntil":{"$date":"2022-07-05T03:25:42.974Z"}}}
{"t":{"$date":"2022-07-05T11:24:42.974+08:00"},"s":"I", "c":"REPL", "id":21798, "ctx":"ReplCoordExtern-0","msg":"Could not find member to sync from"}


分析过程

说实话,对于一个刚接触Mongodb不久的运维来说看到这样的JSON格式的日志会觉得很头疼,也没好好的去分析错误的内容。关注的中心一直在“Could not find member to sync from”这一句报错上面。

首先进行了基础排查,排查了网络,hosts文件,Primary的状态均为正常,根据报错的内容也执行过rs.syncFrom("vm002:27017"),依然报错。

rs.syncFrom用于临时修改同步的目的,在mongodb重启之后或者同步目标落后于复制集其它成员30秒之后就会恢复到默认的同步行为。同时要求该同步源要满足节点可达、具有​​members[n].buildIndexes​​ 属性并且在同一个复制集群中的条件。

在仔细分析报错后发现了一句话

{
"t": {
"$date": "2022-07-05T11:24:42.974+08:00"
},
"s": "I",
"c": "REPL",
"id": 5579708,
"ctx": "ReplCoordExtern-0",
"msg": "We are too stale to use candidate as a sync source. Denylisting this sync source because our last fetched timestamp is before their earliest timestamp",
"attr": {
"candidate": "vm002:27017",
"lastOpTimeFetchedTimestamp": {
"$timestamp": {
"t": 1656920248, <---Mon Jul 04 2022 15:37:28 GMT+0800 (CST)
"i": 1
}
},
"remoteEarliestOpTimeTimestamp": {
"$timestamp": {
"t": 1656922797, <---Mon Jul 04 2022 16:19:57 GMT+0800 (CST)
"i": 63501
}
}

在这个日志中提到“We are too stale”,这就提示可能是因为恢复所需要的OPLOG已经被清理,找不到OPLOG导致。当前节点的lastOpTimeFetchedTimestamp为Jul 04 2022 15:37:28,远程节点最早的Oplog为Mon Jul 04 2022 16:19:57,所以导致无法做同步。

在如下信息中也可以发现是丢失OPLOG问题导致

rs0:PRIMARY> db.getReplicationInfo() 
{
"logSizeMB" : 1023.925537109375,
"usedMB" : 1021.3,
"timeDiff" : 68345,
"timeDiffHours" : 18.98,
"tFirst" : "Mon Jul 04 2022 16:19:57 GMT+0800 (CST)",
"tLast" : "Tue Jul 05 2022 11:19:02 GMT+0800 (CST)",
"now" : "Tue Jul 05 2022 11:19:09 GMT+0800 (CST)"
}

rs.status()中_id=1的lastAppliedWallTime也可以看到当前db2只运行到了Mon Jul 04 2022 15:37:28

解决方式

通过重建的方式修复db2

step1 在现有的配置中删除db2,

需要注意的是,不能使用rs.remove("vm002:27018")的方式去清理,因为rs.remove的操作是majority写的,在备库宕机的情况下,主库没办法完成rs.remove的操作,需要使用rs.reconfig()的方式强制清理

cfg=rs.conf()
cfg.members= <----清理_id1的信息
[
{
"_id" : 0,
"host" : "vm002:27017",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {

},
"secondaryDelaySecs" : NumberLong(0),
"votes" : 1
},
{
"_id" : 2,
"host" : "vm002:27019",
"arbiterOnly" : true,
"buildIndexes" : true,
"hidden" : false,
"priority" : 0,
"tags" : {

},
"secondaryDelaySecs" : NumberLong(0),
"votes" : 1
}
]

rs.reconfig(cfg,{force:true});

step2,重新加入复制集

同样的,在PSA架构中要增加Secondary或者其他操作,也不能使用rs.add()去添加,否则会报错 "errmsg": "Rejecting reconfig where the new config has a PSA topology and the secondary is electable, but the old config contains only one writable node. Refer to https://docs.mongodb.com/manual/reference/method/rs.reconfigForPSASet/ for next steps on reconfiguring a PSA set."

cfg = rs.conf();
cfg["members"] = <---新增节点_id3
[
{
"_id" : 0,
"host" : "vm002:27017",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {

},
"secondaryDelaySecs" : NumberLong(0),
"votes" : 1
},
{
"_id" : 3,
"host" : "vm002:27018",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {

},
"secondaryDelaySecs" : NumberLong(0),
"votes" : 1
},
{
"_id" : 2,
"host" : "vm002:27019",
"arbiterOnly" : true,
"buildIndexes" : true,
"hidden" : false,
"priority" : 0,
"tags" : {

},
"secondaryDelaySecs" : NumberLong(0),
"votes" : 1
}
]

rs0:PRIMARY> rs.reconfigForPSASet(1, cfg);
Running first reconfig to give member at index 1 { votes: 1, priority: 0 }
Running second reconfig to give member at index 1 { priority: 1 }
{
"ok" : 1,
"$clusterTime" : {
"clusterTime" : Timestamp(1657005199, 1),
"signature" : {
"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
"keyId" : NumberLong(0)
}
},
"operationTime" : Timestamp(1657005199, 1)
}


举报

相关推荐

0 条评论