1.模拟Standby Namenode 损坏
(1)检查namenode节点状态。
[hadoop@big82 current]$ hdfs haadmin -getAllServiceState
big81:9000 active
big82:9000 standby
(2)破坏standby namenode
[hadoop@big82 current]$ pwd
/data02/current
[hadoop@big82 current]$ ll
total 2172
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000001-0000000000000000002
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000003-0000000000000000004
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000005-0000000000000000006
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000007-0000000000000000008
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000009-0000000000000000010
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000011-0000000000000000012
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000013-0000000000000000014
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000015-0000000000000000016
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000017-0000000000000000018
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000019-0000000000000000020
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000021-0000000000000000022
-rw-rw-r-- 1 hadoop hadoop 116 Apr 17 09:06 edits_0000000000000000023-0000000000000000025
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000026-0000000000000000027
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000028-0000000000000000029
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000030-0000000000000000031
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000032-0000000000000000033
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000034-0000000000000000035
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000036-0000000000000000037
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000038-0000000000000000039
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000040-0000000000000000041
-rw-rw-r-- 1 hadoop hadoop 1048576 Apr 17 09:06 edits_0000000000000000042-0000000000000000042
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000043-0000000000000000044
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000045-0000000000000000046
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000047-0000000000000000048
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000049-0000000000000000050
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000051-0000000000000000052
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000053-0000000000000000054
-rw-rw-r-- 1 hadoop hadoop 42 Apr 17 09:06 edits_0000000000000000055-0000000000000000056
-rw-rw-r-- 1 hadoop hadoop 1048576 Apr 17 09:06 edits_inprogress_0000000000000000057
-rw-rw-r-- 1 hadoop hadoop 388 Apr 17 09:06 fsimage_0000000000000000000
-rw-rw-r-- 1 hadoop hadoop 62 Apr 17 09:06 fsimage_0000000000000000000.md5
-rw-rw-r-- 1 hadoop hadoop 3 Apr 17 09:06 seen_txid
-rw-rw-r-- 1 hadoop hadoop 216 Apr 17 09:06 VERSION
[hadoop@big82 current]$ rm -rf ./* --standby namenode里面的内容全部删除。
(3)检查standby namenode 状态。
[hadoop@big82 current]$ jps
18594 ResourceManager
19732 NameNode --依然活着。
17803 DFSZKFailoverController
20014 Jps
登陆html 页面检查:
http://192.168.1.82:50070 或者
http://192.168.1.82:50070/dfshealth.html#tab-overview
虽然删除了standby namenode 所有的数据,但是进程依然驻留在内存中,还没有感知到操作系统的错误。
我们现在尝试关闭Standby namenode 看看什么情况。
[hadoop@big82 current]$ hdfs --daemon stop namenode
[hadoop@big82 current]$ hdfs haadmin -getAllServiceState
big81:9000 active
2022-04-17 09:22:34,731 INFO ipc.Client: Retrying connect to server: big82/192.168.1.82:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
big82:9000 Failed to connect: Call From big82/192.168.1.82 to big82:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
此时出现连接异常,集群已经检测到了损坏。
Datanode节点持续报如下报错:IPC客户端连接异常,连接不上namenode82,即standby namenode;
2022-04-17 09:23:29,841 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: big82/192.168.1.82:9000. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
(4)修复Standby namenode
将主namenode的数据全部复制到Standby namenode 相同的目录下。然后重启Standbynamenode
[hadoop@big82 current]$ pwd
/data02/current
[hadoop@big82 current]$ scp big81:/data02/current/* .
edits_0000000000000000001-0000000000000000002 100% 42 20.2KB/s 00:00
edits_0000000000000000003-0000000000000000004 100% 42 25.6KB/s 00:00
edits_0000000000000000005-0000000000000000006 100% 42 30.4KB/s 00:00
edits_0000000000000000007-0000000000000000008 100% 42 25.1KB/s 00:00
edits_0000000000000000009-0000000000000000010 100% 42 30.6KB/s 00:00
edits_0000000000000000011-0000000000000000012 100% 42 31.4KB/s 00:00
edits_0000000000000000013-0000000000000000014 100% 42 31.1KB/s 00:00
edits_0000000000000000015-0000000000000000016 100% 42 31.4KB/s 00:00
edits_0000000000000000017-0000000000000000018 100% 42 25.8KB/s 00:00
edits_0000000000000000019-0000000000000000020 100% 42 27.6KB/s 00:00
edits_0000000000000000021-0000000000000000022 100% 42 27.7KB/s 00:00
edits_0000000000000000023-0000000000000000025 100% 116 56.5KB/s 00:00
edits_0000000000000000026-0000000000000000027 100% 42 31.8KB/s 00:00
edits_0000000000000000028-0000000000000000029 100% 42 33.8KB/s 00:00
edits_0000000000000000030-0000000000000000031 100% 42 28.5KB/s 00:00
edits_0000000000000000032-0000000000000000033 100% 42 31.0KB/s 00:00
edits_0000000000000000034-0000000000000000035 100% 42 32.5KB/s 00:00
edits_0000000000000000036-0000000000000000037 100% 42 41.6KB/s 00:00
edits_0000000000000000038-0000000000000000039 100% 42 34.3KB/s 00:00
edits_0000000000000000040-0000000000000000041 100% 42 24.3KB/s 00:00
edits_0000000000000000042-0000000000000000042 100% 1024KB 70.5MB/s 00:00
edits_0000000000000000043-0000000000000000044 100% 42 23.4KB/s 00:00
edits_0000000000000000045-0000000000000000046 100% 42 28.2KB/s 00:00
edits_0000000000000000047-0000000000000000048 100% 42 18.8KB/s 00:00
edits_0000000000000000049-0000000000000000050 100% 42 12.2KB/s 00:00
edits_0000000000000000051-0000000000000000052 100% 42 13.7KB/s 00:00
edits_0000000000000000053-0000000000000000054 100% 42 14.4KB/s 00:00
edits_0000000000000000055-0000000000000000056 100% 42 29.6KB/s 00:00
edits_0000000000000000057-0000000000000000058 100% 42 32.7KB/s 00:00
edits_0000000000000000059-0000000000000000060 100% 42 29.9KB/s 00:00
edits_0000000000000000061-0000000000000000062 100% 42 29.2KB/s 00:00
edits_0000000000000000063-0000000000000000064 100% 42 27.4KB/s 00:00
edits_0000000000000000065-0000000000000000066 100% 42 29.6KB/s 00:00
edits_0000000000000000067-0000000000000000068 100% 42 26.5KB/s 00:00
edits_0000000000000000069-0000000000000000070 100% 42 30.6KB/s 00:00
edits_inprogress_0000000000000000071 100% 1024KB 54.8MB/s 00:00
fsimage_0000000000000000000 100% 388 113.2KB/s 00:00
fsimage_0000000000000000000.md5 100% 62 20.4KB/s 00:00
seen_txid 100% 3 1.4KB/s 00:00
VERSION
(5) 启动Standby Namenode;
[hadoop@big82 current]$ hdfs --daemon start namenode
[hadoop@big82 current]$ jps
18594 ResourceManager
20434 Jps
20392 NameNode --Standby namenode 又活过来了。
17803 DFSZKFailoverController
[hadoop@big82 current]$ hdfs haadmin -getAllServiceState --检查状态。
big81:9000 active
big82:9000 standby
也可以登陆 http://192.168.1.82:50070 检查备Standby Namenode的具体状态。