异常信息
Hive外部表执行或HDFS集群拷贝异常: Cannot obtain block length for LocatedBlock
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1647235016030_0015_1_00, diagnostics=[Task failed, taskId=task_1647235016030_0015_1_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1647235016030_0015_1_00_000000_0:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: java.io.IOException: Cannot obtain block length for LocatedBlock{BP-658896538-172.16.0.231-1618368143316:blk_1074121079_380284; getBlockSize()=1754; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[172.16.0.160:50010,DS-c2b56f6d-70e8-41c2-aa83-752ef9c283de,DISK], DatanodeInfoWithStorage[172.16.0.6:50010,DS-17d41643-588f-4e03-a460-30c3469511f5,DISK]]} at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.RuntimeException: java.io.IOException: java.io.IOException: Cannot obtain block length for LocatedBlock{BP-658896538-172.16.0.231-1618368143316:blk_1074121079_380284; getBlockSize()=1754; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[172.16.0.160:50010,DS-c2b56f6d-70e8-41c2-aa83-752ef9c283de,DISK], DatanodeInfoWithStorage[172.16.0.6:50010,DS-17d41643-588f-4e03-a460-30c3469511f5,DISK]]} at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152) at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185) ... 14 more
hdfs fsck
命令排查异常文件
hdfs fsck / –openforwrite
查看是否存在openforwrite
状态文件,其中/
表示要检查的根目录。
如果只想打印文件名,可参考如下
hadoop fsck / -openforwrite | egrep -v '^\.+$' | egrep "MISSING|OPENFORWRITE" | grep -o "/[^ ]*"
hdfs recoverLease
命令释放租约:
hdfs debug recoverLease -path <path-of-the-file>[-retries <retry-times>]
使用上述命令来修复异常文件,注意-path
后参数使用文件绝对路径,不能是文件夹名称。
如hdfs debug recoverLease -path /ods/events/access/2021-10-20/flume.1634720226917.log -retries 5
编写脚本
我们在实际应用中,可能每天定时执行hive,如果存在这种异常状态文件,就会影响任务运行,因此可以写个每天的定时脚本,来自动修复。
#!/bin/bash
# 取今天时间
MYDATE=`date +%F`
# 执行fsck命令,获取openforwrite状态文件列表,最后grep -v ${MYDATE}表示不检查当前date的文件(因为正常都用的按日分区表,可按需修改)
FILELIST=`hadoop fsck /ods/events/ -openforwrite | egrep -v '^\.+$' | egrep "MISSING|OPENFORWRITE" | grep -o "/[^ ]*" | sed -e "s/:$//" | grep -v ${MYDATE}`
# 遍历上面文件列表,并挨个执行修复
for mypath in ${FILELIST}
do
hdfs debug recoverLease -path ${mypath} -retries 5
done
最后可由contrab
或者常用任务编排工具来定时执行,此处不再赘述。