How to fix Hadoop HDFS cluster with missing blocks after one node was reinstalled?

Question

I have a 5 slave Hadoop cluster (using CDH4)---slaves are where DataNode and TaskNode run. Each slave has 4 partitions dedicated to HDFS storage. One of the slaves needed a reinstall and this caused one of the HDFS partitions to be lost. At this point, HDFS was complaining about 35K missing blocks.

A few days later, the reinstall was complete and I brought the node back online to Hadoop. HDFS remains in safe-mode and the new server is not registering anywhere near the amount of blocks that the other nodes are. For instance, under DFS Admin, the new node shows it has 6K blocks, while the other nodes have about 400K blocks.

Currently, the new node's DataNode logs show it is doing some verification (or copying?) on a variety of blocks, some of which fail as already existing. I believe this is HDFS just replicating existing data to the new node. Example of verification:

2013-08-09 17:05:02,113 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-143510735-141.212.113.141-1343417513962:blk_6568189110100209829_1733272

Example of failure:

2013-08-09 17:04:48,100 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: meez02.eecs.umich.edu:50010:DataXceiver error processing REPLACE_BLOCK operation  src: /141.212.113.141:52192 dest: /141.212.113.65:50010
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-143510735-141.212.113.141-1343417513962:blk_-4515068373845130948_756319 already exists in state FINALIZED and thus cannot be created.
    at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:813)
    at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:92)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:155)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.replaceBlock(DataXceiver.java:846)
    at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReplaceBlock(Receiver.java:137)
    at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:70)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
    at java.lang.Thread.run(Thread.java:679)

In DFS Admin, I can also see that this new node is at 61% capacity (matching other nodes' approximate usage), even though its number of blocks is about 2% of the other nodes. I'm guessing this is just the old data that HDFS is not recognizing.

I suspect one of a few things happened: (a) HDFS abandoned this node's data because of staleness; (b) the reinstall changed some system parameter so HDFS treats it as a brand new node (i.e. not an existing one with data); or (c) somehow the drive mapping got messed up, thus causing the partitions mapping to be changed and HDFS not able to find the old data (although the drives have labels, and I am 95% sure we got this right).

Main question: How can I get HDFS to re-recognize the data on this drive?

answer: restart NameNode, and the nodes will re-report which blocks they have (see Update 1 below)

Sub-question 1: If my assumption of the new node's data usage is correct---that the 61% usage is ghost data---does it ever get cleaned up by HDFS, or do I need to manually remove this?

less of an issue: since a large portion of drive seems to be recognized (see Update 1 below)

Sub-question 2: Currently, I cannot run listCorruptFileBlocks to find the missing blocks due to "replication queues have not been initialized." Any idea how to fix this? Do I have to wait for the new node to rebalance (i.e. this verification/copying phase to end)?

answer: leaving Safe Mode let me run this (see Update 1 below)

Updates

Update 1: I thought I had fixed the issue by restarting my NameNode. This caused the new node's block count to jump up to approximately the same usage as the other nodes, and DFS changed its message to:

Safe mode is ON. The reported blocks 629047 needs additional 8172 blocks to reach the threshold 0.9990 of total blocks 637856. Safe mode will be turned off automatically.

I left it in this state for several hours, hoping that it would finally leave Safe Mode, but nothing has changed. I then manually turned off Safe Mode, and DFS's message changed to, "8800 blocks are missing". At this point, I was able to run hdfs fsk -list-corruptfileblocks, to see a large portion of files that are missing blocks.

Current remaining issue: how to get these missing blocks recovered... (should I spin this off in new question?)

What kind of replication did the cluster have? As you say, "one of the HDFS partitions" was lost. Was that data ever replicated to another DataNode? — Henk Langeveld, Aug 10 '13 at 16:05
*Most* files had replication=3 (the default), some larger datasets had replication=2, and it's possible some users used other replication (e.g. rep=1), but I'm not sure. With rep=2, HDFS should hopefully put that data on different servers, so I'm hoping that someone used rep=1 for some non-important data, and that's what the missing blocks are. I'll email the users to check — Dolan Antenucci, Aug 10 '13 at 16:14
`I left it in this state for several hours, hoping that it would finally leave Safe Mode, but nothing has changed`. Did you check the datanode logs and namenode log? — 030, Jul 04 '15 at 01:05
`At this point, HDFS was complaining about 35K missing blocks` This is possible if one of the nodes dies. The namenode will replicate the blocks. E.g., if the replication factor of a block is three and one block resided on the died node then a new copy has to be created. Did you check the namenode log and datanodes logs whether blocks were replicated and the number of missing blocks decreased? — 030, Jul 04 '15 at 01:08
Unfortunately, I can't recall what I was seeing in the datanode/namenode logs, which I'm pretty sure I would have checked. My "solution" though, which I I can't recall how I did (probably via `dfsadmin`), was to identify the files with missing blocks, and after concluding that we could live with their loss, I deleted the files from HDFS. In my case, there was a node/drive that died *and* replication was set to very low (possibly 1) on the troublesome files. I'll add this as an answer so as not to leave this question unanswered. — Dolan Antenucci, Jul 05 '15 at 13:34

score 2 · Accepted Answer · edited May 23 '17 at 12:41

I ended up having to delete the files with bad blocks, which after further investigation, I realized had a very low replication (rep=1 if I recall correctly).

This SO post has more information on finding the files with bad blocks, using something along the lines of:

hadoop fsck / | egrep -v '^\.+$' | grep -v eplica

So, to answer my own questions:

Can these files be recovered? Not unless the failed nodes/drives are brought back online with the missing data.
How do I get out of safe mode? Remove these troublesome files, and then leave safe mode via dfsadmin.

score 2 · Answer 2 · answered Apr 27 '18 at 14:58

We had a similar problem today. One of our nodes (out of 3, with replication=3) just died on us, and after restarting we started to see this on the affected datanodes' logs:

18/04/27 14:37:22 INFO datanode.DataNode: Receiving BP-1114663060-172.30.36.22-1516109725438:blk_1073743913_3089 src: /172.30.36.26:35300 dest: /172.30.36.25:50010
18/04/27 14:37:22 INFO datanode.DataNode: g500603svhcm:50010:DataXceiver error processing WRITE_BLOCK operation  src: /172.30.36.26:35300 dst: /172.30.36.25:50010; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-1114663060-172.30.36.22-1516109725438:blk_1073743913_3089 already exists in state FINALIZED and thus cannot be created.

The namenodes' webui shows the datanode having only 92 blocks (out of the 13400 the rest had).

Fixed it by triggering a full block report on the datanode, which updated the namenode's data on it:

hdfs dfsadmin -triggerBlockReport g500603svhcm:50020

The result: the datanode was missing a couple of blocks which it happily accepted and restored the cluster.

How to fix Hadoop HDFS cluster with missing blocks after one node was reinstalled?

Updates

2 Answers2