I have had a 10 node HBase cluster up and running for the past 4 months. The cluster was setup on VMs in a corporate environment which I do not control, but everything has been working great...until today.
Today, every part of the system was down. I restarted the system and everything came back up for a bit, but then would go down again (particularly HBase...but I think that was because of this following issue).
There is an error in the HDFS logs that says:
HdfsCanaryCdh4{hdfs://hbase-1.internal:8020} for hdfs://hbase-1.internal:8020: Failed to read /tmp/.cloudera_health_monitoring_canary_files/.canary_file_2014_04_15-17_39_25. Error: org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.sortLocatedBlocks(DatanodeManager.java:334)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1343)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:413)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:172)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44938)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1752)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1748)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1746)
When I jump onto the Name Node and run:
sudo -u hdfs hdfs dfs -cat /tmp/.cloudera_health_monitoring_canary_files/.canary_file_2014_04_15-17_39_25
I get back a single line that says: cat: java.lang.NullPointerException
.
I also double checked that the disks weren't full (they aren't) and that I have connectivity (everything appears normal - nobody has touched this system as I was the only one with access).
I am completely stumped what is happening here. I can provide more details if needed, but I'm not even sure where to go from here.
Update
Per Mark's request in the comments, the output of:
sudo -u hdfs hdfs dfs -lsr /tmp/
is
drwxrwxrwx - hdfs supergroup 0 2014-04-16 09:48 /tmp/.cloudera_health_monitoring_canary_files
-rw-rw-rw- 3 hdfs supergroup 56 2014-04-15 16:59 /tmp/.cloudera_health_monitoring_canary_files/.canary_file_2014_04_15-16_59_24
[continues like this for all the files in the directory]