1

I'm having weird issues with the Hadoop Namenode and Secondary Namenode. Our HDFS cluster runs smoothly most of the time. But every now and then, either the Primary Namenode freezes (crashing the whole cluster) or the Secondary Namenode freezes and stops making checkpoints.

Whenever this happens I try to restart the hanging service which fails because the port is still blocked:

# service hadoop-namenode restart
 * Stopping Hadoop namenode: 
no namenode to stop
 * Starting Hadoop namenode: 
starting namenode, logging to /var/log/hadoop/hadoop-hdfs-namenode-HOST.out
Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 26100; nested exception is: 
        java.net.BindException: Address already in use

However, checking the output of ps auxw, no Namenode is running anymore. Checking which process is blocking the port it get:

# netstat -tulpen | grep 26100
tcp        0      0 0.0.0.0:26100           0.0.0.0:*               LISTEN      6001       20067       -

which isn't helpful at all. It says, the port is in use, but the process is -.

ss isn't any more helpful either:

# ss -apne | grep 26100
tcp    LISTEN     34     50                     *:26100                 *:*      uid:6001 ino:20067 sk:000015c1 <->
tcp    CLOSE-WAIT 224    0              127.0.0.1:26100         127.0.0.1:56770  ino:0 sk:00000527 -->
...
tcp    CLOSE-WAIT 13     0              127.0.0.1:26100         127.0.0.1:56762  ino:0 sk:0000078f -->
tcp    CLOSE-WAIT 17     0              127.0.0.1:26100         127.0.0.1:56772  ino:0 sk:000007b1 -->

The only thing both commands tell me is that the process is owned by UID 6001 which is the hdfs user. Checking ps auxw for any processes owned by hdfs I can see that there is one zombie process:

hdfs      4947  4.8  0.0      0     0 ?        Zl   Feb23 435:50 [java] <defunct>

So for any reason, restarting the Namenode service leaves a zombie which continues blocking the port. Unfortunately, there is no way I can get rid of this process because the only parent process is init:

# pstree -ps 4947
init(1)───java(4947)───{java}(9957)

The only workaround is either restarting the operating system (out of question) or temporarily using another port of the Namenode (or Secondary Namenode).

Do you have any idea what the reason for this weird bug may be? I couldn't find any hints in dmesg.

The cluster has 130 nodes, each running Ubuntu 14.04 Trusty with Kernel 4.2.0-30-generic #35~14.04.1-Ubuntu. Hadoop version is 2.7.1.

0 Answers0