Hadoop - What is the purpose of the /usr/sbin/ shell scripts?

Question

I am installing Hadoop 1.1.2 on CentOS 6.4.

I read all the Hadoop documentation at http://hadoop.apache.org/docs/stable/

After installing, I noticed there are many shell scripts at /usr/sbin/. But the documentation does not explain what most of these do.

For example:

hadoop-create-user.sh
hadoop-setup-conf.sh
hadoop-setup-hdfs.sh
hadoop-setup-single-node.sh
hadoop-validate-setup.sh
slaves.sh
start-balancer.sh
start-jobhistoryserver.sh
stop-balancer.sh
stop-jobhistoryserver.sh
update-hadoop-env.sh

Is there some supplemental documentation to get an explanation of these scripts?

score 0 · Answer 1 · answered Aug 20 '13 at 05:13

hadoop-create-user.sh sets up the named user's home directory in HDFS under the /user path.

hadoop-setup-conf.sh is used to bootstrap the cluster configuration on a new cluster.

hadoop-setup-hdfs.sh is used to format the HDFS structure and create the standard directory tree inside HDFS. This is a destructive tool and could cause bad things to occur on an existing cluster, such as dataloss.

hadoop-setup-single-node.sh is for setting up a single node deployment, often known as a pseudo-distributed cluster. This causes all necessary daemons to run under one system.

hadoop-validate-setup.sh runs teragen, terasort, and teravalidate as a way to smoketest your cluster and make sure it's running properly. It's a basic benchmark.

slaves.sh allows you to run a command on all slaves in a cluster (basically, the datanodes).

start-balancer.sh runs hadoop balancer, which causes the namenode to shuffle blocks around on the datanodes in order to make sure all datanodes are using a (roughly) equal amount of disk space. This is a housekeeping task that should be run periodically.

start-jobhistoryserver.sh is the tool to start up the jobhistory server, which provides information on the jobs that have been run on the mapreduce side of the cluster.

stop-balancer.sh and stop-jobhistoryserver.sh are the opposite of the above two.

update-hadoop-env.sh updates the hadoop-env.sh script, which is used to set up common environment variables needed by all hadoop tools and daemons in the cluster.

There's not really much in the way of documentation for some of this stuff. You just need to dig around in the scripts to see what they're really doing.

Hadoop - What is the purpose of the /usr/sbin/ shell scripts?

1 Answers1