0

I have nutch/hadoop pseudo distributed running fine. I want to add processing capacity by adding new nodes which are smaller than master (HD 3 times smaller) and cheaper of course.

Since the default HDFS replication is at 3, after balancing the data I will not get more space, which is not my concern first.

Do I still get more processing power ?

I don't understand how map/reduce tasks work against replication. How is it decided which nodes gets the work out of the different replica.

millebii
  • 161
  • 8

2 Answers2

1

You will have to move to a cluster setup from your pseudocluster setup and by doing so, you will indeed get more processivity out of your cluster by adding more nodes i.e. you will be able to process more map and reduce tasks. The processivity increase as you would expect is linear.

Replication will determine the number of replicates that are present in your cluster for each HDFS block. So lets assume that you have a file that is split into 6 blocks, for a replication of 3, 18 blocks will be spread out in your cluster. The more nodes you have the higher coverage you will get and thus when it comes down to commencing your map phase, less data will have to be transfered between datanodes. And to answer your final question, Hadoop will always try to assign map tasks to nodes that serve as datanodes for the input to those map tasks. So in this case, the replicates will make this task easier since there will be a larger pool of tasktrackers to choose from.

  • @diliop so the task is assigned to one and only one tasktracker out of the replicates right ? The job tracker does that presumably ? – millebii May 22 '11 at 07:53
  • You seem to be confusing the associations between split, map task input and block replicate. Every map task gets as input a single split and is run on a single tasktracker. Every split is usually (and by default) equal to the HDFS block size. Blocks are the ones that are replicated and for each file, a single replicate will only be used as input. Replication's main purpose of existence if fault tolerance. And yed, the jobtracker is responsible for generating the splits and scheduling tasks on tasktrackers. –  May 23 '11 at 00:29
  • Was this helpful or are you still having issues? –  May 26 '11 at 00:32
  • Yes & no, I think I need to understand the concept of split better and read a few things about it. I just have a few priorities on top before I can test and tick back. – millebii May 26 '11 at 20:40
  • Just to help you in that process, I would strongly advise you to pick up a copy of Hadoop: The definitive guide (http://oreilly.com/catalog/9780596521981) and read the first 5-6 chapters. –  May 26 '11 at 20:44
  • Thx bought that book, half of my problems are also solved by getting a bigger HD eventually. – millebii May 28 '11 at 13:07
0

Your question is a bit confusing. If you're running in pseudo-distributed mode, then that's where all four processes (JobTracker, NameNode, DataNode, TaskTracker) are all launched on the same (typically development) system.

The Hadoop xxx-site.xml configuration for pseudo-distributed has everything talking to localhost, thus adding new nodes won't work.

Leaving that aside, if you are adding more nodes, and these are running both DataNodes and TaskTrackers, then you will get added storage and CPU capacity.

Note that as you fill up HDFS, eventually the 3x replication won't be possible when all of the smaller nodes are at capacity, so you'll start getting warnings/errors.