-1

I'm new to Hadoop and trying to understand how it should be installed/configured. From the documentation I see that Hadoop normally should be aware about physical servers configuration (e.g replicating data between racks). So it is not clear for me, what if Hadoop installed on top of some hypervisor layer (e.g. using OpenStack), how to achieve correct replication in this case. Could you please point to articles/documentation for this?

Pavel
  • 1
  • 1

1 Answers1

2

That depends on your topology.
If you are spreading your OpenStack environment across multiple racks/switches and plan on moving Hadoop HDFS nodes around, then well, you can't specify a topology (or you need to change it after each move of a VM).
It will still work correctly, it will just not be as efficient as on bare metal servers and less outage resistant.

For more information you should read: http://wiki.apache.org/hadoop/Virtual%20Hadoop
Which also states:

The most significant implication is in storage. A core architectural design of both Google's GFS and Hadoop's GFS is that three-way replication onto local storage is a low-cost yet reliable way of storing Petabytes of data. This design is based on physical topology (rack and host) awareness of hadoop so it can smartly place data block across rack and host to get survival from host/rack failure. In some cloud vendors' infrastructure, this design may no longer valid as they don't expose physical topology (even abstracted) info to customer. In this case, you will be disappointed when one day all your data disappears and please do not complain if this happens after reading this page: you have been warned. If your cloud vendor do expose this info in someway (or promise they are physical but not virtual) or you own your cloud infrastructure, the situation is different that you still can have a reliable Hadoop cluster like in physical environment.

faker
  • 17,326
  • 2
  • 60
  • 69