What is Hadoop and what is it used for?

Question

I have been enjoying reading ServerFault for a while and I have come across quite a few topics on Hadoop. I have had a little trouble finding out what it does from a global point of view.

So my question is quite simple : What is Hadoop ? What does it do ? What is it used for ? Why does it kick ass ?

Edit : If anyone happens to have demonstrations/explanations of use cases in which Hadoop was used, that would be fantastic.

Facebook makes heavy use of Hadoop (well really Hive which is a layer on top of Hadoop). There is a good writeup of it on the Facebook Engineering page. http://www.facebook.com/note.php?note_id=89508453919 — John Meagher, Jun 18 '09 at 17:30
Hadoop is a framework which makes the processing of large amount of data (_Big data_) simple by distributing the clusters of data among the nodes/servers and making the process run in parallel. This process/algorithm is known as MapReduce. — Mr_Green, Jul 21 '14 at 06:03

score 26 · Accepted Answer · edited Jul 21 '14 at 06:00

Straight from the horse's mouth:

Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.

Map/Reduce is a programming paradigm that was made popular by Google where in a task is divided in to small portions and distributed to a large number of nodes for processing (map), and the results are then summarized in to the final answer (reduce). Google and Yahoo use this for their search engine technology, among other things.

Hadoop is a generic framework for implementing this kind of processing scheme. As for why it kicks ass, mostly because it provides neat features such as fault tolerance and lets you bring together pretty much any kind of hardware to do the processing. It also scales extremely well, provided your problem fits the paradigm.

You can read all about it on the website.

As for some examples, Paul gave a few, but here's a few more you could do that are not so web-centric:

Rendering a 3D film. The "map" step distributes the geometry for every frame to a different node, the nodes render it, and the rendered frames are recombined in the "reduce" step.
Computing the energy in a system in a molecular model. Each frame of a system trajectory is distributed to a node in the "map" step. The nodes compute the the energy for each frame,
and then the results are summarized in the "reduce" step.

Essentially the model works very well for a problem that can be broken down in to similar discrete computations that are completely independent, and can be recombined to produce a final result.

Thank you for your answer. So basically it takes apps (PHP ? Java ?) and it breaks them down and dispatches the work amongst a bunch of nodes ? As for HDFS, it's sort of like OCFS except with a bunch of nodes ? — Antoine Benkemoun, Jun 18 '09 at 06:47
Interested in this aswell. I'd like to see some more specific, real word examples though. — Karolis T., Jun 18 '09 at 07:19

score 10 · Answer 2 · 2009-06-18T07:30:36.413

Cloudera have some great videos that explain the principles behind Map Reduce and Hadoop.

http://www.cloudera.com/hadoop-training-basic

One of the core ideas behind MapReduce is that for large data sets you are going to be io bound on your disks, so in Hadoop HDFS gives you the ability to split things up between lots of nodes enabling parallel processing.

Some uses of Hadoop of interest to systems administrators are often around processing large log file sets - I can only post one link but these include, google should find these:

Rackspace mail log query
Apache log analysis with pig - see Cloudera blog
Yahoo! fight spam

Looks nice I'll have a look :-) – Antoine Benkemoun Jun 18 '09 at 08:56 — Antoine Benkemoun, Jun 18 '09 at 08:56

score 1 · Answer 3 · answered Jan 19 '12 at 02:53

Initially hadoop is developed for large amount of data sets in OLAP environment.

With introduction of Hbase on top of hadoop, cane be used for OLAP Processing also. Hadoop is a framework with all the subcomponents like map reduce,hdfs,hbase,pig.

Ifound one the the article with basic of hadoop in Why Hadoop is introduced.

In Hadoop,data storage in the form of files, not in the tables,columns.

What is Hadoop and what is it used for?

3 Answers3