Questions tagged [hadoop]

Hadoop is an open-source solution for providing a distributed/replicated file system, a produciton grade map-reduce system, and has a series of complementary additions like Hive, Pig, and HBase to get more out of a Hadoop-powered cluster.

Hadoop is an Apache foundation sponsored project, with commercial support provided by multiple vendors, including Cloudera, Hortonworks, and MapR. Apache has a more complete set of commercial solutions documented.

Available complementary additions to Hadoop include:

  • Hadoop distributed filesystem ( standard )
  • The map-reduce architecture ( standard )
  • Hive, which provides a SQL like interface to the M/R arch
  • Hbase, a distributed key-value service

Recommended reference sources:

262 questions
3
votes
3 answers

Hadoop - /usr/bin/hadoop: line 320: /usr/bin/java/bin/java: Not a directory

I am installing Hadoop on CentOS 6.4. Following these instructions http://hadoop.apache.org/docs/stable/single_node_setup.html wget http://apache.osuosl.org/hadoop/common/hadoop-1.1.2/hadoop-1.1.2-1.x86_64.rpm chmod 700 hadoop-1.1.2-1.x86_64.rpm rpm…
davidjhp
  • 630
  • 2
  • 7
  • 13
3
votes
2 answers

How does hadoop decide what its nodes hostnames are?

Currently the urls generated by the jobtracker & namenode return either hostnames like bubbles.local or just bubbles. These end up not resolving unless the client machine has specified these in their /etc/hosts file. When I run the hostname command…
Dan R
  • 2,275
  • 1
  • 19
  • 27
3
votes
3 answers

Hadoop ecosystem web dashboard

I am trying to find a tool, which would show me an overview of my Hadoop ecosystem - state, health, running tasks, etc. I tried to Google, but did not find any. Is there some nice useful tool?
Vojtech
  • 31
  • 2
3
votes
1 answer

starting hadoop on mac os lion

I want to start hadoop on my macbook pro, I did all the steps that apache says. When I use the command "bin/start-all.sh", I get this: starting namenode, logging to…
AliBZ
  • 253
  • 1
  • 2
  • 10
3
votes
2 answers

Is there a way to get a list of Hadoop cluster machines from one of the data nodes?

I have access to a data node in a Hadoop cluster, and I'd like to find out the identity of the name nodes for the same cluster. Is there a way to do this?
Yuval
  • 217
  • 1
  • 6
  • 11
3
votes
1 answer

Best practice for administering a (hadoop) cluster

I've recently been playing with Hadoop. I have a six node cluster up and running - with HDFS, and having run a number of MapRed jobs. So far, so good. However I'm now looking to do this more systematically and with a larger number of nodes. Our base…
Alex
3
votes
4 answers

Interpreting exim log files after parsing

I'm parsing exim log files and, due to my processing method, lose the original order of all entries in this file. I rebuild the transactions by their transaction ID (i.e. 1OfiYX-0000Ev-7k) but still don't have a way to determine the original…
gnucom
  • 197
  • 8
3
votes
1 answer

Hadoop slaves file necessary?

I'm working on a team trying to create a system for creating Hadoop clusters on EC2 with minimal effort on the part of the user. Ideally, we would like slave instances to only require the hostname of the master instance as user data on boot. The…
Tim Yates
  • 225
  • 2
  • 6
2
votes
0 answers

Yarn error: Failed to create Spark client for Spark session

I'm a bit new to this and have little experience, would appreciate your help. I'm trying to install Hive on an existing Spark installation. I mostly followed the instructions in this page with no…
hnagaty
  • 131
  • 1
  • 4
2
votes
0 answers

Hadoop Streaming with Python 3.5: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 127

I'm trying to run my own mapper and reducer Python scripts using Hadoop Streaming on my cluster built on VMware Workstation VMs. Hadoop version - 2.7, Python - 3.5, OS - CentOS 7.2 on all the VMs. I have a separate machine which plays a role of a…
alex
  • 21
  • 1
  • 3
2
votes
0 answers

AWS-Hadoop Data Analytics Implementation for Multiple JSON Files

I am new to hadoop and AWS. I have setup multi-node (4 instances t2.large) AWS EC2 cluster with cloudera Hadoop distribution. I have tested the environment with basic examples using CSV files such as word count. Now, my main project is to analyze…
Rash
  • 21
  • 1
2
votes
1 answer

Running HDFS with only 1 data node - appending fails

I'm trying to test some services that require HDFS using Docker Compose. Since the services being tested, namenode, and data node(s) will all be running on the same physical machine (dev laptop), it would be nice to reduce the memory usage by only…
2
votes
1 answer

Possible to ssh into a server without using -i flag for key?

I have 3 EC2 instances and they all use the same private key. I'm setting up a hadoop cluster between these nodes and they require passwordless entry for this to work. How can I use this private key to easily ssh into the servers with keyless entry?…
coderkid
  • 173
  • 1
  • 5
2
votes
2 answers

How to remove RAID option from HP DL360 Gen 9 for HDFS

I am setting up a brand new DL360 G9 Server for use in a Hadoop cluster proof-of-concept. As HDFS will be taking care of the RAID, I need to bypass this option in the G9 array controller (Smart Array P440ar). I just can't find where to do that - IF…
Sketch
  • 21
  • 1
  • 2
2
votes
1 answer

Should I deploy hadoop on physical machines or virtual machines?

We will deploy a hadoop cluster on hundreds(say 300) of physical x86 nodes. Since we have no much production deployment experience, there is a simple question as the title we want to hear response from experienced guys. What are the best practics?…
John Wang
  • 97
  • 2
  • 12
1 2
3
17 18