Questions tagged [mapreduce]

13 questions
4
votes
2 answers

hadoop-config.sh in bin/ and libexec/

While setting up hadoop, I found that hadoop-config.sh script is present in two directories, bin/ and libexec/. Both the files are identical. While looking onto scripts, I found that if hadoop-config.sh is present in libexec, then it gets executed.…
krackoder
  • 151
  • 1
  • 3
4
votes
1 answer

How do I define the timeout for bootstrap actions on Amazon's Elastic MapReduce?

How do I change the timeout for bootstrap actions on Amazon's Elastic MapReduce?
user76542
3
votes
1 answer

Best practice for administering a (hadoop) cluster

I've recently been playing with Hadoop. I have a six node cluster up and running - with HDFS, and having run a number of MapRed jobs. So far, so good. However I'm now looking to do this more systematically and with a larger number of nodes. Our base…
Alex
2
votes
0 answers

Hadoop Streaming with Python 3.5: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 127

I'm trying to run my own mapper and reducer Python scripts using Hadoop Streaming on my cluster built on VMware Workstation VMs. Hadoop version - 2.7, Python - 3.5, OS - CentOS 7.2 on all the VMs. I have a separate machine which plays a role of a…
alex
  • 21
  • 1
  • 3
1
vote
0 answers

Sample output of Rumen or Input to Gridmix

I want to see JobHistory logs, which can be fed as input to the Rumen. More specifically, I am interested in knowing input format for the Gridmix. I tried following two things for it: 1) I found this files: . What is this file exactly? Is this…
PHcoDer
  • 111
  • 2
1
vote
1 answer

Hadoop FileAlreadyExistsException: Output directory hdfs://:9000/input already exists

I have Hadoop setup in fully distributed mode with one master and 3 slaves. I am trying to execute a jar file named Tasks.jar which is taking arg[0] as input directory and arg[1] as output directory. In my hadoop environment, I have the input files…
1
vote
2 answers

Updating group without log out or subshell

I'm trying to run Docker on Elastic MapReduce streaming but am having trouble with a permissions issue. In my bootstrap script, I need the "hadoop" user to be part of the "docker" group (as described on the AWS Docker Basics page): sudo usermod -a…
Max
  • 111
  • 2
1
vote
1 answer

MapReduce job is hung after 1 of 5 reducers completed on single-node environment

I have only one Data Node on my dev environment on EC2. I ran heavy MR job and in 6 hours noticed that 100% of mappers and 20% of reducers finished (1 of reducer shows 100% competition, other ones - 0%). Looks like job is hung between 2 reducer…
Marboni
  • 111
  • 4
1
vote
0 answers

How to increase the performance on Amazon Elastic Mapreduce for job execution?

My task is: Initially I want to import the data from MS SQL Server into HDFS using SQOOP. Through Hive I am processing the data and generating the result in one table That result containing table from Hive is again exported to MS SQL SERVER…
Bhavesh Shah
  • 111
  • 2
1
vote
3 answers

Hadoop Rolling Small files

I am running Hadoop on a project and need a suggestion. Generally by default Hadoop has a "block size" of around 64mb.. There is also a suggestion to not use many/small files.. I am currently having very very very small files being put into HDFS due…
Arenstar
  • 3,592
  • 2
  • 24
  • 34
0
votes
1 answer

How to view status of recent AppEngine mapreduce jobs?

We recently upgraded our App Engine application to GAE SDK 1.9, and upgraded the older MapReduce library we'd been using to the most recent version hosted on GitHub. We now find that the old MapReduce status page…
0
votes
0 answers

Distributing Master node ssh key

For the master node to passwordless-ly ssh into the slaves, the master needs to distribute its ssh key to the slaves. Copying key using ssh-copy-id asks for the user password. If there are hundreds of nodes in the system, it may not be a good idea…
krackoder
  • 151
  • 1
  • 3
0
votes
1 answer

MongoDB Locking - Very, very, slow to read

This is the output from db.currentOp(): > db.currentOp() { "inprog" : [ { "opid" : 2153, "active" : false, "op" : "update", "ns" : "", "query" : { "name" :…
StuR
  • 167
  • 2
  • 10