1

I am running Hadoop on a project and need a suggestion.

Generally by default Hadoop has a "block size" of around 64mb..
There is also a suggestion to not use many/small files..

I am currently having very very very small files being put into HDFS due to the application design of flume..

The problem is, that Hadoop <= 0.20 cannot append to files, whereby i have too many files for my map-reduce to function efficiently..

There must be a correct way to simply roll/merge roughly 100 files into one..
Therefore Hadoop is effectively reading 1 large file instead of 10

Any Suggestions??

Arenstar
  • 3,592
  • 2
  • 24
  • 34

3 Answers3

1

Media6degrees has come up with a fairy good solution to combine small files in Hadoop. You can use their jar straight out. http://www.jointhegrid.com/hadoop_filecrush/index.jsp

Aman
  • 41
  • 2
1

Have you considered using Hadoop Archives? Think of them as tar files for HDFS. http://hadoop.apache.org/common/docs/r0.20.2/hadoop_archives.html

0

What you need to do is write a trivial concatenator program with an identity mapper and one or just a few identity reducers. This program will allow you to concatenate your small files into a few large files to ease the load on Hadoop.

This can be quite a task to schedule and it wastes space, but it is necessary due to the design of HDFS. If HDFS were a first class file system, then this would be much easier to deal with.

Ted Dunning
  • 306
  • 1
  • 6