Hadoop Rolling Small files

Question

I am running Hadoop on a project and need a suggestion.

Generally by default Hadoop has a "block size" of around 64mb..
There is also a suggestion to not use many/small files..

I am currently having very very very small files being put into HDFS due to the application design of flume..

The problem is, that Hadoop <= 0.20 cannot append to files, whereby i have too many files for my map-reduce to function efficiently..

There must be a correct way to simply roll/merge roughly 100 files into one..
Therefore Hadoop is effectively reading 1 large file instead of 10

Any Suggestions??

score 1 · Answer 1 · answered Dec 09 '10 at 00:51

1

Media6degrees has come up with a fairy good solution to combine small files in Hadoop. You can use their jar straight out. http://www.jointhegrid.com/hadoop_filecrush/index.jsp

answered Dec 09 '10 at 00:51

Aman

41
2

score 1 · Answer 2 · answered Jan 05 '11 at 13:07

1

Have you considered using Hadoop Archives? Think of them as tar files for HDFS. http://hadoop.apache.org/common/docs/r0.20.2/hadoop_archives.html

answered Jan 05 '11 at 13:07

Charles Wimmer

11
2

score 0 · Answer 3 · answered Dec 04 '10 at 20:17

What you need to do is write a trivial concatenator program with an identity mapper and one or just a few identity reducers. This program will allow you to concatenate your small files into a few large files to ease the load on Hadoop.

This can be quite a task to schedule and it wastes space, but it is necessary due to the design of HDFS. If HDFS were a first class file system, then this would be much easier to deal with.

Hadoop Rolling Small files

3 Answers3