1

I have a large set of log files that I need to extract data from. Is it possible to use Flume to read these files and dump them into an HDFS (Cassandra, or another data source) which I can then query?

The documentation seems to suggest it's all live event based log processing. I'm wondering if I'm missing some obvious process to just have flume read and process static log files from a directory.

womble
  • 95,029
  • 29
  • 173
  • 228
duckus
  • 11
  • 2

1 Answers1

1

Yes, this is the standard use case for flume.

The server with the log files will run a flume-node and another (or potentially the same) server will run a flume-master. The flume-nodes will discover the flume-master and from the flume-master you can execute commands like:

exec config my-config 'tail("/path/to/logfile")' 'collectorSink("hdfs://path/to/hdfs-folder", [options])'

This creates a configuration that tells flume how to access the file (it can tail or read the entire file, other options are available) and where to put it.

Then it is a matter of pointing the configuration at a particular server:

exec map (server-hostname) my-config

There is more information in the flume user guide: http://archive.cloudera.com/cdh/3/flume/UserGuide/index.html

Jeff Wu
  • 111
  • 1