2

Looking for a high performance distributed scalable solution for storing tons of log messages. We have multiple concurrent log sources(=servers).

The interesting thing here is that performance is crucial and we are even willing to loose a small percent (let's say max 2%) of all of the daily messages if the logging system performs better.

We want to process the log messages daily with an online algorithm so we do not need any fancy relational database stuff. Just want to run through the data sequentially and calculate some aggregates and trends.

This is what we need:

  • At least 98% of the messages must be stored. It's not a problem to loose a couple of messages.
  • Once a message is stored it must be reliably stored (Durable aka D from ACID - so basically replication is needed)
  • Multiple sources.
  • The messages must be stored in a sequent way, but exact ordering is not needed (we expect any two messages further away than a couple of seconds be in the right order, but messages close to each other can be in arbitrary order)
  • We must be able to process the daily data sequentially (ideally in some reliable way like map-reduce, so machine failures are handled and processing on nodes with failures is restarted)

Any RDBMS is certainly not an option here as it guarantees too many (for this task unnecessary) properties.

Karoly Horvath
  • 334
  • 1
  • 4
  • 14
  • Are you logging over the network (e.g. syslog)? If so, keep in mind that traditional syslog/UDP can easily lose far more then 2% of the UDP packets. – Stefan Lasiewski May 26 '11 at 23:36

2 Answers2

3

I think you want Flume. It seems to hit most of the points you are looking for - multiple sources, reliability (E2E guarantee), the ability to write to HDFS (distributed, fault-tolerant storage, integrates into Hadoop for map/reduce.

Edit: I'd also like to mention Scribe as another possibility. It's C++ based, written by Facebook, but it seems to have been abandoned mostly by upstream. Still, it's a lot lower footprint than Flume, nevertheless including the footprint of all of the Flume dependencies such as Zookeeper. And it, too, can write to HDFS.

Scrivener
  • 3,106
  • 1
  • 20
  • 23
  • brilliant! I really like Flume's failure modes.. we will be able to balance between reliability and performance. Exactly what I was looking for. – Karoly Horvath Apr 18 '11 at 17:31
  • Glad to help. Be aware, there are some issues with it - for one being a hard dependency on having some of the Hadoop libraries, and it runs Java, so it carries with it the full weight of a JVM on each machine. But still, nothing else does what it does for the moment, and I've recently been required to suck up because of that fact. – Scrivener Apr 18 '11 at 20:49
  • Just for future viewer’s sake, I should also mention [Fluentd](http://fluentd.org/) is another alternative, and that Scribe has been steadily declining in popularity. – Scrivener May 17 '13 at 20:41
1

A distributed Splunk setup would fit your needs, but it sounds like you have a high volume of log data; it's licensed based on amount of data indexed per day, so it would not be cheap.

Shane Madden
  • 112,982
  • 12
  • 174
  • 248
  • You are right, we need a system that can handle very high volume of log data (.. in the near future so we can handle the rapid growth). The company might pay for some additional services of an open-source solution (e.g.: on-line support, troubleshooting, feature requests, ...) but certainly won't pay on a daily basis for traffic. – Karoly Horvath Apr 15 '11 at 16:21
  • @yi_H To clarify, you buy a permanent license that allows for a certain amount of data to be indexed per day; for instance, I think the ballpark number is around $10,000 for a permanent 1GB per day license, and cheaper per gig as the license gets bigger (I know the prices have been changed since my last conversation with their sales team, so take this with a grain of salt). – Shane Madden Apr 15 '11 at 16:25