Looking for a high performance distributed scalable solution for storing tons of log messages. We have multiple concurrent log sources(=servers).
The interesting thing here is that performance is crucial and we are even willing to loose a small percent (let's say max 2%) of all of the daily messages if the logging system performs better.
We want to process the log messages daily with an online algorithm so we do not need any fancy relational database stuff. Just want to run through the data sequentially and calculate some aggregates and trends.
This is what we need:
- At least 98% of the messages must be stored. It's not a problem to loose a couple of messages.
- Once a message is stored it must be reliably stored (Durable aka D from ACID - so basically replication is needed)
- Multiple sources.
- The messages must be stored in a sequent way, but exact ordering is not needed (we expect any two messages further away than a couple of seconds be in the right order, but messages close to each other can be in arbitrary order)
- We must be able to process the daily data sequentially (ideally in some reliable way like map-reduce, so machine failures are handled and processing on nodes with failures is restarted)
Any RDBMS is certainly not an option here as it guarantees too many (for this task unnecessary) properties.