This is a complex problem, as you well know. You tagged Logstash in your question, so I'm going to assume you have that.
Ingesting logs is what Logstash does. It has a file {}
input plugin just for that:
input {
file {
path => [ '/opt/export/cdn_logs/*.csv'
tags => [ 'cdnlogs' ]
}
}
And a csv {}
filter to ease ingesting CSV data.
filter {
if 'cdnlogs' in [tags] {
csv {
source => "message"
columns => [
'cdndate',
'host_server',
[...]
'response_time' ]
}
}
}
If you don't have CSV data, perhaps those lines are in a fairly normal-looking Apache format, all is not lost. You'll probably need to spend time with grok. That's its own thing.
Date-ordering is less of a problem so long as you take care to preserve your timestamps and don't use statsd which manifestly doesn't preserve them. If you haven't done so already, Logstash can take the date in the log-file and make it the date/time stamp of the event.
filter {
date {
match => [ "cdndate", "ISO8601" ]
}
}
That gets the date/time stamp of the logline as the timestamp of the event. Cool, now to get that into something useful.
The stock datastore for Logstash is elasticsearch, which Elastic (the company) is busy trying to bill as just as good of a timeseries-datastore as the purpose-built tools like InfluxDB or OpenTSDB. It can be, though in my experience the purpose-built ones perform better. All of these can, assuming you input them right, store out of order events in the correct order so that later queries can assimilate the new information.
The graphite {}
output from Logstash will preserve timestamps, which allows you to use graphite as your backing store for that if you wish.
The influxdb {}
and opentsdb {}
output plugins exist and will get your data into a true time-series database.
From there, aggregation/summarization for near-term data (a few days from your explanation) should be done at query-time. A tool like grafana can front several of these datastores and makes display easier. Once you're past your risk-zone for arrival of logs, you can then run a later ETL process to generate in-database aggregations/summarizations based on the complete dataset. And then purge the full-detail logs as needed.
In short, the method:
- Ingests files using Logstash.
- Leverages filtering to extract the fields from the CDN log-files.
- Uses the
date {}
filter to pull the log's timestamp into the event's timestamp.
- Exports the data to something (elastic, graphite, or some other time-series database)
- Display tools use real-time aggregation queries to display data to consumers, at least for near-term data.
- After a period, probably a couple of days, a scripted or other automated process generates aggregations and inputs them into the datastore.
- After more time, the full-resolution data is purged leaving just the aggregated data.