1

We'd like to feed CDN logs into Graphite and aggregate numbers found in there (rate of different HTTP status-codes, average response sizes, average cache-hit ratio, etc.)

However, the logs are only uploaded to us occasionally and some times even out of order -- once in a while, a morning log can get uploaded in the evening, hours after the afternoon's log was uploaded and processed. Also, because the CDN (obviously) has multiple servers and data-centers, different logs can cover overlapping periods.

This means, any aggregator needs to maintain access to all of the earlier stats to be able to augment the aggregations when processing a new log...

What -- if anything -- can do that? And how do I configure logstash to feed into it? Thanks!

Mikhail T.
  • 2,272
  • 1
  • 22
  • 49

1 Answers1

2

This is a complex problem, as you well know. You tagged Logstash in your question, so I'm going to assume you have that.

Ingesting logs is what Logstash does. It has a file {} input plugin just for that:

input {
  file {
    path => [ '/opt/export/cdn_logs/*.csv'
    tags => [ 'cdnlogs' ]
  }
}

And a csv {} filter to ease ingesting CSV data.

filter {
  if 'cdnlogs' in [tags] {
    csv {
      source => "message"
      columns => [
        'cdndate',
        'host_server',
        [...]
        'response_time' ]
    }
  }
}

If you don't have CSV data, perhaps those lines are in a fairly normal-looking Apache format, all is not lost. You'll probably need to spend time with grok. That's its own thing.

Date-ordering is less of a problem so long as you take care to preserve your timestamps and don't use which manifestly doesn't preserve them. If you haven't done so already, Logstash can take the date in the log-file and make it the date/time stamp of the event.

filter {
  date {
    match => [ "cdndate", "ISO8601" ]
  }
}

That gets the date/time stamp of the logline as the timestamp of the event. Cool, now to get that into something useful.

The stock datastore for Logstash is , which Elastic (the company) is busy trying to bill as just as good of a timeseries-datastore as the purpose-built tools like InfluxDB or OpenTSDB. It can be, though in my experience the purpose-built ones perform better. All of these can, assuming you input them right, store out of order events in the correct order so that later queries can assimilate the new information.

The graphite {} output from Logstash will preserve timestamps, which allows you to use graphite as your backing store for that if you wish.

The influxdb {} and opentsdb {} output plugins exist and will get your data into a true time-series database.

From there, aggregation/summarization for near-term data (a few days from your explanation) should be done at query-time. A tool like can front several of these datastores and makes display easier. Once you're past your risk-zone for arrival of logs, you can then run a later ETL process to generate in-database aggregations/summarizations based on the complete dataset. And then purge the full-detail logs as needed.

In short, the method:

  1. Ingests files using Logstash.
  2. Leverages filtering to extract the fields from the CDN log-files.
  3. Uses the date {} filter to pull the log's timestamp into the event's timestamp.
  4. Exports the data to something (elastic, graphite, or some other time-series database)
  5. Display tools use real-time aggregation queries to display data to consumers, at least for near-term data.
  6. After a period, probably a couple of days, a scripted or other automated process generates aggregations and inputs them into the datastore.
  7. After more time, the full-resolution data is purged leaving just the aggregated data.
sysadmin1138
  • 131,083
  • 18
  • 173
  • 296
  • Thank you very much for the detailed answer. Some of this we've already figured out ourselves here -- indeed, Logstash is already feeding our ElastsicSearch/Kibana system, for example. I wish I could "accept" a detailed answer like yours multiple times -- it is more useful than some "blog-posts" out there! – Mikhail T. Jun 25 '17 at 14:08
  • The bit you response is missing, though, is using CDN's own POP-ID (or even servers' IP-addresses) as part of the Graphite keys -- to prevent hits coming to different POPs/servers at the same second overwriting each other... – Mikhail T. Jun 25 '17 at 14:20