Log transport and aggregation at scale

Question

How're you analysing log files from UNIX/Linux machines? We run several hundred servers which all generate their own log files, either directly or through syslog. I'm looking for a decent solution to aggregate these and pick out important events. This problem breaks down into 3 components:

1) Message transport

The classic way is to use syslog to log messages to a remote host. This works fine for applications that log into syslog but less useful for apps that write to a local file. Solutions for this might include having the application log into a FIFO connected to a program to send the message using syslog, or by writing something that will grep the local files and send the output to the central syslog host. However, if we go to the trouble of writing tools to get messages into syslog would we be better replacing the whole lot with something like Facebook's Scribe which offers more flexibility and reliability than syslog?

2) Message aggregation

Log entries seem to fall into one of two types: per-host and per-service. Per-host messages are those which occur on one machine; think disk failures or suspicious logins. Per-service messages occur on most or all of the hosts running a service. For instance, we want to know when Apache finds an SSI error but we don't want the same error from 100 machines. In all cases we only want to see one of each type of message: we don't want 10 messages saying the same disk has failed, and we don't want a message each time a broken SSI is hit.

One approach to solving this is to aggregate multiple messages of the same type into one on each host, send the messages to a central server and then aggregate messages of the same kind into one overall event. SER can do this but it's awkward to use. Even after a couple of days of fiddling I had only rudimentary aggregations working and had to constantly look up the logic SER uses to correlate events. It's powerful but tricky stuff: I need something which my colleagues can pick up and use in the shortest possible time. SER rules don't meet that requirement.

3) Generating alerts

How do we tell our admins when something interesting happens? Mail the group inbox? Inject into Nagios?

So, how're you solving this problem? I don't expect an answer on a plate; I can work out the details myself but some high-level discussion on what is surely a common problem would be great. At the moment we're using a mishmash of cron jobs, syslog and who knows what else to find events. This isn't extensible, maintainable or flexible and as such we miss a lot of stuff we shouldn't.

Updated: we're already using Nagios for monitoring which is great for detected down hosts/testing services/etc but less useful for scraping log files. I know there are log plugins for Nagios but I'm interested in something more scalable and hierarchical than per-host alerts.

related - http://serverfault.com/questions/62687/alternatives-to-splunk :) — warren, Sep 25 '09 at 11:50

score 5 · Accepted Answer · answered May 01 '09 at 18:25

5

I've used three different systems for centralizing logs:

Syslog/syslog-ng forwarding to one host
Zenoss for aggregating and alerting events
Splunk for log aggregation and search

For #3, I typically use syslog-ng to forward the messages from each host directly into splunk. It can also parse log files directly, but that can be a bit of a pain.

Splunk is pretty awesome for search and categorizing your logs. I haven't used splunk for log alerting, but I think it's possible.

answered May 01 '09 at 18:25

Gary Richardson

1,767
3
19
21

+1 for Splunk. You can have Splunk trigger external scripts when certain events are detected; either sending a mail or an SNMP trap. – Murali Suriar May 02 '09 at 10:06

score 2 · Answer 2 · answered Jun 10 '09 at 12:22

You can take a look at OSSEC, a complete open-source HIDS, it does log analysis & can trigger actions or send mail on alerts. Alerts are trigered by a set of simple XML based rules, a lot of pre-defined ones for various log formats are included and you can add your own rules

http://www.ossec.net/

sebthebert · Answer 3 · 2012-09-04T21:28:22.230

1

Take a look at Octopussy. It's fully customizable and seems to answer all your needs...

PS: I'm the developer of this solution.

edited Sep 04 '12 at 21:28

answered May 05 '09 at 23:01

sebthebert

1,224
8
21

1

I wouldn't want to risk deploying or even recommending a product that has "pussy" in the name. That probably wouldn't go over well with most companies, particularly if there are women working within IT (pretty common these days). – Starfish Sep 06 '12 at 15:37

score 0 · Answer 4 · edited Apr 13 '17 at 12:13

0

You need to look into a monitoring system, for example Zenoss Core. Among other things, it says on the intro page:

Zenoss Event Monitoring and Management provides the ability to aggregate log and event information from various sources including availability monitoring, performance monitoring, syslog sources, SNMP trap sources, Windows Event log.

See what-tool-do-you-use-to-monitor-your-servers.

edited Apr 13 '17 at 12:13

Community

1

answered May 01 '09 at 13:17

gimel

1,193
7
9

I didn't realise Zenoss had log aggregation features. I'll take a look -- thanks. – markdrayton May 01 '09 at 13:33

Log transport and aggregation at scale

4 Answers4

Linked