Summarizing gargantuan log files



I have several 2-8 GB log files from a service that I'm running. Normally, these log files are smaller than this (more on the order of 50-250 MB).

I would like to analyze them and summarize them to find out what's going on.

Are there any tools to help automate this, or at least give a first pass? I'm considering using head, awk, cut, grep, etc., but these aren't very automatic.


Posted 2011-03-28T13:11:16.100

Reputation: 942

What's the service? If it's a common one there may already be a log analyzer tool available for it. – Majenko – 2011-03-28T13:14:53.877

It's a custom in-house service. I was just wondering if there were any general-purpose techniques for summarizing huge files. – mskfisher – 2011-03-28T13:32:51.633



Have you tried splunk?


Posted 2011-03-28T13:11:16.100

Reputation: 31

Wow. That's an impressive tool. It's overkill for my one-off investigation, but it may be helpful with other types of investigations I need to do. Thanks. – mskfisher – 2011-03-28T15:46:45.073

Whilst this may theoretically answer the question, it would be preferable to include the essential parts of the answer here, and provide the link for reference.

– Ivo Flipse – 2012-07-19T07:43:51.333


I've found that a combination of grep, cut, sort, uniq, head, and tail are helpful for an ad-hoc, one-time log inspection.

Inspect top of log file

Looks like each line starts with a date/time.

$ head porter10.log

03/10/2011 12:14:25 --------  (Port Control [Version 5.2 (Milestone 4)])  --------
03/10/2011 12:14:25 --------  LOG BEGINS  --------
03/10/2011 12:14:25 Common DLL [Version 5.2 (Milestone 4)] [Version Details: 5.2.4]
03/10/2011 12:14:25 Message DLL [Version 5.2 (Milestone 4)] [Version Details: 5.2.4]

Remove timestamp

I use the cut command, telling it to retain fields 3 and up, and to use a space as the delimiter.

$ cut -f3- -d' ' porter10.log | head

--------  (Port Control [Version 5.2 (Milestone 4)])  --------
--------  LOG BEGINS  --------
Common DLL [Version 5.2 (Milestone 4)] [Version Details: 5.2.4]
Message DLL [Version 5.2 (Milestone 4)] [Version Details: 5.2.4]

Trim to the unchanging portion of the line

I had a hunch that most of the excess output lines would have similar text, so I trimmed the output to the first 20 characters after the timestamp.

$ cut -f3- -d' ' porter10.log | cut -b-20 | head
--------  (Port Cont
--------  LOG BEGINS
Common DLL [Version
Message DLL [Version
Protocol DLL [Versio

Sort and find the largest counts

I then sorted, counted, and sorted the counts to find which lines occurred most often.

It appears that my naive timestamp-removal technique trimmed some useful (non-timestamp) information on a few lines, leaving me with some bare numbers instead.
However, it looks like they all occurred at the same frequency, and an order of magnitude more often than anything else, so I've found my suspects.

The 20-character range is a hunch, not a hard-and-fast rule. You may need to run this step multiple times to find the sweet spot that separates out the unusual lines.

$ cut -f3- -d' ' porter10.log | cut -b-20 | sort | uniq -c | sort -n

  13827 Error (266) to Remot
  13842 Error decode for dev
  17070 Error Writing to Ret
  46506 **** Checkpoint ****
 181820 (65)
 181820 (67)
 181821 (111)
 181821 (1555)
 181821 (55)
 181821 (66)
 181821 (77)
 181980 (107)

Search for candidates in context

So, now that I have a list of potential candidates, I can look for them in context using grep and the -C# lines-of-context option:

$ grep -C3 '(1555)' porter10.log | head
03/10/2011 12:14:25.455 looking for DLC devices / start
Decoding tbl_pao_lite.cpp (107)
Decoding tbl_base.cpp (111)
Decoding dev_single.cpp (1555)
Decoding dev_dlcbase.cpp (77)
Decoding tbl_carrier.cpp (55)
Decoding tbl_route.cpp (66)
Decoding tbl_loadprofile.cpp (67)
Decoding tbl_pao_lite.cpp (107)

Monte-Carlo approach - inspect middle of log file

If the above approach doesn't work, try looking at different spots in the file.

Looks like there are about 1.6 million lines in this file, so I looked at line 800k.
This confirmed the results of my sort-and-count approach.

$ wc -l porter10.log
1638656 porter10.log

$ head -800000 porter10.log | tail
Decoding dev_dlcbase.cpp (77)
Decoding tbl_carrier.cpp (55)
Decoding tbl_route.cpp (66)
Decoding dev_carrier.cpp (65)


In this case, the output was due to some excess debug logging being left on in our configuration files.

You will need to adjust this approach to fit your particular log file, but the main keys are:

  1. Trim out timestamps
  2. Trim to some amount of the line that's likely to be unchanging
  3. Sort and count what's left
  4. Search for the biggest offenders in context


Posted 2011-03-28T13:11:16.100

Reputation: 942


If you want to analyze your file while growing, you can have neat result with logtop :

Most requesting IP :

tail -f /var/log/apache2/access.log | cut -d' ' -f1 | logtop

Most requesting URL (If the url is the 7th field is your file ?)

tail -f /var/log/apache2/access.log | cut -d' ' -f7 | logtop

Julien Palard

Posted 2011-03-28T13:11:16.100

Reputation: 150