73

I can use log analyzers, but often I need to parse recent web logs to see what's happening at the moment.

I sometimes do things like to figure out top 10 ips that request a certain file

cat foo.log | grep request_to_file_foo | awk '{print $1}' |  sort -n | uniq -c | sort -rn | head

What do you have in your toolbox?

Giacomo1968
  • 3,522
  • 25
  • 38
deadprogrammer
  • 1,661
  • 7
  • 24
  • 25
  • 1
    I actually had this big beautiful regex I had written by hand to parse all of my apache custom logs down to individual fields for submission in to a database. I am kicking myself that I don't have it anymore. It was a one liner; gave you back one variable for each log element - then I was inserting in to MySQL. If I find it I'll post it here. – Kyle Jun 03 '09 at 00:31

12 Answers12

61

You can do pretty much anything with apache log files with awk alone. Apache log files are basically whitespace separated, and you can pretend the quotes don't exist, and access whatever information you are interested in by column number. The only time this breaks down is if you have the combined log format and are interested in user agents, at which point you have to use quotes (") as the separator and run a separate awk command. The following will show you the IPs of every user who requests the index page sorted by the number of hits:

awk -F'[ "]+' '$7 == "/" { ipcount[$1]++ }
    END { for (i in ipcount) {
        printf "%15s - %d\n", i, ipcount[i] } }' logfile.log

$7 is the requested url. You can add whatever conditions you want at the beginning. Replace the '$7 == "/" with whatever information you want.

If you replace the $1 in (ipcount[$1]++), then you can group the results by other criteria. Using $7 would show what pages were accessed and how often. Of course then you would want to change the condition at the beginning. The following would show what pages were accessed by a user from a specific IP:

awk -F'[ "]+' '$1 == "1.2.3.4" { pagecount[$7]++ }
    END { for (i in pagecount) {
        printf "%15s - %d\n", i, pagecount[i] } }' logfile.log

You can also pipe the output through sort to get the results in order, either as part of the shell command, or also in the awk script itself:

awk -F'[ "]+' '$7 == "/" { ipcount[$1]++ }
    END { for (i in ipcount) {
        printf "%15s - %d\n", i, ipcount[i] | sort } }' logfile.log

The latter would be useful if you decided to expand the awk script to print out other information. It's all a matter of what you want to find out. These should serve as a starting point for whatever you are interested in.

Mark
  • 2,846
  • 19
  • 13
  • Yah, it always seems strange to see crazy long cat/grep/awk pipelines. Once you're in awk, that's usually enough. The first three clauses of the original post could be trivially written as "awk '/request_to_file_foo/ {print $1}' foo.log". awk can take a file as input, and can use regex to know which lines to care about. – Zac Thompson Jun 04 '09 at 06:06
  • Elegant and simple. Good. – Olivier Dulac Feb 12 '15 at 06:53
  • Beware, spaces seems allowed in the "authuser" (3rd) field, that breaks everything, and I personally think it should be forbidden, to allow us doing this ;-) – Mandark Jun 16 '15 at 09:31
25

One thing I've never seen anyone else do, for reasons that I can't imagine, is to change the Apache log file format to a more easily parseable version with the information that actually matters to you.

For example, we never use HTTP basic auth, so we don't need to log those fields. I am interested in how long each request takes to serve, so we'll add that in. For one project, we also want to know (on our load balancer) if any servers are serving requests slower than others, so we log the name of the server we're proxying back to.

Here's an excerpt from one server's apache config:

# We don't want to log bots, they're our friends
BrowserMatch Pingdom.com robot

# Custom log format, for testing
#
#         date          proto   ipaddr  status  time    req     referer         user-agent
LogFormat "%{%F %T}t    %p      %a      %>s     %D      %r      %{Referer}i     %{User-agent}i" standard
CustomLog /var/log/apache2/access.log standard env=!robot

What you can't really tell from this is that between each field is a literal tab character (\t). This means that if I want to do some analysis in Python, maybe show non-200 statuses for example, I can do this:

for line in file("access.log"):
  line = line.split("\t")
  if line[3] != "200":
    print line

Or if I wanted to do 'who is hotlinking images?' it would be

if line[6] in ("","-") and "/images" in line[5]:

For IP counts in an access log, the previous example:

grep -o "[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" logfile | sort -n | uniq -c | sort -n

becomes something like this:

cut -f 3 log | uniq -c | sort -n

Easier to read and understand, and far less computationally expensive (no regex) which, on 9 GB logs, makes a huge difference in how long it takes. When this gets REALLY neat is if you want to do the same thing for User-agents. If your logs are space-delimited, you have to do some regular expression matching or string searching by hand. With this format, it's simple:

cut -f 8 log | uniq -c | sort -n

Exactly the same as the above. In fact, any summary you want to do is essentially exactly the same.

Why on earth would I spend my system's CPU on awk and grep when cut will do exactly what I want orders of magnitude faster?

Dan Udey
  • 1,460
  • 12
  • 17
17

Forget about awk and grep. Check out asql. Why write unreadable scripts when you can use sql like syntax for querying the logfile. Eg.

asql v0.6 - type 'help' for help.
asql> load /home/skx/hg/engaging/logs/access.log
Loading: /home/skx/hg/engaging/logs/access.log
sasql> select COUNT(id) FROM logs
46
asql> alias hits SELECT COUNT(id) FROM logs
ALIAS hits SELECT COUNT(id) FROM logs
asql> alias ips SELECT DISTINCT(source) FROM logs;
ALIAS ips SELECT DISTINCT(source) FROM logs;
asql> hits
46
asql> alias
ALIAS hits SELECT COUNT(id) FROM logs
ALIAS ips SELECT DISTINCT(source) FROM logs;
Vihang D
  • 632
  • 4
  • 7
  • Interesting, but you might run into problems if your logs are particularly large I would think. Also how well does it cope with custom log formats? – Vagnerr Jun 05 '09 at 13:39
  • I am trying it at the moment, the load time is so slow (at least in version 0.9). Loading a 200Mb log it's taking more than five minutes.. – aseques Dec 15 '11 at 11:17
  • I must say that after the load time (it took around 15 minutes) the synstax of this program is great, you can sort,count and group by. Really nice. – aseques Dec 15 '11 at 17:19
  • Apache HTTPD has a method with which you can effectively send the logs to a database. Yes, writes might take long, but a threaded proxy might do just the right thing sandwiched in the middle. Anyhow, that'll make querying logs in an SQL like syntax a lot faster. No loading involved too - the database server is perpetually "ON". – nearora May 28 '12 at 23:16
  • I had no idea something like this existed! For web devs that are already very experienced with SQL, this is a great option. – rinogo Sep 21 '20 at 20:20
8

Here is a script to find top urls, top referrers and top useragents from the recent N log entries

#!/bin/bash
# Usage
# ls-httpd type count
# Eg: 
# ls-httpd url 1000
# will find top URLs in the last 1000 access log entries
# ls-httpd ip 1000
# will find top IPs in the last 1000 access log entries
# ls-httpd agent 1000
# will find top user agents in the last 1000 access log entries

type=$1
length=$2

if [ "$3" == "" ]; then
  log_file="/var/log/httpd/example.com-access_log"
else
  log_file="$3"
fi

if [ "$type" = "ip" ]; then
  tail -n $length $log_file | grep -o "[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" | sort -n | uniq -c | sort -n
elif [ "$type" = "agent" ]; then
  tail -n $length $log_file | awk -F\" '{print $6}'| sort -n | uniq -c | sort -n
elif [ "$type" = "url" ]; then
  tail -n $length $log_file | awk -F\" '{print $2}'| sort -n | uniq -c | sort -n
fi

Source

Bart De Vos
  • 17,761
  • 6
  • 62
  • 81
anoopjohn
  • 201
  • 4
  • 5
5

for IP counts in an access log:

cat log | grep -o "[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" | sort -n | uniq -c | sort -n

It's a bit ugly, but it works. I also use the following with netstat (to see active connections):

netstat -an | awk '{print $5}' | grep -o "[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" | egrep -v "(`for i in \`ip addr | grep inet |grep eth0 | cut -d/ -f1 | awk '{print $2}'\`;do echo -n "$i|"| sed 's/\./\\\./g;';done`127\.|0\.0\.0)" | sort -n | uniq -c | sort -n

They're some of my favorite "one liners" :)

f4nt
  • 859
  • 10
  • 9
4

Here my 'sed' example, it reads the default format of apache logs and converts it to something more convenient for automatic processing. The whole line is defined as regular expression, variables are saved and written to output with '#' as separator.

The simplified notation of the input is: %s %s %s [%s] "%s" %s %s "%s" "%s"

Example input line: xx.xx.xx.xx - - [29/Mar/2011:12:33:02 +0200] "GET /index.html HTTP/1.0" 200 9443 "-" "Mozilla/4.0"

Example output line: xx.xx.xx.xx#-#-#29/Mar/2011:12:33:02 +0200#GET /index.html HTTP/1.0#200#9443#-#Mozilla/4.0

cat access.log | \ 
  sed 's/^\(.*\) \(.*\) \(.*\) \[\(.*\)\] \"\(.*\)\" \(.*\) \(.*\) \"\(.*\)\" \"\(.*\)\"$/\1#\2#\3#\4#\5#\6#\7#\8#\9/g'

Feel the power of regular expressions :-)

Kris
  • 41
  • 1
  • This made processing with AWK a breeze. Was looking for a quick way to setup a common deliminator and this nailed it. – Citricguy Jun 12 '14 at 03:07
  • I have felt the regex power and just wanted to pass along my own tweak, which cuts out the "HTML/1.1" and separates the protocol (in a probably non-standards compliant way) into its own field. Enjoy: ``` cat access.log | sed 's/^\(.*\) \(.*\) \(.*\) \[\(.*\)\] \"\([[:alpha:]]\+\) \(.*\) HTTP\/1\.1\" \(.*\) \(.*\) \"\(.*\)\" \"\(.*\)\"$/\1#\2#\3#\4#\5#\6#\7#\8#\9#\10/g' ``` – Josh Rumbut Oct 06 '15 at 18:30
3

Building a list of common questions would be a great index for this answers to this question. My common questions are:

  • why did the hitrate change?
  • why is overall response time rising?'.

I notice such changes by monitoring the server-status pages (via mod_status) for hitrate and approximate response time for active and recently completed requests (knowing full-well that I miss a huge pile of data, but samples are good enough).

I use the following LogFormat directive (the %T is really useful)

LogFormat "%h %l %u %t \"%r\" %>s %b 
    \"%{Referer}i\" \"%{User-Agent}i\" %T" custom

I'm looking for cause-effect and what happened first... usually about specific subsets of patterns in my logs, so I need to know the following for any given pattern/regular expression:

  • hitcounts per interval (minute or hour) for a given pattern (ip address or cgi string or parameters, etc)
  • histograms of approximate response time (using %T parameter)

I generally use perl, because eventually it gets complex enough to be worthwhile.


A non-perl example would be a quickie hitrate per minute for non-200 status codes:

tail -9000 access_log | grep -v '" 200 ' | cut -d: -f2,3 | uniq -c

Yes I'm cheating with that grep, presuming a quote-space-200-space matches only http status codes.... could use awk or perl to isolate the field just keep in mind it could be inaccurate.


A more complex example in perl might be to visualize a change in hitrate for a pattern.

There is a lot to chew in the script below, especially if you are unfamilar with perl.

  • reads stdin so you can use portions of your logs, use tail (especially with tail -f), with or without greps and other filtering...
  • cheats epoch timestamp extraction with hack of a regex and use of Date::Manip
  • you could modify it only slightly to extract response time or other arbitrary data

code follows:

#!/usr/bin/perl
# script to show changes in hitrates for any regex pattern
# results displayed with arbitrary intervals
# and ascii indication of frequency
# gaps are also displayed properly
use Date::Manip;
use POSIX qw(strftime);
$pattern=shift || ".";
$ival=shift || 60;
$tick=shift || 10;
$minb=undef;
while (<>){
    next unless /$pattern/;
    $stamp="$1 $2" if m[(../.../....):(..:..:..)];
    $epoch = UnixDate(ParseDate($stamp),"%s");
    $bucket= int($epoch/$ival)*$ival;
    $minb=$bucket if $bucket<$minb || !defined($minb);
    $maxb=$bucket if $bucket>$maxb;
    $count{$bucket}++;
}
# loop thru the min/max range to expose any gaps
for($t=$minb;$t<=$maxb;$t+=$ival){
    printf "%s %s %4d %s\n",
            $t,
            strftime("%m/%d/%Y %H:%M:%S",localtime($t)),
            $count{$t}+0,
            substr("x"x100,0,$count{$t}/$tick
    );
}

If you just want to process standard metrics, checkout

  • 'mergelog' to get all your logs together (if you have multiple apaches behind a load balancer) and
  • webalizer (or awstats or other common analyzer).
Jeff Atwood
  • 12,994
  • 20
  • 74
  • 92
ericslaw
  • 1,562
  • 2
  • 13
  • 15
2

Who is hot-linking your images:

awk -F\" '($2 ~ /\.(jpg|gif)/ && $4 !~ /^http:\/\/www\.mydomain\.com/){print $4}' access_log | sort | uniq -c | sort
quanta
  • 50,327
  • 19
  • 152
  • 213
rkthkr
  • 8,503
  • 26
  • 38
2

I use awk a lot by tailing or cat'ing the file. Each night I deliver myself a web report for each server. Depending on your log file and your LogFormat you'll need to edit some of the one liners to work for you.. .

Here's a simple example:

If I want to tail the logs on my server for only 404/500 status codes I'd do this:

# $6 is the status code in my log file

tail -f ${APACHE_LOG} |  awk  '$8 ~ /(404|500)/ {print $6}'

< snip>

echo ""
#echo  "Hits by source IP:"
echo "======================================================================"

awk '{print $2}' "$1" | grep -ivE "(127.0.0.1|192.168.100.)" | sort | uniq -c | sort -rn | head -25

echo ""
echo ""
#echo "The 25 most popular pages:"
echo "======================================================================"

awk '{print $6}' "$1" | grep -ivE '(mod_status|favico|crossdomain|alive.txt)' | grep -ivE '(.gif|.jpg|.png)' | \
 sed 's/\/$//g' | sort | \
 uniq -c | sort -rn | head -25

echo ""    
echo ""
echo "The 25 most popular pages (no js or css):"
echo "======================================================================"

awk '{print $6}' "$1" | grep -ivE '(mod_status|favico|crossdomain|alive.txt)' | grep -ivE '(.gif|.jpg|.png|.js|.css)' | \
 sed 's/\/$//g' | sort | \
   uniq -c | sort -rn | head -25

   echo ""


#echo "The 25 most common referrer URLs:"
echo "======================================================================"

awk '{print $11}' "$1" | \
 grep -vE "(^"-"$|/www.$host|/$host)" | \
 sort | uniq -c | sort -rn | head -25

echo ""

#echo "Longest running requests"
echo "======================================================================"

awk  '{print $10,$6}' "$1" | grep -ivE '(.gif|.jpg|.png|.css|.js)'  | awk '{secs=0.000001*$1;req=$2;printf("%.2f minutes req time for %s\n", secs / 60,req )}' | sort -rn | head -50

exit 0

< /snip>

1

The thing I tend to be doing most of the time is reading sections of a log based on time, so I wrote the following script using sed to pull out the period I'm interested in, it works on every log file I've came across and can handle the archived logs as well.

#!/bin/bash
#This script should return a set of lines between 2 values, the main purpose is for searching a log file between 2 times
#Script usage: logship.sh "start" "stop" file

#If the file contains any "/" in the date range the following 2 lines add the escape character so that the search can be performed for those characters
start=$(echo "$1" | sed 's/\//\\\//g')
stop=$(echo "$2" | sed 's/\//\\\//g')

zipped=$(echo "$3" | grep -c "gz$")     #figures out if the file is zipped or not

if [ "$zipped" ==  "1" ]; then          #If the file is zipped then pass it through zcat before sed
        zcat $3 | sed -n "/$start/,/$stop/p";
else
        sed -n "/$start/,/$stop/p" $3;  #if it's not zipped just run sed
fi
Chris
  • 246
  • 1
  • 4
1

While not sed or awk, there are two things I've found useful for handling apache and icecast log files.

AWStats has a very useful script called logresolvemerge.pl that will combine multiple compressed or uncompressed log files, strip dupes and sort by timestamp. It can also do DNS lookups and be configured to run multithreaded. It's particularly useful when using with awstats because awstats cannot add log lines with timestamps older than the current database, so all must be added in order, but that's very easy as you just chuck everything at logresolvemerge.pl and it all pops out nicely.

sed and awk are pretty bad at handling dates because they generally treat them as strings. awk has some time and date functions, but they are not up to much. For example extracting a range of lines between two timestamps is hard if those exact timestamps do not occur in the file (even if values between them do) - Chris' example has exactly this problem. To deal with that, I wrote a PHP script that reports log file timestamp ranges and can also extract a chunk by timestamp range, using any date or time format you like (it doesn't need to match the log file's timestamp format).

To keep this on-topic, here are a couple of useful awkisms: Get the total number of bytes served from apache or icecast log:

cat access.log | awk '{ sum += $10 } END { print sum }'

Get the total number of seconds connected from an icecast log:

cat access.log | awk '{ sum += $13 } END { print sum }'
Synchro
  • 2,983
  • 5
  • 25
  • 35
0

Recovering this old thread, after giving up on asql for big log files, looked for a solution againg, also in serverfault, i found about wtop here it's an opensource tool, that is capable of doing live monitoring or process logs and get stats (top N), very flexible and powerful, the official place is here

aseques
  • 688
  • 4
  • 12
  • 26