One thing I've never seen anyone else do, for reasons that I can't imagine, is to change the Apache log file format to a more easily parseable version with the information that actually matters to you.
For example, we never use HTTP basic auth, so we don't need to log those fields. I am interested in how long each request takes to serve, so we'll add that in. For one project, we also want to know (on our load balancer) if any servers are serving requests slower than others, so we log the name of the server we're proxying back to.
Here's an excerpt from one server's apache config:
# We don't want to log bots, they're our friends
BrowserMatch Pingdom.com robot
# Custom log format, for testing
#
# date proto ipaddr status time req referer user-agent
LogFormat "%{%F %T}t %p %a %>s %D %r %{Referer}i %{User-agent}i" standard
CustomLog /var/log/apache2/access.log standard env=!robot
What you can't really tell from this is that between each field is a literal tab character (\t). This means that if I want to do some analysis in Python, maybe show non-200 statuses for example, I can do this:
for line in file("access.log"):
line = line.split("\t")
if line[3] != "200":
print line
Or if I wanted to do 'who is hotlinking images?' it would be
if line[6] in ("","-") and "/images" in line[5]:
For IP counts in an access log, the previous example:
grep -o "[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" logfile | sort -n | uniq -c | sort -n
becomes something like this:
cut -f 3 log | uniq -c | sort -n
Easier to read and understand, and far less computationally expensive (no regex) which, on 9 GB logs, makes a huge difference in how long it takes. When this gets REALLY neat is if you want to do the same thing for User-agents. If your logs are space-delimited, you have to do some regular expression matching or string searching by hand. With this format, it's simple:
cut -f 8 log | uniq -c | sort -n
Exactly the same as the above. In fact, any summary you want to do is essentially exactly the same.
Why on earth would I spend my system's CPU on awk and grep when cut will do exactly what I want orders of magnitude faster?