As Stack Overflow grows, we're starting to look closely at our IIS logs to identify problem HTTP clients -- things like rogue web spiders, users who have a large page set to refresh every second, poorly written one-off web scrapers, tricksy users who try to increment page count a zillion times, and so forth.
I've come up with a few LogParser queries that help us identify most of the oddities and abnormalities when pointed at an IIS log file.
Top bandwidth usage by URL
SELECT top 50 DISTINCT
SUBSTR(TO_LOWERCASE(cs-uri-stem), 0, 55) AS Url,
Count(*) AS Hits,
AVG(sc-bytes) AS AvgBytes,
SUM(sc-bytes) as ServedBytes
FROM {filename}
GROUP BY Url
HAVING Hits >= 20
ORDER BY ServedBytes DESC
url hits avgbyte served ------------------------------------------------- ----- ------- ------- /favicon.ico 16774 522 8756028 /content/img/search.png 15342 446 6842532
Top hits by URL
SELECT TOP 100
cs-uri-stem as Url,
COUNT(cs-uri-stem) AS Hits
FROM {filename}
GROUP BY cs-uri-stem
ORDER BY COUNT(cs-uri-stem) DESC
url hits ------------------------------------------------- ----- /content/img/sf/vote-arrow-down.png 14076 /content/img/sf/vote-arrow-up.png 14018
Top bandwidth and hits by IP / User-Agent
SELECT TOP 30
c-ip as Client,
SUBSTR(cs(User-Agent), 0, 70) as Agent,
Sum(sc-bytes) AS TotalBytes,
Count(*) as Hits
FROM {filename}
group by c-ip, cs(User-Agent)
ORDER BY TotalBytes desc
client user-agent totbytes hits ------------- --------------------------------------------- --------- ----- 66.249.68.47 Mozilla/5.0+(compatible;+Googlebot/2.1; 135131089 16640 194.90.190.41 omgilibot/0.3++omgili.com 133805857 6447
Top bandwidth by hour by IP / User-Agent
SELECT TOP 30
TO_STRING(time, 'h') as Hour,
c-ip as Client,
SUBSTR(cs(User-Agent), 0, 70) as Agent,
Sum(sc-bytes) AS TotalBytes,
count(*) as Hits
FROM {filename}
group by c-ip, cs(User-Agent), hour
ORDER BY sum(sc-bytes) desc
hr client user-agent totbytes hits -- ------------- ----------------------------------------- -------- ---- 9 194.90.190.41 omgilibot/0.3++omgili.com 30634860 1549 10 194.90.190.41 omgilibot/0.3++omgili.com 29070370 1503
Top hits by hour by IP / User-Agent
SELECT TOP 30
TO_STRING(time, 'h') as Hour,
c-ip as Client,
SUBSTR(cs(User-Agent), 0, 70) as Agent,
count(*) as Hits,
Sum(sc-bytes) AS TotalBytes
FROM {filename}
group by c-ip, cs(User-Agent), hour
ORDER BY Hits desc
hr client user-agent hits totbytes -- ------------- ----------------------------------------- ---- -------- 10 194.90.190.41 omgilibot/0.3++omgili.com 1503 29070370 12 66.249.68.47 Mozilla/5.0+(compatible;+Googlebot/2.1 1363 13186302
The {filename} of course would be a path to an IIS logfile, such as
c:\working\sologs\u_ex090708.log
I did a lot of web searches for good IIS LogParser queries and found precious little. These 5, above, have helped us tremendously in identifying serious problem clients. But I'm wondering -- what are we missing?
What other ways are there to slice and dice the IIS logs (preferably with LogParser queries) to mine them for statistical anomalies? Do you have any good IIS LogParser queries you run on your servers?