86

As Stack Overflow grows, we're starting to look closely at our IIS logs to identify problem HTTP clients -- things like rogue web spiders, users who have a large page set to refresh every second, poorly written one-off web scrapers, tricksy users who try to increment page count a zillion times, and so forth.

I've come up with a few LogParser queries that help us identify most of the oddities and abnormalities when pointed at an IIS log file.

Top bandwidth usage by URL

SELECT top 50 DISTINCT 
SUBSTR(TO_LOWERCASE(cs-uri-stem), 0, 55) AS Url, 
Count(*) AS Hits, 
AVG(sc-bytes) AS AvgBytes, 
SUM(sc-bytes) as ServedBytes 
FROM {filename} 
GROUP BY Url 
HAVING Hits >= 20 
ORDER BY ServedBytes DESC
url                                                   hits  avgbyte  served
-------------------------------------------------     ----- -------  -------
/favicon.ico                                          16774 522      8756028
/content/img/search.png                               15342 446      6842532

Top hits by URL

SELECT TOP 100 
cs-uri-stem as Url, 
COUNT(cs-uri-stem) AS Hits 
FROM {filename} 
GROUP BY cs-uri-stem 
ORDER BY COUNT(cs-uri-stem) DESC
url                                                                    hits
-------------------------------------------------                      -----
/content/img/sf/vote-arrow-down.png                                    14076
/content/img/sf/vote-arrow-up.png                                      14018

Top bandwidth and hits by IP / User-Agent

SELECT TOP 30
c-ip as Client, 
SUBSTR(cs(User-Agent), 0, 70) as Agent, 
Sum(sc-bytes) AS TotalBytes, 
Count(*) as Hits 
FROM {filename} 
group by c-ip, cs(User-Agent) 
ORDER BY TotalBytes desc
client         user-agent                                      totbytes   hits
-------------  ---------------------------------------------   ---------  -----
66.249.68.47   Mozilla/5.0+(compatible;+Googlebot/2.1;         135131089  16640
194.90.190.41  omgilibot/0.3++omgili.com                       133805857  6447

Top bandwidth by hour by IP / User-Agent

SELECT TOP 30
TO_STRING(time, 'h') as Hour, 
c-ip as Client, 
SUBSTR(cs(User-Agent), 0, 70) as Agent, 
Sum(sc-bytes) AS TotalBytes, 
count(*) as Hits 
FROM {filename} 
group by c-ip, cs(User-Agent), hour 
ORDER BY sum(sc-bytes) desc
hr   client        user-agent                                  totbytes   hits
--   ------------- -----------------------------------------   --------   ----
9    194.90.190.41 omgilibot/0.3++omgili.com                   30634860   1549
10   194.90.190.41 omgilibot/0.3++omgili.com                   29070370   1503

Top hits by hour by IP / User-Agent

SELECT TOP 30
TO_STRING(time, 'h') as Hour, 
c-ip as Client, 
SUBSTR(cs(User-Agent), 0, 70) as Agent, 
count(*) as Hits, 
Sum(sc-bytes) AS TotalBytes 
FROM {filename} 
group by c-ip, cs(User-Agent), hour 
ORDER BY Hits desc
hr   client         user-agent                                  hits  totbytes
--   -------------  -----------------------------------------   ----  --------
10   194.90.190.41  omgilibot/0.3++omgili.com                   1503  29070370
12   66.249.68.47   Mozilla/5.0+(compatible;+Googlebot/2.1      1363  13186302

The {filename} of course would be a path to an IIS logfile, such as

c:\working\sologs\u_ex090708.log

I did a lot of web searches for good IIS LogParser queries and found precious little. These 5, above, have helped us tremendously in identifying serious problem clients. But I'm wondering -- what are we missing?

What other ways are there to slice and dice the IIS logs (preferably with LogParser queries) to mine them for statistical anomalies? Do you have any good IIS LogParser queries you run on your servers?

Jeff Atwood
  • 12,994
  • 20
  • 74
  • 92

8 Answers8

19

A good indicator for hacking activies or other attacks is the number of errors per hour. The following script returns the dates and hours that had more than 25 error codes returned. Adjust the value depending on the amount of traffic on the site (and the quality of your web application ;-) ).

SELECT date as Date, QUANTIZE(time, 3600) AS Hour, 
       sc-status as Status, count(*) AS ErrorCount
FROM   {filename} 
WHERE  sc-status >= 400 
GROUP BY date, hour, sc-status 
HAVING ErrorCount > 25
ORDER BY ErrorCount DESC

The result could something like this:

Date       Hour     Status ErrorCount
---------- -------- ------ ------
2009-07-24 18:00:00 404    187
2009-07-17 13:00:00 500    99
2009-07-21 21:00:00 404    80
2009-07-03 04:00:00 404    45
...

The next query detects an unusually high number of hits on a single URL from one IP address. In this example I chose 500, but you may have to change the query for edge cases (excluding the IP address of Google London for example ;-) .)

SELECT DISTINCT date AS Date, cs-uri-stem AS URL,
      c-ip AS IPAddress, Count(*) AS Hits
FROM  {filename}
GROUP BY date, c-ip, cs-uri-stem
HAVING Hits > 500
ORDER BY Hits Desc
Date       URL                                 IPAddress       Hits
---------- ----------------------------------- --------------- ----
2009-07-24 /Login.aspx                         111.222.111.222 1889
2009-07-12 /AccountUpdate.aspx                 11.22.33.44     973
2009-07-19 /Login.aspx                         123.231.132.123 821
2009-07-21 /Admin.aspx                         44.55.66.77     571
...
splattne
  • 28,348
  • 19
  • 97
  • 147
  • the second query, we already do -- note the grouping in several of the queries. By IP and User Agent, this is the best of both worlds, since if the User Agent is null, it's the same as IP, and if not, that's more info. – Jeff Atwood Jul 25 '09 at 12:50
  • Okay Jeff, adding the user-agent makes sense. Sorry, a combination of bad short-time memory and Attention Deficit Disorder. :-) – splattne Jul 25 '09 at 13:00
  • replacing the `having` with a `Limit n` might make for a good way to tune the first query – BCS Aug 06 '09 at 19:27
6

One thing you could consider to filter out legitimate traffic (and broaden your scope) is to enable cs(Cookie) in your IIS logs, add a bit of code that sets a small cookie using javascript, and add WHERE cs(Cookie)=''.

Because of your small bit of code, every user should have a cookie unless they manually disabled cookies (which a small percent of people might do) or unless that user is actually a bot that doesn't support Javascript (for example, wget, httpclient, etc. don't support Javascript).

I suspect that if a user has a high volume of activity, but they accept cookies and have javascript enabled, they are more likely to be a legitimate user, whereas if you find a user with a high volume of activity but no cookie/javascript support, they are more likely to be a bot.

Adam Brand
  • 6,057
  • 2
  • 28
  • 40
6

Sorry, can't comment yet so I'm forced to answer.

There's a minor bug with the 'Top bandwidth usage by URL' query. While most of the time you'd be okay taking your requests for a page and multiplying by the file size, in this case, since you're not paying attention to any query parameters, you're going to run into some slightly-to-very inaccurate numbers.

For a more accurate value, just do a SUM(sc-bytes) instead of the MUL(Hits, AvgBytes) as ServedBytes.

James Skemp
  • 862
  • 2
  • 10
  • 21
6

Anders Lundström has been writing a series of blog articles regarding common LogParser queryies.

I've been using these:

rrrr-o
  • 103
  • 4
Christopher_G_Lewis
  • 3,647
  • 21
  • 27
5

This guy has about a dozen useful queries:

http://logparserplus.com/Examples/Queries.aspx

Portman
  • 5,263
  • 4
  • 27
  • 31
4

You may want to look for your longest requests (stems and/or queries), and the ones with most bytes received by the server. I'd also try one that groups by the bytes received and the IP, so that you can see if a particular request format that's likely repeated by one IP.

SELECT TOP 30
cs-uri-stem,
cs-uri-query,
c-ip as Client, 
SUBSTR(cs(User-Agent), 0, 70) as Agent, 
cs-bytes,
c-ip,
FROM {filename} 
WHERE cs-uri-stem != '/search'
ORDER BY LEN(cs-uri-query) desc

SELECT TOP 30
COUNT(*) AS Hits
cs-uri-stem,
cs-uri-query,
c-ip as Client, 
SUBSTR(cs(User-Agent), 0, 70) as Agent, 
cs-bytes,
c-ip,
FROM {filename} 
GROUP BY c-ip, cs(User-Agent), cs-bytes 
ORDER BY Hits desc

I'd also count the hits for either the group of requesting IP for an hour and minute of a day, or group the requesting IP with the minute of the hour to find if there's any regularly recurring visits that may be scripts. This would be a small modification on the hits by hour script.

On any non-programming sites, searching your logs for SQL keywords is also a good idea, things like SELECT, UPDATE, DROP, DELETE and other oddities like FROM sys.tables, ORing that together and counting by IP would seem handy. For most sites including these, the words would rarely if ever appear in the query portion of the URI, but here they might legitimately appear in the URI stem and data parts. I like reversing the IPs of any hits just to see who's running premade scripts. I tend to see .ru, .br, .cz and .cn. I don't mean to judge, but I sort of tend to block them henceforth. In their defense, those countries are generally mostly populated, though I thus far I don't see much of say .in, .fr, .us or .au doing the same.

SELECT TOP 30
c-ip as Client, 
SUBSTR(cs(User-Agent), 0, 70) as Agent, 
cs-uri-stem,
LOWER(cs-uri-query) AS q,
count(*) as Hits,
SUM(sc-bytes) AS BytesSent,
SUM(cs-bytes) AS BytesRecv
FROM {filename} 
WHERE q like '%select%' OR q like '%sys.tables%' OR etc... 
GROUP BY c-ip, cs(User-Agent) 
ORDER BY Hits desc

P.S. I can't verify that these queries would actually run correctly. Please edit them freely if they need fixing.

dlamblin
  • 929
  • 2
  • 10
  • 20
3

These were all found here (which is an excellent guide for parsing your IIS logfiles, btw):

20 newest files on your website

logparser -i:FS "SELECT TOP 20 Path, CreationTime from c:\inetpub\wwwroot*.* ORDER BY CreationTime DESC" -rtp:-1

Path                                                        CreationTime
----------------------------------------------------------- ------------------
c:\inetpub\wwwroot\Default.asp                              6/22/2003 6:00:01
c:\inetpub\wwwroot\About.asp                                6/22/2003 6:00:00
c:\inetpub\wwwroot\global.asa                               6/22/2003 6:00:00
c:\inetpub\wwwroot\Products.asp                             6/22/2003 6:00:00

20 most recently modified files

logparser -i:FS "SELECT TOP 20 Path, LastWriteTime from c:\inetpub\wwwroot*.* ORDER BY LastWriteTime DESC" -rtp:-1

Path                                                        LastWriteTime
----------------------------------------------------------- ------------------
c:\inetpub\wwwroot\Default.asp                              6/22/2003 14:00:01
c:\inetpub\wwwroot\About.asp                                6/22/2003 14:00:00
c:\inetpub\wwwroot\global.asa                               6/22/2003 6:00:00
c:\inetpub\wwwroot\Products.asp                             6/22/2003 6:00:00

Files that have resulted in 200 status codes (in case trojans were deleted)

logparser "SELECT DISTINCT TO_LOWERCASE(cs-uri-stem) AS URL, Count() AS Hits FROM ex.log WHERE sc-status=200 GROUP BY URL ORDER BY URL" -rtp:-1

URL                                      Hits
---------------------------------------- -----
/About.asp                               122
/Default.asp                             9823
/downloads/setup.exe                     701
/files.zip                               1
/Products.asp                            8341
/robots.txt                              2830

Show any IP address that hit the same page more than 50 times in a single day

logparser "SELECT DISTINCT date, cs-uri-stem, c-ip, Count() AS Hits FROM ex.log GROUP BY date, c-ip, cs-uri-stem HAVING Hits>50 ORDER BY Hits Desc" -rtp:-1

date       cs-uri-stem                         c-ip            Hits
---------- ----------------------------------- --------------- ----
2003-05-19 /Products.asp                       203.195.18.24   281
2003-06-22 /Products.asp                       210.230.200.54  98
2003-06-05 /Products.asp                       203.195.18.24   91
2003-05-07 /Default.asp                        198.132.116.174 74
GregD
  • 8,713
  • 1
  • 23
  • 35
  • I've looked at those and didn't find them particularly helpful. They are mostly "find the compromise" and that's not really our goal (we haven't been compromised) – Jeff Atwood Aug 07 '09 at 08:59
0

I don't know how to do it with LogParser but looking for strings of requests for things like "phpMyAdmin" (or other common vunerablities) that get 404s might be a good way to identify scripted attacks.

BCS
  • 1,065
  • 2
  • 15
  • 24
  • the goal isn't to find scripted attacks of that type, but irresponsible / problem clients related to bandwidth and traffic. – Jeff Atwood Aug 07 '09 at 09:04
  • 2
    I'd say any client that is TRYING to hurt me is a problem client and I'd rather not have them using my bandwidth. – BCS Aug 07 '09 at 17:16