1

I am planning on using Webalizer to analyze and graph our IIS logs, but because we have a server farm Webalizer requires me to make sure that all of the logs are in chronological order (or else it will start skipping results).

Our logs are stored gzipped so I started by unzipping everything to separate files and then I used LogParser 2.2 in order to merge those files. My LogParser command was:

LogParser.exe -i:iisw3c "select * into combinedLogFile.log from *.log order by date, time" -o:w3c 

I probably don't need * but I do need most of the fields because Webalizer will need them. This works perfectly fine on some of my logs, however one of our server farm clusters generate a LOT of logs, we have 14 servers where each server's logs are (at least) 2.5 GB per day (each log is in a separate day). So when I try and merge these logs LogParser just crashes with a meaningless generic error.

I assumed it was a memory issue and so I tried a number of ways to try and minimize the memory.

I am using powershell to call LogParser and so I started to try and pipe the input using the standard Powershell piping. (This caused an OutOfMemoryException in Powershell (instead of LogParser) sooner than just using the files in any way I could do it).

What I finally ended up with is using multiple named pipes being called from a batch file call to "Cat" directly piping that into LogParser...and I got back to where I started when I was pre-zipping them.

We have other scripts that process these same log files and none of them have issues (although their output is generally smaller than this ones will be).

So I just want to know if you have any ideas about a better way to merge all of these files or some LogParser script that will work as the one I came up isn't sufficient.

P.S. I know I could probably write a merging program in .NET as all of the individual logs are already sorted and so I wouldn't need to read more than a few rows at a time but I am trying to avoid having to do that if possible.

masegaloeh
  • 17,978
  • 9
  • 56
  • 104
  • 1
    Not an answer to your question, but I went with logging from our load balancer instead of IIS -- helps out a lot IMO. – Kyle Brandt Jul 15 '11 at 18:06
  • what load balancer do you use? Maybe I can suggest that method. – James J. Regan IV Jul 15 '11 at 18:08
  • We use haproxy. This also also gives us the response time of each web requests as seen from the load balancer and avoids some strangeness with output caching and x-forwarded-for with client ips. – Kyle Brandt Jul 15 '11 at 18:12

1 Answers1

4

Given that you are running into issues trying to sort the data for a single day, I'd look to one of two strategies.

  1. Find a better sort. See if you can get the windows sort tool to work for you. The logs are rigged with date and time first, in an ascii-text-sort friendly format for a reason. It uses a lot less memory and doesn't have to parse lines to sort. My bet is this works for you.

  2. Write an interleave, that opens all 14 files and pulls the earliest line from the top of each, working its way through the 14 files simultaneously. I shudder to think of this but it wouldn't need but 64KB of memory for each file.

old answer:

Divide and conquer. Write one script that reads logs and puts them in new files by date, with a known filename that has the date in it (weblog-20110101.log). Run a sort on each file that sorts by time. Cat the files you need together.

Mark
  • 2,248
  • 12
  • 15
  • my log files are already in separate files by date. Sorry I didn't make that clear. There is one file per day for each of the 14 webservers. – James J. Regan IV Jul 15 '11 at 18:13
  • Well, if they are IIS logs they should have a good date in the file name. My suggestion stands with one tweak... the first script just needs to find the 14 files for each day and get them together. Sort by day to interleave the lines then cat the days together. – Mark Jul 15 '11 at 18:17
  • "Sort by day to interleave the lines" is the problem I am trying to solve. How is having 1 file to sort going to be any different than having 14? – James J. Regan IV Jul 15 '11 at 18:23
  • Yeah, your writeup makes it sound like you are sorting months of data at once. Sorry. I'll ponder a bit and update the answer. – Mark Jul 15 '11 at 18:29
  • #2 is essentially the later half of a "merge sort" – Chris Nava Jul 15 '11 at 19:07
  • the Windows Sort seems to have worked. I haven't run it on all of my data yet (as it would take a really long time) but it did get significantly further than logparser did and I noticed its RAM wasn't increasing. THanks – James J. Regan IV Jul 19 '11 at 21:42