Counting duplicates lines from a stream

2

1

I'm currently parsing apache logs with that command:

tail -f /opt/apache/logs/access/gvh-access_log.1365638400  | 
grep specific.stuff. | awk '{print $12}' | cut -d/ -f3 > ~/logs

The output is a list of domains:

www.domain1.com
www.domain1.com
www.domain2.com
www.domain3.com
www.domain1.com

In another terminal I then run this command:

watch -n 10 'cat ~/logs | sort | uniq -c | sort -n | tail -50'

The output is:

1023 www.domain2.com
2001 www.domain3.com
12393 www.domain1.com

I use this to monitor in quasi real time apache stats. The trouble is that logs get very big very fast. I don't need logs for any other purpose than uniq -c.

My question is: is there any way to avoid using a temporary file? I don't want to hand-roll my own counter in my language of choice, I'd like to use some awk magic if possible.

Note that since I need to use sort, I have to use a temp file in the process, because sorting on streams is meaningless (although uniq isn't).

cpa

Posted 2013-04-11T09:05:14.670

Reputation: 131

Answers

0

Although it might be pointing the obvious but, did you try this:

tail -f /opt/apache/logs/access/gvh-access_log.1365638400  | grep specific.stuff. | awk '{print $12}' | cut -d/ -f3 | sort | uniq | sort -n | tail -50

I know it is a long command line but it elimibates the creation of the intermediary file. If this is not working for you, could you please tell why, so that you can get more meaningful answers.

MelBurslan

Posted 2013-04-11T09:05:14.670

Reputation: 835

1It doesn't work because it is meaningless to use sort on a stream, that's why I need a temp file in the process. – cpa – 2013-04-11T09:29:11.400

have you tried and saw that it didn't work for you or you are just assuming it will not work ?? The creation of the temporary file is the same thing as piping the output of your first command to the second command as its input. If you haven't tried, just try it. If you tried, what prob lem did you encounter ? – MelBurslan – 2013-04-11T09:54:05.893

1There several reasons why this doesn't work (and I've tried):

  • sort waits for EOF before writing its output. I hope it's obvious why.
  • tail -50 takes the last 50 lines from EOF.

So in the end it amounts to the fact that tail -f on an apache log will never output EOF since it is constantly append lines to the file. Dumping results in a file is a way to achieve that. Sure, I could just tail but it still requires parsing the log file every time, which is stupid. – cpa – 2013-04-11T10:05:40.757