Bash: amount of bytes used in a log file grouped by token

2

Assume a large log file of several GBs and several million lines where each line contains a token identifying the user account that generated the line.

All tokens have the same length and can be found at the position within each log line.

The goal is to figure out the amount of bytes logged by each account.

One way of doing this is in multiple steps, like this:

awk -F "|" '{ print $5 }' trace.log | sort | uniq | xargs -l sh -c 'echo -n $0 && grep "$0" trace.log | wc -c'

where awk extracts the token (5th entry tokenizing by '|'), sort | uniq extracts the list of unique tokens appearing in the file and finally xargs greps and counts the bytes.

Now this works but it is terribly inefficient because the same (huge) file gets grepped X times.

Is there a smarter way of achieving the same via shell commands? (where by smarter I mean faster and without consuming tons of RAM or temporary storage, like sorting the whole file in RAM or sorting it to a tmp file).

Sergio

Posted 2016-07-11T18:15:37.907

Reputation: 265

Why not write a more sophisticated script (bash/perl/python/ruby/etc)? One that does a single pass through the file, reading line by line, and maintaining a map of accumulated byte counts for each user. – jehad – 2016-07-11T18:21:36.843

@jehad: because 1) that would remove all the fun :) 2) I'm lazy and 3) I explicitly asked for a way of doing that via shell commands – Sergio – 2016-07-11T18:25:24.710

1You're right, and I've +1'd the answer from john1024. My awk skills are not great, but he has shown me the light! – jehad – 2016-07-11T18:27:18.640

Answers

2

Try:

awk -F "|" '{ a[$5]+=1+length($0) } END{for (name in a) print name,a[name]}' trace.log

Example

Let's consider this test file:

$ cat trace.log
1|2|3|4|jerry|6
a|b|c|d|phil|f
1|2|3|4|jerry|6

The original command produces this output:

$ awk -F "|" '{ print $5 }' trace.log | sort | uniq | xargs -l sh -c 'echo -n $0 && grep "$0" trace.log | wc -c'
jerry32
phil15

The suggested command, which loops through the file just once, produces this output:

$ awk -F "|" '{ a[$5]+=1+length($0) } END{for (name in a) print name,a[name]}' trace.log
jerry 32
phil 15

How it works

  • -F "|"

    This sets the field separator for input.

  • a[$5]+=1+length($0)

    For each line, we add the length of the line to the count stored in associative array a under this line's user name.

    The quantity length($0) does not include the newline that ends the line. Consequently, we add one to this to account for the \n.

  • END{for (name in a) print name,a[name]}

    After we have read through the file once, we print out the sums.

John1024

Posted 2016-07-11T18:15:37.907

Reputation: 13 893

1awk for the win. Well done and in about 20 seconds, impressive – Sergio – 2016-07-11T18:32:59.697