Computing average on a list of values for each value

0

I have a huge file (50k lines) with the following format:

52370581 2116
17996781 4314
47818829 2584
61421045 2158
...
25145813 2274

The first value can be present multiple times and I want to compute the average of the 2nd value for each 1st value, using the terminal. Ideally, I'd want the median, the min and max too. I can to it on all the values quite easily with awk but I was wondering how to get that done for each value.

Julien Genestoux

Posted 2013-05-11T15:19:01.117

Reputation: 103

1Do you want the median or the average? It's not the same thing. – Nifle – 2013-05-11T15:28:08.780

@Nifle he said he wants the median as well. – Christian Stewart – 2013-05-12T00:36:52.150

Answers

3

awk '{c[$1]++;s[$1]+=$2}END{for(i in c)print i,(s[i]/c[i])}' test.in

More readably:

awk '
{ # Every line
    # Increment the key, add value to its sum
    count[$1]++
    sum[$1] += $2
}
END {
    # Go through all the numbers we saw
    for(number in count) {
        # Print the number followed by the average (sum/count) of its values
        print number, (sum[number]/count[number])
    }
}
' test.in

Note that the numbers will be output in a more-or-less random order (some hash of the keys). Getting min and max isn't so bad, just add them after count and sum in the first block. Median requires tracking every number. You can do it with a 2d array (or awk's emulation thereof), but I'll leave it as an exercise to the reader.

Kevin

Posted 2013-05-11T15:19:01.117

Reputation: 1 019

Perfect! I wish I could double vote just for the explanation which have allowed me to understand and go further! – Julien Genestoux – 2013-05-12T10:45:08.537

0

For avarage

awk '{if(NR==1){i=$1;}} {if($1~i){j+=$2;k++}} END{print "Average is " (j/k)}' <filename>

this will provide you in case you want to do it only for the value in first line. Instead of NR==0, if you pass a variable, you can do it for any line specific case. Now if you want to do it for each line, then below awk will do it in one pass

awk '{value[$1]+=$2;count[$1]++} END{for(indx in value)print "Avarage of " indx " is " (value[indx]/count[indx])}' <filename>

awk has two dimensional array, but i'm not familiar with that, so using 2 arrays to keep sum of the elements and count. Any other operation, we need to change the script based on the operation

abasu

Posted 2013-05-11T15:19:01.117

Reputation: 116

awk lines start at 1, so NR == 0 is always false. – Kevin – 2013-05-11T17:02:19.360

oops Sorry my mistake, I should have tested once before posting :) Thanks @Kevin – abasu – 2013-05-11T17:18:57.470