csv averages on linux command line

I have a big file full of records like this

1, 2, 4, 5, 6
1, 3, 5, 6, 3
1, 4, 5, 6, 6
2, 4, 5, 5, 5
2, 3, 4, 5, 2

anyway I need to take the average of all rows with the same 1st number(key). i.e.

1, 3, 4.66, 5.66, 5
2, 3.5, 4.5, 5, 3.5

I know this is something awk/sed would be great for, I just don't have enough experience with them to accomplish it, thanks!

Also, what about averaging those columns together? so, after I output this to a file, id like to get another like:

1, 4.58
1, 4.125

The number of columns to add might not always be 4 either.

EDIT: this might be easier to do in gnuplot, so I mainly just need an answer to the first part.

Flamewires

Posted 2010-07-20T18:55:29.030

Reputation: 160

Are the rows sorted (field one grouped) as you show in your example? You say the number of fields in a record might be different. Are they consistent within a file? – Paused until further notice. – 2010-07-20T19:37:39.143

well its just a sort -n to sort them so thats no problem. yeah. and yes they are consistent in the file, just not from file to file. If ya have an answer that depends on me specifying the number of fields, thats fine, Ill take that too. Just trying to get this scripted before I leave :) – Flamewires – 2010-07-20T21:26:25.047

Answers

For the first option:

awk -F, 'BEGIN { OFS=","} {if (!keys[$1]) {keys[$1] = 1}; for (i=2;i<=NF;i++){array[$1,i]+=$i}; count[$1]+=1}END{for (i in keys) {printf ("%s ", i); for (j=2;j<=NF;j++) {printf ("%.2f ", array[i,j]/count[i])}; printf ("%s","\n")}}' inputfile

For the second option:

awk -F, 'BEGIN { OFS=","} {if (!keys[$1]) {keys[$1] = 1}; for (i=2;i<=NF;i++){array[$1,i]+=$i}; count[$1]+=1}END{for (i in keys) {{printf ("%s ", i); sum = 0; for (j=2;j<=NF;j++) {sum += array[i,j]/count[i]}}; printf ("%.2f\n",sum/(NF-1))}}' inputfile

but I'm not sure I understand why you'd want the average of some averages.

Paused until further notice.

Posted 2010-07-20T18:55:29.030

Reputation: 86 075

This is surprisingly tricky and complicated using Sed, so here's a Python hack to do this:

#!/usr/bin/env python

f = open("mycsv","r")
values = {}
index = {}
for line in f:
    rownum = line.strip().split(", ")
    try:
        values[rownum[0]] = map(lambda x,y: x+y, values[rownum[0]], [float(x) for x in rownum[1:]])
        index[rownum[0]] += 1
    except KeyError:
        values[rownum[0]] = [ float(x) for x in rownum[1:] ]
        index[rownum[0]] = 1

for k,v in values.items():
    values[k] = [x/index[k] for x in values[k]]
    print k, ":", values[k]

This works irrespective of the order of the rows, as long as rows with the same first element are of the same length.

Averaging the columns together will take just one more line of Python in the for loop:

print reduce(lambda x,y: x+y, values[k])/len(values[k])

Given the harrowing number of list comprehensions, though, you're probably better off tackling this problem with NumPy or Matlab.

Karthik

Posted 2010-07-20T18:55:29.030

Reputation: 95

Karthik has a good suggestion to do this in Numpy: it's just a few lines,

import numpy
data = numpy.loadtxt('filename.txt')
for key in numpy.unique(data.T[0]):
    print data[data.T[0]==key].mean(0)

Or if you wanted to average the columns together, the last line would change to

    avgs = data[data.T[0]==key].mean(0)[1:]
    print avgs[0], avgs[1:].mean()

David Z

Posted 2010-07-20T18:55:29.030

Reputation: 5 688