Count elements in lines with awk

I have a tab-delimited text file containing data like

Col 1 Col 2
1     a
1     b
1     c
1     d
1     d
2     a
2     b
3     a

so on and so forth.

I would like to transform this structure into

so that, a, b, c and d become columns; 1, 2 and 3 become a single row; and the numbers represent count. E.g. 1 has one "a" and two "d"s.

How can this be accomplished using awk or similar tools?

ashwin

Posted 2012-08-09T22:13:54.913

Reputation: 21

Welcome to Super User! The formatting here can definitely be confusing if you're new. Check out this guide. I fixed it for you this time. Good luck!

– Pops – 2012-08-09T22:24:32.040

Answers

awk 'NR>1 {
    count[$1,$2]++;
    rows[$1]++;
    cols[$2]++;
}
END {
    printf("%3s", "");
    for (col in cols) {
        printf("%4s", col);
    }
    printf("\n");
    for (row in rows) {
        printf("%3d", row);
        for (col in cols) {
            printf(" %3d", count[row,col]);
        }
        printf("\n");
    }
}' data

Not necessarily efficient or elegant, but it should be fairly easy to read and gets the job done. Also, the rows are columns will not necessarily be printed in sorted order. The key is the use of count[row,col] to simulate a multidimensional array, which is not directly supported in awk. A Google search for "awk multidimensional arrays" will turn up several articles including this one.

garyjohn

Posted 2012-08-09T22:13:54.913

Reputation: 29 085

Here's a PERL solution:

  perl -e '
    my (%col1, %col2); 
    while(<>){
        chomp; 
        @a=split(/\s+/); ## split line on whitespace
        $col2{$a[1]}++; ## Collect unique values from the 2nd column
        $col1{$a[0]}{$a[1]}++;## Count values per column/line
    } 
    my @l=sort keys %col2; 
    $"="\t"; ## Array record separator, using tabs to deal with variable size input
    print "\t@l\n"; 
    foreach my $c1 (sort keys(%col1)) {## For each column1 value
        print "$c1\t"; 
        my $str;
        for (my $i=0; $i<=$#l; $i++) {
        ## Collect the values for each position or 0 if there is none
        $col1{$c1}{$l[$i]}="0" unless defined($col1{$c1}{$l[$i]});
        $str.="$col1{$c1}{$l[$i]}\t";
        }
    chop($str); ## remove extra \t 
    print "$str\n";
    }' data   >ll

terdon

Posted 2012-08-09T22:13:54.913

Reputation: 45 216