correcting awk script to have output in correct order

1

I have an awk "script" which sums column 3, then 4, for each value in column 1 and when column 2 > 0:

awk 'BEGIN { print "Target covered_bases percentage_covered" } {FS = "\t" } $2 > 0 { n[$1]++; covered_bases[$1] += $3 ;percentage_covered[$1] += $4 } END { for (i in n) { print i,covered_bases[i],percentage_covered[i] } }' $1

My infile would be like this:

S 0 20 0.2
S 1 300 0.7
S 2 10 0.1
D 0 10 0.3
D 1 20 0.6
D 2 2  0.02
D 3 5  0.034

And so on, to let's say Z. The output here would be:

Target covered_bases percentage_covered
S 310 0.8
D 27  0.654

So this is fine. However, the letters are output in the wrong order. I know from other questions here that awk sometimes output things not in order. My problem is I cannot seem to correct this using previous answers given in this forum as my understanding of awk is not great at all and my "script" is already quite complicated to my mind.

Could you let me know how I can correct it?

Many thanks!

Agathe

Posted 2017-01-13T15:49:26.310

Reputation: 11

For recent (v4) GNU awk only, you can set PROCINFO["sorted_in"]="@ind_str_asc"; before the for(i in n). For any other awk use external sort as answered by Alex. Or consider using perl in its awk-ish -lna mode instead. – dave_thompson_085 – 2017-01-13T17:50:48.220

Thanks for this! It works in a similar way as what was suggested by Alex. However, again, column 1 is not in alphabetical or numerical order. I edited my question. – Agathe – 2017-01-13T18:01:52.050

May be i didn't get something but according to math order of values that need to be summarized doesn't effect result . Sorting happened after sum calculated. Could you clarify it please how order of first column may effect calculation – Alex – 2017-01-13T18:22:42.633

Are you saying you want the letters (column 1) in the SAME ORDER AS THE INPUT not their natural order which is A B C D E F etc? If so, are all lines for a letter consecutive? If not, let's say the input lines are A B A D C B A D C B A -- what is the correct order of the output and why? If they are consecutive you don't need accumulate-array(s)-anywhere-in-file logic you need accumulate-scalar(s)-while-group logic. – dave_thompson_085 – 2017-01-14T00:52:03.187

Sorry it is not clear yet. To answer dave, indeed, I would like the output to be in the same order as the input (not their natural order). Lines for a letter are consecutive. I am not sure what the "accumulate-scalar(s)-while_group" logic is, but I sure can try and find that. Thanks. – Agathe – 2017-01-18T17:25:05.720

Answers

0

Just pipe output of your awk to sort command but append header after awk processing.

awk '{FS = "\t" } $2 > 0 {
    n[$1]++;
    covered_bases[$1] += $3;
    percentage_covered[$1] += $4;
}
END {
    for (i in n) {
        print i,covered_bases[i],percentage_covered[i];
    }
}' $1 | sort | (echo 'Target covered_bases percentage_covered' && cat)

Alex

Posted 2017-01-13T15:49:26.310

Reputation: 5 606

Thanks for your answer. However, the thing is I cannot use sort because my first column is not in alphabetical or in numerical order. I am sorry I wasn't clear enough. I edited my question accordingly. – Agathe – 2017-01-13T17:47:07.627