Perform a GROUP BY-like command in UNIX

You can use awk to traverse the matrix and count the number of 1s and 0s using the following script:

count.awk:

NR != 1 {
  for (i=1; i<=NF; ++i)
    count[i] += $i;
}

END {
  ORS = ",";
  for (i=1; i<=length(count); ++i)
    if (count[i] >= min)
       print i
}

If you execute this script using

awk -v min=2 -f count.awk matrix.txt

you will get a line of columns that have two or more 1s, in this case "1,2,4,8,9," (note: you can change the min=X to any minimum threshold you want).

Now, use cut to print out only the columns that we want:

cols=$(awk -v min=2 -f count.awk matrix.txt); cut -d' ' -f${cols:0:-1} matrix.txt

This stores the awk output in a variable (the reason for this is that awk returns a list of columns with an extra , at the end. I "slice" the comma out when I pass the cols to cut).

Set the delimiter for cut to "space" (-d' '), and the output columns to the comma-separated list from awk, with the last comma sliced out (-f${cols:0:-1}).

Output:

1 2 4 8 9 n
1 0 1 1 0 0
0 1 0 0 0 1
1 0 0 1 1 0
0 1 1 0 0 0
0 1 0 0 1 1

If you want to output the columns with fewer than min 1s (ie. columns 3, 5, 6, 7), just reverse the condition of the if statement in the awk script above to read if (count[i] < min).

Output:

savanto

Posted 2014-06-02T16:20:26.260

Reputation: 419

Thanks for that. Actually the code is probably having some problems. What is the NR != 1 part for? EDIT: It works with NR>0. – ddmichael – 2014-06-03T00:46:16.183

@ddmichael NR != 1 is used to skip the heading line with the column numbering. It says "if the record number (NR) is not 1, then do the following actions." On my system NR > 1 also works, but NOT NR > 0. I'm surprised it works for you -- maybe your awk starts numbering from zero?? – savanto – 2014-06-03T00:53:14.843

Ah, yes you're correct. – ddmichael – 2014-06-03T01:11:22.683

Perform a GROUP BY-like command in UNIX

Answers