I can't give you something as pretty as slhck's python solution but here's a pure bash one:
printf "\t" &&
for file in ex*; do \
printf "%-15s" "$file ";
done &&
echo "" &&
while read fruit; do \
printf "$fruit\t";
for file in ex*; do \
printf "%-15s" `grep -wc $fruit $file`;
done;
echo "";
done < superset.txt
If you copy/paste that horrible thing into a terminal, assuming your list of fruits is in a file called superset.txt
with one fruit per line, you get:
example1 example2 example3
apple 1 2 2
banana 1 1 2
mango 0 1 1
orange 1 1 2
pear 0 1 1
plum 0 0 1
EXPLANATION:
printf "\t"
: print a TAB to have the filenames aligned to the end of the fruit names.
for file in ex*; [...] done
: print the file names (assuming they are the only files whose name starts with ex
.
echo ""
: print a new line
while read fruit; do [...]; done <list
: list
must be a text file containing the superset you mentioned, i.e., all fruit, one fruit per line. This file is read in this loop and each fruit is saved as $fruit
.
printf "$fruit\t";
: print the fruit name and a TAB.
for file in ex*; do [...]; done
: Here we go through each file again and use grep -wc $fruit $file
to get the number of times the fruit we are currently processing was found in that file.
You might also be able to use column
but I never have so did not try:
The column utility formats its input into multiple columns.
Rows are filled before columns. Input is taken from file oper‐
ands, or, by default, from the standard input. Empty lines are
ignored unless the -e option is used.
And here's a Perl one. Technically, this is a one liner, albeit a LONG one:
perl -e 'foreach $file (@ARGV){open(F,"$file"); while(<F>){chomp; $fruits{$_}{$file}++}} print "\t";foreach(sort @ARGV){printf("%-15s",$_)}; print "\n"; foreach $fruit (sort keys(%fruits)){print "$fruit\t"; do {$fruits{$fruit}{$_}||=0; printf("%-15s",$fruits{$fruit}{$_})} for @ARGV; print "\n";}' ex*
Here it is in commented script form which might actually be intelligible:
#!/usr/bin/env perl
foreach $file (@ARGV){ ## cycle through the files
open(F,"$file");
while(<F>){
chomp;## remove newlines
## Count the fruit. This is a hash of hashes
## where the fruit is the first key and the file
## the second. For each fruit then, we will end up
## with something like this: $fruits{apple}{example1}=1
$fruits{$_}{$file}++;
}
}
print "\t"; ## pretty formatting
## Print each of the file names
foreach(sort @ARGV){
printf("%-15s",$_)
}
print "\n"; ## pretty formatting
## Now, cycle through each of the "fruit" we
## found when reading the files and print its
## count in each file.
foreach $fruit (sort keys(%fruits)){
print "$fruit\t"; ## print the fruit names
do {
$fruits{$fruit}{$_}||=0; ## Count should be 0 if none were found
printf("%-15s",$fruits{$fruit}{$_}) ## print the value for each fruit
} for @ARGV;
print "\n"; ## pretty formatting
}
This has the benefit of coping with arbitrary "fruit", not superset is needed. Also, both these solutions use native *nix tools and do not require the installation of additional packages. That said, the python solution in slhck's answer is more concise and gives prettier output.
What you're looking for is a so-called "term document matrix", usually found in information retrieval and text mining applications. This might help you search for solution. What OS are you on? – slhck – 2012-12-01T15:08:55.857
Will a Perl solution be OK? – terdon – 2012-12-01T16:14:46.920
Perl would be lovely. Am on osx. :) – Joe – 2012-12-01T17:03:13.807