Create a term-document matrix from files

I have a set of files from example001.txt to example100.txt. Each file contains a list of keywords from a superset (the superset is available if we want it).

So example001.txt might contain

apple
banana
...
otherfruit

I'd like to be able to process these files and produce something akin to a matrix so there is the list of examples* on the top row, the fruit down the side, and a '1' in a column if the fruit is in the file.

An example might be...

x           example1    example2   example3
Apple         1            1          0
Babana        0            1          0
Coconut       0            1          1

Any idea how I might build some sort of command-line magic to put this together? I'm on OSX and happy with perl or python...

Joe

Posted 2012-12-01T14:45:27.127

Reputation: 2 942

What you're looking for is a so-called "term document matrix", usually found in information retrieval and text mining applications. This might help you search for solution. What OS are you on? – slhck – 2012-12-01T15:08:55.857

Will a Perl solution be OK? – terdon – 2012-12-01T16:14:46.920

Perl would be lovely. Am on osx. :) – Joe – 2012-12-01T17:03:13.807

Answers

With Python, you can install textmining through

sudo pip install textmining

Then, create a new file – let's call it matrix.py, and add the following:

#!/usr/bin/env python
import textmining
import glob

tdm = textmining.TermDocumentMatrix()

files = glob.glob("/Users/foo/files/*.txt")
print(files)
for f in files:
  content = open(f).read()
  content = content.replace('\n', ' ')
  tdm.add_doc(content)
tdm.write_csv('matrix.csv', cutoff=1)

Save it and call chmod +x matrix.py. Now, simply run it with ./matrix.py. This program will search in the directory specified in glob() and write the output matrix to matrix.csv in your current directory, maybe like this:

As you can see, the only drawback is that it doesn't output the document names. We can prepend this list though, using a couple of bash commands – we only need a list of the file names:

echo "" > files.txt; find /Users/foo/files/ -type f -iname "*.txt" >> files.txt

And then, paste this together with the matrix.csv:

paste -d , files.txt matrix.csv > matrix2.csv

Voilà, our complete term-document matrix:

I could imagine there are less convoluted solutions, but this is Python and I don't know it well enough to change the code to output the entire correct matrix.

slhck

Posted 2012-12-01T14:45:27.127

Reputation: 182 472

+1 that's cool. What about using os.system() and ther from inside of python script call bash commands? Wouldn't be it more handy if You just run one python script? – andilabs – 2013-10-31T14:45:38.437

Well, you did it already :) As I said, I wasn't that familiar with Python back then and more at home in the shell. – slhck – 2013-10-31T19:33:52.073

It is almost the slhck solution. I just added inside Python script the bash commands executed via os.sytem, to put all in one python script without necessity to shift between python and bash console.

#!/usr/bin/env python
import textmining
import glob
import os
tdm = textmining.TermDocumentMatrix()
files = glob.glob("/Users/andi/Desktop/python_nltk/dane/*.txt")
os.system("""echo "" > files.txt; find /Users/andi/Desktop/python_nltk/dane -type f -iname "*.txt" >> files.txt""")
print(files)
for f in files:
  content = open(f).read()
  content = content.replace('\n', ' ')
  tdm.add_doc(content)
tdm.write_csv('matrix.csv', cutoff=1)

os.system("""paste -d , files.txt matrix.csv > matrix2.csv """)

andilabs

Posted 2012-12-01T14:45:27.127

Reputation: 372

I can't give you something as pretty as slhck's python solution but here's a pure bash one:

printf "\t" && 
for file in ex*; do \
  printf "%-15s" "$file "; 
done &&
echo "" && 
while read fruit; do \
    printf "$fruit\t";
    for file in ex*; do \
      printf "%-15s" `grep -wc $fruit $file`;  
    done;  
echo ""; 
done < superset.txt

If you copy/paste that horrible thing into a terminal, assuming your list of fruits is in a file called superset.txt with one fruit per line, you get:

        example1       example2       example3       
apple   1              2              2              
banana  1              1              2              
mango   0              1              1              
orange  1              1              2              
pear    0              1              1              
plum    0              0              1

EXPLANATION:

printf "\t" : print a TAB to have the filenames aligned to the end of the fruit names.
for file in ex*; [...] done : print the file names (assuming they are the only files whose name starts with ex.
echo "" : print a new line
while read fruit; do [...]; done <list : list must be a text file containing the superset you mentioned, i.e., all fruit, one fruit per line. This file is read in this loop and each fruit is saved as $fruit.
printf "$fruit\t"; : print the fruit name and a TAB.
for file in ex*; do [...]; done : Here we go through each file again and use grep -wc $fruit $file to get the number of times the fruit we are currently processing was found in that file.

You might also be able to use column but I never have so did not try:

 The column utility formats its input into multiple columns.
 Rows are filled before columns.  Input is taken from file oper‐
 ands, or, by default, from the standard input.  Empty lines are
 ignored unless the -e option is used.

And here's a Perl one. Technically, this is a one liner, albeit a LONG one:

perl -e 'foreach $file (@ARGV){open(F,"$file"); while(<F>){chomp; $fruits{$_}{$file}++}} print "\t";foreach(sort @ARGV){printf("%-15s",$_)}; print "\n"; foreach $fruit (sort keys(%fruits)){print "$fruit\t"; do {$fruits{$fruit}{$_}||=0; printf("%-15s",$fruits{$fruit}{$_})} for @ARGV; print "\n";}' ex*

Here it is in commented script form which might actually be intelligible:

#!/usr/bin/env perl
foreach $file (@ARGV){ ## cycle through the files
    open(F,"$file");
    while(<F>){
    chomp;## remove newlines
    ## Count the fruit. This is a hash of hashes
    ## where the fruit is the first key and the file
    ## the second. For each fruit then, we will end up
    ## with something like this: $fruits{apple}{example1}=1
    $fruits{$_}{$file}++; 
    }
}
print "\t"; ## pretty formatting

## Print each of the file names
foreach(sort @ARGV){
    printf("%-15s",$_)
}
print "\n";  ## pretty formatting

## Now, cycle through each of the "fruit" we 
## found when reading the files and print its
## count in each file.
foreach $fruit (sort keys(%fruits)){
    print "$fruit\t"; ## print the fruit names
    do {
        $fruits{$fruit}{$_}||=0; ## Count should be 0 if none were found
        printf("%-15s",$fruits{$fruit}{$_}) ## print the value for each fruit
    } for @ARGV;
    print "\n"; ## pretty formatting
}

This has the benefit of coping with arbitrary "fruit", not superset is needed. Also, both these solutions use native *nix tools and do not require the installation of additional packages. That said, the python solution in slhck's answer is more concise and gives prettier output.

terdon

Posted 2012-12-01T14:45:27.127

Reputation: 45 216

In Python, you can use sklearn.feature_extraction.text.CountVectorizer.fit_transform: it learns the vocabulary dictionary and returns a term-document matrix.

Example:

import sklearn
import sklearn.feature_extraction

vectorizer = sklearn.feature_extraction.text.CountVectorizer(min_df=1)

corpus = ['This is the first document.',
        'This is the second second document.',
        'And the third one.',
        'Is this the first document? This is right.',]

X = vectorizer.fit_transform(corpus).toarray()
print('X: {0}'.format(X))
print('vectorizer.vocabulary_: {0}'.format(vectorizer.vocabulary_))

outputs:

X: [[0 1 1 1 0 0 0 1 0 1]
    [0 1 0 1 0 0 2 1 0 1]
    [1 0 0 0 1 0 0 1 1 0]
    [0 1 1 2 0 1 0 1 0 2]]

vectorizer.vocabulary_: {u'and': 0, u'right': 5, u'third': 8, u'this': 9, u'is': 3,
                         u'one': 4, u'second': 6, u'the': 7, u'document': 1, u'first': 2}

Since you are working with files, you might be interested in the method sklearn.feature_extraction.text.CountVectorizer.transform() as well.

Franck Dernoncourt

Posted 2012-12-01T14:45:27.127

Reputation: 13 518