Document-term matrix

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

General Concept

When creating a database of terms that appear in a set of documents the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:

  • D1 = "I like databases"
  • D2 = "I hate databases",

then the document-term matrix would be:

Ilikehatedatabases
D11101
D21011

which shows which documents contain which terms and how many times they appear.

Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.

Choice of Terms

A point of view on the matrix is that each row represents a document. In the vectorial semantic model, which is normally the one used to compute a document-term matrix, the goal is to represent the topic of a document by the frequency of semantically significant terms. The terms are semantic units of the documents. It is often assumed, for Indo-European languages, that nouns, verbs and adjectives are the more significant categories, and that words from those categories should be kept as terms. Adding collocation as terms improves the quality of the vectors, especially when computing similarities between documents.

Applications

Improving search results

Latent semantic analysis (LSA, performing singular-value decomposition on the document-term matrix) can improve search results by disambiguating polysemous words and searching for synonyms of the query. However, searching in the high-dimensional continuous space is much slower than searching the standard trie data structure of search engines.

Finding topics

Multivariate analysis of the document-term matrix can reveal topics/themes of the corpus. Specifically, latent semantic analysis and data clustering can be used, and more recently probabilistic latent semantic analysis and non-negative matrix factorization have been found to perform well for this task.

gollark: I did wonder about this. It seems like the ideal, optimal, entirely flawless way to live would be to attain a giant warehouse of some kind and stick computers and a bed in one corner.
gollark: Some online friends did vaguely express interest in running our IRC network over ham radio instead of boring IP networks. That might be neat.
gollark: It's on my list of things to eternally never get round to doing.
gollark: > In mid-2019, part of IPv4 range was sold off for conventional use, due to IPv4 address exhaustion. I see.
gollark: /9 means that the first 9 bits of the address are the same for the things within the block of IPs.

See also

  • Bag of words model

Implementations

  • Gensim: Open source Python framework for Vector Space modelling. Contains memory-efficient algorithms for constructing term-document matrices from text plus common transformations (tf-idf, LSA, LDA).



This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.