w-shingling

In natural language processing a w-shingling is a set of unique shingles (therefore n-grams) each of which is composed of contiguous subsequences of tokens within a document, which can then be used to ascertain the similarity between documents. The symbol w denotes the quantity of tokens in each shingle selected, or solved for.

The document, "a rose is a rose is a rose" can therefore be maximally tokenized as follows:

(a,rose,is,a,rose,is,a,rose)

The set of all contiguous sequences of 4 tokens (Thus 4=n, thus 4-grams) is

{ (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose) } Which can then be reduced, or maximally shingled in this particular instance to { (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is) }.

Resemblance

For a given shingle size, the degree to which two documents A and B resemble each other can be expressed as the ratio of the magnitudes of their shinglings' intersection and union, or

r(A,B)={{|S(A)\cap S(B)|} \over {|S(A)\cup S(B)|}}

where |A| is the size of set A. The resemblance is a number in the range [0,1], where 1 indicates that two documents are identical. This definition is identical with the Jaccard coefficient describing similarity and diversity of sample sets.

gollark: They patch all the programs they ship with it to be compatible with it, somehow.

gollark: It's kind of an experiment to see how much stuff immediately breaks.

gollark: Void Linux is basically Linux, but all the functions in all C headers return `void`.

gollark: Just traverse the package digraph, silly.

gollark: Obviously `-y` is "fetch latest package lists".

References

(Manber 1993) Finding Similar Files in a Large File System. Does not yet use the term "shingling".
(Broder, Glassman, Manasse, and Zweig 1997) Syntactic Clustering of the Web. SRC Technical Note #1997-015.

External links

Manning, Christopher D.; Raghavan, Prabhakar; Schütze, Hinrich (7 July 2008). "w-shingling". Introduction to Information Retrieval. Cambridge University Press. ISBN 978-1-139-47210-4.

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

w-shingling

Resemblance

See also

References

External links