SimHash
In computer science, SimHash is a technique for quickly estimating how similar two sets are. The algorithm is used by the Google Crawler to find near duplicate pages. It was created by Moses Charikar.
Evaluation and benchmarks
A large scale evaluation has been conducted by Google in 2006[1] to compare the performance of Minhash and Simhash[2] algorithms. In 2007 Google reported using Simhash for duplicate detection for web crawling[3] and using Minhash and LSH for Google News personalization.[4]
gollark: Rude.
gollark: What could this POSSIBLY be?
gollark: Oh bee oh apiary forms.
gollark: `proxy_read_timeout bees` and `proxy_write_timeout apioforms`?
gollark: Hmm. Technically not any ever without needing an infinite amount of resistors. But close enough.
See also
- MinHash
- w-shingling
- Count-min sketch
References
- Henzinger, Monika (2006), "Finding near-duplicate web pages: a large-scale evaluation of algorithms", Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p. 284, doi:10.1145/1148170.1148222, ISBN 978-1595933690.
- Charikar, Moses S. (2002), "Similarity estimation techniques from rounding algorithms", Proceedings of the 34th Annual ACM Symposium on Theory of Computing, p. 380, doi:10.1145/509907.509965, ISBN 978-1581134957.
- Gurmeet Singh, Manku; Jain, Arvind; Das Sarma, Anish (2007), "Detecting near-duplicates for web crawling", Proceedings of the 16th International Conference on World Wide Web (PDF), p. 141, doi:10.1145/1242572.1242592, ISBN 9781595936547.
- Das, Abhinandan S.; Datar, Mayur; Garg, Ashutosh; Rajaram, Shyam; et al. (2007), "Google news personalization: scalable online collaborative filtering", Proceedings of the 16th International Conference on World Wide Web, p. 271, doi:10.1145/1242572.1242610, ISBN 9781595936547.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.