Dasgupta's objective

In the study of hierarchical clustering, Dasgupta's objective is a measure of the quality of a clustering, defined from a similarity measure on the elements to be clustered. It is named after Sanjoy Dasgupta, who formulated it in 2016.[1] Its key property is that, when the similarity comes from an ultrametric space, the optimal clustering for this quality measure follows the underlying structure of the ultrametric space. In this sense, clustering methods that produce good clusterings for this objective can be expected to approximate the ground truth underlying the given similarity measure.[2]

In Dasgupta's formulation, the similarity between elements is given by an undirected graph $G=(V,E)$ , with non-negative real weights on its edges. Large weights indicate elements that should be considered more similar to each other, while small weights or missing edges indicate pairs of elements that are not similar. A hierarchical clustering can be described by a tree (not necessarily a binary tree) whose leaves are the elements to be clustered; the clusters are then the subjects of elements descending from each tree node, and the size $|C|$ of any cluster $C$ is its number of elements. For each edge $uv$ of the input graph, let $w(uv)$ denote the weight of edge $uv$ and let $C(uv)$ denote the smallest cluster of a given clustering that contains both $u$ and $v$ . Then Dasgupta defines the cost of a clustering to be[1]

\sum _{uv\in E}w(uv)\cdot |C(uv)|.

The optimal clustering for this objective is NP-hard to find. However, it is possible to find a clustering that approximates the minimum value of the objective in polynomial time by a divisive (top-down) clustering algorithm that repeatedly subdivides the elements using an approximation algorithm for the sparsest cut problem, the problem of finding a partition that minimizes the ratio of the total weight of cut edges to the total number of cut pairs.[1] Equivalently, for purposes of approximation, one may minimize the ratio of the total weight of cut edges to the number of elements on the smaller side of the cut. Using the best known approximation for the sparsest cut problem, the approximation ratio of this approach is $O({\sqrt {\log n}})$ .[3]

References

Dasgupta, Sanjoy (2016), "A cost function for similarity-based hierarchical clustering", Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2016), New York, New York: ACM, pp. 118–127, arXiv:1510.05043, doi:10.1145/2897518.2897527, MR 3536559
Cohen-Addad, Vincent; Kanade, Varun; Mallmann-Trenn, Frederik; Mathieu, Claire (2018), "Hierarchical clustering: objective functions and algorithms", Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2018), Philadelphia, Pennsylvania: Society for Industrial and Applied Mathematics, pp. 378–397, arXiv:1704.02147, doi:10.1137/1.9781611975031.26, MR 3775814
Arora, Sanjeev; Rao, Satish; Vazirani, Umesh (2009), "Expander flows, geometric embeddings and graph partitioning", Journal of the ACM, 56 (2): A5:1–A5:37, doi:10.1145/1502793.1502794, MR 2535878

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[dasgupta-1] Dasgupta, Sanjoy (2016), "A cost function for similarity-based hierarchical clustering", Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2016), New York, New York: ACM, pp. 118–127, arXiv:1510.05043, doi:10.1145/2897518.2897527, MR 3536559

[ckmm-2] Cohen-Addad, Vincent; Kanade, Varun; Mallmann-Trenn, Frederik; Mathieu, Claire (2018), "Hierarchical clustering: objective functions and algorithms", Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2018), Philadelphia, Pennsylvania: Society for Industrial and Applied Mathematics, pp. 378–397, arXiv:1704.02147, doi:10.1137/1.9781611975031.26, MR 3775814

[arv-3] Arora, Sanjeev; Rao, Satish; Vazirani, Umesh (2009), "Expander flows, geometric embeddings and graph partitioning", Journal of the ACM, 56 (2): A5:1–A5:37, doi:10.1145/1502793.1502794, MR 2535878