Local Equivalences of Distances Between Clusterings

Tech Report Number



In comparing clusterings, several different distances and indices are in use. We prove that the Misclassification Error distance, the Hamming distance (equivalent to the unadjusted Rand index), and the dχ2 distance between partitions are equivalent in the neighborhood of 0. In other words, if two partitions are very similar, then one distance defines upper and lower bounds on the other and viceversa. The proof is geometric and relies on the convexity of a certain set of probability measures. To my knowledge, this is the first result of its kind. The motivation for this work is in the area of data clustering. Practically, these distances are frequently used to compare two clusterings of a set of observations. Theoretically, such distances are involved in formulating and proving properties of clusterig algorithms. Besides, our results apply to any pair of finite valued random variables, and provides simple yet tight upper and lower bounds on the χ 2 measure of (in)dependence valid when the two variables are strongly dependent.


tr553.pdf195.5 KB