Medoid
Medoids are representative objects of a data set or a cluster within a data set whose average dissimilarity to all the objects in the cluster is minimal.[1] Medoids are similar in concept to means or centroids, but medoids are always restricted to be members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. They are also used in contexts where the centroid is not representative of the dataset like in images and 3-D trajectories and gene expression [2] (where while the data is sparse the medoid need not be). These are also of interest while wanting to find a representative using some distance other than squared euclidean distance (for instance in movie-ratings).
For some data sets there may be more than one medoid, as with medians. A common application of the medoid is the k-medoids clustering algorithm, which is similar to the k-means algorithm but works when a mean or centroid is not definable. This algorithm basically works as follows. First, a set of medoids is chosen at random. Second, the distances to the other points are computed. Third, data are clustered according to the medoid they are most similar to. Fourth, the medoid set is optimized via an iterative process.
Note that a medoid is not equivalent to a median, a geometric median, or centroid. A median is only defined on 1-dimensional data, and it only minimizes dissimilarity to other points for metrics induced by a norm (such as the Manhattan distance or Euclidean distance). A geometric median is defined in any dimension, but is not necessarily a point from within the original dataset.
Algorithms to compute medoids
From the definition above, it is clear that the medoid can be computed after computing all pairwise distances between points in the ensemble. This would take distance evaluations. In the worst case, one can not compute the medoid with fewer distance evaluations.[3][4] However, there are many approaches that allow us to compute medoids either exactly or approximately in sub-quadratic time under different statistical models.
If the points lie on the real line, computing the medoid reduces to computing the median which can be done in by Quick-select algorithm of Hoare.[5] However, in higher dimensional real spaces, no linear-time algorithm is known. RAND[6] is an algorithm that estimates the average distance of each point to all the other points by sampling a random subset of other points. It takes a total of distance computations to approximate the medoid within a factor of with high probability, where is the maximum distance between two points in the ensemble. Note that RAND is an approximation algorithm, and moreover may not be known apriori.
RAND was leveraged by TOPRANK [7] which uses the estimates obtained by RAND to focus on a small subset of candidate points, evaluates the average distance of these points exactly, and picks the minimum of those. TOPRANK needs distance computations to find the exact medoid with high probability under a distributional assumption on the average distances.
trimed [3] presents an algorithm to find the medoid with distance evaluations under a distributional assumption on the points. The algorithm uses the triangle inequality to cut down the search space.
Meddit[4] leverages a connection of the medoid computation with multi-armed bandits and uses a Upper-Confidence-bound type of algorithm to get an algorithm which takes distance evaluations under statistical assumptions on the points.
Correlated Sequential Halving[8] also leverages multi-armed bandit techniques, improving upon Meddit. By exploiting the correlation structure in the problem, the algorithm is able to provably yield drastic improvement (usually around 1-2 orders of magnitude) in both number of distance computations needed and wall clock time.
Implementations
An implementation of RAND, TOPRANK, and trimed can be found here. An implementation of Meddit can be found here and here. An implementation of Correlated Sequential Halving can be found here.
See also
References
- Struyf, Anja; Hubert, Mia; Rousseeuw, Peter (1997). "Clustering in an Object-Oriented Environment". Journal of Statistical Software. 1 (4): 1–30.
- van der Laan, Mark J.; Pollard, Katherine S.; Bryan, Jennifer (2003). "A New Partitioning Around Medoids Algorithm". Journal of Statistical Computation and Simulation. Taylor & Francis Group. 73 (8): 575–584. doi:10.1080/0094965031000136012.
- Newling, James; & Fleuret, François (2016); "A sub-quadratic exact medoid algorithm", in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, PMLR 54:185-193, 2017 Available online.
- Bagaria, Vivek; Kamath, Govinda M.; Ntranos, Vasilis; Zhang, Martin J.; & Tse, David N. (2017); "Medoids in almost linear time via multi-armed bandits", arXiv preprint Available online.
- Hoare, Charles Antony Richard (1961); "Algorithm 65: find", in Communications of the ACM, 4(7), 321-322
- Eppstein, David; & Wang, Joseph (2006); "Fast approximation of centrality", in Graph Algorithms and Applications, 5, pp. 39-45
- Okamoto, Kazuya; Chen, Wei; & Li, Xiang-Yang (2008); "Ranking of closeness centrality for large-scale social networks", in Preparata, Franco P.; Wu, Xiaodong; Yin, Jianping (eds.); Frontiers in Algorithmics Workshop 2008, Lecture Notes in Computer Science, 5059, 186-195
- Baharav, Tavor Z.; & Tse, David N. (2019); "Ultra Fast Medoid Identification via Correlated Sequential Halving", in Advances in Neural Information Processing Systems, available online