Medoid
A medoid is a data point within a cluster that minimizes the average dissimilarity to all other points in that cluster.[1] This concept serves as the foundation for medoid-based clustering methods in data analysis, where representative points are selected from the actual dataset rather than computed averages.[2] The primary algorithm utilizing medoids is K-medoids clustering, also known as Partitioning Around Medoids (PAM), which was introduced by Leonard Kaufman and Peter J. Rousseeuw in 1987.[2] In this approach, the algorithm partitions a dataset into k clusters by iteratively selecting k medoids and assigning each data point to the cluster with the nearest medoid, minimizing the total sum of dissimilarities.[1] Unlike K-means clustering, which relies on centroids (arithmetic means that may not correspond to actual data points), K-medoids uses real data objects as centers, enabling compatibility with arbitrary dissimilarity measures such as Manhattan or Hamming distances.[2] K-medoids offers several advantages over centroid-based methods, including greater robustness to noise and outliers, as extreme values cannot skew the medoid selection in the same way they affect means.[1] It is particularly effective for categorical or mixed data types where computing means is infeasible, such as in bioinformatics or market segmentation tasks.[1] Extensions like CLARA (Clustering Large Applications) address scalability for large datasets by sampling subsets of the data.[1] Overall, medoid-based techniques provide interpretable and reliable clustering solutions in scenarios demanding actual data representatives.[2]Fundamentals
Definition
A medoid is a representative data point within a given set S that minimizes the average dissimilarity to all other points in the set. Formally, it is defined as the point m \in S that satisfies m = \arg\min_{m' \in S} \frac{1}{|S|} \sum_{i \in S} d(m', i), where d is a dissimilarity or distance function measuring the separation between points.[3] This concept originates from early work on partitioning methods that select actual observations as cluster representatives rather than computed averages.[3] Unlike a centroid, which is typically the arithmetic mean of points in a set and may not correspond to any actual data point, a medoid must be one of the existing points in the dataset. This requirement makes medoids inherently robust to outliers, as extreme values cannot skew the representative to a non-existent position, and allows the use of arbitrary distance metrics beyond Euclidean space, such as Manhattan or custom dissimilarities suitable for non-numeric data.[4][5] For example, consider the one-dimensional dataset S = \{1, 2, 10\} with the absolute distance metric d(x, y) = |x - y|. The medoid is 2, since it yields the minimum average distance of 3, compared to 3.33 for 1 and 5.67 for 10. Medoids play a central role in medoid-based clustering algorithms, where multiple such representatives are selected to partition datasets.[3]Historical Development
The concept of the medoid, as a representative point within a cluster selected from the actual data objects, was formally introduced in the context of clustering by Leonard Kaufman and Peter J. Rousseeuw in their 1987 paper on clustering using medoids under the L1 norm, with the seminal Partitioning Around Medoids (PAM) algorithm detailed in their 1990 book Finding Groups in Data: An Introduction to Cluster Analysis.[3] This work positioned medoids as a robust alternative to centroids in k-means clustering, particularly for handling outliers and arbitrary dissimilarity measures, building on earlier ideas from Vinod (1969) and Rao (1971) but establishing a practical algorithmic framework.[3] Following its introduction, the medoid concept gained traction in k-medoids clustering during the 1990s, with integrations into robust statistics emphasizing its resistance to noisy data compared to mean-based methods. By the post-1990s period, medoids were increasingly adopted in statistical applications requiring outlier robustness, such as in environmental and biomedical data analysis, where the selection of actual data points as cluster representatives minimized sensitivity to extreme values. In the 2000s, the medoid framework expanded to accommodate non-metric spaces, leveraging its inherent flexibility with dissimilarity matrices beyond Euclidean distances, as explored in theoretical extensions for general metric and non-metric settings.[6] This period saw applications in diverse fields like text and graph clustering, where non-Euclidean similarities were prevalent. Key milestones include the development of scalable variants: the CLARA algorithm in 1986 for handling larger datasets via sampling, followed by CLARANS in 1994, which improved efficiency through randomized searches and was later formalized in journal publication in 2002.[7][8] In the 2020s, adaptations for big data and AI have accelerated, such as BanditPAM++ (2023), which optimizes k-medoids for high-dimensional embedding spaces in machine learning tasks like natural language processing, achieving faster convergence on large-scale datasets, and OneBatchPAM (2025), a frugal approximation algorithm for handling large-scale datasets with low time and memory complexity.[9][10]Mathematical Foundations
Properties and Selection Criteria
Medoids exhibit robustness to outliers and noise compared to centroid-based approaches, as they are selected from actual data points rather than computed averages like means, which can be heavily influenced by extreme values.[4] This median-like behavior ensures that the medoid remains representative even in the presence of anomalies, making medoids particularly suitable for datasets with non-Euclidean distances or irregular distributions. The primary selection criterion for a medoid in a cluster C is the minimization of the total cost, defined as the sum of dissimilarities from the candidate point to all other points in the cluster.[3] Mathematically, the medoid m is chosen as: m = \arg\min_{x \in C} \sum_{y \in C} d(x, y) where d(x, y) denotes the dissimilarity between points x and y. This formulation promotes representativeness by identifying the point that minimizes the average deviation within the cluster, akin to a generalized median that balances proximity to all members.[4] In practice, medoid selection often yields local optima due to heuristic algorithms, whereas a global medoid would require exhaustive evaluation of all points, which is computationally prohibitive for large clusters.Distance Metrics for Medoids
Medoids are representative data points selected to minimize the total dissimilarity to other points in a cluster, relying on a dissimilarity matrix rather than arithmetic means. This approach allows medoid-based clustering to accommodate arbitrary distance metrics, in contrast to centroid-based methods like k-means, which are typically limited to Euclidean distances due to the requirement of computing vector averages. The flexibility in distance measures enables medoids to handle non-numeric, non-Euclidean, or heterogeneous data types where meaningful centroids cannot be defined.[11] Common distance metrics for medoid selection include those tailored to specific data characteristics. For numerical features, the Euclidean distance measures straight-line separation asd(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{p} (x_i - y_i)^2},
while the Manhattan (L1) distance aggregates absolute differences:
d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{p} |x_i - y_i|.
The Manhattan metric is often preferred for its robustness to outliers, as it reduces the influence of extreme values compared to the squared terms in Euclidean distance. For categorical data, the Hamming distance counts differing attributes:
d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{p} \mathbb{I}(x_i \neq y_i),
where \mathbb{I} is the indicator function, providing a simple count of mismatches suitable for nominal variables. For mixed numerical and categorical data, Gower's distance integrates scaled Manhattan and Dice distances, offering a coefficient between 0 and 1 that handles feature heterogeneity without assuming equal scales. Non-Euclidean metrics extend medoid applicability to specialized domains. Dynamic Time Warping (DTW) is particularly effective for time series, as it computes an optimal alignment path to account for temporal shifts and varying speeds, defined via dynamic programming as the minimum cumulative distance along a warping path in a cost matrix. Unlike Euclidean distance, which penalizes misalignment rigidly, DTW's allowance for elastic matching better captures sequential similarities, influencing medoid selection toward representatives that align well under distortions. In high-dimensional settings, the cosine distance focuses on angular similarity:
d(\mathbf{x}, \mathbf{y}) = 1 - \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\| \|\mathbf{y}\|},
mitigating magnitude variations and the curse of dimensionality by prioritizing direction over scale, which can lead to more robust medoid choices in sparse or normalized feature spaces.[1] The choice of distance metric profoundly affects medoid selection and cluster quality, as it defines the geometry of the data space and the optimization objective of minimizing intra-cluster dissimilarities. For example, L1-based metrics like Manhattan enhance outlier resistance by downweighting distant points, potentially selecting medoids that better represent core cluster structures, whereas angle-based metrics like cosine promote selections invariant to scaling, improving performance in vector-heavy data. This metric-dependent robustness underscores the need to align the dissimilarity measure with data properties for effective medoid computation.[11]
Medoid-Based Clustering
Overview of Clustering Process
Medoid-based clustering is a partitioning technique that organizes a dataset of n objects into k clusters, with each cluster represented by a medoid—an actual data point selected as the representative that minimizes the average dissimilarity to other points within the cluster.[12] The process aims to achieve a configuration where data points within the same cluster exhibit high similarity, while points across different clusters show low similarity.[13] The core objective is to minimize the total within-cluster dissimilarity sum, defined as the aggregate of dissimilarities between each medoid and the points assigned to its cluster across all k clusters.[12] This optimization ensures compact and well-separated groups, accommodating various dissimilarity measures without assuming Euclidean geometry.[13] The clustering process generally proceeds in iterative steps. It starts with the initialization of k medoids, typically selected through a greedy process that minimizes the initial total dissimilarity, to seed the partitions. Each data point is then assigned to the nearest medoid using a predefined dissimilarity metric, forming initial clusters. Next, candidate medoids are evaluated by considering swaps between current medoids and non-medoid points; swaps that reduce the total dissimilarity cost are adopted.[12] These assignment and update steps repeat until no further cost reduction is possible, indicating convergence to a local optimum. This approach provides robustness to noise and outliers, as medoids are existing data points rather than computed averages, and it effectively captures arbitrary cluster shapes unlike methods assuming spherical distributions.[13]Comparison to Centroid-Based Methods
Medoid-based clustering differs fundamentally from centroid-based methods like k-means in the selection and representation of cluster centers. In k-means, centroids are computed as the arithmetic means of points within each cluster, which may result in synthetic points not present in the original dataset and assumes a Euclidean space for effective optimization. In contrast, medoids are actual data points selected to minimize the total dissimilarity to other points in the cluster, allowing compatibility with arbitrary distance metrics beyond Euclidean distances, such as Manhattan or custom dissimilarities suitable for categorical or non-numeric data.[14] A primary advantage of medoid methods is their robustness to outliers and noise, as the medoid selection process avoids shifting cluster centers toward extreme values, unlike k-means where outliers can disproportionately influence the mean. Additionally, medoids enhance interpretability since each cluster representative is a genuine data instance, facilitating easier analysis and explanation in domains like customer segmentation or bioinformatics.[2] However, this comes at the cost of higher computational complexity; for instance, the classic Partitioning Around Medoids (PAM) algorithm exhibits a time complexity of O(k(n-k)^2) per iteration, where n is the number of data points and k is the number of clusters, compared to k-means' more efficient O(nki) (with i iterations, often linear in n).[15][16] Medoid clustering is particularly advantageous for discrete, non-convex, or non-Euclidean datasets where centroid computation is infeasible or misleading, such as text or graph data.[17] Conversely, centroid-based methods like k-means are preferred for large-scale continuous data in Euclidean spaces due to their scalability and simplicity.[14]Algorithms
Partitioning Around Medoids (PAM)
Partitioning Around Medoids (PAM) is a foundational k-medoids clustering algorithm that partitions a dataset into k clusters by selecting actual data points as medoids, minimizing the total dissimilarity within clusters. Unlike centroid-based methods, PAM uses medoids to ensure robustness to outliers and applicability to non-Euclidean distance metrics. The algorithm iteratively refines the medoid selection to optimize a cost function, making it suitable for datasets where actual objects represent cluster centers.[18] The PAM algorithm proceeds in two main phases: the build phase and the swap phase. In the build phase, initial medoids are selected greedily. The first medoid is chosen as the point that minimizes the sum of distances to all other points. Subsequent medoids are added one by one, selecting the non-medoid point that results in the smallest increase in the total cost when assigned as a new medoid. This phase requires O(n²k) time, where n is the number of data points and k is the number of clusters.[18] The swap phase then optimizes the initial medoids by iteratively evaluating possible swaps. All k(n - k) possible swaps are evaluated to find the one that most reduces the total cost. If such a swap exists, it is performed, cluster assignments are updated, and the process repeats until no swap reduces the cost. Each iteration of the swap phase evaluates all possible swaps, computing the cost change for each affected point, leading to O(k(n - k)²) time complexity per iteration. The algorithm typically converges in a small number of iterations.[18] The objective function minimized by PAM is the total cost, defined as the sum over all clusters C_i of the sum of distances from each point x_j in C_i to its medoid m_i: \text{Total Cost} = \sum_{i=1}^{k} \sum_{x_j \in C_i} d(x_j, m_i), where d is the chosen dissimilarity measure. This cost measures the within-cluster dispersion and is robust since medoids are data points.[18] The overall time complexity of PAM is dominated by the swap phase, resulting in O(k(n - k)²) per iteration, with the full algorithm requiring storage of an n × n dissimilarity matrix. A high-level pseudocode outline is as follows:This pseudocode captures the core logic, with cost computations optimized by precomputing affected distances. As an illustrative example, consider applying PAM to the Iris dataset, which consists of 150 samples of sepal and petal measurements from three iris species, using k=3 clusters and Euclidean distance. The resulting clusters align closely with the true species, with silhouette analysis indicating strong separation (values up to approximately 0.8). This demonstrates PAM's ability to recover meaningful structure in low-dimensional biological data.[19]Algorithm PAM(D, k): Input: Dissimilarity [matrix](/page/Matrix) D (n × n), number of clusters k Output: Medoids M = {m1, ..., mk}, cluster assignments // Build phase: Initialize medoids Select initial medoids M by greedy selection: m1 = argmin_u sum_v D(u,v) For i = 2 to k: Select mi as non-medoid u minimizing total cost with M ∪ {u} // Assign points to nearest medoid For each point x: Assign x to cluster of closest m in M // Swap phase: Iterate until no improvement Repeat: best_delta = 0 best_m = None best_o = None current_cost = total cost For each medoid m in M: For each non-medoid o not in M: Compute Δ = cost change if swap m and o If Δ < best_delta: best_delta = Δ best_m = m best_o = o If best_delta >= 0: Break Else: Perform swap of best_m and best_o Update assignments Return M and assignmentsAlgorithm PAM(D, k): Input: Dissimilarity [matrix](/page/Matrix) D (n × n), number of clusters k Output: Medoids M = {m1, ..., mk}, cluster assignments // Build phase: Initialize medoids Select initial medoids M by greedy selection: m1 = argmin_u sum_v D(u,v) For i = 2 to k: Select mi as non-medoid u minimizing total cost with M ∪ {u} // Assign points to nearest medoid For each point x: Assign x to cluster of closest m in M // Swap phase: Iterate until no improvement Repeat: best_delta = 0 best_m = None best_o = None current_cost = total cost For each medoid m in M: For each non-medoid o not in M: Compute Δ = cost change if swap m and o If Δ < best_delta: best_delta = Δ best_m = m best_o = o If best_delta >= 0: Break Else: Perform swap of best_m and best_o Update assignments Return M and assignments
Scalable Algorithms (CLARA and CLARANS)
CLARA, or Clustering Large Applications, addresses the computational limitations of the PAM algorithm when applied to large datasets by employing a sampling strategy to approximate the medoid selection process. The algorithm draws multiple random subsets (samples) from the full dataset, typically 5 to 50 samples each of size around 40 plus twice the number of clusters k, applies the PAM procedure to each subset to identify medoids, and selects the clustering with the lowest total cost across all samples.[11] This sampling reduces the time complexity from the quadratic scaling of PAM to effectively linear in the dataset size n, making CLARA suitable for datasets where n exceeds several thousand points.[20] Building on similar scalability goals, CLARANS (Clustering Large Applications based upon RANdomized Search) introduces a stochastic optimization approach that modifies the neighborhood search in PAM to handle very large datasets more efficiently. Instead of exhaustively evaluating all possible swaps in the neighborhood of current medoids, CLARANS randomly samples a fixed number of candidate swaps (controlled by parameter maxneighbor, often 250) at each step and accepts those that improve the objective function. The process repeats for a specified number of local searches (numlocal, typically 10 to 100), retaining the best local optimum found.[8] A simplified pseudocode outline for the core neighbor evaluation in CLARANS is as follows:This randomized sampling of the vast neighborhood space (which grows as O(k(n-k))) avoids full enumeration while exploring diverse solutions.[8] Both CLARA and CLARANS trade exact optimality for substantial speedups, yielding approximate clusterings that are often close to PAM's quality but with runtimes reduced by orders of magnitude on large datasets (n > 10,000). CLARA's sampling can miss global structures if samples are unrepresentative, while CLARANS may converge to suboptimal local minima due to its stochastic nature, though increasing the number of samples or searches mitigates this at additional computational cost. These methods are particularly effective for initial explorations or when exact solutions are unnecessary, enabling medoid-based clustering on datasets infeasible for PAM.[4][8]Algorithm CLARANS(D, k, numlocal, maxneighbor): Input: Dissimilarity [matrix](/page/Matrix) D (n × n), k, numlocal, maxneighbor Output: Best medoids S best_S = None best_cost = infinity For i = 1 to numlocal: S = random initial medoids (k points) improved = True While improved: improved = False For j = 1 to maxneighbor: Randomly select medoid m in S and non-medoid o not in S Compute cost of new_S = (S - {m}) ∪ {o} If cost(new_S) < cost(S): S = new_S improved = True Break // Optional: continue sampling or accept and proceed If cost(S) < best_cost: best_S = S best_cost = cost(S) Return best_SAlgorithm CLARANS(D, k, numlocal, maxneighbor): Input: Dissimilarity [matrix](/page/Matrix) D (n × n), k, numlocal, maxneighbor Output: Best medoids S best_S = None best_cost = infinity For i = 1 to numlocal: S = random initial medoids (k points) improved = True While improved: improved = False For j = 1 to maxneighbor: Randomly select medoid m in S and non-medoid o not in S Compute cost of new_S = (S - {m}) ∪ {o} If cost(new_S) < cost(S): S = new_S improved = True Break // Optional: continue sampling or accept and proceed If cost(S) < best_cost: best_S = S best_cost = cost(S) Return best_S
Implementations
Software Libraries and Frameworks
Several software libraries and frameworks provide implementations of medoid-based clustering algorithms, enabling researchers and practitioners to apply methods like Partitioning Around Medoids (PAM) and its variants across various programming environments.[21][1] In R, the cluster package offers core functionality for medoid clustering through thepam() function, which implements the PAM algorithm to identify k representative medoids from a dataset, minimizing the total dissimilarity within clusters. The package also includes the clara() function, an extension of PAM designed for large datasets by sampling subsets of data for efficient computation while maintaining robustness to outliers. These functions support various distance metrics, such as Euclidean and Manhattan, and are widely used for their reliability in statistical analysis workflows.
For Python, the scikit-learn-extra library extends the scikit-learn ecosystem with the KMedoids class, which performs k-medoids clustering using PAM or alternate optimization methods to select actual data points as cluster representatives, offering improved robustness over centroid-based approaches like k-means.[21] Additionally, the pyclustering library provides a dedicated kmedoids class that implements the PAM algorithm, allowing users to specify initial medoids and distance matrices for flexible clustering on diverse data types, including support for custom dissimilarity measures.[15]
MATLAB includes a built-in kmedoids function in the Statistics and Machine Learning Toolbox, which partitions data into k clusters by selecting medoids that minimize the sum of dissimilarities, with options for specifying distance metrics like those computed via the pdist function for pairwise distances.[1] This integration facilitates custom implementations of PAM by combining pdist for distance calculations with the core clustering routine, particularly useful in high-dimensional or non-Euclidean spaces.[22]
In Julia, the Clustering.jl package features a kmedoids function that employs a k-means-style iterative algorithm to find medoids, emphasizing convergence efficiency while supporting arbitrary distance functions for clustering tasks.[23] For a more direct PAM implementation, the PAM.jl package offers specialized routines that align closely with the original algorithm, providing high-quality medoid selection comparable to R's cluster package, though at a performance trade-off for accuracy.[24]
Python Example Using scikit-learn-extra
The scikit-learn-extra library provides an implementation of the Partitioning Around Medoids (PAM) algorithm through itsKMedoids class, which supports both Euclidean distances and precomputed distance matrices for synthetic or real datasets.[21] To demonstrate, consider generating synthetic 2D data using make_blobs from scikit-learn, applying PAM with k=3 clusters, and visualizing the results with matplotlib. The following code snippet computes clusters using the default Euclidean metric and plots the data points colored by cluster labels, highlighting the medoids.
This example yields three distinct clusters, with medoid indices typically at [22, 75, 149] for the given random seed, illustrating how actual data points serve as cluster representatives.[25]pythonimport numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_blobs from sklearn_extra.cluster import KMedoids # Generate synthetic data X, _ = make_blobs(n_samples=300, centers=3, cluster_std=0.60, random_state=0) # Apply PAM clustering kmedoids = KMedoids(n_clusters=3, random_state=0) kmedoids.fit(X) labels = kmedoids.labels_ medoids = kmedoids.cluster_centers_ # Visualize clusters plt.figure(figsize=(8, 6)) plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis') plt.scatter(medoids[:, 0], medoids[:, 1], c='red', s=200, marker='X', label='Medoids') plt.title('PAM Clustering on Synthetic Data') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.legend() plt.show() print("Medoid indices:", kmedoids.medoid_indices_) print("Cluster labels shape:", labels.shape)import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_blobs from sklearn_extra.cluster import KMedoids # Generate synthetic data X, _ = make_blobs(n_samples=300, centers=3, cluster_std=0.60, random_state=0) # Apply PAM clustering kmedoids = KMedoids(n_clusters=3, random_state=0) kmedoids.fit(X) labels = kmedoids.labels_ medoids = kmedoids.cluster_centers_ # Visualize clusters plt.figure(figsize=(8, 6)) plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis') plt.scatter(medoids[:, 0], medoids[:, 1], c='red', s=200, marker='X', label='Medoids') plt.title('PAM Clustering on Synthetic Data') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.legend() plt.show() print("Medoid indices:", kmedoids.medoid_indices_) print("Cluster labels shape:", labels.shape)
R Example Using the cluster Package
In R, thecluster package's clara() function implements the CLARA algorithm for scalable medoid-based clustering, suitable for larger datasets by sampling subsets.[20] For illustration, apply clara() to the built-in iris dataset (excluding the Species factor) with k=4 clusters, then extract medoids and compute silhouette scores using silhouette() to evaluate cluster quality. The code below performs the clustering and prints key outputs.
Executing this produces medoids at indices such as 1, 51, 101, and 135, with cluster sizes around 38, 37, 38, and 37, and an average silhouette width of approximately 0.58, indicating reasonable separation.[26]rlibrary(cluster) data(iris) X <- iris[, 1:4] # Numeric features only # Apply CLARA clustering with k=4 clara_result <- clara(X, k=4, sampsize=50, rngR=TRUE, seed=123) # Output medoids and clustering print("Medoid indices:") print(clara_result$medoid) print("Cluster sizes:") print(table(clara_result$clustering)) # Compute silhouette scores sil <- silhouette(clara_result$clustering, dist(X)) print("Average silhouette width:") print(summary(sil)$avg.width) plot(sil, main="Silhouette Plot for CLARA Clustering")library(cluster) data(iris) X <- iris[, 1:4] # Numeric features only # Apply CLARA clustering with k=4 clara_result <- clara(X, k=4, sampsize=50, rngR=TRUE, seed=123) # Output medoids and clustering print("Medoid indices:") print(clara_result$medoid) print("Cluster sizes:") print(table(clara_result$clustering)) # Compute silhouette scores sil <- silhouette(clara_result$clustering, dist(X)) print("Average silhouette width:") print(summary(sil)$avg.width) plot(sil, main="Silhouette Plot for CLARA Clustering")
Tips for Handling Custom Distances
For non-Euclidean distances, such as Manhattan or Gower for mixed data types, precompute the distance matrix and supply it to the algorithm to ensure compatibility. In Python'sKMedoids, set metric='precomputed' and pass the matrix as input; similarly, in R's pam() or clara(), use the diss argument with a dist()-computed or custom matrix. This approach allows flexibility for domain-specific metrics without altering the core medoid selection process.[21][27]