Fact-checked by Grok 2 weeks ago

Medoid

A medoid is a data point within a cluster that minimizes the average dissimilarity to all other points in that cluster.^[1] This concept serves as the foundation for medoid-based clustering methods in data analysis, where representative points are selected from the actual dataset rather than computed averages.^[2] The primary algorithm utilizing medoids is K-medoids clustering, also known as Partitioning Around Medoids (PAM), which was introduced by Leonard Kaufman and Peter J. Rousseeuw in 1987.^[2] In this approach, the algorithm partitions a dataset into k clusters by iteratively selecting k medoids and assigning each data point to the cluster with the nearest medoid, minimizing the total sum of dissimilarities.^[1] Unlike K-means clustering, which relies on centroids (arithmetic means that may not correspond to actual data points), K-medoids uses real data objects as centers, enabling compatibility with arbitrary dissimilarity measures such as Manhattan or Hamming distances.^[2] K-medoids offers several advantages over centroid-based methods, including greater robustness to noise and outliers, as extreme values cannot skew the medoid selection in the same way they affect means.^[1] It is particularly effective for categorical or mixed data types where computing means is infeasible, such as in bioinformatics or market segmentation tasks.^[1] Extensions like CLARA (Clustering Large Applications) address scalability for large datasets by sampling subsets of the data.^[1] Overall, medoid-based techniques provide interpretable and reliable clustering solutions in scenarios demanding actual data representatives.^[2]

Fundamentals

Definition

A medoid is a representative data point within a given set S that minimizes the average dissimilarity to all other points in the set. Formally, it is defined as the point m \in S that satisfies

m = \arg\min_{m' \in S} \frac{1}{|S|} \sum_{i \in S} d(m', i),

where d is a dissimilarity or distance function measuring the separation between points.^[3] This concept originates from early work on partitioning methods that select actual observations as cluster representatives rather than computed averages.^[3] Unlike a centroid, which is typically the arithmetic mean of points in a set and may not correspond to any actual data point, a medoid must be one of the existing points in the dataset. This requirement makes medoids inherently robust to outliers, as extreme values cannot skew the representative to a non-existent position, and allows the use of arbitrary distance metrics beyond Euclidean space, such as Manhattan or custom dissimilarities suitable for non-numeric data.^[4]^[5] For example, consider the one-dimensional dataset S = \{1, 2, 10\} with the absolute distance metric d(x, y) = |x - y|. The medoid is 2, since it yields the minimum average distance of 3, compared to 3.33 for 1 and 5.67 for 10. Medoids play a central role in medoid-based clustering algorithms, where multiple such representatives are selected to partition datasets.^[3]

Historical Development

The concept of the medoid, as a representative point within a cluster selected from the actual data objects, was formally introduced in the context of clustering by Leonard Kaufman and Peter J. Rousseeuw in their 1987 paper on clustering using medoids under the L1 norm, with the seminal Partitioning Around Medoids (PAM) algorithm detailed in their 1990 book Finding Groups in Data: An Introduction to Cluster Analysis.^[3] This work positioned medoids as a robust alternative to centroids in k-means clustering, particularly for handling outliers and arbitrary dissimilarity measures, building on earlier ideas from Vinod (1969) and Rao (1971) but establishing a practical algorithmic framework.^[3] Following its introduction, the medoid concept gained traction in k-medoids clustering during the 1990s, with integrations into robust statistics emphasizing its resistance to noisy data compared to mean-based methods. By the post-1990s period, medoids were increasingly adopted in statistical applications requiring outlier robustness, such as in environmental and biomedical data analysis, where the selection of actual data points as cluster representatives minimized sensitivity to extreme values. In the 2000s, the medoid framework expanded to accommodate non-metric spaces, leveraging its inherent flexibility with dissimilarity matrices beyond Euclidean distances, as explored in theoretical extensions for general metric and non-metric settings.^[6] This period saw applications in diverse fields like text and graph clustering, where non-Euclidean similarities were prevalent. Key milestones include the development of scalable variants: the CLARA algorithm in 1986 for handling larger datasets via sampling, followed by CLARANS in 1994, which improved efficiency through randomized searches and was later formalized in journal publication in 2002.^[7]^[8] In the 2020s, adaptations for big data and AI have accelerated, such as BanditPAM++ (2023), which optimizes k-medoids for high-dimensional embedding spaces in machine learning tasks like natural language processing, achieving faster convergence on large-scale datasets, and OneBatchPAM (2025), a frugal approximation algorithm for handling large-scale datasets with low time and memory complexity.^[9]^[10]

Mathematical Foundations

Properties and Selection Criteria

Medoids exhibit robustness to outliers and noise compared to centroid-based approaches, as they are selected from actual data points rather than computed averages like means, which can be heavily influenced by extreme values.^[4] This median-like behavior ensures that the medoid remains representative even in the presence of anomalies, making medoids particularly suitable for datasets with non-Euclidean distances or irregular distributions. The primary selection criterion for a medoid in a cluster C is the minimization of the total cost, defined as the sum of dissimilarities from the candidate point to all other points in the cluster.^[3] Mathematically, the medoid m is chosen as:

m = \arg\min_{x \in C} \sum_{y \in C} d(x, y)

where d(x, y) denotes the dissimilarity between points x and y. This formulation promotes representativeness by identifying the point that minimizes the average deviation within the cluster, akin to a generalized median that balances proximity to all members.^[4] In practice, medoid selection often yields local optima due to heuristic algorithms, whereas a global medoid would require exhaustive evaluation of all points, which is computationally prohibitive for large clusters.

Distance Metrics for Medoids

Medoids are representative data points selected to minimize the total dissimilarity to other points in a cluster, relying on a dissimilarity matrix rather than arithmetic means. This approach allows medoid-based clustering to accommodate arbitrary distance metrics, in contrast to centroid-based methods like k-means, which are typically limited to Euclidean distances due to the requirement of computing vector averages. The flexibility in distance measures enables medoids to handle non-numeric, non-Euclidean, or heterogeneous data types where meaningful centroids cannot be defined.^[11] Common distance metrics for medoid selection include those tailored to specific data characteristics. For numerical features, the Euclidean distance measures straight-line separation as
d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{p} (x_i - y_i)^2},
while the Manhattan (L1) distance aggregates absolute differences:
d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{p} |x_i - y_i|.
The Manhattan metric is often preferred for its robustness to outliers, as it reduces the influence of extreme values compared to the squared terms in Euclidean distance. For categorical data, the Hamming distance counts differing attributes:
d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{p} \mathbb{I}(x_i \neq y_i),
where \mathbb{I} is the indicator function, providing a simple count of mismatches suitable for nominal variables. For mixed numerical and categorical data, Gower's distance integrates scaled Manhattan and Dice distances, offering a coefficient between 0 and 1 that handles feature heterogeneity without assuming equal scales. Non-Euclidean metrics extend medoid applicability to specialized domains. Dynamic Time Warping (DTW) is particularly effective for time series, as it computes an optimal alignment path to account for temporal shifts and varying speeds, defined via dynamic programming as the minimum cumulative distance along a warping path in a cost matrix. Unlike Euclidean distance, which penalizes misalignment rigidly, DTW's allowance for elastic matching better captures sequential similarities, influencing medoid selection toward representatives that align well under distortions. In high-dimensional settings, the cosine distance focuses on angular similarity:
d(\mathbf{x}, \mathbf{y}) = 1 - \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\| \|\mathbf{y}\|},
mitigating magnitude variations and the curse of dimensionality by prioritizing direction over scale, which can lead to more robust medoid choices in sparse or normalized feature spaces.^[1] The choice of distance metric profoundly affects medoid selection and cluster quality, as it defines the geometry of the data space and the optimization objective of minimizing intra-cluster dissimilarities. For example, L1-based metrics like Manhattan enhance outlier resistance by downweighting distant points, potentially selecting medoids that better represent core cluster structures, whereas angle-based metrics like cosine promote selections invariant to scaling, improving performance in vector-heavy data. This metric-dependent robustness underscores the need to align the dissimilarity measure with data properties for effective medoid computation.^[11]

Medoid-Based Clustering

Overview of Clustering Process

Medoid-based clustering is a partitioning technique that organizes a dataset of n objects into k clusters, with each cluster represented by a medoid—an actual data point selected as the representative that minimizes the average dissimilarity to other points within the cluster.^[12] The process aims to achieve a configuration where data points within the same cluster exhibit high similarity, while points across different clusters show low similarity.^[13] The core objective is to minimize the total within-cluster dissimilarity sum, defined as the aggregate of dissimilarities between each medoid and the points assigned to its cluster across all k clusters.^[12] This optimization ensures compact and well-separated groups, accommodating various dissimilarity measures without assuming Euclidean geometry.^[13] The clustering process generally proceeds in iterative steps. It starts with the initialization of k medoids, typically selected through a greedy process that minimizes the initial total dissimilarity, to seed the partitions. Each data point is then assigned to the nearest medoid using a predefined dissimilarity metric, forming initial clusters. Next, candidate medoids are evaluated by considering swaps between current medoids and non-medoid points; swaps that reduce the total dissimilarity cost are adopted.^[12] These assignment and update steps repeat until no further cost reduction is possible, indicating convergence to a local optimum. This approach provides robustness to noise and outliers, as medoids are existing data points rather than computed averages, and it effectively captures arbitrary cluster shapes unlike methods assuming spherical distributions.^[13]

Comparison to Centroid-Based Methods

Medoid-based clustering differs fundamentally from centroid-based methods like k-means in the selection and representation of cluster centers. In k-means, centroids are computed as the arithmetic means of points within each cluster, which may result in synthetic points not present in the original dataset and assumes a Euclidean space for effective optimization. In contrast, medoids are actual data points selected to minimize the total dissimilarity to other points in the cluster, allowing compatibility with arbitrary distance metrics beyond Euclidean distances, such as Manhattan or custom dissimilarities suitable for categorical or non-numeric data.^[14] A primary advantage of medoid methods is their robustness to outliers and noise, as the medoid selection process avoids shifting cluster centers toward extreme values, unlike k-means where outliers can disproportionately influence the mean. Additionally, medoids enhance interpretability since each cluster representative is a genuine data instance, facilitating easier analysis and explanation in domains like customer segmentation or bioinformatics.^[2] However, this comes at the cost of higher computational complexity; for instance, the classic Partitioning Around Medoids (PAM) algorithm exhibits a time complexity of O(k(n-k)^2) per iteration, where n is the number of data points and k is the number of clusters, compared to k-means' more efficient O(nki) (with i iterations, often linear in n).^[15]^[16] Medoid clustering is particularly advantageous for discrete, non-convex, or non-Euclidean datasets where centroid computation is infeasible or misleading, such as text or graph data.^[17] Conversely, centroid-based methods like k-means are preferred for large-scale continuous data in Euclidean spaces due to their scalability and simplicity.^[14]

Algorithms

Partitioning Around Medoids (PAM)

Partitioning Around Medoids (PAM) is a foundational k-medoids clustering algorithm that partitions a dataset into k clusters by selecting actual data points as medoids, minimizing the total dissimilarity within clusters. Unlike centroid-based methods, PAM uses medoids to ensure robustness to outliers and applicability to non-Euclidean distance metrics. The algorithm iteratively refines the medoid selection to optimize a cost function, making it suitable for datasets where actual objects represent cluster centers.^[18] The PAM algorithm proceeds in two main phases: the build phase and the swap phase. In the build phase, initial medoids are selected greedily. The first medoid is chosen as the point that minimizes the sum of distances to all other points. Subsequent medoids are added one by one, selecting the non-medoid point that results in the smallest increase in the total cost when assigned as a new medoid. This phase requires O(n²k) time, where n is the number of data points and k is the number of clusters.^[18] The swap phase then optimizes the initial medoids by iteratively evaluating possible swaps. All k(n - k) possible swaps are evaluated to find the one that most reduces the total cost. If such a swap exists, it is performed, cluster assignments are updated, and the process repeats until no swap reduces the cost. Each iteration of the swap phase evaluates all possible swaps, computing the cost change for each affected point, leading to O(k(n - k)²) time complexity per iteration. The algorithm typically converges in a small number of iterations.^[18] The objective function minimized by PAM is the total cost, defined as the sum over all clusters C_i of the sum of distances from each point x_j in C_i to its medoid m_i:

\text{Total Cost} = \sum_{i=1}^{k} \sum_{x_j \in C_i} d(x_j, m_i),

where d is the chosen dissimilarity measure. This cost measures the within-cluster dispersion and is robust since medoids are data points.^[18] The overall time complexity of PAM is dominated by the swap phase, resulting in O(k(n - k)²) per iteration, with the full algorithm requiring storage of an n × n dissimilarity matrix. A high-level pseudocode outline is as follows:

Algorithm PAM(D, k):
    Input: Dissimilarity [matrix](/page/Matrix) D (n × n), number of clusters k
    Output: Medoids M = {m1, ..., mk}, cluster assignments

    // Build phase: Initialize medoids
    Select initial medoids M by greedy selection:
        m1 = argmin_u sum_v D(u,v)
        For i = 2 to k:
            Select mi as non-medoid u minimizing total cost with M ∪ {u}

    // Assign points to nearest medoid
    For each point x:
        Assign x to cluster of closest m in M

    // Swap phase: Iterate until no improvement
    Repeat:
        best_delta = 0
        best_m = None
        best_o = None
        current_cost = total cost
        For each medoid m in M:
            For each non-medoid o not in M:
                Compute Δ = cost change if swap m and o
                If Δ < best_delta:
                    best_delta = Δ
                    best_m = m
                    best_o = o
        If best_delta >= 0:
            Break
        Else:
            Perform swap of best_m and best_o
            Update assignments
    Return M and assignments
Algorithm PAM(D, k):
    Input: Dissimilarity [matrix](/page/Matrix) D (n × n), number of clusters k
    Output: Medoids M = {m1, ..., mk}, cluster assignments

    // Build phase: Initialize medoids
    Select initial medoids M by greedy selection:
        m1 = argmin_u sum_v D(u,v)
        For i = 2 to k:
            Select mi as non-medoid u minimizing total cost with M ∪ {u}

    // Assign points to nearest medoid
    For each point x:
        Assign x to cluster of closest m in M

    // Swap phase: Iterate until no improvement
    Repeat:
        best_delta = 0
        best_m = None
        best_o = None
        current_cost = total cost
        For each medoid m in M:
            For each non-medoid o not in M:
                Compute Δ = cost change if swap m and o
                If Δ < best_delta:
                    best_delta = Δ
                    best_m = m
                    best_o = o
        If best_delta >= 0:
            Break
        Else:
            Perform swap of best_m and best_o
            Update assignments
    Return M and assignments

This pseudocode captures the core logic, with cost computations optimized by precomputing affected distances. As an illustrative example, consider applying PAM to the Iris dataset, which consists of 150 samples of sepal and petal measurements from three iris species, using k=3 clusters and Euclidean distance. The resulting clusters align closely with the true species, with silhouette analysis indicating strong separation (values up to approximately 0.8). This demonstrates PAM's ability to recover meaningful structure in low-dimensional biological data.^[19]

Scalable Algorithms (CLARA and CLARANS)

CLARA, or Clustering Large Applications, addresses the computational limitations of the PAM algorithm when applied to large datasets by employing a sampling strategy to approximate the medoid selection process. The algorithm draws multiple random subsets (samples) from the full dataset, typically 5 to 50 samples each of size around 40 plus twice the number of clusters k, applies the PAM procedure to each subset to identify medoids, and selects the clustering with the lowest total cost across all samples.^[11] This sampling reduces the time complexity from the quadratic scaling of PAM to effectively linear in the dataset size n, making CLARA suitable for datasets where n exceeds several thousand points.^[20] Building on similar scalability goals, CLARANS (Clustering Large Applications based upon RANdomized Search) introduces a stochastic optimization approach that modifies the neighborhood search in PAM to handle very large datasets more efficiently. Instead of exhaustively evaluating all possible swaps in the neighborhood of current medoids, CLARANS randomly samples a fixed number of candidate swaps (controlled by parameter maxneighbor, often 250) at each step and accepts those that improve the objective function. The process repeats for a specified number of local searches (numlocal, typically 10 to 100), retaining the best local optimum found.^[8] A simplified pseudocode outline for the core neighbor evaluation in CLARANS is as follows:

Algorithm CLARANS(D, k, numlocal, maxneighbor):
    Input: Dissimilarity [matrix](/page/Matrix) D (n × n), k, numlocal, maxneighbor
    Output: Best medoids S

    best_S = None
    best_cost = infinity
    For i = 1 to numlocal:
        S = random initial medoids (k points)
        improved = True
        While improved:
            improved = False
            For j = 1 to maxneighbor:
                Randomly select medoid m in S and non-medoid o not in S
                Compute cost of new_S = (S - {m}) ∪ {o}
                If cost(new_S) < cost(S):
                    S = new_S
                    improved = True
                    Break  // Optional: continue sampling or accept and proceed
        If cost(S) < best_cost:
            best_S = S
            best_cost = cost(S)
    Return best_S
Algorithm CLARANS(D, k, numlocal, maxneighbor):
    Input: Dissimilarity [matrix](/page/Matrix) D (n × n), k, numlocal, maxneighbor
    Output: Best medoids S

    best_S = None
    best_cost = infinity
    For i = 1 to numlocal:
        S = random initial medoids (k points)
        improved = True
        While improved:
            improved = False
            For j = 1 to maxneighbor:
                Randomly select medoid m in S and non-medoid o not in S
                Compute cost of new_S = (S - {m}) ∪ {o}
                If cost(new_S) < cost(S):
                    S = new_S
                    improved = True
                    Break  // Optional: continue sampling or accept and proceed
        If cost(S) < best_cost:
            best_S = S
            best_cost = cost(S)
    Return best_S

This randomized sampling of the vast neighborhood space (which grows as O(k(n-k))) avoids full enumeration while exploring diverse solutions.^[8] Both CLARA and CLARANS trade exact optimality for substantial speedups, yielding approximate clusterings that are often close to PAM's quality but with runtimes reduced by orders of magnitude on large datasets (n > 10,000). CLARA's sampling can miss global structures if samples are unrepresentative, while CLARANS may converge to suboptimal local minima due to its stochastic nature, though increasing the number of samples or searches mitigates this at additional computational cost. These methods are particularly effective for initial explorations or when exact solutions are unnecessary, enabling medoid-based clustering on datasets infeasible for PAM.^[4]^[8]

Implementations

Software Libraries and Frameworks

Several software libraries and frameworks provide implementations of medoid-based clustering algorithms, enabling researchers and practitioners to apply methods like Partitioning Around Medoids (PAM) and its variants across various programming environments.^[21]^[1] In R, the cluster package offers core functionality for medoid clustering through the pam() function, which implements the PAM algorithm to identify k representative medoids from a dataset, minimizing the total dissimilarity within clusters. The package also includes the clara() function, an extension of PAM designed for large datasets by sampling subsets of data for efficient computation while maintaining robustness to outliers. These functions support various distance metrics, such as Euclidean and Manhattan, and are widely used for their reliability in statistical analysis workflows. For Python, the scikit-learn-extra library extends the scikit-learn ecosystem with the KMedoids class, which performs k-medoids clustering using PAM or alternate optimization methods to select actual data points as cluster representatives, offering improved robustness over centroid-based approaches like k-means.^[21] Additionally, the pyclustering library provides a dedicated kmedoids class that implements the PAM algorithm, allowing users to specify initial medoids and distance matrices for flexible clustering on diverse data types, including support for custom dissimilarity measures.^[15] MATLAB includes a built-in kmedoids function in the Statistics and Machine Learning Toolbox, which partitions data into k clusters by selecting medoids that minimize the sum of dissimilarities, with options for specifying distance metrics like those computed via the pdist function for pairwise distances.^[1] This integration facilitates custom implementations of PAM by combining pdist for distance calculations with the core clustering routine, particularly useful in high-dimensional or non-Euclidean spaces.^[22] In Julia, the Clustering.jl package features a kmedoids function that employs a k-means-style iterative algorithm to find medoids, emphasizing convergence efficiency while supporting arbitrary distance functions for clustering tasks.^[23] For a more direct PAM implementation, the PAM.jl package offers specialized routines that align closely with the original algorithm, providing high-quality medoid selection comparable to R's cluster package, though at a performance trade-off for accuracy.^[24]

Python Example Using scikit-learn-extra

The scikit-learn-extra library provides an implementation of the Partitioning Around Medoids (PAM) algorithm through its KMedoids class, which supports both Euclidean distances and precomputed distance matrices for synthetic or real datasets.^[21] To demonstrate, consider generating synthetic 2D data using make_blobs from scikit-learn, applying PAM with k=3 clusters, and visualizing the results with matplotlib. The following code snippet computes clusters using the default Euclidean metric and plots the data points colored by cluster labels, highlighting the medoids.

python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn_extra.cluster import KMedoids

# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=0.60, random_state=0)

# Apply PAM clustering
kmedoids = KMedoids(n_clusters=3, random_state=0)
kmedoids.fit(X)
labels = kmedoids.labels_
medoids = kmedoids.cluster_centers_

# Visualize clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis')
plt.scatter(medoids[:, 0], medoids[:, 1], c='red', s=200, marker='X', label='Medoids')
plt.title('PAM Clustering on Synthetic Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

print("Medoid indices:", kmedoids.medoid_indices_)
print("Cluster labels shape:", labels.shape)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn_extra.cluster import KMedoids

# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=0.60, random_state=0)

# Apply PAM clustering
kmedoids = KMedoids(n_clusters=3, random_state=0)
kmedoids.fit(X)
labels = kmedoids.labels_
medoids = kmedoids.cluster_centers_

# Visualize clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis')
plt.scatter(medoids[:, 0], medoids[:, 1], c='red', s=200, marker='X', label='Medoids')
plt.title('PAM Clustering on Synthetic Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

print("Medoid indices:", kmedoids.medoid_indices_)
print("Cluster labels shape:", labels.shape)

This example yields three distinct clusters, with medoid indices typically at [22, 75, 149] for the given random seed, illustrating how actual data points serve as cluster representatives.^[25]

R Example Using the cluster Package

In R, the cluster package's clara() function implements the CLARA algorithm for scalable medoid-based clustering, suitable for larger datasets by sampling subsets.^[20] For illustration, apply clara() to the built-in iris dataset (excluding the Species factor) with k=4 clusters, then extract medoids and compute silhouette scores using silhouette() to evaluate cluster quality. The code below performs the clustering and prints key outputs.

r
library(cluster)
data(iris)
X <- iris[, 1:4]  # Numeric features only

# Apply CLARA clustering with k=4
clara_result <- clara(X, k=4, sampsize=50, rngR=TRUE, seed=123)

# Output medoids and clustering
print("Medoid indices:")
print(clara_result$medoid)
print("Cluster sizes:")
print(table(clara_result$clustering))

# Compute silhouette scores
sil <- silhouette(clara_result$clustering, dist(X))
print("Average silhouette width:")
print(summary(sil)$avg.width)
plot(sil, main="Silhouette Plot for CLARA Clustering")
library(cluster)
data(iris)
X <- iris[, 1:4]  # Numeric features only

# Apply CLARA clustering with k=4
clara_result <- clara(X, k=4, sampsize=50, rngR=TRUE, seed=123)

# Output medoids and clustering
print("Medoid indices:")
print(clara_result$medoid)
print("Cluster sizes:")
print(table(clara_result$clustering))

# Compute silhouette scores
sil <- silhouette(clara_result$clustering, dist(X))
print("Average silhouette width:")
print(summary(sil)$avg.width)
plot(sil, main="Silhouette Plot for CLARA Clustering")

Executing this produces medoids at indices such as 1, 51, 101, and 135, with cluster sizes around 38, 37, 38, and 37, and an average silhouette width of approximately 0.58, indicating reasonable separation.^[26]

Tips for Handling Custom Distances

For non-Euclidean distances, such as Manhattan or Gower for mixed data types, precompute the distance matrix and supply it to the algorithm to ensure compatibility. In Python's KMedoids, set metric='precomputed' and pass the matrix as input; similarly, in R's pam() or clara(), use the diss argument with a dist()-computed or custom matrix. This approach allows flexibility for domain-specific metrics without altering the core medoid selection process.^[21]^[27]

Applications in Natural Language Processing

Text Clustering and Topic Modeling

In text clustering, documents are typically represented as high-dimensional vectors using term frequency-inverse document frequency (TF-IDF) weighting, which captures the importance of terms relative to the entire corpus while mitigating the effects of common words. K-medoids algorithms, such as PAM, are then applied to these vectors to partition the documents into clusters, selecting actual documents as medoids that minimize the total dissimilarity within each group—often measured using cosine similarity, which is particularly suitable for sparse text data due to its focus on angular differences rather than magnitude. This approach ensures that cluster representatives are genuine data points, providing inherent interpretability compared to abstract centroids in k-means. A representative example involves applying k-medoids to the Reuters-21578 news dataset, a standard benchmark comprising over 21,000 newswire articles across 135 categories. After TF-IDF vectorization and preprocessing (e.g., stop-word removal and stemming), the algorithm forms clusters corresponding to topics like finance or politics, with medoids serving as prototypical articles that exemplify each group. It may require dimensionality reduction for efficiency on large corpora. For topic modeling, k-medoids integrates with probabilistic models like Latent Dirichlet Allocation (LDA) by using medoids as initial seeds or post hoc cluster centers to refine topic assignments, thereby reducing the number of iterations needed for convergence in seeded variants. In this setup, LDA first infers latent topics from the corpus, after which k-medoids groups documents or sentences by topic probabilities, selecting representative documents as medoids to stabilize and accelerate the process. Recent advancements in the 2020s leverage BERT embeddings for semantic vectorization—replacing or augmenting TF-IDF—to form denser representations that capture contextual nuances, enabling k-medoids to produce more coherent semantic clusters in domains like occupational text or biomedical abstracts. For instance, combining BERT variants (e.g., SciBERT) with k-medoids on topic-distributed sentences yields improved clustering accuracy over traditional methods.^[28]^[29] A key benefit of this integration is the interpretability of topics, as medoids are actual documents that directly illustrate cluster themes, facilitating human review and downstream applications like topic exploration without needing probabilistic decoding. This contrasts with centroid-based models, where representatives may not correspond to any input, and has shown superior ROUGE scores in topic-driven tasks, emphasizing coverage of subtopics through real exemplars.^[28]

Summarization and Sentiment Analysis

In extractive text summarization, medoids facilitate the selection of representative key sentences from clusters derived from sentence embeddings, ensuring diverse coverage of document content without relying on supervised training. Sentence embeddings, generated via models like BERT or Sentence-BERT, are clustered using the K-medoids algorithm, where each medoid—being an actual sentence—represents a central theme or subtopic, and the closest sentences to these medoids are prioritized for inclusion in the summary. This approach mitigates issues like topic bias by promoting balanced representation across clusters. On the CNN/DailyMail dataset, comprising over 300,000 news articles, this method achieved ROUGE-1, ROUGE-2, and ROUGE-L scores of 43.90%, 19.01%, and 41.50%, respectively, surpassing baselines such as LexRank and TextRank in both effectiveness and computational efficiency.^[30] In sentiment analysis, medoids enable the clustering of reviews or texts by polarity (positive, negative, or neutral), with the medoids acting as exemplar instances that capture prototypical expressions of each sentiment category for interpretability and further analysis. Reviews are first represented in a vector space model using term weighting schemes (e.g., TF-IDF or BM25), then partitioned via the Partitioning Around Medoids (PAM) algorithm for robust polarity grouping. This has demonstrated higher accuracy in sentiment clustering on datasets like the Polarity Dataset v2.0 (movie reviews) and Amazon product reviews.^[31] Integration with lexicon-based scorers like VADER enhances this process by providing initial polarity labels through compound scores ranging from -1 (most negative) to +1 (most positive), accounting for nuances in informal text such as negation and intensifiers, before medoid-based clustering refines the exemplars.^[32]

Broader Real-World Applications

Bioinformatics and Gene Expression

Medoid-based clustering, particularly the Partitioning Around Medoids (PAM) algorithm, has been applied to gene expression data to identify patterns in microarray and RNA sequencing profiles by grouping genes or samples with similar expression levels. In these analyses, dissimilarity is often measured using Pearson correlation distance, which captures linear relationships in expression patterns across conditions, making it suitable for biological datasets where absolute magnitudes may vary but co-expression trends are key.^[33] For instance, in cancer research, k-medoids clustering has been used on leukemia microarray data to distinguish subtypes by partitioning samples into clusters based on representative medoids, achieving improved separation compared to centroid-based methods in noisy datasets.^[34] A key advantage of medoids in gene expression analysis is their robustness to outliers and noisy variances inherent in biological measurements, such as technical artifacts in microarray hybridization or batch effects in sequencing, unlike k-means which relies on sensitive mean centroids. This property allows medoids to select actual data points as cluster representatives, preserving interpretability in high-variance scenarios like tumor heterogeneity. Studies comparing clustering algorithms on cancer gene expression datasets, including leukemia and colon cancer, demonstrate that k-medoids yields more stable partitions, with lower distortion measures in the presence of outliers.^[34] In single-cell RNA sequencing (scRNA-seq), medoids facilitate the identification of representative cell types by clustering transcriptomic profiles from thousands of cells, addressing sparsity and dropout events common in such data. For example, scalable variants like BanditPAM have been employed to cluster large-scale scRNA-seq datasets, such as those from 10x Genomics, selecting medoids that represent distinct cell populations.^[35] This approach enhances the discovery of rare cell types and improves downstream analyses like trajectory inference, with applications extending early microarray-based leukemia subtyping to modern scRNA-seq studies in the 2020s. In social network analysis, medoids serve as central nodes in community detection algorithms by leveraging graph distances to identify representative members within clusters of interconnected users. For instance, the Medoid-Shift algorithm, an adaptation of mean-shift clustering, uses distance matrices derived from graph structures to detect communities, preserving the interpretability of actual nodes as centroids while handling non-Euclidean distances common in networks.^[36] This approach is particularly effective in sparse graphs, where medoids reduce the impact of noise by selecting real data points as cluster representatives.^[37] An example application involves clustering Twitter content types based on tweet text features using K-medoids partitioning, identifying sentiment-based groups with textual similarity.^[38] In broader ego network contexts, like those from social media datasets, medoids highlight central actors in personal connection graphs, aiding in the discovery of overlapping communities around individual users.^[39] In market segmentation, medoids cluster customer profiles using RFM (Recency, Frequency, Monetary) metrics to generate representative personas directly from transactional data, offering a robust alternative to centroid-based methods in handling outliers like high-value one-time buyers.^[40] Studies on online retail datasets have applied K-medoids to RFM values, selecting actual customer records as medoids to capture heterogeneous spending behaviors. In e-commerce, K-medoids has been used for customer segmentation, demonstrating robustness in behavioral data.^[41] The primary benefit of medoids in these domains lies in their interpretability, as each segment is anchored to a real customer or node, facilitating actionable insights like personalized marketing campaigns or targeted network interventions without the abstraction of synthetic centroids.^[40]

Advanced Topics

Handling High-Dimensional Data

In high-dimensional spaces, medoid-based clustering, such as k-medoids, encounters the curse of dimensionality, where distances between points become increasingly uniform—a phenomenon known as distance concentration—leading to degraded cluster separation and ineffective medoid selection.^[42] This issue arises because, as dimensionality grows, the volume of the space expands exponentially, making most points equidistant and rendering traditional Euclidean metrics unreliable for identifying representative medoids.^[43] For instance, in datasets exceeding 100 dimensions, the ratio of nearest-to-farthest neighbor distances approaches 1, complicating the PAM (Partitioning Around Medoids) algorithm's swap operations.^[44] To mitigate these challenges, dimensionality reduction techniques are commonly applied prior to medoid clustering, projecting data into a lower-dimensional space while preserving local and global structures. Principal Component Analysis (PCA) linearly transforms high-dimensional data by retaining principal components that capture maximum variance, enabling more discriminative medoid selection; for example, PCA has improved silhouette scores in k-medoids applications in smart grid segmentation tasks.^[45] Uniform Manifold Approximation and Projection (UMAP), a non-linear method, further excels in preserving topological relationships, outperforming PCA in comparative studies on clustering high-dimensional datasets like gene expression profiles when paired with k-medoids, achieving higher adjusted Rand indices.^[46] These preprocessing steps address distance concentration without altering the core medoid paradigm, ensuring clusters remain interpretable and robust.^[47] Adapted distance metrics tailored to high-dimensional embeddings also enhance medoid clustering by focusing on relevant similarities rather than absolute distances. Cosine similarity, which measures the angle between vectors, is particularly effective for sparse, high-dimensional data like text embeddings, ignoring magnitude differences that Euclidean distance amplifies under the curse of dimensionality; in k-medoids applied to 100-dimensional word vectors from GloVe, cosine-based clustering is used for semantic grouping.^[43] Similarly, the Mahalanobis distance incorporates the data's covariance structure, scaling features according to their variability and correlations, which proves advantageous in non-isotropic high-dimensional spaces such as indoor localization fingerprints, where it boosts clustering accuracy by accounting for outlier sensitivity absent in simpler metrics.^[48] Recent advances as of 2025 have enabled medoid clustering directly on high-dimensional transformer embeddings without prior reduction, leveraging approximate nearest neighbor (ANN) techniques to approximate distance computations efficiently. In frameworks like PDASC, approximate nearest neighbor search supports k-medoids with arbitrary distances, achieving high recall while reducing computation times compared to exact methods.^[43] Similarly, imprinting methods employing k-medoids on Vision Transformer (ViT) embeddings (e.g., 768 dimensions) with L2-normalized approximate m-nearest neighbor aggregation have demonstrated accuracies exceeding 87% in few-shot classification tasks, highlighting the scalability of medoids in raw high-dimensional representations from large language models.^[49] These developments underscore ANN's role in maintaining medoid integrity amid the computational demands of dimensions in the hundreds to thousands.

Visualization and Anomaly Detection

Visualization of medoid-based clusters aids in interpreting the partitioning of data points around representative medoids, particularly in low- and high-dimensional spaces. For one-dimensional (1D) or two-dimensional (2D) data, distance matrices serve as a fundamental visualization tool, where pairwise dissimilarities between points are represented as a heatmap; clusters can then be overlaid by coloring rows and columns according to their assigned medoid, revealing intra-cluster cohesion and inter-cluster separation.^[50] Dendrograms, traditionally associated with hierarchical clustering, can also be adapted for medoid visualizations in 1D/2D settings by constructing a tree-like structure based on the dissimilarity matrix used in k-medoids algorithms, allowing users to observe merging patterns and cluster hierarchies post-partitioning.^[51] In higher dimensions, dimensionality reduction techniques like t-SNE (t-distributed stochastic neighbor embedding) project the data into 2D or 3D space while preserving local structures, enabling the visualization of medoid clusters by coloring points according to their cluster assignment and highlighting the medoid points themselves for emphasis.^[52] This approach is particularly useful for uncovering non-linear cluster geometries that distance matrices alone cannot capture, as demonstrated in applications such as phenotype clustering in biological data where medoids represent central prototypes within t-SNE embeddings.^[53] Anomaly detection leverages medoids by identifying points with excessive dissimilarity to their assigned cluster representative, treating such outliers as deviations from the core cluster structure. In the Partitioning Around Medoids (PAM) framework, after clustering, outliers are flagged for points whose distance to the medoid exceeds a threshold, commonly set as 1.5 times the average distance within the cluster (average absolute distance to medoid, or ADMP), ensuring robustness against noise in datasets like the Iris benchmark.^[54] Alternative thresholds incorporate statistical measures, such as points beyond the mean dissimilarity plus one standard deviation from the medoid, to quantify "farness" in a data-driven manner.^[55] A representative example is a 1D line plot of a simple dataset, such as points along a numerical axis (e.g., [1, 2, 3, 10]), where the medoid for the outlier-prone cluster might be at 2, with deviations visualized as vertical lines from the medoid to each point; the isolated point at 10 exceeds the threshold (e.g., average deviation of 0.67 plus standard deviation ≈0.58, threshold ≈1.25), highlighting its anomalous position for enhanced interpretability in otherwise complex, multidimensional scenarios.^[56] This technique promotes conceptual understanding by contrasting tight clusters around medoids against sparse, distant anomalies, facilitating outlier removal or further investigation in practical analyses.^[57]

References

[1]
kmedoids - k-medoids clustering - MATLAB - MathWorks
k-medoids is a related algorithm that partitions data into k distinct clusters, by finding medoids that minimize the sum of dissimilarities between points in ...
[2]
[PDF] The K-Medoids Clustering Method
Introduction o K-Medoids (also called as PAM: Partitioning Around Medoid) algorithm was proposed in 1987 by Kaufman and Rousseeuw o K-medoids clustering is ...Missing: explanation | Show results with:explanation
[3]
[PDF] CLUSTERING BY MEANS OF MEDOIDS - KU Leuven
For more details see Kaufman and Rousseeuw (1987). The algorithm used in program PAM yields a good but not necessarily optimal solution to model (1)-(5).
[4]
Fast and eager k-medoids clustering: O(k) runtime improvement of ...
Kaufman and Rousseeuw [3] proposed the name medoid for an object with lowest total dissimilarity to the other objects of its cluster, in order to distinguish it ...
[5]
https://ieeexplore.ieee.org/document/6926527
[6]
https://proceedings.mlr.press/v130/jiang21c/jiang21c.pdf
[7]
[PDF] clustering large data sets - KU Leuven
The clustering of a set of objects with the CLARA algorithm. (Clustering LARge Applications) is carried out in two steps. First a sample of n* objects is ...
[8]
https://ieeexplore.ieee.org/document/1033770
[9]
[PDF] BanditPAM++: Faster k-medoids Clustering
BanditPAM++ accelerates BanditPAM, a k-medoids algorithm, by reusing clustering information within and across iterations, making it faster.
[10]
Finding Groups in Data | Wiley Series in Probability and Statistics
Mar 8, 1990 · Finding Groups in Data: An Introduction to Cluster Analysis. Author(s): Leonard Kaufman, Peter J. Rousseeuw, First published:8 March 1990.Missing: original paper
[11]
(PDF) Clustering by Means of Medoids - ResearchGate
Mar 3, 2014 · We can compute central points by the robust k-medoids clustering method of Kaufman and Rousseeuw (1987) . The k-medoids algorithm partitions ...
[12]
Cluster Analysis - Information Technology Laboratory
Sep 26, 2017 · The cluster medoids correspond to the most centrally located observations in the cluster. The k-medoids method is more robust to outliers and ...
[13]
[PDF] CLUSTERING - Stony Brook Computer Science
K-MEDOIDS CLUSTERING The mean in k-means clustering is sensitive to outliers. Since an object with an extremely high value may substantially distort the ...
[14]
pyclustering.cluster.kmedoids.kmedoids Class Reference
PAM algorithm complexity is \(O\left ( k\left ( n-k \right )^{2} \right )\). There is an example where PAM algorithm is used to cluster 'TwoDiamonds' data ...
[15]
(PDF) Computational Complexity between K-Means and K-Medoids ...
Aug 10, 2025 · k-medoids have low complexity, are highly robust, and require lower convergence time compared to k-means clustering [20] . The number of users ...
[16]
Cluster Analysis with K-Mean versus K-Medoid in Financial ... - MDPI
Aug 10, 2022 · This paper presents the financial performance of Hungarian and Romanian food retail companies by using two well-known cluster analyzing methods (K-Mean and K- ...Missing: seminal | Show results with:seminal
[17]
https://www.mdpi.com/2076-3417/12/16/7985
[18]
[PDF] Partition Around Medoids (PAM) and Silhouette plots
This lab introduces a new clustering technique based on the k-means algorithm. While hierarchical clustering techniques progressively group data points that are ...
[19]
CLARA in R : Clustering Large Applications - Datanovia.com
CLARA (Clustering Large Applications, (Kaufman and Rousseeuw 1990)) is an extension to k-medoids (PAM) methods to deal with data containing a large number ...Missing: history | Show results with:history
[20]
sklearn_extra.cluster.KMedoids - scikit-learn-extra - Read the Docs
The KMeans algorithm minimizes the within-cluster sum-of-squares criterion. It scales well to large number of samples. Notes. Since all pairwise distances are ...
[21]
pdist - Pairwise distance between pairs of observations - MATLAB
Compute the Euclidean distance between pairs of observations, and convert the distance vector to a matrix using squareform.Missing: PAM medoid
[22]
K-medoids · Clustering.jl - JuliaStats
The function implements a K-means style algorithm instead of PAM (Partitioning Around Medoids). K-means style algorithm converges in fewer iterations, but was ...
[23]
mthelm85/PAM.jl: Partitioning Around Medoids - GitHub
using Distances using PAM using RDatasets iris = dataset("datasets", "iris") X = Matrix(iris[:,1:4]) D = pairwise(Euclidean(), X, dims=1) k = 3 results ...
[24]
K-Medoid Clustering (PAM)Algorithm in Python - Medium
Apr 1, 2022 · PAM stands for “Partition Around Medoids.” PAM converts each step of PAM from a deterministic computational to a statistical estimation problem ...1. Introduction · 2. Partition Around Medoids... · 3. Implementation Using...
[25]
Clustering: PAM k-Medoids, CLARA & Silhouette Values
Apr 9, 2020 · In this post, I briefly explain the PAM Partitioning Around Medoids algorithm, implementing it from scratch in R on a simple 2-dimensional dataset.
[26]
K-Medoids in R: Algorithm and Practical Examples - Datanovia
In k-medoids clustering, each cluster is represented by one of the data point in the cluster. These points are named cluster medoids.
[27]
Clustering-based topic modeling for biomedical documents ...
Nov 13, 2024 · The method uses topic modeling (LDA), BERT for vectorization, and K-medoid clustering to extract top sentences for extractive summarization.
[28]
Optimizing SBERT for long text clustering: two novel approaches ...
Jun 2, 2025 · This study proposes two distinct methods to overcome these limitations and enhance the performance of SBERT models for clustering long text.
[29]
https://link.springer.com/article/10.1007/s11227-025-07414-4
[30]
A Comparison Study of Clustering Models for Online Review ...
The task of sentiment analysis is to judge whether a review expresses a positive, ... Partition around medoids. PAM. Kaufman & Rousseeuw (1990). Clustering Large ...
[31]
[PDF] VADER: A Parsimonious Rule-based Model for Sentiment Analysis ...
We find that incorporating these heu- ristics improves the accuracy of the sentiment analysis en- gine across several domain contexts (social media text, NY.
[32]
[PDF] Statistical Inference for Simultaneous Clustering of Gene Expression ...
We recommend the clustering algorithm Partitioning Around Medoids. (PAM) [11], because it is nonparametric and more robust to outliers than many other methods.
[33]
Community Detection Using Revised Medoid-Shift Based on KNN
Apr 19, 2023 · The Medoid-Shift algorithm preserves the benefits of Mean-Shift and can be applied to problems based on distance matrix, such as community detection.
[34]
Community Detection Using Revised Medoid-Shift Based on KNN
Aug 7, 2025 · One drawback of the Medoid-Shift algorithm is that there may be no data points within the neighborhood region defined by a distance parameter.
[35]
Clustering Content Types and User Roles Based on Tweet Text ...
The method used in this study is the K-Medoids Partitioning-Based algorithm based on twitter user text. This algorithm was chosen because it is easy to ...
[36]
[PDF] Clustering Content Types and User Roles Based on Tweet Text ...
Aug 25, 2023 · Next, a clustering process will be carried out using K-Medoids. As a result of the visualization, the preprocessed data is displayed in the ...Missing: example | Show results with:example
[37]
[PDF] Comparison of K-Medoids and K-Means Algorithms in Segmenting ...
In this clustering method there are several algorithms that can be used such as: K-Means,. K-Medoids, Fuzzy C-Means, DBSCAN, and so on. K-Means and K-Medoids ...
[38]
[PDF] Analysis Of Tokopedia Product Clustering Using The K-Means And ...
Oct 10, 2025 · This study aims to address this gap by implementing and comparing the K-Means and K-Medoids clustering algorithms on Tokopedia product data ...
[39]
K-Medoids clustering applications for high-dimensionality ...
This paper proposes for the first time the K-Medoids clustering method to model uncertainties in unbalanced distribution systems.
[40]
https://www.e3s-conferences.org/articles/e3sconf/pdf/2024/14/e3sconf_foitic2024_02008.pdf
[41]
[PDF] The Unbalancing Effect of Hubs on K–medoids Clustering in High ...
Apr 18, 2015 · All three methods seem to be a viable solution to 'defeat' the curse of dimensionality and unbalanced cluster sizes in K– medoids clustering. VI ...
[42]
https://www.sciencedirect.com/science/article/pii/S0142061524000826
[43]
Comparative assessment of projection and clustering method ...
The study compares six projection methods (PCA, ICA, isomap, MDS, t-SNE, UMAP) with five clustering algorithms (k-means, k-medoids, single link, Ward's, ...
[44]
[PDF] PCA-COUNSELED K-MEANS AND K-MEDOIDS WITH DIMENSION ...
Mar 25, 2025 · This study aims to explore how K-Means and. K-Medoids can be integrated by leveraging. PCA's dimensionality reduction techniques. Exploring ...Missing: UMAP | Show results with:UMAP
[45]
[PDF] Clustering Performance Analysis of the K-Medoids Algorithm for ...
However, the k-medoids algorithm with the Euclidean and cosine distance metrics generated the most clusters that were slightly better separated than the others, ...
[46]
https://www.sciencedirect.com/science/article/pii/S2352914824001291
[47]
[2507.21989] Benchmarking Filtered Approximate Nearest Neighbor ...
[Submitted on 29 Jul 2025]. Title:Benchmarking Filtered Approximate Nearest Neighbor Search Algorithms on Transformer-based Embedding Vectors.
[48]
Calculate K-medoids using the uncentered correlation distance ...
Calculate K-medoids using the uncentered correlation distance method.Missing: high | Show results with:high
[49]
[PDF] Hierarchical Clustering / Dendrograms - NCSS
The horizontal axis of the dendrogram represents the distance or dissimilarity between clusters. The vertical axis represents the objects and clusters. The ...<|separator|>
[50]
Unsupervised Phenotype-Based Clustering of Clinicopathologic ...
In this study, we perform unsupervised k-medoids clustering as a machine learning technique of 2,978 primary cutaneous melanomas at Mass General Brigham and ...
[51]
Selecting Representative Samples From Complex Biological ...
Jul 17, 2022 · We present a toolkit called Cookie which can efficiently select out the most representative samples from a massive single-cell population with diverse ...Abstract · Introduction · Materials and Methods · Results
[52]
[PDF] Outliers Detection Based on Partitioning Around Medoids
To detect the outliers in the rest of clusters, computer the Absolute Distances between the. Medoid, μ, of the current cluster and each one of the. Points, pi, ...
[53]
outlier detection using minimum spanning tree and medoid selection
Feb 12, 2024 · This article presents an advanced method to extract cluster-based outliers by employing a scaled minimum spanning tree (MST) data structure and a new medoid ...
[54]
ML | K-Medoids clustering with solved example - GeeksforGeeks
Jan 11, 2023 · A medoid can be defined as a point in the cluster, whose dissimilarities with all the other points in the cluster are minimum.
[55]
Extracting centroids with its data point using K-Medoids clustering in ...
Jul 27, 2020 · I have some data in a 1D array X with 10 elements in it. I applied KMedoids clustering on this data with 3 as a number of clusters.