k -anonymity

k-anonymity is a formal privacy model used in data anonymization to ensure that, in a released dataset, each record is indistinguishable from at least k-1 other records based on a set of quasi-identifier attributes, thereby limiting the risk of re-identifying individuals through linkage with external information sources to a probability of at most 1/k.^[1] Introduced in the late 1990s by researchers Pierangela Samarati and Latanya Sweeney, this model addresses the challenge of sharing microdata—such as medical or census records—for research purposes while protecting individual privacy against linkage attacks that combine quasi-identifiers like ZIP code, birth date, and gender with publicly available databases.^[2] To achieve k-anonymity, datasets are typically transformed using techniques such as generalization (e.g., replacing exact ages with ranges like "30-40") or suppression (e.g., withholding specific values), ensuring that every combination of quasi-identifier values appears at least k times in the anonymized table.^[1] While effective against basic identity disclosure, k-anonymity is susceptible to attribute disclosure attacks, such as homogeneity (where all records in an equivalence class share the same sensitive value) and background knowledge exploitation, prompting extensions like l-diversity—which requires at least l distinct sensitive values within each group—and t-closeness—which ensures the distribution of sensitive attributes in a group is close to the overall dataset distribution within a threshold t.^[3]^[4]^[5] These developments have made k-anonymity a foundational concept in privacy-preserving data publishing, influencing standards in fields like healthcare and social science research despite ongoing debates about its utility-distortion trade-offs in high-dimensional data.^[3]

Introduction

Definition

k-anonymity is a privacy model designed to protect individuals' identities in released tabular datasets by ensuring that no single record can be distinguished from at least k-1 others based on certain attributes.^[2] A release of data is said to provide k-anonymity if, for every record in the dataset, there are at least k-1 other records that are indistinguishable from it with respect to a set of quasi-identifiers.^[2] Formally, let R be a dataset with attributes divided into identifiers (ID), quasi-identifiers (QI), and sensitive attributes (SA); k-anonymity requires that each equivalence class in the QI-grouping of R has size at least k, where an equivalence class is a set of records sharing the same QI values.^[2] In mathematical terms, for a table T (A₁, ..., A_n) and quasi-identifiers QI_T associated with it, T satisfies k-anonymity if, for each quasi-identifier QI ∈ QI_T, every sequence of values in T[QI] appears with at least k occurrences in T[QI].^[2] This concept was first formalized in 1998 by Pierangela Samarati and Latanya Sweeney.^[6]^[2] To illustrate, consider a simple medical dataset with quasi-identifiers age and ZIP code, and sensitive attribute disease. The following non-anonymous table violates 2-anonymity because all records are unique on the quasi-identifiers:

Age	ZIP Code	Disease
22	12345	Flu
25	12346	Hypertension
29	12345	Flu
35	12347	Cancer

In contrast, after anonymization via generalization, the table achieves 2-anonymity with one equivalence class of size 4:

Age	ZIP Code	Disease
20-40	1234*	Flu
20-40	1234*	Hypertension
20-40	1234*	Flu
20-40	1234*	Cancer

Here, the single combination of generalized age and ZIP code appears four times, preventing linkage to a specific individual.^[1]

Historical Context

The underlying concept of avoiding unique records in datasets traces back to Tore Dalenius's 1986 paper on identifying anonymous census records.^[7] The concept of k-anonymity emerged in the late 1990s amid growing concerns over privacy breaches in public datasets, particularly in medical records and census data, where re-identification risks were demonstrated through linkage attacks using auxiliary information.^[2] In 1997, researcher Latanya Sweeney highlighted these vulnerabilities by re-identifying Massachusetts Governor William Weld's anonymized hospital records using publicly available voter registration data, including date of birth, gender, and ZIP code, underscoring the inadequacy of simple de-identification techniques.^[8] This work built on earlier efforts and directly influenced the formal introduction of k-anonymity by Pierangela Samarati and Latanya Sweeney in their 1998 technical report, which proposed the model as a rigorous framework to ensure that individuals in released data could not be distinguished from at least k-1 others.^[2] Key milestones followed in the early 2000s, with Sweeney's seminal 2002 paper formalizing k-anonymity as a privacy protection model, complete with deployment policies and algorithms for achieving it through techniques like generalization and suppression.^[1] The model's influence extended to regulatory frameworks, notably informing the U.S. Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule's Safe Harbor method, finalized in 2002 and effective in 2003, which requires removal of specific identifiers to mitigate re-identification risks in health data—a provision shaped by Sweeney's demonstrations and advocacy.^[9]^[10] By 2025, k-anonymity has maintained its relevance as a foundational privacy model despite critiques regarding its limitations against certain attacks, continuing to underpin many data anonymization practices and tools.^[11] Its integration into international standards, such as the ASEAN Guide on Data Anonymisation released in January 2025, reflects ongoing adoption for structured data protection across sectors, emphasizing techniques like k-anonymity to balance privacy and utility in regional data-sharing initiatives.^[12]

Core Concepts

Quasi-Identifiers

Quasi-identifiers (QIs) are attributes in a dataset that, individually, do not uniquely identify an individual but can be combined with external data sources to re-identify them through linkage attacks.^[1] This concept was formalized in the context of privacy-preserving data publishing, where QIs represent non-sensitive attributes that enable probabilistic or deterministic matching to auxiliary information.^[13] A prominent demonstration of QI risks involved combining zip code, gender, and date of birth, which uniquely identified 87.1% of the U.S. population based on 1990 census data.^[8] In a practical re-identification study, Latanya Sweeney linked anonymized medical records from Massachusetts state employees to publicly available voter registration lists using these QIs, successfully identifying specific individuals, including high-profile cases like the state governor's records.^[1] QIs are distinguished from direct identifiers, such as names or Social Security numbers, which explicitly point to individuals and are typically removed before data release.^[13] In contrast, sensitive attributes—like medical diagnoses or financial status—are the protected information that could be disclosed upon re-identification but are not themselves used for linking and thus not generalized in anonymization processes.^[1] Identifying QIs requires domain knowledge to assess potential linkage risks or empirical tests to evaluate uniqueness in target populations.^[13] In healthcare datasets, common QIs include age, geographic location (e.g., zip code), and gender, as these attributes frequently align with external records like census or voter data.^[14] Selection often involves threat modeling to simulate adversary access to auxiliary datasets.^[13]

Equivalence Classes

In k-anonymity, an equivalence class is defined as a maximal set of records in a dataset that share identical values across all quasi-identifier attributes, making those records indistinguishable based on the quasi-identifiers alone.^[2] This grouping arises naturally from the quasi-identifiers, which are the attributes used to partition the data.^[2] A key property of equivalence classes in a k-anonymous dataset is that each class must contain at least k records, ensuring that no individual can be uniquely identified within the group and providing a level of indistinguishability from at least k-1 others.^[2] This minimum size requirement directly enforces the k-anonymity criterion, as any class with fewer than k records would violate the privacy protection by allowing potential re-identification.^[2] Equivalence classes are formed by partitioning the dataset according to the exact values of the quasi-identifiers following the anonymization process, where records with matching quasi-identifier tuples are clustered together.^[2] Smaller equivalence classes, particularly those with fewer than k records, signal elevated re-identification risks, as they reduce the ambiguity needed for privacy.^[2] To illustrate, consider a simple dataset with quasi-identifiers Age and ZIP code, and sensitive attribute Disease, aiming for k=3 anonymity. In the original (pre-anonymization) table, several equivalence classes have size 1, violating the requirement:

Age	ZIP	Disease
21	13053	Flu
29	13068	Heart
22	13055	Flu
50	14853	Cancer
47	14850	Flu
45	14852	Heart

After anonymization (e.g., generalizing values), the table forms classes of at least size 3:

Age	ZIP	Disease
<40	130**	Flu
<40	130**	Heart
<40	130**	Flu
>40	148**	Cancer
>40	148**	Flu
>40	148**	Heart

This example demonstrates how pre-anonymization singleton classes expose individuals, while post-anonymization grouping achieves compliance.^[1]

Anonymization Techniques

Generalization and Suppression

Generalization is a core technique in k-anonymity that involves replacing specific values of quasi-identifiers with broader, less precise categories to ensure that at least k records share the same generalized values, thereby forming equivalence classes of size k or larger.^[2] For instance, an exact age of 25 might be generalized to the range 20-30, or a full ZIP code like 02138 could be generalized to 021**.^[2] This process reduces the distinguishability of individual records while preserving the overall structure of the dataset. Suppression complements generalization by removing or masking specific quasi-identifier values entirely, such as replacing them with an asterisk (*) or omitting the attribute from certain records, to eliminate outliers that prevent the formation of sufficiently large equivalence classes.^[2] In practice, suppression is applied judiciously to entire tuples or individual attribute values when generalization alone cannot achieve k-anonymity without excessive distortion. Both techniques rely on generalization hierarchies to systematically define levels of broadening for attributes. These hierarchies, often structured as trees or lattices, specify possible generalizations; for example, a date attribute might progress from a full birthdate (e.g., 09/27/1964) to year only (1964), while location attributes like ZIP codes follow a domain generalization hierarchy from full code (02139) to the first three digits (021) or even the state level.^[2] The primary trade-off of generalization and suppression is a balance between enhanced privacy through larger equivalence classes and reduced data utility due to loss of precision and completeness. In a medical dataset of 265 patient records, generalizing birthdates to year and ZIP codes to three digits, combined with suppressing about 10% of tuples (27 records), achieved k-anonymity while minimizing information loss; similar approaches for k=5 in healthcare simulations often require generalizing age into 5- or 10-year bands and postal codes to broader regions, though this can suppress up to 50% of records in small samples, highlighting the need for minimal application to retain analytical value.^[2]^[15]

Partitioning Methods

Partitioning methods for achieving k-anonymity focus on dividing a dataset into clusters, or equivalence classes, where each cluster contains at least k records that share identical quasi-identifier values after processing. This approach groups similar records together based on proximity metrics, such as Euclidean distance, before applying transformations like aggregation or generalization within each cluster to obscure individual identities. Unlike global transformations across the entire dataset, partitioning localizes changes, allowing for more tailored anonymization that balances privacy and data fidelity.^[16] A prominent partitioning technique is microaggregation, which partitions numerical microdata into clusters of size at least k and replaces original values with the cluster centroid (e.g., the average). Introduced by Domingo-Ferrer and Mateo-Sanz, microaggregation uses clustering algorithms, such as hierarchical methods or genetic heuristics, to form homogeneous groups that minimize within-cluster variance. This process ensures k-anonymity by making records in the same cluster indistinguishable, while the aggregation step preserves statistical properties of the data.^[17]^[16] Another key method adapts standard clustering algorithms, like k-means, to enforce minimum cluster sizes of k for anonymization. Byun et al. developed a greedy clustering algorithm that iteratively assigns records to clusters by selecting the assignment that minimizes overall information loss, measured by intra-cluster distances for both numerical and categorical attributes. This adaptation treats k-anonymization as a constrained clustering problem, where the goal is to partition the data into equivalence classes without relying on predefined generalization hierarchies.^[18]^[19] These partitioning methods offer advantages in utility preservation for numerical data, as they limit modifications to local clusters rather than applying uniform generalization across all records, resulting in lower information loss compared to non-partitioned approaches. For example, microaggregation applied to age and salary attributes in a dataset might sort records by age, divide them into clusters of size k=10, and replace values in each cluster with the mean age and salary, maintaining aggregate trends while achieving anonymity. Generalization can be applied within these partitions to handle categorical data if needed.^[20]^[21]^[16]

Algorithms and Implementations

Bottom-Up Approaches

Bottom-up approaches to k-anonymity begin with the original dataset, where quasi-identifiers are at their most specific levels, and iteratively broaden these values through generalization until every equivalence class contains at least k records.^[22] This method contrasts with more aggressive initial generalizations by focusing on targeted refinements, aiming to preserve data utility while satisfying the privacy constraint.^[23] A key algorithm exemplifying this paradigm is Incognito, introduced by LeFevre, DeWitt, and Ramakrishnan in 2005, which leverages a lattice-based search over predefined generalization hierarchies to discover full-domain generalizations that achieve k-anonymity with minimal distortion to the data.^[23] Incognito systematically explores the space of possible generalizations by constructing a multi-dimensional lattice for subsets of quasi-identifiers, where each node represents a level of generalization and edges indicate valid transitions to broader categories.^[24] The core steps of Incognito involve a bottom-up breadth-first search starting from the most specific (bottom) nodes of the lattice: first, it partitions the dataset into initial equivalence classes based on exact quasi-identifier values; next, it identifies classes smaller than k and applies the minimal generalization—such as expanding an age range or truncating a zip code prefix—that merges the class with adjacent ones, using the lattice structure to ensure validity; finally, it re-partitions the entire dataset after each merge and prunes invalid paths using properties like monotonicity (broader generalizations remain anonymous if narrower ones do).^[24] This process repeats across quasi-identifier subsets until a complete set of minimal k-anonymous solutions is generated. While the theoretical time complexity is exponential in the number of quasi-identifiers due to the search space size, practical implementations achieve O(n log n) performance through efficient pruning, caching of partitions, and breadth-first optimization on real-world datasets with up to thousands of records.^[24] To demonstrate, consider a small dataset of four records with quasi-identifiers age and zip code, requiring k=2 anonymity. The generalization hierarchy for age is exact value → decade range (e.g., 20s) → broader range (e.g., 20–40); for zip code, it is 5 digits → 3 digits → 1 digit.

Record	Age	Zip Code
1	25	12345
2	27	12346
3	50	54321
4	52	54322

Initially, each record forms a singleton equivalence class of size 1, violating k=2. The algorithm identifies all classes as smallest and selects the first pair (records 1 and 2) based on proximity in the lattice. It applies minimal generalization: broaden age to [20s] (covering 25 and 27) and zip code to 123** (first three digits), merging records 1 and 2 into one class of size 2. The dataset is re-partitioned, leaving records 3 and 4 as singletons. Next, for the remaining smallest classes, generalize age to [50s] and zip code to 543** for records 3 and 4, merging them into a second class of size 2. The final anonymized table has two equivalence classes, each satisfying the requirement with limited distortion: one for [20s, 123**] and one for [50s, 543**]. This traversal ensures merges occur at the lowest lattice levels possible, preserving specificity where feasible.^[24]

Top-Down Approaches

Top-down approaches to achieving k-anonymity begin with a dataset where all quasi-identifier (QI) attributes are fully generalized to their highest level of abstraction or suppressed, forming a single equivalence class containing all records. From this maximally general state, the process iteratively specializes selected attribute values by narrowing their generalizations—such as partitioning broad categories into subcategories—while ensuring that no equivalence class has fewer than k records, thereby maintaining anonymity. This contrasts with methods that build up from specific data, as top-down specialization descends from broad to more precise representations, often leveraging hierarchical taxonomies for attributes like age or location. Suppression serves as the initial generalization technique in these approaches, aligning with broader anonymization strategies.^[25] A seminal algorithm in this category is Top-Down Specialization (TDS), developed by Fung, Wang, and Yu in 2005, which uses greedy heuristics to refine hierarchies efficiently. Subsequent refinements enhance TDS with scoring metrics like InfoGain/AnonyLoss to prioritize specializations that balance privacy and data utility, making it particularly suitable for classification tasks. These algorithms use greedy heuristics to avoid exhaustive searches, enabling scalability on large datasets.^[25] The core steps of top-down specialization involve: (1) initializing the dataset with all QI attributes at their root generalization level, creating one large equivalence class; (2) selecting a candidate attribute for specialization based on potential utility gains; (3) evaluating possible refinements (e.g., descending one level in a taxonomy tree) and choosing the one that partitions classes without creating any group smaller than k; and (4) repeating iteratively until no further specializations are possible without breaching anonymity or until a utility threshold is met. For continuous attributes, intervals are split similarly. This process is efficient for large datasets, with implementations processing millions of records in seconds using compressed data structures like TIPS (Taxonomy-based Interval Partitioning and Specialization).^[25] Consider a census dataset with QI attributes including age, zip code, and education level, anonymized for k=5. Initially, all ages are generalized to "[0-100]", all zip codes to "*****", and education to "Any", forming a single equivalence class of all records. The algorithm might first specialize age to "[20-60]" for a subset, ensuring the resulting classes each have at least 5 records, then further refine to "[20-40]" and "[41-60]" only if both new classes meet the k threshold without isolating individuals. This progressive narrowing preserves trends, such as higher education correlations in certain age bands, while preventing re-identification.^[25]

Limitations

Utility Trade-offs

In the context of k-anonymity, data utility refers to the extent to which the anonymized dataset retains its value for intended analytical purposes, such as statistical analysis or model training. Key measures of utility include the discernibility metric (DM), which sums the squares of equivalence class sizes (or assigns a penalty equal to the class size to each tuple therein), with higher values indicating greater information loss and lower utility, as larger classes reduce distinguishability.^[26] Another prominent measure is the normalized certainty penalty (NCP), which evaluates information loss from generalization and suppression across attributes, normalized to a scale of 0 (no loss) to 1 (complete loss), with emphasis on the precision of interval representations for quasi-identifiers.^[27] Achieving k-anonymity involves a fundamental trade-off: higher values of k enhance privacy by ensuring each record is indistinguishable from at least k-1 others, but this necessitates larger equivalence classes, leading to increased data distortion through broader generalizations or more suppressions, which diminishes utility. For example, on the benchmark Adult dataset from the UCI Machine Learning Repository, anonymization algorithms yield varying levels of information loss, with loss escalating as k increases due to more extensive recoding.^[28] This utility degradation poses significant challenges, particularly for irreversible impacts on advanced analytical tasks like machine learning training, where generalized quasi-identifiers can degrade model accuracy by obscuring fine-grained patterns essential for learning.^[29] For instance, classifiers such as logistic regression and random forests trained on k-anonymized datasets often achieve only 94% of the accuracy obtained on original data, highlighting the tension in privacy-sensitive applications.^[30] To address these trade-offs, modern anonymization tools employ multi-objective optimization techniques that simultaneously minimize utility loss metrics like NCP or DM while enforcing k-anonymity constraints, often formulating the problem as a mathematical program to explore Pareto-optimal solutions.^[31]

Disclosure Risks

k-Anonymity is designed to mitigate identity disclosure by ensuring that each record in a released dataset is indistinguishable from at least k-1 other records with respect to quasi-identifiers, thereby preventing an attacker from linking a specific individual to their data.^[1] However, this approach fails to address attribute disclosure, where sensitive attributes associated with an individual or group can still be revealed, even if identity is obscured.^[32] For instance, if all records in an equivalence class share the same sensitive value, an attacker can infer that value for any member of the group, compromising privacy without identifying the exact person.^[32] In practice, the effectiveness of k-anonymity heavily depends on the accurate identification of quasi-identifiers, which are attributes that, when combined with external data, could uniquely identify individuals. Errors in this identification, such as overlooking seemingly innocuous fields that become linking keys with auxiliary information, can lead to incomplete protection and heightened re-identification risks.^[1] Studies from the 2010s, including analyses of health and search data releases, demonstrated these vulnerabilities, where anonymized datasets adhering to k-anonymity principles were successfully re-identified through cross-referencing with publicly available records, underscoring the challenges in anticipating all potential linkages.^[33] Critiques of k-anonymity emphasize its susceptibility to evolving external data sources, as noted by Sweeney in her foundational work, where data holders cannot fully predict or control how future information might enable inferences.^[1] The 2025 ASEAN Guide on Data Anonymisation further highlights these risks in dynamic datasets, where ongoing updates or new external linkages necessitate periodic re-evaluation to maintain privacy protections.^[12] Unlike differential privacy, which offers rigorous probabilistic guarantees that limit the influence of any single individual's data on query outputs irrespective of auxiliary knowledge, k-anonymity provides only deterministic assurances based on group sizes, leaving it exposed to deterministic linkages and lacking mathematical bounds on disclosure probability.^[34]

Attacks

Background Knowledge Attacks

Background knowledge attacks on k-anonymity exploit an adversary's access to external information sources, such as public records or demographic databases, to link quasi-identifiers in the anonymized dataset and reduce the size of equivalence classes below the k threshold, thereby enabling re-identification of individuals.^[1] In this mechanism, the attacker combines the released k-anonymous data with prior knowledge to infer unique identities, as the anonymization process alone does not account for such supplementary data that correlates with the quasi-identifiers.^[4] A seminal example is Latanya Sweeney's 1997 demonstration, where she linked anonymized hospital discharge records from the Massachusetts Group Insurance Commission—covering 135,000 state employees and dependents—with publicly available Cambridge voter registration lists purchased for $20. By matching on quasi-identifiers like ZIP code, birth date, and gender, Sweeney re-identified the medical records of then-Governor William Weld, who was the only individual matching his demographic profile within his 5-digit ZIP code among a group of six sharing his birth date. This attack highlighted how even seemingly anonymous health data could be de-anonymized through linkage to voter rolls.^[1] Such attacks require the availability of correlated external datasets that overlap significantly with the quasi-identifiers in the released data; for instance, Sweeney's analysis of 1990 U.S. Census data showed that 87% of the population (216 million out of 248 million individuals) had unique combinations of 5-digit ZIP code, gender, and date of birth, making re-identification highly feasible when these attributes are present. Success rates can approach 90% in scenarios with strong overlaps between external and internal quasi-identifiers, underscoring the vulnerability of k-anonymity to informed adversaries.^[1] Defenses against background knowledge attacks include restricting the release of quasi-identifiers to minimize linkage opportunities, though k-anonymity itself does not fully mitigate these risks since it assumes no external knowledge beyond the dataset. Enhanced models like l-diversity have been proposed to address this by ensuring diversity in sensitive attributes within equivalence classes, but basic k-anonymity relies on careful selection of released attributes to limit exposure.^[4]

Homogeneity Attacks

Homogeneity attacks on k-anonymity exploit situations where all records within an equivalence class share the same value for the sensitive attribute, allowing an adversary to infer that value for any individual in the class with certainty.^[35] This occurs because k-anonymity generalizes quasi-identifier attributes to ensure at least k records are indistinguishable but does not require diversity in the sensitive attribute, potentially creating homogeneous groups.^[35] For instance, in a medical dataset anonymized to k=4, if all four records in a class—corresponding to individuals aged 30–40 from ZIP code 13***—list "heart disease" as the diagnosis, an attacker knowing a target's quasi-identifiers can confidently deduce the diagnosis.^[35] A classic example of a homogeneity attack, as described in the introduction of l-diversity, involves antagonistic neighbors Alice and Bob. Bob is taken to the hospital by ambulance, and Alice, knowing his age (31), gender (male), and ZIP code (13053), consults a published 4-anonymous table of inpatient records. All four records matching Bob's quasi-identifiers indicate the sensitive attribute "cancer," allowing Alice to infer Bob's diagnosis with certainty.^[35] The inference confidence in a homogeneity attack reaches 1 if the equivalence class is fully uniform in the sensitive attribute, regardless of k; as class size grows while remaining homogeneous, the attack's reliability increases without bound, since every possible match yields the same sensitive value.^[35] Thus, while k-anonymity protects against identity disclosure by blending individuals into groups of size k, it fails to prevent attribute disclosure when sensitive values lack diversity within those groups.^[35]

Enhancements

l-Diversity

l-Diversity is an extension to k-anonymity that addresses its vulnerability to attribute disclosure by ensuring diversity in the sensitive attribute values within each equivalence class.^[36] Specifically, a k-anonymous dataset is said to satisfy l-diversity if, for every equivalence class (or q*-block), the sensitive attribute contains at least l well-represented values.^[36] This principle requires that the values are not only present but also sufficiently distributed to prevent probabilistic inference of sensitive information.^[36] The model operationalizes the notion of "well-represented" values through variants such as entropy l-diversity and recursive (c, l)-diversity. Entropy l-diversity requires that the entropy of the sensitive attribute distribution in each equivalence class is at least log(l), promoting a balanced distribution. Recursive (c, l)-diversity ensures that the frequency of the most frequent sensitive value is less than c times the sum of the frequencies of the next l-1 most frequent values, allowing controlled skewness with parameter c (often set to 2 or 3).^[36] l-Diversity improves upon k-anonymity by mitigating homogeneity attacks, where an adversary infers the sensitive value due to uniformity within a class.^[36] For example, consider a medical dataset anonymized to 4-anonymity where all records in an equivalence class share the sensitive attribute "cancer" as the disease; an attacker could confidently infer this for any individual in the class.^[36] Applying l-diversity (e.g., with l=3) diversifies the class to include at least three well-represented values such as "heart disease," "viral infection," and "cancer," forcing the attacker to consider multiple possibilities and reducing inference risk.^[36] This enhancement was introduced by Machanavajjhala et al. in 2006.^[37] Despite these advances, l-diversity remains susceptible to background knowledge attacks, where external information about value distributions can still enable inference even in diverse classes.^[36]

t-Closeness

t-Closeness is a privacy model that extends k-anonymity by ensuring that the distribution of sensitive attributes within each equivalence class closely approximates the global distribution of those attributes in the dataset, thereby mitigating attribute disclosure risks that arise from distributional differences.^[5] Specifically, in a k-anonymous dataset, t-closeness requires that for every equivalence class, the distance between the sensitive attribute's distribution in that class and the overall table distribution does not exceed a predefined threshold t, using a metric such as the Earth Mover's Distance (EMD) to quantify the divergence.^[5] This approach builds on l-diversity by addressing its shortcomings in handling skewed data distributions and background knowledge attacks, focusing on probabilistic inference rather than just value diversity.^[5] Mathematically, an equivalence class Q satisfies t-closeness if d(P_Q, P_R) \leq t, where P_Q denotes the probability distribution of the sensitive attribute in Q, P_R is the distribution across the entire table, and d is the EMD, defined as the minimum cost to transform P_Q into P_R under a ground distance metric (e.g., absolute difference for discrete values).^[5] The EMD is computed as:

d(P_Q, P_R) = \min \sum_{i,j} d_{ij} f_{ij}

subject to flow constraints that preserve the total mass of each distribution, where f_{ij} represents the flow from element i in P_Q to j in P_R, and d_{ij} is the distance between them.^[5]^[38] A dataset achieves t-closeness if all its equivalence classes meet this criterion, with t typically set small (e.g., 0.2) to enforce tight distributional similarity.^[5] The primary advantages of t-closeness lie in its ability to counter attribute disclosure attacks that exploit skewness in sensitive attribute distributions or an attacker's background knowledge about global demographics.^[5] For instance, in a dataset anonymized by age and ZIP code with salary as the sensitive attribute, l-diversity might allow a class dominated by low-income individuals (e.g., all salaries below $50,000), enabling inference of poverty in that demographic; t-closeness with t=0.2 prevents this by requiring the class's salary distribution to closely match the table-wide mean (e.g., avoiding overrepresentation of low values), thus blocking biased probabilistic inferences.^[5] This makes t-closeness particularly effective against similarity attacks, where l-diversity fails due to semantic closeness of "diverse" values (e.g., multiple negative disease outcomes).^[5] Despite its strengths, t-closeness introduces limitations, including increased computational overhead from calculating EMD for distributions, which scales poorly with large or high-cardinality sensitive attributes.^[5] It also does not inherently protect against identity disclosure, relying on underlying k-anonymity for that purpose, and its precise relationship to information-theoretic measures like mutual information remains unclear.^[5] t-Closeness was proposed by Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian in 2007.^[5]