Cluster sampling is a probability sampling method in statistics where the population is divided into mutually exclusive and collectively exhaustive subgroups known as clusters, typically based on geographic or administrative boundaries, and a random sample of these clusters is selected for inclusion in the study, with all individuals or a subsample within the chosen clusters then surveyed.[1] This approach is particularly useful when the population is large and dispersed, making it impractical to create a complete sampling frame of individual units.[1] Developed as part of early 20th-century survey sampling theory by statisticians like Jerzy Neyman in 1934, cluster sampling balances efficiency with representativeness by leveraging natural groupings in the population.[2]The procedure for cluster sampling generally involves two stages: first, randomly selecting clusters from a list of all possible clusters using simple random sampling or similar techniques; second, either including every unit within the selected clusters (one-stage sampling) or randomly subsampling units from those clusters (two-stage or multistage sampling).[3] For example, in educational research, a population of students might be grouped into schools as clusters, with a random selection of schools followed by surveying all fifth-grade students within them.[4] This method contrasts with simple random sampling, where every individual has an equal chance of selection, as cluster sampling selects groups holistically to reduce logistical demands.[5]Key advantages of cluster sampling include significant cost and time savings, especially for geographically spread populations where traveling to individual units would be prohibitive, and it facilitates data collection in settings like public health surveys or national censuses without needing a full population list.[4] However, it often introduces higher sampling error due to intra-cluster homogeneity—individuals within a cluster tend to be more similar than those across clusters—potentially requiring larger overall sample sizes to achieve desired precision compared to other methods like stratified sampling.[3] Despite these drawbacks, cluster sampling remains a cornerstone of applied statistics in fields such as clinical trials and social sciences, where practical constraints outweigh the need for minimal variance.[6]
Fundamentals
Definition and Principles
Cluster sampling is a probability sampling technique in which the population is divided into mutually exclusive and collectively exhaustive groups known as clusters, typically based on naturally occurring aggregations such as geographic areas, schools, or organizations. A random sample of these clusters is then selected, and either all elements within the chosen clusters or a subsample of them are included in the survey. This method is particularly useful when a complete list (sampling frame) of individual population elements is unavailable or prohibitively expensive to compile, as the frame can instead be constructed at the cluster level.[7][8]The core principles of cluster sampling revolve around ensuring representativeness and efficiency. Clusters are ideally formed to be internally heterogeneous, capturing the full diversity of the population within each group, while being homogeneous relative to one another; this configuration minimizes the between-cluster variance and enhances the precision of estimates compared to simple random sampling under similar costs. Random selection of clusters via methods like simple random sampling guarantees that every cluster has an equal probability of inclusion, thereby producing unbiased estimates of population parameters. The approach trades some precision for substantial reductions in logistical costs, especially in large or dispersed populations where travel or data collection expenses increase with the spread of sampled units.[7][8]The basic steps in implementing single-stage cluster sampling are as follows: first, define the target population and partition it into clusters that are practical and relevant to the study context; second, randomly select a predetermined number of clusters from the total using a probability-based method; third, enumerate and include all elements within the selected clusters for measurement or observation. For example, in estimating average income in a metropolitan area, the population of households could be grouped into city blocks as clusters, with a random sample of 20 blocks chosen and all households in those blocks surveyed.[7][8]Under the assumption of equal-sized clusters, the unbiased estimator for the population mean in single-stage cluster sampling is given by\bar{y}_{\text{clus}} = \frac{1}{n} \sum_{i=1}^{n} \bar{y}_i,where n denotes the number of selected clusters and \bar{y}_i is the sample mean within the i-th cluster. This estimator is design-unbiased, with its variance depending on the cluster-to-cluster variability.[7]
Comparison to Other Sampling Methods
Cluster sampling differs from simple random sampling (SRS) primarily in its approach to grouping elements to minimize logistical costs, particularly for geographically dispersed populations. In SRS, every individual in the population has an equal probability of selection independently, which typically results in lower sampling variance but requires a complete sampling frame and can incur high travel or data collection expenses.[1] By contrast, cluster sampling divides the population into clusters—naturally occurring groups like schools or neighborhoods—and randomly selects entire clusters for inclusion, reducing costs by concentrating data collection efforts but potentially increasing variance due to intra-cluster correlation, where elements within a cluster tend to be more similar than those across clusters.[1]Compared to stratified sampling, cluster sampling treats clusters as the primary units of selection without ensuring proportional representation from predefined homogeneous subgroups (strata). Stratified sampling explicitly divides the population into mutually exclusive strata based on key characteristics (e.g., age or income) and draws random subsamples from each in proportion to their size, which enhances precision by reducing overall variance.[1] In cluster sampling, however, whole clusters are chosen randomly regardless of their composition, which simplifies administration but may lead to less precise estimates if clusters are heterogeneous internally.[1] Similarly, unlike systematic sampling—which selects elements at fixed intervals from an ordered list (requiring a full sampling frame) to achieve efficiency—cluster sampling requires upfront cluster formation and avoids potential biases from periodicity in the list but introduces challenges in defining and sampling clusters.[1]The efficiency of cluster sampling relative to SRS is quantified by the design effect (DEFF), which measures the ratio of the variance under cluster sampling to that under SRS for the same sample size. The DEFF is given by\text{DEFF} = 1 + (m - 1)\rho,where m is the average cluster size and \rho is the intra-cluster correlation coefficient, reflecting similarity within clusters.[9] A DEFF greater than 1 indicates reduced efficiency (higher variance) compared to SRS, necessitating a larger sample size by a factor of DEFF to achieve equivalent precision; conversely, if \rho is low or clusters are small, DEFF approaches 1, making cluster sampling comparably efficient while still cost-effective.[9]Cluster sampling emerged in the 1930s and 1940s as a practical solution for large-scale surveys, particularly in agricultural and population contexts where full enumeration was infeasible. Pioneering work by Jerzy Neyman in his 1934 paper on representative methods laid foundational principles for probability-based cluster designs, influencing their adoption in early survey research to balance cost and accuracy.
Implementation Methods
Single-Stage Cluster Sampling
Single-stage cluster sampling involves dividing the population into N naturally occurring or artificially formed clusters, randomly selecting n of these clusters using simple random sampling without replacement, and then including every element within the selected clusters in the sample. This approach is particularly useful when the population is geographically dispersed or organized into groups, allowing for complete enumeration within chosen clusters to estimate population parameters efficiently.[7]The total population size M is given by M = N \bar{m}, where \bar{m} denotes the average cluster size across all N clusters. Each cluster i has size m_i, and the total number of elements in the selected clusters is \sum_{i=1}^n m_i. This setup ensures that every population element has an equal probability of inclusion, n/N, since entire clusters are selected or excluded together.[7][10]For estimating the populationmean \bar{Y}, the estimator in single-stage cluster sampling is the sample mean \bar{y} = \frac{\sum_{i=1}^n m_i \bar{y}_i}{\sum_{i=1}^n m_i}, where \bar{y}_i is the mean of the m_i elements in the i-th selected cluster. This is a ratio estimator that is approximately unbiased, with small bias when cluster sizes vary. When cluster sizes are equal (i.e., m_i = \bar{m} for all i), this simplifies to the average of the cluster means: \bar{y} = \frac{1}{n} \sum_{i=1}^n \bar{y}_i. This estimator is unbiased under equal cluster sizes because inclusion probabilities are constant across elements, making the overall sample mean representative of the populationmean.[11][7][10]The variance of this estimator can be approximated as v(\bar{y}) \approx \frac{1 - f}{n} s_b^2, where f = n/N is the sampling fraction and s_b^2 is the between-cluster variance, typically estimated as s_b^2 = \frac{1}{n-1} \sum_{i=1}^n (\bar{y}_i - \bar{y})^2 for equal-sized clusters or adjusted for cluster totals in unequal cases. For equal cluster sizes, the exact variance is V(\bar{y}) = \frac{N-n}{N n} S_b^2, where S_b^2 = \frac{1}{N-1} \sum_{i=1}^N (\bar{Y}_i - \bar{Y})^2 and \bar{Y}_i is the i-th population cluster mean; the estimated variance substitutes sample quantities for population ones. This variance reflects the design effect due to clustering, often larger than in simple random sampling.[11][7]A practical example is estimating the average test scores of students in a large school district, where schools serve as clusters. Suppose there are N = 100 schools, and n = 10 are randomly selected; all students in those 10 schools are tested. The sample mean \bar{y} from the combined test scores provides an unbiased estimate of the district-wide average under equal school sizes, with variance depending on the between-school variability in scores.[12]
Multi-Stage Cluster Sampling
Multi-stage cluster sampling extends the cluster sampling framework by incorporating multiple levels of sampling within selected primary units, enabling efficient data collection from vast populations where full enumeration of elements within clusters would be impractical. The procedure begins with the random selection of primary sampling units (PSUs), or clusters, from the total population of N clusters using simple random sampling without replacement. Within each selected PSU, secondary sampling units (SSUs), such as individual elements or sub-clusters, are then subsampled randomly, typically via simple random sampling. This process can continue to additional stages—tertiary units within SSUs—if the populationstructure warrants further subdivision, thereby balancing cost and precision in surveys covering large geographic areas.[3]A standard example is two-stage cluster sampling, where n PSUs are chosen from N, and then m SSUs are selected from the M elements in each chosen PSU. This design assumes equal cluster sizes for simplicity, though adaptations exist for unequal sizes; the subsampling at the second stage reduces fieldwork demands while capturing intra-cluster variability. The method gained prominence in the 1940s through U.S. Census Bureau applications for national surveys, where it facilitated scalable data collection amid resource constraints, building on foundational theory by Hansen and Hurwitz for stratified two-stage designs.[3][13]For estimation in two-stage cluster sampling, the unbiased estimator of the population mean is given by\bar{y}_{st} = \frac{1}{n} \sum_{i=1}^{n} \bar{y}_i,where \bar{y}_i is the sample mean within the ith selected cluster. This estimator averages the subsample means across clusters, inherently adjusting for the second-stage subsampling by relying on the observed values rather than full cluster totals.[3]The variance of \bar{y}_{st} decomposes into contributions from both sampling stages, reflecting between-cluster heterogeneity and within-cluster subsampling error. Under the assumption of equal cluster sizes and simple random sampling at each stage, the approximate variance isv(\bar{y}_{st}) = \frac{(1 - f_1)}{n} s_1^2 + \frac{1}{n} \frac{(1 - f_2)}{m} s_2^2,where f_1 = n/N is the first-stage sampling fraction, f_2 = m/M is the second-stage sampling fraction, s_1^2 denotes the population variance among cluster means, and s_2^2 is the average population variance within clusters. This formula highlights how increasing n reduces between-cluster variance, while larger m mitigates within-cluster variance, guiding sample allocation decisions.[3]
Handling Variations
Unequal Cluster Sizes
In cluster sampling, unequal cluster sizes pose a significant challenge because larger clusters contribute disproportionately more to population totals, yet equal probability selection of clusters assigns the same selection chance to each cluster regardless of size. This results in bias, as elements in larger clusters have lower inclusion probabilities compared to those in smaller clusters, leading to underestimation of the population total if unadjusted estimators like the simple expansion are used.[10]Designs with equal probability per cluster are typically non-self-weighting, meaning sampled elements do not have equal representation in the population, requiring post-sampling weight adjustments to achieve unbiased estimates. In contrast, self-weighting designs ensure equal inclusion probabilities for all elements, simplifying analysis by allowing unweighted averages to yield unbiased population means. To address unequal sizes, probability proportional to size (PPS) selection is employed, where the probability of selecting cluster i is \pi_i = n \frac{M_i}{M}, with n as the number of clusters selected, M_i as the size of cluster i (e.g., number of elements), and M as the total population size measure; this equalizes the first-stage inclusion probabilities for elements across clusters when all elements within selected clusters are subsampled equally.[14]For estimation under PPS, the Horvitz-Thompson estimator adjusts for unequal probabilities by weighting observations inversely to their inclusion chances: \hat{Y} = \sum_{i \in s} \frac{y_i}{\pi_i}, where s is the sample of selected clusters and y_i is the total in cluster i; this provides an unbiased estimate of the population total Y. An illustrative example is urban household surveys, where city blocks serve as clusters varying in household counts; blocks are selected via PPS based on estimated households, and all or a fixed subsample of households within selected blocks is surveyed, with weights applied during analysis to reflect the design.Modern software facilitates implementation of PPS for unequal cluster sizes, such as the survey package in R, which supports design specification for PPS sampling (with or without replacement), Horvitz-Thompson estimation, and variance computation via linearization or replication methods, enabling straightforward analysis of complex cluster designs.[15]
Small Number of Clusters
In cluster sampling, resource constraints often limit the number of clusters that can be selected, typically to fewer than 10, rendering standard variance estimators unreliable due to their tendency to produce downward-biased results and foster overconfidence in estimates. This issue arises because the estimators assume a large number of independent clusters for asymptotic normality, which fails when intra-cluster dependence dominates and the effective sample size is reduced. For instance, conventional cluster-robust standard errors can underestimate variability by up to 50% or more when the number of clusters is small relative to the population, leading to invalid hypothesis tests and confidence intervals.[16]To mitigate these challenges, alternative inference approaches focus on cluster-level summaries, such as totals or ratios, which treat the selected clusters as a simple random sample and simplify variance calculation to the between-cluster variability. Bootstrap methods, including the wild cluster bootstrap that resamples entire clusters with replacement, offer robust variance estimation by empirically capturing the sampling distribution without relying on normality assumptions; these are particularly effective for small n, as they avoid the pitfalls of parametric corrections while maintaining type I error control.[17][18]Recommendations emphasize selecting at least 20-30 clusters to uphold normality assumptions for standard methods, as simulation studies reveal significant bias in t-tests with fewer clusters, often inflating type I error rates beyond 10%.[19]A practical example occurs in national surveys of countries with limited administrative divisions, such as selecting 5-10 provinces as clusters to estimate economic indicators; here, analysts apply conservative confidence intervals—widening them by 20-50% based on bootstrap or adjusted df—to counteract the heightened uncertainty from small n.[20]
Applications
Fisheries Science
In fisheries science, cluster sampling is widely applied to estimate fish populations and biomass by treating geographic units—such as lakes, river segments, coastal zones, or trawl locations—as natural clusters, from which subsamples of catches are drawn to infer broader abundance. This approach is particularly suited to aquatic environments where fish exhibit patchy distributions, allowing researchers to select clusters of fishing sites or vessels and then measure catches within them, often via trawls or nets, to scale up estimates of total stock size.[21][22]A prevalent implementation is two-stage cluster sampling, where primary clusters represent regions or fishing grounds, and secondary units consist of individual hauls or sampling events within those clusters; the total catch Y is estimated as Y = N \bar{y}_1, with N denoting the total number of primary units in the population and \bar{y}_1 the average total catch from sampled primary units, typically adjusted by fishing effort (e.g., hours fished or gear type) to standardize for variability in capture efficiency.[22][23] This method builds on multi-stage principles by nesting subsamples hierarchically, ensuring unbiased estimates while minimizing logistical demands in expansive marine or freshwater systems.[24]Historically, the Food and Agriculture Organization (FAO) of the United Nations supported the adoption of cluster sampling in the 1950s for cost-effective fishery monitoring in developing countries, such as India's marine catch assessment programs, which enabled reliable data collection across vast coastal areas with limited resources.[25] In these contexts, cluster sampling offered key advantages by substantially reducing travel and operational costs over large oceanic or riverine expanses, where accessing dispersed sites individually would be prohibitive, and by inherently addressing spatial autocorrelation in fish distributions, as clustered individuals within patches provide more representative local variance than dispersed simple random samples.[26][21][27]Post-2020 advancements have incorporated cluster sampling with acoustic technologies and geographic information systems (GIS) to refine cluster boundaries dynamically, using hydroacoustic data from multibeam echosounders to map fish aggregations and GIS for delineating spatially explicit sampling units in surveys of pelagic and reef species.[28][29] For instance, integrated acoustic-trawl designs now leverage GIS to optimize cluster selection based on real-time environmental layers, improving biomass precision in dynamic habitats like offshore wind farm areas or coastal pelagic zones.
Economics
In economic surveys, cluster sampling is widely applied to estimate key indicators such as employment rates and household expenditures, particularly in national labor force surveys. For instance, the U.S. Bureau of Labor Statistics (BLS) Current Population Survey (CPS) employs a multi-stage cluster design where primary sampling units (PSUs), often consisting of counties or groups of counties, are selected as geographic clusters. Within these PSUs, clusters of approximately four adjacent housing units are systematically sampled to form ultimate sampling units, enabling efficient data collection on labor force participation from about 60,000 households monthly to produce unemployment rate estimates.[30]The method typically involves multi-stage sampling with probability proportional to size (PPS) to account for unequal cluster sizes, such as varying population densities across areas. In the CPS, PSUs are selected with PPS based on 2010 census population data, followed by systematic sampling of housing unit clusters within PSUs using state-specific ratios; the unemployment rate estimator is then computed as a weighted average of cluster means, adjusted for sampling weights to reflect the national population.[30] This approach ensures representativeness while minimizing fieldwork costs through geographic clustering.Historically, cluster sampling emerged in economic contexts during the 1940s for post-World War II planning, exemplified by the UK's National Food Survey (NFS), initiated in 1940 by the Ministry of Food to monitor urban working-class household diets amid rationing and reconstruction efforts. The NFS utilized a three-tiered multi-stage design, stratifying by parliamentary constituencies as initial clusters, then selecting wards and households within them to track food consumption and expenditures, informing policy on nutritional adequacy and economic recovery.[31] By the late 1940s, this method supported broader economic surveys for assessing household budgets and resource allocation in rebuilding economies.[32]A key challenge in economic cluster sampling is non-response, which often clusters within socioeconomic areas due to factors like urban density or income levels, potentially biasing estimates of indicators such as poverty rates. For example, lower-response clusters in disadvantaged neighborhoods can skew unemployment figures; to address this, imputation techniques, such as hot-deck methods that replace missing values with responses from similar units within the same cluster, are applied to maintain estimate integrity without introducing excessive variance.[33][34]Post-2010, many economic surveys have incorporated online data collection modes as supplements to traditional in-person cluster-based approaches, partly reducing reliance on pure geographic clustering by enabling remote self-response from selected households. In the CPS, experimental web collection for supplements began in the mid-2010s, with full internet self-response implementation planned by 2027 to boost participation rates amid declining in-person response, though core cluster designs persist for representativeness.[35] This shift enhances efficiency in capturing economic data like consumer spending but requires hybrid weighting to align with cluster probabilities.[36]
Public Health and Surveys
Cluster sampling plays a pivotal role in public health surveys, particularly through the World Health Organization's (WHO) Demographic and Health Surveys (DHS), which employ a two-stage cluster design to generate nationally representative estimates of fertility, mortality, and health indicators in low- and middle-income countries.[37] In this approach, primary sampling units—such as enumeration areas, villages, or urban blocks—are selected as clusters using probability proportional to size (PPS) stratification, followed by random selection of households within those clusters for data collection on demographic and health outcomes.[38] Standard DHS protocols typically involve 30-40 clusters per survey stratum, with 20-30 households sampled per cluster, enabling efficient coverage of diverse populations while minimizing logistical costs in resource-limited settings.[39]During the 2020s, cluster sampling has been instrumental in COVID-19 seroprevalence studies to map infection rates across communities without exhaustive population screening. For instance, a cross-sectional study published in 2023 in Oran, Algeria (conducted 7–20 January 2021), utilized stratified random cluster sampling by age and sex to select households, revealing a seroprevalence of 35.6% (95% CI, 32.9 to 38.4) among 1,185 participants and highlighting geographic variations in exposure.[40] Similarly, a 2021 study in urban Fianarantsoa, Madagascar, applied two-stage cluster sampling integrating field mapping to select clusters and households, which estimated seroprevalence at 41.5% (95% CrI, 36.5 to 47.2) among 1,121 individuals across 674 households and informed targeted public health responses in settings with incomplete census data.[41] These applications demonstrate how clustering by communities facilitates rapid, cost-effective assessment of epidemic spread, especially in hard-to-reach urban and peri-urban zones.To analyze data from such clustered designs, public health researchers account for intra-cluster correlation in disease outcomes using multilevel modeling, which partitions variance into individual and cluster levels to accurately estimate parameters like prevalence and risk factors.[42] This approach is essential for addressing the non-independence of observations within clusters, as seen in studies of disease variance where between-cluster heterogeneity can inflate standard errors if ignored.[43] Recent advancements since 2023 have incorporated adaptive cluster sampling for remote or disadvantaged areas, such as in immunization coverage surveys, where initial clusters are adjusted based on interim findings to oversample under-vaccinated populations, improving equity in health data collection.[44]
Evaluation
Advantages
Cluster sampling offers substantial practical benefits, particularly in terms of cost efficiency. By concentrating data collection efforts within geographically proximate clusters, it minimizes travel and logistical expenses associated with surveying dispersed populations. For instance, in large-scale household surveys, this approach can reduce overall costs by grouping households into predefined units, such as neighborhoods or villages, thereby lowering the need for extensive interviewer travel and field operations.[45]This method is especially feasible for studying large or widely dispersed populations where constructing a complete sampling frame is impractical or impossible. Instead of requiring a comprehensive list of all individuals, cluster sampling leverages existing administrative divisions, such as census enumeration areas, schools, or postal codes, to form natural clusters. This eliminates the need for exhaustive population registries and simplifies the initial stages of survey design, making it viable for scenarios with limited resources or incomplete data infrastructure.[1]Administratively, cluster sampling streamlines fieldwork by aligning with pre-existing groupings, which facilitates easier supervision, training, and execution of surveys. Interviewers can operate within compact areas, reducing coordination challenges and enabling more efficient allocation of personnel. This ease of implementation is particularly advantageous in multi-stage designs used for national or international programs, such as the United Nations Children's Fund's Multiple Indicator Cluster Surveys (MICS), which scale effectively across countries by adjusting cluster numbers and sizes to achieve reliable estimates at various administrative levels.[45][4]From a statistical perspective, cluster sampling can achieve comparable or even superior precision to simple random sampling under certain conditions, particularly when the intra-cluster correlation coefficient (ρ) is low or negative. In such cases, the design effect (DEFF), which measures the efficiency of the clustered design relative to simple random sampling, can be less than 1, allowing for more precise estimates at the same cost. This efficiency gain, combined with logistical savings, enhances its suitability for resource-constrained applications like the Demographic and Health Surveys (DHS).[46]
Disadvantages
Cluster sampling often results in higher sampling error compared to simple random sampling due to the homogeneity of elements within clusters, which reduces the diversity of the sample and inflates the variance of estimates when the intra-cluster correlation coefficient (ρ) is greater than zero.[1] This increased variability is quantified by the design effect (DEFF), which measures the efficiency loss relative to simple random sampling; when DEFF > 1, larger sample sizes are required to achieve the same precision.[47] Seminal work by Mahalanobis (1940) and Hansen and Hurwitz (1946) established that this effect depends on cluster size and intra-class correlation, leading to potential underestimation of uncertainty if ignored.[47]The design of cluster samples introduces significant complexity, as it requires careful selection and formation of clusters to ensure representativeness; poor choices, such as forming overly homogeneous clusters, can introduce bias and compromise the validity of inferences.[48] Balancing the number of clusters, their sizes, and the sampling stages demands expertise to minimize both cost and variance, making the method more challenging to implement than simpler designs.[47] In multistage cluster sampling, this complexity is amplified, as errors at earlier stages can propagate through subsequent selections.[1]Non-response and coverage issues pose additional challenges in cluster sampling, where refusals or unavailability within a selected cluster can affect the entire unit, leading to clustered non-response that is harder to detect and adjust for compared to individual-level sampling in simple random designs.[34] This propagation of non-response bias at the cluster level can systematically skew results, particularly if entire clusters are excluded due to low participation rates, exacerbating undercoverage of certain population segments.[49]Cluster sampling is particularly less precise for estimating characteristics of rare events or small subgroups, as the geographic or structural clustering may result in selected clusters either entirely missing these subgroups or overrepresenting them, leading to unreliable prevalence estimates.[50] The homogeneity within clusters further limits the method's ability to capture rare variations, making it suboptimal for studies requiring fine-grained subgroupanalysis.[1]For instance, Shackman (2001) demonstrated through survey examples that ignoring the design effect leads to underestimated sample size needs, confirming the practical impact of these inefficiencies.[51]
Advanced Inference
Probability Proportional to Size Sampling
Probability proportional to size (PPS) sampling is an advanced technique in cluster sampling designed to address unequal cluster sizes by selecting clusters with probabilities proportional to a measure of their size, such as the number of elements or an auxiliary variable correlated with the study variable. This approach ensures that larger clusters, which may contribute more to population totals, have a higher chance of inclusion, thereby improving the efficiency of estimates compared to equal-probability selection. In the first stage of a two-stage cluster design, the selection probability for cluster i is p_i = M_i / \sum_{j=1}^N M_j, where M_i is the size measure for cluster i and N is the total number of clusters.[52]The procedure for implementing PPS can vary depending on whether sampling is with or without replacement. For with-replacement sampling, clusters are selected independently n times, allowing for possible duplicates, using methods like the Hansen-Hurwitz approach, which involves drawing from a cumulative size distribution. Without replacement, techniques such as systematic PPS sampling order clusters by cumulative size measures and select every k-th unit starting from a random point, ensuring no duplicates and approximate proportionality. Brewer's method, particularly useful for small sample sizes like n=2, employs a paired selection procedure to achieve exact first-order inclusion probabilities proportional to size, with generalizations available for larger n. These methods are particularly effective in cluster sampling where size measures are known in advance from a sampling frame.[53][54][55]For estimation under with-replacement PPS, the Hansen-Hurwitz estimator for the population total is \hat{Y} = \frac{1}{n} \sum_{i=1}^n \frac{y_i}{p_i}, where y_i is the total in the selected cluster i. The population meanestimator is then \hat{\bar{y}} = \hat{Y} / M, where M = \sum M_j is the known totalpopulation size. The variance of \hat{Y} is v(\hat{Y}) = \frac{1}{n} \sum_{i=1}^N p_i \left( \frac{y_i}{p_i} - Y \right)^2, so v(\hat{\bar{y}}) = v(\hat{Y}) / M^2 = \frac{1}{n} \sum_{i=1}^N p_i \left( \frac{y_i}{p_i M} - \bar{y} \right)^2, where \bar{y} = Y / M. This can be estimated by \hat{v}(\hat{Y}) = \frac{1}{n(n-1)} \sum_{i=1}^n \left( \frac{y_i}{p_i} - \hat{Y} \right)^2, so \hat{v}(\hat{\bar{y}}) = \hat{v}(\hat{Y}) / M^2 = \frac{1}{n(n-1) M^2} \sum_{i=1}^n \left( \frac{y_i}{p_i} - \hat{Y} \right)^2 = \frac{1}{n(n-1)} \sum_{i=1}^n \left( \frac{y_i}{p_i M} - \hat{\bar{y}} \right)^2. For without-replacement designs, the Horvitz-Thompson estimator is often used instead, but PPS facilitates its application by stabilizing inclusion probabilities.[56][57][58]Compared to equal-probability cluster sampling, PPS offers key advantages, including reduced variance in estimates of population totals by equalizing the inclusion probabilities of individual elements across clusters and minimizing bias when cluster sizes vary significantly. This leads to more precise inferences, especially in applications like cluster-randomized trials, where PPS can improve the power to detect treatment effects by better representing the population structure. Additionally, it allows for the use of auxiliary size information to enhance efficiency without requiring stratification.[59][60]Implementation of PPS sampling and variance estimation is supported in statistical software packages. In SAS, PROC SURVEYSELECT facilitates PPS sample selection using methods like systematic or Brewer's, while PROC SURVEYMEANS computes the Hansen-Hurwitz estimator and its variance under with-replacement designs. Stata handles PPS through the svyset command with pweight options for unequal probabilities, enabling variance estimation via linearization or replication methods in svy: mean commands. These tools automate the complex probability calculations and provide design-based standard errors.[61][62]In recent advancements during the 2020s, machine learning has been integrated with PPS sampling, for example, to de-bias regression models under complex designs including PPS and to develop active sampling frameworks that optimize subsample selection for finite populationinference. These approaches improve efficiency in scenarios with auxiliary information or complex survey designs.[63][64]
Optimal Cluster Sample Design
Optimal cluster sample design involves selecting the number of clusters and their sizes to achieve efficient estimation, typically by minimizing the variance of the population mean estimator subject to a fixed total cost or by minimizing the total cost for a specified precision level. These optimization problems are solved using techniques like Lagrange multipliers, which handle the nonlinear constraints arising from the variance-cost relationship in two-stage sampling frameworks. For instance, in two-stage cluster sampling, the objective function minimizes a weighted sum of variances across characteristics while constraining the total budget, leading to explicit formulas for the optimal number of primary sampling units (clusters) and secondary units per cluster.[65]A key result from classical sampling theory provides the optimal number of clusters n asn = \sqrt{\frac{N C_b}{C_w / m + C_e}},where N is the total number of clusters in the population, C_b is the fixed cost per selected cluster (e.g., travel and setup), C_w is the variable cost per element interviewed within clusters, m is the average cluster size, and C_e is any additional enumeration or fixed overhead cost. This formula arises from balancing the between-cluster variance contribution, which decreases with more clusters, against the within-cluster variance, which increases with smaller cluster sizes, under a cost constraint; it assumes simple random sampling of clusters and elements within them.[66]Adaptations of Neyman allocation to cluster sampling further refine this by prioritizing heterogeneous clusters, allocating proportionally more elements or clusters to those with higher within-cluster standard deviations to minimize overall variance. In stratified cluster designs, this involves solving for stratum-specific allocations using Lagrange multipliers, where the optimal number of clusters per stratum k_j is proportional to the stratum size times the square root of its variance contribution, ensuring resources focus on variable subgroups.[67]For practical implementation, especially with small numbers of available clusters, simulation-based approaches like Monte Carlo methods evaluate design efficiency by repeatedly simulating the sampling process to estimate variance and power across candidate numbers of clusters and sizes. These methods address analytical limitations in complex scenarios, such as varying cluster sizes or intracluster correlations, by providing empirical distributions of estimators. More recently, Bayesian methods from the 2010s incorporate prior information on parameters like intracluster correlation into optimization, enabling prior-informed sample size determination that averages utility over posterior distributions for robust designs in large-scale or data-rich environments. For example, behavioral Bayes approaches treat design parameters as random with elicited priors, optimizing expected power or precision in cluster randomized settings.[68][69]