Probability-proportional-to-size sampling
Probability proportional to size (PPS) sampling is a probability-based sampling technique used in survey methodology where the probability of selecting a population unit is directly proportional to a predetermined measure of its size, such as population count or economic value, to improve efficiency in estimating population parameters.[1] This method was first formally introduced by Morris H. Hansen and William N. Hurwitz in their 1943 paper on sampling from finite populations, where they proposed PPS with replacement to allow unbiased estimation of totals using the Hansen-Hurwitz estimator.[2] Subsequent developments, including the Horvitz-Thompson estimator for PPS without replacement in 1952, extended its applicability to single-stage and multistage designs, particularly in cluster sampling where larger clusters receive higher selection probabilities while maintaining equal probabilities for ultimate elements through fixed subsampling.[1][3] PPS sampling is especially valuable in scenarios with heterogeneous unit sizes, such as national health surveys or business establishment frames, as it reduces variance in estimates compared to equal-probability sampling by allocating more resources to larger units.[4][5] Key implementation steps involve calculating cumulative sizes, determining a sampling interval, and selecting units via systematic random starts, enabling straightforward computation of inclusion probabilities and weights for unbiased inference.[3] Advantages include enhanced precision for skewed distributions and simplified fieldwork logistics in multistage surveys, though challenges like requiring accurate size measures and handling without-replacement complexities persist.[1][5]Definition and Basics
Definition
Probability-proportional-to-size (PPS) sampling is a probability-based sampling technique where the probability of selecting a population unit into the sample is directly proportional to a specified measure of its size, such as the number of elements it contains, its economic value, or an auxiliary variable like revenue. This approach assigns higher inclusion probabilities to larger units, which enhances the efficiency of estimating population totals or means by focusing selection efforts on units that contribute more substantially to the aggregate quantities of interest.[1][6] Mathematically, the first-order inclusion probability \pi_i for unit i is defined as \pi_i = n \cdot \frac{size_i}{\sum_{j=1}^N size_j}, where n denotes the desired sample size, size_i is the size measure for unit i, and \sum_{j=1}^N size_j is the total size measure across all N units in the population. This formulation ensures that the expected number of units selected equals n, while larger units are oversampled relative to their proportion in an equal-probability scheme. The size measure must be accurately known or reliably estimated from the sampling frame to implement PPS effectively.[2] The primary purpose of PPS sampling is to mitigate inefficiencies in simple random sampling when dealing with heterogeneous or skewed populations, where a small number of large units account for a disproportionate share of the total variability or aggregate value, thereby reducing the variance of estimators without increasing sample size. In contrast to equal-probability methods, PPS leverages auxiliary size information to achieve greater precision, particularly when the size measure correlates positively with the study variable.[2][1] PPS sampling was developed in the 1940s by statisticians Morris H. Hansen and William N. Hurwitz as part of advancements in survey methodology, initially applied to improve estimation in agricultural and economic surveys involving finite populations with varying unit sizes. Their foundational work established PPS as a key tool for handling unequal unit contributions in practical sampling scenarios.[2]Key Concepts
Probability-proportional-to-size (PPS) sampling modifies equal-probability sampling by assigning selection probabilities to population units in proportion to a chosen size measure, thereby reducing sampling error when estimating totals or means in populations with substantial variability in unit sizes.[7] This adjustment leverages auxiliary information about unit sizes to overweight larger units, which are presumed to contribute more to the population total, leading to more efficient estimators compared to simple random sampling.[8] For instance, if the size measure correlates strongly with the study variable, the variance of the estimator can approach zero when the variable is exactly proportional to the size.[7] Size measures in PPS sampling are auxiliary variables that quantify the relative importance or scale of each unit and must be positively correlated with the variable of interest to achieve efficiency gains.[8] Common examples include population counts for clusters, such as the number of students in school classes or residents in geographic areas; revenue figures for businesses; or land area for agricultural plots.[8][7] These measures are typically obtained from pre-survey data sources, such as administrative records or censuses, to compute selection probabilities prior to sampling.[8] Size variables can be continuous, such as exact revenue amounts, or discrete, such as rounded counts of elements like employees or households, with the choice depending on data availability and the nature of the population frame.[7] Accurate auxiliary data is essential for defining these sizes, as it directly influences the proportionality of selection probabilities.[2] As a probability-based method, PPS sampling ensures design unbiasedness for appropriate estimators when selection probabilities are correctly specified based on the size measures.[2] However, if the size variable lacks positive correlation with the study variable, efficiency diminishes through increased variance, though unbiasedness is preserved provided the probabilities reflect the true design.[8][7]Sampling Procedures
With-Replacement Sampling
In probability-proportional-to-size (PPS) sampling with replacement, each unit in the population is assigned a selection probability equal to its size measure divided by the total size measure across all units, and selections are made independently for a fixed sample size n, permitting the possibility of duplicate selections.[2] This approach ensures that larger units have a higher chance of being selected in each draw, while the independence of draws simplifies the sampling process compared to without-replacement variants.[9] The standard procedure for implementing PPS with replacement employs the cumulative total method, which facilitates efficient selection based on precomputed size accumulations. Consider a population of N units, each with a known positive size measure x_i > 0 for i = 1, \dots, N, and let T = \sum_{i=1}^N x_i denote the total size. The steps are as follows:- List the units in any arbitrary order and compute the cumulative size sums: S_0 = 0 and S_j = \sum_{i=1}^j x_i for j = 1, \dots, N, so that S_N = T.
- For each of the n independent draws (k = 1, \dots, n), generate a uniform random variate u_k \sim U(0, T).
- Select the unit i_k as the smallest index j such that S_j \geq u_k; the probability of selecting unit i in any single draw is then p_i = x_i / T.
Without-Replacement Sampling
In probability-proportional-to-size (PPS) sampling without replacement, distinct units are selected from a finite population such that the inclusion probability of each unit is proportional to its size measure, ensuring a fixed sample size n with no duplicates. This method is essential for applications where redundant selections would waste resources, particularly when n is a meaningful proportion of the population size N. The procedure relies on ordered selection techniques to maintain proportionality while enforcing uniqueness.[11] The general procedure involves ordering the population units by their size measures to create a cumulative total frame, which facilitates systematic draws. A random starting point u is selected uniformly from [0, X/n], where X is the total size, and subsequent points are chosen at regular intervals equal to X/n. The units are selected by finding, for each point, the smallest index j such that the cumulative size S_j >= the point value. This systematic PPS approach, often attributed to early models of unequal probability sampling, guarantees exactly n unique units while approximating the desired inclusion probabilities. For smaller samples, Brewer's method offers a straightforward extension for n > 2, pairing units and using adjusted joint probabilities to select without replacement, as detailed in foundational work on systematic unequal probability designs.[12][13] The algorithm typically proceeds in three steps: first, assign initial inclusion probabilities \pi_i = n \cdot (x_i / X) for each unit i, where x_i is the size measure and X = \sum x_i, ensuring \pi_i \leq 1; second, construct an ordered list of units (e.g., by increasing size) and employ a pivotal or sequential draw method to select units one at a time, rejecting any already chosen and renormalizing remaining probabilities; third, apply ordering adjustments, such as stratification by size, to preserve the proportional structure across the sample. Size-based ordering, as a core concept in PPS designs, underpins these steps by aligning the frame with the probability structure.[14] Key variants include systematic PPS with a random start and fixed interval X/n, which is computationally efficient for ordered frames, and rejective sampling schemes that generate multiple candidate samples proportional to size and accept only those with exactly n distinct units, as cataloged in comprehensive reviews of unequal probability methods. These variants, such as those building on Brewer's framework, ensure unbiased inclusion probabilities close to the target without complex adjustments.[15] In finite populations, without-replacement PPS offers an advantage over with-replacement alternatives by reducing sampling variance when n is relatively large compared to N, as it maximizes the diversity of selected units and avoids inefficient multiple inclusions.[11]Estimation and Properties
Unbiased Estimators
In probability-proportional-to-size (PPS) sampling, unbiased estimators for population parameters are constructed using inverse inclusion probabilities to correct for the unequal selection chances of units based on their sizes. The Horvitz-Thompson (HT) estimator, adapted for PPS designs without replacement, provides an unbiased estimate of the population total Y = \sum_{i=1}^N y_i by weighting each observed value y_i in the sample by the inverse of its inclusion probability \pi_i. Specifically, the estimator is given by \hat{Y} = \sum_{i \in s} \frac{y_i}{\pi_i}, where s denotes the sample. This formulation ensures unbiasedness under the sampling design, as the expected value E(\hat{Y}) = Y, since E\left( \frac{y_i}{\pi_i} \mathbf{I}_i \right) = y_i for the indicator \mathbf{I}_i of unit i's inclusion.[16] In PPS without replacement, the inclusion probabilities \pi_i are proportional to the unit sizes x_i, typically \pi_i = n \cdot \frac{x_i}{\sum x_j} under certain approximations, though exact forms depend on the selection procedure; the HT estimator directly incorporates these \pi_i as weights, maintaining unbiasedness without requiring joint inclusion probabilities \pi_{ij} for the point estimate itself (though they are used in variance estimation).[17] For the population mean \bar{Y} = Y / N, where N is the known population size, the unbiased estimator is simply \bar{\hat{Y}} = \hat{Y} / N. This follows directly from the unbiasedness of \hat{Y}, yielding E(\bar{\hat{Y}}) = \bar{Y}.[16] In PPS sampling with replacement, the Hansen-Hurwitz estimator is employed instead, drawing n independent samples where each unit i has selection probability p_i proportional to its size x_i. The estimator for the total is \hat{Y}_{HH} = \frac{1}{n} \sum_{k=1}^n \frac{y_k}{p_k}, where the sum accounts for possible duplicates by including y_k / p_k for each draw k; averaging over the draws ensures unbiasedness, with E(\hat{Y}_{HH}) = Y, as each term's expectation is \sum_i y_i. The corresponding mean estimator is \bar{\hat{Y}}_{HH} = \hat{Y}_{HH} / N. For without-replacement PPS, joint inclusion probabilities inform higher-order adjustments but are not part of the basic HT point estimator.[2]Variance Estimation
In probability-proportional-to-size (PPS) sampling with replacement, the exact design-based variance of the Hansen-Hurwitz estimator \hat{Y}_{HH} = \frac{1}{n} \sum_{k=1}^n \frac{y_k}{p_k} for the population total Y = \sum_{i=1}^N y_i (where p_i = X_i / X) is \Var(\hat{Y}_{HH}) = \frac{1}{n} \left( \sum_{i=1}^N \frac{y_i^2}{p_i} - Y^2 \right). An unbiased estimator of this variance is \hat{V}(\hat{Y}_{HH}) = \frac{1}{n(n-1)} \sum_{k=1}^n \left( \frac{y_k}{p_k} - \hat{Y}_{HH} \right)^2, which can equivalently be expressed as \hat{V}(\hat{Y}_{HH}) = \frac{X^2}{n(n-1)} \sum_{k=1}^n \left( \frac{y_k}{X_k} - \hat{\bar{\beta}} \right)^2, where \hat{\bar{\beta}} = n^{-1} \sum_{k=1}^n y_k / X_k; this estimator is exact under the design and does not rely on approximations.[18][8] For PPS sampling without replacement, the Sen-Yates-Grundy form provides the variance of the Horvitz-Thompson estimator \hat{Y}_{HT} = \sum_{i \in s} y_i / \pi_i, where \pi_i are the inclusion probabilities (set proportional to sizes X_i): \Var(\hat{Y}_{HT}) = \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N (\pi_i \pi_j - \pi_{ij}) \left( \frac{y_i}{\pi_i} - \frac{y_j}{\pi_j} \right)^2, with \pi_{ij} denoting the joint inclusion probability for units i and j.[18] The corresponding estimator is \hat{V}_{SYG}(\hat{Y}_{HT}) = \frac{1}{2} \sum_{i \in s} \sum_{j \in s, j \neq i} \frac{\pi_i \pi_j - \pi_{ij}}{\pi_{ij}} \left( \frac{y_i}{\pi_i} - \frac{y_j}{\pi_j} \right)^2, which requires estimating the joint probabilities \pi_{ij} and is applicable to fixed-size designs like systematic PPS without replacement.[18][19] For complex PPS designs, approximation methods such as the bootstrap or Taylor series linearization are commonly used to estimate variances when exact joint probabilities are unavailable or computationally intensive. Bootstrap procedures resample from the original PPS design to mimic the selection process, providing variance estimates for the Horvitz-Thompson estimator in without-replacement settings; for instance, algorithms tailored to PPS bootstrap the sample while preserving size-based probabilities.[20] Linearization approximates the variance for large samples by treating the estimator as a function of sample totals, often implemented in software like R'ssurvey package, which supports PPS via the svydesign function with probability arguments and computes variances using replication or linearization methods.[21]
Variance estimation in PPS sampling faces challenges, including high variability when the size measure X_i correlates poorly with the study variable y_i, leading to less efficiency compared to equal-probability sampling and potentially inflated standard errors.[11] Additionally, stable estimation typically requires sample sizes n > 10 to ensure the denominator terms like n(n-1) in with-replacement formulas yield reliable results without excessive instability.