Simple random sample
A simple random sample (SRS) is a probability sampling method in which a subset of n units is selected from a finite population of N units such that every possible sample of size n has an equal probability of being chosen.[1] This technique ensures that each unit in the population has an equal chance of inclusion, typically without replacement, making it a foundational approach in statistical inference.[2] SRS is widely used in survey research, experimental design, and population estimation to obtain representative data for generalizing findings to the broader population.[3]
To implement a simple random sample, researchers first define the target population and construct a complete sampling frame, such as a numbered list of all units.[2] They then use random selection mechanisms—like random number tables, lottery draws, or computer-generated pseudorandom numbers—to choose the sample, ensuring no bias in the process.[1] For example, to study TV viewing habits among U.S. children aged 5-15, one might number all eligible children on a list and randomly select 100 using three-digit random numbers, contacting their parents for data collection.[2] This method can be conducted with or without replacement, though without replacement is standard to avoid duplicates in finite populations.[4]
Simple random sampling yields unbiased estimators for population parameters, such as the mean \hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} y_i and total \hat{\tau} = N \hat{\mu}, with variance for the mean given by \text{Var}(\hat{\mu}) = \frac{s^2}{n} \left( \frac{N - n}{N} \right), where s^2 is the sample variance.[1] The finite population correction factor \frac{N - n}{N} accounts for reduced variability in samples from smaller populations relative to the total size.[1] It serves as a benchmark for more complex designs like stratified or cluster sampling, providing a baseline for assessing efficiency and precision in statistical analyses.[3]
Among its advantages, simple random sampling is straightforward to implement with minimal prior knowledge of population structure and promotes high representativeness when the population is homogeneous.[1] However, it requires a comprehensive sampling frame, which can be costly and time-consuming to develop, especially for large or dispersed populations.[2] Additionally, it may be inefficient for heterogeneous populations, leading to higher sampling error and precision issues compared to targeted methods like stratification.[1]
Fundamentals
Definition
A simple random sample (SRS) is a subset of individuals selected from a larger population such that every possible sample of a given size has an equal probability of being chosen.[5] This method ensures that each member of the population has an equal chance of inclusion in the sample, promoting representativeness and minimizing bias in the selection process.[6]
In this framework, simple random sampling is applied to finite populations where the total number of units, denoted as N, is known. The sample size, typically denoted as n, is fixed in advance, and randomness is introduced through mechanisms like random number generation or physical randomization to achieve the equal probability condition.[7] This random selection underpins the validity of subsequent statistical analyses by allowing inferences to be generalized from the sample to the population.[8]
The concept of simple random sampling emerged in the early 20th century within the context of probability theory and experimental design, with Ronald A. Fisher playing a pivotal role in its formalization. In his 1925 book Statistical Methods for Research Workers, Fisher emphasized randomization as essential for valid statistical inference, laying the groundwork for modern sampling techniques.[9]
Simple random sampling serves as a foundational tool in statistical inference, where the primary goal is to use the sample to estimate unknown population parameters, such as the mean or proportion, and to quantify the uncertainty in those estimates.[6]
Key Properties
A simple random sample ensures unbiasedness because every unit in the population has an equal probability of being selected, resulting in estimators like the sample mean and sample proportion having expected values equal to the corresponding population parameters.[10] This equal selection probability eliminates subjective biases in the sampling process and supports reliable inference about population characteristics.[11]
The method's randomness fosters representativeness, as the sample tends to reflect the population's diversity and distributional properties on average, thereby minimizing systematic errors that could arise from non-random selection.[10] In without-replacement simple random sampling, which is commonly used, the observations are not strictly independent since each draw alters the probabilities for remaining units, though the design maintains exchangeability among selected units.[11][12]
Key advantages of simple random sampling include its theoretical simplicity, which facilitates equal treatment of all population units and straightforward statistical analysis, as well as its ability to quantify sampling error precisely.[10] Disadvantages encompass the need for a complete population listing, which may be impractical for large or dispersed groups, and reduced efficiency when data display clustering, where other designs like stratified sampling perform better.[10]
These properties remain robust under finite population correction (FPC) when the sample constitutes a substantial portion of the population, such as more than 5%, by adjusting variance estimates downward to account for decreased sampling variability without replacement.[10] The FPC multiplier, typically \sqrt{1 - n/N} where n is sample size and N is population size, refines precision for such scenarios while preserving unbiasedness.[11]
Mathematical Foundations
Selection Mechanism
The selection mechanism of a simple random sample involves probabilistic procedures to ensure every population unit has an equal chance of inclusion, typically implemented through draws from a defined population of size N. Two primary models govern this process: sampling with replacement and sampling without replacement. These models differ in their treatment of previously selected units and the resulting probability structures, with the choice depending on whether duplicates are permissible in the sample.
In the with-replacement model, each draw is independent, and a unit selected in one draw is returned to the population before the next, allowing for possible duplicates in the sample. The probability of selecting any specific unit on a given draw is p = 1/N, and for a specific ordered sample of size n, the probability is (1/N)^n.[13] This model follows a multinomial distribution for the counts of each unit in the sample.[14]
In contrast, the without-replacement model prohibits duplicates by removing selected units from consideration for subsequent draws, resulting in a sample of distinct units. The probability of selecting any specific unordered sample of size n is $1 / \binom{N}{n}, where \binom{N}{n} denotes the binomial coefficient representing the total number of possible combinations of n units from N.[15] This ensures uniformity over all possible subsets, akin to a hypergeometric selection process but without regard to categories.[1]
To simulate these selections computationally, random numbers are generated from a uniform distribution on [0,1), which are then mapped to population units via inverse transform or direct indexing.[16] This prerequisite relies on pseudo-random number generators to approximate true randomness in practice. A complete and accessible sampling frame—a list encompassing all N population units—is essential for both models, as it defines the universe from which draws occur and ensures the probabilities are well-defined.[2]
For large populations where N is much greater than n, the without-replacement model approximates the with-replacement model, simplifying computations while maintaining similar probabilistic properties; this approximation is particularly useful in post-2000 computational statistics texts addressing big data scenarios.[16][17]
Estimators and Variance
In simple random sampling without replacement from a finite population of size N, the sample mean \bar{X} = \frac{1}{n} \sum_{i=1}^n x_i serves as the unbiased estimator of the population mean \mu = \frac{1}{N} \sum_{i=1}^N x_i, satisfying E(\bar{X}) = \mu.[14] This unbiasedness holds because each unit in the population has an equal probability of inclusion in the sample, ensuring the expected value of the estimator aligns with the true parameter.[14]
The variance of the sample mean under this sampling scheme is given by
\text{Var}(\bar{X}) = \frac{N - n}{N} \cdot \frac{S^2}{n},
where S^2 = \frac{1}{N-1} \sum_{i=1}^N (x_i - \mu)^2 is the population variance defined with the N-1 denominator for unbiased estimation purposes, and the factor \frac{N - n}{N} is the finite population correction (FPC) that accounts for the reduced variability when sampling a substantial portion of the population without replacement.[15] An unbiased estimator of this variance, using the sample variance s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{X})^2, is \widehat{\text{Var}}(\bar{X}) = s^2 \frac{N - n}{n (N - 1)}.[1]
For estimating a population proportion p in dichotomous populations—where each unit is classified as a success (1) or failure (0)—the sample proportion \hat{p} = \frac{1}{n} \sum_{i=1}^n x_i (or equivalently, the number of successes divided by n) is unbiased with E(\hat{p}) = p.[14] Its variance is
\text{Var}(\hat{p}) = \frac{p(1 - p)}{n} \left(1 - \frac{n - 1}{N - 1}\right),
which simplifies to the estimated form \widehat{\text{Var}}(\hat{p}) = \hat{p}(1 - \hat{p}) \frac{N - n}{n (N - 1)} incorporating the FPC.[1] This structure mirrors the sample mean case, as the proportion is a special instance of the mean for binary data.
The standard error of the sample mean, \text{SE}(\bar{X}) = \sqrt{\widehat{\text{Var}}(\bar{X})}, quantifies the precision of the estimate and forms the basis for constructing confidence intervals, typically \bar{X} \pm z_{\alpha/2} \cdot \text{SE}(\bar{X}) for large samples under approximate normality via the central limit theorem.[18] Similar standard errors apply to \hat{p}, enabling inference about proportions. In complex scenarios where analytical variance formulas are intractable—such as non-normal distributions or nonlinear statistics—bootstrap methods provide a resampling-based alternative for variance estimation; introduced by Efron in 1979, these involve repeatedly drawing samples with replacement from the observed data to approximate the sampling distribution empirically, proving especially useful in modern computational settings.[19]
Comparisons with Other Methods
Equal Probability Sampling
Equal probability sampling (EPS), also referred to as the equal probability of selection method (EPSEM), is a sampling framework in which every unit in the population has an identical probability of inclusion in the sample, denoted as \pi_i = n/N for all units i, where n is the sample size and N is the population size.[20] This design ensures that the selection process treats all population elements uniformly, facilitating straightforward probability calculations for inference.[21]
Simple random sampling represents a core special case of EPS, characterized by direct random selection from the full population without incorporating stratification, clustering, or other structural modifications, thereby maintaining equal inclusion probabilities through mechanisms like lottery draws or random number generation.[21] Within the broader EPS framework, this approach avoids complexities introduced by multi-stage or layered designs while preserving the equal probability property.[2]
The EPS framework offers key benefits for design-based inference, as the constant inclusion probabilities streamline estimation procedures; notably, the Horvitz-Thompson estimator, which in general form weights observations by the inverse of \pi_i, simplifies to the population size multiplied by the unweighted sample mean under EPS, reducing computational demands and aligning variance calculations with those of simple random sampling.[22] Historically, EPS concepts were formalized in survey methodology during the 1950s by W. Edwards Deming, who emphasized simplifications through equal probabilities and replication to enhance practical application in large-scale surveys.[23] Post-2010 developments have integrated EPS into complex survey designs, such as the National Children's Study, where equal probability selection supports representativeness across diverse health outcomes in multi-stage frameworks.[24]
A primary limitation of EPS is its assumption against utilizing auxiliary information for optimizing selection probabilities, which can lead to lower efficiency in heterogeneous populations compared to alternatives like probability proportional to size (PPS) sampling that leverage unit characteristics to vary inclusion chances and reduce variance.[25]
Systematic Sampling
Systematic sampling is a probability sampling method where elements are selected from a population list at regular intervals, known as the sampling interval k, which is typically calculated as k = N/n, with N being the population size and n the desired sample size. To ensure randomness, a random starting point is chosen between 1 and k, after which every k-th element is selected until the sample reaches size n.[26] This approach maintains equal inclusion probabilities of n/N for each unit. However, if the population list has periodicity that aligns with k, it can lead to higher variance due to correlated selections.[1]
In contrast to simple random sampling, which allows every possible combination of n units to have an equal probability of selection, systematic sampling restricts the sample to specific linear subsets imposed by the ordered list and fixed interval.[27] This ordering can lead to lower variance than simple random sampling if the population exhibits random variation without trends, as it spreads the sample evenly across the list; however, it increases variance if hidden periodic patterns exist, potentially clustering similar units together.[28]
The efficiency of systematic sampling is often assessed through its approximate variance for the sample mean, given by
V_{\text{sys}} \approx \left(1 - \frac{n}{N}\right) \frac{S^2}{n} \left[1 + (n-1)\delta\right],
where S^2 is the population variance and \delta is the average intraclass correlation coefficient among elements separated by multiples of k.[29] This formula resembles the simple random sampling variance but adjusts for ordering effects via \delta; when \delta < 0 (indicating dispersion), systematic sampling reduces variance and is preferred for cost savings in accessing large, ordered lists like directories or databases.[1] Systematic sampling is particularly advantageous in scenarios requiring simplicity and uniformity, such as quality control audits, where full randomization is logistically challenging.[26]
Simple random sampling is superior to systematic sampling when populations contain hidden periodicity, as the unrestricted selection avoids alignment with intervals that could bias estimates, ensuring robustness in unordered or trend-heavy datasets like financial time series.[30]
Special Cases and Applications
Dichotomous Populations
In a dichotomous population of size N, there exists a proportion p = K/N of elements classified as "successes" (e.g., individuals with a binary trait such as yes/no responses or defective/non-defective items), where K denotes the total number of successes in the population. Simple random sampling without replacement from this population involves drawing a sample of size n, resulting in k observed successes within the sample. This setup models scenarios with a fixed number of units in two mutually exclusive categories in the population, enabling inference about the unknown population proportion p.[31]
The number of successes k in the sample follows a hypergeometric distribution, which accounts for the dependencies introduced by sampling without replacement from a finite population. The probability mass function is given by
P(k = k) = \frac{ \binom{K}{k} \binom{N - K}{n - k} }{ \binom{N}{n} },
for k = \max(0, n - (N - K)) to \min(n, K), where \binom{\cdot}{\cdot} denotes the binomial coefficient. This distribution arises directly from the uniform selection mechanism of simple random sampling, ensuring each subset of size n is equally likely.[31]
The unbiased estimator for the population proportion is \hat{p} = k/n, which provides a consistent estimate of p as n increases relative to N. Under the hypergeometric distribution, the exact variance of this estimator is
\text{Var}(\hat{p}) = \frac{p(1 - p)}{n} \cdot \frac{N - n}{N - 1},
reflecting the reduction in variability due to the finite population correction factor (N - n)/(N - 1). For large N relative to n, this approximates the binomial variance p(1 - p)/n, facilitating normal approximations for confidence intervals when n is sufficiently large (e.g., np \geq 5 and n(1 - p) \geq 5).[32][33]
This framework finds application in polling, where simple random samples estimate binary voter preferences (e.g., support for a candidate), allowing construction of confidence intervals to predict election outcomes with quantified uncertainty. In quality control, it is used to assess the proportion of defective items in a production batch by sampling without replacement, aiding decisions on acceptability thresholds.[34][35]
Bayesian extensions incorporate prior knowledge via beta-binomial models, where a beta prior distribution on p (conjugate to the binomial likelihood) updates to a posterior beta distribution after observing the sample, particularly useful when approximating the hypergeometric with a binomial for large populations. This approach is increasingly applied in machine learning contexts, such as A/B testing for binary conversion rates, enabling posterior predictive checks and credible intervals that integrate uncertainty from small samples.[36]
Real-World Examples
In public opinion polling, simple random sampling has been instrumental in avoiding selection biases that plagued earlier methods. The 1936 Literary Digest poll, which surveyed over 10 million individuals selected from telephone directories and automobile registration lists, inaccurately predicted a landslide victory for Republican candidate Alfred Landon over incumbent Franklin D. Roosevelt due to its non-representative sample favoring wealthier, urban Republicans.[37] In contrast, George Gallup's American Institute of Public Opinion employed a more scientific quota sampling approach informed by random principles to achieve representativeness across demographics, correctly forecasting Roosevelt's victory with 61% of the vote.[38] This episode underscored simple random sampling's role in producing unbiased estimates of population opinions, influencing modern polling standards.[39]
In clinical trials, simple random assignment to treatment and control groups ensures unbiased estimation of intervention effects by equalizing known and unknown confounders across groups. The U.S. Food and Drug Administration's 1962 Kefauver-Harris Amendments mandated adequate and well-controlled studies for drug approval, establishing randomization as a core requirement to minimize bias and support causal inferences.[40] For instance, in evaluating new therapies for conditions like cancer or cardiovascular disease, researchers randomly allocate participants to arms, allowing valid comparisons of outcomes such as survival rates or symptom reduction.[41] This practice has become standard in Phase III trials, enabling regulators to approve treatments based on reliable evidence of efficacy and safety.[41]
Environmental monitoring frequently applies simple random sampling to assess contamination in natural populations, providing unbiased estimates for policy decisions. The U.S. Environmental Protection Agency's National Study of Chemical Residues in Lake Fish Tissue, conducted from 2000 to 2003, selected lakes and sampling sites using probability-based designs incorporating random selection to represent the nation's approximately 147,000 lakes and reservoirs.[42] Within selected lakes, fish were randomly captured and composited to measure contaminants like mercury and PCBs, with the study estimating that mercury concentrations exceeded human health screening values in 48.8% of lakes.[43] These findings informed advisories on fish consumption and guided remediation efforts, demonstrating how random sampling yields nationally generalizable contamination profiles; results have informed subsequent assessments such as the 2017 National Lakes Assessment.[44]
In the 2020s, simple random subsampling has gained prominence in big data contexts, particularly for training artificial intelligence models on massive datasets. For example, the 2020 Big Transfer (BiT) framework for visual representation learning used balanced random subsamples of the ImageNet dataset—containing over 1.2 million images across 1,000 classes—to efficiently train models while maintaining performance comparable to full-dataset training.[45] This approach reduces computational costs in resource-intensive tasks like image classification, allowing researchers to iterate quickly without sacrificing the representativeness needed for robust model generalization.[45] Such subsampling techniques have been widely adopted in machine learning pipelines to handle datasets exceeding terabytes in size.[46]
Despite its strengths, simple random sampling in practice faces challenges like non-response bias, where certain subgroups decline participation, skewing results. Mitigation strategies include post-stratification weighting to adjust for underrepresented groups and follow-up incentives to boost response rates, as implemented in large-scale surveys to restore balance.[47] For instance, in opinion polls, weighting by demographics like age and education has proven effective in correcting biases from low-response subsets.[48]
Overall, simple random sampling enables generalizability by ensuring each population unit has an equal selection chance, allowing inferences to extend reliably from samples to broader contexts across fields like polling, medicine, ecology, and AI.[49] This property underpins its enduring value in empirical research, fostering trustworthy conclusions that inform decisions at scale.[45]
Implementation
Algorithms
A simple random sample without replacement can be generated using the Fisher-Yates shuffle algorithm, which randomly permutes the entire population and selects the first n elements from the permuted list. This approach ensures each subset of size n from the population of size N is equally likely, with an expected time complexity of O(N). The modern version of the algorithm, as described by Knuth, iterates from the last index to the first, swapping each element with a randomly chosen element from the unshuffled portion of the array.
The pseudocode for the Fisher-Yates shuffle is as follows:
procedure FisherYatesShuffle(array A of size N)
for i from N-1 downto 1 do
j ← random integer such that 0 ≤ j ≤ i
exchange A[j] and A[i]
end for
return the first n elements of A
procedure FisherYatesShuffle(array A of size N)
for i from N-1 downto 1 do
j ← random integer such that 0 ≤ j ≤ i
exchange A[j] and A[i]
end for
return the first n elements of A
This method requires the population to be fully available in memory as a list.
For sampling with replacement, the algorithm independently draws n elements by selecting indices uniformly at random from 1 to N, allowing duplicates.[50] Each draw is performed using a uniform random number generator, resulting in a multiset where each population element has equal probability of selection per draw.[50] This is computationally simpler, with time complexity O(n), but produces samples that may include repetitions.[51]
When the population size N is very large or the data arrives as a stream, reservoir sampling provides an efficient without-replacement method.[52] Vitter's Algorithm Z initializes a reservoir of the first n items and, for each subsequent item k > n, replaces a reservoir element with probability n/k, achieving expected time complexity O(n) independent of N.[52] This algorithm uses constant extra space beyond the reservoir and is suitable for online processing.[52]
In software libraries, these algorithms are implemented for practical use; for example, R's sample() function supports both with- and without-replacement sampling from vectors.[50] Similarly, Python's random.sample() in the standard library generates without-replacement samples from sequences using an underlying pseudorandom number generator.[51]
For applications requiring high-security or true randomness, such as cryptographic sampling in 2025, quantum random number generators (QRNGs) can replace pseudorandom sources to drive these algorithms.[53] NIST's SP 800-90 series, updated in September 2025, endorses QRNG constructions based on quantum nonlocality for verifiable randomness in random bit generation.[54] The CURBy beacon, launched by NIST in June 2025, provides a public service for such quantum-entropy sources, ensuring unpredictability against classical adversaries.[53]
Practical Considerations
A complete and accurate sampling frame, which enumerates every element in the target population, is a fundamental requirement for simple random sampling to guarantee equal selection probabilities.[7] Constructing such a frame often demands substantial resources, particularly for expansive populations like those covered by national censuses, where data compilation, verification, and maintenance can incur costs comparable to conducting a full enumeration due to the need for comprehensive listing and updates.[55]
In practice, simple random sampling frequently encounters issues such as non-response, where selected units decline to participate, and undercoverage, where segments of the population are omitted from the frame, both of which can distort representativeness.[56] These challenges can be partially addressed through weighting adjustments that rebalance the sample to align with known population distributions, thereby reducing bias without altering the initial selection process.[57]
Simple random sampling is often less suitable for populations exhibiting clustering or stratification, such as geographic groupings or demographic subgroups, where alternative methods like cluster or stratified sampling prove more efficient by lowering costs and variance while preserving accuracy.[58]
From an ethical standpoint, simple random sampling upholds fairness by assigning equal selection chances to all units, minimizing discrimination in the process; however, in contexts involving personal data, adherence to regulations like the European Union's General Data Protection Regulation (GDPR) as of 2025 is critical, requiring explicit consent, transparency in frame usage, and safeguards against privacy breaches during selection.[59]
Advancements since 2020 have introduced AI-assisted techniques for building sampling frames, leveraging machine learning to aggregate and validate population lists from disparate sources, which enhances coverage efficiency and mitigates manual errors in large-scale surveys.[60]
For scalability in big data scenarios, where full-frame access becomes computationally intensive, hybrid strategies integrating simple random subsampling with other probabilistic methods enable manageable analysis of massive datasets while approximating population inferences.