Coupon collector's problem

The coupon collector's problem is a classic problem in probability theory that models the number of independent random trials required to collect all n distinct coupons, assuming each trial yields one coupon chosen uniformly at random from the n types with replacement. The expected number of trials to complete the collection is exactly n H_n, where H_n = \sum_{k=1}^n 1/k is the nth harmonic number; asymptotically, this grows as n \ln n + \gamma n + O(1), with \gamma \approx 0.57721 denoting the Euler-Mascheroni constant. This expectation arises from decomposing the process into n geometric waiting times: after collecting k-1 distinct coupons, the expected additional trials to obtain a new one is n/(n-k+1) for k = 1 to n, and linearity of expectation yields the total.^[1] The problem traces its origins to Abraham de Moivre's 1711 treatise De Mensura Sortis, where it appeared in the context of measuring fair shares in games of chance, and was later analyzed by Pierre-Simon Laplace and Leonhard Euler in the 18th century.^[2] Over time, it has become a foundational example in stochastic processes, illustrating concepts like geometric distributions and Markov chains, with the variance of the collection time asymptotically \sim n^2 \pi^2 / 6.^[1] Beyond pure mathematics, the problem finds applications in computer science and engineering, such as estimating cache miss rates in memory systems, analyzing load balancing in distributed networks, and modeling the time for packets to traverse all links in communication paths.^[3]^[4] Numerous generalizations extend the classic setup, including variants with unequal coupon probabilities, requirements for multiple copies of each type, or draws of multiple coupons per trial, which connect to broader areas like occupancy problems and extreme value theory.^[5] Tail bounds, such as those showing the collection time concentrates around its mean with high probability, further highlight its utility in algorithm analysis and randomized computing.^[6]

Problem Statement and History

Definition

In the coupon collector's problem, there are n distinct types of coupons, where n is a finite positive integer. Each trial consists of drawing one coupon, which is selected independently and uniformly at random from the n types with equal probability \frac{1}{n} for each type and with replacement. The goal is to determine the expected number of trials T required to obtain at least one coupon of each of the n types.^[7] The random variable T represents the total number of trials needed to complete the collection and can be decomposed as T = \sum_{i=1}^n T_i, where T_i is the number of trials required to acquire a new coupon type after exactly i-1 distinct types have already been collected.^[7] The process assumes that trials are independent and that the selection remains uniform over all n types regardless of prior draws.^[7] For illustration, consider the case n=2. The first trial T_1 always yields a new coupon, so T_1 = 1. After that, the waiting time T_2 for the second distinct type follows a geometric distribution with success probability \frac{1}{2}, yielding an expected value E[T_2] = 2. Thus, the expected total E[T] = E[T_1] + E[T_2] = 1 + 2 = 3.^[7]

Historical Context

The coupon collector's problem has roots in early 18th-century probability theory, first appearing in Abraham de Moivre's 1711 treatise De Mensura Sortis.^[2] It was further analyzed by Pierre-Simon Laplace, who in his 1774 memoir Mémoire sur la probabilité des causes par les événements examined scenarios akin to covering all outcomes in lottery draws or repeated trials with multiple possibilities, using interpretations involving dice and equiprobable events to compute probabilities of complete coverage.^[8] Although these discussions resembled the core idea of collecting distinct items through random sampling, they were framed in terms of inverse probability and cause inference rather than the explicit "collector" setup. Modern probabilistic analysis of the problem, including the urn model, began in the mid-20th century, with Newman and Shepp deriving asymptotics in 1960.^[9] In 1961, Paul Erdős and Alfréd Rényi provided key asymptotic results for this classical problem in their paper "On a classical problem of probability theory," published in the Publications of the Mathematical Institute of the Hungarian Academy of Sciences.^[10] There, they posed it as an urn problem: determining the expected number of balls needed to occupy all n urns when balls are distributed randomly one at a time, marking a pivotal shift to rigorous probabilistic analysis of waiting times for complete coverage.^[10] In the 1980s and early 1990s, significant advancements focused on asymptotic behavior, with Philippe Flajolet and colleagues developing unified frameworks for analyzing the problem's moments and tail probabilities in large-scale settings, often using generating functions and singularity analysis.^[11] These contributions extended the foundational results and bridged the problem to computer science, where it gained prominence in the 1990s for modeling phenomena like hash table collisions, universal hashing performance, and randomized sampling in algorithms.^[11] Interest in the coupon collector's problem persists into 2025, with ongoing extensions exploring variants in algorithmic design, such as group drawings and labeled collections, and applications in statistical inference and network coverage analysis.^[12]

Basic Properties

Expected Number of Trials

The total number of trials T required to collect all n distinct coupons can be decomposed into a sum of independent geometric random variables. Specifically, let T_i denote the number of additional trials needed to obtain a new coupon after i-1 distinct coupons have already been collected, for i = 1, 2, \dots, n. Each T_i follows a geometric distribution with success probability p_i = (n - i + 1)/n, representing the probability of drawing one of the remaining n - i + 1 uncollected coupons. Thus, T = \sum_{i=1}^n T_i.^[3] By the linearity of expectation, the expected value satisfies E[T] = \sum_{i=1}^n E[T_i]. Since the expected value of a geometric random variable with success probability p_i is $1/p_i, it follows that E[T_i] = n / (n - i + 1). Therefore, E[T] = \sum_{i=1}^n \frac{n}{n - i + 1} = n \sum_{k=1}^n \frac{1}{k} = n H_n, where H_n is the nth harmonic number.^[3] The harmonic number H_n admits the asymptotic approximation H_n \approx \ln n + \gamma + \frac{1}{2n} - \frac{1}{12n^2} + \cdots, where \gamma \approx 0.57721 is the Euler-Mascheroni constant. Consequently, E[T] \approx n (\ln n + \gamma), providing a leading-order estimate for large n. This approximation highlights that the expected number of trials grows logarithmically with n, scaled by n.^[3] For small values of n, exact computations illustrate the formula. When n=1, E[T] = 1. For n=2, H_2 = 1 + 1/2 = 1.5, so E[T] = 3. For n=3, H_3 \approx 1.833, yielding E[T] = 5.5. For n=10, H_{10} \approx 2.929, so E[T] \approx 29.3. These examples demonstrate the rapid increase in expected trials as n grows modestly.^[3]

Variance of the Number of Trials

The total number of trials T to collect all n coupons can be decomposed as T = \sum_{i=1}^n T_i, where each T_i is the number of additional trials needed to obtain a new coupon after i-1 distinct coupons have been collected.^[13] Each T_i follows a geometric distribution with success probability p_i = (n - i + 1)/n, and the T_i are independent due to the memoryless property of the process.^[13] The variance of a geometric random variable T_i with success probability p_i is \operatorname{Var}(T_i) = (1 - p_i)/p_i^2.^[13] By independence, \operatorname{Var}(T) = \sum_{i=1}^n \operatorname{Var}(T_i) = \sum_{i=1}^n (1 - p_i)/p_i^2. Substituting p_i = (n - i + 1)/n and reindexing the sum yields \operatorname{Var}(T) = n^2 \sum_{k=1}^n \frac{1}{k^2} - n H_n, where H_n = \sum_{k=1}^n \frac{1}{k} is the nth harmonic number.^[13] Asymptotically, as n \to \infty, \sum_{k=1}^n \frac{1}{k^2} \to \frac{\pi^2}{6} and H_n \sim \ln n + \gamma (with \gamma the Euler-Mascheroni constant), so \operatorname{Var}(T) \sim n^2 \frac{\pi^2}{6} - n \ln n.^[14] The dominant term n^2 \frac{\pi^2}{6} shows that the variance grows quadratically with n, implying substantial relative variability in T compared to its expected value of order n \ln n.^[14]

Mathematical Connections

Relation to Stirling Numbers

The number of distinct coupons collected after m independent trials, where each trial uniformly selects one of n possible coupons, follows a distribution directly involving the Stirling numbers of the second kind S(m, k), which count the number of ways to partition m labeled trials into k non-empty unlabeled subsets. Specifically, the probability of collecting exactly k distinct coupons is given by

P(D_m = k) = \binom{n}{k} \frac{k! \, S(m, k)}{n^m},

where \binom{n}{k} chooses the k coupon types, k! \, S(m, k) counts the surjective functions from the m trials onto those k types (with each subset assigned to a type), and n^m is the total number of possible outcomes.^[15] This expression highlights the combinatorial essence of the problem, as S(m, k) encodes the partitioning of trials into groups corresponding to distinct coupons received at least once. This connection extends to the waiting time T until all n coupons are collected. The probability mass function is

P(T = m) = \frac{n! \, S(m-1, n-1)}{n^m} \quad \text{for } m \geq n,

arising because the first m-1 trials must cover exactly n-1 distinct coupons (with probability \binom{n}{n-1} \frac{(n-1)! \, S(m-1, n-1)}{n^{m-1}}), while the m-th trial must yield the missing coupon (probability $1/n); these combine to simplify to the form above.^[16] The unsigned Stirling numbers of the second kind thus provide an exact combinatorial tool for computing the distribution of T, beyond the well-known expected value E[T] = n H_n, where H_n is the nth harmonic number. Further, the cumulative probability P(T \leq m) can be expressed using a sum over Stirling numbers:

P(T \leq m) = \sum_{j=n}^{m} P(T = j) = \frac{n!}{n^m} \sum_{j=n}^{m} \frac{S(j-1, n-1)}{n^{j-m}},

though in practice, inclusion-exclusion forms are often used for computation; the Stirling representation emphasizes the underlying set-partition structure of the collection process.^[15] This link has been instrumental in deriving higher moments and asymptotic behaviors, as explored in combinatorial analyses of the problem.

Generating Functions

The waiting time T in the coupon collector's problem with n coupons can be decomposed as T = \sum_{i=1}^n T_i, where the T_i are independent geometric random variables, with T_i denoting the number of trials needed to obtain a new coupon after i-1 distinct coupons have been collected. Each T_i has success probability p_i = \frac{n - i + 1}{n}, so P(T_i = k) = (1 - p_i)^{k-1} p_i for k = 1, 2, \dots.^[17] The probability generating function (PGF) of a geometric random variable T_i is G_i(s) = \mathbb{E}[s^{T_i}] = \frac{p_i s}{1 - (1 - p_i)s} for |s| \le 1. Since the T_i are independent, the PGF of T is the product G(s) = \prod_{i=1}^n G_i(s) = \prod_{i=1}^n \frac{p_i s}{1 - (1 - p_i)s}. Substituting the uniform success probabilities yields G(s) = s^n \prod_{i=1}^n \frac{(n - i + 1)/n}{1 - ((i - 1)/n)s}. This product form provides an explicit expression for the PGF in the uniform case, facilitating analytical computations.^[17] The moments of T can be extracted from the PGF by differentiation at s = 1: the expected value is \mathbb{E}[T] = G'(1) = n \sum_{k=1}^n \frac{1}{k} = n H_n, where H_n is the n-th harmonic number, and the variance is \mathrm{Var}(T) = G''(1) + G'(1) - [G'(1)]^2 = \sum_{i=1}^n \frac{1 - p_i}{p_i^2} = n^2 \sum_{k=1}^n \frac{1}{k^2} - n H_n. These derivatives confirm the known first and second moments derived from the geometric decomposition.^[17] The exact probability mass function P(T = m) for m \ge n can be obtained from the PGF via coefficient extraction, P(T = m) = [s^m] G(s), or more practically using the relation to tail probabilities from inclusion-exclusion: P(T > m) = \sum_{k=1}^n (-1)^{k-1} \binom{n}{k} \left( \frac{n - k}{n} \right)^m, so P(T = m) = P(T > m-1) - P(T > m). This formula arises from the probability that at least one specific coupon is missing after m trials, adjusted by inclusion-exclusion, and connects to the PGF through the generating function \sum_{m=0}^\infty P(T > m) s^m = \frac{1}{1 - s} - \frac{G(s)}{s}.^[17] The PGF G(s) relates to the generating function for the unsigned Stirling numbers of the first kind c(n, k), which count the number of permutations of n elements with k cycles and satisfy \sum_{k=0}^n c(n, k) s^k = s(s+1) \cdots (s + n - 1). In combinatorial interpretations of the coupon collector process, such as mappings to cycle structures in random allocations, the PGF coefficients align with expressions involving these rising factorials, providing an alternative lens for computing higher-order statistics or asymptotic behaviors.^[18]

Concentration and Bounds

Tail Probability Estimates

The tail probabilities for the number of trials T in the coupon collector's problem have been studied extensively to quantify the likelihood of significant deviations from the expected value E[T] \approx n \ln n + \gamma n, where \gamma is the Euler-Mascheroni constant. Early estimates were provided by Erdős and Rényi in their seminal 1961 paper, where they analyzed the asymptotic distribution of T. Specifically, they showed that for large n and fixed c \in \mathbb{R},

\lim_{n \to \infty} P\left( T \leq n \ln n + c n \right) = \exp\left(-e^{-c}\right).

This result indicates that T is tightly concentrated around its mean, with the normalized deviation (T - n \ln n)/n converging in distribution to a Gumbel random variable (shifted and scaled). The upper tail decays double-exponentially, reflecting the problem's sub-exponential concentration properties.^[10] Since T = \sum_{k=1}^n T_k where the T_k are independent geometric random variables with success probabilities p_k = (n - k + 1)/n, Chernoff-type bounds can be derived by exploiting the moment-generating functions of these geometrics. For non-identical geometrics, explicit upper tail bounds have been established using Chernoff-Hoeffding techniques adapted to varying parameters. In particular, for \lambda > 1,

P(T > \lambda E[T]) \leq \exp\left( - p^* E[T] (\lambda - 1 - \ln \lambda) \right),

where p^* = \min_k p_k = 1/n. This bound, which leverages the minimum success probability, yields exponential decay in deviations scaled by E[T], confirming strong concentration even for moderate \lambda > 1. Similar Chernoff-style arguments apply to the lower tail, providing bounds of the form P(T < (1 - \delta) E[T]) \leq \exp( - \delta^2 E[T] / (2 + 2\delta) ) for \delta \in (0,1), though the asymmetry of geometric distributions makes the lower tail lighter and easier to bound.^[19] For deviations t > 0, more refined explicit inequalities can be obtained via Bernstein's inequality applied to the sum of bounded-moment random variables, yielding

P(T > E[T] + t) \leq \exp\left( -\frac{t^2}{2 \operatorname{Var}(T) + (2/3) M t} \right),

where M is a proxy for the maximum influence of each T_k (often taken as O(n \ln n) for practical truncation). This adapts the standard Bernstein form to the non-identical geometrics by using the total variance \operatorname{Var}(T) \approx n^2 \pi^2 / 6 and accounts for the heavier tails of individual terms, providing a sub-exponential decay that improves on Chebyshev's polynomial bound P(|T - E[T]| > t) \leq \operatorname{Var}(T)/t^2. These inequalities highlight the problem's suitability for applications requiring high-probability guarantees, such as algorithm analysis and randomized caching.

Higher Moments and Inequalities

The higher moments of the number of trials T_N required to collect all N coupons can be derived using generating functions or recursive relations based on the phase-wise decomposition of the process into independent geometric waiting times. For fixed r \geq 1, the asymptotics of the rising factorial moments \mathbb{E}[T_N (T_N - 1) \cdots (T_N - r + 1)] as N \to \infty are dominated by the contributions from the later waiting times, particularly the final geometric phase with success probability $1/N, which has heavy tails relative to earlier phases. Doumas and Papanicolaou (2013) provide explicit asymptotic expansions for these moments under both uniform and non-uniform coupon probabilities, applicable to a broad class of distributions where the probabilities are bounded away from zero and infinity in ratios.^[20] The random variable T_N is a sum of N independent but non-identically distributed geometric random variables W_k \sim \mathrm{Geo}((N - k + 1)/N) for k = 1, \dots, N, each of which satisfies a sub-exponential tail condition with parameters scaling as O(N/k). This structure allows the application of Bernstein's inequality to bound deviations \mathbb{P}(|T_N - \mathbb{E}[T_N]| \geq t), yielding exponential concentration around the mean with rate depending on the variance proxy \sum_k (N/k)^2 \sim N^2 \pi^2/6. Vershynin (2018) establishes the general Bernstein inequality for sums of independent sub-exponential variables, directly applicable here since the Orlicz norm \|W_k\|_{\psi_1} = O(N/k). Martingale methods offer an alternative for deriving inequalities and bounds on overshoot, defined as the excess trials beyond a threshold in the final phase. By constructing Doob martingales based on the number of distinct coupons collected and applying optional stopping theorems, one obtains moment bounds and tail estimates for T_N. Brown, Peköz, and Ross (2008) develop this approach to compute exact moments and concentration via martingale corollaries, providing sharper controls on overshoot compared to direct sum decompositions. As of 2025, recent refinements in large deviation inequalities for T_N focus on structured variants and precise constants in the rate functions, building on the double-exponential tails from the Gumbel limit. Falgas-Ravry, Larsson, and Markström (2020) derive concentration bounds with optimized constants for graph-structured coupon collections, improving deviation probabilities for covering times. The asymptotic distribution (T_N - N \log N)/N \to_d G, where G follows a Gumbel distribution with cdf \exp(-e^{-x}), implies fixed higher-order asymptotics: skewness \approx 1.1395 and kurtosis $5.4, dominated by the extreme-value nature of the last phases. Erdős and Rényi (1961) originated the Gumbel limit, with subsequent works confirming the moment implications.

Generalizations and Variants

Unequal Probabilities

In the generalization of the coupon collector's problem to unequal probabilities, there are n coupon types with respective probabilities p_1, \dots, p_n > 0 satisfying \sum_{i=1}^n p_i = 1. Each trial independently yields type i with probability p_i, and T denotes the number of trials required to obtain at least one of each type. The waiting times between new coupons can be viewed sequentially, but because the success probability at each stage depends on the specific set of missing types—which evolves randomly—the expected value E[T] does not simplify to a straightforward sum over individual p_i as in the uniform case. A closed-form expression for E[T] arises from inclusion-exclusion applied to the tail probabilities:

E[T] = \sum_{\substack{S \subseteq \\ S \neq \emptyset}} (-1)^{|S| + 1} \frac{1}{\sum_{i \in S} p_i}.

This formula, equivalent to summing over the sizes of subsets of "missing" types with alternating signs, provides an exact computation, though it requires $2^n - 1 terms and is practical only for small n.^[21] An alternative representation uses Poissonization, approximating the discrete process by independent Poisson arrivals for each type; the expected time then becomes the integral

E[T] \approx \int_0^\infty \left(1 - \prod_{i=1}^n (1 - e^{-p_i t})\right) \, dt,

which facilitates asymptotic analysis and is asymptotically exact as n \to \infty under mild conditions on the p_i. When the probabilities are nearly uniform, with p_i \approx 1/n for all i, the expression reduces to the classical expectation n H_n \approx n \log n + \gamma n + O(1), where H_n is the nth harmonic number and \gamma is the Euler-Mascheroni constant. In skewed distributions, where some p_i are much smaller than others, the expectation is dominated by the time to collect the rare types; for instance, if the minimal p_{\min} = \min_i p_i \ll 1/n, then E[T] \sim 1/p_{\min} as the process spends most time awaiting the scarcest coupons. Upper and lower bounds often leverage the fact that E[T] \leq \sum_{i=1}^n 1/p_i, with the sum providing a loose but simple overestimate emphasizing the impact of low-probability events. The variance \mathrm{Var}(T) admits a similar but more involved decomposition, expressible via double sums over tail probabilities P(T > k) weighted by (2k - 1), leading to coupled terms across subsets that do not simplify as neatly as for the expectation. Explicit computation requires extensions of the inclusion-exclusion principle, and in the skewed case, the variance is likewise dominated by the rare coupons, scaling roughly as (1/p_{\min})^2. Applications arise in scenarios with power-law (Zipf-like) probability distributions, such as ecological models of host-parasitoid interactions where host types have heterogeneous abundances; here, the expected search time to encounter all types highlights how rare hosts prolong collection, informing biodiversity sampling strategies.^[22]

Multiple Copies and Collectors

In the multiple copies variant of the coupon collector's problem, the goal is to obtain at least h_i copies of each coupon type i = 1, \dots, n, where the h_i may differ across types and coupons are drawn uniformly at random with replacement. The process can be decomposed into stages corresponding to achieving successive copy levels across all types, where the waiting time in each stage follows a geometric distribution with a success probability adjusted by the proportion of types still needing an additional copy at that level. This decomposition allows computation of the expected number of trials E[T] as the sum of these stage expectations, though dependencies between stages complicate exact closed forms beyond the classical single-copy case. For the uniform case where h_i = h for all i and fixed h as n \to \infty, the asymptotic expected value is

E[T] \sim n \ln n + (h-1) n \ln \ln n + O(n).

This reflects the leading n \ln n term from the initial set collection, with subsequent copies adding a subleading \ln \ln n correction due to the evolving distribution of copy counts. A prominent special case is the double Dixie cup problem, introduced by Newman and Shepp, which requires at least two copies of each of n types. Here, the asymptotic is

E[T] \sim n \ln n + n \ln \ln n + \gamma n + o(n),

where \gamma \approx 0.57721 is the Euler-Mascheroni constant; this extends the classical expectation by accounting for the additional effort to double the collection.^[23] The multiple collectors variant extends the problem to k independent collectors, each drawing coupons uniformly and simultaneously in discrete time steps. One formulation seeks the expected time until the union of all collections covers every type, equivalent to a grouped coupon collector where each time step yields k independent draws; for k \ll n, this expected time is approximately \frac{n \ln n}{k} + O\left(\frac{n}{k}\right). Alternatively, the time until every collector has a complete single set is the expected maximum of k independent classical coupon collector times, yielding \Theta(n \log (k n)) samples per collector with high probability when no sharing occurs. Recent variants (2020–2025) introduce additional structure. The labeled coupon collector problem (LCCP) distinguishes multiple copies of the same type via unique labels, requiring draws of ordered subsets of size k to identify all labels; for unknown label sets and k=2, the expected draws satisfy E[Q] \approx \frac{n}{2} H_n, where H_n is the harmonic number.^[12] The k-LCCP generalizes this to batched draws of k coupons per trial, with expected values computed via Markov chains and separating systems, often requiring \lceil n/k \rceil H_{n/k} draws asymptotically for coverage.^[12] These variants find applications in load balancing, where multiple copies model task replication across servers to minimize maximum load, with expected balancing time scaling as n \ln n + O(n) for h=2 redundancy. In caching systems, requiring multiple copies ensures fault tolerance and redundancy, where the expected trials to achieve h replicas per item informs eviction policies and hit rates under uniform access.^[24]