Random variable

In probability theory, a random variable is a function that assigns a real number to each outcome in the sample space of a random experiment, enabling the quantification of uncertainty through numerical values rather than qualitative descriptions. This mapping allows probabilities to be defined over the possible values of the variable, facilitating analysis in fields such as statistics, finance, and engineering.^[1] Random variables are broadly classified into two types: discrete and continuous. A discrete random variable takes on a countable number of distinct values, such as the number of heads in a series of coin flips, where the possible outcomes form a finite or countably infinite set.^[2] Its probability distribution is described by a probability mass function (PMF), which assigns a probability to each possible value, with the sum of these probabilities equaling 1.^[2] In contrast, a continuous random variable can assume any value within a continuous range, such as the exact time until an event occurs, representing an uncountably infinite set of outcomes.^[3] The distribution of a continuous random variable is characterized by a probability density function (PDF), where probabilities are computed as integrals over intervals, and the total area under the PDF equals 1.^[4] Key properties of random variables include the expected value (or mean), which represents the long-run average value of the variable over many repetitions of the experiment, and the variance, which measures the spread or dispersion of the variable's values around the mean.^[2] For a discrete random variable X, the expected value is E(X) = \sum x \cdot P(X = x), while the variance is Var(X) = E[(X - E(X))^2] = E(X^2) - [E(X)]^2.^[2] These properties extend to continuous cases via integrals, providing foundational tools for deriving further statistics like standard deviation and for applications in modeling real-world phenomena.^[5] Random variables also form the basis for joint distributions when multiple variables are considered together, allowing analysis of dependence and covariance in multivariate settings.^[6]

Basic Concepts

Definition

In the early 20th century, the concept of a random variable emerged as a key element in the axiomatization of probability theory. The Italian mathematician Francesco Paolo Cantelli introduced the term variabile casuale (random variable) around 1913 in his work on probability limits, providing an early formal recognition of variables whose values depend on chance outcomes.^[7] This idea was further developed through Andrey Kolmogorov's seminal 1933 monograph Grundbegriffe der Wahrscheinlichkeitsrechnung (Foundations of the Theory of Probability), which established the modern axiomatic framework for probability and defined random variables rigorously within it.^[8] The English term "random variable" was used by J. V. Uspensky in his 1937 textbook Introduction to Mathematical Probability. Intuitively, a random variable X assigns a numerical value to each possible outcome of a random experiment, thereby quantifying uncertain phenomena in a measurable way. For instance, in an experiment consisting of tossing a fair coin three times, the random variable X might represent the number of heads obtained, mapping each outcome sequence (e.g., HHT) to the integer 2. This abstraction allows probabilities to be associated with the values taken by X rather than directly with the underlying outcomes. Formally, a random variable X is defined as a measurable function

X: \Omega \to \mathbb{R},

where (\Omega, \mathcal{F}, P) is a probability space.^[8] Here, \Omega is the sample space, the set of all possible outcomes of the random experiment; \mathcal{F} is a \sigma-algebra of subsets of \Omega, known as the event space, which specifies the collection of measurable events; and P is a probability measure on \mathcal{F} that assigns a value between 0 and 1 to each event, satisfying Kolmogorov's axioms of probability (non-negativity, normalization, and countable additivity).^[8] The measurability of X ensures compatibility with the probability structure, requiring that for every x \in \mathbb{R}, the preimage set \{\omega \in \Omega : X(\omega) \leq x\} belongs to \mathcal{F}.^[8] This condition guarantees that events defined in terms of X, such as \{X \leq x\}, are well-defined and assignable probabilities under P. Random variables may take discrete or continuous values, but the general definition encompasses both cases.

Probability Space

A probability space provides the mathematical foundation for defining random variables and modeling uncertainty in a rigorous manner. It is formally defined as a triple (\Omega, \mathcal{F}, P), where \Omega is the sample space consisting of all possible outcomes of a random experiment, \mathcal{F} is a sigma-algebra of subsets of \Omega (known as events) that is closed under countable unions, intersections, and complements, and P is a probability measure assigning to each event in \mathcal{F} a value between 0 and 1, with the normalization condition P(\Omega) = 1.^[9]^[10] The axioms governing the probability measure P were established by Andrey Kolmogorov in his seminal 1933 work, providing an axiomatic basis for probability theory. These axioms are: (1) non-negativity, stating that P(A) \geq 0 for every event A \in \mathcal{F}; (2) normalization, P(\Omega) = 1; and (3) countable additivity, which asserts that if \{A_i\}_{i=1}^\infty is a countable collection of pairwise disjoint events in \mathcal{F}, then

P\left( \bigcup_{i=1}^\infty A_i \right) = \sum_{i=1}^\infty P(A_i).

These axioms ensure that probabilities behave consistently for complex events built from simpler ones.^[9]^[11] Examples of probability spaces illustrate their versatility across discrete and continuous settings. In a finite discrete case, such as a fair coin toss, the sample space is \Omega = \{H, T\} (heads or tails), the sigma-algebra \mathcal{F} is the power set of \Omega with four elements, and P assigns equal probability $1/2 to each singleton event.^[12] For a continuous case, consider a uniform distribution over the unit interval, where \Omega = [0,1], \mathcal{F} is the Borel sigma-algebra generated by open intervals, and P is the Lebesgue measure restricted to [0,1], so P([a,b]) = b - a for $0 \leq a \leq b \leq 1.^[10] Every random variable is defined on a probability space (\Omega, \mathcal{F}, P), which guarantees its measurability with respect to \mathcal{F} and allows the assignment of probabilities to the variable's outcomes.^[11]

Types of Random Variables

Discrete Random Variables

A discrete random variable is a random variable whose range, or set of possible values, is countable, meaning it consists of either a finite number of distinct outcomes or a countably infinite number, such as the non-negative integers.^[2] Unlike more general random variables defined on a probability space, discrete random variables assign positive probabilities only to these countable points, with the total probability summing to 1 across the entire support.^[13] The probability mass function (PMF) of a discrete random variable X, denoted p_X(x) or simply p(x), provides the probability that X takes a specific value x in its range, so p(x) = P(X = x).^[14] This function satisfies two key properties: p(x) \geq 0 for all x in the range, and the sum of p(x) over all possible x equals 1, i.e., \sum_{x} p(x) = 1.^[14] The support of X, denoted \operatorname{supp}(X), is the smallest set containing all x such that p(x) > 0, ensuring probabilities are concentrated only on these points.^[15] Common examples of discrete random variables include the Bernoulli distribution, which takes only the values 0 or 1, and the Poisson distribution, which takes values in the non-negative integers \{0, 1, 2, \dots\}.^[16]^[17] To compute probabilities for intervals, the probability that X falls between integers a and b (inclusive), where a \leq b, is given by summing the PMF over those values: P(a \leq X \leq b) = \sum_{x=a}^{b} p(x).^[14] This summation leverages the countable nature of the support, allowing exact calculation via the discrete probabilities.^[14]

Continuous Random Variables

A continuous random variable is defined as a random variable whose possible values form an uncountable set, such as the real line \mathbb{R} or a continuous interval within it, with the probability of the variable equaling any specific point being zero: P(X = x) = 0 for every x in the support.^[18]^[3] This contrasts with discrete random variables, which assign positive probabilities to countable points.^[3] Probabilities for continuous random variables are determined over intervals rather than at individual points, reflecting the infinite and uncountable nature of their range.^[19] Specifically, the probability P(a \leq X \leq b) for an interval [a, b] is obtained by integrating a non-negative density function over that interval, ensuring the total probability across the entire support equals 1.^[18] This integral-based approach allows for modeling phenomena with inherently continuous outcomes, such as physical measurements.^[3] Although continuous random variables lack positive probability at single points, they can approximate discrete distributions in limiting scenarios, such as when the number of discrete categories increases indefinitely.^[20] Representative examples include the uniform distribution on the interval [0,1], which assigns equal likelihood to all points within a bounded continuous range, and the exponential distribution, commonly used to model waiting times between events in continuous-time processes.^[18]^[21] A defining mathematical property in standard usage is that the cumulative distribution function (CDF) of an absolutely continuous random variable—which is the typical sense of "continuous" in introductory contexts—is absolutely continuous with respect to Lebesgue measure, meaning it can be expressed as the integral of a density function and possesses no jumps. Singular continuous distributions, discussed separately, represent a more advanced case without densities.^[18]^[22]

Singular and Mixed Random Variables

In probability theory, a singular continuous random variable is characterized by a cumulative distribution function (CDF) that is continuous everywhere but not absolutely continuous with respect to Lebesgue measure, implying the absence of a probability density function while lacking discrete jumps.^[23] This means the distribution is supported on a set of Lebesgue measure zero, yet it assigns positive probability to intervals without concentrating mass at points.^[24] A canonical example is the Cantor distribution, whose CDF is the Cantor function—also known as the devil's staircase—which is constant on the intervals removed in the construction of the Cantor set and increases continuously from 0 to 1 over [0,1], with support confined to the Cantor set of measure zero.^[23] Mixed random variables arise when the distribution combines discrete and continuous components, resulting in a CDF that exhibits jumps at discrete points alongside regions of continuous increase.^[25] For instance, consider a random variable X with P(X=0) = 0.5 and, conditional on X > 0, X uniform on (0,1] with probability 0.5; here, the distribution places a point mass at 0 while spreading the remaining probability continuously over an interval. In general, the CDF of a mixed random variable can be expressed as

F(x) = \sum_{y \leq x} p_y + \int_{-\infty}^x f(t) \, dt + F_s(x),

where \sum p_y captures the discrete jumps, \int f(t) \, dt the absolutely continuous part, and F_s(x) the singular continuous component, though the latter is often absent in practical mixed cases.^[26] The Lebesgue decomposition theorem provides the foundational result for classifying all probability distributions on the real line, stating that any such distribution \mu can be uniquely decomposed as \mu = \mu_d + \mu_{ac} + \mu_s, where \mu_d is the discrete (atomic) part, \mu_{ac} is absolutely continuous with respect to Lebesgue measure, and \mu_s is singular continuous.^[27] This theorem underscores that singular continuous distributions form a distinct class, separate from both discrete and absolutely continuous types.^[27] In applications, singular and fully mixed distributions (including singular parts) are rare, as most probabilistic models in statistics and engineering rely on purely discrete or absolutely continuous random variables for tractability; singular examples like the Cantor distribution primarily serve theoretical purposes in measure theory and fractal analysis.^[23]

Distribution Functions

Cumulative Distribution Function

The cumulative distribution function (CDF) of a real-valued random variable X, denoted F_X(x), is defined as F_X(x) = P(X \leq x) for all x \in \mathbb{R}.^[28]^[29]^[30] This function provides a complete description of the probability distribution of X, applicable to discrete, continuous, or mixed cases.^[30] The CDF possesses several fundamental properties: it is non-decreasing, meaning F_X(a) \leq F_X(b) whenever a < b; right-continuous, so F_X(x) = \lim_{y \to x^+} F_X(y); and it satisfies the boundary conditions \lim_{x \to -\infty} F_X(x) = 0 and \lim_{x \to \infty} F_X(x) = 1.^[28]^[29]^[30] These ensure that F_X(x) maps the real line to the interval [0, 1] in a consistent manner with probability axioms.^[28] Probabilities over intervals can be computed directly from the CDF: for any a < b, P(a < X \leq b) = F_X(b) - F_X(a).^[29]^[30] This property allows the CDF to specify all finite-dimensional distributions, thereby uniquely determining the law (or distribution) of X.^[28]^[30] The form of the CDF reveals the type of random variable: discontinuities or jumps correspond to discrete components, where the jump size at a point equals the probability mass there, while continuous and differentiable portions indicate absolutely continuous parts.^[30] The quantile function, or generalized inverse of the CDF, is defined for u \in (0,1) as

F_X^{-1}(u) = \inf\{x : F_X(x) \geq u\},

providing the smallest x such that the CDF reaches at least u.^[28] This function is non-decreasing and left-continuous, facilitating the generation of random variables from uniform distributions via the inverse transform sampling method.^[28]

Probability Mass and Density Functions

For discrete random variables, the probability mass function (PMF), denoted p(x), assigns to each possible value x in the support the probability p(x) = P(X = x) \geq 0.^[14] This function fully characterizes the distribution, as the probability of X taking any finite or countable set of values A is P(X \in A) = \sum_{x \in A} p(x).^[31] The PMF relates to the cumulative distribution function (CDF) F(x) = P(X \leq x) through the jumps in the CDF, specifically p(x) = F(x) - F(x^-), where F(x^-) = \lim_{y \uparrow x} F(y) denotes the left-hand limit at x.^[32] A fundamental property is normalization: \sum_{x} p(x) = 1, with the sum taken over the countable support of X.^[14] The PMF enables computation of expectations for functions of the random variable. For a measurable function g, the expectation is E[g(X)] = \sum_{x} g(x) p(x), provided the sum converges absolutely. This includes key quantities like the mean E[X] = \sum_{x} x p(x), assuming finite support or appropriate convergence. For absolutely continuous random variables, the probability density function (PDF), denoted f(x), provides a density with respect to Lebesgue measure such that probabilities are given by integrals: P(a < X \leq b) = \int_{a}^{b} f(x) \, dx.^[33] The PDF is obtained from the CDF as its derivative where differentiable: f(x) = \frac{d}{dx} F(x).^[34] Conversely, the CDF recovers via F(x) = \int_{-\infty}^{x} f(t) \, dt.^[33] The PDF satisfies f(x) \geq 0 for all x and the normalization condition \int_{-\infty}^{\infty} f(x) \, dx = 1.^[33] Expectations using the PDF follow E[g(X)] = \int_{-\infty}^{\infty} g(x) f(x) \, dx, again assuming absolute integrability. For instance, the mean is E[X] = \int_{-\infty}^{\infty} x f(x) \, dx. While the PDF uniquely determines the distribution for absolutely continuous cases (up to sets of Lebesgue measure zero), representations involving generalized functions like Dirac deltas are not unique, as one can add such components without altering probabilities under integration.^[35] Singular distributions, which have a continuous CDF but are not absolutely continuous with respect to Lebesgue measure, admit no ordinary PDF.^[36]

Examples

Discrete Examples

A Bernoulli random variable is the simplest discrete random variable, taking only two possible values: 1 (representing success) with probability p and 0 (representing failure) with probability $1 - p, where $0 < p < 1.^[37] The probability mass function (PMF) is given by

P(X = x) = p^x (1 - p)^{1 - x}, \quad x \in \{0, 1\}.

^[37] For example, if p = 0.6, then P(X = 1) = 0.6 and P(X = 0) = 0.4.^[37] The binomial random variable generalizes the Bernoulli by representing the number of successes in n independent Bernoulli trials, each with success probability p.^[38] Its support is the integers k = 0, 1, \dots, n, and the PMF is

P(X = k) = \binom{n}{k} p^k (1 - p)^{n - k},

^[38] where \binom{n}{k} = \frac{n!}{k!(n - k)!} is the binomial coefficient counting the number of ways to choose k successes out of n trials.^[38] Thus, a binomial random variable is the sum of n independent Bernoulli random variables with the same p.^[38] A classic binomial example is the number of heads in n tosses of a fair coin, where success is heads with p = 0.5.^[39] For n = 3, the probability of exactly 2 heads is

P(X = 2) = \binom{3}{2} (0.5)^2 (0.5)^{1} = 3 \times 0.125 = 0.375 = \frac{3}{8}.

^[39] The expected value (mean) is np = 3 \times 0.5 = 1.5.^[39] Another common discrete example is the outcome of a fair six-sided dice roll, which follows a discrete uniform distribution on the set \{1, 2, 3, 4, 5, 6\}, with each outcome equally likely.^[40] The PMF is

P(X = x) = \frac{1}{6}, \quad x = 1, 2, \dots, 6.

^[40] This distribution assigns equal probability $1/N to each of N possible outcomes.^[40]

Continuous Examples

The uniform distribution on the interval [a, b], where a < b, serves as a foundational example of a continuous random variable, representing equal likelihood across a finite range. Its probability density function (PDF) is defined as

f(x) = \frac{1}{b - a}, \quad a \leq x \leq b,

and f(x) = 0 otherwise.^[41] The corresponding cumulative distribution function (CDF) is

F(x) = \frac{x - a}{b - a}, \quad a \leq x \leq b,

with F(x) = 0 for x < a and F(x) = 1 for x > b.^[42] For the standard uniform distribution on [0, 1], the probability P(0.2 < X < 0.5) is computed by integrating the PDF over the interval, yielding $0.3.^[41] The exponential distribution, parameterized by rate \lambda > 0, exemplifies continuous random variables in modeling waiting times until the first event in a Poisson process. Its PDF is

f(x) = \lambda e^{-\lambda x}, \quad x \geq 0,

and f(x) = 0 otherwise.^[43] The probability that the waiting time exceeds t \geq 0 is P(X > t) = e^{-\lambda t}.^[44] For \lambda = 1, the expected value, or mean, is $1/\lambda = 1.^[44] The normal distribution, with mean \mu and variance \sigma^2 > 0, is a cornerstone continuous distribution characterized by its symmetric bell-shaped curve. Its PDF is

f(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right), \quad -\infty < x < \infty.

^[45] Under this distribution, approximately 68% of the values fall within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations.^[46]

Measure-Theoretic Foundations

Probability Spaces and Measurable Functions

A probability space is formally defined as a triple (\Omega, \mathcal{F}, P), where \Omega is a nonempty set serving as the sample space, \mathcal{F} is a \sigma-algebra of subsets of \Omega (the event space), and P: \mathcal{F} \to [0,1] is a probability measure satisfying the Kolmogorov axioms: P(\Omega) = 1, P(A) \geq 0 for all A \in \mathcal{F}, and for any countable collection of pairwise disjoint events \{A_n\}_{n=1}^\infty \subset \mathcal{F}, P\left(\bigcup_{n=1}^\infty A_n\right) = \sum_{n=1}^\infty P(A_n).^[11] In advanced treatments, the space is often taken to be complete, meaning that if N \in \mathcal{F} with P(N) = 0 and B \subset N, then B \in \mathcal{F} and P(B) = 0; this completion ensures all subsets of null sets are measurable and assigned measure zero.^[47] Measurable functions provide the bridge between the abstract probability space and numerical outcomes. A function X: \Omega \to \mathbb{R} is \mathcal{F}/\mathcal{B}(\mathbb{R})-measurable, where \mathcal{B}(\mathbb{R}) denotes the Borel \sigma-algebra on \mathbb{R}, if for every Borel set B \in \mathcal{B}(\mathbb{R}), the preimage X^{-1}(B) = \{\omega \in \Omega : X(\omega) \in B\} \in \mathcal{F}.^[48] Equivalently, X is measurable if and only if X^{-1}(B) \in \mathcal{F} for all B \in \mathcal{B}(\mathbb{R}). The Borel \sigma-algebra \mathcal{B}(\mathbb{R}) is the smallest \sigma-algebra containing all open sets in \mathbb{R}, generated specifically by the collection of all open intervals (a, b) for a < b in \mathbb{R}; this generation ensures that continuity and other topological properties align with measurability. Any nonnegative measurable function f: \Omega \to [0, \infty] can be approximated pointwise by a sequence of simple functions, which are measurable functions taking only finitely many finite values.^[49] Specifically, there exists a sequence \{\phi_n\}_{n=1}^\infty of simple functions such that \phi_n(\omega) \uparrow f(\omega) for all \omega \in \Omega, facilitating integration and analysis in the measure-theoretic framework.^[50] This measure-theoretic formulation extends the basic Kolmogorov axioms to handle complex scenarios, such as infinite product spaces for sequences of independent identically distributed (i.i.d.) random variables, via the Kolmogorov extension theorem, which constructs a consistent probability measure on the infinite-dimensional product \sigma-algebra from finite-dimensional marginals.^[51]

Real-Valued Random Variables

In measure-theoretic probability, a real-valued random variable X on a probability space (\Omega, \mathcal{F}, P) is defined as a measurable function X: \Omega \to \mathbb{R}, such that for every Borel set B \in \mathcal{B}(\mathbb{R}), the preimage X^{-1}(B) = \{\omega \in \Omega : X(\omega) \in B\} belongs to \mathcal{F}.^[52]^[53] This measurability ensures that events defined by the random variable, such as \{X \leq x\} for x \in \mathbb{R}, are measurable and thus assignable probabilities under P.^[52]^[53] The random variable X induces a probability distribution \mu_X on the measurable space (\mathbb{R}, \mathcal{B}(\mathbb{R})), called the law or distribution of X, given by

\mu_X(B) = P(X^{-1}(B)) = P(\{\omega \in \Omega : X(\omega) \in B\})

for all Borel sets B \in \mathcal{B}(\mathbb{R}).^[52]^[53] This pushforward measure \mu_X fully characterizes the probabilistic behavior of X, allowing expectations of bounded measurable functions g: \mathbb{R} \to \mathbb{R} to be computed as E[g(X)] = \int_{\mathbb{R}} g(x) \, \mu_X(dx).^[53] For the expectation E[X] to exist as a real number, X must be integrable, meaning E[|X|] = \int_{\Omega} |X(\omega)| \, dP(\omega) < \infty.^[52]^[53] A foundational class of real-valued random variables consists of simple random variables, which take only finitely many values and can be expressed as step functions X = \sum_{i=1}^n x_i \mathbf{1}_{A_i}, where x_i \in \mathbb{R} are distinct, the A_i \subset \Omega are disjoint events in \mathcal{F} with P(A_i) > 0, and \mathbf{1}_{A_i} is the indicator function of A_i.^[52]^[53] These simple functions form an algebra dense in the space of bounded measurable functions under pointwise convergence, facilitating approximations in integration and convergence theorems.^[53] To accommodate phenomena like unbounded growth, real-valued random variables can be extended to the extended real line \overline{\mathbb{R}} = [-\infty, \infty], where X: \Omega \to \overline{\mathbb{R}} remains measurable with respect to the Borel \sigma-field on \overline{\mathbb{R}}, provided P(|X| = \infty) = 0.^[52]^[53] This extension preserves the induced distribution on \mathbb{R} while handling infinite values with probability zero, ensuring integrals and expectations remain well-defined when finite.^[53] The notion generalizes to random vectors, which are measurable functions X: \Omega \to \mathbb{R}^n for n \geq 2, equipped with the product Borel \sigma-field \mathcal{B}(\mathbb{R}^n), such that X^{-1}(A) \in \mathcal{F} for all A \in \mathcal{B}(\mathbb{R}^n).^[52]^[53] The induced distribution \mu_X on (\mathbb{R}^n, \mathcal{B}(\mathbb{R}^n)) is then \mu_X(A) = P(X^{-1}(A)) for Borel A \subset \mathbb{R}^n, capturing joint probabilistic structure.^[53]

Moments and Characteristics

Expectation

In measure-theoretic probability, the expectation of a real-valued random variable X defined on a probability space (\Omega, \mathcal{F}, P) is given by the Lebesgue integral

\mathbb{E}[X] = \int_{\Omega} X(\omega) \, dP(\omega),

provided this integral exists in the extended real line.^[54] The expectation exists if and only if X is integrable, meaning \mathbb{E}[|X|] < \infty, where absolute integrability ensures the positive and negative parts of X do not lead to infinite discrepancies.^[55] Without this condition, the expectation is undefined, as seen in cases like the Cauchy distribution where the integral diverges.^[56] For practical computation, the expectation can be expressed in terms of the distribution of X. If X is discrete with probability mass function p(x), then

\mathbb{E}[X] = \sum_{x} x \, p(x),

where the sum is over the support of X.^[57] For a continuous random variable with probability density function f(x), the expectation is

\mathbb{E}[X] = \int_{-\infty}^{\infty} x \, f(x) \, dx.

^[58] These formulas follow from the change of variables in the integral definition, pushing the measure forward via the distribution of X./04:_Expected_Value/4.09:_Expected_Value_as_an_Integral) A key property of expectation is its linearity: for constants a, b \in \mathbb{R} and random variables X, Y,

\mathbb{E}[aX + bY] = a \mathbb{E}[X] + b \mathbb{E}[Y],

which holds regardless of dependence between X and Y, as long as the expectations exist.^[59] This linearity simplifies computations for sums and linear combinations without requiring joint distributions. For example, a Bernoulli random variable X with success probability p, where P(X=1)=p and P(X=0)=1-p, has expectation \mathbb{E}[X] = p.^[60] Similarly, a uniform random variable on [0,1] with density f(x)=1 for x \in [0,1] has \mathbb{E}[X] = \int_0^1 x \, dx = \frac{1}{2}.^[61] For non-negative random variables X \geq 0, the expectation admits an alternative representation using the survival function:

\mathbb{E}[X] = \int_0^\infty P(X > t) \, dt.

This tail integral formula is particularly useful for deriving moments or bounding expectations via tail probabilities.^[62]

Variance and Covariance

The variance of a random variable X, denoted \operatorname{[Var](/page/Var)}(X), quantifies the expected squared deviation from its mean \mu = \mathbb{E}[X], serving as a measure of dispersion in the distribution. It is formally defined as

\operatorname{[Var](/page/Var)}(X) = \mathbb{E}\left[(X - \mathbb{E}[X])^2\right],

which can also be computed using the alternative form

\operatorname{[Var](/page/Var)}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2.

^[63]^[64] The standard deviation \sigma_X is the positive square root of the variance, \sigma_X = \sqrt{\operatorname{[Var](/page/Var)}(X)}, providing a scale in the same units as X itself.^[64] For two random variables X and Y with means \mu_X = \mathbb{E}[X] and \mu_Y = \mathbb{E}[Y], the covariance \operatorname{Cov}(X, Y) measures the joint variability around their means and is defined as

\operatorname{Cov}(X, Y) = \mathbb{E}\left[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])\right],

equivalently expressed as

\operatorname{Cov}(X, Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y].

^[65] Note that \operatorname{Cov}(X, X) = \operatorname{Var}(X), linking the two concepts.^[66] Key properties include linearity in scaling: for constants a and b, \operatorname{Var}(aX + b) = a^2 \operatorname{Var}(X), reflecting that variance scales with the square of the coefficient and is invariant to shifts.^[66] Covariance is bilinear: for constants a, b, c, d, \operatorname{Cov}(aX + b, cY + d) = a c \operatorname{Cov}(X, Y). For illustration, consider a Bernoulli random variable X with success probability p, where \operatorname{Var}(X) = p(1 - p), maximized at p = 1/2.^[67] Similarly, for a continuous uniform random variable on [0, 1], the variance is \operatorname{Var}(X) = 1/12.^[68]

Higher Moments and Central Moments

Higher moments of a random variable X provide additional insights into the shape and characteristics of its distribution beyond the mean and variance. The k-th raw moment, denoted \mu_k = \mathbb{E}[X^k], captures the expected value of X raised to the power k, serving as a foundational measure for the distribution's overall scale and location.^[69] In contrast, the k-th central moment, \mu_k' = \mathbb{E}[(X - \mu)^k], where \mu = \mathbb{E}[X] is the mean, shifts the focus to deviations from the mean, emphasizing spread and asymmetry.^[70] These moments form a sequence that uniquely determines the distribution under certain conditions, such as when all moments exist and the distribution is determined by its moments.^[70] Among higher central moments, the third-order moment relates to skewness, which quantifies the asymmetry of the distribution around the mean. The skewness coefficient is defined as \gamma_1 = \frac{\mathbb{E}[(X - \mu)^3]}{\sigma^3}, where \sigma^2 = \mathrm{Var}(X) is the variance; a positive value indicates a right-tailed distribution, while a negative value suggests left-tailed asymmetry.^[71] The fourth-order central moment informs kurtosis, a measure of the tails' heaviness relative to a normal distribution, given by \gamma_2 = \frac{\mathbb{E}[(X - \mu)^4]}{\sigma^4} - 3; values greater than zero denote leptokurtic (heavy-tailed) distributions, and less than zero indicate platykurtic (light-tailed) ones.^[71]^[72] The moment-generating function (MGF) offers a compact way to encapsulate all raw moments: M(t) = \mathbb{E}[e^{tX}] = \sum_{k=0}^{\infty} \frac{\mu_k t^k}{k!}, valid in a neighborhood of t = 0 where the series converges, allowing moments to be extracted via derivatives at t = 0.^[73] For illustration, consider the standard normal distribution, which is symmetric about its mean; all odd-order central moments vanish (\mu_{2m+1}' = 0 for m \geq 0), and the kurtosis is exactly zero, reflecting its mesokurtic nature with neither excessive peaks nor tails.^[73]

Functions of Random Variables

Expectation of Functions

In probability theory, the expectation of a function g(X) of a random variable X can be computed directly from the distribution of X without reference to the underlying probability space, a result known as the law of the unconscious statistician (LOTUS). For a discrete random variable X taking values in a countable set with probability mass function p_X(x), LOTUS states that

E[g(X)] = \sum_{x} g(x) \, p_X(x),

provided the sum exists.^[74] Similarly, for a continuous random variable X with probability density function f_X(x), the expectation is given by

E[g(X)] = \int_{-\infty}^{\infty} g(x) \, f_X(x) \, dx,

assuming the integral converges.^[75] This formulation extends naturally to the general case using the cumulative distribution function F_X, where

E[g(X)] = \int_{-\infty}^{\infty} g(x) \, dF_X(x),

interpreted as a Stieltjes integral. A fundamental application arises with indicator functions, where g(X) = 1_A(X) is the indicator of the event \{X \in A\}. Here, LOTUS simplifies to E[1_A(X)] = P(X \in A), directly linking the expectation to the probability measure induced by X.^[76] This identity underpins many derivations in probability, such as those for tail probabilities. Jensen's inequality provides a key property for convex functions applied to expectations. If \phi is a convex function and X is a random variable with finite expectation, then \phi(E[X]) \leq E[\phi(X)], with equality if \phi is linear or X is constant almost surely.^[77] This inequality highlights the preservation of convexity under expectation and has broad implications in optimization and risk analysis. Common examples illustrate these concepts. The second moment E[X^2] computes as \sum x^2 p_X(x) or \int x^2 f_X(x) \, dx via LOTUS, relating to variance through \operatorname{Var}(X) = E[X^2] - (E[X])^2.^[78] Similarly, the L^1 norm E[|X|] measures absolute deviation and follows from applying LOTUS to g(x) = |x|.^[79]

Transformations and Examples

One common transformation involves applying a strictly monotone function g to a continuous random variable X with probability density function f_X(x), yielding Y = g(X). For g strictly increasing and differentiable, the density of Y is given by

f_Y(y) = f_X(g^{-1}(y)) \cdot \left| \frac{d}{dy} g^{-1}(y) \right| = \frac{f_X(g^{-1}(y))}{|g'(g^{-1}(y))|},

where g^{-1} is the inverse function and the absolute value accounts for the Jacobian of the transformation.^[80] This formula derives from the change-of-variables theorem in probability, ensuring the density integrates to 1 over the support of Y. For strictly decreasing g, the form is analogous but with a sign adjustment in the derivative.^[81] A fundamental example is the sum Z = X + Y of two independent continuous random variables X and Y with densities f_X and f_Y. The density of Z is the convolution of the individual densities:

f_Z(z) = \int_{-\infty}^{\infty} f_X(x) f_Y(z - x) \, dx = (f_X * f_Y)(z).

This integral captures all pairs (x, z - x) contributing to the sum z, leveraging independence to factor the joint density.^[82] For discrete variables, the convolution becomes a sum over possible values. The convolution operation extends to sums of more than two independents via iterated application.^[83] For the product W = X Y of two positive independent continuous random variables X and Y, a log transformation simplifies the analysis: let U = \log X and V = \log Y, so \log W = U + V. If U and V are independent normals with means \mu_U, \mu_V and variances \sigma_U^2, \sigma_V^2, then \log W is normal with mean \mu_U + \mu_V and variance \sigma_U^2 + \sigma_V^2, making W lognormal.^[84] This property holds more generally: the product of independent lognormals is lognormal, with parameters adding in the log scale.^[85] For non-lognormal cases, the density of W can be derived via integration similar to convolution, but the log transform often aids computation when positivity holds.^[86] Consider the minimum M = \min(X_1, \dots, X_n) or maximum M' = \max(X_1, \dots, X_n) of n i.i.d. continuous random variables X_i with common survival function S(x) = P(X_i > x) = 1 - F(x), where F is the cdf. The survival function of M is P(M > t) = [S(t)]^n, since all must exceed t. Differentiating yields the density f_M(t) = n f(t) [S(t)]^{n-1}, where f = -S' is the density.^[87] For the maximum M', the cdf is P(M' \leq t) = [F(t)]^n, so the density is f_{M'}(t) = n f(t) [F(t)]^{n-1}. These extreme value distributions arise in reliability and order statistics.^[88] Specific distributions illustrate these transformations. The chi-squared distribution with k degrees of freedom arises as the sum of squares of k independent standard normal random variables: if Z_i \sim N(0,1) i.i.d., then \chi^2_k = \sum_{i=1}^k Z_i^2. This follows from the quadratic transformation and independence, with the density derived via repeated convolution of chi-squared(1) components, each being the square of a standard normal (which has a gamma(1/2, 1/2) distribution).^[89] The chi-squared is central in statistical inference, such as variance estimation. The beta distribution also emerges from uniforms via order statistics. For n i.i.d. Uniform(0,1) random variables U_1, \dots, U_n, the k-th order statistic U_{(k)} (the k-th smallest) follows a Beta(k, n-k+1) distribution, with density

f_{U_{(k)}}(u) = \frac{n!}{(k-1)!(n-k)!} u^{k-1} (1-u)^{n-k}, \quad 0 < u < 1.

This results from the multinomial probability of exactly k-1 uniforms below u and n-k above, times the density contributions.^[90] Beta distributions model proportions and are foundational in Bayesian statistics.^[91]

Key Properties

Linearity and Monotonicity

The linearity of expectation states that for any finite collection of random variables X_1, X_2, \dots, X_n (not necessarily independent) and real constants a_1, a_2, \dots, a_n, the expected value of their linear combination equals the linear combination of the individual expectations:

\mathbb{E}\left[ \sum_{i=1}^n a_i X_i \right] = \sum_{i=1}^n a_i \mathbb{E}[X_i].

This property derives directly from the definition of expectation as an integral over the probability space and holds unconditionally, without requiring knowledge of joint distributions or dependence structures.^[52] Monotonicity of expectation follows from the non-negativity of the measure in the underlying probability space: if integrable random variables X and Y satisfy X \leq Y almost surely, then \mathbb{E}[X] \leq \mathbb{E}[Y].^[52] As a consequence, if X \geq 0 almost surely, then \mathbb{E}[X] \geq 0, since the constant random variable 0 provides a lower bound.^[52] Markov's inequality leverages non-negativity to bound tail probabilities: for a non-negative random variable X and a > 0,

P(X \geq a) \leq \frac{\mathbb{E}[X]}{a}.

This extends to any random variable via the absolute value, yielding P(|X| \geq a) \leq \mathbb{E}[|X|]/a.^[52] The proof applies linearity to the non-negative indicator I_{\{X \geq a\}}, noting that \mathbb{E}[I_{\{X \geq a\}} X] = \mathbb{E}[X] \geq a P(X \geq a).^[52] A practical illustration of linearity arises in counting problems, such as the binomial distribution. Suppose X represents the number of successes in n trials, expressed as X = \sum_{i=1}^n I_i where each I_i is an indicator random variable for the i-th success (with \mathbb{E}[I_i] = p). Even if the trials are dependent, linearity gives \mathbb{E}[X] = \sum_{i=1}^n p = np.^[59] This simplifies computation in scenarios like estimating overlaps in hashing or matching problems, where dependence complicates direct evaluation.

Independence and Dependence

Two random variables X and Y defined on the same probability space are independent if, for every pair of measurable sets A and B, the joint probability satisfies P(X \in A, Y \in B) = P(X \in A) P(Y \in B).^[92] This definition extends to collections of random variables, where pairwise independence requires the condition to hold for every pair, while mutual independence requires it for all finite subcollections. An equivalent characterization is that the joint distribution of independent random variables is the product of their marginal distributions, meaning the joint cumulative distribution function factors as F_{X,Y}(x,y) = F_X(x) F_Y(y) for all x, y.^[93] For bounded measurable functions g and h, independence also implies E[g(X) h(Y)] = E[g(X)] E[h(Y)].^[92] In contrast, dependence arises when the joint behavior of random variables cannot be expressed as a product of marginals. A common measure of linear dependence is covariance, defined as \operatorname{Cov}(X,Y) = E[(X - E[X])(Y - E[Y])], but zero covariance (uncorrelatedness) does not imply independence.^[94] For example, let X be uniform on [-1, 1] and Y = X^2; then \operatorname{Cov}(X,Y) = 0 because E[XY] = E[X^3] = 0 by symmetry, yet X and Y are dependent since P(|X| > 0.5, Y < 0.1) = 0 while P(|X| > 0.5) P(Y < 0.1) > 0.^[95] Conditional expectation provides a framework for quantifying dependence, where E[X \mid Y] is the best L^2-approximation of X by a function of Y, interpreted as the orthogonal projection of X onto the closed subspace of L^2 functions measurable with respect to the \sigma-algebra generated by Y.^[96] If X and Y are independent, then E[X \mid Y] = E[X] almost surely, reflecting no information gain from Y.^[97] Classic examples illustrate these concepts: the outcomes of successive fair coin tosses, modeled as Bernoulli random variables, are independent since the probability of heads on the second toss does not depend on the first.^[98] In financial contexts, daily returns of stock prices from the same sector, such as technology firms, exhibit dependence due to shared market influences like economic news, violating the independence condition.^[99]

Equivalence and Comparison

Almost Sure Equality

Two random variables X and Y, defined on the same probability space (\Omega, \mathcal{F}, P), are equal almost surely, written X = Y a.s., if the set where they differ has probability zero:

P(\{\omega \in \Omega : X(\omega) \neq Y(\omega)\}) = 0.

This definition captures equality except possibly on a null set, a subset of \Omega with measure zero under P.^[100] Almost sure equality is an equivalence relation on the space of random variables, partitioning them into equivalence classes where members agree except on null sets.^[101] Almost sure equality implies that X and Y share all key probabilistic properties, including the same probability distribution, since the events \{X \in B\} and \{Y \in B\} differ by at most a null set for any Borel set B.^[102] If the expectations exist, then E[X] = E[Y], and similarly for higher moments like variance when finite.^[103] These invariances extend to integrability: X belongs to L^p if and only if Y does, for $1 \leq p \leq \infty.^[102] In essence, X and Y are indistinguishable for almost all probabilistic computations.^[104] Random variables are frequently treated as equal modulo null sets, meaning two versions of a random variable that coincide almost surely are regarded as identical in analysis. This equivalence allows flexibility in choosing representatives within a class, as long as differences occur only on null sets. In practice, this modifies foundational definitions; for example, the conditional expectation E[X \mid \mathcal{G}] of an integrable random variable X with respect to a sub-\sigma-algebra \mathcal{G} is unique only up to almost sure equality, so distinct versions agree except on a \mathcal{G}-measurable null set.^[105] A concrete example illustrates this concept. On the probability space [0,1] equipped with the Lebesgue measure (uniform distribution), define X(\omega) = \omega for all \omega \in [0,1], which is uniformly distributed on [0,1]. Now define Y(\omega) = \omega for \omega \neq 1/2 and Y(1/2) = 3/4. The singleton \{1/2\} is a null set under Lebesgue measure, so P(X \neq Y) = 0, hence X = Y a.s. Both induce the same uniform distribution on [0,1], and all moments match where defined.^[103]

Equality in Distribution

Two random variables X and Y defined on possibly different probability spaces are said to be equal in distribution, denoted X \stackrel{d}{=} Y, if they induce the same probability measure on the real line, meaning their cumulative distribution functions coincide: F_X(t) = F_Y(t) for all t \in \mathbb{R}.^[106] An equivalent characterization is that \mathbb{E}[g(X)] = \mathbb{E}[g(Y)] for every bounded continuous function g: \mathbb{R} \to \mathbb{R}.^[106] This equality implies that X and Y share all distributional properties preserved under weak convergence, such as moments when they exist. Specifically, if the k-th moments are finite, then \mathbb{E}[X^k] = \mathbb{E}[Y^k] for every nonnegative integer k.^[52] Equality in distribution also serves as the terminal case of convergence in distribution, where the sequence trivially converges to the common law in one step.^[106] Unlike almost sure equality, which requires the variables to coincide on a set of probability one, equality in distribution permits the variables to differ pathwise while maintaining identical marginal laws. For example, two independent and identically distributed (i.i.d.) copies of a non-degenerate random variable are equal in distribution but almost surely unequal.^[52] In practice, equality in distribution for samples from X and Y can be assessed using the Kolmogorov-Smirnov test, a nonparametric procedure that evaluates the maximum deviation between their empirical cumulative distribution functions.^[107] A concrete illustration is that any two standard normal random variables—such as Z_1 \sim \mathcal{N}(0,1) on (\Omega_1, \mathcal{F}_1, P_1) and Z_2 \sim \mathcal{N}(0,1) on a distinct space (\Omega_2, \mathcal{F}_2, P_2)—satisfy Z_1 \stackrel{d}{=} Z_2, as both have the standard normal cumulative distribution function \Phi(t) = \int_{-\infty}^t \frac{1}{\sqrt{2\pi}} e^{-u^2/2} \, du.^[52]

Convergence

Convergence in Probability

Convergence in probability is a mode of convergence for a sequence of random variables X_n defined on a probability space, where X_n converges in probability to a random variable X, denoted X_n \to^P X, if for every \epsilon > 0,

\lim_{n \to \infty} P(|X_n - X| > \epsilon) = 0.

This definition captures the idea that the probability of X_n deviating from X by more than any fixed positive amount \epsilon diminishes to zero as n increases.^[108]^[109] Convergence in probability is a weaker form of convergence compared to almost sure convergence, as it does not require the sequence to converge pointwise almost everywhere but instead holds uniformly over the probability measure.^[110]^[111] However, convergence in probability implies convergence in distribution, meaning the cumulative distribution functions of X_n converge to that of X at continuity points.^[112]^[113] Slutsky's theorem provides a useful continuity property for operations on sequences converging in probability: if X_n \to^P a for a constant a, and Y_n \to^P Y, then a Y_n + b X_n \to^P a Y + b a for any constant b, and similarly X_n Y_n \to^P a Y.^[114]^[115] This theorem extends to products and allows combining convergence in probability with constants or other limits to preserve the mode of convergence.^[116] A classic example is the weak law of large numbers: for independent and identically distributed random variables X_1, X_2, \dots with finite mean \mu and finite variance, the sample mean \bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i converges in probability to \mu.^[117]^[118] This result, often proved using Chebyshev's inequality, illustrates how averages stabilize probabilistically around the true expectation.^[119] Almost sure convergence implies convergence in probability, as pointwise convergence almost everywhere ensures the probability of large deviations goes to zero.^[120]^[111]

Almost Sure Convergence

Almost sure convergence, also known as convergence with probability one, is the strongest form of convergence for sequences of random variables. A sequence of random variables \{X_n\}_{n=1}^\infty defined on a probability space (\Omega, \mathcal{F}, P) is said to converge almost surely to a random variable X if

P\left( \left\{ \omega \in \Omega : \lim_{n \to \infty} X_n(\omega) = X(\omega) \right\} \right) = 1.

This means that the set where the pointwise limit fails has probability zero, so the convergence holds pathwise for almost every outcome \omega.^[52] Almost sure convergence implies both convergence in probability and convergence in distribution. Specifically, if X_n \to X almost surely, then for every \epsilon > 0,

\lim_{n \to \infty} P(|X_n - X| > \epsilon) = 0,

establishing convergence in probability, and the distributions of X_n converge weakly to that of X. This pathwise nature makes almost sure convergence particularly useful for establishing limits that hold "for sure" except on a negligible set.^[121] A key tool for proving almost sure convergence is the Borel–Cantelli lemmas, which concern the occurrence of events in sequences. The first Borel–Cantelli lemma states that if \{A_n\}_{n=1}^\infty is a sequence of events with \sum_{n=1}^\infty P(A_n) < \infty, then P(\limsup_{n \to \infty} A_n) = 0, meaning the probability that infinitely many A_n occur is zero. For independent events, the second lemma adds that if \sum_{n=1}^\infty P(A_n) = \infty, then P(\limsup_{n \to \infty} A_n) = 1. These lemmas facilitate almost sure convergence by controlling the tails of series related to deviations |X_n - X| > \epsilon. An important application is the strong law of large numbers (SLLN), which asserts almost sure convergence for sample means of independent and identically distributed (i.i.d.) random variables. If \{X_i\}_{i=1}^\infty are i.i.d. with finite expectation \mu = E[X_1], then the sample mean \bar{X}_n = n^{-1} \sum_{i=1}^n X_i satisfies \bar{X}_n \to \mu almost surely as n \to \infty. This result, originally proved by Kolmogorov under the finite mean condition, underpins many asymptotic arguments in statistics and relies on techniques like truncation and the Borel–Cantelli lemmas to bound large deviations.^[52] The monotone convergence theorem provides another avenue for almost sure convergence in the context of expectations. If \{X_n\}_{n=1}^\infty is a sequence of non-negative random variables such that $0 \leq X_n \uparrow X almost surely (i.e., X_n(\omega) increases to X(\omega) for almost every \omega), then X_n \to X almost surely and E[X_n] \to E[X]. This theorem, an adaptation of Lebesgue's result for integrals to the probability measure, ensures that expectations preserve limits under monotonicity, facilitating computations in stochastic processes.^[52]

Convergence in Distribution

Convergence in distribution, also known as weak convergence, describes a sequence of random variables X_n converging to a random variable X if their cumulative distribution functions F_{X_n}(x) converge pointwise to F_X(x) at all continuity points x of F_X. Equivalently, by the Portmanteau theorem, X_n \to^d X if \mathbb{E}[g(X_n)] \to \mathbb{E}[g(X)] for every bounded continuous function g: \mathbb{R} \to \mathbb{R}.^[106] This theorem provides several equivalent conditions, including \limsup_{n \to \infty} P(X_n \in F) \leq P(X \in F) for every closed set F \subseteq \mathbb{R} and \liminf_{n \to \infty} P(X_n \in G) \geq P(X \in G) for every open set G \subseteq \mathbb{R}, as well as convergence P(X_n \in A) \to P(X \in A) for every Borel set A with P(\partial A) = 0.^[106] An important characterization uses characteristic functions: if the characteristic functions \phi_{X_n}(t) = \mathbb{E}[e^{itX_n}] converge pointwise to \phi_X(t) for all t \in \mathbb{R} and the limit function is continuous at t = 0, then X_n \to^d X, by Lévy's continuity theorem. This criterion is particularly useful for proving convergence when direct computation of distribution functions is intractable. A canonical application is the central limit theorem (CLT), which states that if X_1, X_2, \dots are i.i.d. random variables with finite mean \mu and variance \sigma^2 > 0, then the standardized sample mean \sqrt{n} (\bar{X}_n - \mu)/\sigma \to^d Z, where Z \sim \mathcal{N}(0, 1). For the specific case of Bernoulli trials, the De Moivre–Laplace theorem refines the CLT: if S_n \sim \operatorname{Binomial}(n, p), then \frac{S_n - np}{\sqrt{np(1-p)}} \to^d \mathcal{N}(0, 1) as n \to \infty. Note that the unstandardized S_n / n \to^d \delta_p, a degenerate distribution at p, but the CLT requires standardization to achieve a non-degenerate limit. Convergence in distribution is the weakest form of convergence among common modes, as it concerns only the limiting marginal laws and does not imply convergence in probability unless the limit X is almost surely constant (degenerate distribution). Note that convergence in distribution to a non-degenerate limit does not imply convergence in probability, even if the sequence is tight.^[106]^[122]