Probability mass function

A probability mass function (PMF), also known as a probability function or frequency function, is a mathematical function that describes the probability distribution of a discrete random variable by assigning a non-negative probability to each possible value that the variable can take.^[1] For a discrete random variable X taking values in a countable set, the PMF is typically denoted as p_X(x) = P(X = x), where P(X = x) represents the probability that X equals exactly x.^[2] The PMF must satisfy two fundamental properties: first, p_X(x) \geq 0 for all x in the support of X, ensuring probabilities are non-negative; second, the sum of p_X(x) over all possible values x equals 1, \sum p_X(x) = 1, which guarantees that the total probability is conserved.^[1] These properties make the PMF a valid probability measure for discrete outcomes, fully characterizing the distribution and enabling the computation of expected values, variances, and other statistical moments.^[3] In contrast to the probability density function used for continuous random variables, the PMF provides the actual probability mass at discrete points rather than a density over an interval, and the cumulative distribution function derived from the PMF is a step function that jumps by p_X(x) at each point x.^[1] Common examples include the PMF of the binomial distribution, which models the number of successes in fixed trials, and the Poisson distribution, which describes the number of events in a fixed interval.^[4] The concept is central to probability theory and finds applications in fields such as statistics, machine learning, and operations research for modeling countable phenomena.^[3]

Fundamentals

Definition

In probability theory, the probability mass function (PMF) of a discrete random variable X is defined as the function p_X: S \to [0,1] that assigns to each possible outcome x in the support set S the probability p_X(x) = P(X = x), where S is the countable sample space consisting of all values that X can take with positive probability.^[5]^[1] A fundamental requirement of the PMF is that the probabilities over the entire support sum to unity: \sum_{x \in S} p_X(x) = 1.^[5]^[6] This normalization ensures that the PMF fully describes the probability distribution of X. The PMF applies specifically to discrete random variables, where probabilities are concentrated at isolated points in the support S, in contrast to continuous random variables that require probability density functions to integrate over intervals.^[3] The support set S is typically finite or countably infinite and comprises exactly those points where p_X(x) > 0.^[1]^[7]

Properties

The probability mass function (PMF) of a discrete random variable X, denoted p_X(x), must satisfy non-negativity, meaning p_X(x) \geq 0 for all x in the state space S, as probabilities cannot be negative by the axioms of probability theory.^[1] This ensures that the assigned probabilities represent valid measures of likelihood. Additionally, each individual probability satisfies $0 \leq p_X(x) \leq 1, since no single event can have a probability exceeding the total probability of the sample space.^[8] The normalization property requires that \sum_{x \in S} p_X(x) = 1, reflecting the fact that the events \{X = x\} for x \in S form a partition of the sample space, and by the axiom of total probability (or countable additivity for infinite supports), their probabilities sum to the probability of the entire space, which is 1.^[5] This condition guarantees that the PMF fully accounts for all possible outcomes without overlap or omission. For a given discrete random variable, the PMF is uniquely determined, as it is defined directly by p_X(x) = P(X = x) for each x, and the probabilities P(X = x) are uniquely specified by the underlying probability measure.^[9] The effective support of the PMF is the set \{x \in S \mid p_X(x) > 0\}, which identifies the values that the random variable can actually attain with positive probability, while p_X(x) = 0 for all other x \in S.^[10] This support set encapsulates the possible realizations of X under the given distribution. The expected value of X can be computed using the PMF as E[X] = \sum_{x \in S} x \, p_X(x), leveraging the normalization property.^[5]

Relationships

Cumulative distribution function

The cumulative distribution function (CDF) of a discrete random variable X with probability mass function p_X and support S is defined as

F_X(x) = P(X \leq x) = \sum_{\substack{y \leq x \\ y \in S}} p_X(y),

where the sum accumulates the probabilities assigned by the PMF up to and including x. This function maps any real number x to the interval [0, 1], representing the total probability that X takes a value less than or equal to x.^[11]^[12] For discrete random variables, the CDF exhibits a step-function form, remaining constant between points in the support S and featuring discontinuous jumps precisely at those points where p_X(y) > 0. The magnitude of each jump at a point y \in S equals p_X(y), reflecting the discrete probability mass concentrated there, while the function is right-continuous at every point. This stepwise increase ensures that F_X(x) approaches 1 as x tends to infinity and starts at 0 for x less than the smallest element of S.^[3]^[13] The PMF can be recovered from the CDF through the relation

p_X(x) = F_X(x) - F_X(x^-),

where F_X(x^-) denotes the left-hand limit of the CDF at x, capturing the size of the jump at x. This difference directly corresponds to the increments in the CDF, as each PMF value p_X(x) quantifies the vertical rise at that point, allowing the original discrete distribution to be reconstructed solely from the cumulative form.

Probability density function

In contrast to the probability mass function (PMF) for discrete random variables, the probability density function (PDF), denoted f_X(x), describes the probability distribution of a continuous random variable X. It is a nonnegative integrable function defined over the support of X such that the integral over the entire support equals 1, i.e., \int_{-\infty}^{\infty} f_X(x) \, dx = 1. The probability that X falls within an interval (a, b) is given by the area under the PDF curve over that interval: P(a < X < b) = \int_a^b f_X(x) \, dx. A fundamental distinction between the PMF and PDF lies in their interpretation and normalization. The PMF p_X(x) directly assigns probabilities to discrete points, summing to 1 across all possible outcomes: \sum_x p_X(x) = 1, with each p_X(x) representing P(X = x) \leq 1. In contrast, the PDF provides a density rather than probabilities at points, integrating to 1 over the continuous domain, and its values f_X(x) can exceed 1, as they measure relative likelihood per unit interval rather than absolute probability. This density-based approach reflects the infinite divisibility of continuous spaces, where probabilities are accumulated over intervals rather than assigned to isolated values. For continuous random variables governed by a PDF, the probability of the variable taking any exact single value is zero: P(X = x) = 0 for any specific x, because the integral over an infinitesimally small interval around x approaches zero. This property underscores the inapplicability of PMFs to continuous cases, as no finite probability can be assigned to points without violating the total probability measure. In certain limiting scenarios, discrete distributions described by PMFs can approximate continuous ones via PDFs. For instance, as the number of trials n in a binomial distribution grows large while the success probability p is fixed, the PMF converges to the PDF of a normal distribution by the central limit theorem, enabling the use of continuous approximations for large-scale discrete processes. The cumulative distribution function serves as a unifying framework that accommodates both PMFs and PDFs, defining F_X(x) = P(X \leq x) for either type of random variable.

Examples

Finite support

A probability mass function (PMF) with finite support is defined over a discrete random variable that can take only a finite number of possible values, making it straightforward to enumerate all probabilities directly. These distributions are fundamental in modeling scenarios with limited outcomes, such as coin flips or dice rolls, where the total probability sums to 1 across the support.^[5] The Bernoulli distribution is the simplest example, representing a single binary trial with outcomes 0 (failure) or 1 (success), parameterized by the success probability p \in [0,1]. Its PMF is given by

p_X(x) = p^x (1-p)^{1-x}, \quad x = 0,1.

This distribution interprets real-world events like a coin landing heads (success) with probability p = 0.5, where the PMF assigns p to x=1 and $1-p to x=0.^[14] The binomial distribution extends the Bernoulli to n independent trials, counting the number of successes k = 0, 1, \dots, n, with the same success probability p. Its PMF is

p_X(k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0,1,\dots,n,

where \binom{n}{k} is the binomial coefficient. This arises as the sum of n independent Bernoulli random variables, modeling aggregates like the number of heads in n coin flips.^[15] The discrete uniform distribution applies when all outcomes in a finite set S with |S| = m elements are equally likely, parameterized by the support size. Its PMF is

p_X(x) = \frac{1}{m}, \quad x \in S.

This models fair dice or random selection from a finite list without bias, ensuring uniform probability across the support.^[16] The multinomial distribution generalizes the binomial to r categories over n trials, with probabilities p_1, \dots, p_r summing to 1, assigning counts (k_1, \dots, k_r) with \sum k_i = n. Its PMF is

p_{\mathbf{X}}(\mathbf{k}) = \frac{n!}{k_1! \cdots k_r!} p_1^{k_1} \cdots p_r^{k_r},

focusing on finite categorical outcomes like distributing n items into r bins.^[17]

Infinite support

Discrete probability distributions with infinite support assign positive probabilities to a countably infinite set of outcomes, typically non-negative integers, while ensuring the total probability sums to 1 over all possible values. This requires the infinite series \sum_{k=0}^{\infty} p_X(k) = 1, where the probabilities p_X(k) decrease sufficiently rapidly to converge. Such distributions model phenomena with no upper bound on outcomes, like the number of occurrences in an unbounded time frame. The Poisson distribution is a canonical example, with probability mass function

p_X(k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \dots

where \lambda > 0 is the rate parameter, representing the average number of events per interval. Its expected value is \lambda, and it applies to modeling rare events or counts, such as radioactive decays or arrivals in a queue. The infinite sum converges due to the factorial growth in the denominator overpowering the exponential in the numerator. The geometric distribution describes the number of trials until the first success in independent Bernoulli trials with success probability p \in (0,1]. Its PMF is

p_X(k) = (1-p)^{k-1} p, \quad k = 1, 2, 3, \dots

(or shifted to start at k=0 for failures before success), with mean (1-p)/p. It models waiting times, like the number of coin flips until heads. Convergence of the sum to 1 follows from the geometric series formula. The negative binomial distribution generalizes the geometric to the number of trials until the r-th success, where r is a positive integer. The PMF is

p_X(k) = \binom{k-1}{r-1} p^r (1-p)^{k-r}, \quad k = r, r+1, r+2, \dots

with mean r(1-p)/p. It extends waiting time models to multiple successes, such as defect counts in quality control until a fixed number of inspections. The probabilities sum to 1 via the negative binomial series expansion.

Extensions

Multivariate case

In the multivariate case, the probability mass function is extended to describe the joint distribution of multiple discrete random variables. For two discrete random variables X and Y taking values in countable sets, the joint probability mass function is defined as p_{X,Y}(x,y) = P(X = x, Y = y) for each pair (x, y) in the joint support, where p_{X,Y}(x,y) \geq 0 and the normalization condition \sum_x \sum_y p_{X,Y}(x,y) = 1 holds to ensure the probabilities sum to unity.^[18] This bivariate formulation serves as the foundation for higher-dimensional cases, where the joint PMF for a vector \mathbf{X} = (X_1, \dots, X_n) is p_{\mathbf{X}}(\mathbf{x}) = P(X_1 = x_1, \dots, X_n = x_n), satisfying \sum_{\mathbf{x}} p_{\mathbf{X}}(\mathbf{x}) = 1.^[19] Marginal probability mass functions are derived from the joint PMF by summing over the unwanted variables. Specifically, the marginal PMF of X is p_X(x) = \sum_y p_{X,Y}(x,y), where the sum is over all possible values of Y, and likewise p_Y(y) = \sum_x p_{X,Y}(x,y).^[20] In the general multivariate setting, the marginal PMF for any subset of variables is obtained by summing the joint PMF over the complementary variables, preserving the univariate properties as a special case.^[20] Conditional probability mass functions capture dependencies between variables. The conditional PMF of Y given X = x is given by p_{Y|X}(y|x) = \frac{p_{X,Y}(x,y)}{p_X(x)} whenever p_X(x) > 0, and it satisfies the properties of a valid PMF for each fixed x.^[21] This definition extends to multivariate conditionals, such as p_{Y,Z|X}(y,z|x) = \frac{p_{X,Y,Z}(x,y,z)}{p_X(x)} for p_X(x) > 0, allowing analysis of partial dependencies in higher dimensions.^[21] Independence in the multivariate context implies that the joint PMF factors into the product of marginal PMFs. For X and Y, they are independent if and only if p_{X,Y}(x,y) = p_X(x) p_Y(y) for all x, y.^[22] More generally, for \mathbf{X} = (X_1, \dots, X_n), mutual independence holds if p_{\mathbf{X}}(\mathbf{x}) = \prod_{i=1}^n p_{X_i}(x_i) for all \mathbf{x}, simplifying computations and modeling in applications like statistical inference.^[22]

Measure-theoretic formulation

In measure theory, the probability mass function arises as part of the rigorous treatment of discrete random variables on a probability space. A discrete probability space is defined as a triple (\Omega, \mathcal{F}, P), where \Omega is a countable sample space, \mathcal{F} is the power set of \Omega (which is a \sigma-algebra since \Omega is countable), and P: \mathcal{F} \to [0,1] is a probability measure satisfying P(\Omega) = 1 and countable additivity.^[23] A random variable X on this space is a measurable function X: \Omega \to S, where S is a countable set serving as the codomain or support. This X induces a probability measure \mu (also denoted P_X) on the measurable space (S, \mathcal{P}(S)), where \mathcal{P}(S) is the power set of S, via the pushforward construction: for any subset A \subseteq S, \mu(A) = P(X^{-1}(A)).^[23] The induced measure \mu assigns point masses to singletons in S, with the probability mass function p_X defined by p_X(x) = \mu(\{x\}) = P(X^{-1}(\{x\})) = P(X = x) for each x \in S. These point masses satisfy the normalization condition \sum_{x \in S} p_X(x) = \mu(S) = 1, ensuring \mu is a probability measure. Equivalently, p_X serves as the Radon-Nikodym derivative of \mu with respect to the counting measure on S, which assigns mass equal to the cardinality of finite sets and infinity otherwise.^[23] This formulation extends naturally to cases with infinite support, where S is countably infinite; here, the counting measure on S is \sigma-finite (as it is a countable union of finite-mass sets), and the induced \mu remains a finite probability measure with the same point-mass structure, provided the series \sum p_X(x) converges to 1.^[23]