Probability vector

A probability vector, also known as a stochastic vector, is a vector in mathematics and statistics whose components are non-negative real numbers that sum to exactly one.^[1] This structure ensures that the vector represents a valid probability distribution over a finite set of mutually exclusive and exhaustive outcomes, where each component denotes the probability of a specific event occurring.^[2] For example, the vector [0.3, 0.5, 0.2] could describe the probabilities of rain, clouds, or sunshine on a given day, with the non-negativity constraint preventing negative probabilities and the summation to one guaranteeing completeness. The set of all probability vectors of dimension n forms the standard (n-1)-simplex in \mathbb{R}^n, a geometric object that is convex and compact, highlighting properties such as convexity: any convex combination of probability vectors is itself a probability vector. Key algebraic properties include closure under certain linear transformations, particularly multiplication by stochastic matrices, which preserve the probability vector structure and are central to modeling dynamic systems.^[2] These vectors are normalized in the 1-norm (\| \mathbf{p} \|_1 = 1) and lie within the unit hypercube [0,1]^n, but the simplex boundary excludes interior points where the sum deviates from one. Probability vectors find extensive applications across disciplines, most notably in Markov chains where they describe the current state distribution of a stochastic process and evolve via transition matrices to reach steady-state distributions.^[2] In probability theory, they underpin discrete random variables and enable computations like expected values or entropy measures.^[1] Further, in quantum mechanics, the squared moduli of components in a state vector yield a probability vector for measurement outcomes, bridging classical probability with quantum superpositions.^[1] Other uses include decision theory for representing belief states and optimization problems in operations research, such as resource allocation under uncertainty.

Definition and Formalism

Definition

A probability vector is an n-dimensional vector \mathbf{p} = (p_1, p_2, \dots, p_n) where each p_i \geq 0 for i = 1, \dots, n and \sum_{i=1}^n p_i = 1.^[3] This structure captures the probabilities assigned to each outcome in a discrete sample space with n elements.^[4] Unlike a general real-valued vector, which may have arbitrary components without sign or magnitude restrictions, a probability vector imposes strict non-negativity on all entries to ensure they represent valid probabilities and requires normalization to unity to preserve the total probability axiom.^[5] These constraints distinguish probability vectors from ordinary vectors in linear algebra, embedding them within a bounded geometric space known as the probability simplex. In probability theory, a probability vector encodes the discrete probability mass function (PMF) of a random variable with finite support, where each p_i denotes the probability that the random variable takes the value associated with the i-th outcome. This representation facilitates the modeling of discrete probability distributions in various stochastic contexts.^[6]

Notation and Representation

A probability vector \mathbf{p} in \mathbb{R}^n is commonly denoted using boldface to indicate its vector nature, with individual components p_i for i = 1, \dots, n. These components satisfy p_i \geq 0 (componentwise non-negativity) and \sum_{i=1}^n p_i = 1, or equivalently in vector form, \mathbf{p} \in \mathbb{R}^n with \mathbf{p} \geq \mathbf{0} and \mathbf{1}^T \mathbf{p} = 1, where \mathbf{1} denotes the all-ones vector of dimension n.^[7]^[8] The representation as a row or column vector depends on the context. In linear algebra and general probability, \mathbf{p} is often a column vector, facilitating operations like expectation as \mathbb{E}[X] = \sum p_i x_i. In Markov chain theory, however, it is standard to use row vectors for state distributions, enabling updates via right-multiplication by the transition matrix: \mathbf{p}^{(t+1)} = \mathbf{p}^{(t)} P, where P is row-stochastic.^[9]^[7] Column vector notation appears in column-stochastic settings, such as certain optimization problems, where updates take the form \mathbf{p}^{(t+1)} = P \mathbf{p}^{(t)}.^[7] Probability vectors frequently appear as rows or columns within stochastic matrices. A row-stochastic matrix has each row summing to 1, making every row a probability vector that represents transition probabilities from a given state. Conversely, a column-stochastic matrix has columns as probability vectors, often used in contexts like stationary distributions solved via P \mathbf{\pi} = \mathbf{\pi}.^[9]^[7] For discrete point masses, probability vectors align with Dirac delta-like basis representations. The standard basis vectors \mathbf{e}_k \in \mathbb{R}^n, defined with a 1 in the k-th position and 0 elsewhere (i.e., \mathbf{e}_k = (\delta_{1k}, \dots, \delta_{nk})^T), function as probability vectors corresponding to certain outcomes with probability 1. These form an orthonormal basis for the space, useful in expansions like \mathbf{p} = \sum_{k=1}^n p_k \mathbf{e}_k.^[8]

Mathematical Properties

Algebraic Properties

Probability vectors, defined as non-negative vectors \mathbf{p} \in \mathbb{R}^n satisfying \sum_{i=1}^n p_i = 1 (or equivalently, \mathbf{1}^\top \mathbf{p} = 1), form the standard probability simplex \Delta^{n-1}, which is a convex set.^[10] This convexity implies that the set of probability vectors is closed under convex combinations: for probability vectors \mathbf{p}^{(1)}, \dots, \mathbf{p}^{(k)} and coefficients \alpha_1, \dots, \alpha_k \geq 0 with \sum_{i=1}^k \alpha_i = 1, the vector \sum_{i=1}^k \alpha_i \mathbf{p}^{(i)} is also a probability vector, as it remains non-negative and sums to 1.^[10] In fact, the probability simplex is the convex hull of the standard basis vectors \mathbf{e}_1, \dots, \mathbf{e}_n in \mathbb{R}^n, confirming its convex structure. The set of probability vectors is not closed under vector addition, as the sum of two such vectors \mathbf{p} + \mathbf{q} is non-negative but sums to 2, violating the normalization condition.^[11] However, it is closed under normalization applied to non-negative vectors: given a non-negative vector \mathbf{v} \in \mathbb{R}^n_{\geq 0} with \sum_{i=1}^n v_i > 0, the normalized vector \mathbf{p} = \mathbf{v} / \|\mathbf{v}\|_1, where \|\mathbf{v}\|_1 = \sum_{i=1}^n v_i, is a probability vector.^[12] This operation projects the interior of the non-negative orthant onto the probability simplex. Under the standard Euclidean inner product \langle \mathbf{p}, \mathbf{q} \rangle = \sum_{i=1}^n p_i q_i, probability vectors inherit orthogonality from \mathbb{R}^n: two probability vectors are orthogonal if their inner product is zero, which, due to non-negativity, requires disjoint supports.^[11] The uniform vector \mathbf{u} = (1/n, \dots, 1/n) serves as the centroid (barycenter) of the probability simplex, obtained as the average of its vertices \mathbf{e}_1, \dots, \mathbf{e}_n.^[13]

Statistical Properties

A probability vector p = (p_1, p_2, \dots, p_n) \in \mathbb{R}^n with p_i \geq 0 and \sum_{i=1}^n p_i = 1 has components whose mean value is \mu = \frac{1}{n}, as this follows directly from the normalization condition dividing the total sum by the number of components. This mean represents the expected value when treating the indices uniformly, providing a baseline for assessing concentration or spread in the distribution encoded by p. The variance of the components, defined as \sigma^2 = \frac{1}{n} \sum_{i=1}^n (p_i - \frac{1}{n})^2, quantifies the dispersion around this mean and serves as a measure of uncertainty in the probability assignment. To relate this to the geometry of p, note that the squared Euclidean norm satisfies \|p\|_2^2 = \sum_{i=1}^n p_i^2. Expanding the variance expression yields n \sigma^2 = \sum_{i=1}^n p_i^2 - 2 \cdot \frac{1}{n} \sum_{i=1}^n p_i + n \cdot \left( \frac{1}{n} \right)^2 = \sum_{i=1}^n p_i^2 - \frac{1}{n}, so \|p\|_2^2 = n \sigma^2 + \frac{1}{n} or \|p\|_2 = \sqrt{n \sigma^2 + \frac{1}{n}}. This connection highlights how the norm encodes both the inherent uniformity (via the $1/n term) and deviation from it (via \sigma^2).^[14] The variance \sigma^2 is bounded by $0 \leq \sigma^2 \leq \frac{n-1}{n^2}. The lower bound of 0 is achieved when p is the uniform vector p_i = \frac{1}{n} for all i, corresponding to maximum evenness. The upper bound is attained at any delta vector (standard basis vector), where one component is 1 and the others are 0, maximizing concentration. To verify the upper bound, substitute the delta case into the variance formula: \sigma^2 = \frac{1}{n} \left[ (1 - \frac{1}{n})^2 + (n-1) \left(0 - \frac{1}{n}\right)^2 \right] = \frac{1}{n} \left[ \left(\frac{n-1}{n}\right)^2 + (n-1) \frac{1}{n^2} \right] = \frac{1}{n} \cdot \frac{(n-1)^2 + (n-1)}{n^2} = \frac{n-1}{n^2}. In high dimensions (n \gg 1), even the maximum variance approximates \frac{1}{n}, implying that components tend to be small unless the vector is sharply peaked, which underscores the role of variance in gauging probabilistic uncertainty at scale. Another key statistical measure for probability vectors is the Shannon entropy H(p) = -\sum_{i=1}^n p_i \log p_i (typically using base-2 or natural log), which quantifies the average uncertainty or diversity inherent in the distribution. Entropy is concave due to the convexity of the negative log function, and on the probability simplex, it achieves its maximum value of \log n at the uniform vector, reflecting maximal unpredictability. This property aligns with the algebraic convexity of probability vectors, allowing mixtures to inherit intermediate entropy levels.

Geometric Interpretation

The Probability Simplex

The probability simplex, denoted as \Delta_{n-1}, is the set of all probability vectors in \mathbb{R}^n, formally defined as \Delta_{n-1} = \{ p \in \mathbb{R}^n \mid p_i \geq 0 \ \forall i, \ \sum_{i=1}^n p_i = 1 \}.^[15] This forms an (n-1)-dimensional simplex embedded within the n-dimensional Euclidean space \mathbb{R}^n.^[15] The vertices of the probability simplex \Delta_{n-1} are the standard basis vectors e_i \in \mathbb{R}^n, where e_i has a 1 in the i-th position and 0 elsewhere, for i = 1, \dots, n.^[16] The faces of the simplex are the convex hulls of subsets of these vertices and correspond to the lower-dimensional subspaces where one or more components p_i = 0.^[17] The probability simplex lies in the affine hull defined by the hyperplane \sum_{i=1}^n p_i = 1, which is the smallest affine subspace containing it.^[18] Within this structure, the components of a probability vector p directly serve as its barycentric coordinates with respect to the vertices e_i, expressing p as the convex combination \sum_{i=1}^n p_i e_i.^[19] While the simplex inherits the Euclidean metric from \mathbb{R}^n, where the distance between two points p, q \in \Delta_{n-1} is \|p - q\|_2, a more natural metric for probability vectors is the total variation distance, defined as d_{\text{TV}}(p, q) = \frac{1}{2} \|p - q\|_1 = \max_{A \subseteq } |p(A) - q(A)|, which measures the maximum discrepancy in probabilities over subsets.^[20]

Visualization and Dimensionality

In the two-dimensional case where n=2, the probability simplex forms a line segment connecting the points (1,0) and (0,1). This segment parameterizes all probability vectors as p = (\theta, 1-\theta) for \theta \in [0,1], providing a simple geometric representation of binary probability distributions.^[21] For n=3, the probability simplex is an equilateral triangle embedded in the plane \sum p_i = 1 with p_i \geq 0. The vertices of the triangle correspond to the Dirac distributions at each outcome, such as (1,0,0), (0,1,0), and (0,0,1), while the center point represents the uniform distribution p = (1/3, 1/3, 1/3). In this ternary plot visualization,^[22]^[23] In higher dimensions, the probability simplex exhibits the curse of dimensionality, where its (n-1)-dimensional volume scales as \sqrt{n} / (n-1)!, decreasing factorially and making direct geometric intuition challenging. Random points sampled uniformly from the simplex concentrate near the boundaries and faces rather than the interior, reflecting the sparse nature of high-dimensional space. The uniform distribution on the simplex corresponds to the Dirichlet distribution with all parameters equal to 1, serving as a natural reference measure. To facilitate visualization, dimensionality reduction techniques like principal component analysis (PCA), adapted for compositional data on the simplex, project high-dimensional vectors onto lower-dimensional spaces while preserving key structural properties.^[21]^[24]^[25]

Examples

Basic Discrete Distributions

Probability vectors provide a compact representation for the probability mass functions of basic discrete distributions, encoding the likelihood of each possible outcome in a finite sample space. The Bernoulli distribution, modeling binary outcomes such as success or failure in a single trial, is represented by the probability vector \mathbf{p} = [1 - q, q], where q \in [0, 1] denotes the success probability.^[26] For a biased coin with heads probability 0.65, this becomes \mathbf{p} = [0.35, 0.65].^[26] The discrete uniform distribution assigns equal probability to each of n outcomes, yielding the probability vector \mathbf{p} = \left[ \frac{1}{n}, \frac{1}{n}, \dots, \frac{1}{n} \right].^[27] This form captures scenarios like a fair n-sided die, where every face is equally likely. A point mass, or degenerate distribution, places all probability on one specific outcome, resulting in a probability vector with a single 1 and zeros elsewhere, such as \mathbf{p} = [0, \dots, 0, 1, 0, \dots, 0] at the k-th position.^[28] Biased discrete distributions over more than two outcomes, modeled by the multinoulli (or categorical) distribution, use probability vectors with unequal non-zero components summing to 1; for example, \mathbf{p} = [0.5, 0.25, 0.25] might represent a three-outcome process like a weighted die.^[29] The multinomial distribution extends this to categorized counts, parameterized by a probability vector over k categories; an illustrative case is \mathbf{p} = [0.3, 0.5, 0.07, 0.1, 0.03] for five categories.^[30] These vectors must consist of non-negative components that sum to unity to qualify as valid probability representations.^[28]

Vectors in Stochastic Processes

In stochastic processes, probability vectors serve as initial distributions for Markov chains, capturing the starting probabilities across states. The initial distribution is denoted by a row vector \pi_0, where each component \pi_0(i) represents the probability of beginning in state i, satisfying \sum_i \pi_0(i) = 1. The state distribution at time t, \pi_t, evolves as \pi_t = \pi_0 P^t, with P as the one-step transition matrix whose entries P_{ij} denote the probability of transitioning from state i to j.^[31] This formulation allows the probability vector to propagate through the process, reflecting the dynamic nature of the system's state probabilities over discrete time steps. A key feature in Markov chains is the stationary distribution \pi, a probability vector that remains unchanged under the transition matrix, satisfying \pi P = \pi and \pi \mathbf{1} = 1, where \mathbf{1} is a column vector of ones.^[32] For irreducible and aperiodic chains, the distribution \pi_t converges to this stationary \pi as t \to \infty, regardless of the initial \pi_0. Consider a two-state Markov chain modeling weather (sunny or rainy), with initial distribution \pi_0 = [0.7, 0.3] indicating a 70% chance of starting sunny. Suppose the transition matrix is P = \begin{pmatrix} 0.8 & 0.2 \\ 0.4 & 0.6 \end{pmatrix}; then \pi_1 = [0.68, 0.32], \pi_2 = [0.672, 0.328], and further iterations approach the stationary distribution [\frac{2}{3}, \frac{1}{3}], illustrating convergence to equilibrium.^[33] In absorbing Markov chains, certain states are inescapable, and the probability vector eventually concentrates mass on these absorbing states. An absorbing state j has P_{jj} = 1, so once entered, the process remains there indefinitely. Starting from a transient state, repeated multiplication by P shifts the probability vector toward the absorbing one, such as [0, 1] for a two-state chain where the second state absorbs all probability after sufficient steps. This behavior models scenarios like gambler's ruin, where the vector represents the evolving probability of ruin or continuation until absorption occurs with probability 1.^[34] Probability vectors also arise in discretizing continuous stochastic processes, such as the Poisson process, which counts events occurring randomly over time at rate \lambda. For a fixed interval [0, t], the number of events follows a Poisson distribution with parameter \lambda t, but discretization into n small subintervals approximates this via a binomial distribution: each subinterval has success probability p = \lambda t / n, yielding event count probabilities as a vector [ \Pr(K=0), \Pr(K=1), \dots, \Pr(K=n) ], where K \sim \text{Binomial}(n, p). As n \to \infty and p \to 0 with np = \lambda t fixed, this vector converges to the Poisson probabilities, providing a discrete vector representation for computational analysis of event counts.

Applications

In Probability Theory

In probability theory, a probability vector provides a compact representation for the probability mass function (PMF) of a discrete random variable defined over a finite sample space. Specifically, for a random variable X taking values in a finite set \{x_1, x_2, \dots, x_n\}, the PMF is encoded by the vector \mathbf{p} = (p_1, p_2, \dots, p_n)^\top where p_i = P(X = x_i) for each i, ensuring \sum_{i=1}^n p_i = 1 and p_i \geq 0. This vector form simplifies algebraic manipulations of discrete distributions; for example, the PMF of the sum of two independent discrete random variables is given by the discrete convolution of their PMF vectors.^[35]^[36] The expectation of X, a fundamental concept, is directly computed using the probability vector as E[X] = \sum_{i=1}^n x_i p_i, equivalent to the dot product \mathbf{x}^\top \mathbf{p} where \mathbf{x} = (x_1, x_2, \dots, x_n)^\top is the vector of outcomes. This formulation extends naturally to higher moments, such as variance, via E[X^2] = \sum_{i=1}^n x_i^2 p_i, enabling efficient calculation of distributional properties without enumerating the sample space explicitly. Such vector-based expectations underpin analyses in discrete probability, including risk assessment and decision theory.^[6] Probability vectors play a key role in Bayesian updating and mixture models. In the discrete case of Bayes' theorem, the posterior probability vector \mathbf{\pi}' is obtained by element-wise multiplication of the prior vector \mathbf{\pi} and the likelihood vector \mathbf{L}, followed by normalization: \mathbf{\pi}' = \frac{\mathbf{\pi} \odot \mathbf{L}}{\sum (\mathbf{\pi} \odot \mathbf{L})}. This update preserves the probabilistic structure and is central to inference over finite hypothesis spaces. For mixtures, a compound distribution arises as a convex combination \mathbf{p} = \sum_{k=1}^m w_k \mathbf{p}_k where w_k > 0, \sum w_k = 1, and each \mathbf{p}_k is a component PMF vector, modeling heterogeneous populations like in latent class analysis.^[37]^[38] By the central limit theorem, the centered and scaled vector (\mathbf{N}_n - n \mathbf{p}) / \sqrt{n} from n independent multinomial draws converges in distribution to a multivariate normal distribution with mean \mathbf{0} and covariance matrix \operatorname{diag}(\mathbf{p}) - \mathbf{p} \mathbf{p}^\top. This result highlights the asymptotic normality of probability vector estimates \mathbf{N}_n / n, with the covariance structure reflecting both the diagonal variances p_i(1 - p_i) and the off-diagonal dependencies - p_i p_j due to the fixed total sum. It provides a foundation for large-sample approximations in multinomial settings, such as hypothesis testing for categorical data.^[39] Probability generating functions further leverage the vector form for univariate discrete cases, defined as G(s) = \sum_{i=0}^\infty p_i s^i = \mathbf{p}^\top \mathbf{s} where \mathbf{s} = (1, s, s^2, \dots)^\top (truncated for finite support). Derivatives of G(s) at s=1 yield moments, such as E[X] = G'(1), offering a generating mechanism for probabilistic computations like tail probabilities or convolutions for independent sums. This approach is particularly useful in branching processes and queueing theory, where the functional form encodes recursive distributional properties.^[40]

In Computing and Optimization

In stochastic gradient descent (SGD), probability vectors define sampling distributions for mini-batches to improve gradient estimates and convergence rates. Importance sampling variants of SGD select data points non-uniformly, where the probability vector assigns higher probabilities to points with larger gradient magnitudes, reducing variance in the stochastic approximation. For instance, prioritized experience replay in reinforcement learning uses a probability vector proportional to the temporal-difference error for mini-batch sampling, enabling more efficient updates in deep networks.^[41]^[42] Optimization problems over the probability simplex often involve maximizing entropy subject to linear constraints, formulated as linear programming tasks. The maximum entropy distribution is obtained by solving \max -\sum_i p_i \log p_i subject to \sum_i p_i = 1, p_i \geq 0, and moment constraints like \sum_i p_i \mu_i = m, where p = (p_1, \dots, p_n) is the probability vector and \mu_i are features. This approach yields the least informative distribution consistent with observed moments and can be solved efficiently using interior-point methods or entropic regularization in linear programming.^[43]^[44] In machine learning, the softmax function converts raw neural network outputs into a probability vector for multi-class classification tasks. For an input vector z \in \mathbb{R}^K, the softmax produces p_k = \frac{\exp(z_k)}{\sum_{j=1}^K \exp(z_j)} for k=1,\dots,K, ensuring \sum_k p_k = 1 and p_k \geq 0, which represents predicted class probabilities. This output is trained by minimizing the cross-entropy loss, equivalent to the Kullback-Leibler (KL) divergence D(p \| q) = \sum_i p_i \log(p_i / q_i) between the true label distribution p (one-hot) and model predictions q. The KL divergence measures distributional mismatch and promotes calibrated probabilities in classifiers like logistic regression and deep neural networks. Markov chain Monte Carlo (MCMC) methods rely on probability vectors as rows of the transition matrix P, where each row \mathbf{p}_i specifies the distribution over next states from state i. In Metropolis-Hastings sampling, proposals are accepted with probability \min(1, \frac{\pi(j) q(i|j)}{\pi(i) q(j|i)}), ensuring the chain converges to the target distribution \pi, and the rows of P form valid probability vectors that preserve detailed balance. This framework enables approximate sampling from complex posteriors in Bayesian inference by iterating matrix-vector multiplications starting from an initial state vector.^[45]