Fact-checked by Grok 2 weeks ago

Probability vector

A probability vector, also known as a vector, is a in and statistics whose components are non-negative real numbers that sum to exactly one. This structure ensures that the vector represents a valid over a of mutually exclusive and exhaustive outcomes, where each component denotes the probability of a specific occurring. For example, the vector [0.3, 0.5, 0.2] could describe the probabilities of , clouds, or sunshine on a given day, with the non-negativity constraint preventing negative probabilities and the summation to one guaranteeing completeness. The set of all probability vectors of dimension n forms the standard (n-1)-simplex in \mathbb{R}^n, a geometric object that is convex and compact, highlighting properties such as convexity: any convex combination of probability vectors is itself a probability vector. Key algebraic properties include closure under certain linear transformations, particularly multiplication by stochastic matrices, which preserve the probability vector structure and are central to modeling dynamic systems. These vectors are normalized in the 1-norm (\| \mathbf{p} \|_1 = 1) and lie within the unit hypercube [0,1]^n, but the simplex boundary excludes interior points where the sum deviates from one. Probability vectors find extensive applications across disciplines, most notably in Markov chains where they describe the current state distribution of a stochastic process and evolve via transition matrices to reach steady-state distributions. In probability theory, they underpin discrete random variables and enable computations like expected values or entropy measures. Further, in quantum mechanics, the squared moduli of components in a state vector yield a probability vector for measurement outcomes, bridging classical probability with quantum superpositions. Other uses include decision theory for representing belief states and optimization problems in operations research, such as resource allocation under uncertainty.

Definition and Formalism

Definition

A probability is an n-dimensional \mathbf{p} = (p_1, p_2, \dots, p_n) where each p_i \geq 0 for i = 1, \dots, n and \sum_{i=1}^n p_i = 1. This structure captures the probabilities assigned to each outcome in a sample space with n elements. Unlike a general real-valued , which may have arbitrary components without sign or magnitude restrictions, a probability imposes strict non-negativity on all entries to ensure they represent valid probabilities and requires to unity to preserve the total probability . These constraints distinguish probability vectors from ordinary vectors in linear , embedding them within a bounded geometric space known as the probability . In , a probability encodes the (PMF) of a with finite , where each p_i denotes the probability that the takes the value associated with the i-th outcome. This representation facilitates the modeling of probability distributions in various contexts.

Notation and Representation

A probability \mathbf{p} in \mathbb{R}^n is commonly denoted using boldface to indicate its nature, with individual components p_i for i = 1, \dots, n. These components satisfy p_i \geq 0 (componentwise non-negativity) and \sum_{i=1}^n p_i = 1, or equivalently in vector form, \mathbf{p} \in \mathbb{R}^n with \mathbf{p} \geq \mathbf{0} and \mathbf{1}^T \mathbf{p} = 1, where \mathbf{1} denotes the all-ones of n. The representation as a row or column vector depends on the context. In linear and general probability, \mathbf{p} is often a column vector, facilitating operations like as \mathbb{E}[X] = \sum p_i x_i. In , however, it is standard to use row vectors for state distributions, enabling updates via right-multiplication by the : \mathbf{p}^{(t+1)} = \mathbf{p}^{(t)} P, where P is row-stochastic. Column vector notation appears in column-stochastic settings, such as certain optimization problems, where updates take the form \mathbf{p}^{(t+1)} = P \mathbf{p}^{(t)}. Probability vectors frequently appear as rows or columns within matrices. A row- matrix has each row summing to 1, making every row a probability vector that represents transition probabilities from a given . Conversely, a column- matrix has columns as probability vectors, often used in contexts like distributions solved via P \mathbf{\pi} = \mathbf{\pi}. For discrete point masses, probability vectors align with Dirac delta-like basis representations. The vectors \mathbf{e}_k \in \mathbb{R}^n, defined with a 1 in the k-th position and 0 elsewhere (i.e., \mathbf{e}_k = (\delta_{1k}, \dots, \delta_{nk})^T), function as probability vectors corresponding to certain outcomes with probability 1. These form an for the space, useful in expansions like \mathbf{p} = \sum_{k=1}^n p_k \mathbf{e}_k.

Mathematical Properties

Algebraic Properties

Probability vectors, defined as non-negative vectors \mathbf{p} \in \mathbb{R}^n satisfying \sum_{i=1}^n p_i = 1 (or equivalently, \mathbf{1}^\top \mathbf{p} = 1), form the standard probability \Delta^{n-1}, which is a . This convexity implies that the set of probability vectors is closed under convex combinations: for probability vectors \mathbf{p}^{(1)}, \dots, \mathbf{p}^{(k)} and coefficients \alpha_1, \dots, \alpha_k \geq 0 with \sum_{i=1}^k \alpha_i = 1, the vector \sum_{i=1}^k \alpha_i \mathbf{p}^{(i)} is also a probability vector, as it remains non-negative and sums to 1. In fact, the probability is the of the vectors \mathbf{e}_1, \dots, \mathbf{e}_n in \mathbb{R}^n, confirming its structure. The set of probability vectors is not closed under vector addition, as the sum of two such vectors \mathbf{p} + \mathbf{q} is non-negative but sums to 2, violating the normalization condition. However, it is closed under normalization applied to non-negative vectors: given a non-negative vector \mathbf{v} \in \mathbb{R}^n_{\geq 0} with \sum_{i=1}^n v_i > 0, the normalized vector \mathbf{p} = \mathbf{v} / \|\mathbf{v}\|_1, where \|\mathbf{v}\|_1 = \sum_{i=1}^n v_i, is a probability vector. This operation projects the interior of the non-negative orthant onto the probability simplex. Under the standard inner product \langle \mathbf{p}, \mathbf{q} \rangle = \sum_{i=1}^n p_i q_i, probability vectors inherit from \mathbb{R}^n: two probability vectors are orthogonal if their inner product is zero, which, due to non-negativity, requires disjoint supports. The uniform vector \mathbf{u} = (1/n, \dots, 1/n) serves as the (barycenter) of the probability , obtained as the average of its vertices \mathbf{e}_1, \dots, \mathbf{e}_n.

Statistical Properties

A probability vector p = (p_1, p_2, \dots, p_n) \in \mathbb{R}^n with p_i \geq 0 and \sum_{i=1}^n p_i = 1 has components whose value is \mu = \frac{1}{n}, as this follows directly from the condition dividing the total by the number of components. This represents the when treating the indices uniformly, providing a for assessing concentration or spread in the encoded by p. The variance of the components, defined as \sigma^2 = \frac{1}{n} \sum_{i=1}^n (p_i - \frac{1}{n})^2, quantifies the dispersion around this and serves as a measure of in the probability assignment. To relate this to the of p, note that the squared satisfies \|p\|_2^2 = \sum_{i=1}^n p_i^2. Expanding the variance expression yields n \sigma^2 = \sum_{i=1}^n p_i^2 - 2 \cdot \frac{1}{n} \sum_{i=1}^n p_i + n \cdot \left( \frac{1}{n} \right)^2 = \sum_{i=1}^n p_i^2 - \frac{1}{n}, so \|p\|_2^2 = n \sigma^2 + \frac{1}{n} or \|p\|_2 = \sqrt{n \sigma^2 + \frac{1}{n}}. This connection highlights how the encodes both the inherent uniformity (via the $1/n term) and deviation from it (via \sigma^2). The variance \sigma^2 is bounded by $0 \leq \sigma^2 \leq \frac{n-1}{n^2}. The lower bound of 0 is achieved when p is the vector p_i = \frac{1}{n} for all i, corresponding to maximum evenness. The upper bound is attained at any delta vector ( vector), where one component is 1 and the others are 0, maximizing concentration. To verify the upper bound, substitute the delta case into the variance formula: \sigma^2 = \frac{1}{n} \left[ (1 - \frac{1}{n})^2 + (n-1) \left(0 - \frac{1}{n}\right)^2 \right] = \frac{1}{n} \left[ \left(\frac{n-1}{n}\right)^2 + (n-1) \frac{1}{n^2} \right] = \frac{1}{n} \cdot \frac{(n-1)^2 + (n-1)}{n^2} = \frac{n-1}{n^2}. In high dimensions (n \gg 1), even the maximum variance approximates \frac{1}{n}, implying that components tend to be small unless the vector is sharply peaked, which underscores the role of variance in gauging probabilistic at scale. Another key statistical measure for probability vectors is the Shannon entropy H(p) = -\sum_{i=1}^n p_i \log p_i (typically using base-2 or natural log), which quantifies the average uncertainty or diversity inherent in the . is concave due to the convexity of the negative log function, and on the probability , it achieves its maximum value of \log n at the vector, reflecting maximal unpredictability. This property aligns with the algebraic convexity of probability vectors, allowing mixtures to inherit intermediate levels.

Geometric Interpretation

The Probability Simplex

The probability simplex, denoted as \Delta_{n-1}, is the set of all probability vectors in \mathbb{R}^n, formally defined as \Delta_{n-1} = \{ p \in \mathbb{R}^n \mid p_i \geq 0 \ \forall i, \ \sum_{i=1}^n p_i = 1 \}. This forms an (n-1)-dimensional embedded within the n-dimensional \mathbb{R}^n. The vertices of the probability simplex \Delta_{n-1} are the vectors e_i \in \mathbb{R}^n, where e_i has a 1 in the i-th position and 0 elsewhere, for i = 1, \dots, n. The faces of the simplex are the convex hulls of subsets of these vertices and correspond to the lower-dimensional subspaces where one or more components p_i = 0. The probability simplex lies in the affine hull defined by the \sum_{i=1}^n p_i = 1, which is the smallest affine subspace containing it. Within this structure, the components of a probability vector p directly serve as its barycentric coordinates with respect to the vertices e_i, expressing p as the \sum_{i=1}^n p_i e_i. While the simplex inherits the Euclidean metric from \mathbb{R}^n, where the distance between two points p, q \in \Delta_{n-1} is \|p - q\|_2, a more natural metric for probability vectors is the distance, defined as d_{\text{TV}}(p, q) = \frac{1}{2} \|p - q\|_1 = \max_{A \subseteq } |p(A) - q(A)|, which measures the maximum discrepancy in probabilities over subsets.

Visualization and Dimensionality

In the two-dimensional case where n=2, the probability simplex forms a line segment connecting the points (1,0) and (0,1). This segment parameterizes all probability vectors as p = (\theta, 1-\theta) for \theta \in [0,1], providing a simple geometric representation of probability distributions. For n=3, the probability simplex is an embedded in the plane \sum p_i = 1 with p_i \geq 0. The vertices of the triangle correspond to the Dirac distributions at each outcome, such as (1,0,0), (0,1,0), and (0,0,1), while the center point represents the p = (1/3, 1/3, 1/3). In this visualization, In higher dimensions, the probability simplex exhibits the curse of dimensionality, where its (n-1)-dimensional scales as \sqrt{n} / (n-1)!, decreasing factorially and making direct geometric challenging. Random points sampled uniformly from the concentrate near the boundaries and faces rather than the interior, reflecting the sparse nature of high-dimensional space. The uniform distribution on the corresponds to the with all parameters equal to 1, serving as a natural reference measure. To facilitate visualization, techniques like (PCA), adapted for on the , project high-dimensional vectors onto lower-dimensional spaces while preserving key structural properties.

Examples

Basic Discrete Distributions

Probability vectors provide a compact representation for the probability mass functions of basic discrete distributions, encoding the likelihood of each possible outcome in a finite . The , modeling binary outcomes such as success or failure in a single trial, is represented by the probability vector \mathbf{p} = [1 - q, q], where q \in [0, 1] denotes the success probability. For a biased with heads probability 0.65, this becomes \mathbf{p} = [0.35, 0.65]. The assigns equal probability to each of n outcomes, yielding the probability vector \mathbf{p} = \left[ \frac{1}{n}, \frac{1}{n}, \dots, \frac{1}{n} \right]. This form captures scenarios like a n-sided die, where every face is equally likely. A point mass, or , places all probability on one specific outcome, resulting in a probability vector with a single 1 and zeros elsewhere, such as \mathbf{p} = [0, \dots, 0, 1, 0, \dots, 0] at the k-th position. Biased discrete distributions over more than two outcomes, modeled by the , use probability vectors with unequal non-zero components summing to 1; for example, \mathbf{p} = [0.5, 0.25, 0.25] might represent a three-outcome process like a weighted die. The extends this to categorized counts, parameterized by a probability vector over k categories; an illustrative case is \mathbf{p} = [0.3, 0.5, 0.07, 0.1, 0.03] for five categories. These vectors must consist of non-negative components that sum to to qualify as valid probability representations.

Vectors in Stochastic Processes

In processes, probability vectors serve as initial s for Markov chains, capturing the starting probabilities across s. The initial is denoted by a row vector \pi_0, where each component \pi_0(i) represents the probability of beginning in i, satisfying \sum_i \pi_0(i) = 1. The at time t, \pi_t, evolves as \pi_t = \pi_0 P^t, with P as the one-step whose entries P_{ij} denote the probability of transitioning from i to j. This formulation allows the probability vector to propagate through the process, reflecting the dynamic nature of the system's probabilities over discrete time steps. A key feature in Markov chains is the stationary distribution \pi, a probability vector that remains unchanged under the transition matrix, satisfying \pi P = \pi and \pi \mathbf{1} = 1, where \mathbf{1} is a column vector of ones. For irreducible and aperiodic chains, the distribution \pi_t converges to this stationary \pi as t \to \infty, regardless of the initial \pi_0. Consider a two-state modeling weather (sunny or rainy), with initial distribution \pi_0 = [0.7, 0.3] indicating a 70% of starting sunny. Suppose the is P = \begin{pmatrix} 0.8 & 0.2 \\ 0.4 & 0.6 \end{pmatrix}; then \pi_1 = [0.68, 0.32], \pi_2 = [0.672, 0.328], and further iterations approach the [\frac{2}{3}, \frac{1}{3}], illustrating to . In absorbing Markov chains, certain states are inescapable, and the probability vector eventually concentrates mass on these absorbing states. An absorbing state j has P_{jj} = 1, so once entered, the process remains there indefinitely. Starting from a , repeated multiplication by P shifts the probability vector toward the absorbing one, such as [0, 1] for a two-state where the second state absorbs all probability after sufficient steps. This behavior models scenarios like , where the vector represents the evolving probability of ruin or continuation until absorption occurs with probability 1. Probability vectors also arise in discretizing continuous stochastic processes, such as the process, which counts occurring randomly over time at \lambda. For a fixed [0, t], the number of follows a with parameter \lambda t, but discretization into n small subintervals approximates this via a : each subinterval has success probability p = \lambda t / n, yielding event count probabilities as a vector [ \Pr(K=0), \Pr(K=1), \dots, \Pr(K=n) ], where K \sim \text{Binomial}(n, p). As n \to \infty and p \to 0 with np = \lambda t fixed, this vector converges to the Poisson probabilities, providing a discrete vector representation for computational analysis of event counts.

Applications

In Probability Theory

In , a probability vector provides a compact representation for the (PMF) of a defined over a . Specifically, for a X taking values in a \{x_1, x_2, \dots, x_n\}, the PMF is encoded by the vector \mathbf{p} = (p_1, p_2, \dots, p_n)^\top where p_i = P(X = x_i) for each i, ensuring \sum_{i=1}^n p_i = 1 and p_i \geq 0. This vector form simplifies algebraic manipulations of distributions; for example, the PMF of the sum of two is given by the of their PMF vectors. The expectation of X, a fundamental concept, is directly computed using the probability vector as E[X] = \sum_{i=1}^n x_i p_i, equivalent to the dot product \mathbf{x}^\top \mathbf{p} where \mathbf{x} = (x_1, x_2, \dots, x_n)^\top is the vector of outcomes. This formulation extends naturally to higher moments, such as variance, via E[X^2] = \sum_{i=1}^n x_i^2 p_i, enabling efficient calculation of distributional properties without enumerating the sample space explicitly. Such vector-based expectations underpin analyses in discrete probability, including risk assessment and decision theory. Probability vectors play a key role in Bayesian updating and mixture models. In the discrete case of , the vector \mathbf{\pi}' is obtained by element-wise multiplication of the prior vector \mathbf{\pi} and the likelihood vector \mathbf{L}, followed by : \mathbf{\pi}' = \frac{\mathbf{\pi} \odot \mathbf{L}}{\sum (\mathbf{\pi} \odot \mathbf{L})}. This update preserves the probabilistic structure and is central to over finite spaces. For mixtures, a compound arises as a \mathbf{p} = \sum_{k=1}^m w_k \mathbf{p}_k where w_k > 0, \sum w_k = 1, and each \mathbf{p}_k is a component PMF vector, modeling heterogeneous populations like in latent class analysis. By the , the centered and scaled vector (\mathbf{N}_n - n \mathbf{p}) / \sqrt{n} from n independent multinomial draws converges in distribution to a with mean \mathbf{0} and \operatorname{diag}(\mathbf{p}) - \mathbf{p} \mathbf{p}^\top. This result highlights the asymptotic normality of probability vector estimates \mathbf{N}_n / n, with the covariance structure reflecting both the diagonal variances p_i(1 - p_i) and the off-diagonal dependencies - p_i p_j due to the fixed total sum. It provides a foundation for large-sample approximations in multinomial settings, such as hypothesis testing for categorical data. Probability generating functions further leverage the vector form for univariate discrete cases, defined as G(s) = \sum_{i=0}^\infty p_i s^i = \mathbf{p}^\top \mathbf{s} where \mathbf{s} = (1, s, s^2, \dots)^\top (truncated for finite support). Derivatives of G(s) at s=1 yield moments, such as E[X] = G'(1), offering a generating mechanism for probabilistic computations like tail probabilities or convolutions for independent sums. This approach is particularly useful in branching processes and , where the functional form encodes recursive distributional properties.

In Computing and Optimization

In (SGD), probability vectors define sampling distributions for mini-batches to improve gradient estimates and convergence rates. variants of SGD select data points non-uniformly, where the probability vector assigns higher probabilities to points with larger gradient magnitudes, reducing variance in the . For instance, prioritized experience replay in uses a probability vector proportional to the temporal-difference error for mini-batch sampling, enabling more efficient updates in deep networks. Optimization problems over the probability simplex often involve maximizing subject to linear constraints, formulated as tasks. The maximum distribution is obtained by solving \max -\sum_i p_i \log p_i subject to \sum_i p_i = 1, p_i \geq 0, and moment constraints like \sum_i p_i \mu_i = m, where p = (p_1, \dots, p_n) is the probability vector and \mu_i are features. This approach yields the least informative distribution consistent with observed moments and can be solved efficiently using interior-point methods or entropic regularization in . In machine learning, the softmax function converts raw neural network outputs into a probability vector for multi-class classification tasks. For an input vector z \in \mathbb{R}^K, the softmax produces p_k = \frac{\exp(z_k)}{\sum_{j=1}^K \exp(z_j)} for k=1,\dots,K, ensuring \sum_k p_k = 1 and p_k \geq 0, which represents predicted class probabilities. This output is trained by minimizing the cross-entropy loss, equivalent to the Kullback-Leibler (KL) divergence D(p \| q) = \sum_i p_i \log(p_i / q_i) between the true label distribution p (one-hot) and model predictions q. The KL divergence measures distributional mismatch and promotes calibrated probabilities in classifiers like logistic regression and deep neural networks. Markov chain Monte Carlo (MCMC) methods rely on probability vectors as rows of the P, where each row \mathbf{p}_i specifies the distribution over next states from state i. In Metropolis-Hastings sampling, proposals are accepted with probability \min(1, \frac{\pi(j) q(i|j)}{\pi(i) q(j|i)}), ensuring the chain converges to the target distribution \pi, and the rows of P form valid probability vectors that preserve . This framework enables approximate sampling from complex posteriors in by iterating matrix-vector multiplications starting from an initial .