Fact-checked by Grok 2 weeks ago

Probability theory

Probability theory is a branch of concerned with the of random phenomena and the quantification of through the assignment of probabilities to possible outcomes of events. It provides a rigorous for modeling , enabling the prediction and understanding of behaviors in systems where complete is absent. The origins of probability theory trace back to the in Europe, spurred by problems in and games of chance, with foundational contributions from mathematicians such as and , who developed early concepts like in correspondence over the "." Subsequent advancements included Abraham de Moivre's approximation to the in the early and Jacob Bernoulli's formulation of the in 1713, which established that empirical frequencies converge to theoretical probabilities as the number of trials increases. The field was placed on a modern axiomatic foundation by in 1933, who defined probability as a measure on a sigma-algebra of events satisfying three axioms: non-negativity, normalization to 1 for the entire , and countable additivity for disjoint events. Central concepts in probability theory include random variables, which map outcomes of a random experiment to numerical values; probability distributions, describing the likelihood of these values; and key theorems such as the , which asserts that the sum of many independent random variables approximates a under mild conditions. and further allow the decomposition of joint events, with providing a method to update probabilities based on new evidence. These elements underpin stochastic processes, such as Markov chains, which model systems evolving over time with probabilistic transitions. Probability theory forms the mathematical backbone of , enabling from , and extends to diverse applications in physics for and , in for and option pricing, in for algorithms and , and in for . Its development continues through measure-theoretic approaches and computational methods, addressing modern challenges in and simulation.

Historical Development

Origins in Games of Chance

The empirical roots of probability theory emerged in the amid the popular culture of in , where mathematicians began systematically analyzing games of chance to gain an edge. , an known for his work in and , authored Liber de Ludo Aleae (Book on Games of Chance) around 1564, providing the earliest known systematic treatment of odds in games based on his personal experiences as a gambler. In this manuscript, published posthumously in 1663, Cardano enumerated the total possible outcomes for throws of two and three dice as 36 and 216, respectively, and calculated the ratio of favorable to unfavorable outcomes to determine betting odds. For example, he recognized that the probability of a specific face appearing on a single fair die is 1/6, and he extended such calculations to scenarios like the odds of throwing two even numbers with two dice. Cardano's approach was pragmatic and rooted in observation rather than abstract theory; he advised gamblers on strategies for and games, such as evaluating the s in a game of by considering the distribution of suits and ranks in a . Although his work incorporated some erroneous assumptions about luck influencing outcomes and lacked algebraic formalism, it marked a shift from to quantitative reasoning in assessing events. In the early 17th century, Galileo Galilei advanced these ideas through his analysis of dice outcomes, commissioned around 1620 by Grand Duke Ferdinando II de' Medici of the powerful Tuscan family that employed him as court mathematician. In the unpublished treatise Sopra le Scoperte dei Dadi (On a Discovery Concerning Dice), Galileo investigated why certain sums occur more frequently than expected when rolling three dice, attributing the phenomenon to the varying number of combinations rather than divine intervention or bias. He enumerated all 216 possible outcomes, assuming each was equally likely, and showed that a sum of 10 can be achieved in 27 ways, compared to 25 ways for a sum of 9, thus introducing the foundational concept of equally probable elementary events in fair games. Galileo's work also briefly addressed card games, calculating based on combinatorial counts, such as the probabilities in dealing hands from a standard deck without replacement. His empirical of outcomes provided a clearer method for predicting frequencies in repeated trials, influencing subsequent gamblers and scholars despite remaining unpublished during his lifetime. A landmark development occurred in 1654 through the correspondence between French mathematicians and , who tackled the "" posed by the gambler Chevalier de Méré. This classic dilemma involved dividing stakes fairly in an interrupted game where players compete to reach a certain number of points first, such as the first to win three rounds in a dice or . In their exchange of letters, Pascal proposed using by weighting remaining possibilities according to their probabilities, while Fermat suggested simulating the completion of the game through all potential future rounds to apportion the pot proportionally. For instance, if one player needs one more point and the other two in a best-of-five game, their method would allocate shares based on the 1/4 and 3/4 chances of winning the next rounds, assuming equal skill. This collaboration resolved longstanding gambling disputes by formalizing how to handle incomplete games without resuming play, emphasizing combinatorial enumeration of paths to victory. Their solutions, applied to simple dice throws where each face has a 1/6 probability, demonstrated practical utility in card and dice contexts and catalyzed broader mathematical interest in chance. Building on this correspondence, in 1657, published De Ratiociniis in Ludo Aleae, the first book-length treatment of probability. Huygens introduced the concept of and solved various problems related to dividing stakes and fair bets in games of chance.

Key Mathematical Contributions

's seminal work , published posthumously in 1713, laid foundational groundwork for probability theory by introducing a precursor to the and exploring applications of the to practical problems such as mortality tables. In this text, Bernoulli demonstrated that, for repeated trials with fixed probability p, the proportion of successes converges to p as the number of trials increases, providing early justification for using empirical frequencies to estimate probabilities in areas like annuities and . Building on such ideas, advanced the approximation of probabilities using the normal curve in the second edition of his Doctrine of Chances in 1738. De Moivre derived that for a random variable S_n with parameters n and p, the probability P(|S_n - np| < k \sqrt{npq}) \approx \erf(k / \sqrt{2}), where q = 1 - p and \erf denotes the error function, enabling efficient computation of probabilities for large n in gambling and combinatorial problems. This approximation marked a significant step toward understanding the ubiquity of the normal distribution in probabilistic limits. A notable contribution to inverse probability came posthumously in 1763 through Thomas Bayes's essay "An Essay towards solving a Problem in the Doctrine of Chances," which introduced for updating probabilities based on new evidence. Pierre-Simon Laplace further synthesized and expanded these developments in his 1812 Théorie Analytique des Probabilités, where he refined generating functions for probability distributions and articulated early forms of the , showing that the sum of independent random variables tends toward a normal distribution under mild conditions. Laplace's work integrated combinatorial methods with analytic techniques, applying them to astronomy, physics, and error theory, thus broadening probability's scope beyond games of chance. In 1867, Pafnuty Chebyshev provided a general inequality bounding the probability of large deviations for any distribution with finite variance, stating that for a random variable X with mean \mu and variance \sigma^2, P(|X - \mu| \geq k \sigma) \leq 1/k^2 for k > 0. This result, published in his paper "Démonstration d'une proposition générale ayant rapport à la probabilité des événements," offered a distribution-free tool for assessing tail probabilities, influencing later convergence theorems.

Formalization in the 20th Century

In the early , efforts to formalize probability theory shifted toward a rigorous mathematical framework, addressing the limitations of earlier approaches that struggled with sample spaces and continuous distributions. played a pivotal role with his publication Éléments de la théorie des probabilités, where he applied emerging concepts from to probability, introducing ideas that prefigured the use of sigma-algebras for handling denumerable and uncountable event collections. This work laid essential groundwork for treating probabilities over sets, enabling a more systematic analysis of continuous phenomena that previous treatments, reliant on finite approximations, could not adequately address. The culmination of this formalization came in 1933 with Andrey Kolmogorov's Foundations of the Theory of Probability, which established probability as a branch of measure theory. By defining probability measures on sigma-algebras over sample spaces, Kolmogorov provided a unified axiomatic basis that resolved longstanding paradoxes, such as Bertrand's paradox on random chords in a circle, by specifying uniform distributions via rather than ambiguous geometric intuitions. This measure-theoretic approach effectively handled infinite sample spaces and continuous distributions, eliminating ambiguities in earlier ad-hoc methods and enabling precise definitions for limits and expectations in probabilistic limits. John von Neumann contributed to this rigor in the 1930s through his work on operator algebras, particularly in reformulating in a framework analogous to , as detailed in his 1932 . While primarily motivated by quantum applications, this approach reinforced classical probability's foundations by interpreting probabilities as expectation values in commutative algebras, bridging deterministic dynamics with interpretations and emphasizing measurable structures for rigorous computation. Post-World War II developments further extended this formalization computationally, notably through the Monte Carlo methods introduced by Stanislaw Ulam and in the late 1940s. Originating from simulations for neutron diffusion at in 1946–1947, these methods used random sampling to approximate solutions to complex probabilistic integrals and expectations, leveraging electronic computers to apply Kolmogorov's measure-theoretic probabilities to practical, high-dimensional problems intractable analytically. This innovation marked a shift toward computational verification of theoretical predictions, influencing fields like physics and statistics by demonstrating the power of axiomatic probability in simulation-based inference.

Interpretations of Probability

Classical Interpretation

The classical interpretation of probability posits that the probability of an is the ratio of the number of favorable outcomes to the total number of possible outcomes in a finite where all outcomes are equally likely. This approach treats probability as an a priori measure derived from and combinatorial reasoning, without reliance on empirical or subjective . Formally, for a finite \Omega and A \subseteq \Omega, the probability is given by P(A) = \frac{|A|}{|\Omega|}, where |\cdot| denotes the number of elements. This interpretation originated in the work of early probabilists but was rigorously defined by in his 1814 treatise A Philosophical Essay on Probabilities. Laplace described the of chance as "reducing all the events of the same kind to a certain number of cases equally possible," emphasizing the assumption of uniformity across outcomes to compute probabilities analytically. His formulation built on earlier ideas from games of chance, providing a philosophical foundation that prioritized logical equipossibility over experimental data. Illustrative examples highlight the intuitive appeal of this view. For a flip, the \Omega = \{\text{heads}, \text{tails}\} yields P(\text{heads}) = 1/2, as one outcome is favorable out of two equally likely possibilities. Similarly, in rolling two fair six-sided , the probability that the sum is 7 is $6/36 = 1/6, corresponding to the six favorable pairs (1,6), (2,5), (3,4), (4,3), (5,2), (6,1) out of 36 total outcomes. These cases demonstrate how the classical method excels in discrete, symmetric scenarios like problems. Despite its elegance, the classical interpretation is limited to situations with a finite, enumerable set of equally likely outcomes; it fails in infinite or asymmetric cases where equipossibility cannot be assumed without additional justification. For example, —dropping a needle onto a with spaced distance d apart, where the needle length l \leq d—involves a continuous of positions and orientations, necessitating over areas rather than simple counting to find the crossing probability $2l/(\pi d). This highlights the need for extensions beyond the classical framework for geometric or continuous probabilities.

Frequentist Interpretation

The frequentist of probability views it as an property of a repeatable , defined as the long-run relative with which an occurs in an of identical trials. Formally, the probability P(A) of an A is given by P(A) = \lim_{n \to \infty} \frac{n_A}{n}, where n is the number of trials and n_A is the number of trials in which A occurs. This approach was notably articulated by in his 1866 work The Logic of Chance, where he emphasized probability as the ratio of favorable outcomes in a long series of trials, rejecting subjective elements and grounding it in empirical observation. Later, advanced the framework in 1919 by introducing axioms for "random sequences" or Kollektivs, which ensure the existence and stability of limiting frequencies while incorporating the principle of to exclude predictable patterns. These axioms formalized the frequentist perspective, making it a rigorous basis for mathematical probability applicable to empirical sciences. In practice, the frequentist interpretation underpins key tools in classical statistics, such as confidence intervals, which provide a range of plausible values for an unknown parameter based on the proportion of intervals that would contain the true value over repeated sampling, and hypothesis testing, which evaluates claims by assessing how extreme observed are under a model using long-run error rates like the level. For instance, confidence intervals rely on the idea that the procedure yields correct coverage in the limit of many repetitions, aligning directly with the frequency definition. Unlike the classical interpretation, which treats probability as a of favorable to total equally likely outcomes in finite equiprobable cases, the frequentist approach extends to non-uniform scenarios by relying on observed from actual or hypothetical repeated experiments; the classical view can be seen as a special case when empirical match assumed uniformity. An example is estimating the probability of heads for a potentially biased : after 1000 flips yielding 550 heads, the frequentist estimate is 0.55, with further flips refining the approximation toward the true limiting .

Bayesian Interpretation

The Bayesian interpretation treats probability as a measure of the strength of an individual's in a , representing subjective degrees of partial rather than long-run frequencies. This view posits that rational agents assign probabilities based on their personal information and update them coherently upon receiving new evidence. Coherence is enforced through the Dutch book argument, which demonstrates that incoherent beliefs—those violating —allow an adversary to construct a set of bets guaranteeing a net loss for the agent regardless of outcomes. Frank Ramsey formalized this subjective approach in his 1926 essay "Truth and Probability," arguing that degrees of belief function like probabilities in betting scenarios and must obey the standard axioms to ensure rational consistency. Building on Ramsey's ideas, advanced the theory in his 1937 paper "Foresight: Its Logical Laws, Its Subjective Sources," where he contended that all probabilities are inherently subjective and that probabilities emerge only as among subjective views under shared information. De Finetti's work emphasized that subjective probabilities remain valid as long as they satisfy conditions, such as avoiding books. At the core of Bayesian updating is Bayes' theorem, which prescribes how to revise prior beliefs P(H) in light of evidence E to obtain posterior beliefs P(H|E): P(H|E) = \frac{P(E|H) \, P(H)}{P(E)} Here, P(H) denotes the prior probability of hypothesis H, P(E|H) is the likelihood of observing evidence E given H, and P(E) is the marginal probability of E, often computed as \sum_H P(E|H) P(H) for discrete cases. This theorem ensures that updates preserve coherence and rationality. In , Bayesian probabilities underpin expected utility maximization, where agents select actions that optimize outcomes weighted by their belief strengths, as axiomatized by Leonard Savage in his foundational framework. Frequentist frequency estimates can inform these subjective when prior knowledge is limited. In , Bayesian priors regularize models to quantify uncertainty, such as in Gaussian processes for regression or Bayesian neural networks for classification, preventing by incorporating belief distributions over parameters. Since the 1990s, computational advances like (MCMC) methods have revolutionized by enabling sampling from intractable posterior distributions, facilitating applications in complex hierarchical models across fields like and . Key developments include the Gibbs sampler and Metropolis-Hastings algorithm, which gained prominence through their integration into statistical software, allowing scalable posterior estimation where analytical solutions are infeasible.

Axiomatic Foundations

Kolmogorov's Axioms

Kolmogorov's axiomatic approach, introduced in his 1933 Grundbegriffe der Wahrscheinlichkeitsrechnung, provided a rigorous mathematical foundation for probability theory by embedding it within the framework of measure theory. This formalization resolved longstanding issues in handling continuous probability distributions and paradoxes arising from earlier intuitive definitions, such as those in games of chance or geometric probabilities, by defining probability as a countably additive measure on an abstract space. The axioms unify diverse interpretations of probability—classical, frequentist, or subjective—by specifying abstract rules that any valid probability assignment must satisfy, without presupposing a particular philosophical stance. The three fundamental axioms are as follows:
  1. Non-negativity: For any E, the probability P(E) \geq 0. This ensures that probabilities represent non-negative measures, preventing physically negative likelihoods.
  2. Normalization: The probability of the entire \Omega is P(\Omega) = 1. This axiom the total measure to unity, reflecting the certainty that some outcome in \Omega must occur.
  3. Countable additivity: For a countable collection of pairwise disjoint events E_1, E_2, \dots, the probability of their is the of their individual probabilities: P\left( \bigcup_{i=1}^\infty E_i \right) = \sum_{i=1}^\infty P(E_i). This extends finite additivity to infinite collections, enabling the theory to handle limits and continuous distributions rigorously.
From countable additivity, finite additivity follows immediately: for a finite number of disjoint events E_1, \dots, E_n, one can set P(E_{n+1}) = P(E_{n+2}) = \dots = 0 to apply the countable case, yielding P\left( \bigcup_{i=1}^n E_i \right) = \sum_{i=1}^n P(E_i). Moreover, countable additivity implies continuity of the : if a non-decreasing of events E_1 \subseteq E_2 \subseteq \dots has E, then P(E) = \lim_{n \to \infty} P(E_n); similarly for decreasing sequences with non-zero . These properties ensure that probabilities behave consistently under limits, crucial for deriving results in continuous settings. Key implications derive directly from the axioms. The probability of the is zero: P(\emptyset) = 0, obtained by noting that \emptyset is the of zero events ( is zero) or, equivalently, by considering \Omega = \Omega \cup \emptyset with disjointness, so $1 = P(\Omega) = P(\Omega) + P(\emptyset), implying P(\emptyset) = 0. For the complement of an event E, P(E^c) = 1 - P(E), since E and E^c are disjoint and their is \Omega, yielding P(E) + P(E^c) = P(\Omega) = 1. These derivations establish basic inclusion-exclusion principles and underpin further developments in the theory.

Sample Spaces and Events

In probability theory, the sample space, denoted by \Omega, is the set of all possible outcomes of a random experiment or process. This foundational concept captures the universe of conceivable results, which may be finite, countably infinite, or uncountably infinite, depending on the nature of the experiment. For instance, in the simple case of a single toss, \Omega = \{H, T\}, where H represents heads and T represents tails. Events are subsets of the \Omega, representing collections of outcomes that share a particular property of interest. The collection of events must form a sigma-algebra, meaning it contains \Omega and the \emptyset, and is closed under complements and countable unions (equivalently, countable intersections) relative to \Omega. This structure ensures that logical combinations of events—such as "the union of countably many events" or "the complement of an event"—remain valid events within the . For the toss example, the event "heads" is the \{H\}, while the event "not tails" is also \{H\}, illustrating how complements operate. In more complex scenarios, such as modeling a over a continuous , the can be \Omega = [0,1], the set of all real numbers between 0 and 1 inclusive. Here, events might include intervals like [0, 0.5] for outcomes up to halfway. For finite or countable \Omega, the full (all possible subsets) can serve as the sigma-algebra of events. However, for uncountable sample spaces like [0,1], the power set is too large and unwieldy for practical modeling; instead, a suitable sigma-algebra, such as the Borel sigma-algebra generated by the open intervals, is selected, consisting of subsets closed under the required operations but not encompassing every conceivable subset of \Omega. These set-theoretic structures provide the basis for assigning probabilities to events through axiomatic definitions, ensuring in handling .

Probability Measures and Sigma-Algebras

In probability theory, to rigorously handle s that may be infinite or uncountable, such as the real numbers, the concept of a sigma-algebra is introduced to specify the collection of subsets—known as events—that are measurable with respect to a . A sigma-algebra \mathcal{F} on a \Omega is a family of subsets of \Omega that contains \Omega and the \emptyset, and is closed under complementation and countable unions; equivalently, it is also closed under countable intersections. This structure ensures that operations on events, such as forming unions of countably many disjoint events, remain within the collection, allowing for consistent probability assignments even in complex spaces. A probability measure P is then defined as a function from the sigma-algebra \mathcal{F} to the interval [0,1] that satisfies the Kolmogorov axioms: P(\emptyset) = 0, P(\Omega) = 1, and for any countable collection of pairwise disjoint events \{A_n\}_{n=1}^\infty \in \mathcal{F}, P\left(\bigcup_{n=1}^\infty A_n\right) = \sum_{n=1}^\infty P(A_n). This countably additive property extends the finite additivity of earlier probability formulations to infinite collections, providing the foundation for limit theorems and convergence concepts in probability. The pair (\Omega, \mathcal{F}, P) forms a probability space, where \mathcal{F} delineates the observable events derivable from basic outcomes in \Omega. The necessity of sigma-algebras becomes evident when considering continuous sample spaces, where not all subsets can be assigned a probability in a way that preserves desirable properties like translation invariance. For instance, the , constructed using the as a selector from the equivalence classes of real numbers in [0,1)&#36;, is non-measurable with respect to the Lebesgue measure, as any assignment of measure to it would lead to contradictions in countable additivity and invariance under rational translations.[](https://jamesrmeyer.com/infinite/vitali-paper-measure-points-on-line) Sigma-algebras, such as the Borel sigma-algebra generated by the open sets on \mathbb{R}$, restrict attention to measurable sets, ensuring that probabilities can be consistently defined. For continuous probabilities, integration with respect to the underpins the theory, where the probability of an event A \in \mathcal{F} is given by P(A) = \int_A f(x) \, d\lambda(x) for a density f with respect to the Lebesgue measure \lambda. The Lebesgue integral, more general than the , allows for the computation of expectations and probabilities over Borel sets, accommodating discontinuities and infinite domains while maintaining countable additivity. This framework unifies discrete and continuous cases, as discrete probabilities correspond to integration against .

Random Variables

Definitions and Properties

In probability theory, a random variable is formally defined as a X: \Omega \to \mathbb{R}, where (\Omega, \mathcal{F}, P) is a , \mathcal{F} is the sigma-algebra on the \Omega, and measurability ensures that for every B \subseteq \mathbb{R}, the preimage X^{-1}(B) \in \mathcal{F}. This definition, introduced in the axiomatic framework, allows random variables to quantify outcomes of random experiments in a mathematically precise manner. Random variables are categorized into discrete and continuous types based on the nature of their range. A discrete random variable takes values in a countable subset of \mathbb{R}, such as the integers, while a continuous random variable assumes values in an uncountable set, typically an interval of \mathbb{R}. This distinction arises from the structure of the induced probability measure on \mathbb{R}, though all random variables share the same foundational properties regardless of type. The (CDF) provides a complete of a X and is defined as F_X(x) = P(X \leq x), \quad x \in \mathbb{R}. The CDF F_X is non-decreasing, right-continuous, and satisfies \lim_{x \to -\infty} F_X(x) = 0 and \lim_{x \to \infty} F_X(x) = 1. These properties ensure that F_X uniquely determines the induced by X on the Borel sigma-algebra of \mathbb{R}, allowing probabilities of intervals and events involving X to be computed directly from the CDF. For instance, P(a < X \leq b) = F_X(b) - F_X(a) for a < b.

Expectation and Moments

In probability theory, the expectation of a random variable X, denoted E[X], represents its average value weighted by the probability distribution. For a discrete random variable taking values x_i with probabilities p(x_i), the expectation is given by the sum E[X] = \sum x_i p(x_i). For a continuous random variable with probability density function f(x), it is the integral E[X] = \int_{-\infty}^{\infty} x f(x) \, dx. In the general measure-theoretic framework, the expectation is defined as the Lebesgue integral E[X] = \int x \, dF(x), where F is the cumulative distribution function of X. The expectation serves as the first raw moment of the distribution. Higher-order raw moments are defined analogously as E[X^n] for positive integers n, capturing aspects of the distribution's shape beyond the mean, such as spread and asymmetry through powers of the variable. Central moments, which measure deviations from the mean, are given by E[(X - E[X])^n]; the second central moment, for instance, relates to variability, though its detailed properties are addressed elsewhere. These moments provide a sequence of quantitative descriptors for the probability distribution, with the raw moments directly extending the expectation concept. A fundamental property of expectation is its linearity, which holds unconditionally regardless of dependence between variables: for constants a and b and random variables X and Y, E[aX + bY] = a E[X] + b E[Y]. This follows from the linearity of the integral defining expectation and enables simplification of expectations for linear combinations without computing joint distributions. Jensen's inequality provides a key bound involving expectations and convex functions. For a convex function \phi and random variable X with finite expectation, \phi(E[X]) \leq E[\phi(X)], with equality if \phi is linear or X is constant almost surely. This inequality, originally stated for weighted averages of convex functions, extends naturally to probabilistic expectations and underpins applications in optimization and risk analysis.

Variance, Covariance, and Dependence

Variance quantifies the dispersion of a random variable X around its mean \mu = E[X], providing a measure of variability in probability distributions. It is defined as the expected value of the squared deviation from the mean: \Var(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2. This second central moment is always non-negative, with \Var(X) = 0 if and only if X is constant almost surely, and it scales with the square of the units of X. Building on the concept of expectation as the first moment, variance extends to second-order statistics to capture spread. Covariance extends this idea to pairs of random variables, measuring their joint variability and linear relationship. For random variables X and Y with means \mu_X and \mu_Y, the covariance is \Cov(X, Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]E[Y]. A positive covariance indicates that X and Y tend to deviate in the same direction from their means, while a negative value suggests opposite directions; zero covariance implies no linear association. Covariance is bilinear and symmetric, with \Cov(X, X) = \Var(X), but its magnitude depends on the scales of X and Y. To obtain a scale-invariant measure of linear dependence, the Pearson correlation coefficient normalizes covariance by the standard deviations \sigma_X = \sqrt{\Var(X)} and \sigma_Y = \sqrt{\Var(Y)}: \rho_{X,Y} = \frac{\Cov(X,Y)}{\sigma_X \sigma_Y}. Introduced by Karl Pearson, this coefficient ranges from -1 to 1, where |\rho| = 1 signifies perfect linear dependence, \rho = 0 indicates uncorrelatedness, and values near \pm 1 denote strong linear relationships. These measures connect to probabilistic dependence, particularly independence. If X and Y are independent, then \Cov(X, Y) = 0 and \rho_{X,Y} = 0, as the expectation factors under independence. However, the converse does not hold: zero covariance or correlation does not imply independence, as nonlinear dependencies can exist without linear association. A classic counterexample involves X discrete uniform on \{-1, 0, 1\}, and Y = X^2. Here, E[X] = 0, E[XY] = E[X^3] = 0, so \Cov(X, Y) = 0, but X and Y are dependent since P(Y = 0 \mid X = 0) = 1 \neq P(Y = 0) = 1/3.

Probability Distributions

Discrete Distributions

A discrete probability distribution describes the probabilities associated with a countable set of possible outcomes for a X, where the support of X is a countable set, such as the non-negative integers or a finite collection of values. The distribution is fully specified by its probability mass function (PMF), denoted p(x) = P(X = x), which gives the probability that X takes the exact value x. The PMF must satisfy two fundamental properties: p(x) \geq 0 for all x in the support, ensuring non-negative probabilities, and \sum_{x} p(x) = 1, guaranteeing that the total probability over the entire support equals unity. These conditions ensure that the PMF defines a valid probability measure on the discrete sample space. A key tool for analyzing discrete distributions is the probability generating function (PGF), defined as G(s) = E[s^X] = \sum_{x} p(x) s^x, where the expectation is taken over the PMF and s is a complex variable with |s| \leq 1 for convergence. The PGF encapsulates the entire distribution and facilitates computations such as finding moments (e.g., derivatives at s=1 yield factorial moments) and deriving the distribution of sums of independent discrete random variables via convolution. For instance, if X and Y are independent discrete random variables, the PGF of X + Y is the product G_X(s) G_Y(s). As a representative example, the Bernoulli distribution with success probability p \in [0,1] has PMF p(1) = p and p(0) = 1 - p, modeling a single trial with two outcomes. Tail probabilities, which quantify the likelihood of extreme values, are computed as P(X \geq k) = \sum_{x \geq k} p(x) for integer k, providing insights into the heaviness of the distribution's right tail and applications in risk assessment. In discrete settings, the inclusion-exclusion principle extends to compute probabilities of unions of events over countable spaces: for events A_1, \dots, A_n, P(\cup_{i=1}^n A_i) = \sum_i P(A_i) - \sum_{i < j} P(A_i \cap A_j) + \cdots + (-1)^{n+1} P(\cap_{i=1}^n A_i), enabling exact calculations for overlapping outcomes without overcounting.

Continuous Distributions

Continuous probability distributions describe the probabilities associated with random variables that take values in a continuous range, such as the real numbers, where the sample space is uncountable. Unlike discrete distributions, which assign probabilities to individual outcomes via a , continuous distributions use a to characterize the likelihood over intervals, as the probability of any exact value is zero. The probability density function f(x) of a continuous random variable X is a nonnegative integrable function defined over the support of X, satisfying two key properties: f(x) \geq 0 for all x in the support, and \int_{-\infty}^{\infty} f(x) \, dx = 1, ensuring the total probability is 1. The probability that X falls within an interval (a, b) is given by the integral of the density over that interval: P(a < X < b) = \int_a^b f(x) \, dx. This integral represents the area under the density curve between a and b, providing a measure of likelihood for continuous outcomes. The cumulative distribution function (CDF) F(x) = P(X \leq x) is the integral of the PDF up to x, so F(x) = \int_{-\infty}^x f(t) \, dt, and the PDF is its derivative where it exists: f(x) = F'(x). Related to the CDF is the function S(x) = 1 - F(x) = P(X > x), which quantifies the probability that the random variable exceeds x. This function is particularly useful in contexts like reliability analysis, where it models the probability of survival beyond a certain point. Another important representation is the quantile function Q(p), defined for p \in (0,1) as the smallest (or infimum) value x such that F(x) \geq p, or Q(p) = \inf \{ x : F(x) \geq p \}. This inverse of the CDF maps probabilities to values, enabling the computation of percentiles and s; for instance, the median is Q(0.5). The is nondecreasing and right-continuous, providing a way to generate random variables from uniform distributions via the method. For transformations of continuous random variables, the change-of-variables formula derives the density of a transformed variable. If Y = g(X) where g is a strictly monotonic differentiable function with inverse g^{-1}, and X has density f_X, then the density of Y is f_Y(y) = f_X(g^{-1}(y)) \left| \frac{d}{dy} g^{-1}(y) \right| for y in the range of g. This accounts for how the transformation stretches or compresses the density, preserving the total probability through the absolute value of the Jacobian determinant in the univariate case. For non-monotonic g, the formula sums over branches of the inverse.

Multivariate Distributions

Multivariate distributions extend the concepts of univariate probability distributions to describe the joint behavior of two or more random variables, capturing their simultaneous outcomes and interrelationships. In the case of two random variables X and Y, the joint distribution provides the probability assignments to pairs (x, y). For discrete random variables, this is specified by the joint probability mass function (PMF), defined as p_{X,Y}(x,y) = P(X = x, Y = y), where the function is nonnegative and sums to 1 over all possible pairs: \sum_{x} \sum_{y} p_{X,Y}(x,y) = 1. For continuous random variables, the joint probability density function (PDF) f_{X,Y}(x,y) satisfies \iint f_{X,Y}(x,y) \, dx \, dy = 1, and the probability over a region A is given by the double integral \iint_A f_{X,Y}(x,y) \, dx \, dy. Marginal distributions are derived from the joint distribution by integrating or summing out the other variable, effectively reducing the multivariate case back to univariate marginals. For discrete variables, the marginal PMF of X is p_X(x) = \sum_y p_{X,Y}(x,y), and similarly for Y. In the continuous case, the marginal PDF of X is f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dy. These marginals represent the univariate distributions referenced in prior sections on discrete and continuous distributions. Conditional distributions quantify the probability of one variable given the value of another, enabling analysis of dependencies. For discrete variables, the conditional PMF is p_{X|Y}(x|y) = \frac{p_{X,Y}(x,y)}{p_Y(y)} for p_Y(y) > 0. For continuous variables, the conditional PDF is f_{X|Y}(x|y) = \frac{f_{X,Y}(x,y)}{f_Y(y)}. This definition underpins conditional probability in multivariate settings. Two random variables are independent if their joint distribution factors into the product of their marginals, meaning knowledge of one does not affect the other. Thus, p_{X,Y}(x,y) = p_X(x) p_Y(y) for all x, y in the discrete case, or f_{X,Y}(x,y) = f_X(x) f_Y(y) in the continuous case; this equivalence holds more generally for distribution functions as well. Copulas provide a flexible framework for constructing and analyzing multivariate distributions by separating the marginal behaviors from the dependence structure. Sklar's theorem states that for any multivariate (CDF) H(x_1, \dots, x_n) with marginal CDFs F_1, \dots, F_n, there exists a C: [0,1]^n \to [0,1] such that H(x_1, \dots, x_n) = C(F_1(x_1), \dots, F_n(x_n)) for all x_i, and conversely, any such yields a valid joint CDF when combined with continuous marginals. This decomposition allows modeling dependence independently of marginals, with the capturing the joint structure in the uniform [0,1] space.

Common Probability Distributions

Bernoulli, Binomial, and Poisson

The distribution models a single trial with two possible outcomes: success with probability p (where $0 \leq p \leq 1) or failure with probability $1-p. It is the simplest discrete distribution and serves as the foundation for more complex models involving multiple independent trials. The (PMF) of a X is given by P(X = x) = p^x (1-p)^{1-x}, \quad x = 0, 1. The expected value is E[X] = p, and the variance is \operatorname{Var}(X) = p(1-p). These moments highlight the distribution's concentration around p, with maximum variance at p = 0.5. The binomial distribution generalizes the Bernoulli to n independent trials, each with success probability p, where n is a positive integer. It counts the number of successes k in these trials, making it suitable for scenarios like quality control or polling. The PMF is P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \dots, n, where \binom{n}{k} is the . The is E[X] = np, and the variance is \operatorname{Var}(X) = np(1-p). For large n, the can be approximated by a with mean np and variance np(1-p), provided np and n(1-p) are sufficiently large (typically greater than 5 or 10). This approximation aids in computing probabilities without exact enumeration. The arises in modeling the number of events occurring in a fixed , when events are rare and independent, such as defects in or arrivals at a . It is parameterized by \lambda > 0, the average rate of occurrence. Notably, the emerges as a limiting case of the as n \to \infty and p \to 0 while holding np = \lambda constant, justifying its use for approximating binomials with many trials and low success probability. The PMF is P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!}, \quad k = 0, 1, 2, \dots, with expected value E[X] = \lambda and variance \operatorname{Var}(X) = \lambda. The equality of mean and variance is a distinctive property, often observed in count data from rare events. These distributions find broad applications in modeling binary outcomes and counts. The Bernoulli and binomial are used for success counts in fixed trials, such as coin flips or clinical trial outcomes, while the Poisson excels in rare event modeling, exemplified by radioactive decay where the number of decays in a time interval follows P(k) = \frac{e^{-\lambda} \lambda^k}{k!} with \lambda as the decay rate.

Uniform, Normal, and Exponential

The is a fundamental continuous that assigns equal probability to all values within a specified finite [a, b], where a < b. Its probability density function (PDF) is given by f(x) = \frac{1}{b - a}, \quad a \leq x \leq b, and zero otherwise. The expected value is E[X] = \frac{a + b}{2}, and the variance is \operatorname{Var}(X) = \frac{(b - a)^2}{12}. This distribution serves as a baseline model for scenarios assuming no bias toward any outcome in the , such as generating random numbers in simulations where each value in the range is equally likely. The normal distribution, also known as the Gaussian distribution, is a symmetric bell-shaped continuous probability distribution defined by parameters \mu (mean) and \sigma^2 (variance), with \sigma > 0. Its PDF is f(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right), \quad -\infty < x < \infty. The expected value is E[X] = \mu, and the variance is \operatorname{Var}(X) = \sigma^2. A key empirical property is the 68-95-99.7 rule, which states that approximately 68% of the probability mass lies within one standard deviation of the mean, 95% within two, and 99.7% within three. Originally derived by in 1809 to model measurement errors in astronomical observations, it remains central for approximating errors in scientific measurements, where deviations are assumed symmetric and rare extremes occur. The exponential distribution models the time between events in a memoryless process, parameterized by rate \lambda > 0. Its PDF is f(x) = \lambda e^{-\lambda x}, \quad x \geq 0, and zero otherwise. The expected value is E[X] = \frac{1}{\lambda}, and the variance is \operatorname{Var}(X) = \frac{1}{\lambda^2}. A defining feature is its memoryless property: P(X > s + t \mid X > s) = P(X > t) for all s, t \geq 0, implying that the process forgets prior waiting time. This arises naturally as the interarrival time distribution in a Poisson process, commonly applied to model waiting times between independent events, such as customer arrivals at a service point or radioactive decays.

Convergence and Limit Theorems

Modes of Convergence

In probability theory, sequences of random variables can converge in various senses, each providing distinct insights into the limiting behavior of the probabilities or expectations involved. These modes of convergence form the foundation for analyzing asymptotic properties and are crucial for establishing theorems, though they differ in strength and implications. The primary modes include in probability, almost sure , in , and in L^p spaces, with well-established relationships among them that guide their applications. Convergence in probability, also known as convergence, occurs when a sequence of random variables \{X_n\} approaches a limiting random variable X such that the probability of their difference exceeding any fixed positive threshold diminishes to zero. Formally, X_n \to X in probability if, for every \epsilon > 0, \lim_{n \to \infty} P(|X_n - X| > \epsilon) = 0. This mode captures the idea that large deviations become increasingly unlikely, making it a weaker form of suitable for many statistical approximations. Almost sure convergence, or convergence with probability one, is a stronger notion that requires the sequence to converge pointwise on the sample space except possibly on a set of measure zero. Specifically, X_n \to X almost surely if P\left( \left\{ \omega : \lim_{n \to \infty} X_n(\omega) = X(\omega) \right\} \right) = 1. This implies that the random variables settle to the limit for "almost all" outcomes, providing a pathwise guarantee that is more stringent than mere probabilistic control. Convergence in distribution, sometimes called weak convergence, focuses on the limiting behavior of the cumulative distribution functions without requiring pointwise agreement of the variables themselves. A sequence \{X_n\} converges in distribution to X if the distribution function F_{X_n}(x) satisfies \lim_{n \to \infty} F_{X_n}(x) = F_X(x) at all continuity points x of F_X. This mode is particularly useful for studying the asymptotic shapes of distributions, as it preserves properties like expectations of bounded continuous functions. Convergence in L^p, or convergence in pth mean for p \geq 1, emphasizes control over the moments of the difference between the sequence and the limit. Here, X_n \to X in L^p if \lim_{n \to \infty} E[|X_n - X|^p] = 0. This form ensures that the pth power of the deviation has vanishing expectation, linking probabilistic convergence to integrability conditions in the underlying . The relationships among these modes form a of implications: almost sure convergence implies convergence in probability, which in turn implies convergence in distribution; similarly, L^p convergence implies convergence in probability for any p > 0. However, the converses do not hold in general—for instance, convergence in probability does not guarantee almost sure convergence, as counterexamples exist where the sequence oscillates indefinitely on sets of positive , though such events occur with probability approaching zero. Likewise, convergence in distribution is the weakest, allowing limits in law without convergence of moments or paths, as seen when variables concentrate around different points but share the same limiting distribution. These distinctions ensure that stronger modes provide more robust conclusions, while weaker ones suffice for distributional asymptotics.

Law of Large Numbers

The law of large numbers (LLN) asserts that, for a sequence of independent and identically distributed random variables X_1, X_2, \dots with finite expectation \mu = \mathbb{E}[X_i], the sample mean \bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i converges to \mu in an appropriate probabilistic sense as n \to \infty. This theorem underpins the reliability of empirical averages in approximating theoretical expectations, with two primary forms: the weak law, which establishes convergence in probability, and the strong law, which establishes almost sure convergence. The weak law of large numbers (WLLN), first formulated in a special case by Jacob Bernoulli in 1713 for the binomial distribution, states that \bar{X}_n \to \mu in probability, meaning that for any \epsilon > 0, \mathbb{P}(|\bar{X}_n - \mu| \geq \epsilon) \to 0 as n \to \infty. Bernoulli demonstrated this for repeated Bernoulli trials with success probability p, showing that the proportion of successes converges in probability to p, provided the variables have finite variance. A general proof for i.i.d. random variables with finite mean and variance, due to Pafnuty Chebyshev in 1867, relies on his inequality: \mathbb{P}(|\bar{X}_n - \mu| \geq \epsilon) \leq \frac{\mathrm{Var}(\bar{X}_n)}{\epsilon^2} = \frac{\sigma^2}{n \epsilon^2} \to 0, where \sigma^2 = \mathrm{Var}(X_i) < \infty. This bound exploits the fact that the variance of the sample mean diminishes as $1/n, ensuring the probability of significant deviation vanishes. The strong law of large numbers (SLLN) strengthens this result, asserting that \bar{X}_n \to \mu almost surely, meaning the set of outcomes where the convergence fails has probability zero. Andrey Kolmogorov proved in 1930 that for i.i.d. random variables with \mathbb{E}[|X_i|] < \infty, the SLLN holds almost surely; this condition is necessary and sufficient. In 1933, he provided a criterion for independent (not necessarily identically distributed) random variables with finite second moments: the SLLN holds if and only if \sum_{i=1}^\infty \frac{\mathrm{Var}(X_i)}{i^2} < \infty. For i.i.d. cases with finite variance, this 1933 condition is automatically satisfied since the variances are identical and \sum 1/i^2 < \infty, but the SLLN holds more generally under the finite first absolute moment condition from 1930. The proof typically involves truncation arguments and the to control the probabilities of large deviations infinitely often. These laws justify the frequentist interpretation of probability, where the probability of an event is defined as the limiting relative frequency of its occurrence in repeated independent trials; the guarantees that observed frequencies stabilize around this limit with high probability (weak form) or certainty (strong form), providing a rigorous basis for inference from data.

Central Limit Theorem

The central limit theorem (CLT) asserts that the sum of a large number of independent and identically distributed (i.i.d.) random variables, when properly standardized, converges in distribution to a standard normal random variable, regardless of the underlying distribution of the individual variables, provided they have finite mean and positive finite variance. This result explains the ubiquity of the normal distribution in statistical applications and underpins many inferential procedures. A precise statement of the CLT for i.i.d. random variables, often referred to as the Lindeberg–Lévy CLT, is as follows: Let X_1, X_2, \dots, X_n be i.i.d. random variables with \mathbb{E}[X_i] = \mu and \operatorname{Var}(X_i) = \sigma^2 > 0. Define the sample sum S_n = \sum_{i=1}^n X_i. Then, the standardized sum Z_n = \frac{S_n - n\mu}{\sigma \sqrt{n}} converges in distribution to a standard normal random variable Z \sim \mathcal{N}(0, 1), i.e., Z_n \xrightarrow{d} Z as n \to \infty. This theorem was established in its modern form by Lyapunov in 1901 using moment conditions. A historically significant special case is the , which applies to the . Consider S_n \sim \operatorname{Binomial}(n, p) with np and variance np(1-p). The standardized version (S_n - np)/\sqrt{np(1-p)} converges in to \mathcal{N}(0, 1) as n \to \infty. This result was first derived by in 1733 for the case p = 1/2 and later generalized by in 1812, marking an early precursor to the general CLT. One standard proof of the CLT relies on characteristic functions. Let \phi(t) be the characteristic function of the centered and scaled variable (X_1 - \mu)/\sigma, so \phi(0) = 1 and \phi'(0) = 0, \phi''(0) = -1. The characteristic function of Z_n is [\phi(t / \sqrt{n})]^n. Taking the logarithm yields n \log \phi(t / \sqrt{n}). For small u = t / \sqrt{n}, the Taylor expansion gives \log \phi(u) = iu \cdot 0 + (u^2 / 2) \phi''(0) + o(u^2) = -u^2 / 2 + o(u^2), so n \log \phi(t / \sqrt{n}) = -t^2 / 2 + o(1). Thus, [\phi(t / \sqrt{n})]^n \to e^{-t^2 / 2}, the characteristic function of \mathcal{N}(0, 1). By the continuity theorem for characteristic functions, convergence in distribution follows. This approach, leveraging Fourier analysis, was popularized by Cramér in his 1937 work on random variables and distributions. The quantifies the rate of convergence in the CLT, providing a uniform bound on the difference between the (CDF) of Z_n and the standard CDF \Phi. Specifically, for i.i.d. X_i with \mathbb{E}[|X_i - \mu|^3] = \rho < \infty, \sup_{x \in \mathbb{R}} \left| \mathbb{P}(Z_n \leq x) - \Phi(x) \right| \leq C \frac{\rho}{\sigma^3 \sqrt{n}}, where C is a universal constant (originally bounded by 7.59, later improved). This bound, of order O(1/\sqrt{n}), was independently established by Berry in and Esseen in , enabling practical assessments of approximation accuracy.

References

  1. [1]
    [PDF] Introduction to Probability Theory
    The field of “probability theory” is a branch of mathematics that is concerned with describing the likelihood of different outcomes from uncertain processes ...
  2. [2]
    [PDF] Probability Theory - Wharton Faculty Platform
    Probability theory is that part of mathematics that aims to provide insight into phe- nomena that depend on chance or on uncertainty. The most prevalent use of ...
  3. [3]
    Probability History - Utah State University
    The study of probability didn't really take off in Europe until the 1600's. Its development was motivated by an interest in gambling and games of chance. But ...Missing: key | Show results with:key
  4. [4]
    [PDF] probability - a (very) brief history.pdf
    De Moivre pioneered the modern approach to the theory of probability, when he published The Doctrine of Chance: A method of calculating the probabilities of ...Missing: key | Show results with:key
  5. [5]
    [PDF] probability and statistics throughout the centuries
    Nevertheless, the mathematical development of probability started in 1713 with the publication of Jacob Bernoulli's (1654-1705) work. Among other things, he ...
  6. [6]
    [PDF] Initial Impressions and the History of Probability Theory
    Jun 29, 2021 · It was Andrey Nikolaevich Kolmogorov in the early 1930's, who established prob- ability as a branch of mathematics by building an axiomatic ...
  7. [7]
    Basic Probability - Seeing Theory
    Probability theory is the mathematical framework that allows us to analyze chance events in a logically sound manner.Compound Probability · Probability Distributions · Regression Analysis
  8. [8]
    [PDF] Basic Probability Theory - Semantic Scholar
    Sep 9, 2019 · Two events E and F are said to be independent if. P(E ∩ F) = P(E)P(F). This definition can easily be extended to multiple random variables. • It ...
  9. [9]
    [PDF] Review of Probability Theory - CS229
    Probability theory is the study of uncertainty. Through this class, we will be relying on concepts from probability theory for deriving machine learning ...
  10. [10]
    [PDF] Probability: Theory and Examples, Rick Durrett Version 5
    Jan 11, 2019 · Probability is not a spectator sport, so the book contains almost 450 exercises to challenge the reader and to deepen their understanding.” The ...
  11. [11]
    Gerolamo Cardano
    Cardano's work formed a good starting place for probability theory, but many of his explanations and assumptions were not correct. He attempted, but was unable ...
  12. [12]
    Decoding Cardano's Liber de Ludo Aleae - ScienceDirect.com
    Written in the 16th century, Cardano's Liber de Ludo Aleae was, in its time, an advanced treatment of the probability calculus.
  13. [13]
    [PDF] the reception of gerolamo cardano's liber de ludo aleae
    Cardano kept researching probability and chance. His gambling addiction led him to discover one of the fundamental laws of the theory of probability. He was ...
  14. [14]
    [PDF] Some Laws and Problems of Classical Probability and How ...
    Cardano's works on probability were published post- humously in the famous 15–page Liber de Ludo Aleae (The. Book on Games of Chance) consisting of 32 small ...
  15. [15]
    The Three-Dice Problem - Futility Closet
    Dec 27, 2024 · In 1620, the Grand Duke of Tuscany wrote to Galileo with a puzzling problem. In rolling three fair six-sided dice, it would seem that the sums 9 and 10 should ...Missing: probability Medici family
  16. [16]
    [PDF] CONCERNING AN INVESTIGATION ON DICE (SOPRA LE ...
    The fact that in a dice-game certain numbers are more advantageous than others has a very obvious reason, i.e. that some are more easily and more frequently ...
  17. [17]
    [PDF] Probability to 1750 - hom-sigmaa
    Galileo's analysis of the problem was published in Sopra le Scopetre dei Dadi (On a discovery concerning dice, 1898). He pointed out that “The fact that in a ...
  18. [18]
    Early problems in games of chance - History Of Mathematics - Fiveable
    Liber de Ludo Aleae. Cardano wrote "Liber de Ludo Aleae" (Book on Games of Chance) around 1564; First known systematic treatment of probability in gambling ...Missing: Gerolamo | Show results with:Gerolamo
  19. [19]
    [PDF] FERMAT AND PASCAL ON PROBABILITY - University of York
    The problem was proposed to Pascal and Fermat, probably in 1654, by the Chevalier de Méré, a gambler who is said to have had unusual ability “even for the ...
  20. [20]
    July 1654: Pascal's Letters to Fermat on the "Problem of Points"
    Jul 1, 2009 · In 1654, a French essayist and amateur mathematician named Antoine Gombaud, who was fond of gambling, found himself pondering what is known as “the problem of ...
  21. [21]
    The Problem of Points
    Then, in 1654, the Chevalier de Méré (1607-1684) posed the problem to Blaise Pascal who consulted with Pierre de Fermat, and both found different ways to solve ...
  22. [22]
    [PDF] The Pascal-Fermat Correspondence
    Pascal begins his letter hesitantly: I was not able to tell you my entire thoughts regarding the problem of the points by the last post, and at the same ...
  23. [23]
    [PDF] Jakob Bernoulli On the Law of Large Numbers Translated into ...
    His Ars Conjectandi (1713) (AC) was published posthumously with a Foreword by his nephew, Niklaus Bernoulli (English translation: David (1962, pp. 133 – 135); ...
  24. [24]
    [PDF] De Moivre on the Law of Normal Probability - University of York
    His own translation, with some additions, was included in the second edition (1738) of The Doctrine of Chances, pages 235–243. This paper gave the first ...
  25. [25]
    De Moivre's Normal Approximation to the Binomial, 1733, and Its ...
    Abraham de Moivre (1667–1754) was of a French Protestant family; from 1684 he studied mathematics in Paris. The persecution of the French Protestants caused ...
  26. [26]
    LII. An essay towards solving a problem in the doctrine of chances ...
    An essay towards solving a problem in the doctrine of chances. By the late Rev. Mr. Bayes, FRS communicated by Mr. Price, in a letter to John Canton, AMFR S.
  27. [27]
    [PDF] THE ANALYTIC THEORY OF PROBABILITIES Third Edition Book II ...
    Book II is essentially a consolidation of Laplace's previous work on probability. He takes up many problems already considered by earlier researchers or himself ...
  28. [28]
    Pafnuty Chebyshev (1821 - 1894) - Biography - MacTutor
    In 1867 he published a paper On mean values which used Bienaymé's ... Twenty years later Chebyshev published On two theorems concerning probability ...
  29. [29]
    The Life, Work, and Legacy of P. L. Chebyshev
    We survey briefly the life and work of PL Chebyshev and his ongoing influence. We discuss his contributions to probability, number theory, and mechanics.
  30. [30]
    Borel and the Emergence of Probability on the Mathematical Scene ...
    Dec 19, 2022 · This paper deals with the way in which, in the 1920s, probability was imposed on the French scientific scene under the decisive impulse of the mathematician ...
  31. [31]
  32. [32]
    [PDF] Defusing Bertrand's paradox - LSE Research Online
    Hilbert's call was answered only in 1933, when Kolmogorov firmly anchored probability theory within measure theory [13]. (See [6] for the history of some of ...
  33. [33]
    [PDF] Von Neumann's work on Hilbert space quantum mechanics
    Interpretation of classical probability theory + analogy between classical and quantum probability theory suggests the following physical interpretation of ...
  34. [34]
    [PDF] The Monte Carlo Method - Nicholas Metropolis; S. Ulam
    Jan 25, 2006 · The essential feature of the process is that we avoid dealing with multiple integrations or multiplications of the probability matrices, but ...Missing: 1940s | Show results with:1940s
  35. [35]
    Interpretations of Probability - Stanford Encyclopedia of Philosophy
    Oct 21, 2002 · Classical probabilities are ascertainable, assuming that the space of possibilities can be determined in principle. They bear a relationship to ...
  36. [36]
    A Philosophical Essay on Probabilities/Chapter 2 - Wikisource
    Dec 3, 2018 · The theory of chance consists in reducing all the events of the same kind to a certain number of cases equally possible, that is to say, to ...Missing: 1814 | Show results with:1814
  37. [37]
    Buffon's Needle Problem -- from Wolfram MathWorld
    Buffon's needle problem asks to find the probability that a needle of length l will land on a line, given a floor with equally spaced parallel lines a distance ...Missing: classical source
  38. [38]
    John Venn's opposition to probability as degree of belief
    John Venn is known as one of the clearest expounders of the interpretation of probability as the frequency of a particular outcome in a potentially ...
  39. [39]
    VON MISES' AXIOMATISATION OF RANDOM SEQUENCES
    The fundamental primitive in von Mises' axiomatisation is the Kollektiv, a mathematical abstraction representing an infinite series of independent trials.
  40. [40]
    (PDF) Von Mises' Frequentist Approach to Probability - ResearchGate
    Mar 31, 2016 · A strict Frequentist approach to probability. This approach is limited to observations where there are sufficient reasons to project future stability.
  41. [41]
    Statistical tests, P values, confidence intervals, and power: a guide ...
    These methods are thus called frequentist methods, and the hypothetical frequencies they predict are called “frequency probabilities.” Despite considerable ...
  42. [42]
    [PDF] Chapter 5 Confidence Intervals and Hypothesis Testing - MIT
    This chapter covers the fundamentals of Bayesian and frequentist approaches to these problems. 5.1 Bayesian confidence intervals. Recall from Section 4.4 that ...
  43. [43]
    [PDF] Von Mises' Frequentist Approach to Probability - StatLit.org
    One might summarize von Mises as objecting to any arbitrarily-postulated theory of probability whether a purely axiomatic development that gives no guidance ...
  44. [44]
    Dutch Book Arguments - Stanford Encyclopedia of Philosophy
    Jun 15, 2011 · For Ramsey and de Finetti, the point of appealing to a Dutch Book argument was to establish probabilistic constraints using precise terms ...The Basic Dutch Book... · The Dutch Book Argument and...
  45. [45]
    [PDF] "Truth and Probability" (1926)
    Note on this Electronic Edition: the following electronic edition of Frank Ramsey's famous essay. "Truth and Probability" (1926) is adapted from Chapter VII of ...
  46. [46]
    [PDF] BRUNO DE FINETTI - Foresight: Its Logical Laws, Its Subjective ...
    * Page 12 Foresight: Its Logical Laws, Its Subjective Sources 105 obliged to attribute to the event E if one wants to remain in the domain of coherence, after ...
  47. [47]
    Priors in Bayesian Deep Learning: A Review - Wiley Online Library
    May 11, 2022 · In this review, we highlight the importance of prior choices for Bayesian deep learning and present an overview of different priors that have been proposed.Priors in (Deep) Gaussian... · Priors in Bayesian Neural... · (Meta-)Learning Priors
  48. [48]
    [PDF] A Short History of Markov Chain Monte Carlo - arXiv
    Jan 9, 2012 · Abstract. We attempt to trace the history and development of Markov chain Monte Carlo (MCMC) from its early inception in the late 1940s.
  49. [49]
    [PDF] Kolmogorov and Probability Theory - CORE
    The monograph by Kolmogorov published in 1933 transformed the calculus of probability into a mathematical discipline. Some authors compare this role of ...Missing: original | Show results with:original
  50. [50]
    [PDF] FOUNDATIONS THEORY OF PROBABILITY - University of York
    THEORY OF PROBABILITY. BY. A.N. KOLMOGOROV. Second English Edition. TRANSLATION EDITED BY. NATHAN MORRISON. WITH AN ADDED BIBLIOGRPAHY BY. A.T. BHARUCHA-REID.Missing: formulation | Show results with:formulation
  51. [51]
    [PDF] FOUNDATIONS THEORY OF PROBABILITY - University of York
    FOUNDATIONS. OF THE. THEORY OF PROBABILITY. BY. A.N. KOLMOGOROV. Second English Edition. TRANSLATION EDITED BY. NATHAN MORRISON. WITH AN ADDED BIBLIOGRPAHY BY.Missing: paradoxes | Show results with:paradoxes
  52. [52]
    [PDF] Probability Theory 1 Sample spaces and events - MIT Mathematics
    Feb 10, 2015 · To treat probability rigorously, we define a sample space S whose elements are the possible outcomes of some process or experiment.
  53. [53]
    On the problem of measuring sets of points by Giuseppe Vitali - Logic
    On the problem of measuring sets of points on a straight line, by Giuseppe Vitali. This is an English translation of Giuseppe Vitali's paper of 1905 on the ...
  54. [54]
    [PDF] An Introduction to Measure Theory - Terry Tao
    there is no probability measure P on the reals R with the Lebesgue σ-algebra L[R] with the translation-invariance property P(E + x) = P(E) for every event E ...
  55. [55]
    [PDF] Introduction to Probability Theory and Its Applications
    FELLER · An Introduction to Probability Theory and Its Applications,. Volume I, Third Edition. FELLER · An Introduction to Probability Theory and Its Applica-.
  56. [56]
    [PDF] Foundations of the theory of probability - Internet Archive
    The theory of probability, as a mathematical discipline, can and should be developed from axioms in exactly the same way as Geometry and Algebra.
  57. [57]
    VII. Note on regression and inheritance in the case of two parents
    Note on regression and inheritance in the case of two parents. Karl Pearson ... Published:01 January 1895https://doi.org/10.1098/rspl.1895.0041. Abstract.
  58. [58]
    7.2 - Probability Mass Functions | STAT 414 - STAT ONLINE
    The probability mass function, P ( X = x ) = f ( x ) , of a discrete random variable X is a function that satisfies the following properties:.
  59. [59]
    2.4 probability mass function - MIT
    The probability mass function is the assignment of probabilities to each possible value of the random variable. It plays an identically analogous role for ...
  60. [60]
    [PDF] Probability Generating Functions - Texas A&M University
    Definition. Let X be a discrete random variable defined on a probability space with probability measure Pr. Assume that X has non-negative integer values.
  61. [61]
    [PDF] Generating functions and transforms
    An exact probability generating function uniquely determines a distribution; an approxi- mation to the probability generating function approximately determines ...
  62. [62]
    [PDF] Unit 4 The Bernoulli and Binomial Distributions
    The Bernoulli Distribution is an example of a discrete probability distribution. ... Following is a general formula for this idea. # ordered selections of size x ...
  63. [63]
    [PDF] Moments and tails
    Apr 2, 2015 · this chapter is to bound tail probabilities using moments and moment-generating functions. Tail bounds arise naturally in many contexts, as ...
  64. [64]
    [PDF] The Inclusion-Exclusion Principle
    That is, every point contained in the union of A1,A2,...An is counted exactly one time. Thus, eq. (6) is established. The corresponding result in probability ...
  65. [65]
    14.1 - Probability Density Functions | STAT 414 - STAT ONLINE
    The probability density function (pdf) of a continuous random variable with support is an integrable function satisfying the following:
  66. [66]
    [PDF] Chapter 4. Continuous Random Variables 4.1
    Every continuous random variable has a probability density function (PDF), instead of a probability mass function (PMF), that defines the relative likelihood ...
  67. [67]
    Integration and Probability - Penn Math
    Definition 14.1.​​ probability density. A probability density is a nonnegative function such that . ∫ − ∞ ∞ f ( x ) d x = 1 . random variable.
  68. [68]
    [PDF] 23.0 Survival Analysis - Stat@Duke
    The survival function is S(t)=1 − F(t), or the probability that a person or machine or a business lasts longer than t time units. Here F(t) is the usual.
  69. [69]
    [PDF] Lecture 5: Survival Analysis 5.1 Survival Function
    Namely, S(t) is the probability that an individual will survive past time t. Here are some basic properties about S(t):. • S(0) = 1 and S(∞) = 0.
  70. [70]
    [PDF] STA 611: Introduction to Mathematical Statistics Lecture 3 - Stat@Duke
    We define the quantile function of X as. F. −1. (p) = the smallest x such that F(x) ≥ p. F−1(p) is called the p quantile of x or the 100 × p percentile pf X.
  71. [71]
    [PDF] Introduction to Random Variables
    Aug 28, 2020 · Formally, quantile function can be defined as Q(p) = minx∈S F(x) ≥ p. Thus, for any input probability p ∈ [0,1], the quantile function Q(p).
  72. [72]
    22.2 - Change-of-Variable Technique | STAT 414 - STAT ONLINE
    We have now derived what is called the change-of-variable technique first for an increasing function and then for a decreasing function.
  73. [73]
    [PDF] 1 Change of Variables - Stat@Duke
    Let X be a real-valued random variable with pdf fX(x) and let Y = g(X) for some strictly monotonically-increasing differentiable function g(x); then.
  74. [74]
    [PDF] 1957-feller-anintroductiontoprobabilitytheoryanditsapplications-1.pdf
    INTRODUCTION: THE NATURE OF PROBABILITY THEORY . The Background ................,. Procedure ................04084. “Statistical” Probability... ........2..
  75. [75]
    [PDF] Probability and Measure - University of Colorado Boulder
    The book presupposes a knowledge of combinatorial and discrete probability, of rigorous calculus, in particular infinite series, and of elemen- tary set theory.
  76. [76]
    Random variables, joint distribution functions, and copulas - EuDML
    How to cite ... Sklar, Abe. "Random variables, joint distribution functions, and copulas." Kybernetika 09.6 (1973): (449)-460. <http://eudml.org/doc/28992>.
  77. [77]
    [PDF] Bernoulli and Binomial Random Variables
    Jul 10, 2017 · A Bernoulli random variable is the simplest kind of random variable. It can take on two values, 1 and 0. It takes on a 1 if an experiment with ...
  78. [78]
    [PDF] Chapter 3. Discrete Random Variables 3.4 - Washington
    Each Xi in the Bernoulli process with parameter p is Bernoulli/indicator random variable with parameter p. It simply represents a binary outcome, like a coin ...<|control11|><|separator|>
  79. [79]
    28.1 - Normal Approximation to Binomial | STAT 414
    We will now focus on using the normal distribution to approximate binomial probabilities. The Central Limit Theorem is the tool that allows us to do so.
  80. [80]
    Lesson 12: The Poisson Distribution - STAT ONLINE
    It is important to keep in mind that the Poisson approximation to the binomial distribution works well only when n is large and p is small. In general, the ...
  81. [81]
    [PDF] The Poisson Distribution
    Jul 12, 2017 · A Poisson random variable approximates Binomial where n is large, p is small, and λ = np is. “moderate”. Interestingly, to calculate the things ...
  82. [82]
    [PDF] Zoo of Discrete RVS part II Poisson Distribution - Washington
    Formally, Binomial approaches Poisson in the limit as. n → ∞ (equivalently, p → 0) while holding np = 𝜆. Page 25. Probability Mass Function ...
  83. [83]
    [PDF] Some discrete distributions - UConn Undergraduate Probability OER
    This proposition shows that the Poisson distribution models binomials when the probability of a success is small.
  84. [84]
    Fig. 2. Poisson distribution for various values of µ.
    Two important examples of such processes are radioactive decay and particle reactions. To take a concrete example, consider a typical radioactive source ...
  85. [85]
    Uniform Distribution | Definition - Probability Course
    A continuous random variable X is said to have a Uniform distribution over the interval [a,b], shown as X∼Uniform(a,b), if its PDF is given by ...Missing: seminal | Show results with:seminal
  86. [86]
    1.3.6.6.2. Uniform Distribution
    One of the most important applications of the uniform distribution is in the generation of random numbers. That is, almost all random number generators ...
  87. [87]
    5.3: Normal Distribution and Its Applications - Statistics LibreTexts
    Sep 21, 2024 · Definition: The Empirical Rule: 68-95-99.7% Rule · Approximately 68% of the observations fall within one standard deviation (σ) of the mean μ.Missing: variance | Show results with:variance
  88. [88]
    The normal distribution - Analytical Science Journals
    CARL GAUSS​​ The normal (or Gaussian) distribution was first described by Carl Friedrich Gauss in 1809 [1] in the context of measurement errors in astronomy. ...
  89. [89]
    4.2.1.4. The random errors follow a normal distribution.
    The normal distribution is one of the probability distributions in which extreme random errors are rare. If some other distribution actually describes the ...
  90. [90]
    Exponential Distribution | Definition | Memoryless Random Variable
    The exponential distribution is one of the widely used continuous distributions. It is often used to model the time elapsed between events.
  91. [91]
    15.1: Introduction - Statistics LibreTexts
    Apr 23, 2022 · The interarrival times have an exponential distribution with rate parameter \(r\). · The exponential distribution is the only distribution with ...
  92. [92]
    [PDF] Notes 3 : Modes of convergence
    DEF 3.1 (Modes of convergence) Let {Xn}n be a sequence of (not necessarily independent) RVs and let X be a RV. Then we have the following definitions. ...
  93. [93]
    [PDF] CHAPTER 5. Convergence of Random Variables
    This part of probability is often called “large sample theory” or “limit theory” or “asymptotic theory.” This material is extremely important for statistical ...
  94. [94]
  95. [95]
    [PDF] Sec. 6.7. CONVERGENCE OF RANDOM SEQUENCES - MIT
    Definition 6.7-4 (Almost-sure convergence.) The random sequence X[n] converges almost surely to the random variable X if the sequence of functions X[n, ] ...
  96. [96]
    [PDF] Convergence in Distribution - Arizona Math
    Xn converges to X in distribution if, for every bounded continuous function h, lim Eh(Xn) = Eh(X), based on comparing distributions P{Xn ∈ A} and P{X ∈ A}.
  97. [97]
    [PDF] Convergence of Probability Measures - CERMICS
    Billingsley, Patrick. Convergence of probability measures / Patrick Billingsley. - 2nd ed. Probability and statistics) p. cm. - (Wiley series in probability ...
  98. [98]
    [PDF] Convergence Concepts - Arizona Math
    Nov 17, 2009 · Xn = X} = 1. 2. We say that Xn converges to X in Lp or in p-th moment, p > 0, (Xn →Lp ... comparison of the random variables Xn with X but rather ...
  99. [99]
    [PDF] Convergence: a.s., i.p., and Lp - Stat@Duke
    → θ. Types of convergence. Given random variables X1, ..., Xn and X defined on the probability space (Ω, F, P) the following types of convergence can happen.
  100. [100]
    [PDF] Modes of convergence - MyWeb
    Weak convergence. Convergence of moments. Strong convergence. Convergence in mean. Interchanging limits and integrals. Convergence in mean vs convergence in ...
  101. [101]
    [PDF] Modes of convergence for random variables - Purdue Math
    Independence of σ-algebras: Let (Fj)j∈J σ-algebras, Fj ⊂ F. Those σ-algebras are said to be independent if for all m ≥ 2: For all j1,...,jm ∈ J, ...
  102. [102]
    [PDF] ON CHEBYSHEV'S INEQUALITY IN ELEMENTARY STATISTICS
    Known as the founding father of Russian mathematics, Pafnuty Chebyshev first mentioned and proved the inequality in a paper published in 1867.
  103. [103]
    [PDF] An Appreciation of Kolmogorov's 1933 Paper - DTIC
    Jun 15, 1992 · In the next years he published fundamental work on laws of large numbers; he regarded such laws, the study of which began with Bernoulli, as ...
  104. [104]
    [PDF] Error Statistics and the Frequentist Interpretation of Probability1
    3.3 The Strong Law of Large Numbers. A formal justification for the frequentist interpretation stems from the Strong. Law of Large Numbers (SLLN) which gives ...
  105. [105]
    Central Limit Theorem - StatLect
    Lindeberg-Lévy Central Limit Theorem​​ denotes convergence in distribution. converges in distribution to a standard normal distribution.
  106. [106]
    The Central Limit Theorem Around 1935 - jstor
    Ann. 97 1-59. BERRY, A. C. (1941). The accuracy of Gaussian approximation to the sum of independent variates. Trans.
  107. [107]
    De Moivre's “Miscellanea Analytica”, and the Origin of the Normal ...
    Evidence was presented which shows that Abraham De Moivre (1667–1754) invented the normal curve and the normal probability integral about 1721.
  108. [108]
    [PDF] Characteristic Functions and the Central Limit Theorem
    Corollary 108 If the characteristic function of two random variables X and. Y agree, then X and Y have the same distribution. Proof. This follows immediately ...
  109. [109]
    The Accuracy of the Gaussian Approximation to the Sum of ... - jstor
    157. Page 3. 124 A. C. BERRY [January of the modulus of the difference ... 1941] THE SUM OF INDEPENDENT VARIATES 127 o. (22) +(t) -(t) = f eixtd F(x) ...