Fact-checked by Grok 2 weeks ago

Thompson sampling

Thompson sampling, also known as posterior sampling or probability matching, is a randomized Bayesian heuristic algorithm designed to solve sequential decision-making problems under uncertainty, particularly those involving the exploration-exploitation trade-off, such as the multi-armed bandit problem.^[1] In this framework, the algorithm maintains a posterior distribution over the unknown parameters of the reward model for each action; at each decision step, it samples a parameter value from this posterior, computes the expected reward for each action under the sampled parameters, and selects the action with the highest expected reward.^[1] This process naturally balances the need to exploit actions known to yield high rewards with the exploration of potentially better alternatives, as optimistic samples encourage trying less-proven options.^[1] Named after William R. Thompson, the algorithm was first proposed in 1933 as a method for allocating patients to treatments in clinical trials, where the goal was to maximize the number of successes while learning about the relative efficacy of therapies with unknown success probabilities.^[2] Thompson's original formulation involved sampling from beta distributions to compare two treatments, framing the problem in terms of likelihoods derived from binomial observations.^[2] Despite its early introduction, Thompson sampling remained relatively obscure for decades, with limited adoption until the late 1990s and early 2000s, when it was rediscovered in the context of reinforcement learning and bandit algorithms.^[1] Its popularity surged in the 2010s due to empirical successes in large-scale applications and theoretical analyses demonstrating near-optimal regret bounds, such as O(\sqrt{KT \log T}) for K-armed Bernoulli bandits over T rounds.^[1] Thompson sampling has found widespread use in modern applications requiring adaptive experimentation, including online advertising (e.g., selecting ad variants to maximize clicks), A/B testing in web services, and personalized recommendation systems at companies like Google and Amazon.^[1] It extends naturally to more complex settings, such as linear bandits, contextual bandits, and Markov decision processes, where it leverages structured models for efficient computation.^[1] Notable properties include its simplicity of implementation—often requiring only posterior sampling and argmax operations—and its robustness to model misspecification, though it can be computationally intensive for high-dimensional problems without approximations.^[1] Ongoing research continues to refine its theoretical guarantees and practical variants, solidifying its role as a cornerstone of Bayesian optimization.^[1]

Introduction

Definition

Thompson sampling is a randomized heuristic for choosing actions in sequential decision-making problems under uncertainty, most notably in the multi-armed bandit framework, where it balances exploration and exploitation by sampling parameter values from the posterior distribution of a Bayesian model and selecting the action that maximizes the expected reward under those sampled values.^[1] In this setting, the problem involves multiple "arms," each representing an action with stochastic rewards drawn from an unknown probability distribution; the goal is to maximize cumulative reward over time while learning about the arms' reward structures.^[3] Core components of Thompson sampling include the arms (discrete actions available to the decision-maker), the rewards (random outcomes associated with pulling each arm, typically assumed to be independent and identically distributed given unknown parameters), and prior beliefs (initial probability distributions over the reward parameters, which are updated to posteriors as data is observed).^[1] The algorithm maintains a Bayesian model of the environment, starting with these priors to represent uncertainty about the reward distributions before any observations.^[3] A simple example arises in a two-armed bandit problem with Bernoulli rewards, where each arm yields a success (reward 1) with unknown probability \theta_i and failure (reward 0) otherwise. Priors on \theta_1 and \theta_2 are chosen as Beta distributions, which are conjugate to the Bernoulli likelihood and thus yield Beta posteriors after observations; at each step, the algorithm samples \tilde{\theta}_i from the current posterior for each arm i, then selects the arm with the highest \tilde{\theta}_i.^[1] The method is named after William R. Thompson, who proposed it in 1933 for allocating treatments in clinical trials modeled as a two-armed bandit problem.^[2]

Intuition

Thompson sampling addresses the fundamental exploration-exploitation dilemma in sequential decision-making under uncertainty, where an agent must balance exploiting actions known to yield high rewards based on current information against exploring potentially better but uncertain alternatives to improve long-term performance.^[1] This trade-off arises in settings like multi-armed bandits, where repeatedly selecting suboptimal actions leads to regret, but ignoring uncertainty risks missing superior options.^[1] Thompson sampling resolves this by incorporating probabilistic beliefs about rewards, naturally favoring exploration when uncertainty is high through a mechanism of optimistic sampling from these beliefs.^[4] At its core, the intuitive mechanism of Thompson sampling involves maintaining a posterior distribution over possible reward parameters for each action, informed by observed data, and then sampling a hypothesized reward value from each distribution to select the action with the highest sampled value.^[1] This process induces an automatic bias toward actions with high variance in their posterior, as such actions are more likely to produce high samples and thus be selected, promoting exploration precisely when evidence is ambiguous.^[1] Over time, as data accumulates, the posteriors concentrate around true values, shifting the balance toward exploitation of the empirically best actions.^[4] A compelling analogy for this approach originates from its medical roots, akin to a clinician allocating treatments to patients in a trial with two therapies of unknown efficacy: by simulating outcomes from current evidence on patient beliefs, the method favors assigning the promising but less-proven treatment more often when uncertainty suggests it might outperform the known one, thereby testing hypotheses efficiently while minimizing harm from inferior choices.^[2] Thompson sampling offers key advantages, including asymptotic optimality in finite-armed bandit problems, where its expected regret grows logarithmically with time, matching information-theoretic lower bounds in many cases.^[4] It is also computationally efficient when posterior sampling is tractable, such as with conjugate priors like beta distributions for Bernoulli rewards, enabling straightforward implementation without excessive overhead.^[1]

Mathematical Formulation

Bayesian Framework

Thompson sampling operates within a Bayesian framework, where the unknown reward parameters \theta for each arm in a multi-armed bandit problem are treated as random variables drawn from a prior distribution p(\theta). Observations of rewards are modeled via a likelihood function p(r \mid \theta), where r denotes the observed reward. The posterior distribution p(\theta \mid \text{data}) is then obtained through Bayes' theorem as p(\theta \mid \text{data}) \propto p(\text{data} \mid \theta) p(\theta), representing the updated belief about \theta after incorporating evidence from past interactions.^[1] For computational tractability, Thompson sampling often employs conjugate priors, which ensure that the posterior belongs to the same family as the prior, facilitating efficient inference. In the case of Bernoulli rewards, a Beta prior p(\theta_k) = \frac{\Gamma(\alpha_k + \beta_k)}{\Gamma(\alpha_k) \Gamma(\beta_k)} \theta_k^{\alpha_k - 1} (1 - \theta_k)^{\beta_k - 1} for arm k is conjugate to the Bernoulli likelihood p(r \mid \theta_k) = \theta_k^r (1 - \theta_k)^{1 - r}, resulting in a Beta posterior with updated parameters (\alpha_k + r_t, \beta_k + 1 - r_t) after observing reward r_t from arm k. Similarly, for normally distributed rewards with known variance, a Gaussian prior on \theta yields a Gaussian posterior, with updates to the mean and covariance based on the observed data. These choices enable closed-form posterior updates without requiring complex numerical methods.^[1]^[5] The posterior distribution encapsulates the decision-maker's beliefs, quantifying uncertainty over the reward parameters after each observation and integrating all available evidence in a coherent probabilistic manner. Sampling from this posterior allows actions to be selected in a way that naturally balances optimism about uncertain arms (exploration) with confidence in high-reward arms (exploitation), as samples from regions of high posterior density influence choices probabilistically. This Bayesian updating process ensures that beliefs evolve dynamically with new data, maintaining a principled representation of epistemic uncertainty.^[1] The framework generalizes to any parametric model where posterior sampling is feasible, such as linear models for contextual bandits or more complex hierarchical structures, provided the prior and likelihood permit efficient computation of the posterior. This flexibility underpins Thompson sampling's applicability beyond simple bandits, as long as the model supports sampling from the belief distribution over parameters.^[1]

Posterior Sampling Procedure

In Thompson sampling, the posterior sampling procedure forms the decision-making core, leveraging Bayesian posteriors to balance exploration and exploitation in sequential decision problems such as multi-armed bandits. At each time step t = 1, 2, \dots, the agent first draws a parameter sample \theta^{(t)} from the posterior distribution p(\theta \mid H_{1:t-1}), where H_{1:t-1} denotes the history of actions and observed rewards up to time t-1. This sampling reflects the agent's current beliefs about the unknown parameters \theta, updated from an initial prior via Bayes' rule. The action selection then proceeds deterministically based on the sampled parameters:

a_t = \arg\max_{a \in \mathcal{A}} \theta_a^{(t)},

where \mathcal{A} is the action space (e.g., the set of K arms in a multi-armed bandit) and \theta_a^{(t)} is the component of \theta^{(t)} corresponding to action a. In the multi-armed bandit setting, parameters are sampled independently for each arm, allowing the procedure to scale naturally to multiple options without requiring joint sampling over the full parameter space. If multiple actions yield the same maximum value under the sample (a tie), the tie is broken arbitrarily, such as by selecting the lowest-indexed arm or via randomization, to ensure a valid action is chosen. After selecting and executing a_t, the agent observes the reward r_t and incorporates it into the history, updating the posterior to p(\theta \mid H_{1:t}) for the next iteration. A concrete example of this procedure arises in the Beta-Bernoulli multi-armed bandit, where each arm k has an unknown success probability \theta_k \in [0,1] and rewards are binary (0 or 1). The posterior for arm k is maintained as a Beta distribution \text{Beta}(\alpha_k, \beta_k), initialized from a prior such as \text{Beta}(1,1). At time t, sample \theta_k^{(t)} \sim \text{Beta}(\alpha_k, \beta_k) independently for each k = 1, \dots, K, then set a_t = \arg\max_k \theta_k^{(t)} (with arbitrary tie-breaking). Upon observing r_t \in \{0,1\} for the chosen arm k = a_t, update the parameters as \alpha_k \leftarrow \alpha_k + r_t and \beta_k \leftarrow \beta_k + (1 - r_t), which conjugates the Beta prior with the Bernoulli likelihood to yield the new posterior. This update rule ensures the posterior remains tractable, enabling efficient repeated sampling in practice.

Algorithm and Implementation

Pseudocode

Thompson sampling can be described algorithmically in a straightforward manner for the multi-armed bandit setting, where the goal is to select actions (arms) over a horizon T to maximize cumulative reward while balancing exploration and exploitation.^[6] The algorithm initializes prior beliefs over the unknown reward parameters \theta for each of the K arms and iteratively samples from the posterior distribution to select the arm with the highest sampled value, observes the reward, and updates the posterior accordingly.^[1] Inputs to the algorithm typically include the number of arms K and the time horizon T, while outputs consist of the sequence of selected actions A_t for t = 1 to T and the resulting cumulative regret, defined as the difference between the optimal cumulative reward and the algorithm's total reward.^[6] The following pseudocode outlines the general procedure:

Initialize priors for each arm k = 1 to K (e.g., based on [domain knowledge](/page/Domain_knowledge) or non-informative defaults)
For t = 1 to T:
    Sample θ_k ~ Posterior(prior_k, observed data for arm k) for each k = 1 to K
    Select A_t = argmax_k θ_k
    Observe reward R_t from [arm](/page/Arm) A_t
    Update posterior for [arm](/page/Arm) A_t using R_t
Output: Sequence A_1, ..., A_T and cumulative [regret](/page/Regret) = ∑_{t=1}^T (μ^* - μ_{A_t}), where μ^* is the mean reward of the optimal [arm](/page/Arm)
Initialize priors for each arm k = 1 to K (e.g., based on [domain knowledge](/page/Domain_knowledge) or non-informative defaults)
For t = 1 to T:
    Sample θ_k ~ Posterior(prior_k, observed data for arm k) for each k = 1 to K
    Select A_t = argmax_k θ_k
    Observe reward R_t from [arm](/page/Arm) A_t
    Update posterior for [arm](/page/Arm) A_t using R_t
Output: Sequence A_1, ..., A_T and cumulative [regret](/page/Regret) = ∑_{t=1}^T (μ^* - μ_{A_t}), where μ^* is the mean reward of the optimal [arm](/page/Arm)

This structure ensures that actions are chosen probabilistically from the posterior, promoting optimistic exploration of uncertain arms.^[1] For the common case of Bernoulli multi-armed bandits, where rewards are binary (0 or 1) with unknown success probabilities \theta_k \in [0,1] for each arm k, conjugate Beta priors are used for tractable posterior updates. A standard non-informative prior is Beta(1,1), which is uniform over [0,1].^[6] The posterior after observing S_k successes and F_k failures for arm k is Beta($1 + S_k, $1 + F_k). Sampling from this posterior can be implemented using random number generators, such as the inverse CDF method or libraries like NumPy's random.beta. The specific pseudocode is as follows:

Input: K (number of [arms](/page/Arms)), T (horizon)
Initialize: For each k = 1 to K, set S_k = 0 (successes), F_k = 0 (failures)
For t = 1 to T:
    For each k = 1 to K:
        Sample θ_k ~ [Beta](/page/Beta)(1 + S_k, 1 + F_k)
    Select A_t = argmax_k θ_k
    Observe reward R_t ∈ {0,1} from arm A_t
    If R_t = 1:
        S_{A_t} ← S_{A_t} + 1
    Else:
        F_{A_t} ← F_{A_t} + 1
Output: Sequence A_1, ..., A_T and cumulative [regret](/page/Regret)
Input: K (number of [arms](/page/Arms)), T (horizon)
Initialize: For each k = 1 to K, set S_k = 0 (successes), F_k = 0 (failures)
For t = 1 to T:
    For each k = 1 to K:
        Sample θ_k ~ [Beta](/page/Beta)(1 + S_k, 1 + F_k)
    Select A_t = argmax_k θ_k
    Observe reward R_t ∈ {0,1} from arm A_t
    If R_t = 1:
        S_{A_t} ← S_{A_t} + 1
    Else:
        F_{A_t} ← F_{A_t} + 1
Output: Sequence A_1, ..., A_T and cumulative [regret](/page/Regret)

This implementation is efficient for moderate K and T, with each iteration requiring O(K) samples and constant-time updates.^[6] The algorithm is adaptable to continuous action spaces by sampling from multivariate posteriors over parameterizations of the reward function, such as Gaussian processes or neural network ensembles, and selecting the action that maximizes the sampled expected reward.^[1]

Computational Aspects

Implementing Thompson sampling requires addressing several computational challenges, particularly in sampling from the posterior distribution and scaling to practical settings. For models with conjugate priors, such as Beta distributions in Bernoulli multi-armed bandits or Gaussian priors in linear bandits, exact posterior sampling is straightforward and efficient, as updates maintain closed-form distributions that can be sampled directly in constant time per arm.^[1] However, in non-conjugate settings—common in complex environments like contextual bandits with logistic rewards or neural network models—exact inference becomes intractable, necessitating approximate methods such as Markov Chain Monte Carlo (MCMC) techniques (e.g., Gibbs sampling or Langevin dynamics) or variational inference to generate samples from the posterior.^[1]^[7] To mitigate these tractability issues, various approximations have been developed for complex posteriors. Bootstrap resampling, which treats empirical data as the population to generate multiple posterior samples, offers a simple, non-parametric alternative suitable for high-dimensional or non-conjugate models without requiring likelihood evaluations.^[7] Moment-matching methods, such as fitting a Gaussian approximation to match the first two moments of the true posterior, provide further efficiency in high-dimensional spaces by reducing sampling to standard normal draws, though they may introduce bias in non-Gaussian cases.^[1] Variational inference variants, like those minimizing the reverse KL divergence, enable scalable approximations but can lead to under- or over-exploration if the approximation error is large, potentially causing linear regret unless mitigated by added exploration mechanisms.^[7] Scalability concerns arise primarily with large numbers of arms K or actions, where the ideal per-step time complexity is O(K) due to sampling each arm's posterior and selecting the maximum, which remains feasible for moderate K but becomes prohibitive in high-dimensional or continuous spaces.^[1] For such cases, particle filters maintain a set of weighted samples (particles) to approximate the posterior via sequential Monte Carlo, enabling online updates with complexity scaling with the number of particles rather than K, as demonstrated in complex bandit problems like queuing networks.^[8] In reinforcement learning extensions, online posterior updates using ensemble methods or sparse Gaussian processes further enhance scalability for large state-action spaces.^[9] In practice, Thompson sampling is supported by various software libraries for efficient implementation. Python's NumPy provides built-in functions for sampling from conjugate distributions like Beta, facilitating quick prototyping for simple bandits.^[10] Reinforcement learning frameworks integrate Thompson sampling variants for bandit and contextual settings, handling approximations like bootstrapping within scalable RL pipelines. Specialized packages, including those on PyPI, offer ready-to-use implementations for multi-armed bandits with support for MCMC-based sampling.^[11]

History

Origins

Thompson sampling was first proposed by William R. Thompson in his 1933 paper published in Biometrika, titled "On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples."^[12] In this work, Thompson introduced the method as a strategy for sequential decision-making in clinical trials aimed at comparing the efficacy of two treatments, where the goal was to allocate patients to treatments in a way that maximizes the overall benefit while learning about the unknown success probabilities.^[1] The approach was motivated by the need for efficient sequential analysis in medicine, where resources such as patient numbers were limited, and decisions had to balance immediate treatment outcomes with the accumulation of evidence to inform future allocations.^[1] At its core, Thompson's key insight involved modeling the unknown success probabilities of the treatments using Bayesian conjugate priors—specifically, beta distributions for binary outcomes—and then performing sampling from the posterior distributions to determine treatment allocation.^[12] This procedure effectively sampled a hypothesized value for each treatment's success probability from its respective posterior and selected the treatment with the higher sampled value, thereby incorporating uncertainty directly into the decision process.^[1] This Bayesian framework allowed for adaptive randomization that favored promising treatments based on accumulating data, predating the formal development of multi-armed bandit theory in the 1950s.^[1] Despite its innovative nature, Thompson's method had limited initial impact and was largely overlooked in the academic literature for decades.^[1] The primary reasons included the computational constraints of the era, which made repeated posterior sampling and updates impractical without modern computing resources, as well as a lack of theoretical analysis and empirical validation at the time.^[1] It was not until the 2000s, with advances in computational power and renewed interest in Bayesian methods for online decision-making, that the approach gained widespread recognition.^[1]

Subsequent Developments

Following its initial proposal in 1933, Thompson sampling experienced a revival in the early 2000s within computer science, particularly for addressing multi-armed bandit problems in online decision-making, with independent rediscoveries such as Wyatt (1997) and Strens (2000) in reinforcement learning contexts, followed by empirical successes in Scott (2010) and Chapelle and Li (2011).^[1] This rediscovery emphasized its Bayesian foundations for balancing exploration and exploitation, leading to renewed theoretical interest. Key contributions included the first rigorous analysis demonstrating logarithmic expected regret for the stochastic multi-armed bandit problem by Agrawal and Goyal in 2012. ^[4] Concurrently, Kaufmann, Korda, and Munos provided the first finite-time analysis proving asymptotic optimality for Bernoulli reward distributions in 2012, matching the lower bound on regret. ^[13] Subsequent theoretical advancements solidified Thompson sampling's guarantees across broader settings. Russo and Van Roy offered an information-theoretic framework in 2014 that proved asymptotic optimality in stochastic bandits by bounding Bayesian regret through mutual information, highlighting its efficiency in reducing uncertainty about optimal actions. ^[14] Extensions followed to more complex environments, such as linear bandits via linear Gaussian models and contextual bandits where actions depend on observed features, with Agrawal and Goyal deriving near-optimal regret bounds for linear payoffs in 2013. ^[15] Post-2020 developments have further diversified Thompson sampling's scope. In multi-agent settings, Liu et al. introduced multi-agent Thompson sampling in 2020, enabling coordinated exploration in sparse neighborhood structures for distributed bandit problems, as demonstrated in applications like recommender systems. ^[16] Recent information-theoretic analyses, such as that by Zhang in 2025, have extended regret bounds to infinite action spaces, revealing tighter connections between posterior sampling and entropy minimization. ^[17] A comprehensive survey by Wang in 2025 cataloged over 50 variants, including neural and meta-learning adaptations, underscoring ongoing innovations in scalability and adaptability. ^[18] Thompson sampling's influence has permeated practical implementations, with integration into reinforcement learning libraries like Vowpal Wabbit for contextual bandit tasks. Tech companies, including Microsoft, have adopted it for A/B testing in large-scale systems, as evidenced by empirical evaluations showing superior performance over epsilon-greedy baselines in ad personalization.

Applications

Multi-Armed Bandits

The multi-armed bandit (MAB) problem involves a decision-maker facing K arms, each associated with an unknown reward distribution characterized by mean \mu_i for arm i = 1, \dots, K. The objective is to maximize the cumulative reward over T sequential pulls by selecting one arm at each time step t = 1, \dots, T, while balancing the trade-off between exploiting known high-reward arms and exploring uncertain ones to learn their means.^[1] In the MAB setting, Thompson sampling (TS) operates by maintaining a posterior distribution over each arm's mean \mu_i based on observed rewards and a chosen prior (e.g., Gaussian prior for continuous rewards). At each step, it samples a value \tilde{\mu}_i from the posterior for each arm and selects the arm with the maximum sampled value, which naturally favors arms with high expected means while incorporating uncertainty to occasionally select suboptimal arms. This posterior sampling mechanism promotes optimistic exploration of under-sampled arms, leading TS to empirically outperform purely greedy methods that exploit the current best arm without exploration.^[1] Simulations in Gaussian MAB environments, where rewards are drawn from normal distributions with known variance, demonstrate TS's effectiveness; for instance, in settings with multiple arms and means spaced to include one optimal arm, TS incurs low cumulative regret by systematically exploring suboptimal arms in early stages before converging to the best arm.^[1] TS extends to non-stationary MAB variants, where arm means \mu_i evolve over time, by incorporating forgetting factors such as exponential discounting in the posterior updates to downweight outdated observations and maintain adaptability to changes.^[1]

Broader Uses

Thompson sampling has been extended to Bayesian reinforcement learning (RL) settings, where it facilitates exploration in partially observable Markov decision processes (POMDPs) by sampling from the posterior distribution over model parameters to select actions.^[1] A prominent example is posterior sampling for reinforcement learning (PSRL), which applies Thompson sampling in tabular Markov decision processes (MDPs) to balance exploration and exploitation by drawing policies from the posterior over transition and reward dynamics. Beyond RL, Thompson sampling finds application in A/B testing for web services, such as ad optimization at LinkedIn, where it dynamically allocates traffic to variants to maximize click-through rates while minimizing regret.^[19] In clinical trials, it supports adaptive designs by sequentially updating treatment allocations based on posterior beliefs about efficacy, enabling efficient patient assignment in multi-arm trials.^[20] For brain-computer interfaces (BCIs), Thompson sampling enhances P300 speller performance by adaptively selecting stimuli sequences to elicit event-related potentials, reducing the number of trials needed to identify target characters. A 2022 mini-review highlights its role in RL-driven stimulus optimization for ERP-based spellers, demonstrating improved accuracy and speed in real-time applications.^[21] In chemistry, it accelerates molecular design by efficiently searching ultralarge synthesis-on-demand libraries, prioritizing promising candidates via active learning to optimize properties like binding affinity.^[22] Recent advancements include multi-agent Thompson sampling for scenarios with sparse rewards and neighborhood structures, such as coordinating distributed sensors, where it leverages Bayesian updates across agents to achieve low regret in communication-constrained environments.^[16] As of 2025, Thompson sampling has been applied in GenAI-powered adaptive interventions for mobile health, using generator-mediated bandit algorithms to optimize real-time treatment delivery.^[23] Adaptations of Thompson sampling incorporate contextual features to handle covariate-dependent rewards, as in contextual Thompson sampling, which samples from posteriors conditioned on observed contexts to recommend personalized actions. Hierarchical models extend it to structured data, modeling shared parameters across tasks or levels to enable transfer learning, such as in multi-task bandits where global and local posteriors guide sampling for efficient exploration.

Theoretical Properties

Regret Bounds

In the multi-armed bandit (MAB) framework, the cumulative regret R_T after T rounds is defined as R_T = T \mu^* - \mathbb{E}\left[ \sum_{t=1}^T r_t \right], where \mu^* denotes the expected reward of the optimal arm and r_t is the observed reward at round t. This measures the cumulative shortfall from always selecting the best arm. Efficient exploration-exploitation algorithms, including Thompson sampling, aim for sublinear regret R_T = o(T), ensuring the per-round average regret approaches zero asymptotically.^[4] For the Bernoulli MAB, where arm rewards are binary with unknown success probabilities, Thompson sampling achieves an expected regret bound of O(\sqrt{K T \log T}), with K the number of arms; this problem-independent bound matches the minimax rate up to logarithmic factors. For Gaussian rewards with known variance, Thompson sampling yields logarithmic problem-dependent regret bounds under suitable Gaussian priors, scaling as O\left( \sum_{i \neq i^*} \frac{\log T}{\Delta_i} \right), where \Delta_i is the suboptimality gap for suboptimal arm i and i^* is the optimal arm. These results establish Thompson sampling's efficiency in stochastic environments.^[24]^[4]^[25] Regret analyses for Thompson sampling typically leverage posterior concentration, showing that the posterior over arm means concentrates sufficiently fast around the true values, ensuring the optimal arm is sampled with probability approaching 1 after an initial exploration phase. This concentration implies that suboptimal arms are pulled only O(\log T) times each, yielding the logarithmic terms. An alternative information-theoretic framework bounds regret via the information ratio, defined as the mutual information between observations and beliefs divided by the variance of the value function; for Thompson sampling, this ratio remains bounded, directly implying sublinear regret across broad classes of problems.^[4]^[26] These bounds hold under specific priors matched to the reward distributions, such as Beta for Bernoulli or Gaussian for normal rewards; mismatched priors can inflate the constants or worsen the scaling. In adversarial settings, where rewards may be chosen non-stochastically, Thompson sampling lacks sublinear regret guarantees in the worst case, potentially incurring linear regret against adaptive adversaries.^[25]^[26]

Optimality Guarantees

In stochastic multi-armed bandit problems, Thompson sampling achieves asymptotic optimality by attaining logarithmic regret that matches the Lai-Robbins lower bound under appropriate prior distributions, such as the uniform prior for Bernoulli rewards.^[13] This result establishes that, as the time horizon grows, the algorithm's cumulative regret scales no worse than the theoretical minimum required for any consistent policy, demonstrating its efficiency in long-run exploration-exploitation trade-offs.^[13] From a Bayesian perspective, Thompson sampling minimizes the Bayes risk over finite horizons in specific models, where it effectively approximates or aligns with the optimal dynamic programming solution. For instance, in infinite-horizon discounted multi-armed bandit problems with known reward structures, it relates to the Gittins index policy, which is proven to be Bayesian optimal by selecting arms according to their indices derived from expected future rewards. This equivalence highlights Thompson sampling's strength in Bayesian settings, where it leverages posterior sampling to achieve near-minimal expected loss relative to the true posterior. Thompson sampling's guarantees extend under weak assumptions, such as sub-Gaussian reward distributions, enabling logarithmic regret bounds without requiring bounded or specific parametric forms for the rewards.^[1] In practice, the algorithm demonstrates robustness to model misspecification, maintaining strong empirical performance even when the assumed prior or reward model deviates from reality, as evidenced in sequential decision-making applications with non-stationary or misspecified environments. Recent information-theoretic analyses further underscore Thompson sampling's near-optimal exploration properties, bounding its Bayesian regret in terms of the mutual information between beliefs and outcomes, which scales favorably with problem complexity. Extensions of this framework to infinite action spaces and generalized bandits confirm that the algorithm achieves regret proportional to the square root of this information ratio, approaching optimality under broad prior assumptions.

Comparisons

Probability Matching

Probability matching is a heuristic strategy in which arms are selected with probabilities proportional to their estimated likelihood of being the optimal choice, computed as the posterior probability that each arm has the highest mean reward given the observed data.^[6] This approach draws from Bayesian posterior beliefs about arm qualities, where the selection rule allocates pulls to arm i with probability P(\mu_i = \max_j \mu_j \mid \text{[data](/page/Data)}).^[27] In practice, a common variant observed in human decision-making matches selection probabilities directly to the arms' estimated reward probabilities rather than their probabilities of being optimal, leading to persistent allocation to suboptimal options. For instance, in a two-arm bandit with reward probabilities 0.7 and 0.3, this results in pulling the suboptimal arm approximately 30% of the time even after estimates converge, yielding an expected reward of $0.7 \times 0.7 + 0.3 \times 0.3 = 0.58 instead of the optimal 0.7.^[28] This behavior was prominently studied in early psychophysics experiments on human and animal choice under uncertainty, where subjects consistently matched choices to observed reinforcement frequencies despite the availability of superior strategies. In the multi-armed bandit framework, probability matching is suboptimal compared to asymptotically efficient algorithms, incurring linear regret O(T) because a constant fraction of pulls continues to suboptimal arms indefinitely.^[29] Its intuitive appeal lies in mimicking observed frequencies, but this fails to concentrate exploration on promising arms as data accumulates. Unlike Thompson sampling, which randomizes selections by directly sampling arm parameters from the posterior and choosing the apparent best, probability matching serves as a deterministic approximation that fixes allocations based on current posterior probabilities without the sampling-induced variability that drives effective exploration in Thompson sampling.^[6]

Upper Confidence Bound Methods

Upper confidence bound (UCB) methods address the multi-armed bandit problem by deterministically selecting, at each time step t, the arm i that maximizes an index combining the empirical mean reward \hat{\mu}_i(t) with an additive exploration bonus, thereby encouraging the trial of under-explored arms through optimistic estimates.^[30] This approach relies on frequentist concentration inequalities to construct confidence intervals around the mean rewards, ensuring that with high probability, the true mean lies below the upper bound.^[30] The foundational UCB1 algorithm, proposed by Auer, Cesa-Bianchi, and Fischer in 2002, uses the specific index \hat{\mu}_i(t) + \sqrt{\frac{2 \log t}{n_i(t)}}, where n_i(t) denotes the number of times arm i has been pulled by time t.^[30] This bonus term grows with the logarithm of time and inversely with the arm's pull count, systematically inflating the estimate for infrequently sampled arms to promote exploration without probabilistic sampling.^[30] Subsequent variants refine this framework for broader applicability. UCB-V, introduced by Audibert, Munos, and Szepesvári in 2009, modifies the bonus to incorporate an empirical variance estimate V_i(t), yielding an index of the form \hat{\mu}_i(t) + c \sqrt{ \frac{ V_i(t) + c' \frac{\log t}{n_i(t)} }{n_i(t)} } \log t + c'' \frac{\log t}{n_i(t)}, which provides tighter bounds for arms with heterogeneous variances and improves regret in heavy-tailed reward distributions.^[31] For non-parametric or misspecified settings, KL-UCB, developed by Garivier and Cappé in 2011, replaces the Euclidean distance in the bound with the Kullback-Leibler divergence between empirical and candidate reward distributions, enabling adaptation to general bounded stochastic bandits beyond the Gaussian or Bernoulli assumptions.^[32] UCB methods achieve problem-independent regret of order O(\sqrt{K T \log T}), where K is the number of arms and T the horizon, matching the minimax rate for stochastic bandits and paralleling the guarantees for Thompson sampling, though derived via union bounds over concentration events rather than posterior analysis.^[30] Problem-dependent bounds further tighten this to O\left( \sum_{i \neq i^*} \frac{\log T}{\Delta_i} \right), where \Delta_i is the suboptimality gap for arm i, ensuring near-optimal exploration in expectation.^[30] In contrast to Thompson sampling, which draws arm selections probabilistically from a posterior distribution to incorporate prior beliefs and naturally resolve uncertainty, UCB enforces exploration through fixed, deterministic bounds that do not require Bayesian modeling, resulting in simpler computation—often just arithmetic operations on summaries—but reduced flexibility when informative priors are available.^[33] Empirical evaluations confirm UCB's robustness in uniform prior settings, though it may underperform Thompson sampling in scenarios with strong prior information or non-stationary environments due to its lack of probabilistic adaptation.^[33]

Epsilon-Greedy Strategies

The epsilon-greedy strategy is a simple exploration-exploitation algorithm used in multi-armed bandit problems, where with probability $1 - \epsilon, the algorithm selects the arm with the highest empirical mean reward (exploitation), and with probability \epsilon, it selects a random arm uniformly at random (exploration).^[34] Often, \epsilon is fixed or decays over time, such as \epsilon_t = 1/t or tuned to \epsilon_t = \min\{1, t^{-1/3} K^{2/3} \log^{1/3} t\} where K is the number of arms and t is the time step, to balance the trade-off and achieve sublinear regret.^[34] This approach offers advantages in simplicity and does not require maintaining a probabilistic model of rewards, making it computationally lightweight and easy to implement. However, its exploration is inefficient because random selection equally favors all arms, including those known to be suboptimal, leading to unnecessary pulls of poor performers and higher cumulative regret compared to more targeted methods. For the tuned decaying \epsilon_t, the epsilon-greedy algorithm achieves an expected regret of O(T^{2/3} \log^{1/3} T) after T rounds, which is sublinear but polynomial and worse than the logarithmic regret attainable by asymptotically optimal algorithms.^[34] In contrast to epsilon-greedy, Thompson sampling enables more targeted exploration by sampling from posterior distributions over arm rewards, naturally directing pulls toward uncertain but promising arms without needing a fixed \epsilon parameter, resulting in empirically and theoretically superior performance in terms of regret.

References

[1]
[PDF] A Tutorial on Thompson Sampling - Stanford University
Thompson sampling is an algorithm for online decision prob- lems where actions are taken sequentially in a manner that must balance between exploiting what is ...
[2]
On the Likelihood that One Unknown Probability Exceeds Another in ...
ON THE LIKELIHOOD THAT ONE UNKNOWN. PROBABILITY EXCEEDS ANOTHER IN VIEW. OF THE EVIDENCE OF TWO SAMPLES. BY WILLIAM R. THOMPSON. From the Department of ...
[3]
https://proceedings.mlr.press/v23/agrawal12/agrawal12.pdf
[4]
Analysis of Thompson Sampling for the multi-armed bandit problem
Nov 8, 2011 · One of the earliest algorithms, given by W. R. Thompson, dates back to 1933. This algorithm, referred to as Thompson Sampling, is a natural ...
[5]
[PDF] ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY ...
ON THE LIKELIHOOD THAT ONE UNKNOWN. PROBABILITY EXCEEDS ANOTHER IN VIEW. OF THE EVIDENCE OF TWO SAMPLES. BY WILLIAM R. THOMPSON. From the Department of ...
[6]
[PDF] Analysis of Thompson Sampling for the Multi-armed Bandit Problem
The Thompson Sampling algorithm initially assumes arm i to have prior Beta(1,1) on µi, which is natural because Beta(1,1) is the uniform distribution on (0,1).
[7]
None
### Summary of Approximate Inference Methods for Thompson Sampling
[8]
[PDF] Thompson Sampling for Complex Online Problems
We propose using Thompson sampling (Algorithm 1) to play actions in the general bandit model. Before formally stating the regret bound, we present an ...
[9]
[PDF] Thompson Sampling for Complex Online Problems - arXiv
Nov 3, 2013 · The algorithm is implemented using a simple particle filter [10] to maintain and sample from posterior distributions. We evaluated the ...
[10]
How to Do Thompson Sampling Using Python
Jul 25, 2019 · The built-in NumPy beta(a, b) function draws a sample from the beta distribution. Here the function call adds 1 to each a (success) and b ( ...
[11]
thompson-sampling - PyPI
This project is an implementation of a Thompson Sampling approach to a Multi-Armed Bandit. The goal of this project is to easily create and maintain Thompson ...
[12]
https://doi.org/10.1093/biomet/25.3-4.285
[13]
Thompson Sampling: An Asymptotically Optimal Finite Time Analysis
In this paper we answer it positively for the case of Bernoulli rewards by providing the first finite-time analysis that matches the asymptotic rate.
[14]
[1403.5341] An Information-Theoretic Analysis of Thompson Sampling
Mar 21, 2014 · We provide an information-theoretic analysis of Thompson sampling that applies across a broad range of online optimization problems.
[15]
Thompson Sampling for Contextual Bandits with Linear Payoffs
In this paper, we design and analyze Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions.
[16]
Multi-Agent Thompson Sampling for Bandit Applications
Apr 21, 2020 · We propose multi-agent Thompson sampling (MATS), a new Bayesian exploration-exploitation algorithm that leverages loose couplings.
[17]
An Information-Theoretic Analysis of Thompson Sampling with ...
This paper studies the Bayesian regret of the Thompson Sampling algorithm for bandit problems, building on the information-theoretic framework ...
[18]
A Survey on Variants of Thompson Sampling - ResearchGate
Aug 6, 2025 · This review focuses on the evolution of TS and its variants, showing the innovative aspects of Neural Thompson Sampling (NeuralTS) and Meta- ...
[19]
Using Bayesian optimization for balancing metrics in ... - LinkedIn
Feb 4, 2022 · We choose Thompson sampling because it provides probabilistic suggestions. Suppose we have N candidates x1, x2,ॱ ॱ ॱ ,xN. Thompson sampling ...
[20]
Teaching and Mentoring – Michael R. Kosorok's Homepage
Research interests include machine learning, Thompson sampling, adaptive clinical trials, and precision medicine. Daniel de Marchi (PhD graduate ...<|separator|>
[21]
Multi-Armed Bandits in Brain-Computer Interfaces - Frontiers
4. Thompson Sampling. Thompson sampling, first introduced in Thompson (1933), also called probability matching, is an algorithm for MABs with binary rewards.
[22]
Thompson Sampling An Efficient Method for Searching Ultralarge ...
This article describes the application of Thompson sampling (TS), an active learning approach that streamlines the virtual screening of large combinatorial ...Missing: explanation | Show results with:explanation
[23]
[1209.3353] Further Optimal Regret Bounds for Thompson Sampling
In this paper, we provide a novel regret analysis for Thompson Sampling that simultaneously proves both the optimal problem-dependent bound.
[24]
[PDF] Normal Bandits of Unknown Means and Variances
but known variance), policies that achieved the lower bound were called asymptotically ... Optimality of Thompson sampling for Gaussian bandits depends on priors.
[25]
[PDF] An Information-Theoretic Analysis of Thompson Sampling
bounds reflect that Thompson sampling is able to automatically exploit this structure. ... exploration and exploitation. Problems with full information arise as ...Missing: intuition seminal
[26]
[PDF] A modern Bayesian look at the multi-armed bandit - Economics
Randomized probability matching is a particularly appealing heuristic that plays each arm in proportion to its probability of being optimal. Randomized.
[27]
[PDF] An Economist's Perspective on Probability Matching - WorthyLab
The purpose of this paper is therefore threefold: First, to introduce today's readers to what is meant by probability matching, and in particular to clarify ...
[28]
Multi-armed bandit - Action Selection methods - Search StFX.ca
Probability matching: The probability of selecting an action equals the probability it's optimal ... Linear regret - very bad. ε-greedy. O ( T ). Can get ...
[29]
[PDF] Using Confidence Bounds for Exploitation-Exploration Trade-offs
and thus E[R(T)] = ¯R(T)+Ω. √. T . In the current paper we consider an extension of the adversarial bandit problem where the bandits are allowed to “shift”: the ...
[30]
[PDF] The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond
Abstract. This paper presents a finite-time analysis of the KL-UCB algorithm, an online, horizon- free index policy for stochastic bandit problems.
[31]
[PDF] An Empirical Evaluation of Thompson Sampling - Microsoft
Thompson sampling is one of oldest heuristic to address the exploration / ex- ploitation trade-off, but it is surprisingly unpopular in the literature.
[32]
[PDF] Finite-time Analysis of the Multiarmed Bandit Problem*
We have shown simple and efficient policies for the bandit problem that, on any set of reward distributions with known bounded support, exhibit uniform ...