Fact-checked by Grok 2 weeks ago

Thompson sampling

Thompson sampling, also known as posterior sampling or probability matching, is a randomized Bayesian designed to solve sequential decision-making problems under uncertainty, particularly those involving the exploration-exploitation trade-off, such as the problem. In this framework, the algorithm maintains a posterior over the unknown of the reward model for each ; at each decision step, it samples a parameter value from this posterior, computes the expected reward for each under the sampled parameters, and selects the action with the highest expected reward. This process naturally balances the need to exploit actions known to yield high rewards with the of potentially better alternatives, as optimistic samples encourage trying less-proven options. Named after William R. Thompson, the algorithm was first proposed in 1933 as a method for allocating patients to treatments in clinical trials, where the goal was to maximize the number of successes while learning about the relative efficacy of therapies with unknown success probabilities. Thompson's original formulation involved sampling from beta distributions to compare two treatments, framing the problem in terms of likelihoods derived from binomial observations. Despite its early introduction, Thompson sampling remained relatively obscure for decades, with limited adoption until the late 1990s and early 2000s, when it was rediscovered in the context of reinforcement learning and bandit algorithms. Its popularity surged in the 2010s due to empirical successes in large-scale applications and theoretical analyses demonstrating near-optimal regret bounds, such as O(\sqrt{KT \log T}) for K-armed Bernoulli bandits over T rounds. Thompson sampling has found widespread use in modern applications requiring adaptive experimentation, including (e.g., selecting ad variants to maximize clicks), in web services, and personalized recommendation systems at companies like and . It extends naturally to more complex settings, such as linear bandits, contextual bandits, and Markov decision processes, where it leverages structured models for efficient computation. Notable properties include its simplicity of implementation—often requiring only posterior sampling and argmax operations—and its robustness to model misspecification, though it can be computationally intensive for high-dimensional problems without approximations. Ongoing research continues to refine its theoretical guarantees and practical variants, solidifying its role as a cornerstone of .

Introduction

Definition

Thompson sampling is a randomized heuristic for choosing actions in sequential decision-making problems under uncertainty, most notably in the framework, where it balances and by sampling parameter values from the posterior distribution of a Bayesian model and selecting the action that maximizes the expected reward under those sampled values. In this setting, the problem involves multiple "," each representing an action with rewards drawn from an unknown ; the goal is to maximize cumulative reward over time while learning about the arms' reward structures. Core components of Thompson sampling include the (discrete actions available to the decision-maker), the rewards (random outcomes associated with pulling each arm, typically assumed to be independent and identically distributed given unknown parameters), and beliefs (initial probability distributions over the reward parameters, which are updated to posteriors as data is observed). The algorithm maintains a Bayesian model of the , starting with these priors to represent about the reward distributions before any observations. A simple example arises in a two-armed bandit problem with rewards, where each arm yields a success (reward 1) with unknown probability \theta_i and failure (reward 0) otherwise. Priors on \theta_1 and \theta_2 are chosen as distributions, which are conjugate to the likelihood and thus yield posteriors after observations; at each step, the algorithm samples \tilde{\theta}_i from the current posterior for each arm i, then selects the arm with the highest \tilde{\theta}_i. The method is named after William R. Thompson, who proposed it in 1933 for allocating treatments in clinical trials modeled as a two-armed bandit problem.

Intuition

Thompson sampling addresses the fundamental exploration-exploitation dilemma in sequential decision-making under uncertainty, where an agent must balance exploiting actions known to yield high rewards based on current information against exploring potentially better but uncertain alternatives to improve long-term performance. This trade-off arises in settings like multi-armed bandits, where repeatedly selecting suboptimal actions leads to , but ignoring uncertainty risks missing superior options. Thompson sampling resolves this by incorporating probabilistic beliefs about rewards, naturally favoring exploration when uncertainty is high through a mechanism of optimistic sampling from these beliefs. At its core, the intuitive mechanism of Thompson sampling involves maintaining a posterior over possible reward parameters for each , informed by observed , and then sampling a hypothesized reward value from each to select the action with the highest sampled value. This process induces an automatic bias toward actions with high variance in their posterior, as such actions are more likely to produce high samples and thus be selected, promoting precisely when is ambiguous. Over time, as accumulates, the posteriors concentrate around true values, shifting the balance toward of the empirically best actions. A compelling analogy for this approach originates from its medical roots, akin to a allocating to in a with two therapies of unknown : by simulating outcomes from current evidence on patient beliefs, the method favors assigning the promising but less-proven treatment more often when suggests it might outperform the known one, thereby testing hypotheses efficiently while minimizing harm from inferior choices. Thompson sampling offers key advantages, including asymptotic optimality in finite-armed bandit problems, where its expected grows logarithmically with time, matching information-theoretic lower bounds in many cases. It is also computationally efficient when posterior sampling is tractable, such as with conjugate priors like distributions for rewards, enabling straightforward implementation without excessive overhead.

Mathematical Formulation

Bayesian Framework

Thompson sampling operates within a Bayesian framework, where the unknown reward parameters \theta for each in a problem are treated as random variables drawn from a p(\theta). Observations of rewards are modeled via a p(r \mid \theta), where r denotes the observed reward. The posterior p(\theta \mid \text{data}) is then obtained through as p(\theta \mid \text{data}) \propto p(\text{data} \mid \theta) p(\theta), representing the updated belief about \theta after incorporating evidence from past interactions. For computational tractability, Thompson sampling often employs conjugate priors, which ensure that the posterior belongs to the same family as the , facilitating efficient . In the case of Bernoulli rewards, a p(\theta_k) = \frac{\Gamma(\alpha_k + \beta_k)}{\Gamma(\alpha_k) \Gamma(\beta_k)} \theta_k^{\alpha_k - 1} (1 - \theta_k)^{\beta_k - 1} for arm k is conjugate to the likelihood p(r \mid \theta_k) = \theta_k^r (1 - \theta_k)^{1 - r}, resulting in a posterior with updated parameters (\alpha_k + r_t, \beta_k + 1 - r_t) after observing reward r_t from arm k. Similarly, for normally distributed rewards with known variance, a Gaussian on \theta yields a Gaussian posterior, with updates to the mean and based on the observed data. These choices enable closed-form posterior updates without requiring complex numerical methods. The posterior distribution encapsulates the decision-maker's beliefs, quantifying over the reward parameters after each and integrating all available in a coherent probabilistic manner. Sampling from this posterior allows actions to be selected in a way that naturally balances optimism about uncertain arms () with confidence in high-reward arms (), as samples from regions of high posterior density influence choices probabilistically. This Bayesian updating process ensures that beliefs evolve dynamically with new data, maintaining a principled representation of epistemic . The framework generalizes to any where posterior sampling is feasible, such as linear models for contextual bandits or more complex hierarchical structures, provided the and likelihood permit efficient computation of the posterior. This flexibility underpins Thompson sampling's applicability beyond simple bandits, as long as the model supports sampling from the belief over parameters.

Posterior Sampling Procedure

In Thompson sampling, the posterior sampling procedure forms the decision-making core, leveraging Bayesian posteriors to balance and in sequential decision problems such as multi-armed bandits. At each time step t = 1, 2, \dots, the first draws a sample \theta^{(t)} from the posterior p(\theta \mid H_{1:t-1}), where H_{1:t-1} denotes the history of actions and observed rewards up to time t-1. This sampling reflects the 's current beliefs about the unknown parameters \theta, updated from an initial via Bayes' rule. The action selection then proceeds deterministically based on the sampled parameters: a_t = \arg\max_{a \in \mathcal{A}} \theta_a^{(t)}, where \mathcal{A} is the action (e.g., the set of K arms in a ) and \theta_a^{(t)} is the component of \theta^{(t)} corresponding to action a. In the setting, parameters are sampled independently for each arm, allowing the procedure to scale naturally to multiple options without requiring joint sampling over the full parameter . If multiple actions yield the same maximum value under the sample (a ), the is broken arbitrarily, such as by selecting the lowest-indexed arm or via , to ensure a valid action is chosen. After selecting and executing a_t, the agent observes the reward r_t and incorporates it into the history, updating the posterior to p(\theta \mid H_{1:t}) for the next iteration. A concrete example of this procedure arises in the Beta-Bernoulli , where each arm k has an unknown success probability \theta_k \in [0,1] and rewards are binary (0 or 1). The posterior for arm k is maintained as a \text{Beta}(\alpha_k, \beta_k), initialized from a such as \text{Beta}(1,1). At time t, sample \theta_k^{(t)} \sim \text{Beta}(\alpha_k, \beta_k) independently for each k = 1, \dots, K, then set a_t = \arg\max_k \theta_k^{(t)} (with arbitrary tie-breaking). Upon observing r_t \in \{0,1\} for the chosen arm k = a_t, update the parameters as \alpha_k \leftarrow \alpha_k + r_t and \beta_k \leftarrow \beta_k + (1 - r_t), which conjugates the with the likelihood to yield the new posterior. This update rule ensures the posterior remains tractable, enabling efficient repeated sampling in practice.

Algorithm and Implementation

Pseudocode

Thompson sampling can be described algorithmically in a straightforward manner for the setting, where the goal is to select actions () over a horizon T to maximize cumulative reward while balancing and . The algorithm initializes beliefs over the unknown reward parameters \theta for each of the K and iteratively samples from the posterior to select the arm with the highest sampled value, observes the reward, and updates the posterior accordingly. Inputs to the algorithm typically include the number of arms K and the T, while outputs consist of the sequence of selected actions A_t for t = 1 to T and the resulting cumulative , defined as the difference between the optimal cumulative reward and the algorithm's total reward. The following pseudocode outlines the general procedure:
Initialize priors for each arm k = 1 to K (e.g., based on [domain knowledge](/page/Domain_knowledge) or non-informative defaults)
For t = 1 to T:
    Sample θ_k ~ Posterior(prior_k, observed data for arm k) for each k = 1 to K
    Select A_t = argmax_k θ_k
    Observe reward R_t from [arm](/page/Arm) A_t
    Update posterior for [arm](/page/Arm) A_t using R_t
Output: Sequence A_1, ..., A_T and cumulative [regret](/page/Regret) = ∑_{t=1}^T (μ^* - μ_{A_t}), where μ^* is the mean reward of the optimal [arm](/page/Arm)
This structure ensures that actions are chosen probabilistically from the posterior, promoting optimistic exploration of uncertain arms. For the common case of Bernoulli multi-armed bandits, where rewards are binary (0 or 1) with unknown success probabilities \theta_k \in [0,1] for each arm k, conjugate Beta priors are used for tractable posterior updates. A standard non-informative prior is Beta(1,1), which is uniform over [0,1]. The posterior after observing S_k successes and F_k failures for arm k is Beta($1 + S_k, $1 + F_k). Sampling from this posterior can be implemented using random number generators, such as the inverse CDF method or libraries like NumPy's random.beta. The specific pseudocode is as follows:
Input: K (number of [arms](/page/Arms)), T (horizon)
Initialize: For each k = 1 to K, set S_k = 0 (successes), F_k = 0 (failures)
For t = 1 to T:
    For each k = 1 to K:
        Sample θ_k ~ [Beta](/page/Beta)(1 + S_k, 1 + F_k)
    Select A_t = argmax_k θ_k
    Observe reward R_t ∈ {0,1} from arm A_t
    If R_t = 1:
        S_{A_t} ← S_{A_t} + 1
    Else:
        F_{A_t} ← F_{A_t} + 1
Output: Sequence A_1, ..., A_T and cumulative [regret](/page/Regret)
This is efficient for moderate K and T, with each requiring O(K) samples and constant-time updates. The algorithm is adaptable to continuous action spaces by sampling from multivariate posteriors over parameterizations of the reward function, such as Gaussian processes or ensembles, and selecting the that maximizes the sampled expected reward.

Computational Aspects

Implementing Thompson sampling requires addressing several computational challenges, particularly in sampling from the posterior distribution and scaling to practical settings. For models with conjugate priors, such as Beta distributions in Bernoulli multi-armed bandits or Gaussian priors in linear bandits, exact posterior sampling is straightforward and efficient, as updates maintain closed-form distributions that can be sampled directly in constant time per arm. However, in non-conjugate settings—common in complex environments like contextual bandits with logistic rewards or models—exact inference becomes intractable, necessitating approximate methods such as (MCMC) techniques (e.g., or ) or variational inference to generate samples from the posterior. To mitigate these tractability issues, various s have been developed for complex posteriors. Bootstrap resampling, which treats empirical data as the population to generate multiple posterior samples, offers a , non-parametric alternative suitable for high-dimensional or non-conjugate models without requiring likelihood evaluations. Moment-matching methods, such as fitting a Gaussian to match the first two moments of the true posterior, provide further efficiency in high-dimensional spaces by reducing sampling to standard draws, though they may introduce in non-Gaussian cases. Variational variants, like those minimizing the reverse , enable scalable approximations but can lead to under- or over- if the approximation error is large, potentially causing linear unless mitigated by added exploration mechanisms. Scalability concerns arise primarily with large numbers of arms K or actions, where the ideal per-step time complexity is O(K) due to sampling each arm's posterior and selecting the maximum, which remains feasible for moderate K but becomes prohibitive in high-dimensional or continuous spaces. For such cases, particle filters maintain a set of weighted samples (particles) to approximate the posterior via sequential , enabling online updates with complexity scaling with the number of particles rather than K, as demonstrated in complex bandit problems like queuing networks. In extensions, online posterior updates using ensemble methods or sparse Gaussian processes further enhance for large state-action spaces. In practice, Thompson sampling is supported by various software libraries for efficient implementation. Python's provides built-in functions for sampling from conjugate distributions like , facilitating quick prototyping for simple bandits. frameworks integrate Thompson sampling variants for bandit and contextual settings, handling approximations like within scalable pipelines. Specialized packages, including those on PyPI, offer ready-to-use implementations for multi-armed bandits with support for MCMC-based sampling.

History

Origins

Thompson sampling was first proposed by William R. Thompson in his 1933 paper published in Biometrika, titled "On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples." In this work, Thompson introduced the method as a strategy for sequential decision-making in clinical trials aimed at comparing the efficacy of two treatments, where the goal was to allocate patients to treatments in a way that maximizes the overall benefit while learning about the unknown success probabilities. The approach was motivated by the need for efficient sequential analysis in medicine, where resources such as patient numbers were limited, and decisions had to balance immediate treatment outcomes with the accumulation of evidence to inform future allocations. At its core, Thompson's key insight involved modeling the unknown success probabilities of the treatments using Bayesian conjugate priors—specifically, beta distributions for binary outcomes—and then performing sampling from the posterior distributions to determine treatment allocation. This procedure effectively sampled a hypothesized value for each treatment's success probability from its respective posterior and selected the treatment with the higher sampled value, thereby incorporating uncertainty directly into the decision process. This Bayesian framework allowed for adaptive randomization that favored promising treatments based on accumulating data, predating the formal development of multi-armed bandit theory in the 1950s. Despite its innovative nature, Thompson's method had limited initial impact and was largely overlooked in the academic literature for decades. The primary reasons included the computational constraints of the era, which made repeated posterior sampling and updates impractical without modern computing resources, as well as a lack of theoretical analysis and empirical validation at the time. It was not until the , with advances in computational power and renewed interest in Bayesian methods for online decision-making, that the approach gained widespread recognition.

Subsequent Developments

Following its initial proposal in 1933, Thompson sampling experienced a revival in the early 2000s within , particularly for addressing problems in online decision-making, with independent rediscoveries such as Wyatt (1997) and Strens (2000) in contexts, followed by empirical successes in Scott (2010) and Chapelle and Li (2011). This rediscovery emphasized its Bayesian foundations for balancing and , leading to renewed theoretical interest. Key contributions included the first rigorous analysis demonstrating logarithmic expected for the stochastic problem by Agrawal and Goyal in 2012. Concurrently, , Korda, and Munos provided the first finite-time analysis proving asymptotic optimality for reward distributions in 2012, matching the lower bound on . Subsequent theoretical advancements solidified Thompson sampling's guarantees across broader settings. Russo and Van Roy offered an information-theoretic framework in 2014 that proved asymptotic optimality in bandits by bounding Bayesian through , highlighting its efficiency in reducing uncertainty about optimal actions. Extensions followed to more complex environments, such as linear bandits via linear Gaussian models and contextual bandits where actions depend on observed features, with and Goyal deriving near-optimal bounds for linear payoffs in 2013. Post-2020 developments have further diversified Thompson sampling's scope. In multi-agent settings, et al. introduced multi-agent Thompson sampling in 2020, enabling coordinated exploration in sparse neighborhood structures for distributed bandit problems, as demonstrated in applications like recommender systems. Recent information-theoretic analyses, such as that by Zhang in 2025, have extended bounds to infinite action spaces, revealing tighter connections between posterior sampling and minimization. A comprehensive survey by in 2025 cataloged over 50 variants, including neural and meta-learning adaptations, underscoring ongoing innovations in scalability and adaptability. Thompson sampling's influence has permeated practical implementations, with integration into libraries like Vowpal Wabbit for contextual bandit tasks. Tech companies, including , have adopted it for in large-scale systems, as evidenced by empirical evaluations showing superior performance over epsilon-greedy baselines in ad personalization.

Applications

Multi-Armed Bandits

The (MAB) problem involves a decision-maker facing K , each associated with an unknown reward characterized by \mu_i for arm i = 1, \dots, K. The objective is to maximize the cumulative reward over T sequential pulls by selecting one arm at each time step t = 1, \dots, T, while balancing the trade-off between exploiting known high-reward arms and exploring uncertain ones to learn their s. In the MAB setting, operates by maintaining a posterior over each 's \mu_i based on observed rewards and a (e.g., Gaussian prior for continuous rewards). At each step, it samples a value \tilde{\mu}_i from the posterior for each and selects the with the maximum sampled value, which naturally favors with high expected means while incorporating to occasionally select suboptimal . This posterior sampling mechanism promotes optimistic of under-sampled , leading TS to empirically outperform purely methods that exploit the current best without . Simulations in Gaussian MAB environments, where rewards are drawn from distributions with known variance, demonstrate TS's effectiveness; for instance, in settings with multiple and means spaced to include one optimal arm, TS incurs low cumulative regret by systematically exploring suboptimal in early stages before converging to the best arm. TS extends to non-stationary MAB variants, where arm means \mu_i evolve over time, by incorporating forgetting factors such as in the posterior updates to downweight outdated observations and maintain adaptability to changes.

Broader Uses

Thompson sampling has been extended to Bayesian () settings, where it facilitates in partially observable Markov decision processes (POMDPs) by sampling from the posterior distribution over model parameters to select actions. A prominent example is posterior sampling for (PSRL), which applies Thompson sampling in tabular Markov decision processes (MDPs) to balance and exploitation by drawing policies from the posterior over transition and reward dynamics. Beyond , Thompson sampling finds application in for web services, such as ad optimization at , where it dynamically allocates traffic to variants to maximize click-through rates while minimizing regret. In clinical trials, it supports adaptive designs by sequentially updating treatment allocations based on posterior beliefs about efficacy, enabling efficient patient assignment in multi-arm trials. For brain-computer interfaces (BCIs), Thompson sampling enhances P300 speller performance by adaptively selecting stimuli sequences to elicit event-related potentials, reducing the number of trials needed to identify target characters. A 2022 mini-review highlights its role in RL-driven stimulus optimization for ERP-based spellers, demonstrating improved accuracy and speed in real-time applications. In chemistry, it accelerates molecular design by efficiently searching ultralarge synthesis-on-demand libraries, prioritizing promising candidates via to optimize properties like binding affinity. Recent advancements include multi-agent Thompson sampling for scenarios with sparse rewards and neighborhood structures, such as coordinating distributed sensors, where it leverages Bayesian updates across agents to achieve low regret in communication-constrained environments. As of 2025, Thompson sampling has been applied in GenAI-powered adaptive interventions for mobile health, using generator-mediated bandit algorithms to optimize treatment delivery. Adaptations of Thompson sampling incorporate contextual features to handle covariate-dependent rewards, as in contextual Thompson sampling, which samples from posteriors conditioned on observed contexts to recommend personalized actions. Hierarchical models extend it to structured data, modeling shared parameters across tasks or levels to enable , such as in multi-task bandits where and posteriors guide sampling for efficient .

Theoretical Properties

Regret Bounds

In the (MAB) framework, the cumulative regret R_T after T rounds is defined as R_T = T \mu^* - \mathbb{E}\left[ \sum_{t=1}^T r_t \right], where \mu^* denotes the expected reward of the optimal arm and r_t is the observed reward at round t. This measures the cumulative shortfall from always selecting the best arm. Efficient exploration-exploitation algorithms, including Thompson sampling, aim for sublinear regret R_T = o(T), ensuring the per-round average regret approaches zero asymptotically. For the MAB, where arm rewards are binary with unknown success probabilities, Thompson sampling achieves an expected bound of O(\sqrt{K T \log T}), with K the number of ; this problem-independent bound matches the rate up to logarithmic factors. For Gaussian rewards with known variance, Thompson sampling yields logarithmic problem-dependent bounds under suitable Gaussian priors, scaling as O\left( \sum_{i \neq i^*} \frac{\log T}{\Delta_i} \right), where \Delta_i is the suboptimality gap for suboptimal arm i and i^* is the optimal . These results establish Thompson sampling's in environments. Regret analyses for Thompson sampling typically leverage posterior concentration, showing that the posterior over arm means concentrates sufficiently fast around the true values, ensuring the optimal arm is sampled with probability approaching 1 after an initial exploration phase. This concentration implies that suboptimal arms are pulled only O(\log T) times each, yielding the logarithmic terms. An alternative information-theoretic framework bounds via the , defined as the between observations and beliefs divided by the ; for Thompson sampling, this ratio remains bounded, directly implying sublinear across broad classes of problems. These bounds hold under specific priors matched to the reward distributions, such as for or Gaussian for rewards; mismatched priors can inflate the constants or worsen the scaling. In adversarial settings, where rewards may be chosen non-stochastically, Thompson sampling lacks sublinear guarantees in the worst case, potentially incurring linear against adaptive adversaries.

Optimality Guarantees

In stochastic multi-armed bandit problems, Thompson sampling achieves asymptotic optimality by attaining logarithmic that matches the Lai-Robbins lower bound under appropriate distributions, such as the uniform for rewards. This result establishes that, as the time horizon grows, the algorithm's cumulative scales no worse than the theoretical minimum required for any consistent , demonstrating its efficiency in long-run exploration-exploitation trade-offs. From a Bayesian perspective, Thompson sampling minimizes the Bayes over finite horizons in specific models, where it effectively approximates or aligns with the optimal dynamic programming solution. For instance, in infinite-horizon discounted problems with known reward structures, it relates to the policy, which is proven to be Bayesian optimal by selecting arms according to their indices derived from expected future rewards. This equivalence highlights Thompson sampling's strength in Bayesian settings, where it leverages posterior sampling to achieve near-minimal relative to the true posterior. Thompson sampling's guarantees extend under weak assumptions, such as sub-Gaussian reward distributions, enabling logarithmic regret bounds without requiring bounded or specific parametric forms for the rewards. In practice, the algorithm demonstrates robustness to model misspecification, maintaining strong empirical performance even when the assumed or reward model deviates from reality, as evidenced in sequential applications with non-stationary or misspecified environments. Recent information-theoretic analyses further underscore Thompson sampling's near-optimal exploration properties, bounding its Bayesian in terms of the between beliefs and outcomes, which scales favorably with problem complexity. Extensions of this framework to infinite action spaces and generalized bandits confirm that the algorithm achieves proportional to the square root of this , approaching optimality under broad prior assumptions.

Comparisons

Probability Matching

Probability matching is a strategy in which are selected with probabilities proportional to their estimated likelihood of being the optimal choice, computed as the that each has the highest mean reward given the observed . This approach draws from Bayesian posterior beliefs about arm qualities, where the selection rule allocates pulls to arm i with probability P(\mu_i = \max_j \mu_j \mid \text{[data](/page/Data)}). In practice, a common variant observed in human decision-making matches selection probabilities directly to the arms' estimated reward probabilities rather than their probabilities of being optimal, leading to persistent allocation to suboptimal options. For instance, in a two-arm bandit with reward probabilities 0.7 and 0.3, this results in pulling the suboptimal arm approximately 30% of the time even after estimates converge, yielding an expected reward of $0.7 \times 0.7 + 0.3 \times 0.3 = 0.58 instead of the optimal 0.7. This behavior was prominently studied in early experiments on human and animal choice under , where subjects consistently matched choices to observed frequencies despite the availability of superior strategies. In the framework, probability matching is suboptimal compared to asymptotically efficient algorithms, incurring linear O(T) because a constant fraction of pulls continues to suboptimal indefinitely. Its intuitive appeal lies in mimicking observed frequencies, but this fails to concentrate on promising as data accumulates. Unlike Thompson sampling, which randomizes selections by directly sampling arm parameters from the posterior and choosing the apparent best, probability matching serves as a deterministic that fixes allocations based on current posterior probabilities without the sampling-induced variability that drives effective in Thompson sampling.

Upper Confidence Bound Methods

Upper confidence bound (UCB) methods address the problem by deterministically selecting, at each time step t, the arm i that maximizes an index combining the empirical mean reward \hat{\mu}_i(t) with an additive exploration bonus, thereby encouraging the trial of under-explored arms through optimistic estimates. This approach relies on frequentist to construct confidence intervals around the mean rewards, ensuring that with high probability, the true mean lies below the upper bound. The foundational UCB1 algorithm, proposed by Auer, Cesa-Bianchi, and Fischer in , uses the specific index \hat{\mu}_i(t) + \sqrt{\frac{2 \log t}{n_i(t)}}, where n_i(t) denotes the number of times arm i has been pulled by time t. This bonus term grows with the logarithm of time and inversely with the arm's pull count, systematically inflating the estimate for infrequently sampled arms to promote exploration without probabilistic sampling. Subsequent variants refine this framework for broader applicability. UCB-V, introduced by Audibert, Munos, and Szepesvári in 2009, modifies the bonus to incorporate an empirical variance estimate V_i(t), yielding an index of the form \hat{\mu}_i(t) + c \sqrt{ \frac{ V_i(t) + c' \frac{\log t}{n_i(t)} }{n_i(t)} } \log t + c'' \frac{\log t}{n_i(t)}, which provides tighter bounds for arms with heterogeneous variances and improves regret in heavy-tailed reward distributions. For non-parametric or misspecified settings, KL-UCB, developed by Garivier and Cappé in , replaces the in the bound with the Kullback-Leibler divergence between empirical and candidate reward distributions, enabling adaptation to general bounded stochastic bandits beyond the Gaussian or assumptions. UCB methods achieve problem-independent regret of order O(\sqrt{K T \log T}), where K is the number of arms and T the horizon, matching the rate for bandits and paralleling the guarantees for Thompson sampling, though derived via union bounds over concentration events rather than posterior analysis. Problem-dependent bounds further tighten this to O\left( \sum_{i \neq i^*} \frac{\log T}{\Delta_i} \right), where \Delta_i is the suboptimality gap for arm i, ensuring near-optimal in . In contrast to Thompson sampling, which draws arm selections probabilistically from a posterior distribution to incorporate beliefs and naturally resolve , UCB enforces through fixed, deterministic bounds that do not require Bayesian modeling, resulting in simpler computation—often just arithmetic operations on summaries—but reduced flexibility when informative are available. Empirical evaluations confirm UCB's robustness in uniform settings, though it may underperform Thompson sampling in scenarios with strong information or non-stationary environments due to its lack of probabilistic adaptation.

Epsilon-Greedy Strategies

The epsilon-greedy strategy is a simple exploration-exploitation algorithm used in multi-armed bandit problems, where with probability $1 - \epsilon, the algorithm selects the arm with the highest empirical mean reward (exploitation), and with probability \epsilon, it selects a random arm uniformly at random (exploration). Often, \epsilon is fixed or decays over time, such as \epsilon_t = 1/t or tuned to \epsilon_t = \min\{1, t^{-1/3} K^{2/3} \log^{1/3} t\} where K is the number of arms and t is the time step, to balance the trade-off and achieve sublinear regret. This approach offers advantages in simplicity and does not require maintaining a probabilistic model of rewards, making it computationally lightweight and easy to implement. However, its exploration is inefficient because random selection equally favors all arms, including those known to be suboptimal, leading to unnecessary pulls of poor performers and higher cumulative regret compared to more targeted methods. For the tuned decaying \epsilon_t, the epsilon-greedy algorithm achieves an expected regret of O(T^{2/3} \log^{1/3} T) after T rounds, which is sublinear but polynomial and worse than the logarithmic regret attainable by asymptotically optimal algorithms. In contrast to epsilon-greedy, Thompson sampling enables more targeted by sampling from posterior distributions over rewards, naturally directing pulls toward uncertain but promising without needing a fixed \epsilon , resulting in empirically and theoretically superior performance in terms of .

References

  1. [1]
    [PDF] A Tutorial on Thompson Sampling - Stanford University
    Thompson sampling is an algorithm for online decision prob- lems where actions are taken sequentially in a manner that must balance between exploiting what is ...
  2. [2]
    On the Likelihood that One Unknown Probability Exceeds Another in ...
    ON THE LIKELIHOOD THAT ONE UNKNOWN. PROBABILITY EXCEEDS ANOTHER IN VIEW. OF THE EVIDENCE OF TWO SAMPLES. BY WILLIAM R. THOMPSON. From the Department of ...
  3. [3]
  4. [4]
    Analysis of Thompson Sampling for the multi-armed bandit problem
    Nov 8, 2011 · One of the earliest algorithms, given by W. R. Thompson, dates back to 1933. This algorithm, referred to as Thompson Sampling, is a natural ...
  5. [5]
    [PDF] ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY ...
    ON THE LIKELIHOOD THAT ONE UNKNOWN. PROBABILITY EXCEEDS ANOTHER IN VIEW. OF THE EVIDENCE OF TWO SAMPLES. BY WILLIAM R. THOMPSON. From the Department of ...
  6. [6]
    [PDF] Analysis of Thompson Sampling for the Multi-armed Bandit Problem
    The Thompson Sampling algorithm initially assumes arm i to have prior Beta(1,1) on µi, which is natural because Beta(1,1) is the uniform distribution on (0,1).
  7. [7]
    None
    ### Summary of Approximate Inference Methods for Thompson Sampling
  8. [8]
    [PDF] Thompson Sampling for Complex Online Problems
    We propose using Thompson sampling (Algorithm 1) to play actions in the general bandit model. Before formally stating the regret bound, we present an ...
  9. [9]
    [PDF] Thompson Sampling for Complex Online Problems - arXiv
    Nov 3, 2013 · The algorithm is implemented using a simple particle filter [10] to maintain and sample from posterior distributions. We evaluated the ...
  10. [10]
    How to Do Thompson Sampling Using Python
    Jul 25, 2019 · The built-in NumPy beta(a, b) function draws a sample from the beta distribution. Here the function call adds 1 to each a (success) and b ( ...
  11. [11]
    thompson-sampling - PyPI
    This project is an implementation of a Thompson Sampling approach to a Multi-Armed Bandit. The goal of this project is to easily create and maintain Thompson ...
  12. [12]
  13. [13]
    Thompson Sampling: An Asymptotically Optimal Finite Time Analysis
    In this paper we answer it positively for the case of Bernoulli rewards by providing the first finite-time analysis that matches the asymptotic rate.
  14. [14]
    [1403.5341] An Information-Theoretic Analysis of Thompson Sampling
    Mar 21, 2014 · We provide an information-theoretic analysis of Thompson sampling that applies across a broad range of online optimization problems.
  15. [15]
    Thompson Sampling for Contextual Bandits with Linear Payoffs
    In this paper, we design and analyze Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions.
  16. [16]
    Multi-Agent Thompson Sampling for Bandit Applications
    Apr 21, 2020 · We propose multi-agent Thompson sampling (MATS), a new Bayesian exploration-exploitation algorithm that leverages loose couplings.
  17. [17]
    An Information-Theoretic Analysis of Thompson Sampling with ...
    This paper studies the Bayesian regret of the Thompson Sampling algorithm for bandit problems, building on the information-theoretic framework ...
  18. [18]
    A Survey on Variants of Thompson Sampling - ResearchGate
    Aug 6, 2025 · This review focuses on the evolution of TS and its variants, showing the innovative aspects of Neural Thompson Sampling (NeuralTS) and Meta- ...
  19. [19]
    Using Bayesian optimization for balancing metrics in ... - LinkedIn
    Feb 4, 2022 · We choose Thompson sampling because it provides probabilistic suggestions. Suppose we have N candidates x1, x2,ॱ ॱ ॱ ,xN. Thompson sampling ...
  20. [20]
    Teaching and Mentoring – Michael R. Kosorok's Homepage
    Research interests include machine learning, Thompson sampling, adaptive clinical trials, and precision medicine. Daniel de Marchi (PhD graduate ...<|separator|>
  21. [21]
    Multi-Armed Bandits in Brain-Computer Interfaces - Frontiers
    4. Thompson Sampling. Thompson sampling, first introduced in Thompson (1933), also called probability matching, is an algorithm for MABs with binary rewards.
  22. [22]
    Thompson Sampling An Efficient Method for Searching Ultralarge ...
    This article describes the application of Thompson sampling (TS), an active learning approach that streamlines the virtual screening of large combinatorial ...Missing: explanation | Show results with:explanation
  23. [23]
    [1209.3353] Further Optimal Regret Bounds for Thompson Sampling
    In this paper, we provide a novel regret analysis for Thompson Sampling that simultaneously proves both the optimal problem-dependent bound.
  24. [24]
    [PDF] Normal Bandits of Unknown Means and Variances
    but known variance), policies that achieved the lower bound were called asymptotically ... Optimality of Thompson sampling for Gaussian bandits depends on priors.
  25. [25]
    [PDF] An Information-Theoretic Analysis of Thompson Sampling
    bounds reflect that Thompson sampling is able to automatically exploit this structure. ... exploration and exploitation. Problems with full information arise as ...Missing: intuition seminal
  26. [26]
    [PDF] A modern Bayesian look at the multi-armed bandit - Economics
    Randomized probability matching is a particularly appealing heuristic that plays each arm in proportion to its probability of being optimal. Randomized.
  27. [27]
    [PDF] An Economist's Perspective on Probability Matching - WorthyLab
    The purpose of this paper is therefore threefold: First, to introduce today's readers to what is meant by probability matching, and in particular to clarify ...
  28. [28]
    Multi-armed bandit - Action Selection methods - Search StFX.ca
    Probability matching: The probability of selecting an action equals the probability it's optimal ... Linear regret - very bad. ε-greedy. O ( T ). Can get ...
  29. [29]
    [PDF] Using Confidence Bounds for Exploitation-Exploration Trade-offs
    and thus E[R(T)] = ¯R(T)+Ω. √. T . In the current paper we consider an extension of the adversarial bandit problem where the bandits are allowed to “shift”: the ...
  30. [30]
    [PDF] The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond
    Abstract. This paper presents a finite-time analysis of the KL-UCB algorithm, an online, horizon- free index policy for stochastic bandit problems.
  31. [31]
    [PDF] An Empirical Evaluation of Thompson Sampling - Microsoft
    Thompson sampling is one of oldest heuristic to address the exploration / ex- ploitation trade-off, but it is surprisingly unpopular in the literature.
  32. [32]
    [PDF] Finite-time Analysis of the Multiarmed Bandit Problem*
    We have shown simple and efficient policies for the bandit problem that, on any set of reward distributions with known bounded support, exhibit uniform ...