Fact-checked by Grok 2 weeks ago

Stochastic game

A stochastic game, also known as a Markov game, is a formal model in that describes multi-player interactions in a dynamic environment where outcomes depend on both players' actions and random transitions between states. Introduced by in 1953, it generalizes single-agent Markov decision processes to competitive or cooperative multi-agent settings, featuring a finite set of states, action spaces for each player, probabilistic transition functions, and reward structures that accumulate over time. These games are typically analyzed in infinite-horizon discounted forms, where future rewards are geometrically discounted by a factor \gamma \in (0,1), or finite-horizon variants with a fixed number of steps H. The core structure of a stochastic game is defined as a tuple (S, (A_i)_{i \in I}, \mathbb{P}, (r_i)_{i \in I}, \gamma, \mu), where S is the finite state space, A_i are action sets for players I, \mathbb{P}(s'|s,a) governs state transitions, r_i are per-player reward functions, \gamma is the discount factor, and \mu is the initial state distribution. Players employ policies—ranging from history-dependent to stationary Markov strategies—to maximize their expected discounted rewards, with equilibria such as or minimax values characterizing optimal play. In two-player zero-sum stochastic games, Shapley's foundational result guarantees the existence of a unique value and optimal stationary strategies, solvable via value iteration or policy improvement methods. For non-zero-sum multiplayer cases, stationary equilibria exist under discounting, though undiscounted or uniform variants pose greater computational and theoretical challenges. Stochastic games have profoundly influenced fields beyond pure theory, serving as foundational models in for tasks like and autonomous systems, where agents learn equilibria amid uncertainty. In and , they model dynamic , such as fishery or races, capturing long-term strategic interactions under stochastic shocks. Key extensions include absorbing games (with terminal states), stopping games (player-controlled endings), and quitting games (optional exits), each yielding results on \epsilon-equilibria for undiscounted payoffs. Ongoing research addresses algorithmic solvability and applications in , with uniform values ensuring strategies that perform well across varying horizons.

Fundamentals

Definition and Basic Concepts

Stochastic games, also known as Markov games, provide a framework for modeling dynamic interactions among multiple decision-makers in environments subject to uncertainty. They build upon foundational concepts from and processes, extending both to capture sequential strategic under probabilistic state evolution. In classical , a describes a static, one-shot interaction where a of simultaneously selects from their respective sets, resulting in payoffs determined solely by the chosen profile. A (MDP), in contrast, formalizes single-agent sequential in a setting, defined by a state space, space, probabilistic transition kernel that maps current state and to next-state distributions, and a reward function yielding immediate payoffs based on state- pairs. Stochastic games generalize MDPs to multi-agent scenarios by allowing actions to influence both immediate rewards and state transitions, thereby introducing interdependence and potential conflict or alignment among ' objectives. The core feature of stochastic games is their dynamic nature: the system begins in an initial state, players jointly select actions, receive stage-wise payoffs, and the state transitions probabilistically to a new state according to the joint action chosen, repeating indefinitely or until . This contrasts with the fixed of normal-form games, which lack temporal evolution, and with MDPs, where transitions and rewards depend only on a single agent's choice rather than coordinated or adversarial inputs from multiple agents. In stochastic games, interactions can be zero-sum (adversarial) or non-zero-sum (potentially ), with arising from the probabilistic transitions rather than deterministic outcomes. An illustrative example is , a two-player zero-sum stochastic game with one and two absorbing states. In the , Player 1 (the hider) chooses a number 1 or 2, while Player 2 (the matcher) simultaneously chooses 1 or 2. If Player 2 chooses 1, the game remains in the with a stage payoff of 0 to Player 2. If Player 2 chooses 2, the game transitions to an absorbing state: with payoff 1 per stage forever to Player 2 if the choice matches Player 1's number (or -1 to Player 1 in zero-sum), or payoff 0 per stage forever to Player 2 if it does not (or 0 to Player 1). The overall payoff is the limit inferior of the average stage payoffs, capturing long-run performance under uncertainty in matching behaviors.

Historical Development

The theory of stochastic games originated with Lloyd Shapley's seminal 1953 paper, which introduced the model for two-player zero-sum stochastic games as a dynamic extension of static game forms, proving the existence of a value under finite state and action spaces. This work built on John von Neumann's 1928 for zero-sum games, adapting it to sequential interactions where transitions between states occur probabilistically based on joint player actions. In the , early developments linked stochastic games to emerging fields like dynamic programming, pioneered by Richard Bellman, with Shapley's value iteration method establishing parallels to Markov decision processes for single-agent cases while highlighting multi-agent strategic tensions. By the , extensions to multiplayer settings advanced the framework; A.M. Fink's 1964 paper demonstrated the existence of stationary equilibria in discounted finite n-player stochastic games, broadening applicability beyond zero-sum duels. Key milestones in the 1980s included Jean-François Mertens and Shmuel Zamir's proof that undiscounted zero-sum games possess a , resolving long-standing open questions about non-discounted payoffs. Their collaborative efforts also addressed discounted variants, confirming and uniform bounds across discount factors in finite games. Since the 2010s, stochastic game theory has integrated deeply with , enabling scalable solutions for complex environments; for instance, algorithms have been adapted to approximate equilibria in high-dimensional stochastic games, as explored in post-2010 works on fictitious play and policy optimization. In the 2020s, advances in have further enabled computation of equilibria in high-dimensional stochastic games, with applications in and . This evolution has fueled applications since 2015, particularly in multi-agent systems like autonomous coordination and adversarial training, where stochastic games model uncertainty in real-time decision-making for and .

Formal Framework

Model Components

A stochastic game is formally defined as a tuple \Gamma = (S, (A_i)_{i=1}^n, P, (r_i)_{i=1}^n), where S is a finite set of states representing the possible configurations of the system, A_i is a finite set of actions available to player i for i = 1, \dots, n, P: S \times \prod_{i=1}^n A_i \to \Delta(S) is the transition probability function with P(s' \mid s, a_1, \dots, a_n) denoting the probability of moving to state s' from state s under the joint action profile (a_1, \dots, a_n), and r_i: S \times \prod_{i=1}^n A_i \to \mathbb{R} is the stage reward function for player i. This structure generalizes the two-player zero-sum case introduced by Shapley to arbitrary finite numbers of players with individual rewards. The model assumes that the and spaces are finite, ensuring compactness for analysis, and that transition probabilities satisfy \sum_{s' \in S} P(s' \mid s, a_1, \dots, a_n) = 1 for all s \in S and joint actions, with probabilities being nonnegative. Absorbing states, where P(s \mid s, a_1, \dots, a_n) = 1 for all joint actions and rewards are constant, may exist to model conditions. games are typically analyzed over an infinite horizon, where payoffs aggregate stage rewards via or averaging, though finite-horizon variants terminate after a fixed number of stages or upon . In infinite-horizon settings, stationary strategies—probability distributions over actions that depend solely on the current —are emphasized, as they suffice for optimal play under . When n=1, the model reduces to a Markov decision process. To illustrate, consider a two-player stochastic game (n=2) with states S = \{1, 2\} and action sets A_1 = A_2 = \{U, D\} for each player. The transition probabilities P and rewards r_1, r_2 are specified in the following tables for joint actions at each state (rows: player 1's action, columns: player 2's action). For state 1:
Joint ActionP(1 \mid 1, \cdot)P(2 \mid 1, \cdot)r_1(1, \cdot)r_2(1, \cdot)
(U, U)0.70.321
(U, D)0.40.613
(D, U)0.50.502
(D, D)0.20.830
For state 2 (an absorbing state with fixed reward 0 for both after ):
Joint ActionP(1 \mid 2, \cdot)P(2 \mid 2, \cdot)r_1(2, \cdot)r_2(2, \cdot)
(U, U)0100
(U, D)0100
(D, U)0100
(D, D)0100
This example demonstrates how joint actions influence probabilistic state changes and immediate rewards, with absorption in state 2 halting further decisions.

Strategies and Outcomes

In stochastic games, players employ strategies to select actions dynamically over time, accounting for the evolving state of the system. A behavioral strategy for player i is a function \sigma_i: S \times H \to \Delta(A_i), where S denotes the state space, H the set of possible histories (sequences of past states and actions), and \Delta(A_i) the set of probability distributions over player i's action set A_i; this allows actions to depend on the full history of play, enabling complex, history-dependent decision-making. In contrast, a stationary strategy simplifies this to \sigma_i: S \to \Delta(A_i), where the action distribution depends only on the current state, assuming a Markovian structure that ignores prior history beyond the present state; such strategies are computationally tractable and sufficient for optimality in many cases, particularly in discounted settings. A joint strategy profile \sigma = (\sigma_1, \dots, \sigma_n) combines the strategies of all n players, determining the probabilistic evolution of the game. Under this profile, a play path is a sequence (s_0, a_0, s_1, a_1, \dots), where initial s_0 is given, actions a_t = (a_{1t}, \dots, a_{nt}) are drawn from the joint induced by \sigma at s_t, and next states s_{t+1} follow the probabilities p(\cdot | s_t, a_t). These paths realize the game's dynamics, with probabilistic branching due to both strategic choices and environmental transitions. For infinite-horizon stochastic games, outcomes are evaluated via expected payoffs, incorporating a discount factor \gamma \in (0,1) to prioritize immediate rewards over distant ones. The payoff for player i under profile \sigma is u_i(\sigma) = \mathbb{E} \left[ \sum_{t=0}^\infty \gamma^t r_i(s_t, a_t) \mid \sigma \right], where the expectation is over play paths starting from an initial state, and r_i(s_t, a_t) is the stage reward; this formulation ensures convergence by geometrically discounting future rewards. To analyze long-run behavior, particularly under stationary strategies, occupation measures quantify the expected frequency of state-action pairs. For a joint stationary profile (x, y), the occupation measure over state s and actions (i,j) is the limiting proportion \lim_{T \to \infty} \frac{1}{T} \sum_{t=1}^T \Pr(S_t = s, I_t = i, J_t = j \mid x, y), derived from the stationary distribution of the induced Markov chain; these measures facilitate assessment of average rewards and ergodic properties without simulating infinite paths.

Two-Player Zero-Sum Case

Equilibrium and Value Function

In two-player zero-sum stochastic games, the reward function satisfies r_1(s, a_1, a_2) = -r_2(s, a_1, a_2) for states s and actions a_1, a_2, ensuring that one player's gain equals the other's loss. The value function v(s) represents the game's value from state s, uniquely satisfying the Bellman-like v(s) = \val \Bigl\{ r(s, a_1, a_2) + \gamma \sum_{s'} P(s' \mid s, a_1, a_2) \, v(s') \Bigr\}, where \val denotes the value over mixed strategies for the players, \gamma \in [0,1) is the discount factor, and the expectation is with respect to the transition probabilities P. This formulation captures the long-term discounted payoff under optimal play. The equation arises from the Shapley operator T, defined componentwise by (Tv)(s) = \val \{ r(s, a_1, a_2) + \gamma \sum_{s'} P(s' \mid s, a_1, a_2) \, v(s') \}. The true value v^* is the unique fixed point v^* = T v^*, obtained as the limit of value iteration starting from any initial , since T is a in the sup-norm with \gamma: \| T v - T w \|_\infty \leq \gamma \| v - w \|_\infty. This guarantees convergence regardless of the starting point, establishing both and of the value in the discounted setting. Optimal strategies for both players can be chosen as stationary Markov policies, depending only on the current . For each s, the induced by v in future payoffs forms a finite zero-sum game, to which Glicksberg's theorem applies: since the strategy spaces are compact and convex, and payoffs are continuous and quasi-concave in own actions, mixed equilibria exist. Selecting, for each s, a best response pair to the effective payoff yields stationary policies that are globally optimal, achieving v(s) from every starting . In the undiscounted case (\gamma = 1), the v(s) still exists for finite-state and finite-action games, defined as the limit of discounted values as \gamma \to 1, but optimal strategies may not. A , known as the , demonstrates a game where no is optimal for both players, as one requires history-dependent to prevent exploitation. However, for any \varepsilon > 0, there exist \varepsilon-optimal strategies that approximate the uniformly across all states, ensuring practical solvability despite the theoretical gap.

Solution Algorithms

Value iteration is a core for computing the value function in discounted two-player zero-sum stochastic games. The algorithm initializes an arbitrary value vector v^0 and updates it successively via v^{k+1} = T v^k, where T is the Shapley operator defined state-wise as (Tv)(s) = \val_{a \in A, b \in B} \left[ r(s,a,b) + \gamma \sum_{s'} p(s'|s,a,b) v(s') \right], with \val denoting the value of the resulting matrix game over actions a and b. This process continues until the updates converge to a fixed point, yielding the unique value function v of the game. The operator T is a with modulus \gamma < 1, ensuring monotonic convergence independent of the initial v^0. The maximum error after k iterations satisfies \|v^k - v\|_\infty \leq \frac{\gamma^k}{1 - \gamma} \|v^1 - v^0\|_\infty, allowing for controlled approximation by choosing sufficient iterations. Policy iteration offers an alternative approach, often converging faster in practice for finite-state-action games by exploiting policy structures. It alternates two steps: policy evaluation, which fixes policies \sigma for the maximizer and \tau for the minimizer and solves the resulting v = r_\sigma^\tau + \gamma P_\sigma^\tau v for the value v under those policies; and policy improvement, in which for each state s, the players solve the zero-sum matrix game with payoffs r(s, a, b) + \gamma \sum_{s'} p(s' | s, a, b) v(s') to obtain optimal mixed actions defining the new stationary policies \sigma' and \tau'. The process repeats with the updated policies until stationary optimal policies are reached. For finite zero-sum stochastic games, policy iteration terminates in a finite number of steps, producing exact optimal stationary policies. For finite-horizon variants of two-player zero-sum games, a formulation provides an exact solution method. The problem is cast over occupation measures x(s,t,a,b), representing the expected frequency of state-action pairs at time t, with the objective to minimize (or maximize) the total expected reward \sum_{s,t,a,b} x(s,t,a,b) r(s,a,b), subject to flow conservation constraints ensuring \sum_{a,b} x(s,t+1,a,b) = \sum_{a,b} p(s|s',a,b) x(s',t,a,b) for all states and times, initial distribution constraints, and non-negativity. This LP has size polynomial in the horizon length, states, and actions, solvable via standard solvers. Regarding computational complexity, value and policy iteration run in polynomial time for fixed discount factor \gamma < 1, as the number of iterations scales with O(1/(1-\gamma)) and each step is polynomial in the game size. In contrast, undiscounted (mean-payoff) two-player zero-sum stochastic games are in to solve exactly, but no polynomial-time algorithm is known and it remains an whether they can be solved in polynomial time. Recent advances in the 2020s have leveraged neural networks for approximate solutions, training deep architectures to parameterize value functions or policies and optimizing via gradient-based methods on the Bellman residual, enabling scalability to large state spaces beyond traditional iterative approaches.

Generalizations and Variants

Multiplayer Non-Zero-Sum Games

In multiplayer non-zero-sum games, the model extends the two-player framework to n \geq 2 agents, where each player i seeks to maximize their individual expected discounted payoff u_i(\sigma) = \mathbb{E}[\sum_{t=0}^\infty \beta^t r_i(s_t, a_t) \mid s_0, \sigma], with \beta \in (0,1) the common discount factor, r_i player i's stage reward function, and \sigma = (\sigma_1, \dots, \sigma_n) a profile of (possibly history-dependent) . Unlike zero-sum settings, the sum of payoffs \sum_i u_i(\sigma) need not be zero or constant, allowing for cooperative or conflicting incentives across players. A Nash equilibrium is a strategy profile \sigma^* = (\sigma_1^*, \dots, \sigma_n^*) such that no player i can unilaterally improve their payoff by deviating, i.e., u_i(\sigma_i^*, \sigma_{-i}^*) \geq u_i(\sigma_i, \sigma_{-i}^*) for all alternative \sigma_i and all i \in \{1, \dots, n\}. Stationary Nash equilibria, where each \sigma_i^* depends only on the current state s_t, form a natural solution concept for these infinite-horizon games and can be found as fixed points of best-response mappings over the space of strategies. Such equilibria are established using fixed-point theorems like Glicksberg's extension of Kakutani's theorem, applied to the compact of stationary strategies with continuous payoff functions. For instance, in finite-state finite-action games, Nash profiles ensure perfection and Markovian behavior, aligning long-term incentives with state-dependent actions. Non-uniqueness of equilibria is common in these games, as multiple stationary profiles can satisfy the no-deviation condition, often leading to payoff vectors that vary significantly across equilibria. This multiplicity has analogs to folk theorems in repeated , where, for sufficiently (high \beta), any feasible and individually rational payoff in the convex hull of stage-game payoffs can be approximated as a equilibrium payoff in the stochastic game, provided the state transitions support punishment strategies. Examples include dynamic Cournot oligopolies, where equilibria range from collusive outcomes to competitive ones depending on the profile selected. To address non-uniqueness and enhance robustness, refinements like correlated equilibria extend by allowing a public correlation device to recommend actions from a joint distribution \mu over action profiles, such that no player benefits from deviating after observing their recommendation. In games, uniform correlated equilibria exist under conditions like finite states and absorbing structures, enabling coordination without full communication. Trembling-hand perfect equilibria further refine these by requiring stability against small independent perturbations (trembles) in opponents' strategies, ensuring only equilibria robust to implementation errors survive in the limit as tremble probabilities approach zero; such refinements exist in discounted games with perfect recall. A key challenge in multiplayer non-zero-sum stochastic games is the absence of a scalar value function, as each i optimizes their own u_i(\sigma) rather than a shared value, which complicates coordination and can lead to inefficient equilibria due to misaligned incentives. For example, players may select payoff-dominated profiles because unilateral deviations do not guarantee improvement, exacerbating issues in games with imperfect information or uncountable state spaces.

Stopping and Discounted Variants

Stopping games represent a specialized variant of stochastic games in which players decide whether to continue or terminate the game at each stage, based on the current state and associated payoffs. In these games, the payoff is realized upon stopping, and the value of the game is determined by the supremum over one player's stopping strategies and the infimum over the opponent's, often leading to the identification of sets where continuation or termination is strategically indifferent. This framework was introduced by Dynkin, who established the existence of times in Markov processes under conditions, ensuring stationary equilibria in such settings. Discounted variants of games extend the by incorporating -dependent discount factors γ(s), where the impatience rate varies with the current , altering the effective weighting of future rewards. Analysis of these variants relies on generalized Shapley equations, which adapt the original value iteration framework to account for the varying discounts, allowing for the computation of values in zero-sum cases. For instance, in two-player zero-sum settings with Borel and spaces and unbounded rewards, iterative algorithms converge to the value under suitable contraction conditions imposed by the state-dependent discounts. Undiscounted average-reward games shift the objective to the limiting average payoff, defined as \lim_{T \to \infty} \frac{1}{T} \sum_{t=1}^T r_t, where r_t denotes the stage reward, emphasizing long-run performance over ed sums. In this criterion, Blackwell optimality extends the contracting operator approach from Markov decision processes to games, identifying strategies that remain optimal across a range of factors approaching , thereby approximating average-reward solutions. Existence of values and equilibria in these undiscounted variants holds under absorbing or ergodic assumptions, with policies achieving optimality in perfect-information cases.

Theoretical Results and Complexity

Existence Theorems

In two-player zero-sum discounted stochastic games with finite state and action spaces, Shapley's theorem establishes the existence of a unique value for each state and stationary optimal strategies for both players. This result relies on the contraction property of the discounted Bellman operator, ensuring convergence to the value function via value iteration. For undiscounted zero-sum stochastic games, the existence of a value—defined as the limit of the discounted values as the discount factor approaches 1—was proven by Mertens and Neyman, confirming that the game possesses a well-defined value despite the absence of discounting. Furthermore, Mertens, Sorin, and Zamir demonstrated the existence of a uniform value, independent of the discount factor, using approachability theory to show that players can guarantee payoffs arbitrarily close to this value uniformly across discount factors. In the general n-player case with finite states and actions, stationary correlated equilibria exist for discounted stochastic games, as shown by Nowak and Raghavan, where a correlation device recommends actions based solely on the current to sustain equilibrium payoffs. However, pure equilibria are not guaranteed, even in stationary form, due to potential coordination failures in multi-player settings. Recent advancements address approximate equilibria in infinite-state stochastic games, where traditional finite-state assumptions fail. For instance, and He established the existence of approximate stationary equilibria and stationary weak correlated equilibria in constrained discounted stochastic games with countably generated state spaces and continuous transition densities, extending guarantees to broader settings via fixed-point arguments. These results highlight ongoing progress in handling infinite-state complexity through operator-theoretic methods like contractions.

Computational Challenges

Computing approximate equilibria in multiplayer discounted games is PPAD-complete, even for turn-based settings with a constant discount factor and horizon length. This result extends the PPAD-completeness established by Daskalakis, Goldberg, and Papadimitriou in 2009 for finding approximate equilibria in three-player normal-form games, adapting the reduction to the setting where states evolve according to transition probabilities. Specifically, , Muthukumar, and Sidford demonstrated in 2023 that computing an ε-approximate equilibrium in infinite-horizon simultaneous-move games requires solving a PPAD-hard problem when ε is inversely in the total number of actions across players. Concurrently, Daskalakis, Golowich, and Zhang showed PPAD-hardness for ε-perfect equilibria in two-player discounted turn-based games under fixed discount factors like 1/2. In undiscounted stochastic games, computational challenges intensify, particularly in infinite-state spaces where the existence of equilibria can be undecidable. Ummels and Wojtczak proved in 2011 that determining whether a finite-memory equilibrium exists in simple stochastic multiplayer games—modeled on finite graphs with probabilistic transitions—is undecidable, via reduction from the of two-counter machines, even for as few as 13 players using pure strategies. This undecidability arises because strategies may require infinite memory to achieve equilibrium payoffs, rendering exact solutions impossible in general. Despite this, approximations can be pursued through variants adapted to multi-agent settings; for instance, iteratively updates action-value estimates to converge toward approximate equilibria in general-sum stochastic games, though guarantees are weaker in undiscounted cases without additional assumptions like proper policies. Key open issues in solving games revolve around scaling algorithms to large state-action spaces, where in dimensionality hampers exact methods. Recent progress from 2023 includes tensor-based approaches for multiplayer games, where payoff structures are reformulated as symmetric tensors to enable semismooth methods for computing Nash equilibria more efficiently than brute-force . For example, et al. in 2023 exploited k-mode symmetry in payoff tensors to reduce the search space in m-player games, achieving quadratic convergence rates under mild regularity conditions. These tensor methods hold promise for approximating equilibria in high-dimensional variants, particularly when integrated with learning algorithms, though their extension to fully general undiscounted multiplayer cases remains an active area of research intersecting and .

Applications

Economic and Decision-Making Models

Stochastic games provide a for analyzing strategic interactions in where agents make decisions under , with states evolving probabilistically based on prior actions and exogenous shocks. These models capture dynamic environments like fluctuating markets or resource stocks, enabling the study of equilibria that balance short-term gains against long-term outcomes. In economic applications, they extend classical by incorporating Markov processes for state transitions, allowing for realistic depictions of incomplete and repeated play. In , games model duopolies such as stochastic Bertrand games, where firms adjust in response to shocks that alter or conditions. These setups often yield Markov-stationary equilibria where depend on the current , leading to prices above marginal costs due to factors like consumer loyalty or switching frictions. For example, Rosenthal (1982) examines a Bertrand duopoly with stochastically evolving consumer loyalties, demonstrating how such sustain supra-competitive over time. Similarly, Beggs and Klemperer (1992) analyze long-run price competition under switching costs, showing that initial shares influence persistent advantages in settings. Stochastic games also inform and design, particularly in repeated interactions like negotiations amid probabilistic economic states such as business cycles or productivity shocks. Here, players' bargaining powers fluctuate , introducing volatility into outcomes without relying on rigid wages. For , stochastic games tackle common-pool problems under environmental , exemplified by where evolve randomly due to natural variability. Non-cooperative equilibria often lead to over-exploitation, as in the "Great Fish War" model by Levhari and Mirman (1980), which frames extraction as a stochastic game resulting in suboptimal depletion rates. Cooperative variants, such as (1987), incorporate trigger strategies to enforce sustainable harvesting, mitigating the through credible threats in stochastic environments. Jørgensen and Yeung (1996) further develop a for common-property fisheries, deriving equilibria that balance extraction with stock regeneration under . A key application involves folk theorems for games, extended from the onward to sustainable growth models in . These theorems establish that sufficiently patient players can achieve equilibria approximating the feasible and individually rational payoff set, including cooperative sustainable paths in resource or growth contexts. Legros (1993) proves a folk theorem for games, showing that any feasible payoff above the level is sustainable via subgame-perfect equilibria under discounting, with applications to long-term under . This result underpins models of where state changes, like climate variability, allow near-efficient allocations without centralized control.

Reinforcement Learning and AI

Stochastic games provide a foundational framework for (MARL), extending single-agent Markov decision processes to scenarios where multiple agents interact strategically in shared environments. In this setting, agents learn policies that account for the actions and influences of others, often modeled as zero-sum or stochastic games where transitions and rewards depend on joint actions. A seminal contribution is the use of Markov games—equivalent to finite games—as the backbone for MARL algorithms, including extensions of such as minimax-Q and Nash-Q, which compute value functions under adversarial or assumptions. In adversarial training, games model the dynamics between competing agents, as seen in generative adversarial networks () where the generator and discriminator engage in a two-player Nash equilibrium pursuit. This formulation captures the non-convex, challenges in GAN training, enabling convergence guarantees under relaxed conditions like monotonicity of the game's pseudogradient mapping. Similarly, in game-playing AI, games underpin opponent modeling in complex environments; for instance, DeepMind's AlphaStar employs in , a partially game, where agents train against diverse opponent populations to achieve grandmaster-level performance by predicting and adapting to opponent behaviors. Cooperative stochastic games have been applied in robotics and control systems to enable swarm coordination under environmental noise and uncertainty. In swarm robotics, agents learn joint policies in stochastic environments with shared rewards, addressing challenges like decentralized decision-making in tasks such as object transportation or formation control, where noise in sensors or actuators introduces stochastic transitions. For example, deep reinforcement learning frameworks model these as cooperative MARL problems in stochastic games, allowing swarms to emerge robust coordination strategies that outperform independent agents in noisy settings. Recent advances in the 2020s leverage transformer architectures as solvers for large-scale games in simulated multi-agent environments, enhancing scalability through mechanisms that capture inter-agent dependencies and long-range interactions. Transformer-based methods, such as those integrating structures for value decomposition in MARL, enable efficient learning in high-dimensional games with dozens of agents, demonstrating improved sample and in benchmarks like multi-agent particle environments. These approaches build on for opponent modeling and centralized training with decentralized execution, facilitating applications in simulated large-scale systems.

References

  1. [1]
    Stochastic Games* | PNAS
    In a stochastic game the play proceeds by steps from position to position, according to transition probabilities controlled jointly by the two players.
  2. [2]
    [PDF] Lecture 16 Markov (aka stochastic) games - MIT
    Nov 14, 2024 · Definition 1.2 (Finite-horizon stochastic game). In the finite-horizon case, the game is endowed with an addition parameter H ∈ ℕ, called the ...
  3. [3]
    [PDF] A Course in Stochastic Game Theory
    Nov 3, 2022 · Definition 1.11 A mixed strategy is a probability distribution over the set ΣP of pure strategies. Every strategy is equivalent to a mixed ...
  4. [4]
    [PDF] Chapter 3 Representation of Games - MIT OpenCourseWare
    Definition 3.9 (Normal form) A game is any list. G = (S1,...,Sn; u1,...,un) ... One can always convert an extensive-form game to a normal form game in this way.
  5. [5]
    [PDF] An Introduction to Markov Decision Processes - Rice University
    A Markov Decision Process (MDP) model contains: • A set of possible world states S. • A set of possible actions A. • A real valued reward function R(s,a).
  6. [6]
    Stochastic games - PNAS
    Nov 10, 2015 · Stochastic games model dynamic interactions in which the environment changes in response to players' behavior. In Shapley's words, “In a ...
  7. [7]
    Equilibrium in a stochastic $n$-person game - Semantic Scholar
    Equilibrium in a stochastic $n$-person game · A. Fink · Published 1964 · Economics · Journal of Science of the Hiroshima University.Missing: 1970s | Show results with:1970s
  8. [8]
    Discounted Stochastic Games - jstor
    2. The model and the main result. A discounted stochastic game is given S, A*, (A', ri) EN, q, P) where (1) N is a set of players. (2) S is a state space.
  9. [9]
    Recent Developments in Machine Learning Methods for Stochastic ...
    Mar 17, 2023 · Recently, computational methods based on machine learning have been developed for solving stochastic control problems and games. In this review, ...
  10. [10]
    [PDF] Independent Learning in Stochastic Games - arXiv
    Nov 23, 2021 · The frontier of many AI systems emerges in multi-agent settings, including playing games such as chess and Go [Silver et al., 2016, 2017], ...
  11. [11]
    [PDF] On Nash Equilibria in Stochastic Games - UC Berkeley EECS
    A stochastic game is played over a state space, and is played in rounds. In each round, each player chooses an available action simultaneously with and ...
  12. [12]
    None
    ### Formal Definition of Competitive Markov Decision Process (Stochastic Game)
  13. [13]
    [PDF] A Survey on Optimality and Equilibria in Stochastic Games
    The minimax theorem tells us that, for each matrix A there is a unique amount val(A), which player 1 can guarantee as his minimal expected payoff, while at the ...
  14. [14]
    None
    ### Summary of Strategies, Payoffs, and Outcomes in Shapley's Stochastic Games
  15. [15]
    Competitive Markov Decision Processes - SpringerLink
    Free delivery 14-day returnsThis book is intended as a text covering the central concepts and techniques of Competitive Markov Decision Processes.
  16. [16]
    LP formulation of asymmetric zero-sum stochastic games
    Abstract: This paper provides an efficient linear programming (LP) formulation of asymmetric two player zero-sum stochastic games with finite horizon.
  17. [17]
    [PDF] Non-zero-Sum Stochastic Games
    First Nash (1950) introduced the notion of equilibrium for n-person static games and proved its existence using the fixed point theorem of Kakutani. (1941).
  18. [18]
    Correlated Equilibrium in Stochastic Games - ScienceDirect.com
    We study the existence of uniform correlated equilibrium payoffs in stochastic games. The correlation devices that we use are either autonomous (they base ...Missing: multiplayer | Show results with:multiplayer
  19. [19]
    Perfect equilibria in stochastic games - SpringerLink
    For the β-discounted case, as well as for the irreducible limiting average case, we show the existence of trembling-hand perfect equilibria and give ...Missing: perfection | Show results with:perfection
  20. [20]
    [PDF] The challenge of non-zero-sum stochastic games
    Aug 17, 2015 · We present the problems associated with ǫ-equilibria in non-zero-sum stochastic games, from both the perspec- tives of proving existence and ...<|control11|><|separator|>
  21. [21]
    Two-person zero-sum stochastic games with varying discount factors
    In this paper, two-person zero-sum Markov games with Borel state space and action space, unbounded reward function and state-dependent discount factors are ...
  22. [22]
  23. [23]
    Chapter 47 Stochastic games - ScienceDirect.com
    Stochastic games model repeated play with symmetric information. We analyze their value in the zero-sum case, and approach the study of their equilibria in ...
  24. [24]
    [PDF] stochastic games in economics and related fields: an overview
    Abstract. This survey provides an extensive account of research in eco- nomics based on the stochastic games paradigm. Its area-by-area coverage is.
  25. [25]
    Wage negotiations in multi-worker firms and stochastic bargaining ...
    The stochastic bargaining powers of existing workers provide an additional margin to increase the volatility of labor market variables through this mechanism.
  26. [26]
    [PDF] A Folk Theorem for Stochastic Games
    The approach to folk theorem analysis that is taken in this paper has some implications for the purely repeated game case as well; these implications are ...
  27. [27]
    [PDF] Markov Games as a Framework for Multi-Agent Reinforcement ...
    This paper contributes a comprehensive presentation of the relevant techniques for solving stochastic games from both the game theory community and ...
  28. [28]
    Training Generative Adversarial Networks via stochastic Nash games
    Oct 17, 2020 · In this paper, we propose a stochastic relaxed forward-backward (SRFB) algorithm for GANs and we show convergence to an exact solution when an increasing ...
  29. [29]
    [PDF] Deep Reinforcement Learning for Swarm Systems
    The approach is evaluated on a coordinated multi-agent object transportation problem in a grid world with stochastic rewards. Sunehag et al. (2017) tackle the “ ...
  30. [30]
    Transformers for Leveraging the Graph Structure of Multi-Agent ...
    May 30, 2023 · TransfQMix is designed to be entirely transferable, meaning that same parameters can be used to control and train larger or smaller teams of ...