Fact-checked by Grok 2 weeks ago

Proximal operator

In , the proximal operator of a proper closed f: \mathbb{R}^n \to \mathbb{R} \cup \{+\infty\} with parameter \lambda > 0 is defined as \prox_{\lambda f}(v) = \arg\min_{x \in \mathbb{R}^n} \left[ f(x) + \frac{1}{2\lambda} \|x - v\|_2^2 \right], where \|\cdot\|_2 denotes the norm; this operator computes the point x that balances minimizing f with staying close to a given point v. Introduced by Jean-Jacques Moreau in the early as part of his work on generalized projections and the Moreau decomposition theorem, the proximal operator provides a foundational tool for solving nonsmooth problems, particularly those expressible as minimizing sums of smooth and nonsmooth (e.g., or regularization) functions. It generalizes the orthogonal projection onto —recovering it when f is the of a —and exhibits key properties such as firm nonexpansiveness, which ensures that fixed-point iterations involving the operator converge to minimizers of f. The operator's significance lies in enabling efficient proximal algorithms, including the proximal point algorithm, proximal gradient methods (like ISTA and FISTA), and the alternating direction method of multipliers (ADMM), which are widely applied in large-scale settings such as signal processing, machine learning, and statistics due to the separability of proximal operators for sum-separable functions and the availability of closed-form expressions for common regularizers (e.g., \ell_1-norm or nuclear norm). These algorithms leverage the operator's computational tractability, often allowing parallel or distributed implementations, and have been extended beyond convexity to nonconvex settings under certain conditions. Further developments by researchers like R. Tyrrell Rockafellar in the 1970s connected proximal operators to monotone operator theory, solidifying their role in variational analysis.

Definition and Background

Formal Definition

The proximal operator of a function f, denoted \prox_{\lambda f}(x), is formally defined as \prox_{\lambda f}(x) = \arg\min_{y} \left\{ f(y) + \frac{1}{2\lambda} \|y - x\|^2 \right\}, where f: \mathcal{H} \to (-\infty, +\infty] is a proper lower semicontinuous convex function on a Hilbert space \mathcal{H}, \lambda > 0 is a step size parameter, and \|\cdot\| denotes the norm induced by the inner product on \mathcal{H}. This formulation requires f to be convex to ensure the objective is strictly convex, guaranteeing a unique minimizer, and proper lower semicontinuity to ensure well-posedness and existence of the argmin. The parameter \lambda balances the regularization term f(y) against the data fidelity term \frac{1}{2\lambda} \|y - x\|^2, with smaller \lambda emphasizing fidelity to x and larger \lambda allowing greater deviation to minimize f. The value function of this minimization problem is known as the Moreau envelope of f, defined as e_{\lambda f}(x) = \min_{y} \left\{ f(y) + \frac{1}{2\lambda} \|y - x\|^2 \right\}, which satisfies the identity e_{\lambda f}(x) = f(\prox_{\lambda f}(x)) + \frac{1}{2\lambda} \|\prox_{\lambda f}(x) - x\|^2. This envelope provides a smooth (continuously differentiable) approximation of the possibly nonsmooth convex function f, with the proximal operator serving as the unique minimizer. In the context of monotone operator theory, the proximal operator admits a resolvent : \prox_{\lambda f} = (\Id + \lambda \partial f)^{-1}, where \Id is the identity on \mathcal{H} and \partial f denotes the subdifferential of f. This formulation highlights the proximal operator as the inverse of a sum involving the subdifferential, which is a maximal when f is proper lower semicontinuous and , ensuring the resolvent is single-valued and defined everywhere on \mathcal{H}.

Historical Development

The concept of the proximal operator traces its roots to foundational work in convex analysis by Jean-Jacques Moreau, who introduced the proximal mapping in 1962 as a tool for studying subgradients and convexity in Euclidean spaces. Moreau's formulation provided an early geometric interpretation of regularization through infimal convolution, laying the groundwork for later algorithmic developments in optimization. The proximal operator gained prominence in optimization algorithms with the introduction of the proximal point method by Bernard Martinet in 1970, who applied it to regularize variational inequalities and nonlinear programming problems. Independently, R. Tyrrell Rockafellar extended this framework in 1976 by developing the proximal point algorithm for maximal monotone operators in Hilbert spaces, establishing its convergence properties and broad applicability to convex optimization. These contributions marked the shift from theoretical mappings to practical iterative schemes for solving nonsmooth problems. In the 1980s, the proximal operator was popularized through operator-splitting techniques, particularly in the context of partial differential equations and variational inequalities, via works by Roland Glowinski, Jacques-Louis Lions, and collaborators. Their methods, such as alternating direction approaches, integrated proximal steps to handle complex coupled systems efficiently. A surge in interest occurred in the 2000s, driven by applications in and , with key advancements including Amir Beck and Marc Teboulle's fast iterative shrinkage-thresholding algorithm (FISTA) in 2009, which accelerated for sparse recovery. Concurrently, Patrick L. Combettes and Jean-Christophe Pesquet advanced proximal splitting frameworks, unifying and extending algorithms for nonexpansive operator compositions in high-dimensional settings.

Core Properties

Basic Properties

The proximal operator of a proper lower semicontinuous convex function f: \mathcal{H} \to (-\infty, +\infty], defined as \operatorname{prox}_{\lambda f}(x) = \arg\min_{u} \left\{ f(u) + \frac{1}{2\lambda} \|u - x\|^2 \right\} for \lambda > 0 and Hilbert space \mathcal{H}, exhibits several fundamental properties arising from the monotonicity of the subdifferential \partial f. A key property is nonexpansiveness, which states that for all x, y \in \mathcal{H}, \|\operatorname{prox}_{\lambda f}(x) - \operatorname{prox}_{\lambda f}(y)\| \leq \|x - y\|. This follows from the firmly nonexpansive nature of the operator, a stronger condition ensuring contraction-like behavior in iterative methods. Specifically, the proximal operator is firmly nonexpansive, satisfying \|\operatorname{prox}_{\lambda f}(x) - \operatorname{prox}_{\lambda f}(y)\|^2 \leq \langle x - y, \operatorname{prox}_{\lambda f}(x) - \operatorname{prox}_{\lambda f}(y) \rangle for all x, y \in \mathcal{H}. An equivalent formulation is \|\operatorname{prox}_{\lambda f}(x) - \operatorname{prox}_{\lambda f}(y)\|^2 + \|(x - \operatorname{prox}_{\lambda f}(x)) - (y - \operatorname{prox}_{\lambda f}(y))\|^2 \leq \|x - y\|^2. These inequalities hold because \operatorname{prox}_{\lambda f} = (I + \lambda \partial f)^{-1} is the resolvent of the maximal monotone operator \lambda \partial f, and resolvents of maximal monotone operators are firmly nonexpansive. The relation to the subdifferential provides a variational characterization: for p = \operatorname{prox}_{\lambda f}(x), x - p \in \lambda \partial f(p). This inclusion follows directly from the first-order optimality condition of the proximal minimization problem, where $0 \in \partial f(p) + \frac{1}{\lambda}(p - x). Equivalently, p solves the inclusion p \in (I + \lambda \partial f)^{-1}(x). The argmin characterization also yields optimality conditions for minimization of f. A point p \in \mathcal{H} minimizes f if and only if it is a fixed point of the proximal operator, i.e., p = \operatorname{prox}_{\lambda f}(p), which is equivalent to $0 \in \partial f(p). This fixed-point property underpins the convergence of proximal algorithms to minimizers.

Advanced Properties

The proximal operator \prox_{\lambda f} of a proper lower semicontinuous f on a is \frac{1}{2}-averaged, meaning it admits the \prox_{\lambda f} = \alpha \Id + (1 - \alpha) T for some nonexpansive T and \alpha = \frac{1}{2}, where \Id denotes the identity . This property follows from the firmly nonexpansive nature of the proximal operator, which ensures that compositions and convex combinations of proximal operators remain firmly nonexpansive under appropriate conditions, facilitating the analysis of accelerated proximal methods. The \frac{1}{2}-averaged characterization underscores the contraction-like behavior of proximal iterations, enabling convergence guarantees in operator splitting frameworks without requiring strong monotonicity assumptions. Sequences generated by proximal point iterations, defined as x^{k+1} = \prox_{\lambda f}(x^k) for solving \min f(x), exhibit Fejér monotonicity with respect to the \argmin f. Specifically, for any minimizer p \in \argmin f, the distances satisfy \|x^{k+1} - p\| \leq \|x^k - p\|, establishing that the sequence is nonincreasing in distance to the solution set and bounded, which implies weak convergence to a minimizer in Hilbert spaces. This Fejér property unifies convergence proofs across variants of the proximal point algorithm, including inexact implementations, by leveraging the maximal monotonicity of the subdifferential \partial f. In product space formulations for parallel evaluation of multiple proximal operators, cyclicity arises through the introduction of a permutation operator that cycles components across iterations, ensuring equivalence to sequential proximal steps. For m convex functions f_1, \dots, f_m, the product space \mathcal{H}^m hosts the operator T = P \circ (\prox_{\lambda f_m} \times \cdots \times \prox_{\lambda f_1}), where P is the cyclic shift permutation; fixed points of T correspond to minimizers of \sum f_i, and the cyclic structure preserves the firmly nonexpansive properties of individual proximals. This cyclicity enables parallel computation while maintaining theoretical convergence rates comparable to serial methods, as the product operator remains averaged. The proximal operator \prox_{\lambda f} is characterized as firmly quasi-nonexpansive, meaning \|\prox_{\lambda f}(x) - \prox_{\lambda f}(y)\|^2 + \|( \Id - \prox_{\lambda f})(x) - ( \Id - \prox_{\lambda f})(y)\|^2 \leq \|x - y\|^2 for all x, y, with equality holding relative to fixed points; this is equivalent to \prox_{\lambda f} = (\Id + \lambda \partial f)^{-1} for a maximal monotone operator \lambda \partial f. Such characterization establishes a bijection between firmly quasi-nonexpansive mappings and resolvents of maximal monotone operators, extending Minty's theorem on firmly nonexpansive resolvents to quasi variants for broader applicability in quasi-convex settings. The relation to Yosida regularization identifies the operator \Id - \prox_{\lambda f} as \lambda times the Yosida approximate of \partial f, defined as A^\lambda(x) = \frac{1}{\lambda} (x - ([\Id](/page/I-D) + \lambda A)^{-1}(x)) for a maximal monotone A = \partial f, ensuring A^\lambda is single-valued, Lipschitz continuous, and monotone with the same domain as A. This connection implies that proximal iterations approximate solutions via smoothed subgradients, with the Yosida operator converging to A pointwise as \lambda \to 0, providing a regularization tool for analyzing asymptotic behavior in nonsmooth optimization.

Computation Methods

Closed-Form Solutions

Closed-form solutions for proximal operators exist for several common functions, enabling efficient in optimization algorithms. These explicit expressions are particularly valuable for functions that arise frequently in , , and statistics, such as regularization terms or constraints. For the \iota_C of a closed C, the proximal operator is the orthogonal onto C: \text{prox}_{\lambda \iota_C}(x) = \proj_C(x) = \arg\min_{z \in C} \|z - x\|_2^2. This holds because the indicator function enforces membership in C without scaling by \lambda, as \iota_C takes values 0 or +\infty. Specific projections, such as onto affine sets, half-spaces, boxes, or simplices, admit further closed forms depending on the geometry of C. The proximal operator of the \ell_1-norm, f(x) = \|x\|_1, is the componentwise soft-thresholding operator: [\text{prox}_{\lambda \| \cdot \|_1}(x)]_i = \sign(x_i) \max(|x_i| - \lambda, 0), \quad i = 1, \dots, n. This operator promotes sparsity by shrinking small components to zero while preserving the sign of larger ones, making it central to lasso regression and compressed sensing. For the squared \ell_2-norm, f(x) = \frac{1}{2} \|x\|_2^2, the proximal operator performs simple shrinkage: \text{prox}_{\lambda \cdot \frac{1}{2} \|\cdot\|_2^2}(x) = \frac{x}{1 + \lambda}. This scales the input uniformly, reflecting the isotropic nature of the . More generally, for functions of the form f(x) = \frac{1}{2} \|Ax - b\|_2^2, where A is a and b a , the proximal operator has a closed-form solution involving a solve: \text{prox}_{\lambda \cdot \frac{1}{2} \|A \cdot - b\|_2^2}(x) = (I + \lambda A^\top A)^{-1} (x + \lambda A^\top b). This expression requires inverting or solving the symmetric positive semidefinite matrix I + \lambda A^\top A, which is feasible when A has low rank or structured sparsity. When the function is a separable sum, f(x) = \sum_{i=1}^n f_i(x_i), the proximal operator decomposes componentwise: [\text{prox}_{\lambda f}(x)]_i = \text{prox}_{\lambda f_i}(x_i), \quad i = 1, \dots, n. This separability allows independent computation of each proximal subproblem, facilitating scalability for high-dimensional or block-structured objectives.

Iterative Algorithms

When closed-form expressions for the proximal operator \prox_{\lambda f}(x) are unavailable, iterative numerical methods can approximate it by solving the underlying convex minimization problem \argmin_z f(z) + \frac{1}{2\lambda} \|z - x\|^2. These approaches exploit the structure of f, such as smoothness or indicator functions, to ensure convergence. For functions f with Lipschitz continuous gradient (constant L), the proximal operator can be computed via fixed-point iteration derived from the optimality condition. Specifically, the iteration y^{k+1} = x - \lambda \nabla f(y^k) converges to \prox_{\lambda f}(x) under appropriate step-size choices, such as \lambda < 1/L, leveraging the contraction properties of the mapping. Accelerated variants, including momentum-based schemes like Nesterov's method applied to the subproblem, improve the convergence rate. An equivalent perspective uses gradient descent directly on the Moreau envelope e_{\lambda f}, whose gradient satisfies \nabla e_{\lambda f}(x) = \frac{x - \prox_{\lambda f}(x)}{\lambda}. Minimizing the auxiliary objective via gradient steps on \nabla f(z) + \frac{z - x}{\lambda} yields the proximal point, with convergence rate O(1/k) for the objective value under Lipschitz gradient assumptions on f. For strongly convex f, linear convergence is achievable with constant step sizes. For indicator functions of polyhedral sets, represented as intersections of half-spaces, Dykstra's alternating projection algorithm computes the proximal operator (i.e., the projection) by iteratively projecting onto each constraint while maintaining correction vectors to ensure feasibility. Introduced for half-spaces, it extends to general closed convex sets and converges linearly under . This method is particularly efficient for high-dimensional polyhedra in signal processing applications. In stochastic settings where f(x) = \mathbb{E}_\xi [g(x, \xi)] for random \xi, Monte Carlo methods approximate the proximal operator by sampling batches to estimate the expectation in the minimization objective. Stochastic proximal gradient iterations on the sample-averaged problem converge to the true proximal point with rates depending on batch size and variance, often O(1/\sqrt{k}) in expectation. This approach is vital for large-scale machine learning tasks with data-driven regularizers.

Applications in Optimization

Proximal Gradient Methods

Proximal gradient methods address the minimization of composite objective functions \min_x f(x) + g(x), where f is convex and smooth with a Lipschitz continuous gradient, and g is convex and possibly nonsmooth but admits an efficient proximal operator. These methods extend gradient descent by incorporating a proximal step to handle the nonsmooth component g. The core iteration performs a gradient step on f followed by a proximal mapping on g, yielding the update x^{k+1} = \prox_{\lambda g}(x^k - \lambda \nabla f(x^k)), where \lambda > 0 is the step size. Under the assumption that \nabla f is Lipschitz continuous with constant L, a fixed step size \lambda \in (0, 1/L] ensures monotonic decrease in the objective and convergence of the iterates to a minimizer. For convex f and g, the method achieves an O(1/k) convergence rate in function value, meaning f(x^k) + g(x^k) - f(x^*) - g(x^*) \leq O(1/k), where x^* is an optimal point; backtracking line search can adaptively select \lambda if L is unknown. An accelerated variant, known as FISTA (Fast Iterative Shrinkage-Thresholding Algorithm), incorporates momentum via an extrapolation step y^k = x^k + \theta_k (x^k - x^{k-1}) with \theta_k = k/(k+3), before applying the proximal gradient update at y^k. This achieves an optimal O(1/k^2) convergence rate for convex problems, matching the lower bound for first-order methods, and supports to estimate the constant dynamically. For nonconvex settings, where f may lack convexity but retains a gradient, the inertial proximal method introduces an inertial term to promote faster exploration of the search space. The update becomes x^{k+1} = \prox_{\lambda g}(x^k + \beta (x^k - x^{k-1}) - \lambda \nabla f(x^k)), with momentum parameter \beta \in (0,1); under growth conditions on the objective, this converges to a where the proximal gradient residual vanishes. Error bounds in proximal gradient methods often rely on the fixed-point residual r(x) = x - \prox_{\lambda g}(x - \lambda \nabla f(x)), which measures deviation from stationarity and provides an upper bound on the of the objective subgradient via \|r(x)\| / \lambda. This residual serves as a practical stopping criterion, halting iterations when \|r(x^k)\| \leq \epsilon for a tolerance \epsilon > 0, ensuring approximate optimality in both and nonconvex cases.

Splitting Techniques

Splitting techniques in proximal optimization decompose complex problems into simpler subproblems solved via multiple evaluations of proximal , enabling the handling of separable or structured objectives such as \min_x f(x) + g(Ax), where f and g are functions and A is a linear . These methods leverage the firm nonexpansiveness of proximal to ensure convergence through operator iterations. The Douglas-Rachford splitting method addresses inclusions of the form $0 \in B(x) + A(x), where A and B are maximal s, by iterating on their resolvents, which coincide with scaled proximal operators for functions. For the \min_x f(x) + g(Ax), the method employs the reflected resolvent R_{\gamma B} = 2J_{\gamma B} - I, where J_{\gamma B} = (I + \gamma B)^{-1} = \prox_{\gamma f} if B = \partial f, and similarly for A. The is given by \begin{aligned} z^{k+1} &= \frac{1}{2} \left( I + R_{\gamma A} R_{\gamma B} \right) z^k, \\ x^{k+1} &= J_{\gamma B} (z^{k+1}), \end{aligned} or equivalently in reflected form, x^{k+1} = \prox_{\gamma f} \left( 2 \prox_{\gamma g \circ A} (z^k) - z^k \right), starting from an initial z^0. to a point satisfying the holds under qualification conditions, such as the sum A + B being maximal . The alternating direction method of multipliers (ADMM) extends splitting to constrained problems of the form \min_{x,z} f(x) + g(z) subject to Ax + Bz = c, using an augmented framework with updates. In its scaled form, the iterations alternate proximal steps: \begin{aligned} x^{k+1} &= \arg\min_x \left\{ f(x) + \frac{\rho}{2} \|Ax + Bz^k - c + u^k\|_2^2 \right\}, \\ z^{k+1} &= \arg\min_z \left\{ g(z) + \frac{\rho}{2} \|Ax^{k+1} + Bz - c + u^k\|_2^2 \right\}, \\ u^{k+1} &= u^k + Ax^{k+1} + Bz^{k+1} - c, \end{aligned} where \rho > 0 is the penalty parameter and u is the scaled variable. When A = I and B = -I, the updates simplify to direct proximal evaluations of f and g. ADMM is equivalent to Douglas-Rachford splitting applied to the problem. Forward-backward splitting serves as a special case of these techniques when one function is smooth, reducing to a single proximal evaluation per iteration combined with a gradient step, akin to for \min_x f_1(x) + f_2(x) with \nabla f_2 continuous. The iteration is x^{k+1} = \prox_{\gamma f_1} (x^k - \gamma \nabla f_2(x^k)), for step size \gamma \in (0, 2/L), where L is the constant. Convergence theory for these splitting methods relies on the nonexpansiveness of the composed operators, yielding ergodic rates of O(1/k) for the objective value and residuals under convexity and qualification assumptions, such as the existence of a for the in ADMM. Nonergodic convergence occurs linearly for strongly convex objectives, analyzed via Lyapunov functions bounding and residuals. Parallelizable variants enhance for high-dimensional problems by decomposing objectives into separable blocks, allowing simultaneous proximal computations across coordinates or subproblems. For instance, in multi-block ADMM, updates for independent blocks of x and z can be parallelized, achieving o(1/k) convergence under relaxed conditions like diagonal of the penalty matrix. Douglas-Rachford extensions similarly permit resolvent evaluations in product spaces for structured constraints.

Generalizations

Nonconvex Extensions

In the nonconvex setting, the proximal operator of a function f at a point x with parameter \lambda > 0, denoted \prox_{\lambda f}(x), is defined as a local minimizer (or more generally, a critical point) of the function y \mapsto f(y) + \frac{1}{2\lambda} \|y - x\|^2, since the global argmin may not exist or be unique due to the lack of convexity. This local characterization relies on concepts like prox-regularity, which ensures the proximal mapping is well-defined and single-valued near stationary points for prox-bounded functions, allowing the proximal point p to satisfy \lambda (x - p) \in \partial f(p), where \partial denotes an appropriate subdifferential. Convergence guarantees for algorithms relying on nonconvex proximal operators often invoke the Kurdyka-Łojasiewicz (KL) property, a desingularizing that controls the of the objective near critical points and enables sequences generated by proximal methods to converge to points. Specifically, for structured nonconvex functions decomposable as L(x,y) = f(x) + Q(x,y) + g(y), alternating proximal minimization algorithms—where updates solve proximal subproblems like x^{k+1} \in \argmin_u \{ L(u, y^k) + \frac{\lambda_k}{2} \|u - x^k\|^2 \}—produce bounded sequences that converge to critical points if L satisfies the KL inequality at those points, with rates ranging from finite termination to sublinear depending on the KL exponent \theta \in [0,1). In nonconvex and nondifferentiable cases, the convex subdifferential is replaced by the to characterize critical points of the proximal mapping, particularly for locally functions where the proximal point p satisfies $0 \in \partial_C \left( f + \frac{1}{2\lambda} \|\cdot - x\|^2 \right)(p), with \partial_C denoting the Clarke generalized subdifferential. This extension preserves some variational properties but requires additional regularity assumptions, such as lower-C^2 , to ensure computational tractability via algorithms that approximate proximal points using subgradient oracles. A prominent example of nonconvex proximal operators arises in difference-of-convex (DC) programming, where f = g - h with g and h proper lower semicontinuous functions; the DC algorithm () computes the proximal operator iteratively by linearizing the concave part h and solving a convex subproblem \prox_{\lambda g}(x^k - \lambda \partial h(x^k)), often yielding stationary points or global minima for polyhedral DC structures. Nonconvex proximal operators exhibit stability challenges, including the existence of multiple local minima in the proximal subproblem, which can lead to different fixed points corresponding to distinct stationary solutions of the original optimization, and high sensitivity to initialization that may trap algorithms in suboptimal basins. These issues are mitigated in some learned proximal methods by over-parameterization and multiple random starts, ensuring robust discovery of diverse local optima across related problems.

Stochastic and Distributed Variants

In stochastic settings, the proximal operator is adapted to handle noisy or subsampled gradients, particularly for large-scale optimization problems where evaluating the full gradient is computationally prohibitive. The stochastic proximal gradient method approximates the deterministic update by using a mini-batch stochastic gradient ∇f_S(x), where S is a random subset of the data, leading to the iteration x^{k+1} = prox_{λ g}(x^k - λ ∇f_S(x^k)). This approach reduces computational cost while introducing variance in the gradient estimate, which can be mitigated through variance reduction techniques. For instance, the SAGA (Stochastic Average Gradient Accelerated) method maintains a table of past gradients and uses a variance-reduced estimator by subtracting the average of previously computed gradients, achieving linear convergence for strongly convex objectives in finite-sum settings. Extensions like ProxSVRG incorporate variance reduction into the proximal step for nonsmooth nonconvex problems, improving upon plain stochastic proximal gradient by periodically computing full gradients to correct bias. Distributed variants extend proximal operators across multiple nodes or agents, enabling parallel computation for massive datasets. In consensus-based methods, such as proximal dual consensus ADMM, each node i computes a local proximal update prox_{λ g_i}(z_i - λ ∇f_i(z_i)) based on its private data, followed by averaging the results over the network to enforce global : z^{k+1} = (1/N) ∑_i z_i^{k+1}. This formulation leverages the separability of the objective and is particularly effective for multi-agent systems where communication is limited to local exchanges. The method has an ergodic rate of O(1/k) under convexity and network connectivity assumptions, making it suitable for distributed tasks. Asynchronous variants address delays in parallel environments by allowing nodes to perform proximal updates using outdated information from other . In randomized dual proximal algorithms, updates are performed asynchronously with respect to a shared , where each applies a proximal step on its local function and communicates sporadically, tolerating bounded delays without synchronizing all nodes at each iteration. This reduces communication overhead and wall-clock time in clusters, with guaranteed under mild delay conditions. Convergence analysis for these stochastic and distributed proximal methods typically establishes almost sure convergence to stationary points under diminishing step sizes and unbiased gradient estimates. For nonconvex stochastic settings, the expected norm of the gradient decreases at a rate of O(1/√K) after K iterations, matching the complexity of stochastic gradient descent while handling nonsmooth regularizers via the proximal operator. In distributed cases, rates improve to O(1/K) for convex objectives with variance reduction, provided the network graph is connected and stepsizes are appropriately tuned. Applications of proximal operators in federated learning emphasize privacy preservation by incorporating local regularization terms that prevent excessive drift from a global model. In FedProx, each client solves a proximal subproblem with an added quadratic term (μ/2) ‖x - w^k‖^2, where w^k is the previous global iterate, acting as a proximal operator for the indicator of client-specific constraints and mitigating data heterogeneity without sharing raw data. This approach enhances convergence in non-IID settings and supports differential privacy by bounding parameter updates, as demonstrated in heterogeneous network simulations.

References

  1. [1]
    [PDF] Proximal Algorithms - Stanford University
    This suggests a close connection between proximal operators and gradient methods, and also hints that the proximal operator may be useful in optimization.
  2. [2]
    Convergence of a splitting inertial proximal method for monotone ...
    Here, we use the resolvent operator technique to suggest a forward–backward splitting method for solving the problem of finding a zero of the sum of two maximal ...
  3. [3]
    Monotone Operators and the Proximal Point Algorithm
    Strong convergence of a regularization method for Rockafellar's proximal point algorithm. Journal of Global Optimization, Vol. 55, No. 4 | 1 February 2012.
  4. [4]
    A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse ...
    In this paper we present a new fast iterative shrinkage-thresholding algorithm (FISTA) which preserves the computational simplicity of ISTA but with a global ...
  5. [5]
    [0912.3522] Proximal Splitting Methods in Signal Processing - arXiv
    Dec 17, 2009 · In this paper, we review the basic properties of proximity operators which are relevant to signal processing and present optimization methods ...Missing: 2000s | Show results with:2000s
  6. [6]
    [PDF] Firmly nonexpansive mappings and maximally monotone operators
    Jan 24, 2011 · The notion of a firmly nonexpansive mapping is central in fixed point theory because of attractive convergence properties for iterates and ...
  7. [7]
    Firmly Nonexpansive Mappings and Maximally Monotone Operators
    Jul 6, 2011 · In this paper, we systematically analyze the relationship between properties of firmly nonexpansive mappings and associated maximally monotone ...
  8. [8]
    [PDF] Proximal Splitting Methods in Signal Processing - Patrick L. Combettes
    Abstract The proximity operator of a convex function is a natural extension of the notion of a projection operator onto a convex set.
  9. [9]
    [PDF] On Perturbed Proximal Gradient Algorithms
    We study a version of the proximal gradient algorithm for which the gradient is intractable and is approximated by Monte Carlo methods (and in particular Markov.
  10. [10]
    iPiano: Inertial Proximal Algorithm for Nonconvex Optimization
    In this paper we study an algorithm for solving a minimization problem composed of a differentiable (possibly nonconvex) and a convex (possibly ...
  11. [11]
    Sharper Bounds for Proximal Gradient Algorithms with Errors
    In order to validate our convergence results, we use the proposed error bounds to analyze the convergence of proximal gradient when applied to solve the ...
  12. [12]
    [PDF] Distributed Optimization and Statistical Learning via the Alternating ...
    3.3.1 Stopping Criteria. The residuals of the optimality conditions can be related to a bound on the objective suboptimality of the current point, i.e., f(xk) ...
  13. [13]
    [PDF] Proximal splitting algorithms: Relax them all! - Optimization Online
    Jan 10, 2020 · In this paper, we present several existing proximal split- ting algorithms, which are more or less known, and we derive new ones, within a ...
  14. [14]
    [PDF] Computing proximal points of nonconvex functions
    First introduced in [32], prox-regularity provides the necessary structure to extend many of the results on proximal points to a nonconvex setting. Specifically ...
  15. [15]
    Proximal alternating minimization and projection methods for ... - arXiv
    Jan 11, 2008 · This paper studies an alternating proximal minimization algorithm for nonconvex functions, using the Kurdyka-Lojasiewicz inequality, and shows  ...
  16. [16]
  17. [17]
    Learning Proximal Operators to Discover Multiple Optima - arXiv
    Jan 28, 2022 · This paper presents a method to learn proximal operators to find multiple local minima in non-convex optimization problems, using a proximal ...Missing: stability sensitivity
  18. [18]
    A Proximal Dual Consensus ADMM Method for Multi-Agent ... - arXiv
    Sep 11, 2014 · This paper studies efficient distributed optimization methods for multi-agent networks. Specifically, we consider a convex optimization problem.
  19. [19]
    and Zeroth-Order Methods for Nonconvex Stochastic Programming
    The paper introduces the randomized stochastic gradient (RSG) method for solving nonlinear stochastic programming problems, and its complexity for computing an ...