Iteratively reweighted least squares

Iteratively reweighted least squares (IRLS) is an iterative optimization algorithm that solves weighted least squares problems by updating observation weights based on residuals from previous iterations, primarily to achieve robust parameter estimation in the presence of outliers or to minimize objectives like p-norms for p < 2.^[1]^[2] Developed in the context of M-estimators for robust location estimation, IRLS minimizes the influence of anomalous data points by assigning lower weights to larger residuals, converging to solutions that approximate maximum likelihood estimates under heavy-tailed error distributions.^[3] The algorithm begins with an initial estimate of the parameters, often from ordinary least squares, and computes residuals r_i = y_i - x_i^T \hat{\beta}. Weights w_i are then derived from a robust loss function's derivative, such as Huber's \psi-function where w(r_i) = \min(1, k / |r_i|) for a tuning constant k, forming a diagonal weight matrix W. A weighted least squares step follows: \hat{\beta}^{(k+1)} = (X^T W X)^{-1} X^T W y, with iterations continuing until the parameter changes fall below a convergence threshold, typically ensuring scale invariance via a robust scale estimate like the median absolute deviation.^[1]^[2] Common weighting schemes include Huber's, Tukey's biweight for bounded influence, and Andrews' sine function, each tailored to control outlier sensitivity.^[1] IRLS originated from Peter J. Huber's 1964 work on robust estimation of location parameters under contamination models, where it provides asymptotically efficient estimators balancing bias and variance robustness.^[3] It extends to linear regression for M-estimation, offering computational efficiency over direct minimization of non-quadratic losses, and has been analyzed for local convergence properties under conditions like positive definiteness of the Hessian.^[4] Beyond robustness, IRLS implements Fisher scoring for maximum likelihood in generalized linear models (GLMs), such as logistic or Poisson regression, by reparameterizing the score equations into weighted least squares form.^[5] Applications span robust linear models, where IRLS mitigates leverage points and outliers in datasets like economic or environmental data, to sparse signal recovery via l1-norm minimization (e.g., basis pursuit), and image processing tasks like deconvolution.^[6]^[7] Recent extensions include globally convergent variants for high-dimensional robust regression under adversarial noise, ensuring model recovery guarantees.^[8] Despite its widespread use, IRLS can suffer from non-convergence in ill-conditioned problems or require careful initialization, prompting hybrid methods with regularization.^[6]

Prerequisites

Ordinary Least Squares

Ordinary least squares (OLS) is a fundamental method in linear regression that estimates the parameters of a linear model by minimizing the sum of the squared residuals between observed and predicted values. In matrix notation, for a linear system Ax = b where A is an m \times n design matrix, x is the n \times 1 parameter vector, and b is the m \times 1 observation vector, OLS solves the optimization problem \min_x \|Ax - b\|_2^2 = \sum_{i=1}^m (a_i^T x - b_i)^2, where a_i^T denotes the i-th row of A.^[9] This objective function penalizes larger deviations more heavily due to the quadratic term, leading to a solution that provides the best linear unbiased estimator under certain conditions.^[9] Assuming A has full column rank (i.e., A^T A is invertible), the closed-form solution for the OLS estimator is given by the normal equations: x = (A^T A)^{-1} A^T b.^[9] This explicit formula allows for direct computation without iterative procedures, making OLS computationally efficient for well-posed problems. Geometrically, the OLS solution corresponds to the orthogonal projection of the vector b onto the column space of A, where the residual vector b - Ax is perpendicular to every column of A, minimizing the Euclidean distance in the least-squares sense.^[10] OLS relies on several key assumptions for its validity: the underlying model must be linear in the parameters, the errors should be homoscedastic (constant variance across all levels of the independent variables), and the data should be free from significant outliers that could unduly influence the fit.^[11] Violations of homoscedasticity, for instance, can be addressed through extensions like weighted least squares. Historically, the method was independently developed by Adrien-Marie Legendre in 1805 and Carl Friedrich Gauss, who claimed prior invention around 1799, primarily for fitting astronomical data to orbits.^[12]

Weighted Least Squares

Weighted least squares (WLS) is a generalization of ordinary least squares that accounts for heteroscedasticity in the errors or unequal reliability of observations by incorporating a weighting matrix into the objective function.^[13] This approach minimizes the weighted sum of squared residuals, thereby giving greater influence to more precise data points and reducing the impact of outliers or less reliable measurements.^[13] As a special case, when all weights are equal, WLS reduces to ordinary least squares.^[14] The formulation of WLS involves solving the optimization problem

\min_x (Ax - b)^T W (Ax - b),

where A is the design matrix, b is the response vector, and W is a positive definite diagonal matrix with positive weights w_i > 0 on the diagonal.^[13] These weights typically reflect the inverse of the error variances for each observation, ensuring that observations with smaller variances contribute more to the estimate.^[15] The closed-form solution to this problem is given by

x = (A^T W A)^{-1} A^T W b,

which can be computed directly assuming A^T W A is invertible.^[13] This expression parallels the ordinary least squares solution but incorporates the weighting to adjust for varying precisions.^[14] In interpretation, WLS downweights unreliable observations by assigning smaller w_i to those with higher expected error variance, effectively standardizing the residuals by their standard deviations to achieve homoscedasticity in the transformed problem.^[15] This makes the estimator more efficient under the assumption of known or estimated variances.^[13] The weights w_i are often determined from prior knowledge of the data collection process, such as sample sizes or known variances in experimental designs, or estimated iteratively from initial residuals to approximate the error structure.^[13] WLS corresponds to the diagonal case of generalized least squares (GLS), where the full covariance matrix of errors is diagonal, focusing on uncorrelated but heteroscedastic errors; GLS extends this to non-diagonal covariances for correlated errors.^[14]

The IRLS Algorithm

Problem Formulation

The iteratively reweighted least squares (IRLS) method addresses optimization problems in the framework of linear regression, where the objective is to estimate a parameter vector x \in \mathbb{R}^p such that Ax \approx b, with A an n \times p design matrix (n \geq p) whose rows a_i^T represent feature vectors, and b \in \mathbb{R}^n the response vector.^[16] The general problem formulation seeks to solve \min_x \sum_{i=1}^n \rho(r_i), where the residuals are defined as r_i = a_i^T x - b_i and \rho: \mathbb{R} \to [0, \infty) is a convex loss function that promotes desirable statistical properties.^[17] Common choices for \rho include the Huber loss function, which behaves quadratically for small |r_i| to retain efficiency under Gaussian noise and linearly for large |r_i| to reduce outlier sensitivity, or \rho(r) = |r|^p / p for $1 \leq p < 2 to enforce sparsity or other regularization effects.^[3] A specific instance of this formulation is the \ell_p-norm minimization problem \min_x \|Ax - b\|_p^p = \sum_{i=1}^n |a_i^T x - b_i|^p for $1 \leq p < 2, which generalizes ordinary least squares (corresponding to p=2) but introduces non-smoothness at zero for p < 2, complicating direct optimization. IRLS circumvents this difficulty by successively majorizing the non-smooth objective with a series of weighted \ell_2-norm subproblems, leveraging the smoothness of quadratic approximations. This setup is closely tied to M-estimation in robust statistics, where \rho acts as a robust loss function to achieve estimators that are less sensitive to deviations from ideal model assumptions, such as heavy-tailed errors or contamination.^[17] Each IRLS iteration solves a weighted least squares problem to update the estimate.^[16]

Step-by-Step Procedure

The iteratively reweighted least squares (IRLS) algorithm proceeds by solving a sequence of weighted least squares subproblems to approximate the solution to the original optimization objective.^[18]^[19] The process begins with initialization: select an initial estimate x^{(0)}, often the ordinary least squares (OLS) solution x^{(0)} = (A^T A)^{-1} A^T b for the linear model b = A x + \epsilon, where A is the design matrix with rows a_i^T and b is the response vector. Compute initial residuals r_i^{(0)} = a_i^T x^{(0)} - b_i for each observation i = 1, \dots, n.^[16]^[18] At each iteration k \geq 1, update the weights based on the previous residuals. For \ell_p-norm minimization with $1 \leq p < 2, set w_i^{(k)} = |r_i^{(k-1)}|^{p-2} if r_i^{(k-1)} \neq 0, and a small positive constant \epsilon otherwise to avoid undefined values; a more stable variant uses w_i^{(k)} = 1 / \max(|r_i^{(k-1)}|^{2-p}, \epsilon) with \epsilon > 0 small (e.g., $10^{-6}) to regularize near-zero residuals.^[19] For a general convex loss function \sum_i \rho(|r_i|) where \rho is even and increasing for r \geq 0, with \psi = \rho', set w_i^{(k)} = \psi(|r_i^{(k-1)}|) / |r_i^{(k-1)}| if r_i^{(k-1)} \neq 0, again using regularization for small residuals.^[18] Form the diagonal weight matrix W^{(k)} = \diag(w_1^{(k)}, \dots, w_n^{(k)}). Next, solve the weighted least squares subproblem: compute the updated estimate x^{(k)} = (A^T W^{(k)} A)^{-1} A^T W^{(k)} b, which minimizes \| \sqrt{W^{(k)}} (A x - b) \|_2^2. This step leverages standard weighted least squares solvers, such as QR decomposition or Cholesky factorization for efficiency.^[16]^[19] Update the residuals r_i^{(k)} = a_i^T x^{(k)} - b_i. Check the stopping criterion: terminate if \| x^{(k)} - x^{(k-1)} \|_2 < \delta for a small tolerance \delta > 0 (e.g., $10^{-6}) or if a maximum number of iterations (e.g., 100) is reached; otherwise, proceed to the next iteration.^[18]^[19] The following pseudocode summarizes the algorithm for the \ell_p-norm case:

Algorithm IRLS for ℓ_p-norm minimization (1 ≤ p < 2)

Input: Design matrix A ∈ ℝ^{n×p}, response b ∈ ℝ^n, p ∈ [1,2), ε > 0 small, δ > 0 tolerance, max_iter
Output: Approximate solution x*

1. Compute initial x^{(0)} = (A^T A)^{-1} A^T b  // OLS solution
2. r^{(0)} = A x^{(0)} - b
3. k = 0
4. while k < max_iter and ||x^{(k)} - x^{(k-1)}||_2 >= δ (for k>=1):
5.     for i=1 to n:
6.         w_i^{(k+1)} = 1 / max( |r_i^{(k)}|^{2-p}, ε )
7.     W^{(k+1)} = diag(w_1^{(k+1)}, ..., w_n^{(k+1)})
8.     x^{(k+1)} = (A^T W^{(k+1)} A)^{-1} A^T W^{(k+1)} b
9.     r^{(k+1)} = A x^{(k+1)} - b
10.    k = k + 1
11. x* = x^{(k)}
Algorithm IRLS for ℓ_p-norm minimization (1 ≤ p < 2)

Input: Design matrix A ∈ ℝ^{n×p}, response b ∈ ℝ^n, p ∈ [1,2), ε > 0 small, δ > 0 tolerance, max_iter
Output: Approximate solution x*

1. Compute initial x^{(0)} = (A^T A)^{-1} A^T b  // OLS solution
2. r^{(0)} = A x^{(0)} - b
3. k = 0
4. while k < max_iter and ||x^{(k)} - x^{(k-1)}||_2 >= δ (for k>=1):
5.     for i=1 to n:
6.         w_i^{(k+1)} = 1 / max( |r_i^{(k)}|^{2-p}, ε )
7.     W^{(k+1)} = diag(w_1^{(k+1)}, ..., w_n^{(k+1)})
8.     x^{(k+1)} = (A^T W^{(k+1)} A)^{-1} A^T W^{(k+1)} b
9.     r^{(k+1)} = A x^{(k+1)} - b
10.    k = k + 1
11. x* = x^{(k)}

This procedure iteratively downweights observations with large residuals, promoting robustness or sparsity depending on the choice of weights.^[18]^[19]

Mathematical Properties

Convergence Conditions

The iteratively reweighted least squares (IRLS) algorithm exhibits local convergence near the solution when viewed as a fixed-point iteration or a variant of Newton's method, particularly for strictly convex penalty functions \rho. For $1 < p < 2, local convergence is superlinear under conditions such as the restricted isometry property (RIP) of the measurement matrix A, ensuring the weights remain well-behaved and the Hessian is positive definite.^[20] Globally, for $1 < p < 2, IRLS demonstrates monotonic decrease in the objective function, converging to the unique minimizer provided A has full column rank and the initial weights satisfy boundedness away from zero. This global convergence holds under the \ell_p-null space property (NSP) of order s with constant \rho_s < 1/2, yielding a linear rate from arbitrary initialization.^[21] For p = 1, standard IRLS does not always converge to the exact \ell_1 minimizer due to non-differentiability at zero, but it approximates solutions via reweighted \ell_2 minimization; convergence guarantees require smoothing (e.g., \epsilon-approximation of the norm) or proximal regularization terms to ensure stability. With such modifications, IRLS converges exponentially to a neighborhood of the stationary point under RIP conditions of order $2s, though the limit may depend on initialization without additional thresholding.^[22]^[23] Key convergence conditions include residuals bounded away from zero to avoid ill-conditioned weights, positive definiteness of A^T W A (where W is the diagonal weight matrix), and the penalty \rho being majorized by quadratic approximations in each iteration. For sparse recovery, the matrix A must satisfy NSP or RIP to prevent non-unique solutions.^[21]^[20] Recent analyses post-2020 have established linear convergence rates for IRLS under deterministic noise assumptions, such as inlier-outlier separation and spectral dominance of the data matrix, particularly with dynamic smoothing regularization for \ell_1-minimization over subspaces. These results provide global linear rates with contraction factor c < 1, requiring O(\log(1/\delta)) iterations for \delta-accuracy.^[24]^[23] Failure modes include oscillation or divergence for p \leq 1 without safeguards like smoothing or bounded weights, especially when residuals hit zero or the matrix lacks RIP, leading to non-convergence to the global minimizer.^[22]^[20]

Computational Complexity

The per-iteration computational cost of the IRLS algorithm consists of two main components: solving the weighted least squares subproblem and updating the diagonal weight matrix W. The weighted least squares step typically involves forming and factorizing the normal equations (A^\top W A) \beta = A^\top W y, where A is the m \times n design matrix, y is the response vector, and \beta is the parameter vector. For dense matrices, Cholesky factorization achieves this in O(n^3) time, assuming n \ll m. For sparse or structured A, preconditioned conjugate gradient methods reduce the cost to approximately O(n^2) per inner iteration, often requiring 10-30 CG steps for convergence. The weight update, which applies a nonlinear function (e.g., w_i = 1/|r_i|^{2-p} for L_p norms) element-wise to the residuals r = y - A\beta, incurs O(m) complexity. The overall complexity of IRLS is O(K n^3) for direct solves or O(K n^2) with iterative methods, where K is the number of outer iterations until convergence. In practice, K is small, typically ranging from 10 to 50 for many robust regression and L_p-norm problems, though it can reach 60-80 for high-precision L_p regression with p > 2. This modest iteration count stems from the algorithm's local quadratic convergence under suitable conditions, making IRLS efficient for moderate-scale problems. Scalability challenges arise in high-dimensional settings (large n) or massive datasets (large m), where O(n^3) factorization becomes intractable. Solutions include preconditioning the conjugate gradient solver with incomplete Cholesky or diagonal approximations to accelerate convergence, or using approximate weighted least squares via randomized sketching. Weight updates are inherently parallelizable across the m observations, enabling efficient distribution on multi-core systems or GPUs. Compared to interior-point methods for L_p-norm linear regression, which incur O(n^{3.5} \log(1/\epsilon)) time due to \tilde{O}(\sqrt{n}) iterations each costing O(n^3), standard IRLS is often 10-50 times faster in practice for achieving high accuracy (\epsilon = 10^{-8}). However, for exact L_1 minimization, specialized simplex-based linear programming solvers can outperform IRLS by exploiting the linear program's structure, though IRLS provides smoother approximations for general L_p with $1 < p < 2. Modern optimizations in the 2020s have focused on provably convergent IRLS variants for large-scale L_p regression, achieving total runtimes of \tilde{O}(n^\omega) with \omega \approx 2.37 using fast matrix inversion techniques, significantly outperforming prior methods for big data applications. Despite its straightforward implementation, IRLS remains sensitive to the conditioning of A^\top W A, which can degrade performance without regularization or damping in ill-conditioned cases.

Applications and Examples

L1 Minimization in Sparse Recovery

In sparse recovery, particularly within the framework of compressed sensing, iteratively reweighted least squares (IRLS) is applied to solve L1-norm minimization problems that promote sparsity in underdetermined linear systems. The core problem is to recover a sparse signal x \in \mathbb{R}^N from measurements b = A x + \noise, where A is an m \times N measurement matrix with m \ll N and \noise represents noise. This is formulated as the basis pursuit problem: \min_x \|x\|_1 subject to A x = b, which exactly recovers sparse signals under suitable conditions on A. In the noisy case, an unconstrained variant approximates the LASSO objective: \min_x \|A x - b\|_2^2 + \lambda \|x\|_1, where \lambda > 0 balances fidelity and sparsity. IRLS adapts to these L1 problems by iteratively solving weighted least squares subproblems that approximate the L1 norm through a sequence of L2 minimizations. At iteration k, the weights are updated as w_j^{(k)} = \frac{1}{|x_j^{(k-1)}| + \epsilon}, where \epsilon > 0 is a small regularization parameter to avoid division by zero and promote exact zeros. The update solves x^{(k)} = \arg\min_x \|A x - b\|_2^2 + \lambda \sum_j w_j^{(k)} (x_j^{(k)})^2, yielding a closed-form solution via standard weighted least squares. This reweighting iteratively shrinks small coefficients toward zero, converging to an L1 minimizer that favors sparse solutions.^[25] A canonical example arises in compressive sensing, where A is a random matrix (e.g., Gaussian or partial Fourier) satisfying the restricted isometry property (RIP), and b = A s + \noise with s K-sparse (at most K nonzeros). IRLS recovers s by emphasizing the sparse support, outperforming L2 minimization (which favors small diffuse coefficients) in identifying the exact nonzero locations. Under RIP of order $2K with constant \delta_{2K} < \sqrt{2} - 1, IRLS achieves exact support recovery for noiseless cases and stable reconstruction in noise.^[25] IRLS promotes sparsity more effectively than ordinary least squares by penalizing large coefficients less severely while driving small ones to zero, enabling recovery of signals with far fewer measurements than the ambient dimension. This is particularly advantageous in applications like MRI or radar, where signals are inherently sparse in some basis.^[25] For a numerical illustration, consider synthetic data with a 45-sparse signal in \mathbb{R}^{1500}, measured via a $250 \times 1500 Gaussian matrix A (satisfying RIP) and additive noise. Applying IRLS with weights w_j = 1 / (|x_j| + \epsilon) and \epsilon = 10^{-3}, the algorithm recovers the support after 10-15 iterations, with \ell_1-error decaying exponentially from an initial \|x^{(0)} - s\|_1 \approx 50 to below 1, demonstrating rapid convergence to the sparse true signal s.^[25] Despite these strengths, IRLS can converge to local minima if initialized poorly, potentially missing the global sparse solution, especially in highly underdetermined systems. To mitigate this, variants like reweighted zero-attracting IRLS adjust weights to more aggressively enforce exact sparsity, improving recovery rates (e.g., reducing required measurements by up to 25%) while maintaining global convergence guarantees under RIP.

Lp Norm Linear Regression

Lp norm linear regression seeks to estimate the coefficients x in a linear model by solving the optimization problem \min_x \|Ax - b\|_p^p = \min_x \sum_{i=1}^m |a_i^T x - b_i|^p, where A \in \mathbb{R}^{m \times n} is the design matrix, b \in \mathbb{R}^m is the response vector, and $1 < p < 2 provides a balance between the outlier sensitivity of ordinary least squares (p=2) and the computational challenges of least absolute deviations (p=1). This formulation promotes robustness by applying less penalty to large residuals compared to the quadratic penalty in L_2 regression, making it suitable for datasets contaminated by outliers or heavy-tailed errors.^[21] The iteratively reweighted least squares (IRLS) algorithm addresses this non-quadratic objective by approximating it through a sequence of weighted L_2 problems. At each iteration k, given residuals r^{(k-1)} = b - A x^{(k-1)}, the weights are set as w_i^{(k)} = |r_i^{(k-1)}|^{p-2} for i = 1, \dots, m, forming a diagonal weight matrix W^{(k)} = \operatorname{diag}(w_1^{(k)}, \dots, w_m^{(k)}). The update solves the weighted least squares subproblem \min_x (A^T W^{(k)} A) x = A^T W^{(k)} b, yielding x^{(k)}. This process repeats until convergence, typically measured by a small change in the objective or residuals. For p > 1, the L_p objective is strictly convex, ensuring a unique global minimizer, and IRLS converges globally to this solution with linear rate $2 - p.^[21] In practice, the choice of p tunes the degree of robustness: values closer to 1 (e.g., p = 1.2) offer stronger outlier resistance at the cost of efficiency, while p near 2 retains more similarity to ordinary least squares. To handle small residuals where |r_i|^{p-2} becomes large (since p-2 < 0), a small \epsilon > 0 is often introduced, such as w_i^{(k)} = \max(|r_i^{(k-1)}|^{p-2}, \epsilon) or w_i^{(k)} = (|r_i^{(k-1)}|^2 + \epsilon)^{(p-2)/2}, preventing numerical instability without significantly altering the objective.^[21]^[6] For p < 2, the downweighting of large residuals (w_i < 1 for big |r_i|) reduces the influence of outliers, interpreting the solution as a compromise between the mean ( p=2 ) and median (limit as p \to 1 ) regression.^[6]^[26]

Historical Development

Early Origins

The iteratively reweighted least squares (IRLS) method traces its origins to the field of approximation theory, particularly in solving minimax approximation problems. In 1961, Charles L. Lawson introduced the core idea of IRLS in his PhD thesis, where he proposed an iterative weighting scheme to approximate the solution to linear Chebyshev approximation problems, also known as L∞ norm minimization.^[27] This approach involved repeatedly solving weighted least squares problems, with weights adjusted based on the residuals from previous iterations, to converge toward the uniform norm solution that minimizes the maximum deviation.^[28] Lawson's algorithm was specifically designed for linear minimax approximation, such as fitting polynomials to data points while equioscillating at the error extrema, a fundamental challenge in numerical analysis.^[29] Lawson's key contribution appears in his thesis titled Contributions to the Theory of Linear Least Maximum Approximation, which derived explicit weight updates for achieving uniform norm approximation.^[30] Prior to widespread statistical applications, IRLS found use in numerical analysis for tasks like polynomial fitting, where traditional least squares failed to handle outlier sensitivity or non-L2 norms effectively.^[31] Early implementations also emerged in signal processing for approximating continuous functions with discrete basis expansions, emphasizing robustness in bounded-error scenarios.^[32] During the 1970s, the method was extended to Lp norms for 1 ≤ p < ∞, adapting the weighting mechanism to approximate solutions for intermediate norms, including L1 regression as a computationally efficient alternative to linear programming formulations.^[32] These developments built directly on Lawson's framework, applying iterative reweighting to promote sparsity or median-like estimators in approximation problems.^[33] The approach laid foundational groundwork for the majorization-minimization framework in non-smooth optimization, where surrogate quadratic objectives are iteratively minimized to bound non-convex or non-differentiable functions.^[34] Despite its innovations, early IRLS implementations suffered from limitations, including the absence of rigorous convergence proofs, which left theoretical guarantees underdeveloped until later analyses. Moreover, the algorithms were primarily suited for small-scale problems due to the computational demands of repeated least squares solves on limited hardware of the era.^[35]

Key Contributions in Statistics and Optimization

A pivotal advancement in the application of iteratively reweighted least squares (IRLS) occurred in 1972 when John Nelder and Robert Wedderburn introduced it as a core method for fitting generalized linear models (GLMs). In their framework, IRLS iteratively updates weights derived from working residuals and the model's link function to minimize the deviance, enabling the estimation of parameters in models where the response variable follows distributions from the exponential family, such as binomial or Poisson. This approach transformed statistical modeling by extending ordinary least squares to non-normal responses, facilitating widespread use in regression analysis. In the realm of robust statistics during the 1970s and 1980s, building on Peter J. Huber's M-estimation framework from 1964,^[3] IRLS was extended to implement M-estimators for robust regression, as developed by Paul W. Holland and Roy E. Welsch in 1977.^[4] These extensions employed IRLS to approximate solutions for robust regression problems, where weights are adjusted based on bounded influence functions to downweight outliers and achieve estimators with high breakdown points. For instance, Huber's M-estimation framework, combined with IRLS iterations, provided a practical computational tool for linear models resistant to contamination, influencing the development of bounded-influence diagnostics in regression.^[4] Within GLMs, IRLS specifically targets deviance minimization, serving as the backbone for estimating models like logistic regression for binary outcomes and Poisson regression for count data. In logistic regression, weights incorporate the variance from the Bernoulli distribution via the link function, iteratively refining parameter estimates until convergence; similarly, in Poisson regression, IRLS adjusts for the equality of mean and variance to fit rate parameters effectively. This deviance-based optimization via IRLS has been integral to GLM implementations, ensuring efficient maximum likelihood estimation.^[36] The resurgence of IRLS in optimization during the 2000s, particularly in compressed sensing, highlighted its versatility beyond traditional statistics. Rick Chartrand's 2007 work demonstrated IRLS for sparse signal recovery by reweighting to approximate \ell_1-norm minimization, improving reconstruction from underdetermined measurements. Complementing this, Ingrid Daubechies and colleagues in 2008 analyzed IRLS convergence for sparse recovery, proving faster-than-linear rates under restricted isometry conditions and establishing its efficacy in nonconvex optimization. These contributions revealed deep connections between IRLS, the expectation-maximization (EM) algorithm, and majorization-minimization frameworks, where IRLS iterations majorize nonconvex objectives to guarantee monotonic decrease.^[25] The impact of these developments is evident in the integration of IRLS into major statistical software, enabling routine GLM fitting in tools like R's glm function via iteratively weighted least squares and SAS procedures for robust and generalized modeling. This has bridged statistics and signal processing, fostering applications from epidemiological modeling to high-dimensional data analysis while promoting robust, scalable optimization techniques.^[37]^[38]