Fact-checked by Grok 2 weeks ago

Non-linear least squares

Non-linear least squares is a form of least squares analysis applied to fit a set of observations to a model that is nonlinear in its unknown parameters, achieved by minimizing the sum of the squared residuals between the observed data and the model's predictions.^[1] This method extends the linear least squares approach, which assumes linearity in parameters and allows for closed-form solutions, to handle more complex functional relationships where no such analytical solution exists.^[1] Unlike linear models, non-linear least squares requires iterative numerical optimization, often starting from initial parameter estimates, to converge to the optimal values.^[2] The technique is particularly useful for modeling processes exhibiting nonlinear behaviors, such as exponential growth, asymptotic approaches, or chemical reaction kinetics, where the model function cannot be expressed as a linear combination of parameters.^[1] Mathematically, it solves the optimization problem \min_x \frac{1}{2} \| f(x) \|^2_2, where f: \mathbb{R}^n \to \mathbb{R}^m (with m \geq n) represents the vector of residuals, and x are the parameters to estimate.^[2] Key advantages include efficient use of data and the ability to incorporate nearly any closed-form nonlinear function, though it is sensitive to outliers and demands reliable starting values to avoid local minima.^[1] Prominent algorithms for solving non-linear least squares problems include the Gauss-Newton method, which approximates the Hessian using the Jacobian matrix for iterative updates, and the Levenberg-Marquardt algorithm, a damped variant that blends Gauss-Newton steps with gradient descent for improved robustness near singular points.^[2] The Gauss-Newton approach, derived as a modification of Newton's method for least squares objectives, dates back to foundational work by Carl Friedrich Gauss in the early 19th century but was adapted for nonlinear cases in later developments.^[3] Levenberg-Marquardt, introduced by Kenneth Levenberg in 1944 and refined by Donald Marquardt in 1963, remains widely used due to its balance of efficiency and global convergence properties.^[2] Applications span fields like engineering, physics, and biology, including curve fitting for experimental data, parameter estimation in pharmacokinetic models, and optimization in machine learning for nonlinear regression tasks.^[4] For instance, it is employed to model concrete curing times or microbial growth rates, where linear approximations fail to capture saturation effects.^[1] Despite its computational demands, non-linear least squares provides approximate confidence intervals and supports weighted variants for heteroscedastic data, making it a cornerstone of statistical modeling.^[1]

Fundamentals

Definition and Motivation

Non-linear least squares is a form of least squares analysis employed to estimate parameters in a model that is non-linear in those parameters by minimizing the sum of the squares of the residuals between observed data and model predictions. Given a dataset consisting of m observations (x_i, y_i) for i = 1, \dots, m, the objective is to determine the parameter vector \beta \in \mathbb{R}^p that minimizes the objective function

S(\beta) = \sum_{i=1}^m \left[ y_i - f(x_i; \beta) \right]^2,

where f(x; \beta) denotes the non-linear model function relating the predictors x to the responses y.^[4]^[5] This method is motivated by the need to fit models that capture inherently non-linear relationships in real-world data, where linear least squares—its special case with closed-form solutions via normal equations—proves inadequate. In fields such as physics, biology, and engineering, non-linear least squares enables the estimation of parameters in models like exponential growth for population dynamics or radioactive decay processes in physics, the Michaelis-Menten equation describing enzyme kinetics in biology, and dose-response curves for material stress testing in engineering. These applications arise because empirical relationships in nature and engineered systems often exhibit curvature or saturation effects that linear models cannot represent without distortion.^[6]^[7]^[5] The technique originated in the 19th century through Carl Friedrich Gauss's application of least squares to the non-linear orbit determination of the asteroid Ceres in 1801, where he iteratively solved Kepler's elliptical equations using sparse astronomical observations to predict the planet's path. Gauss's pioneering work demonstrated the power of least squares for non-linear parameter estimation amid measurement errors, influencing astronomy and geodesy. Extensions to general non-linear least squares, including robust iterative algorithms, emerged in the 20th century with advances in computational capabilities, solidifying its role in scientific modeling.^[8]^[3]

Comparison to Linear Least Squares

Linear least squares applies when the model function f(\mathbf{x}; \boldsymbol{\beta}) is linear in the parameters \boldsymbol{\beta}, typically expressed as f(\mathbf{x}; \boldsymbol{\beta}) = X \boldsymbol{\beta}, where X is the design matrix.^[1] In this case, the least squares estimates are obtained analytically by solving the normal equations \hat{\boldsymbol{\beta}} = (X^T X)^{-1} X^T \mathbf{y}, assuming X^T X is invertible.^[9] In contrast, non-linear least squares arises when f(\mathbf{x}; \boldsymbol{\beta}) is non-linear in \boldsymbol{\beta}, precluding a closed-form solution for the parameter estimates.^[1] The objective function S(\boldsymbol{\beta}) = \sum_i r_i(\boldsymbol{\beta})^2, where r_i(\boldsymbol{\beta}) = y_i - f(x_i; \boldsymbol{\beta}) are the residuals, may exhibit multiple local minima, and its Hessian is generally non-convex, complicating the search for the global minimum.^[10] Poor initial guesses can lead to convergence at a local rather than the global minimum.^[1] Computationally, linear least squares relies on direct methods like matrix inversion or QR decomposition for exact solutions in finite steps.^[9] Non-linear cases, however, require iterative optimization algorithms, such as Gauss-Newton or Levenberg-Marquardt, which approximate the Hessian using the Jacobian matrix of the residuals and update parameters until convergence.^[1] These methods demand derivatives (analytical or numerical) and are sensitive to starting values, potentially requiring multiple runs to assess solution robustness.^[10] A representative example contrasts simple linear regression, y = a + b x + \epsilon, solvable directly via normal equations, with the non-linear exponential model y = a e^{b x} + \epsilon.^[7] The exponential can sometimes be linearized by reparameterizing as \ln y = \ln a + b x + \epsilon', allowing linear least squares on transformed data.^[11] However, this approach often fails because the log transformation alters the error structure, assuming multiplicative errors and introducing bias in parameter estimates under additive noise assumptions common in original data. Direct non-linear fitting avoids such distortions while properly accounting for the original residual variance.^[7]

Mathematical Formulation

Model and Residuals

In non-linear least squares, the goal is to fit a parametric model to a set of observed data points, where the model is non-linear in the parameters. The model is specified as y_i = f(\mathbf{x}_i; \boldsymbol{\beta}) + \epsilon_i for i = 1, \dots, m, with m observations and m > p typically, where \mathbf{x}_i denotes the vector of input variables for the i-th observation, f(\cdot; \boldsymbol{\beta}) is the non-linear prediction function, and \boldsymbol{\beta} = (\beta_1, \dots, \beta_p)^T is the p-dimensional vector of unknown parameters. This formulation arises in contexts such as curve fitting for exponential growth or pharmacokinetic modeling, where the relationship between inputs and outputs cannot be expressed linearly in the parameters.^[12]^[13] The residuals quantify the discrepancies between the observed responses y_i and the model predictions. They are defined as e_i(\boldsymbol{\beta}) = y_i - f(\mathbf{x}_i; \boldsymbol{\beta}), representing the deviations for each data point. In vector notation, the response vector is \mathbf{y} = (y_1, \dots, y_m)^T \in \mathbb{R}^{m \times 1}, the predicted vector is \mathbf{f}(\boldsymbol{\beta}) = (f(\mathbf{x}_1; \boldsymbol{\beta}), \dots, f(\mathbf{x}_m; \boldsymbol{\beta}))^T \in \mathbb{R}^{m \times 1}, and the residual vector is \mathbf{e}(\boldsymbol{\beta}) = \mathbf{y} - \mathbf{f}(\boldsymbol{\beta}) \in \mathbb{R}^{m \times 1}. These residuals serve as the fundamental building blocks for assessing model fit and parameter estimation in non-linear settings.^[12]^[13] The method relies on several key assumptions about the error terms \epsilon_i. These errors are assumed to be independent across observations, with constant variance (homoscedasticity) and zero mean, implying an unbiased model where E[\epsilon_i] = 0 and \text{Var}(\epsilon_i) = \sigma^2 for all i. Often, normality of the errors is further assumed for inference purposes, though the least squares criterion itself does not require it. Violations of independence, homoscedasticity, or unbiasedness can lead to inefficient estimates or biased inference, prompting the use of robust variants in such cases.^[12]^[13]

Objective Function

In nonlinear least squares, the objective is to minimize the sum of squared residuals, defined as S(\beta) = \sum_{i=1}^m e_i(\beta)^2 = \| y - f(\beta) \|_2^2, where y \in \mathbb{R}^m is the vector of observations, f(\beta) \in \mathbb{R}^m represents the model predictions dependent on the parameter vector \beta \in \mathbb{R}^p, and e(\beta) = y - f(\beta) denotes the residual vector. This function aggregates the individual squared errors from the model fits, providing a measure of overall discrepancy between data and predictions.^[14] The objective function S(\beta) is nonnegative, satisfying S(\beta) \geq 0 for all \beta, with equality to zero achievable only if a perfect fit exists such that f(\beta) = y.^[15] Its gradient is given by \nabla S(\beta) = -2 J(\beta)^T e(\beta), where J(\beta) is the Jacobian matrix of f with respect to \beta, containing partial derivatives J_{ij} = \partial f_i / \partial \beta_j.^[14] Near a local minimum, the Hessian H(\beta) = \nabla^2 S(\beta) can be approximated by H(\beta) \approx 2 J(\beta)^T J(\beta), which is positive semi-definite if J(\beta) has full column rank, though this ignores second-order terms involving the residuals.^[15] These problems typically involve overdetermined systems where the number of observations m greatly exceeds the number of parameters p (i.e., m \gg p), which can lead to ill-conditioned fits without additional regularization to stabilize the solution. Due to the nonlinearity of f in \beta, S(\beta) is generally nonconvex, and while local minima exist, a unique global minimum is not guaranteed, potentially resulting in multiple solutions depending on initial parameter estimates.^[15]

Weighted Variants

In weighted nonlinear least squares, the standard objective function is extended to incorporate a weight matrix that accounts for differences in observation precision, leading to the weighted sum of squared residuals:

S_w(\beta) = \sum_{i=1}^m w_i \, e_i(\beta)^2 = e(\beta)^T W e(\beta),

where e(\beta) is the vector of residuals e_i(\beta) = y_i - f(x_i; \beta), m is the number of observations, W is a positive definite diagonal weight matrix with entries w_i > 0, and \beta denotes the parameter vector.^[16] This formulation minimizes the same nonlinear model as in the unweighted case but adjusts the influence of each residual based on the specified weights.^[17] The primary rationale for weighting is to address heteroscedasticity, where the variance of measurement errors varies across observations, ensuring that more reliable data points exert greater influence on the parameter estimates.^[17] A typical choice for weights is w_i = 1 / \sigma_i^2, with \sigma_i^2 representing the variance of the i-th error term, which normalizes the contributions to reflect relative precisions.^[16] When the error covariance matrix is fully known and not necessarily diagonal, the weighted approach generalizes to a form akin to nonlinear generalized least squares, optimizing the quadratic form e(\beta)^T \Sigma^{-1} e(\beta) where \Sigma is the covariance matrix.^[17] Assuming Gaussian errors with constant known variance \sigma^2, the minimized weighted objective S_w(\hat{\beta}) / \sigma^2 follows a central chi-squared distribution with m - p degrees of freedom, where p is the number of parameters; this statistic provides a basis for goodness-of-fit testing by comparing the achieved value to the expected distribution under the model.^[18] In implementation, weights are often estimated iteratively from the data—such as via replicate measurements, preliminary unweighted fits, or assumed error models—and must remain non-negative to preserve the convexity and interpretability of the quadratic objective.^[16]

Geometric Interpretation

Non-linear least squares problems admit an elegant geometric interpretation. The model parameters \theta \in \mathbb{R}^n map to predicted values in the observation space \mathbb{R}^m (with m \geq n), defining an n-dimensional manifold embedded in \mathbb{R}^m. The observed data point y lies in this space, and the residuals are the vector r(\theta) = y - f(\theta), where f(\theta) is the model prediction. The objective is to find the parameter values \theta^* such that f(\theta^*) is the point on the manifold closest to y in the Euclidean norm, i.e., minimizing \|r(\theta)\|_2^2. This views the optimization as a projection onto the (possibly curved) model manifold.^[19]

Computational Challenges

Initial Parameter Estimates

In nonlinear least squares estimation, selecting appropriate initial parameter estimates for the vector \beta is essential, as the objective function often exhibits multiple local minima, making the optimization highly sensitive to starting values; poor choices can result in algorithm divergence, slow convergence, or entrapment in suboptimal solutions, unlike the global uniqueness typically found in linear least squares.^[20] This sensitivity arises because iterative methods rely on local approximations, and initial estimates far from the true optimum can lead to unreliable fits, particularly in models with complex parameter interactions.^[21] Several strategies exist for obtaining robust initial estimates. Grid search involves evaluating the sum of squared residuals over a discrete lattice in the parameter space to identify a promising starting point, which is computationally feasible for low-dimensional problems and helps explore the landscape broadly.^[22] Moment matching provides another approach by equating statistical moments (such as mean and variance) of the observed data to those implied by the model, yielding initial \beta values that align the model's aggregate properties with the data; this method is particularly effective for parametric models with interpretable moments, like exponential or Gaussian processes.^[23] Linearization techniques approximate the nonlinear model locally as linear in \beta—for instance, by taking a first-order Taylor expansion around a nominal point—and solve the resulting ordinary least squares problem to obtain an initial guess, which can then seed the nonlinear solver.^[24] Incorporating domain-specific knowledge further refines initial estimates by imposing physical or experimental constraints, such as bounding parameters within biologically or physically plausible ranges derived from prior studies, thereby reducing the search space and enhancing reliability.^[24] To assess robustness against multiple local minima, sensitivity analysis is recommended: multiple initial sets are generated (e.g., via random sampling within bounds or combinations of the above methods), each run through the optimizer, and the solution minimizing the objective function is selected as the final estimate.^[25] This multi-start procedure, while increasing computational cost, mitigates the risk of poor convergence and is standard practice in applications like pharmacokinetics and engineering modeling.^[26]

Jacobian Calculation

The Jacobian matrix J(\beta) in non-linear least squares is an m \times p matrix whose elements are defined as J_{i,j}(\beta) = \frac{\partial f_i}{\partial \beta_j} (x_i; \beta), capturing the sensitivity of each predicted value f_i to changes in the parameters \beta_j. This matrix plays a crucial role in gradient-based optimization methods, where the gradient of the sum-of-squares objective is given by \nabla S(\beta) = -2 J(\beta)^T r(\beta), with r(\beta) the vector of residuals. Analytical computation of the Jacobian is feasible when the model function admits explicit differentiation. For simple models, partial derivatives can be derived directly; for instance, in exponential regression where f(x; a, b) = a e^{b x}, the entry corresponding to parameter a is \frac{\partial f}{\partial a} = e^{b x}, while for b it is \frac{\partial f}{\partial b} = a x e^{b x}. Such formulas enable exact evaluation without approximation errors, provided the derivatives are implementable in code. When analytical derivatives are unavailable or complex, numerical approximation via finite differences is commonly employed. The forward difference formula estimates each column j of the Jacobian as

J_{i,j}(\beta) \approx \frac{f_i(x_i; \beta + \delta e_j) - f_i(x_i; \beta)}{\delta},

where e_j is the j-th unit vector and \delta is typically on the order of \sqrt{\epsilon}, with \epsilon the machine precision (around $10^{-8} for double precision). For improved accuracy, central differences may be used:

J_{i,j}(\beta) \approx \frac{f_i(x_i; \beta + \delta e_j) - f_i(x_i; \beta - \delta e_j)}{2\delta}.

These methods require evaluating the full model at perturbed parameters, incurring a computational cost of roughly p additional function evaluations for forward differences or $2p for central differences per iteration.^[27] Automatic differentiation (AD) provides an alternative for computing the Jacobian exactly and efficiently without manual derivation or finite difference approximations. AD tools, such as those in Ceres Solver or Julia's ForwardDiff package, differentiate code programmatically by propagating derivatives through the computation graph, making it ideal for complex nonlinear models in scientific and machine learning applications as of 2025.^[28] Key challenges in Jacobian calculation include the high computational expense, especially for large p or costly model evaluations, and potential ill-conditioning when the matrix is rank-deficient, which arises if parameters are poorly identifiable or the model is overparameterized. In such cases, the Jacobian's condition number can become large, amplifying numerical instability in downstream optimizations.^[29]

Convergence Criteria

Convergence criteria in nonlinear least squares optimization determine when an iterative algorithm has sufficiently minimized the objective function S(\beta) = \sum_{i=1}^m r_i(\beta)^2, where r_i(\beta) are the residuals, ensuring a balance between solution accuracy and computational efficiency. These criteria typically involve thresholds on the gradient of S(\beta), changes in parameters \beta, or variations in S(\beta) itself, preventing excessive iterations while avoiding premature termination. The choice of criteria depends on the problem's scale and noise levels, with tolerances often set empirically or based on data precision.^[18]^[30] A common absolute or relative tolerance criterion stops the iteration when the norm of the gradient \|\nabla S(\beta)\| < \epsilon, where \nabla S(\beta) = -2 J^T(\beta) r(\beta) and J(\beta) is the Jacobian matrix; this indicates that the current \beta is near a stationary point of S(\beta). Relative variants compare the change in S(\beta) to its current value, such as |S(\beta_{k+1}) - S(\beta_k)| < \delta S(\beta_k), ensuring negligible improvement relative to the objective's magnitude. These gradient and function-based tests are standard in algorithms like Gauss-Newton and Levenberg-Marquardt, as they directly assess first-order optimality conditions.^[18]^[30] Parameter change criteria halt optimization when the update \|\beta_{k+1} - \beta_k\| < \tau or its relative version \|\beta_{k+1} - \beta_k\| / \|\beta_k\| < \tau, signaling that further adjustments would not significantly alter the fit. This is particularly useful in ill-conditioned problems where gradient norms may remain large due to scaling issues. In practice, \tau is often set to machine epsilon times the parameter scale for numerical stability.^[18]^[30] Residual-based criteria focus on the individual or aggregate residuals, stopping if the maximum absolute residual \max_i |r_i(\beta)| < \theta for unweighted cases, ensuring each data point is fitted within a desired error bound. For weighted variants, where S(\beta) = \sum_{i=1}^m w_i r_i(\beta)^2 approximates a \chi^2 distribution under Gaussian errors, convergence may be assessed via the reduced \chi^2 = S(\beta)/(m - p) < \gamma, with p the number of parameters and \gamma near 1 indicating a good fit; deviations suggest model inadequacy but can serve as a stopping threshold in statistical contexts. These approaches prioritize data fidelity over pure minimization.^[18]^[27] To prevent infinite loops in non-convergent cases, an iteration limit acts as a fallback, typically set to a multiple of the parameter dimension, such as 10p evaluations. Adaptive tolerances enhance robustness by tightening thresholds progressively or scaling them to problem size, reducing sensitivity to initial choices while maintaining efficiency across diverse applications.^[18]^[31]

Parameter Uncertainty and Residuals

After obtaining the parameter estimates \hat{\beta} that minimize the sum of squared residuals in non-linear least squares, assessing the uncertainty in these estimates is essential for statistical inference. The approximate covariance matrix of the parameters is given by \operatorname{Var}(\hat{\beta}) \approx (J^T J)^{-1} \sigma^2, where J is the Jacobian matrix evaluated at \hat{\beta}, and \sigma^2 is the estimated residual variance, computed as \sigma^2 = S(\hat{\beta}) / (m - p), with S(\hat{\beta}) denoting the minimized sum of squares, m the number of observations, and p the number of parameters.^[32] This approximation arises from linearizing the model locally around \hat{\beta} and assuming the residuals are independent and normally distributed with constant variance. From this covariance matrix, approximate confidence intervals for individual parameters can be constructed using Wald intervals: \hat{\beta}_j \pm t_{m-p, 1-\alpha/2} \sqrt{\operatorname{Var}(\hat{\beta}_j)}, where t_{m-p, 1-\alpha/2} is the critical value from the Student's t-distribution with m - p degrees of freedom.^[32] These intervals provide a linear approximation to the uncertainty and are asymptotically valid under mild regularity conditions, though they may underestimate coverage in highly non-linear cases.^[33] Residual analysis plays a key role in validating the fitted model and detecting violations of assumptions. Plots of residuals r_i = y_i - f(x_i, \hat{\beta}) against fitted values f_i = f(x_i, \hat{\beta}) or predictors x_i help identify patterns such as heteroscedasticity (non-constant variance) or autocorrelation in the residuals. Standardized residuals, defined as r_i = e_i / \sqrt{\hat{\sigma}^2 (1 - h_{ii})}, where h_{ii} are the diagonal elements of the hat matrix analog (detailed below), facilitate outlier detection by highlighting points with |r_i| > 2 or |r_i| > 3 as potential anomalies. For joint parameter uncertainty, confidence regions account for the correlations captured in the covariance matrix and the non-linearity of the model. These regions are typically ellipsoidal contours defined by \Delta S(\beta) = S(\beta) - S(\hat{\beta}) \leq \chi^2_p(1 - \alpha), where \chi^2_p(1 - \alpha) is the critical value from the chi-squared distribution with p degrees of freedom.^[34] This likelihood ratio-based approach provides better coverage than marginal intervals in non-linear settings, as it incorporates the curvature of the objective function surface.^[33] To identify influential data points, leverage and influence measures extend linear regression diagnostics to the non-linear context. The hat matrix analog is H = J (J^T J)^{-1} J^T, where the diagonal elements h_{ii} quantify the leverage of the i-th observation, indicating how far x_i is from the "center" of the design in the parameter space; high-leverage points (h_{ii} > 2p/m) can disproportionately affect \hat{\beta}.^[35] Influence can then be assessed via metrics like Cook's distance, D_i = \frac{r_i^2 h_{ii}}{p (1 - h_{ii})^2}, which combines leverage and residual size to flag points whose removal significantly alters the fit.

Solution Methods

Linearization Approaches

Linearization approaches in non-linear least squares seek to approximate the non-linear model with a linear one, facilitating the use of established linear least squares solvers. One primary method involves the first-order Taylor series expansion of the model function around an initial parameter estimate. For a non-linear model f(\mathbf{x}, \boldsymbol{\beta}), where \boldsymbol{\beta} are the parameters, the expansion at a point \boldsymbol{\beta}^{(k)} yields f(\mathbf{x}, \boldsymbol{\beta}^{(k)} + \Delta \boldsymbol{\beta}) \approx f(\mathbf{x}, \boldsymbol{\beta}^{(k)}) + J(\mathbf{x}, \boldsymbol{\beta}^{(k)}) \Delta \boldsymbol{\beta}, with J denoting the Jacobian matrix of partial derivatives. This transforms the problem into minimizing \| \mathbf{y} - f(\mathbf{x}, \boldsymbol{\beta}^{(k)}) - J(\mathbf{x}, \boldsymbol{\beta}^{(k)}) \Delta \boldsymbol{\beta} \|^2 over \Delta \boldsymbol{\beta}, solvable via linear least squares.^[21] Reparameterization offers another strategy by re-expressing the model in a form that is linear in the new parameters, often through functional transformations. For instance, an exponential model y = a e^{b x} can be reparameterized as \log y = \log a + b x, converting it to a linear regression in the parameters \log a and b. However, such transformations introduce bias in the estimates and residuals because the least squares criterion on the transformed scale does not preserve the original error structure, potentially leading to inefficient or biased inferences unless corrections like smearing estimates are applied.^[36]^[37] Successive linearization extends the Taylor approach by iteratively updating the expansion point after each linear solve, refining the parameter estimates through repeated approximations. Starting from an initial guess \boldsymbol{\beta}^{(0)}, one solves the linearized problem to obtain \boldsymbol{\beta}^{(1)}, then expands around \boldsymbol{\beta}^{(1)} for the next iteration, continuing until convergence. This process forms the foundational principle for iterative solutions in non-linear least squares, leveraging the efficiency of linear solvers at each step.^[21] Despite their utility, linearization approaches have inherent limitations, providing only local approximations valid near the expansion point, which can fail for highly non-linear models or when initial estimates are poor. Strong non-linearity may cause the linear approximation to diverge significantly from the true function, leading to inaccurate solutions or non-convergence in successive iterations. Additionally, reparameterizations can distort variance assumptions and introduce heteroscedasticity if the transformation does not align with the error model.^[37]

Iterative Optimization Algorithms

Iterative optimization algorithms for nonlinear least squares problems aim to minimize the objective function S(\beta) = \frac{1}{2} \|r(\beta)\|^2, where r(\beta) is the vector of residuals, by generating a sequence of parameter updates through solving approximate subproblems. These methods typically rely on local approximations of the nonlinear model, often using the Jacobian matrix J(\beta) to linearize the residuals around the current iterate \beta_k. The core iteration takes the form \beta_{k+1} = \beta_k + \Delta_k, where \Delta_k solves the linear least squares subproblem \min_{\Delta} \|J_k \Delta + r_k\|^2_2, with J_k = J(\beta_k) and r_k = r(\beta_k). This approach bridges the gap between the full nonlinear problem and tractable linear approximations, enabling progress toward a local minimum.^[38] To promote reliable convergence, especially in the presence of ill-conditioning or near-singular Jacobians, many iterative frameworks incorporate globalization strategies such as trust-region constraints or line search procedures. In trust-region methods, the step \Delta_k is computed subject to \|\Delta_k\| \leq r_k, where r_k > 0 defines the radius of the region in which the linear model is trusted to mimic the nonlinear objective. After computing the trial step, the trust region radius is adapted: if the actual reduction in S(\beta) exceeds a predicted reduction based on the model, r_k is increased to allow larger steps in future iterations; otherwise, it is decreased, and the step may be rejected. This mechanism balances exploration and exploitation, ensuring monotonic decrease in S(\beta) under mild conditions. Line search techniques further enhance stability by scaling the candidate step \Delta_k with a factor \alpha_k \in (0,1] to guarantee descent. Backtracking line search, a common variant, starts with \alpha_k = 1 and iteratively reduces \alpha_k (e.g., by a fixed factor like 0.5) until the Armijo condition S(\beta_k + \alpha_k \Delta_k) \leq S(\beta_k) + \eta \alpha_k \nabla S(\beta_k)^T \Delta_k holds for a small \eta > 0, ensuring sufficient progress. This adaptive step length adjustment mitigates the risk of overshooting minima and is particularly useful when the linear model overestimates the curvature of S(\beta). For added robustness against indefiniteness in the approximate Hessian J_k^T J_k, damping is introduced into the subproblem by modifying it to \min_{\Delta} \|J_k \Delta + r_k\|^2_2 + \lambda_k \|\Delta\|^2_2, where \lambda_k \geq 0 acts as a regularization parameter. A small \lambda_k yields steps akin to undamped linearization, while larger values shift toward gradient descent-like behavior, stabilizing iterations near stationary points or when starting from a poor initial estimate \beta_0. The damping factor is typically adjusted dynamically based on the ratio of actual to predicted reductions in S(\beta).^[18]

Decomposition Techniques

In non-linear least squares optimization, decomposition techniques are applied to solve the linearized subproblems arising from a first-order Taylor expansion of the residual vector around the current parameter estimate. One primary method is the QR decomposition of the Jacobian matrix J, which factors J into an orthogonal matrix Q and an upper triangular matrix R, such that J = Q R. The linearized least squares problem then minimizes \| Q R \Delta + e \|^2, where e is the residual vector, which simplifies to solving R \Delta = -Q^T e via stable back-substitution. This approach ensures numerical stability by avoiding the explicit formation of the normal equations J^T J, which can amplify conditioning errors. Another key technique is the singular value decomposition (SVD) of J, expressed as J = U \Sigma V^T, where U and V are orthogonal, and \Sigma is diagonal with non-negative singular values \sigma_i. The solution to the linearized problem is \Delta = -V \Sigma^+ U^T e, with \Sigma^+ denoting the pseudo-inverse that inverts only non-zero singular values. This method naturally handles rank-deficient Jacobians by ignoring small \sigma_i, providing a minimum-norm solution among those minimizing the residual. These decompositions offer significant advantages in numerical stability compared to solving the normal equations directly, as orthogonal transformations in QR and unitary transformations in SVD preserve the Euclidean norm and avoid squaring the condition number. Additionally, SVD enables built-in regularization by truncating small singular values, which is useful for ill-conditioned problems in non-linear least squares. The computational cost for both methods is O(m p^2) per iteration, where m is the number of residuals and p the number of parameters, making them suitable for problems with dense Jacobians of moderate size.

Algorithms

Gauss-Newton Method

The Gauss-Newton method is a classical iterative algorithm for solving nonlinear least squares problems, where the objective is to minimize the sum of squared residuals by approximating the nonlinear model with a sequence of linear least squares subproblems. At each iteration k, the residual vector e_k and its Jacobian matrix J_k are evaluated at the current parameter estimate \beta_k. The method linearizes the residuals around \beta_k, leading to a least squares approximation that ignores second-order terms in the Taylor expansion. This approach, originally inspired by Gauss's early work on least squares estimation and formalized in modern iterative form in the mid-20th century, is particularly efficient when the residuals are small near the solution. The core update rule of the Gauss-Newton method involves solving the normal equations

J_k^T J_k \Delta_k = -J_k^T e_k

for the step \Delta_k, which can be expressed explicitly as \Delta_k = -(J_k^T J_k)^{-1} J_k^T e_k. The parameters are then updated via \beta_{k+1} = \beta_k + \Delta_k. This step minimizes the linearized sum of squares, providing a direction of descent under suitable conditions.^[39] Under ideal conditions, the method achieves quadratic convergence near a local minimum where the Jacobian J_k has full column rank, as the approximation to the Hessian J_k^T J_k closely matches the true Hessian, and the residuals diminish rapidly. However, far from the minimum, convergence is generally linear or sublinear, depending on the nonlinearity of the problem and the conditioning of J_k^T J_k.^[39] To implement the update efficiently without forming the explicit inverse, which can be numerically unstable, the normal equations are typically solved using Cholesky factorization of the symmetric positive semidefinite matrix J_k^T J_k, assuming it is positive definite. This decomposition allows for stable back-substitution to obtain \Delta_k.^[39] Despite its strengths, the basic Gauss-Newton method has notable limitations: it can diverge if the computed step \Delta_k overshoots the minimum, particularly in ill-conditioned problems or when starting far from the solution, since the pure linearization may not guarantee a reduction in the objective function. The standard version lacks built-in mechanisms like line search or damping to mitigate this, potentially requiring modifications for global reliability.^[39]

Levenberg-Marquardt Method

The Levenberg-Marquardt algorithm is an iterative optimization technique for solving nonlinear least squares problems, introduced by Kenneth Levenberg in 1944 and refined by Donald Marquardt in 1963. It addresses limitations of the Gauss-Newton method by incorporating a damping parameter to ensure reliable convergence, particularly for ill-conditioned problems or poor initial parameter estimates. The method blends the rapid local convergence of Gauss-Newton steps with the global stability of gradient descent, making it widely used in applications such as curve fitting and parameter estimation in scientific modeling.^[40] At each iteration k, the algorithm computes an update \Delta_k to the parameter vector by solving the damped normal equations:

(J_k^T J_k + \lambda_k I) \Delta_k = -J_k^T e_k,

where J_k is the Jacobian matrix of the residual vector e_k evaluated at the current parameters, \lambda_k > 0 is the damping parameter, and I is the identity matrix. This formulation modifies the Gauss-Newton step (recovered as \lambda_k \to 0) by adding a term \lambda_k I that regularizes the approximate Hessian J_k^T J_k, preventing large steps in directions of low curvature. When \lambda_k is large, the update approximates a scaled gradient descent step, - (1/\lambda_k) J_k^T e_k, ensuring descent even far from the minimum.^[40]^[18] The damping parameter \lambda_k is adaptively adjusted to balance reliability and efficiency. It typically starts with a relatively large value to favor conservative steps, then decreases if the sum of squared residuals S(\beta_k + \Delta_k) < S(\beta_k) (indicating improvement), or increases otherwise. Marquardt proposed a gain sequence where \lambda_{k+1} = \lambda_k / 10 for successful steps and \lambda_{k+1} = 10 \lambda_k for unsuccessful ones, though variations use factors like 2 or scale \lambda_k by the diagonal of J_k^T J_k for better conditioning. This strategy implicitly enforces a trust-region constraint on the step size, equivalent to minimizing a quadratic model within a region of radius controlled by \lambda_k, enhancing robustness without explicit trust-region subproblems.^[40]^[41]^[18] The method's advantages include improved handling of ill-conditioned Jacobians, where the damping term stabilizes the linear system and avoids divergence common in pure Gauss-Newton iterations. It also prevents oscillatory behavior or failure on poor initial guesses by transitioning smoothly from gradient-like steps to quadratic convergence near the solution, achieving second-order convergence rates locally when the residuals are small and the model is well-approximated. These properties have made it a standard in numerical software libraries for nonlinear optimization.^[40]^[18]^[41]

Gradient Descent Methods

Gradient descent methods for non-linear least squares problems rely on first-order information, specifically the gradient of the objective function, to iteratively update parameter estimates when second-order derivatives are unavailable or computationally prohibitive. These methods are particularly useful in high-dimensional settings or when the Jacobian matrix is sparse, as they avoid forming or inverting approximate Hessians. The objective is to minimize the sum of squared residuals S(\beta) = \frac{1}{2} \| r(\beta) \|^2, where r(\beta) is the residual vector and \beta are the parameters, with the gradient given by \nabla S(\beta) = J(\beta)^T r(\beta), where J(\beta) is the Jacobian matrix.^[39] The basic steepest descent method updates the parameters as \beta_{k+1} = \beta_k - \alpha_k \nabla S(\beta_k), where \alpha_k > 0 is a step size determined by line search to ensure sufficient decrease in S. This direction aligns with the negative gradient, providing the locally steepest descent, and is computationally inexpensive, requiring only O(m p) operations per iteration, with m the number of residuals and p the number of parameters. However, it exhibits linear convergence, often zigzagging toward the minimum and performing poorly near stationary points due to poor conditioning of the problem.^[39]^[2] A conjugate gradient variant improves upon steepest descent by constructing a sequence of mutually conjugate search directions that are orthogonal with respect to the Hessian, accelerating convergence without additional storage beyond a few vectors. The update is \beta_{k+1} = \beta_k + \alpha_k d_k, where d_k = -\nabla S(\beta_k) + \beta_k d_{k-1} and \beta_k is a scalar (e.g., Fletcher-Reeves formula \beta_k = \frac{\| \nabla S(\beta_k) \|^2}{\| \nabla S(\beta_{k-1}) \|^2 }) ensuring conjugacy; periodic restarts every few iterations prevent accumulation of errors from non-quadratic assumptions. This method achieves superlinear convergence in practice for moderately ill-conditioned problems and is widely used in large-scale non-linear least squares due to its efficiency over plain gradient descent.^[42]^[43] Momentum or Nesterov acceleration enhances these first-order methods by incorporating inertia from previous updates, helping to escape local minima and dampen oscillations in ravines of the objective surface. In the heavy-ball variant, the update becomes \beta_{k+1} = \beta_k - \alpha_k \nabla S(\beta_k) + \gamma (\beta_k - \beta_{k-1}), with \gamma \in (0,1) as the momentum parameter; Nesterov refines this by evaluating the gradient at an extrapolated point \beta_k + \gamma (\beta_k - \beta_{k-1}), yielding faster asymptotic rates, such as O(1/k^2) for convex cases. These accelerations are effective in non-convex non-linear least squares landscapes, reducing the number of iterations needed compared to unaccelerated gradient descent, though careful tuning of parameters is required for stability.^[39] Overall, gradient descent methods offer a trade-off of low per-iteration cost against slower linear or sublinear convergence rates, making them suitable for initial phases of optimization or when Jacobian evaluations dominate computation, but less efficient than second-order methods near the solution. In non-linear least squares, their simplicity facilitates implementation in software libraries, though they may require many iterations for high precision due to sensitivity to step size and problem conditioning.^[43]^[39]

Derivative-Free Methods

Derivative-free methods address non-linear least squares problems by optimizing the sum of squared residuals without requiring derivatives or Jacobian computations, making them suitable for scenarios where analytical gradients are unavailable or computationally prohibitive. These approaches typically involve direct searches, sampling strategies, or heuristic population-based techniques that evaluate the objective function at selected points in the parameter space to iteratively improve estimates. Unlike gradient-based methods, they trade off efficiency for robustness, often requiring more function evaluations but handling noisy, non-smooth, or black-box objective functions effectively.^[44] The Nelder-Mead simplex method exemplifies a direct search technique, operating in the parameter space (β-space) by maintaining an n+1 vertex simplex for an n-dimensional problem and updating it through a sequence of reflections, expansions, contractions, or complete shrinks based solely on evaluations of the sum of squares S(β). Developed by Nelder and Mead, this algorithm starts with an initial simplex and sorts vertices by function values to decide transformations that move toward lower S values, converging to a local minimum without derivative information. It has been applied to non-linear least squares fitting in fields like pharmacokinetics and engineering design, where model complexity precludes gradient calculations, though it may struggle with high-dimensional problems due to the exponential growth in simplex size. Pattern search methods, including variants like Hooke-Jeeves, extend direct search principles by systematically sampling points along predefined directions from a current base point, polling the objective function at these offsets to identify promising moves before progressing to a pattern step that extrapolates in successful directions. These algorithms require no gradients and are particularly effective for bound-constrained non-linear least squares, as they adapt search patterns to ensure positive spanning sets that guarantee descent under mild conditions. Coordinate descent, a related strategy, optimizes one parameter at a time by minimizing S with respect to each β_i while fixing others, often using univariate searches; it is computationally simple and parallelizable, finding use in sparse or high-dimensional least squares problems like non-negative matrix factorization approximations. Both approaches excel in low-to-moderate dimensions but can suffer from slow convergence in highly nonlinear landscapes. Evolutionary algorithms provide a global search capability for non-linear least squares, treating parameters as a population of candidate solutions evolved through operations like mutation, crossover, and selection to minimize S. Differential evolution, a seminal population-based method, generates trial vectors by differencing randomly selected population members and scaling the difference to perturb a base vector, accepting improvements based on S evaluations to drive convergence. Introduced by Storn and Price, it handles multimodal objectives robustly, making it ideal for ill-conditioned or multi-minima least squares problems in areas like curve fitting and system identification. Genetic algorithms follow a similar Darwinian paradigm, encoding parameters as chromosomes and applying genetic operators to breed fitter populations over generations. These methods incur higher evaluation costs but offer better exploration of the parameter space compared to local direct searches. Such derivative-free methods find broad applicability in black-box models, including neural networks for regression tasks or simulation-based calibrations where internal model details are inaccessible, allowing optimization via repeated forward evaluations despite elevated computational demands; their non-smoothness tolerance also suits real-world data with measurement errors. In problems with multiple minima, they can incorporate restarts or hybridization for enhanced global search, though careful initial simplex or population design remains essential for efficiency.^[45]^[46]

References

[1]
4.1.4.2. Nonlinear Least Squares Regression
Nonlinear least squares regression extends linear least squares regression for use with a much larger and more general class of functions. Almost any function ...
[2]
[PDF] METHODS FOR NON-LINEAR LEAST SQUARES PROBLEMS
A least squares problem is a special variant of the more general problem: Given a function F: IRn7→IR, find an argument of F that gives the minimum value of ...
[3]
A tutorial history of least squares with applications to astronomy and ...
This article surveys the history, development, and applications of least squares, including ordinary, constrained, weighted, and total least squares.
[4]
Nonlinear Least Squares (Curve Fitting) - MATLAB & Simulink
Nonlinear least-squares is solving the problem min(∑||F(xi) - yi||2), where F(xi) is a nonlinear function and yi is data. The problem can have bounds, linear ...
[5]
Nonlinear Regression Modelling: A Primer with Applications and ...
Mar 15, 2024 · Least-squares estimates (denoted LSEs) are those parameter values that minimize S θ for each of the p model function parameters. In other words, ...
[6]
[PDF] Unit IV: Nonlinear Equations and Optimization Chapter IV.1: Motivation
▷ Linear least-squares leads to the normal equations. AT Ab = AT y. ▷ We saw examples of linear physical models (Ohm's Law,. Hooke's Law, Leontief equations) ...
[7]
T.3.5 - Exponential Regression Example | STAT 501
An example where an exponential regression is often utilized is when relating the concentration of a substance (the response) to elapsed time (the predictor).
[8]
Gauss, Least Squares, and the Missing Planet - Actuaries Institute
Mar 30, 2021 · The early history of statistics can be traced back to 1795 when Carl Fredrich Gauss, at 18 years of age, invented the method of least squares ...
[9]
[PDF] Least squares and the normal equations
Mar 1, 2015 · We can solve ∇f(x) = 0 or, equivalently. AT Ax = AT b to find the least squares solution. Magic. Is this the global minimum? Could it be a ...
[10]
[PDF] Nonlinear Least Squares - CIS UPenn
The properties of nonlinear least squares: • Has multiple local solutions. • Has no closed form solution (iterative solve). ... Nonlinear least squares:.
[11]
[PDF] Chapter 06.04 Nonlinear Models for Regression
This is a linear function between y and ( ) xln and the usual least squares method applies in which y is the response variable and ( ) xln is the regressor.
[12]
Nonlinear Regression Analysis and Its Applications - Wiley
Nonlinear Regression Analysis and Its Applications. Douglas M. Bates, Donald G. Watts. ISBN: 978-0-470-13900-4. April 2007. 392 pages.Missing: formulation | Show results with:formulation
[13]
Nonlinear Regression | Wiley Series in Probability and Statistics
Nonlinear Regression ; Author(s):. G. A. F. Seber, C. J. Wild, ; First published:15 February 1989 ; Print ISBN:9780471617600 | ; Online ISBN: ...Missing: formulation | Show results with:formulation
[14]
None
### Summary of Nonlinear Least Squares Model from Appendix D
[15]
Numerical Optimization | SpringerLink
Numerical Optimization presents a comprehensive and up-to-date description of the most effective methods in continuous optimization.Least-Squares Problems · Nonlinear Equations · Derivative-Free Optimization
[16]
[PDF] Least Squares Adjustment: Linear and Nonlinear Weighted ...
Sep 19, 2013 · This note primarily describes the mathematics of least squares regression analysis as it is often used in geodesy including land surveying ...
[17]
Nonlinear Regression Analysis and Its Applications
This book covers nonlinear regression analysis, including iterative estimation, linear approximations, and practical considerations.
[18]
[PDF] The Levenberg-Marquardt algorithm for nonlinear least squares ...
May 5, 2024 · The Levenberg-Marquardt algorithm was developed in the early 1960's to solve nonlinear least squares problems. Least squares problems arise ...
[19]
[PDF] Nonlinear Regression Analysis - arXiv
Feb 9, 2024 · The main advantages of nonlinear regression models include interpretability, parsimony, and prediction (Bates and Watts, 1988). In general, ...
[20]
[PDF] Nonlinear Regression, Nonlinear Least Squares, and ... - John Fox
Jun 2, 2018 · Many familiar generic functions, such as residuals(), have methods for the nonlinear-model objects produced by nls(). For example, the ...Missing: formulation | Show results with:formulation
[21]
Least-Squares Regression Models - Parameter Estimates ... - Certara
If multiple dose data is fit to a PK model and Phoenix generates initial parameter estimates, a grid search is performed to obtain initial estimates. In this ...Missing: matching | Show results with:matching
[22]
On the relationship of transient storage and aggregated dead zone ...
First-order aggregated dead zone (ADZ) model. be used to define the initial parameter estimates in the TS optimization procedure. The proposed moment matching ...
[23]
Chapter 6 Non-Linear Regression | A Guide on Data Analysis
6.2.1.1 Gauss-Newton Algorithm. The Gauss-Newton Algorithm is an iterative optimization method used to estimate parameters in nonlinear least squares problems.
[24]
[PDF] Faithful Estimation of Dynamics Parameters from CPMG Relaxation ...
May 1, 2006 · ... Nonlinear least-squares fits were executed using Prism 4.0 (GraphPad ... This grid- based approach avoids the bias in initial parameter estimates ...
[25]
[PDF] Improved Initialization for Nonlinear State-Space Modeling - arXiv
Apr 23, 2018 · Good initial values for the model parameters are obtained by identifying separately the linear dynamics and the nonlinear terms in the model. In ...
[26]
Nonlinear Least-Squares Fitting — GSL 2.8 documentation - GNU.org
There are generally two classes of algorithms for solving nonlinear least squares problems, which fall under line search methods and trust region methods.
[27]
Efficient evaluation of the Jacobian in the damped least-squares ...
Estimation of the computational cost for evaluating the Jacobian using different methods. To compute a column of the Jacobian using finite differences, the ...
[28]
[PDF] Nonlinear Least Squares Theory
First, deciding an appropriate nonlinear function is typically difficult. Second, it is usually cumbersome to estimate nonlinear specifications and analyze the ...
[29]
Linear/nonlinear least squares - C++, C#, Java library - ALGLIB
ALGLIB package offers three types of stopping criteria: stop after sufficiently small step; stop after sufficiently small function change; stop after specified ...
[30]
Nonlinear Regression - George A. F. Seber, C. J. Wild - Google Books
Feb 25, 2005 · Nonlinear Regression provides by far the broadest discussion of nonlinear regression models currently available and will be a valuable addition to the library.Missing: formulation | Show results with:formulation
[31]
Computational Experience With Confidence Regions and ...
We present the results of a Monte Carlo study of the leading methods for constructing approximate confidence regions and confidence intervals for parameters ...<|separator|>
[32]
Confidence Regions in Non-Linear Estimation - Oxford Academic
Summary. The statistical properties of the approximate confidence regions for nonlinear estimation based on the likelihood ratio criterion are considered,
[33]
Leverage and Superleverage in Nonlinear Regression - jstor
least squares (OLS), leverage is measured by the magnitude of the elements hij of the hat matrix H, the projection matrix onto the column space of X. The ...
[34]
[PDF] Some Useful Reparameterizations: Linear to Nonlinear Models
Using nonlinear least squares to fit the reparameterized model yields approximate standard errors and confidence intervals more easily than using the delta ...Missing: linearization | Show results with:linearization
[35]
Log-transformation of independent variables: must we? - PMC
This transformation can be motivated by concerns for non-linear dose–response relationship or outliers, however, such transformation may not always reduce bias.
[36]
Nonlinear Least-Squares Problems - SpringerLink
Nonlinear Least-Squares Problems. In: Nocedal, J., Wright, SJ (eds) Numerical Optimization. Springer Series in Operations Research and Financial Engineering.
[37]
[PDF] Numerical Optimization - UCI Mathematics
... Nocedal. Stephen J. Wright. EECS Department. Computer Sciences Department ... Least-Squares Problems. 245. 10.1 Background ...
[38]
An Algorithm for Least-Squares Estimation of Nonlinear Parameters
The modified Gauss-Newton method for the fitting of non-linear regression functions by least squares, Technometrics, 3 (1961), 269–280
[39]
[PDF] The Levenberg-Marquardt Algorithm
Jun 8, 2004 · The Levenberg-Marquardt (LM) algorithm is the most widely used optimization algorithm. It outperforms simple gradient descent and other ...
[40]
[PDF] A survey of the nonlinear conjugate gradient methods - People
This paper reviews the development of different versions of nonlinear conjugate gradient methods, with special attention given to global convergence properties.
[41]
[PDF] Numerical Methods for Unconstrained Optimization and Nonlinear ...
[See Dennis and Schnabel (1979).] 14. Show that the sparse symmetric secant update (11.3.12) reduces to the symmetric secant update (9.1.3) when SP(Z) ...
[42]
[PDF] Derivative-free optimization methods - UC Davis Math
We categorize methods based on assumed properties of the black-box functions, as well as features of the methods. We first overview the primary setting of ...<|control11|><|separator|>
[43]
[PDF] Solving Derivative-Free Nonlinear Least Squares Problems with ...
Other derivative-free approaches to least-squares problems include Implicit Filtering [10,. 11] and DFLS [20, 21], both of which are described later, and LMDIF ...
[44]
[PDF] Scalable Derivative-Free Optimization for Nonlinear Least-Squares ...
Aug 1, 2020 · In this work, we develop a novel model-based DFO method for solving nonlinear least-squares problems. We im- prove on state-of-the-art DFO by ...