Gradient descent
Gradient descent is a first-order iterative optimization algorithm used to minimize a differentiable objective function by repeatedly updating parameters in the direction opposite to the gradient of the function at the current point, scaled by a learning rate that controls the step size.[1] The method was first proposed by French mathematician Augustin-Louis Cauchy in 1847 as a technique for solving systems of equations and minimizing quadratic functions, where each iteration applies a step proportional to the negative gradient, often termed the method of steepest descent.[2] In modern applications, particularly in machine learning and deep learning, gradient descent has become the cornerstone for training models such as neural networks by minimizing loss functions over large datasets, with its stochastic variant enabling efficient computation on massive data.[1] Key variants include batch gradient descent, which computes the gradient over the entire dataset for precise but computationally expensive updates; stochastic gradient descent (SGD), which uses individual training examples for faster, noisier progress suitable for online learning; and mini-batch gradient descent, a compromise that processes small subsets of data to balance efficiency and stability, widely adopted in practice.[1]Fundamentals
Definition and Intuition
Gradient descent is an iterative first-order optimization algorithm designed to minimize a differentiable objective function by repeatedly updating parameters in the direction opposite to the gradient, which points toward the local increase of the function. This approach relies solely on evaluations of the function and its first derivatives, making it computationally efficient for high-dimensional problems where higher-order information, such as Hessians, is impractical to compute. Intuitively, gradient descent can be likened to navigating down a foggy mountain: at each step, one assesses the steepest downhill slope based on the terrain immediately underfoot—the gradient—and takes a step in the opposite direction, gradually approaching the valley floor, which represents a local minimum.[3] This analogy highlights the method's reliance on local information to make globally informed progress, though it may zigzag or slow down in rugged landscapes with flat regions or narrow valleys.[3] In optimization problems, gradient descent serves to locate local minima of the objective function, a task central to fields like machine learning, where it tunes model parameters to minimize loss functions measuring prediction errors.[3] Its simplicity and scalability have made it foundational for training neural networks and other data-driven models.[3] The technique traces its origins to Augustin-Louis Cauchy, who in 1847 introduced it as a method for solving systems of equations by iteratively reducing residuals along descent directions.[4] It was later generalized in the 20th century, notably by Jacques Hadamard in 1907, who applied similar iterative gradient-based steps to variational problems in analysis.Mathematical Formulation
Gradient descent seeks to minimize an objective function f: \mathbb{R}^n \to \mathbb{R} that is differentiable, with the parameters \theta \in \mathbb{R}^n representing the variables to optimize. This formulation arises in unconstrained optimization problems where the goal is to find \theta^* = \arg\min_\theta f(\theta).[5] The gradient of the function, denoted \nabla f(\theta), is the vector of partial derivatives: \nabla f(\theta) = \begin{pmatrix} \frac{\partial f}{\partial \theta_1}(\theta) \\ \vdots \\ \frac{\partial f}{\partial \theta_n}(\theta) \end{pmatrix}. This vector points in the direction of steepest ascent, so the method proceeds by moving in the opposite direction to reduce f. The core update rule of gradient descent is the iterative step \theta_{k+1} = \theta_k - \alpha_k \nabla f(\theta_k), where k indexes the iteration, \theta_0 is an initial parameter vector, and \alpha_k > 0 is the step size (also called the learning rate) at step k.[5] In the basic form, \alpha_k is fixed as a constant \alpha, though more advanced schemes adjust it dynamically. Convergence of this method relies on certain assumptions about f. Specifically, f must be continuously differentiable, ensuring the gradient exists and is well-behaved everywhere in the domain.[5] Additionally, the gradient \nabla f is assumed to be Lipschitz continuous with constant L > 0, meaning \| \nabla f(\theta) - \nabla f(\theta') \| \leq L \| \theta - \theta' \| for all \theta, \theta' \in \mathbb{R}^n. Under these conditions, if f is convex and \alpha_k \leq 1/L, the iterates \theta_k converge to a minimizer at a rate of O(1/k) in function value.[5] To illustrate the formulation, consider the quadratic objective f(\theta) = \frac{1}{2} \theta^T A \theta - b^T \theta, where A \in \mathbb{R}^{n \times n} is symmetric positive definite and b \in \mathbb{R}^n. The gradient simplifies to \nabla f(\theta) = A \theta - b, and applying the update rule yields a sequence that solves the linear system A \theta = b in the limit as k \to \infty, provided $0 < \alpha < 2 / \lambda_{\max}(A), where \lambda_{\max}(A) is the largest eigenvalue of A. This example highlights how gradient descent systematically reduces the function value toward the unique minimum \theta^* = A^{-1} b.[5]Standard Batch Gradient Descent
Algorithm Description
Batch gradient descent, also known as vanilla gradient descent, is an iterative optimization algorithm that minimizes an objective function by repeatedly computing the gradient using the entire training dataset and updating the parameters in the direction opposite to the gradient. The process begins with the selection of an initial parameter vector θ₀, which is often chosen randomly from a normal or uniform distribution to avoid poor starting points that could lead to suboptimal convergence, or set to zero for simplicity in certain convex problems.[6][7] In each iteration, the algorithm evaluates the gradient of the loss function ∇f(θ) with respect to the parameters θ across all n training samples, ensuring a precise estimate of the direction of steepest descent for the full dataset. The parameters are then updated according to the rule θ ← θ - η ∇f(θ), where η is the learning rate, a positive scalar that controls the step size and must be carefully chosen to balance convergence speed and stability. This full-batch computation repeats until a convergence criterion is met, such as the norm of the gradient falling below a small tolerance ε (e.g., ||∇f(θ)|| < ε), indicating that the parameters are near a stationary point, or after a fixed number of iterations to prevent excessive computation.[6][7] The following pseudocode outlines the core procedure:This structure highlights the algorithm's reliance on complete passes through the data in every iteration.[7][6] Computationally, each iteration requires O(n d) time complexity, where n is the number of samples and d is the dimensionality of the parameter space, due to the summation over all data points for gradient evaluation; this makes batch gradient descent suitable for small-to-medium datasets but inefficient for large-scale problems where memory and time constraints arise from full-batch processing.[7] In practice, initialization strategies like random draws from a Gaussian distribution with mean zero and small variance help mitigate issues such as vanishing or exploding gradients in non-convex settings, while stopping criteria based on gradient norm tolerance (typically ε = 10^{-4} to 10^{-6}) or maximum iterations (e.g., 1000) ensure termination without infinite loops, with the choice depending on the desired precision and computational budget.[6][7]Initialize parameters θ ← θ₀ (e.g., random or zero) Set learning rate η > 0 and [tolerance](/page/Tolerance) ε > 0 While ||∇f(θ)|| ≥ ε: Compute [gradient](/page/Gradient) ∇f(θ) = (1/n) Σ_{i=1}^n ∇f(θ; x_i, y_i) over entire [dataset](/page/Data_set) Update θ ← θ - η ∇f(θ) Return θInitialize parameters θ ← θ₀ (e.g., random or zero) Set learning rate η > 0 and [tolerance](/page/Tolerance) ε > 0 While ||∇f(θ)|| ≥ ε: Compute [gradient](/page/Gradient) ∇f(θ) = (1/n) Σ_{i=1}^n ∇f(θ; x_i, y_i) over entire [dataset](/page/Data_set) Update θ ← θ - η ∇f(θ) Return θ