Fact-checked by Grok 2 weeks ago

Chain rule

The chain rule is a fundamental in that provides a method for computing the of a composite , expressed as the product of the of the outer and inner functions. Specifically, if y = f(g(x)), where f and g are differentiable functions, then the derivative is given by \frac{dy}{dx} = f'(g(x)) \cdot g'(x). This rule enables the of complex expressions by breaking them down into simpler components, forming a cornerstone of . The chain rule was developed by the German mathematician Gottfried Wilhelm Leibniz, who first used it in unpublished notes from 1676 to calculate derivatives such as that of $2^{1/x}, though with a sign error. The first explicit published statement appeared in Guillaume de l'Hôpital's 1696 Analyse des Infiniment Petits, based on Leibniz's ideas. Leibniz's 1684 publication Nova methodus pro maximis et minimis, itemque tangentibus (A new method for maxima and minima, and also for tangents) outlined the core rules of differential calculus, including product and quotient rules, without formal proof. Although Isaac Newton developed related concepts in his fluxion-based calculus around the same period, the explicit chain rule as known today aligns more closely with Leibniz's notation and approach, which facilitated its widespread adoption in Europe during the late 17th and 18th centuries. Earlier precursors to the idea appear in ancient mathematics, such as Ptolemy's methods in the Almagest for astronomical calculations, but the rigorous formulation emerged with the invention of calculus. Beyond basic , the chain rule extends to , where it describes how changes in one propagate through a chain of dependent variables, as in the formula \frac{dz}{dt} = \frac{\partial z}{\partial x} \frac{dx}{dt} + \frac{\partial z}{\partial y} \frac{dy}{dt} for z = f(x(t), y(t)). It underpins applications in physics for analyzing rates of change in systems like and , in optimization problems for finding maxima and minima of composite functions, and in scenarios, such as determining how volumes or areas vary with time. The rule's versatility also supports higher-order derivatives and implicit , making it indispensable for modeling real-world phenomena in , and .

Introduction

Intuitive Explanation

The chain rule provides a method for finding the of a composite , where one is applied to the output of another, by multiplying the of the individual functions evaluated at appropriate points. Intuitively, it captures how changes propagate through successive transformations: if a small change in the input x produces a change in an intermediate variable u at \frac{du}{dx}, and that change in u then produces a change in the output y at \frac{dy}{du}, the overall of change \frac{dy}{dx} is the product \frac{dy}{du} \cdot \frac{du}{dx}. This reflects the idea that rates of change multiply when dependencies are chained together. A classic analogy illustrates this: imagine a walking person whose speed relative to the ground is 1 unit per time. A bicyclist travels at 4 times that speed relative to the ground, so \frac{dy}{dx} = 4 where y is bicyclist's position and x is walker's. Now, a car travels at 2 times the bicyclist's speed relative to the bicycle, so \frac{dz}{dy} = 2 where z is car's position. The car's speed relative to the walker is then $2 \times 4 = 8 units per time, showing how the total rate combines multiplicatively. This mirrors the chain rule's structure for nested rates. Another way to visualize it is through "function machines": suppose g(x) transforms an input x (like stretching a ), and f then transforms the output of g (like wrapping it around a ). A tiny adjustment to x first scales by the "" of g at that point, then by the slope of f at g(x), yielding the effect as their product. For linear s, like g(x) = ax + b and f(u) = cu + d, the composite h(x) = f(g(x)) = c(ax + b) + d has slope ac, directly multiplying the individual slopes. For nonlinear cases, the rule evaluates local slopes ( steepness) at the specific input to each . Consider the example y = (x^2 + 1)^3. Let u = x^2 + 1, so y = u^3. The rate \frac{du}{dx} = 2x shows how u changes with x, while \frac{dy}{du} = 3u^2 shows how y changes with u. Thus, \frac{dy}{dx} = 3u^2 \cdot 2x = 3(x^2 + 1)^2 \cdot 2x = 6x(x^2 + 1)^2, intuitively combining the inner growth with the outer cubic expansion. This approach avoids expanding the full , highlighting the rule's efficiency for compositions.

Historical Background

The chain rule in calculus originated in the late 17th century amid the independent development of infinitesimal methods by Isaac Newton and Gottfried Wilhelm Leibniz. While Newton's fluxional calculus employed techniques equivalent to the chain rule for composite functions, it was Leibniz who first documented an early form of the rule in his unpublished 1676 manuscript titled Calculus Tangentium differentialis. In this work, Leibniz applied the rule implicitly to differentiate compositions but included a sign error in the calculation, reflecting the nascent stage of the theory. The rule appeared implicitly in several examples within Guillaume de l'Hôpital's 1696 textbook Analyse des Infiniment Petits pour l'Intelligence des Lignes Courbes, the first published text, which drew heavily from Johann Bernoulli's lectures. L'Hôpital's presentation integrated the concept into practical differentiations without a explicit general statement, treating it as a natural extension of differential notation. This approach aligned with Leibniz's framework, where differentials facilitated handling compositions without a separate named rule. A modern, explicit formulation of the chain rule emerged in Joseph-Louis Lagrange's 1797 treatise Théorie des Fonctions Analytiques, where he stated it rigorously using prime notation for derivatives of composite functions, free from infinitesimals. Lagrange's version emphasized algebraic derivation, influencing subsequent rigorization efforts. further refined and generalized it in his 1823 des Leçons sur le Calcul Infinitésimal, incorporating limits and extending to multivariable cases. The term "chain rule" itself did not appear in calculus literature until the late 19th or early , likely borrowed from earlier texts on unit conversions, which mirrored the multiplicative chaining of factors in the rule.

Single-Variable Chain Rule

Formal Statement

The chain rule for single-variable functions provides a to compute the of a composite . Suppose f and g are differentiable functions such that the of f contains the of g, and let h(x) = f(g(x)) for all x in the of g. Then h is differentiable at every value of x for which g is differentiable at x, and h'(x) = f'(g(x)) \cdot g'(x). This formulation assumes the standard conditions of differentiability: g must be differentiable at x, and f must be differentiable at g(x). These conditions ensure that the defining the of h exists. An equivalent statement using Leibniz notation, which emphasizes the multiplicative nature of the s, is as follows. Let y = f(u) where u = g(x), with f differentiable at u and g differentiable at x. Then \frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}. This notation is particularly useful for identifying the inner and outer functions in expressions involving implicit differentiation or when substituting variables.

Proofs of the Rule

The chain rule for single-variable functions states that if g is differentiable at x = a and f is differentiable at u = g(a), then the composite function h(x) = f(g(x)) is differentiable at x = a, with derivative h'(a) = f'(g(a)) \cdot g'(a). One standard proof relies on the limit definition of the . Consider the for h: h'(a) = \lim_{x \to a} \frac{f(g(x)) - f(g(a))}{x - a}. Let \Delta u = g(x) - g(a), so the expression becomes \frac{f(g(a) + \Delta u) - f(g(a))}{x - a} = \frac{f(g(a) + \Delta u) - f(g(a))}{\Delta u} \cdot \frac{\Delta u}{x - a}, provided \Delta u \neq 0. As x \to a, \Delta u \to 0 because differentiability of g implies continuity at a. Thus, the first factor approaches f'(g(a)) by the definition of the derivative of f at g(a), and the second factor approaches g'(a). The product of the limits equals the limit of the product since both limits exist. To handle the case where \Delta u = 0 for some x_n \to a (possible if g'(a) = 0), define an auxiliary function \Phi(\Delta u) = \begin{cases} \frac{f(g(a) + \Delta u) - f(g(a))}{\Delta u} & \text{if } \Delta u \neq 0, \\ f'(g(a)) & \text{if } \Delta u = 0. \end{cases} This \Phi is continuous at \Delta u = 0 because \lim_{\Delta u \to 0} \Phi(\Delta u) = f'(g(a)). Substituting back, the difference quotient is \Phi(\Delta u) \cdot \frac{\Delta u}{x - a}. As x \to a, \Phi(\Delta u) \to f'(g(a)) and \frac{\Delta u}{x - a} \to g'(a), so the limit is their product. This resolves the division issue rigorously using the epsilon-delta definition of limits. An epsilon-delta explicitly verifies the . Given \varepsilon > 0, since f' exists at g(a), there is \delta_1 > 0 such that if $0 < |\Delta u| < \delta_1, then \left| \frac{f(g(a) + \Delta u) - f(g(a))}{\Delta u} - f'(g(a)) \right| < \varepsilon / (1 + |g'(a)|) (adjusting for the bound on g'). Since g is continuous at a, there is \delta_2 > 0 such that if |x - a| < \delta_2, then |\Delta u| < \delta_1. For $0 < |x - a| < \min(\delta_2, \delta_1 / M) where M > |g'(a)| + 1, the difference quotient's deviation from f'(g(a)) g'(a) is less than \varepsilon, confirming differentiability.

Applications in Single-Variable Calculus

Extensions to Multiple Compositions

The chain rule extends naturally to compositions involving three or more functions in single-variable calculus by iteratively applying the rule, resulting in a product of the derivatives of each component function evaluated at the appropriate inner expressions. For a composition h(x) = f(g(k(x))), where f, g, and k are differentiable functions, the derivative is given by h'(x) = f'(g(k(x))) \cdot g'(k(x)) \cdot k'(x). This formula arises from applying the two-function chain rule successively: first to the outer pair f \circ (g \circ k), then to the inner composition g \circ k./03:_Derivatives/3.06:_The_Chain_Rule) To compute such derivatives, identify the successive layers of composition and differentiate from the outside inward, multiplying each derivative while substituting the inner functions. For instance, consider h(x) = \sin(\cos(x^2)). Here, let f(u) = \sin u, u = g(v) = \cos v, and v = k(x) = x^2. The derivative is h'(x) = \cos(\cos(x^2)) \cdot (-\sin(x^2)) \cdot 2x, which combines the derivatives f'(u) = \cos u, g'(v) = -\sin v, and k'(x) = 2x, evaluated at the nested arguments. This approach scales to arbitrary numbers of compositions, such as four or more functions, by continuing the iterative multiplication without altering the core principle./03:_Derivatives/3.06:_The_Chain_Rule) Such extensions are essential for differentiating complex expressions encountered in applications like physics and , where functions often involve multiple nested transformations, such as in modeling oscillatory systems with or inner components. For example, in the function y = \tan(\sqrt{{grok:render&&&type=render_inline_citation&&&citation_id=3&&&citation_type=wikipedia}}{3x^2} + \ln(5x^4)), three applications of the chain rule yield the derivative as the product of the secant-squared outer derivative and the derivatives of the and logarithmic inner parts, adjusted for their compositions. This iterative process ensures the chain rule remains a foundational tool for handling real-world composite models efficiently. The chain rule facilitates the derivation of several fundamental rules in single-variable , including the , the rule for inverse functions, and the generalized for composite functions. These derivations build directly on the chain rule's statement that if h(x) = f(g(x)), then h'(x) = f'(g(x)) \cdot g'(x). By applying this to specific forms of composite functions, other rules emerge as corollaries, enhancing the toolkit for without relying on limit definitions anew.

Quotient Rule

The , which gives the of a ratio of two functions, can be derived using the together with the applied to the function. Consider the quotient q(x) = \frac{f(x)}{g(x)}, assuming g(x) \neq 0. Rewrite q(x) as the product f(x) \cdot [g(x)]^{-1}. The of the [g(x)]^{-1} follows from the : let u = g(x), so ^{-1}, and the yields \frac{d}{dx} [g(x)]^{-1} = -[g(x)]^{-2} \cdot g'(x). Applying the product rule to q(x) = f(x) \cdot [g(x)]^{-1}, the derivative is: q'(x) = f'(x) \cdot [g(x)]^{-1} + f(x) \cdot \left( -[g(x)]^{-2} \cdot g'(x) \right). Simplifying the expression gives: q'(x) = \frac{f'(x) g(x) - f(x) g'(x)}{[g(x)]^2}, which is the standard . This derivation assumes the and the basic for exponent -1, but the chain rule provides the key step for the reciprocal's derivative. For example, to differentiate \frac{x^2 + 1}{x - 3}, apply the directly: the derivative is \frac{(2x)(x-3) - (x^2 + 1)(1)}{(x-3)^2} = \frac{2x^2 - 6x - x^2 - 1}{(x-3)^2} = \frac{x^2 - 6x - 1}{(x-3)^2}.

Derivatives of Inverse Functions

The chain rule also yields the derivative formula for the inverse of a differentiable function. Suppose y = f^{-1}(x), so x = f(y). Differentiating both sides with respect to x using the chain rule on the right gives $1 = f'(y) \cdot \frac{dy}{dx}. Solving for the derivative of the inverse produces: \frac{dy}{dx} = \frac{1}{f'(y)} = \frac{1}{f'(f^{-1}(x))}. This holds provided f'(f^{-1}(x)) \neq 0, ensuring the inverse is locally differentiable. The proof relies on the chain rule applied to the composition f(f^{-1}(x)) = x./03:_Derivatives/3.07:_Derivatives_of_Inverse_Functions) For instance, the of the natural logarithm, as the of the , follows: since f(x) = e^x and f'(x) = e^x, the derivative of \ln x = f^{-1}(x) is \frac{1}{e^{\ln x}} = \frac{1}{x}. Similarly, for the arctangent function, \frac{d}{dx} \arctan x = \frac{1}{1 + x^2}, derived from the of \tan y where the derivative of is \sec^2 y = 1 + \tan^2 y. These results extend to other via the same chain rule application./03:_Derivatives/3.07:_Derivatives_of_Inverse_Functions)

Generalized Power Rule

The chain rule directly generalizes the to composite functions, allowing of expressions like [g(x)]^n for real n. Let h(x) = [g(x)]^n = f(g(x)), where f(u) = u^n. The states f'(u) = n u^{n-1}, so the chain rule gives: h'(x) = n [g(x)]^{n-1} \cdot g'(x). This derivation assumes the basic for the outer function and applies the chain rule for the inner substitution. It unifies of powers, roots, and rational exponents in composite forms. An example is \frac{d}{dx} (x^2 + 1)^{3/2} = \frac{3}{2} (x^2 + 1)^{1/2} \cdot 2x = 3x (x^2 + 1)^{1/2}, illustrating how the chain rule scales the simple power rule to handle composition. This approach also underpins logarithmic differentiation for products or quotients raised to powers, where taking the natural log and applying the chain rule simplifies complex expressions.

Higher-Order Derivatives

The chain rule for first-order derivatives extends naturally to higher-order derivatives of composite functions, providing expressions for the second, third, and subsequent derivatives of h(x) = f(g(x)) in terms of the derivatives of f and g. For the second derivative, repeated application of the product rule to the first-order chain rule yields h''(x) = f''(g(x)) [g'(x)]^2 + f'(g(x)) g''(x), which illustrates how contributions from both the outer and inner functions accumulate. This pattern generalizes, but the expressions grow combinatorially complex, necessitating a systematic formula for the nth derivative. The general higher-order chain rule is given by Faà di Bruno's formula, which expresses the nth derivative of the composition h(x) = f(g(x)) as \frac{d^n}{dx^n} f(g(x)) = \sum \frac{n!}{b_1! b_2! \cdots b_n!} f^{(k)}(g(x)) \prod_{j=1}^n \left( \frac{g^{(j)}(x)}{j!} \right)^{b_j}, where the sum is taken over all non-negative integers b_1, b_2, \dots, b_n satisfying \sum_{j=1}^n j b_j = n and k = \sum_{j=1}^n b_j. Here, f^{(k)} denotes the kth derivative of f, and g^{(j)} the jth derivative of g. This formula arises from combinatorial considerations involving set partitions of \{1, 2, \dots, n\}, where each b_j counts the number of blocks of size j in the partition. Faà di Bruno's formula unifies the higher-order extensions of the chain rule by accounting for all ways the derivatives of g can contribute to the overall differentiation process, much like the first-order case multiplies the derivatives directly. For instance, applying it to n=3 for h(x) = f(g(x)) produces h'''(x) = f'''(g(x)) [g'(x)]^3 + 3 f''(g(x)) g'(x) g''(x) + f'(g(x)) g'''(x), highlighting the coefficients that emerge from the partition structure (e.g., the factor of 3 corresponds to the three partitions with one block of size 2 and one of size 1). This result can be verified by differentiating the second-order expression twice more using the product and chain rules. In practice, facilitates computations in areas requiring higher derivatives, such as expansions of composite functions and solving differential equations, by avoiding ad hoc repeated differentiations. Alternative formulations, such as those using or directional derivatives, offer computational efficiencies for specific applications but retain the same underlying partition-based structure.

Multivariable Chain Rule

Scalar-Valued Functions

The multivariable chain rule extends the single-variable version to compositions of functions where the outer function is scalar-valued, meaning it maps from \mathbb{R}^k to \mathbb{R}, while the inner function maps from \mathbb{R}^m to \mathbb{R}^k. This rule allows computation of partial derivatives of the composite function h(\mathbf{x}) = f(\mathbf{g}(\mathbf{x})) in terms of the partial derivatives of f and \mathbf{g}, where \mathbf{x} \in \mathbb{R}^m and \mathbf{g}(\mathbf{x}) = \mathbf{u} \in \mathbb{R}^k. The derivative of h is given by the Jacobian matrix product, but since f is scalar-valued, it simplifies to the of f dotted with the of \mathbf{g}. Formally, if f: \mathbb{R}^k \to \mathbb{R} and \mathbf{g}: \mathbb{R}^m \to \mathbb{R}^k are differentiable at the relevant points, then h: \mathbb{R}^m \to \mathbb{R} is differentiable at \mathbf{a} \in \mathbb{R}^m, and the of h with respect to the i-th component of \mathbf{x} is \frac{\partial h}{\partial x_i}(\mathbf{a}) = \sum_{j=1}^k \frac{\partial f}{\partial u_j}(\mathbf{g}(\mathbf{a})) \cdot \frac{\partial g_j}{\partial x_i}(\mathbf{a}), where \mathbf{u} = (u_1, \dots, u_k) and \mathbf{g} = (g_1, \dots, g_k). In vector notation, the gradient of h is \nabla h(\mathbf{a}) = \nabla f(\mathbf{g}(\mathbf{a})) \cdot D\mathbf{g}(\mathbf{a}), where D\mathbf{g}(\mathbf{a}) is the k \times m Jacobian matrix of \mathbf{g} at \mathbf{a}. This holds under the assumption that \mathbf{g} is differentiable at \mathbf{a} and f is differentiable at \mathbf{g}(\mathbf{a}). A common special case arises when the inner function parametrizes a , so m=1, \mathbf{g}: \mathbb{R} \to \mathbb{R}^k, and h(t) = f(\mathbf{g}(t)). Here, the chain rule reduces to the form: h'(t) = \nabla f(\mathbf{g}(t)) \cdot \mathbf{g}'(t). For instance, in two dimensions with f(x,y) = x^2 + y and \mathbf{g}(t) = (2t + 1, 3t - 1), the is h'(t) = 2x \frac{dx}{dt} + \frac{dy}{dt}, substituting the parametrization yields h'(t) = 4(2t + 1) + 3 = 8t + 7. This illustrates how the rule computes rates of change along curves in higher dimensions. Another frequent application occurs when m=2, \mathbf{g}: \mathbb{R}^2 \to \mathbb{R}^k, as in change of variables like polar coordinates. For z = f(x,y) = x^2 y with x = r \cos \theta, y = r \sin \theta, the partials are \frac{\partial z}{\partial r} = \frac{\partial f}{\partial x} \frac{\partial x}{\partial r} + \frac{\partial f}{\partial y} \frac{\partial y}{\partial r} = 2x y \cos \theta + x^2 \sin \theta = 3r^2 \cos^2 \theta \sin \theta, and similarly for \frac{\partial z}{\partial \theta}, enabling efficient computation in non-Cartesian systems. These cases highlight the rule's utility in multivariable settings without requiring the full matrix form.

Vector-Valued Functions

In , the chain rule extends to compositions of vector-valued s, where both the inner and outer functions map between spaces. Consider a \mathbf{g}: \mathbb{R}^m \to \mathbb{R}^k that is differentiable at a point \mathbf{a} \in \mathbb{R}^m, and an outer \mathbf{f}: \mathbb{R}^k \to \mathbb{R}^p that is differentiable at \mathbf{g}(\mathbf{a}). The composite \mathbf{h} = \mathbf{f} \circ \mathbf{g}: \mathbb{R}^m \to \mathbb{R}^p is then differentiable at \mathbf{a}, with its given by the matrix product of the Jacobians of \mathbf{f} and \mathbf{g}. The Jacobian matrix of \mathbf{g} at \mathbf{a}, denoted D\mathbf{g}(\mathbf{a}), is the k \times m whose (i,j)-th entry is the \frac{\partial g_i}{\partial x_j}(\mathbf{a}), where \mathbf{g} = (g_1, \dots, g_k) and \mathbf{x} = (x_1, \dots, x_m). Similarly, D\mathbf{f}(\mathbf{g}(\mathbf{a})) is the p \times k matrix of \mathbf{f}. The chain rule states: D\mathbf{h}(\mathbf{a}) = D\mathbf{f}(\mathbf{g}(\mathbf{a})) \cdot D\mathbf{g}(\mathbf{a}), where the dot denotes . This formula generalizes the single-variable chain rule by chaining linear approximations through matrix composition. In component form, for the i-th component h_i of \mathbf{h}, the partial derivative with respect to x_j is \frac{\partial h_i}{\partial x_j}(\mathbf{a}) = \sum_{\ell=1}^k \frac{\partial f_i}{\partial y_\ell}(\mathbf{g}(\mathbf{a})) \cdot \frac{\partial g_\ell}{\partial x_j}(\mathbf{a}), where \mathbf{y} = (y_1, \dots, y_k) are the intermediate variables. This summation reflects the multivariable nature, accounting for all paths of dependence. The proof relies on the definition of differentiability, expressing the error in the linear approximation of \mathbf{h} as a composition of errors from \mathbf{f} and \mathbf{g}, which vanish in the limit. A concrete example illustrates the rule. Let \mathbf{f}: \mathbb{R}^3 \to \mathbb{R}^2 be defined by \mathbf{f}(x,y,z) = (2xy, y - z), and \mathbf{g}: \mathbb{R}^2 \to \mathbb{R}^3 by \mathbf{g}(u,v) = (u - 4v, uv, u + v). The Jacobians are D\mathbf{g}(u,v) = \begin{pmatrix} 1 & -4 \\ v & u \\ 1 & 1 \end{pmatrix}, \quad D\mathbf{f}(x,y,z) = \begin{pmatrix} 2y & 2x & 0 \\ 0 & 1 & -1 \end{pmatrix}. At a point (u,v), D(\mathbf{f} \circ \mathbf{g})(u,v) = D\mathbf{f}(\mathbf{g}(u,v)) \cdot D\mathbf{g}(u,v), which yields \begin{pmatrix} 4uv - 8v^2 & 2u^2 - 16uv \\ v - 1 & u - 1 \end{pmatrix} after direct computation and verification. This confirms the matrix product form. For special cases where the outer function \mathbf{f} is scalar-valued (p=1), the chain rule simplifies to the gradient of f dotted with the of \mathbf{g}, such as \frac{dh}{dt} = \nabla f(\mathbf{g}(t)) \cdot \mathbf{g}'(t) when \mathbf{g}: \mathbb{R} \to \mathbb{R}^k. This is particularly useful in applications like trajectory analysis or optimization.

Advanced Applications and Generalizations

Backpropagation in Neural Networks

Backpropagation is a cornerstone algorithm for training artificial neural networks, relying fundamentally on the chain rule to efficiently compute gradients of the loss function with respect to the network's parameters. In neural networks, the forward pass computes the output by propagating inputs through layers of interconnected neurons, each applying linear transformations followed by nonlinear activation functions. The chain rule enables the backward pass to decompose the total derivative of the loss with respect to early-layer parameters into products of local derivatives, allowing gradients to flow reversely from the output layer to the input layer. This process, first systematically described for multilayer networks in the seminal 1986 paper by Rumelhart, Hinton, and Williams, revolutionized machine learning by making it feasible to train deep architectures with hundreds of layers. The core mechanism of treats the as a of functions, where the loss C is a function of the network output a^L, which depends on intermediate activations a^l across layers l = 1 to L. To find \frac{\partial C}{\partial w^l_{jk}}, the of the loss with respect to a weight w^l_{jk} in layer l, the chain rule is applied repeatedly: \frac{\partial C}{\partial w^l_{jk}} = \frac{\partial C}{\partial z^l_j} \cdot \frac{\partial z^l_j}{\partial w^l_{jk}}, where z^l_j is the pre-activation input to j in layer l. Here, \frac{\partial C}{\partial z^l_j} (often denoted \delta^l_j) is the term propagated backward, computed as \delta^l_j = \sum_k \frac{\partial C}{\partial z^{l+1}_k} \cdot \frac{\partial z^{l+1}_k}{\partial a^l_j} \cdot \sigma'(z^l_j), with \sigma' the of the . This recursive application of the multivariable chain rule ensures that gradients are calculated in a single backward sweep proportional to the forward pass cost. In practice, for a network with input x, weights w^l, biases b^l, and activations a^l = \sigma(z^l) where z^l = w^l a^{l-1} + b^l, the equations formalize this as four steps: (1) output error \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j); (2) error propagation \delta^l = ( (w^{l+1})^T \delta^{l+1} ) \odot \sigma'(z^l); (3) bias \frac{\partial C}{\partial b^l_j} = \delta^l_j; and (4) weight \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j. These derive directly from the by considering how changes in weights affect subsequent layers through activation paths. For instance, in a simple two-layer network predicting a scalar output, the with respect to the first-layer weights involves the through the second-layer weights and activations, avoiding the computational of finite differences. This efficiency scales to modern deep networks, where the chain rule handles the high dimensionality of parameter spaces exceeding billions. The chain rule's role extends to variants like convolutional and recurrent networks, where it adapts to shared parameters and temporal dependencies, respectively. In convolutional layers, gradients backpropagate through spatial dimensions using analogous multivariable extensions, while in recurrent networks, it unfolds the loop into a deep chain, though prone to vanishing or exploding gradients addressed by techniques like units. Overall, backpropagation's reliance on the chain rule has enabled breakthroughs in image recognition, , and beyond, underpinning optimizers like .

Further Extensions

The chain rule extends to mappings between Banach spaces through the , which generalizes the notion of differentiability to infinite-dimensional settings in . For normed vector spaces U, V, and W, with open sets \Omega \subset U and \tilde{\Omega} \subset V, consider functions F: \Omega \to V (satisfying F(\Omega) \subseteq \tilde{\Omega}) and G: \tilde{\Omega} \to W. If F is Fréchet differentiable at x \in \Omega and G is Fréchet differentiable at y = F(x), then the composition G \circ F is Fréchet differentiable at x, with the derivative given by \Delta_x (G \circ F) = \Delta_y G \circ \Delta_x F, where \Delta_x F: U \to V and \Delta_y G: V \to W are the bounded linear Fréchet derivatives. This formulation mirrors the finite-dimensional case but requires the stronger uniformity condition of the Fréchet derivative, defined as \lim_{\|h\|_U \to 0} \frac{\|F(x + h) - F(x) - \Delta_x F(h)\|_V}{\|h\|_U} = 0. The proof relies on the little-o notation for remainders in the expansions of F and G, ensuring the composition's remainder satisfies the Fréchet condition. In , the chain rule applies to maps between differentiable manifolds, where derivatives are defined via spaces. For manifolds M, N, and P, and maps f: M \to N, g: N \to P, the (pushforward) df_p: T_p M \to T_{f(p)} N at p \in M is the satisfying (f \circ c)'(0) = df_p(c'(0)) for curves c with c(0) = p. The chain rule states that d_p (g \circ f) = dg_{f(p)} \circ df_p. This holds because the g \circ f \circ c differentiates via the standard chain rule in local coordinates, and chart transitions preserve . The result underpins computations in and other geometric theories, where manifolds model curved spaces. A extension appears in as , adapting the chain rule for processes involving . For a X_t with drift a(X_t, t) and diffusion coefficient \sigma(X_t, t), and twice-differentiable u(x, t), gives du(X_t, t) = \partial_t u \, dt + \partial_x u \, dX_t + \frac{1}{2} \partial_{xx} u \, \sigma^2(X_t, t) \, dt. The extra quadratic term arises from the non-zero of dX_t, where (dX_t)^2 = \sigma^2(X_t, t) \, dt in the mean-square sense, unlike deterministic calculus. The proof involves Taylor expansion over partitions and convergence of sums using integrals. This lemma is foundational for pricing derivatives in .

References

  1. [1]
    Calculus I - Chain Rule - Pauls Online Math Notes
    Nov 16, 2022 · If we define F(x)=(f∘g)(x) F ( x ) = ( f ∘ g ) ( x ) then the derivative of F(x) F ( x ) is, F′(x)=f′(g(x))g′(x) F ′ ( x ) = f ′ ( g ( x ) ) g ′ ...
  2. [2]
    The Chain Rule - Department of Mathematics at UTSA
    Nov 3, 2021 · In calculus, the chain rule is a formula that expresses the derivative of the composition of two differentiable functions · The chain rule may ...
  3. [3]
    The Calculus | The Oxford Handbook of Leibniz
    He published the main rules of differential calculus, including the chain rule, without proof in the Nova methodus pro maximis et minimis in 1684 (GM V 220–226) ...
  4. [4]
    [PDF] Early Use of the Chain Rule - Florida State University
    All of this by no means implies that differential calculus as we know it was understood by ancient mathematicians, but it does show that when they needed to ...
  5. [5]
    13.5 The Multivariable Chain Rule
    The Chain Rule allows us to combine several rates of change to find another rate of change. The Chain Rule also has theoretic use, giving us insight into the ...
  6. [6]
    Applications of Chain Rule - BOOKS
    When differentiating functions of several variables, it is essential to keep track of which variables are being held fixed. As a simple example, suppose.
  7. [7]
    [PDF] Derivatives by the Chain Rule - MIT OpenCourseWare
    I see it as the third basic way to find derivatives of new functions from derivatives of old functions.
  8. [8]
    The Intuitive Notion of the Chain Rule
    The chain rule tells us: If y is a quantity that depends on u, and u is a quantity that depends on x, then ultimately, y depends on x and =.
  9. [9]
    The idea of the chain rule - Math Insight
    The chain rule gives us a way to calculate the derivative of a composition of functions, such as the composition f(g(x)) of the functions f and g.
  10. [10]
  11. [11]
    Analyse des infiniment petits : L'Hospital, marquis de (Guillaume ...
    Nov 27, 2007 · Analyse des infiniment petits. by: L'Hospital, marquis de ... PDF download · download 1 file · SINGLE PAGE ORIGINAL JP2 TAR download.
  12. [12]
    Theorie des fonctions analytiques : contenant les principes du calcul ...
    Jan 18, 2010 · Lagrange, J. L. (Joseph Louis), 1736-1813. Publication date: 1797. Topics: Differential calculus, Curves, Surfaces, Mechanics, Analytic.
  13. [13]
    Teaching Calculus Through History's Lens - CMS Notes
    There is no need to “remember the chain rule” because the chain rule did not exist in either Newton's or Leibniz' version of Calculus. In fact, the phrase “ ...
  14. [14]
    3.6: The Chain Rule - Mathematics LibreTexts
    Jan 17, 2025 · The chain rule, which states that the derivative of a composite function is the derivative of the outer function evaluated at the inner function times the ...Missing: single- | Show results with:single-
  15. [15]
    [PDF] MITOCW | OCW_18.100B-Lec15-2025Apr10.mp4
    Apr 10, 2025 · The chain rule says that the composition is also differentiable at x0, and this is the formula for the derivative. Let's try to prove that. And ...
  16. [16]
    [PDF] The Chain Rule
    Φ(h) ) · g/(a). Φ(h) = f/ ( g(a) ) . 0 < |k| < δ/ =⇒ \ \ \ \ \ f(g(a) + k) − f(g(a)) k − f/(g(a)) \ \ \ \ \ < .
  17. [17]
    [PDF] Math 131, Lecture 19: The Chain Rule
    Nov 9, 2011 · When I ask you to give an ε-δ proof for a limit of a piecewise-linear function, my secret goal is to get you to understand how you might go.
  18. [18]
  19. [19]
  20. [20]
    (PDF) Chain Rules for Higher Derivatives - ResearchGate
    Aug 5, 2025 · These three "higher-order chain rules" are alternatives to the classical Faa di Bruno formula. They are less explicit than Faa di Bruno's ...Missing: authoritative sources
  21. [21]
    Special cases of the multivariable chain rule - Math Insight
    Special cases of the multivariable chain rule include when the inner function g is a function of one variable, or when g is a function of two variables.
  22. [22]
    [PDF] 2.5 Chain Rule for Multiple Variables - UCSD Math
    z = f(x,y) depends on two variables. Use partial derivatives. x and y each depend on one variable, t. Use ordinary derivative. To compute.
  23. [23]
    2.3 The Chain Rule
    ### Summary of Multivariable Chain Rule from http://www.math.toronto.edu/courses/mat237y1/20199/notes/Chapter2/S2.3.html
  24. [24]
    multivariable-chain-rule - Stanford AI Lab
    Note that the directional derivative – considered as a function of the direction – coincides with the total derivative of f when f is scalar-valued.
  25. [25]
    [PDF] Multivariable Vector-Valued Functions - Bard Faculty
    The version of the Chain Rule from Calculus I is used to find the derivatives of functions such as h: R → R given by the formula h(x) sin(x2 + 7). The idea ...
  26. [26]
    Special cases of the multivariable chain rule - Math Insight
    ### Summary of Chain Rule for Vector-Valued Functions in Multivariable Calculus
  27. [27]
    Learning representations by back-propagating errors - Nature
    Oct 9, 1986 · Cite this article. Rumelhart, D., Hinton, G. & Williams, R. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).
  28. [28]
    How the backpropagation algorithm works
    A little less succinctly, we can think of backpropagation as a way of computing the gradient of the cost function by systematically applying the chain rule from ...
  29. [29]
    Backpropagation - CS231n Deep Learning for Computer Vision
    This follows the multivariable chain rule in Calculus, which states that if a variable branches out to different parts of the circuit, then the gradients that ...
  30. [30]
    None
    ### Summary of Chain Rule for Fréchet Derivatives
  31. [31]
    [PDF] Fréchet derivatives and Gâteaux derivatives - Jordan Bell
    Apr 3, 2014 · The following is the product rule for Fréchet derivatives. By f1 ·f2 we mean the function x 7→ f1(x)f2(x). Theorem 7 (Product rule) ...
  32. [32]
    [PDF] Differentiable Manifolds
    For the curves, the philosophy is: enforce the chain rule. Definition 5.1. Suppose that M,N are smooth manifolds, and f : M → N is a smooth map. Let p ∈ ...
  33. [33]
    [PDF] Lesson 4, Ito's lemma 1 Introduction - NYU Courant
    Ito's lemma is the chain rule for stochastic calculus, and it serves as the stochastic version of the fundamental theorem of calculus.