Total derivative
In multivariable calculus, the total derivative of a function \mathbf{f}: \mathbb{R}^m \to \mathbb{R}^n at a point \mathbf{a} is the best linear approximation to the change in \mathbf{f} near \mathbf{a}, represented as an n \times m matrix known as the Jacobian matrix.[1] This matrix consists of all first-order partial derivatives of the component functions of \mathbf{f}, with the entry in row i and column j given by \frac{\partial f_i}{\partial x_j} evaluated at \mathbf{a}.[1] Formally, \mathbf{f} is differentiable at \mathbf{a} if there exists a linear map D\mathbf{f}(\mathbf{a}) such that \lim_{\mathbf{h} \to \mathbf{0}} \frac{\| \mathbf{f}(\mathbf{a} + \mathbf{h}) - \mathbf{f}(\mathbf{a}) - D\mathbf{f}(\mathbf{a}) \mathbf{h} \|}{\| \mathbf{h} \|} = 0, where the limit condition ensures the approximation error vanishes faster than the perturbation size.[2] Unlike partial derivatives, which measure change with respect to a single variable while holding others fixed, the total derivative accounts for simultaneous variations in all input variables, providing a complete local linearization of the function.[3] For scalar-valued functions (n = 1), the total derivative reduces to the gradient vector \nabla f = \left( \frac{\partial f}{\partial x_1}, \dots, \frac{\partial f}{\partial x_m} \right), which points in the direction of steepest ascent and determines the tangent hyperplane.[3] In vector-valued cases, it generalizes this to transformations between vector spaces, essential for analyzing systems like those in physics or engineering where multiple inputs and outputs interact.[4] The total derivative underpins key theorems in multivariable calculus, including the chain rule, which composes derivatives as matrix products: if \mathbf{g}: \mathbb{R}^p \to \mathbb{R}^m and \mathbf{f}: \mathbb{R}^m \to \mathbb{R}^n, then D(\mathbf{f} \circ \mathbf{g})(\mathbf{b}) = D\mathbf{f}(\mathbf{g}(\mathbf{b})) \cdot D\mathbf{g}(\mathbf{b}).[5] It also facilitates computations in optimization, where the Jacobian aids gradient-based methods, and in differential geometry, where it describes tangent spaces to manifolds.[6] Properties such as linearity—e.g., D(\mathbf{f} + \mathbf{g}) = D\mathbf{f} + D\mathbf{g} and D(c\mathbf{f}) = c D\mathbf{f} for scalar c—make it a foundational tool for higher-dimensional analysis.[3]Definition and Basics
As a Linear Map
The total derivative of a multivariable function f: \mathbb{R}^n \to \mathbb{R}^m at a point a \in \mathbb{R}^n is defined as the unique linear map Df(a): \mathbb{R}^n \to \mathbb{R}^m that provides the best linear approximation to f near a.[5][7] Specifically, f is differentiable at a if there exists such a linear map satisfying the limit condition \lim_{h \to 0} \frac{\|f(a + h) - f(a) - Df(a)(h)\|}{\|h\|} = 0, where \|\cdot\| denotes a norm on the respective vector spaces, ensuring the approximation error is negligible compared to the perturbation size.[5][7] This definition assumes familiarity with linear maps between finite-dimensional vector spaces and the associated norms.[8] This construction generalizes the single-variable derivative, where for f: \mathbb{R} \to \mathbb{R}, Df(a) is simply multiplication by the scalar f'(a), to higher dimensions by replacing the scalar with a linear operator that captures directional changes in all input variables.[7][9] In matrix form, the total derivative is represented by the Jacobian matrix J_f(a), an m \times n matrix whose entries are the partial derivatives of the component functions of f, such that Df(a)(h) = J_f(a) h for any h \in \mathbb{R}^n.[5][10] The partial derivatives thus serve as the components assembling this matrix.[7] Geometrically, the total derivative provides the best linear approximation to the change in f near a, i.e., Df(a)(h) \approx f(a + h) - f(a), generalizing the single-variable derivative where it gives the slope of the tangent line. For scalar-valued functions (m=1), this corresponds to the tangent hyperplane to the graph in \mathbb{R}^{n+1}.[5][11]Relation to Partial Derivatives
The total derivative of a function f: \mathbb{R}^n \to \mathbb{R}^m at a point a \in \mathbb{R}^n is represented by the Jacobian matrix J_f(a), whose entries are the partial derivatives of the component functions of f; specifically, the (i,j)-th entry is (J_f(a))_{ij} = \frac{\partial f_i}{\partial x_j}(a).[5] In terms of linear maps, the total derivative Df(a) can be expressed as Df(a) = \sum_{j=1}^n \frac{\partial f}{\partial x_j}(a) \otimes e_j^*, where e_j^* are the dual basis vectors, or equivalently in coordinates, Df(a)(v) = J_f(a) v for v \in \mathbb{R}^n.[7] For a scalar-valued function f: \mathbb{R}^n \to \mathbb{R}, the total derivative at a acts as the linear map df(a): \mathbb{R}^n \to \mathbb{R} given by df(a)(v) = \sum_{i=1}^n \frac{\partial f}{\partial x_i}(a) v_i, which in differential notation appears as df(a) = \sum_{i=1}^n \frac{\partial f}{\partial x_i}(a) \, dx_i but represents solely the output of the linear approximation on input vectors.[12] Differentiability of f at a is defined by the existence of a linear map Df(a) satisfying \lim_{h \to 0} \frac{\|f(a+h) - f(a) - Df(a)h\|}{\|h\|} = 0, which implies that all partial derivatives exist at a. A sufficient condition for differentiability at a is that all partial derivatives exist in a neighborhood of a and are continuous at a; however, differentiability does not conversely imply continuity of the partials.[13][14] As a concrete illustration, consider f(x,y) = x^2 + y at the point (1,1). The partial derivatives are \frac{\partial f}{\partial x} = 2x and \frac{\partial f}{\partial y} = 1, so Df(1,1) = \begin{pmatrix} 2 & 1 \end{pmatrix}. Applying this to a vector (h,k) yields Df(1,1)(h,k) = 2h + k.[15] Unlike the directional derivative, which measures the rate of change along a specific direction via \frac{\partial f}{\partial u}(a) = \nabla f(a) \cdot u for a unit vector u and exists if the partials exist along that line, the total derivative demands the existence of all partial derivatives and the global linear approximation limit, ensuring it captures changes in all directions simultaneously.[12]Interpretations and Properties
As a Differential Form
In differential geometry, the total derivative of a smooth function f: M \to \mathbb{R} defined on a smooth manifold M is interpreted as the differential df, which is a differential 1-form on M. This 1-form is expressed locally in coordinates (x_1, \dots, x_n) as df = \sum_{i=1}^n \frac{\partial f}{\partial x_i} \, dx_i, where the dx_i are the coordinate basis 1-forms. As a section of the cotangent bundle T^*M, df assigns to each point p \in M a linear functional df_p: T_p M \to \mathbb{R} that represents the directional derivative of f at p along tangent vectors in the tangent space T_p M.[16][17] The differential df possesses key properties arising from the exterior derivative operator d. For a smooth f, df is exact by construction, meaning df = d(f) where f is viewed as a 0-form, and it is closed since the exterior derivative satisfies d^2 = 0, implying d(df) = 0. This closedness ensures that the wedge product involving higher derivatives vanishes in a trivial way, but the exactness of df is fundamental, distinguishing it from general closed 1-forms on non-contractible manifolds.[16][18][19] For a smooth map f: M \to N between manifolds, the pullback operation provides a geometric interpretation of the total derivative through the induced map on forms: if \omega is a 1-form on N, then f^* \omega is the 1-form on M defined by (f^* \omega)_p(v) = \omega_{f(p)}(df_p(v)) for p \in M and v \in T_p M. In particular, the pullback commutes with the exterior derivative, so d(f^* \omega) = f^* (d \omega), preserving exactness and closedness properties. This framework extends the total derivative to compositions and transformations between manifolds.[16][18][20] The utility of df as a 1-form is evident in integration theory. For a piecewise smooth path \gamma: [a, b] \to M from point A to B, the line integral \int_\gamma df = f(B) - f(A), generalizing the fundamental theorem of calculus to manifolds via Stokes' theorem in the special case where the 1-form is exact. Locally, if \gamma(t) = (x_1(t), \dots, x_n(t)), this becomes \int_a^b \sum_i \frac{\partial f}{\partial x_i} \frac{dx_i}{dt} \, dt. Under a change of coordinates, say from (x_i) to (y_j), the basis 1-forms transform via the pullback: dy_j = \sum_k \frac{\partial y_j}{\partial x_k} dx_k, reflecting the contravariant nature of covectors with respect to the Jacobian matrix of the coordinate map. This ensures that df remains well-defined invariantly across charts.[18][16][17][19]Higher-Order Total Derivatives
The second total derivative of a scalar-valued function f: \mathbb{R}^n \to \mathbb{R} at a point a \in \mathbb{R}^n, denoted D^2 f(a), is defined as the derivative of the first total derivative Df, resulting in a symmetric bilinear map from \mathbb{R}^n \times \mathbb{R}^n to \mathbb{R}.[21] For such functions, D^2 f(a) is represented by the Hessian matrix H_f(a), an n \times n symmetric matrix whose (i,j)-entry is the second partial derivative \frac{\partial^2 f}{\partial x_j \partial x_i}(a).[21] This matrix form allows the second derivative to be expressed as D^2 f(a)(h, k) = k^T H_f(a) h for vectors h, k \in \mathbb{R}^n.[21] In general, the k-th order total derivative D^k f(a) of a sufficiently smooth function f at a is a symmetric multilinear map from (\mathbb{R}^n)^k to \mathbb{R}, given by D^k f(a)(h_1, \dots, h_k) = \sum_{j_1=1}^n \cdots \sum_{j_k=1}^n h_{1,j_1} \cdots h_{k,j_k} \frac{\partial^k f}{\partial x_{j_1} \cdots \partial x_{j_k}}(a), where each h_\ell = (h_{\ell,1}, \dots, h_{\ell,n}).[22] This expression arises from the identification of higher derivatives with multilinear maps on the tangent space, capturing all mixed partial derivatives up to order k.[23] The symmetry of D^k f(a) follows from Schwarz's theorem (also known as Clairaut's theorem), which states that if the second partial derivatives of f are continuous in a neighborhood of a, then the mixed partials commute, i.e., \frac{\partial^2 f}{\partial x_i \partial x_j}(a) = \frac{\partial^2 f}{\partial x_j \partial x_i}(a); this extends to higher orders under suitable continuity assumptions on the partials, ensuring D^k f(a) is invariant under permutations of its arguments.[24][23] Higher-order total derivatives underpin the multivariable Taylor theorem, which provides a polynomial approximation of f near a: for a C^p function f, f(a + h) = \sum_{j=0}^p \frac{1}{j!} D^j f(a)(h, \dots, h) + R_{p,a}(h), where the remainder satisfies \|R_{p,a}(h)\| = o(\|h\|^p) as h \to 0, with the j-th term involving the j-th multilinear form applied to j copies of h.[23] For the second-order case, this yields the quadratic approximation f(a + h) \approx f(a) + Df(a) \cdot h + \frac{1}{2} h^T H_f(a) h.[21] Consider the example f(x,y) = xy on \mathbb{R}^2. The first partials are \frac{\partial f}{\partial x} = y and \frac{\partial f}{\partial y} = x, so the second total derivative at (0,0) is represented by the Hessian H_f(0,0) = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}, with D^2 f(0,0)(h_1, h_2) = h_{1,y} h_{2,x} + h_{1,x} h_{2,y} for h_i = (h_{i,x}, h_{i,y}), symmetric by Schwarz's theorem since the mixed partials \frac{\partial^2 f}{\partial x \partial y} = 1 = \frac{\partial^2 f}{\partial y \partial x} are continuous everywhere.[24][21]Chain Rule and Computation
General Chain Rule Statement
The general chain rule for total derivatives in multivariable calculus states that if f: \mathbb{R}^n \to \mathbb{R}^m is differentiable at a point a \in \mathbb{R}^n and g: \mathbb{R}^m \to \mathbb{R}^p is differentiable at f(a), then the composition h = g \circ f: \mathbb{R}^n \to \mathbb{R}^p is differentiable at a, and the total derivative satisfies Dh(a) = Dg(f(a)) \circ Df(a).[25] In matrix representation, this corresponds to the Jacobian matrix equation J_h(a) = J_g(f(a)) \, J_f(a), where J_f(a) is the m \times n Jacobian of f at a and J_g(f(a)) is the p \times m Jacobian of g at f(a).[26] This formulation treats the total derivative as a linear map between vector spaces, composing via matrix multiplication.[27] For scalar-valued functions, where p = 1, the chain rule specializes to the form \nabla h(a) = \nabla g(f(a))^\top J_f(a), interpreting the gradient \nabla h(a) as a row vector multiplied by the Jacobian of the inner function, generalizing the single-variable rule (u \circ v)'(x) = u'(v(x)) v'(x).[28] This contrasts with chain rules expressed solely in partial derivatives, as the total derivative encapsulates the full linear approximation, ensuring consistency across vector-valued compositions.[25] A proof sketch relies on the definition of differentiability: a function is differentiable at a point if it admits a linear approximation with an error term vanishing faster than the input perturbation. Specifically, write f(a + h) = f(a) + Df(a) h + \epsilon_f(h), where \lim_{h \to 0} \|\epsilon_f(h)\| / \|h\| = 0, and similarly g(b + k) = g(b) + Dg(b) k + \epsilon_g(k) with \lim_{k \to 0} \|\epsilon_g(k)\| / \|k\| = 0. Substituting b = f(a) and k = Df(a) h + \epsilon_f(h) into the expression for g(f(a + h)) yields h(a + h) = h(a) + [Dg(f(a)) Df(a)] h + higher-order terms, where the remainder satisfies the required limit condition for differentiability of h at a.[27][26]Direct Dependency Example
In cases of direct dependency, the total derivative captures the rate of change of a composite function where intermediate variables explicitly depend on independent parameters. Consider z = f(x, y), with x = x(t) and y = y(t), where t is the independent parameter. The chain rule for the total derivative states that \frac{dz}{dt} = \frac{\partial f}{\partial x} \frac{dx}{dt} + \frac{\partial f}{\partial y} \frac{dy}{dt}. This formula arises from the linear approximation of f along the parametric curve in the xy-plane, summing the contributions from each direction of change.[28] To demonstrate the computation, take the specific functions f(x, y) = x^2 y, x(t) = t, and y(t) = \sin t. The goal is to find \frac{dz}{dt} at t = 0. First, compute the partial derivatives: \frac{\partial f}{\partial x} = 2xy, \quad \frac{\partial f}{\partial y} = x^2. Next, find the derivatives of the parameterizations: \frac{dx}{dt} = 1, \quad \frac{dy}{dt} = \cos t. Substitute into the chain rule: \frac{dz}{dt} = (2xy)(1) + (x^2)(\cos t) = 2xy + x^2 \cos t. Evaluate at t = 0, where x(0) = 0 and y(0) = \sin 0 = 0, with \cos 0 = 1: \frac{dz}{dt} \bigg|_{t=0} = 2(0)(0) + (0)^2 (1) = 0. This step-by-step process highlights the "dot product" structure, where the gradient of f at (x(t), y(t)) is dotted with the velocity vector \left( \frac{dx}{dt}, \frac{dy}{dt} \right) along the path.[29] The result \frac{dz}{dt} = 0 at t = 0 indicates that, at this instant, the function z(t) is instantaneously stationary along the parametric path, even though z may change elsewhere; it reflects the combined effects of the direct dependencies on t. This total rate of change provides the instantaneous variation of z as t evolves, essential for analyzing motion or optimization in parameterized systems.[28]Indirect Dependency Example
In cases where the total derivative involves indirect dependencies through intermediate variables, the multivariable chain rule accounts for multiple paths of influence. Consider a function z = f(u, v), where u = g(x, y) depends on both independent variables x and y, and v = h(x) depends only on x. The total partial derivative of z with respect to x is then given by \frac{\partial z}{\partial x} = \frac{\partial f}{\partial u} \frac{\partial u}{\partial x} + \frac{\partial f}{\partial v} \frac{\partial v}{\partial x}, while the partial derivative with respect to y simplifies to \frac{\partial z}{\partial y} = \frac{\partial f}{\partial u} \frac{\partial u}{\partial y}, since v does not depend on y.[28][29] These expressions arise from summing the contributions along each dependency path in the non-parametric multivariable chain rule.[26] To illustrate, take the specific functions z = f(u, v) = u^2 + v, u = g(x, y) = x y, and v = h(x) = x. First, compute the necessary partial derivatives: \frac{\partial f}{\partial u} = 2u, \frac{\partial f}{\partial v} = 1, \frac{\partial u}{\partial x} = y, \frac{\partial u}{\partial y} = x, and \frac{\partial v}{\partial x} = 1 (with \frac{\partial v}{\partial y} = 0). Substituting into the chain rule formulas yields \frac{\partial z}{\partial x} = (2u)(y) + (1)(1) = 2u y + 1 and \frac{\partial z}{\partial y} = (2u)(x) + (1)(0) = 2u x. Evaluating at the point (x, y) = (1, 1), where u = 1 \cdot 1 = 1 and v = 1, gives \frac{\partial z}{\partial x} \big|_{(1,1)} = 2(1)(1) + 1 = 3 and \frac{\partial z}{\partial y} \big|_{(1,1)} = 2(1)(1) = 2. This computation traces the indirect effects: the path through u affects both derivatives, while the path through v contributes only to \frac{\partial z}{\partial x}.[28] The dependency structure can be visualized using a tree diagram, which highlights the indirect paths:- Root: z
- Branches to: u (labeled \frac{\partial z}{\partial u}) and v (labeled \frac{\partial z}{\partial v})
- From u: Branches to x (labeled \frac{\partial u}{\partial x}) and y (labeled \frac{\partial u}{\partial y})
- From v: Branch to x (labeled \frac{\partial v}{\partial x})