Fact-checked by Grok 2 weeks ago

Matrix calculus

Matrix calculus is a mathematical framework that extends the principles of from scalars and vectors to functions involving and higher-dimensional arrays, enabling the computation of such as gradients, Jacobians, and Hessians in multivariable settings while preserving rules like the and . It treats as unified objects rather than collections of scalars, facilitating efficient calculations for operations like matrix factorizations, determinants, and inverses. This discipline is particularly vital in fields requiring large-scale computations, such as , where it underpins optimization algorithms like by providing derivatives of loss functions with respect to weight matrices. In and , matrix calculus supports and Kalman filtering through derivatives of quadratic forms and matrices. Key concepts include denominator layouts for notation—where the derivative of a scalar with respect to a matrix is arranged to match the input's dimensions—and the use of traces or Frobenius inner products to simplify expressions. Notable applications extend to physics and , where it aids in solving differential equations involving tensor fields and in for stability analysis. Resources like The Matrix Cookbook compile essential identities for these derivatives, emphasizing practical rules over abstract proofs to support computational implementations. Overall, matrix calculus bridges linear algebra and analysis, enabling scalable solutions to complex problems in modern data-driven sciences.

Scope and Fundamentals

Definition and Historical Context

Matrix calculus, also known as matrix differential calculus, is the branch of mathematics that extends classical calculus to functions whose inputs or outputs are vectors or matrices, rather than scalars alone. This field focuses on computing derivatives, gradients, and higher-order differentials in multivariable settings where variables are arranged in matrix form, enabling the analysis of complex systems in optimization, statistics, and engineering. Unlike scalar calculus, which deals with single-variable functions, matrix calculus accounts for the linear algebraic structure of vectors and matrices to handle multidimensional dependencies. The origins of matrix calculus trace back to the 19th century with Carl Gustav Jacob Jacobi's introduction of the determinant in 1841, a essential for change-of-variables in multiple integrals and early multivariable . Although explored similar ideas in 1815, Jacobi's systematic treatment in his 1841 paper "De formatione et proprietatibus determinantium" laid foundational concepts for derivatives in multiple dimensions. Significant expansions occurred in the mid-20th century, particularly in the 1940s and 1950s, as advanced; for instance, M. S. Bartlett's 1947 paper on multivariate analysis employed matrix techniques to model correlations among multiple variables. In , the 1960s saw further development through Rudolf E. Kálmán's state-space representations, which used matrix derivatives to describe dynamic systems and . A landmark consolidation came in the late 20th century with Jan R. Magnus and Heinz Neudecker's 1988 book Matrix Differential Calculus with Applications in Statistics and , which standardized notation and differentials for matrix derivatives, building on earlier works from the 1950s and 1960s. Engaging with matrix calculus requires a solid foundation in linear algebra, including vector spaces, matrix operations, and properties like transposes and inverses, alongside multivariate calculus concepts such as partial derivatives and chain rules extended to higher dimensions. These prerequisites allow one to navigate the tensor-like nature of matrix derivatives without delving into specific computations. A representative example illustrating the extension from scalar to matrix calculus is the quadratic form f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x}, where \mathbf{x} is a and A is a ; its with respect to \mathbf{x} yields $2A\mathbf{x}, contrasting with the simple scalar case f(x) = a x^2 where f'(x) = 2a x. This highlights how matrix structure introduces linear transformations into .

Relation to Other Calculus Branches

Matrix calculus serves as a natural extension of scalar calculus, where operations like partial derivatives are generalized to higher-dimensional arrays. In scalar calculus, the derivative of a function f(x) with respect to a scalar x yields a scalar, but in matrix calculus, the analogous derivative of a scalar-valued function with respect to a vector input produces a gradient vector, and further extension to matrix inputs results in a Jacobian matrix that captures all partial derivatives in a structured array. This generalization allows for the handling of multivariable functions over Euclidean spaces of matrices, treating them as flattened vectors while preserving matrix-specific operations like multiplication. Building on , matrix calculus extends concepts such as gradients and Hessians to matrix domains, though it diverges in emphasis: often arises in physics for integrals over paths or surfaces (e.g., line integrals or ), whereas matrix calculus prioritizes optimization problems in finite-dimensional spaces, such as or eigenvalue computations, without the same focus on differential forms or manifolds. For instance, the or operators in have matrix analogs in trace operations or adjugate matrices, but these are typically applied in computational contexts rather than continuous fields. This shift reflects matrix calculus's roots in and , where tools are adapted for discrete array manipulations. Matrix calculus can be viewed as a specialized subset of , particularly for second-order tensors in finite-dimensional Euclidean spaces, where matrices represent linear transformations between vectors. , used extensively in and , employs and covariant derivatives to handle multi-linear maps on manifolds, whereas matrix calculus simplifies this for flat spaces using component-wise or layout-based notations, avoiding the full machinery of metric tensors and . The notational economy of matrix calculus—relying on traces, transposes, and Kronecker products—facilitates computations in optimization and , contrasting with tensor calculus's broader applicability to curved geometries. A key distinction from scalar calculus lies in the non-commutativity of matrix operations, which complicates rules like the chain rule. In scalar calculus, multiplication is commutative, allowing flexible ordering in product rules (e.g., df = g \, dx + x \, dg), but matrix calculus requires careful attention to the direction of , as AB \neq BA in general, leading to distinct left- and right-multiplication variants in derivatives. This non-commutativity affects higher-order derivatives and necessitates specialized identities to ensure consistency. An illustrative example of these extensions is the generalization of the scalar Taylor series to matrix functions using the Frobenius norm. For a scalar function f(t) expanded as f(t) = f(a) + f'(a)(t - a) + \frac{1}{2}f''(a)(t - a)^2 + \cdots, the matrix analog for a differentiable matrix-valued function f(X) around X_0 involves the and higher-order terms, often measured via the Frobenius norm \| \cdot \|_F to quantify perturbations: f(X_0 + E) = f(X_0) + Df(X_0)[E] + \frac{1}{2} D^2 f(X_0)[E, E] + \cdots, where Df(X_0)[E] is the first applied to E, and the is bounded using the Frobenius norm of E. This expansion preserves the local approximation property of scalar while accounting for matrix structure, with applications in estimation and analysis. For the specific case of the Frobenius norm itself, \|X_0 + E\|_F^2 = \|X_0\|_F^2 + 2 \operatorname{Tr}(X_0^T E) + \|E\|_F^2, providing a quadratic approximation akin to scalar expansions.

Primary Applications

Matrix calculus finds extensive applications in , particularly in the training of neural networks through , where matrix derivatives are essential for computing gradients that enable efficient optimization via . The algorithm, which propagates errors backward through the network using chain rule-based derivatives of matrix-valued functions, allows for scalable learning in deep architectures by avoiding explicit computation of high-dimensional Jacobians. In optimization, matrix calculus underpins second-order methods such as Newton's algorithm, which utilizes the —the second derivative of the objective function—to approximate the curvature and guide faster convergence toward minima compared to first-order . This approach is particularly valuable in large-scale problems where the Hessian provides quadratic convergence rates under suitable conditions, though computational costs often necessitate approximations like quasi-Newton methods. Within , matrix derivatives facilitate the maximization of likelihood functions in multivariate models, enabling the derivation of estimators for parameters in high-dimensional settings, such as the in Gaussian assumptions. For instance, differentiating the log-likelihood with respect to coefficients yields closed-form solutions akin to ordinary least squares, while extensions handle constraints or regularization for robust . In , matrix calculus supports the analysis and design of state-space models, where differentials of matrix equations describe , , and policies in linear time-invariant systems. These derivatives are crucial for tasks like the sensitivity of state trajectories to parameter perturbations or synthesizing feedback controllers via Lyapunov methods. Emerging applications post-2020 highlight matrix calculus in simulations, where derivatives of matrix exponentials optimize variational quantum algorithms for modeling complex quantum states. Similarly, in , fairness gradients—computed as matrix derivatives of loss functions with respect to demographic parity constraints—enable debiasing during training to mitigate subgroup disparities without sacrificing overall performance. These developments address gaps in traditional treatments by applying matrix calculus to ethical optimization landscapes. A representative example is the of gradients in via , where the derivative of the with respect to the parameter leads to the normal equations, providing the minimum-variance unbiased estimator in matrix form. This illustrates how matrix calculus simplifies multi-output predictions while ensuring computational efficiency.

Notation and Conventions

Standard Symbols and Definitions

Matrix calculus relies on a consistent set of symbols to represent operations and related concepts, facilitating precise communication in multivariable settings. The is denoted by the symbol ∂, used to express the rate of change of a with respect to a single variable while holding others constant, such as ∂f/∂x_{ij} for the of scalar f with respect to the (i,j)-th entry of X. The operator ∇ denotes the of first-order for a scalar-valued , typically arranged as a column , for instance, \nabla_x f = \left( \frac{\partial f}{\partial x_1}, \dots, \frac{\partial f}{\partial x_n} \right)^T where x \in \mathbb{R}^n. Central to matrix calculus are the Jacobian and Hessian matrices, which organize partial derivatives into form. The J_f of a f: \mathbb{R}^n \to \mathbb{R}^m is the m \times n whose (i,j)-th entry is the first \partial f_i / \partial x_j, capturing the of f near a point. The Hessian H_f for a scalar-valued f: \mathbb{R}^n \to \mathbb{R} is the n \times n with (i,j)-th entry \partial^2 f / \partial x_i \partial x_j, representing second-order s. The operator, denoted vec, transforms a into a column by stacking its columns; for an m \times n A, vec(A) yields an mn \times 1 , which is essential for reformulating derivatives in form. Dimension conventions specify the input and output spaces of functions to clarify derivative structures, such as f: \mathbb{R}^{m \times n} \to \mathbb{R}^p, indicating that f maps m \times n matrices to p-dimensional . Auxiliary operators include the , denoted tr(A), which sums the diagonal elements of a square A. The for compatible matrices A and B is defined as \langle A, B \rangle = \tr(A^T B), providing a scalar measure analogous to the for vectors. As an illustrative example, for a f(X) where X is an m \times n , notation often involves derivatives with respect to X's entries or vec(X), ensuring compatibility with the function's \mathbb{R}^{m \times n}. These symbols form the foundational terminology, with layout arrangements of derivatives addressed separately.

Layout Conventions

In matrix calculus, the layout conventions specify how the partial derivatives are arranged within the resulting derivative matrices, particularly for Jacobians, to eliminate ambiguities arising from the multi-dimensional nature of vectors and matrices. The two dominant conventions are the numerator (NL) and the denominator layout (DL), which differ primarily in the orientation of rows and columns relative to the input and output variables. The numerator layout arranges derivatives as though they form the numerator of a fractional representation \frac{\partial \mathbf{y}}{\partial \mathbf{x}}, with rows indexed by the components of the output \mathbf{y} and columns by those of the input \mathbf{x}; this results in a Jacobian matrix whose dimensions match the outer dimensions of \mathbf{y} by \mathbf{x}. In contrast, the denominator layout arranges derivatives column-wise according to the denominator, with rows indexed by \mathbf{x} components and columns by \mathbf{y} components, yielding a Jacobian with dimensions matching \mathbf{x} by \mathbf{y}. These conventions ensure consistent application of rules like the chain rule but require care in interpretation across fields. The numerator layout is prevalent in engineering and contexts, where the example \partial \mathbf{y}/\partial \mathbf{x} naturally aligns rows with output variations, facilitating intuitive without additional transpositions. Conversely, the denominator layout is standard in statistics and , as seen in applications involving structures. A key advantage of the numerator layout is that it preserves the direct order in the chain rule, \partial \mathbf{y}/\partial \mathbf{z} = (\partial \mathbf{y}/\partial \mathbf{x}) (\partial \mathbf{x}/\partial \mathbf{z}), mirroring scalar . The denominator layout, however, integrates naturally with the operator \operatorname{vec}(\cdot) and Kronecker products, enabling compact expressions for linear transformations in statistical derivations without extraneous transposes. A disadvantage of the denominator layout is the occasional need for transpositions when interfacing with engineering-style computations, while the numerator layout may complicate vectorized statistical formulas. To convert between layouts, the derivative matrix in one convention is simply the of the other, ensuring in the underlying partial derivatives. This transposition rule applies universally, as the layouts reorder the same set of scalar partials. The denominator layout gained prominence through the framework established by and Neudecker in their foundational text on matrix differential calculus, which emphasized its compatibility with econometric models and techniques. As an illustrative example, consider the linear mapping \mathbf{y} = A \mathbf{x}, where \mathbf{y} \in \mathbb{R}^m is a column , \mathbf{x} \in \mathbb{R}^n is a column , and A \in \mathbb{R}^{m \times n} is the . The i-th component is y_i = \sum_{k=1}^n A_{i k} x_k. The partial derivative \partial y_i / \partial x_j = A_{i j} for each i=1,\dots,m and j=1,\dots,n. In the numerator layout, the \partial \mathbf{y}/\partial \mathbf{x} is the m \times n matrix with (i,j)-th entry \partial y_i / \partial x_j = A_{i j}, yielding \partial \mathbf{y}/\partial \mathbf{x} = A. This arrangement stacks the row-wise gradients of each y_i with respect to \mathbf{x}. In the denominator layout, the \partial \mathbf{y}/\partial \mathbf{x} is the n \times m matrix with (j,i)-th entry \partial y_i / \partial x_j = A_{i j}, resulting in \partial \mathbf{y}/\partial \mathbf{x} = A^\top. This column-wise arrangement reflects the input indexing first. For vectorized forms using the chain rule with Kronecker products, the denominator layout aligns the full \partial \operatorname{vec}(\mathbf{y}) / \partial \operatorname{vec}(\mathbf{x})^\top = A directly, while the numerator layout requires A^\top; in generalized matrix cases like Y = A X with X \in \mathbb{R}^{p \times n}, the denominator layout yields I_n \otimes A for the vectorized , preserving the standard Kronecker structure \operatorname{vec}(Y) = (I_n \otimes A) \operatorname{vec}(X).

Alternative Notational Systems

Component-wise notation in matrix calculus employs explicit indices to denote partial derivatives, such as \frac{\partial f_{ij}}{\partial x_{kl}}, which facilitates direct computation of individual elements in matrix-valued . This approach treats matrices as collections of scalars, allowing derivatives to be calculated entry by entry, often using the chain rule in indexed form for clarity in proofs and implementations. For instance, the matrix J of a \mathbf{f}(\mathbf{x}) can be expressed component-wise as J_{ij} = \frac{\partial f_i}{\partial x_j}, enabling straightforward verification of properties like rank and invertibility. Tensor notation extends this framework by incorporating Einstein summation convention, where repeated indices imply summation, particularly useful for higher-order derivatives in multidimensional arrays beyond standard matrices. In applications like extensions to , this notation represents matrix derivatives as tensor contractions, such as the partial derivative tensor \partial_k f_{ij} summed over k for covariant expressions. It generalizes matrix operations to multilinear forms, avoiding explicit summation symbols for conciseness in complex identities involving or stress-energy tensors. Software-specific notations adapt these concepts for computational environments, prioritizing automatic differentiation over manual indexing. In MATLAB, the Symbolic Math Toolbox uses diff for element-wise or matrix derivatives, such as diff(F, X) yielding a symbolic Jacobian for matrix functions F(X). PyTorch's autograd system employs tensor attributes like .grad to compute gradients implicitly, representing matrix derivatives through backward passes without explicit index notation, as in torch.autograd.grad(outputs, inputs). These tools leverage vec operators and Kronecker products internally for efficiency. Component-wise notation excels in pedagogical settings for its transparency in deriving rules like the but becomes verbose for large matrices, potentially obscuring structural insights. Tensor notation offers generality for high-dimensional problems, such as in physics simulations, yet introduces complexity in tracking positions and covariants, risking errors without rigorous . Software notations streamline in optimization tasks, reducing errors in applications, though they abstract away explicit forms, hindering theoretical analysis compared to indexed methods.

Derivatives Involving Vectors

Vector-by-Scalar Derivatives

The derivative of a \mathbf{y} = f(t), where \mathbf{y} is an n \times 1 column and t is a scalar , with respect to t is defined as the n \times 1 whose entries are the partial of the components of \mathbf{y} with respect to t. This derivative, denoted \frac{d\mathbf{y}}{dt}, is computed component-wise, such that the i-th entry is \left( \frac{d\mathbf{y}}{dt} \right)_i = \frac{\partial y_i}{\partial t}. The for such vector-by-scalar derivatives follow from the of the : for a constant scalar a, \frac{d (a \mathbf{y})}{dt} = a \frac{d \mathbf{y}}{dt}, and for the sum of two vector functions \mathbf{y} and \mathbf{z}, \frac{d (\mathbf{y} + \mathbf{z})}{dt} = \frac{d \mathbf{y}}{dt} + \frac{d \mathbf{z}}{dt}. Additionally, the applies to the product of a scalar s(t) and a \mathbf{u}(t), yielding \frac{d (s \mathbf{u})}{dt} = \frac{ds}{dt} \mathbf{u} + s \frac{d \mathbf{u}}{dt}. For example, consider \mathbf{y}(t) = t \mathbf{u} where \mathbf{u} is a constant ; applying the gives \frac{d\mathbf{y}}{dt} = \mathbf{u}. These are essential in modeling time-dependent phenomena, such as computing as the time of a position in classical dynamics.

Scalar-by-Vector Derivatives

In matrix calculus, the scalar-by-vector refers to the of a scalar-valued f: \mathbb{R}^n \to \mathbb{R} with respect to an n \times 1 column input \mathbf{x}. This is defined as the column \nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}, where each component is the partial of f with respect to the corresponding element of \mathbf{x}. In and optimization contexts, this column convention aligns with the where differentials satisfy df = \nabla f(\mathbf{x})^T d\mathbf{x}, ensuring consistency with the scalar nature of df. The operation exhibits as a key property: for scalar constants a and b, and scalar functions f and g, it holds that \nabla (a f(\mathbf{x}) + b g(\mathbf{x})) = a \nabla f(\mathbf{x}) + b \nabla g(\mathbf{x}). This follows directly from the linearity of partial differentiation and extends the familiar rules from scalar calculus to vector inputs. Such properties facilitate the analysis of composite functions in optimization algorithms, where gradients guide directions. A representative example is the quadratic form f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x}, where A is an n \times n independent of \mathbf{x}. To derive the , expand f(\mathbf{x}) = \sum_{i=1}^n \sum_{j=1}^n x_i A_{ij} x_j. The partial derivative with respect to x_k is then \frac{\partial f}{\partial x_k} = \sum_{j=1}^n A_{kj} x_j + \sum_{i=1}^n x_i A_{ik}, which simplifies to the k-th component of (A^T \mathbf{x} + A \mathbf{x}). Thus, \nabla f(\mathbf{x}) = (A + A^T) \mathbf{x}. If A is symmetric (A = A^T), this reduces to \nabla f(\mathbf{x}) = 2 A \mathbf{x}, a form commonly encountered in least-squares problems and . The provides insight into the rate of change along specific directions and is given by the \nabla f(\mathbf{x}) \cdot \mathbf{u} for a \mathbf{u} with \| \mathbf{u} \| = 1. This measures the instantaneous change in f at \mathbf{x} when moving in direction \mathbf{u}, generalizing the one-dimensional derivative to multivariable settings. In optimization, the maximum occurs when \mathbf{u} aligns with \nabla f(\mathbf{x}), pointing toward the steepest ascent.

Vector-by-Vector Derivatives

In matrix calculus, the vector-by-vector derivative refers to the Jacobian matrix associated with a vector-valued function \mathbf{y} = f(\mathbf{x}), where both \mathbf{y} and \mathbf{x} are column vectors of dimensions m \times 1 and n \times 1, respectively. The Jacobian matrix \mathbf{J} = \frac{\partial \mathbf{y}}{\partial \mathbf{x}} is an m \times n matrix that captures the first-order partial derivatives of the output components with respect to the input components. The entries of the Jacobian are defined as J_{ij} = \frac{\partial y_i}{\partial x_j} for i = 1, \dots, m and j = 1, \dots, n. This arrangement linearizes the function locally around a point \mathbf{x}, approximating f(\mathbf{x} + \Delta \mathbf{x}) \approx f(\mathbf{x}) + \mathbf{J} \Delta \mathbf{x}. If m = n, the Jacobian is a square matrix; otherwise, it is rectangular, reflecting the mapping from \mathbb{R}^n to \mathbb{R}^m. A key property of the Jacobian is its behavior under : for \mathbf{z} = g(\mathbf{y}) and \mathbf{y} = f(\mathbf{x}), the chain rule yields \mathbf{J}_{gf} = \mathbf{J}_g \mathbf{J}_f. This multiplicative structure facilitates the propagation of derivatives in multivariable settings. For a \mathbf{y} = \mathbf{A} \mathbf{x}, where \mathbf{A} is an m \times n constant , the is simply \mathbf{J} = \mathbf{A}. In the nonlinear case, consider the component-wise application y_i = \sin(x_i) for i = 1, \dots, n (assuming m = n); the is then a with entries J_{ii} = \cos(x_i). The case where the output is scalar (m = 1) reduces to the vector from scalar-by- derivatives, forming a $1 \times n row.

Derivatives Involving Matrices

Matrix-by-Scalar Derivatives

In matrix calculus, the derivative of a matrix-valued Y = F(t) \in \mathbb{R}^{m \times n}, where t is a scalar variable, is defined as the m \times n matrix whose entries are the partial of the components of Y with respect to t. This extends the concept from scalar and , where the output structure is preserved in the . Component-wise, if Y = [Y_{ij}], then the derivative is the matrix \frac{dY}{dt} = \left[ \frac{\partial Y_{ij}}{\partial t} \right], computed element by element as in univariate calculus. Basic differentiation rules apply analogously, treating the matrix as a collection of scalars. For instance, if A and B are constant matrices of compatible dimensions, the product rule yields \frac{d}{dt} (A Y B) = A \left( \frac{dY}{dt} \right) B, since the derivatives of constants vanish. Similarly, linearity holds: \frac{d}{dt} (Y + Z) = \frac{dY}{dt} + \frac{dZ}{dt} and \frac{d}{dt} (c Y) = c \frac{dY}{dt} for a scalar constant c. A simple example is Y(t) = t I_m, the scalar multiple of the m \times m , whose is \frac{dY}{dt} = I_m, as each diagonal entry differentiates to 1 and off-diagonals to 0. In vectorized form, if \mathrm{vec}(Y) stacks the columns of Y into a , then \frac{d}{dt} \mathrm{vec}(Y) = \mathrm{vec}\left( \frac{dY}{dt} \right), since the scalar input imposes no additional Kronecker beyond the identity mapping. This formulation is useful for computational implementations but simplifies to the direct matrix in practice.

Scalar-by-Matrix Derivatives

In matrix calculus, the scalar-by-matrix derivative arises when differentiating a scalar-valued f(\mathbf{X}) with respect to an m \times n variable \mathbf{X}. The result is an m \times n \frac{\partial f}{\partial \mathbf{X}}, where each entry is given by \left( \frac{\partial f}{\partial \mathbf{X}} \right)_{ij} = \frac{\partial f}{\partial X_{ij}}, the of f with respect to the scalar entry X_{ij}. This structure ensures that the derivative has the same dimensions as \mathbf{X}, facilitating its use in gradient-based optimization and multivariable . The properties of this derivative follow directly from the component-wise partials, making it analogous to the in but extended to matrix arguments. For instance, the derivative is linear in the sense that if f(\mathbf{X}) = g(\mathbf{X}) + h(\mathbf{X}), then \frac{\partial f}{\partial \mathbf{X}} = \frac{\partial g}{\partial \mathbf{X}} + \frac{\partial h}{\partial \mathbf{X}}, and similarly for scalar multiples. A simple example is the function, where f(\mathbf{X}) = \operatorname{tr}(\mathbf{X}). Here, \frac{\partial f}{\partial X_{ij}} = 1 if i = j and 0 otherwise, yielding \frac{\partial f}{\partial \mathbf{X}} = \mathbf{I}_m for a square m \times m matrix \mathbf{X}, assuming n = m. A more involved case is the bilinear form f(\mathbf{X}) = \mathbf{a}^T \mathbf{X} \mathbf{b}, where \mathbf{a} is m \times 1 and \mathbf{b} is n \times 1. To derive \frac{\partial f}{\partial \mathbf{X}}, expand f as f = \sum_{i=1}^m \sum_{j=1}^n a_i X_{ij} b_j. Differentiating with respect to X_{kl} gives \frac{\partial f}{\partial X_{kl}} = a_k b_l, since only the term with i = k and j = l contributes. Thus, the full derivative matrix has entries \left( \frac{\partial f}{\partial \mathbf{X}} \right)_{kl} = a_k b_l, which assembles into the outer product \frac{\partial f}{\partial \mathbf{X}} = \mathbf{a} \mathbf{b}^T. This result is foundational in applications like least squares problems and covariance matrix gradients. Regarding layout conventions, the scalar-by-matrix derivative can be vectorized into a column vector \operatorname{vec}\left( \frac{\partial f}{\partial \mathbf{X}} \right) for compatibility with vectorized optimization algorithms, where the vectorization follows either row-major (stacking rows) or column-major (stacking columns) order depending on the notational system employed. The provides a compact way to express differentials as df = \left\langle \frac{\partial f}{\partial \mathbf{X}}, d\mathbf{X} \right\rangle_F.

Matrix-by-Matrix Derivatives

In matrix calculus, the derivative of a matrix-valued Y = F(X), where both Y is an m \times p and X is a q \times r , is generally a fourth-order tensor \frac{\partial Y}{\partial X} of dimensions m \times p \times q \times r, capturing the partial s \frac{\partial Y_{ij}}{\partial X_{kl}} for all indices i,j,k,l. To handle this complexity in practice, the tensor is often represented in vectorized form, where the J = \frac{d \operatorname{vec}(Y)}{d \operatorname{vec}(X)^T} is an (mp) \times (qr) , leveraging the vec operator to linearize the structure. This trick simplifies computations by transforming the matrix into a standard for vector arguments. For linear functions of the form Y = A X B, where A is m \times q and B is r \times p, the vectorized is straightforward: \frac{\partial \operatorname{vec}(Y)}{\partial \operatorname{vec}(X)^T} = B^T \otimes A, derived from the \operatorname{vec}(A X B) = (B^T \otimes A) \operatorname{vec}(X). This structure preserves the linearity and facilitates efficient evaluation in applications like optimization. A common nonlinear example is Y = X^2 = X X, assuming X is n \times n. The differential is dY = dX \, X + X \, dX, and vectorizing yields \operatorname{vec}(dY) = (X^T \otimes I_n + I_n \otimes X) \operatorname{vec}(dX), so the Jacobian is \frac{\partial \operatorname{vec}(Y)}{\partial \operatorname{vec}(X)^T} = X^T \otimes I_n + I_n \otimes X. For more general nonlinear cases, such as the matrix exponential Y = \exp(X), the derivative lacks a simple closed form unless dX commutes with X; otherwise, it involves the integral representation d \exp(X) = \int_0^1 \exp((1-s)X) \, dX \, \exp(s X) \, ds, which can be vectorized using Kronecker products for the full Jacobian tensor. The high dimensionality of the fourth-order tensor poses significant computational challenges, as storing or manipulating an O(m p q r) object becomes prohibitive for large matrices; instead, differentials dY = \frac{\partial Y}{\partial X} : dX are preferred, allowing component-wise operations without explicit tensor construction. This approach, emphasized in foundational treatments, avoids the need for full Jacobians in applications while maintaining rigor.

Key Identities and Rules

Vector-Specific Identities

Vector-specific identities in matrix calculus encompass tailored to operations on , such as gradients and Jacobians of scalar or vector-valued functions. These identities extend single-variable rules like the product and rules to higher dimensions, where are treated as column or row arrays, and derivatives form matrices or vectors accordingly. They are essential for applications in optimization algorithms, where gradients guide iterative updates, and in deriving equations for physical systems involving vector fields. A fundamental identity is the for the of the product of two scalar functions f and g, both depending on a variable \mathbf{x}: \nabla (f g) = f \nabla g + g \nabla f This holds because the with respect to each component x_k satisfies \frac{\partial (f g)}{\partial x_k} = f \frac{\partial g}{\partial x_k} + g \frac{\partial f}{\partial x_k}, mirroring the scalar case but applied component-wise to form the . To derive it, consider the definition of the as the of partials; the ensures the rule applies directly. Another key identity is the chain rule for the Jacobian of a composition of vector functions. If \mathbf{y} = \mathbf{f}(\mathbf{z}) where \mathbf{z} = \mathbf{g}(\mathbf{x}), with \mathbf{f}: \mathbb{R}^m \to \mathbb{R}^p and \mathbf{g}: \mathbb{R}^n \to \mathbb{R}^m, then the Jacobian matrix of the composite function is J_{\mathbf{f}(\mathbf{g}(\mathbf{x}))} = J_{\mathbf{f}}(\mathbf{g}(\mathbf{x})) \, J_{\mathbf{g}}(\mathbf{x}), where J_{\mathbf{h}} denotes the matrix whose (i,j)-entry is \frac{\partial h_i}{\partial x_j}. This matrix multiplication arises because the total differential d\mathbf{y} = J_{\mathbf{f}} \, d\mathbf{z} and d\mathbf{z} = J_{\mathbf{g}} \, d\mathbf{x}, composing linearly. The proof follows from applying the scalar to each entry of the Jacobian, confirming the product structure. A illustrative example is the of the scalar \mathbf{x}^T \mathbf{x}, where \mathbf{x} \in \mathbb{R}^n. This function is f(\mathbf{x}) = \sum_{i=1}^n x_i^2. The with respect to x_k is \frac{\partial f}{\partial x_k} = 2 x_k, since only the k-th term in the sum depends on x_k, and its is $2 x_k, while all other terms are with respect to x_k. Thus, collecting all partials yields the vector \nabla f(\mathbf{x}) = 2\mathbf{x}. For a full proof, expand in coordinates: f(\mathbf{x}) = \mathbf{x}^T I \mathbf{x} where I is the , and using the general rule for \nabla (\mathbf{x}^T A \mathbf{x}) = (A + A^T) \mathbf{x}, with A = I symmetric, gives $2\mathbf{x}. This identity is pivotal in least-squares optimization, as it shows the steepest descent direction for squared errors. For operations involving the vector cross product, consider two vector fields \mathbf{a} and \mathbf{b} in \mathbb{R}^3. The gradient (Jacobian tensor) of their cross product satisfies the identity \nabla (\mathbf{a} \times \mathbf{b}) = \mathbf{a} (\nabla \cdot \mathbf{b}) - \mathbf{b} (\nabla \cdot \mathbf{a}) + (\mathbf{b} \cdot \nabla) \mathbf{a} - (\mathbf{a} \cdot \nabla) \mathbf{b}, contextualized in matrix form where (\mathbf{b} \cdot \nabla) acts as a matrix applied to \mathbf{a}. This tensor expression expands the derivative using the on components via the , isolating and terms. In matrix calculus, it manifests as the of the map, useful for and derivations without full tensor notation. The form derives from applying the vector component-wise, confirming no terms appear in this expansion.

Matrix-Specific Identities

Matrix-specific identities in calculus extend the rules for differentiation to operations that are inherently matrix-oriented, such as the trace, determinant, products, and inverses, which arise frequently in applications like optimization and . These identities often leverage properties like cyclicity of the and to simplify computations, providing closed-form expressions for gradients that avoid explicit over components. Unlike vector-specific rules, these emphasize matrix structure, including transposes and Kronecker products, to maintain efficiency in higher dimensions. A fundamental identity involves the derivative of the trace of a matrix. For a square matrix X, the partial derivative of \operatorname{tr}(X) with respect to X is the identity matrix: \frac{\partial \operatorname{tr}(X)}{\partial X} = I. This follows from the linearity of the trace, as it sums the diagonal elements, and differentiation term-by-term yields the identity. More generally, for the trace of a bilinear form \operatorname{tr}(A X B), where A and B are constant matrices of compatible dimensions, the derivative with respect to X is \frac{\partial \operatorname{tr}(A X B)}{\partial X} = B^T A^T. This result exploits the cyclic property \operatorname{tr}(A X B) = \operatorname{tr}(B A X) and the Frobenius inner product interpretation of the trace. A special case occurs when A = I and B = Y, yielding \frac{\partial \operatorname{tr}(X Y)}{\partial X} = Y^T, which is useful for differentiating quadratic forms in matrix variables. The of the arithm of the is another key , particularly in contexts involving covariance matrices or . For a positive definite matrix X, \frac{\partial \log \det(X)}{\partial X} = (X^{-1})^T. This can be derived using the Jacobi formula for the of the , d \det(X) = \det(X) \operatorname{tr}(X^{-1} dX), and applying the chain rule to the logarithm, resulting in a symmetric equal to the of the . For matrix products, the derivative of A X B with respect to X, where A and B are constants, is expressed in vectorized form to capture the linear transformation. Vectorizing the output gives \frac{\partial \operatorname{vec}(A X B)}{\partial \operatorname{vec}(X)^T} = B^T \otimes A, where \otimes denotes the . This identity arises from the vectorization property \operatorname{vec}(A X B) = (B^T \otimes A) \operatorname{vec}(X), differentiating directly with respect to the vectorized input. It provides a compact representation of the for matrix-valued functions. Finally, the derivative of the matrix inverse with respect to its elements accounts for the nonlinear dependence. For an invertible matrix X, the partial derivative of X^{-1} with respect to the (k, l)-th entry X_{kl} is \frac{\partial X^{-1}}{\partial X_{kl}} = -X^{-1} e_k e_l^T X^{-1}, where e_k and e_l are vectors. This is obtained by differentiating the identity X X^{-1} = I elementwise, solving for the perturbation in the inverse, and recognizing the outer product form for the rank-one update.

Chain Rule and Product Rule Applications

In matrix calculus, the extends the scalar case to composite involving matrices, , or scalars, accounting for the non-commutative nature of . For a scalar-valued f(g(X)) where g: \mathbb{R}^{m \times n} \to \mathbb{R}^{p \times q} and f: \mathbb{R}^{p \times q} \to \mathbb{R}, the with respect to X is obtained via : \frac{\partial \mathrm{vec}(f)}{\partial \mathrm{vec}(X)} = \frac{\partial f}{\partial \mathrm{vec}(g)} \frac{\partial \mathrm{vec}(g)}{\partial \mathrm{vec}(X)}, where \mathrm{vec}(\cdot) stacks the columns of a into a . This formulation leverages the matrices of the inner and outer , ensuring compatibility in dimensions. For matrix-valued composites H(X) = G(F(X)), the is dH(X) = dG(F(X)) \cdot dF(X), with the dot denoting the appropriate or identification map to preserve structure. The in matrix calculus mirrors the scalar version but requires careful attention to the order of terms due to non-commutativity. For the product of two compatible Y = AB, where A and B may depend on a or X, the is dY = (dA)B + A(dB). In terms of derivatives, if A = A(X) and B is constant, then \frac{\partial \mathrm{vec}(Y)}{\partial \mathrm{vec}(X)} = (B^\top \otimes I) \frac{\partial \mathrm{vec}(A)}{\partial \mathrm{vec}(X)}, using the to linearize the operation. A representative example arises in least-squares optimization, where the objective function involves f(X) = \|AX - B\|^2 = (AX - B)^\top (AX - B); applying the to the inner yields the \frac{\partial f}{\partial X} = 2A^\top (AX - B), facilitating iterative updates. An adaptation of the for matrices handles divisions in structured forms, such as pseudo-inverses or ratios of matrix functions. For Y = A(X) B(X)^{-1}, the differential is dY = (dA) B^{-1} - A B^{-1} (dB) B^{-1}, derived by applying the to Y = A C with C = B^{-1} and dC = -B^{-1} (dB) B^{-1}. This is particularly useful in updating estimates in Kalman filters, where matrix inverses appear in recursive computations. Common pitfalls in applying these rules stem from matrix non-commutativity, such as incorrectly reversing the order in Jacobians, which can lead to mismatches or errors; for instance, pre-multiplying instead of post-multiplying in \frac{\partial f}{\partial X} = \frac{\partial f}{\partial Z} \frac{\partial Z}{\partial X} assumes left-to-right dependency without verifying the layout convention (numerator or denominator). Additionally, overlooking the need for in composite derivatives often results in ill-defined tensor products, emphasizing the importance of consistent identification between differentials and Jacobians.

Differential Forms and Advanced Techniques

Matrix Differentials

Matrix differentials provide an alternative framework to explicit partial derivative computations in matrix calculus, particularly beneficial for circumventing the complexity of higher-order tensors that arise in direct differentiation of matrix-valued functions. This approach leverages the concept of infinitesimal changes, akin to scalar differentials, but adapted to the matrix domain using appropriate inner products. The differential of a scalar function f with respect to a vector \mathbf{x} is defined as df = \sum_i \frac{\partial f}{\partial x_i} dx_i, representing the linear approximation to the change in f. For a matrix-valued function Y(X), where X and Y are matrices, this extends to dY = \frac{\partial Y}{\partial X} : dX, with the colon denoting the , \frac{\partial Y}{\partial X} : dX = \sum_{i,j} \frac{\partial Y_{ij}}{\partial X_{kl}} dX_{kl} (summed over repeated indices). This notation encapsulates the first-order variation in Y due to a perturbation dX. A key advantage of matrix differentials lies in their , which mirrors algebraic operations on and simplifies rule derivations. For instance, the for holds as d(XY) = dX \, Y + X \, dY, allowing straightforward handling of composite expressions without tensor unraveling. This preservation of structure facilitates proofs and computations in applications like statistics and optimization. Illustrative examples highlight the utility of this framework. The differential of the trace function is d \operatorname{tr}(X) = \operatorname{tr}(dX), reflecting the trace's invariance under cyclic permutations. For the matrix inverse, assuming X is invertible, d(X^{-1}) = -X^{-1} dX X^{-1}, derived from differentiating X X^{-1} = I and applying linearity. Matrix differentials also integrate naturally with . Specifically, d \operatorname{vec} Y = J \, d \operatorname{vec} X, where J is the matrix of the vectorized function, linking the differential approach to coordinate-based derivatives. Furthermore, the differential df of a scalar function can be integrated along a path in the matrix space to compute increments, analogous to line integrals: \int df = f(X_b) - f(X_a) for an exact differential along a path from X_a to X_b. This enables path-dependent analyses in matrix parameter spaces, such as in econometric models.

Trace and Determinant Derivatives

In matrix calculus, derivatives of the trace function play a central role due to its linearity and the cyclic property, which asserts that for compatible matrices A, B, and C, \operatorname{tr}(ABC) = \operatorname{tr}(BCA) = \operatorname{tr}(CAB). This invariance under cyclic permutations facilitates the simplification of expressions in optimization and statistical applications. The trace of a matrix X, denoted \operatorname{tr}(X), is the sum of its diagonal elements, and its derivative with respect to a scalar is straightforward, but matrix arguments require careful handling of layout conventions, such as the denominator or numerator layout in the framework of Magnus and Neudecker. A general result for the derivative of the trace of a function f(X) is \frac{\partial \operatorname{tr}(f(X))}{\partial X} = \left( \frac{\partial f}{\partial X} \right)^T under the denominator layout convention, where the transpose accounts for the identification of the derivative space with the original space. Specific cases illustrate this: for \operatorname{tr}(A X^2) where X is square, the general derivative is \frac{\partial}{\partial X} \operatorname{tr}(A X^2) = (A X + X A)^T. When both A and X are symmetric, this simplifies to $2 A X. This formula arises from expanding the differential d \operatorname{tr}(A X^2) = \operatorname{tr}(A (X dX + dX X)) = \operatorname{tr}((A X + X A) dX) and identifying the gradient using the chain rule on matrix differentials, ensuring compatibility with the multi-linear nature of the trace. Derivatives involving the determinant function \det(X) are equally important, particularly for invertible square matrices X. Jacobi's formula provides the differential form d \det(X) = \det(X) \operatorname{tr}(X^{-1} dX), which in component-wise gradient terms yields \frac{\partial \det(X)}{\partial X} = \det(X) (X^{-1})^T. This result stems from the multi-linear properties of the determinant as a of the matrix rows or columns, where the encodes cofactor expansions. The appears due to the standard row-wise convention. For in computations—especially with ill-conditioned or high-dimensional matrices—direct evaluation of \det(X) can lead to overflow or underflow, so the logarithm is preferred: \frac{\partial \log \det(X)}{\partial X} = (X^{-1})^T. This avoids and focuses on the trace of the inverse differential. In the context of covariance matrices, which are symmetric positive definite, the derivative of \log \det \Sigma with respect to \Sigma simplifies to \Sigma^{-1}, as the off-diagonal elements' contributions symmetrize. This formula is pivotal in for multivariate Gaussians, where the log-likelihood includes -\frac{n}{2} \log \det \Sigma - \frac{1}{2} \operatorname{tr}(\Sigma^{-1} S) with sample S, and the score function relies on this for iterative optimization. The of the , reflecting its multi-linearity, further underscores that small perturbations in X scale by \det(X) \operatorname{tr}(X^{-1} \Delta X), emphasizing issues in high dimensions where Cholesky decompositions or LDL factorizations are often used to compute inverses without full inversion.

Applications in Optimization

Matrix calculus plays a pivotal role in optimization algorithms by providing the necessary derivatives for iterative updates in matrix-valued problems. In , a fundamental first-order method, the matrix parameter X is updated as X_{k+1} = X_k - \alpha \frac{\partial f}{\partial X}, where \alpha > 0 is the step size and \frac{\partial f}{\partial X} is the matrix gradient of the objective function f. This approach is widely used for minimizing differentiable functions over matrix spaces, leveraging the chain rule to compute gradients efficiently. For second-order methods, extends this by incorporating curvature information through the H, updating X_{k+1} = X_k - H^{-1} \nabla f, where \nabla f is the and H approximates the second derivatives. This quadratic approximation accelerates convergence near minima but requires solving linear systems involving the Hessian, which matrix calculus identities facilitate. In large-scale optimization, where direct Hessian inversion is prohibitive, the conjugate gradient method solves the systems H \Delta X = -\nabla f iteratively, generating conjugate directions that ensure efficient minimization for symmetric positive definite Hessians. This Krylov subspace technique is particularly suited to sparse or structured matrices, reducing computational cost while maintaining quadratic convergence properties in exact arithmetic. A representative application arises in matrix least squares optimization, where the goal is to minimize \|AX - B\|_F^2 over matrix X, with the Frobenius norm measuring the error; the gradient is $2A^T(AX - B), enabling gradient-based solvers to find the solution X = (A^TA)^{-1}A^TB iteratively. Post-2010 developments in Riemannian optimization adapt these techniques to matrix manifolds, such as the symmetric positive definite (SPD) matrices, which arise in covariance estimation and kernel methods; here, gradients are projected onto the tangent space using the manifold's metric, as in Riemannian gradient descent X_{k+1} = \mathrm{Retr}_X(-\alpha \mathrm{grad} f), where \mathrm{Retr} is the retraction. Seminal works have extended Newton's and conjugate gradient methods to these geometries, improving scalability for non-Euclidean constraints. In , (autodiff) frameworks compute gradients seamlessly during , supporting optimization of deep networks with parameters like weight matrices; for instance, reverse-mode autodiff evaluates \frac{\partial f}{\partial X} in a single backward pass proportional to forward computation time. This enables efficient training via variants on high-dimensional spaces.

References

  1. [1]
    Matrix Calculus for Machine Learning and Beyond | Mathematics
    Modern applications such as machine learning and large-scale optimization require the next big step, “matrix calculus” and calculus on arbitrary vector spaces.Lecture Notes and Readings · Lecture Videos · Syllabus · Instructor InsightsMissing: tutorial | Show results with:tutorial
  2. [2]
    B.1 Matrix Calculus - Purdue Engineering
    The central idea in our definition is that the dimensions of the derivative must match the dimensions of the resulting matrix. In particular, we allow ...
  3. [3]
    [PDF] The Matrix Cookbook
    Nov 15, 2012 · This cookbook is a collection of facts about matrices, including identities, approximations, and relations, for quick reference.
  4. [4]
    A gentle introduction to matrix calculus - ScienceDirect
    Matrix calculus is an important tool when we wish to optimize functions involving matrices or perform sensitivity analyses. This tutorial is designed to ...
  5. [5]
  6. [6]
    Carl Jacobi (1804 - 1851) - Biography - MacTutor
    Jacobi was not the first to study the functional determinant which now bears his name, it appears first in a 1815 paper of Cauchy. However Jacobi wrote a long ...
  7. [7]
    Multivariate Analysis | Journal of the Royal Statistical Society Series B
    M. S. Bartlett; Multivariate Analysis, Journal of the Royal Statistical Society Series B: Statistical Methodology, Volume 9, Issue 2, 1 July 1947, Pages 17.
  8. [8]
    [PDF] On the General Theory of Control Systems
    Let X* be the dual vector space of the state space X, i.e. the space of all linear functions on X. An element z* or x* of X* is called a costate. A costate ...
  9. [9]
    [PDF] Matrix Derivatives: Why and Where Did It Go Wrong? - Jan Magnus
    Introduction. The modern theory of matrix calculus rests on two pillars: a correct definition of the matrix derivative and the use of differentials.
  10. [10]
    [PDF] The Matrix Cookbook
    Nov 15, 2012 · of another matrix. Let U = f(X), the goal is to find the derivative of the function g(U) with respect to X: ∂g(U). ∂X. = ∂g(f(X)). ∂X. (135).
  11. [11]
    [PDF] Review of Simple Matrix Derivatives
    Oct 30, 2014 · Application: Differentiating Quadratic Form. xTAx = x1. ··· xn ... The first (k − 1)th order derivative is evaluated at ¯x; whereas the kth order ...
  12. [12]
    [PDF] Matrix Differentiation
    Similarly, the rank of a matrix A is denoted by rank(A). An identity matrix will be denoted by I, and 0 will denote a null matrix. 3 Matrix Multiplication.Missing: cookbook | Show results with:cookbook
  13. [13]
    [PDF] Vector, Matrix, and Tensor Derivatives - CS231n
    This document explains how to take derivatives of vectors, matrices, and higher order tensors, and how to take derivatives with respect to them.
  14. [14]
    [PDF] Matrix Calculus
    In this Appendix we collect some useful formulas of matrix calculus that often appear in finite element derivations. §D.1 THE DERIVATIVES OF VECTOR FUNCTIONS.
  15. [15]
    Taylor's theorem for matrix functions with applications to condition ...
    Sep 1, 2016 · This formula generalizes a known result for the remainder of the Taylor polynomial for an analytic function of a complex scalar.
  16. [16]
    [PDF] An extended collection of matrix derivative results for forward and ...
    5 Matrix norms. 5.1 Frobenius norm. The Frobenius norm of matrix A is defined as. B = kAkF = pTr(AT A). Differentiating this gives. dB = (2B)−1 Tr(dAT A + AT ...
  17. [17]
    Learning representations by back-propagating errors - Nature
    Oct 9, 1986 · We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in ...
  18. [18]
    [PDF] Matrix Derivatives and Descent Optimization Methods - Qiang Ning
    Sep 29, 2017 · While Newton's method uses the. Hessian matrix, quasi-newton methods does not need to compute the Hessian matrix directly, but to update it ...
  19. [19]
    Matrix differential calculus with applications in the multivariate linear ...
    In this paper, we present a study of this approach to matrix differential calculus with some of its key results along with illustrative examples. We also ...
  20. [20]
    Least Squares Matrix Algorithm for State-Space Modelling of ...
    This work presents a novel least squares matrix algorithm (LSM) for the analysis of rapidly changing systems using state-space modelling.
  21. [21]
    [PDF] FairGrad: Fairness Aware Gradient Descent - OpenReview
    In this paper, we focus on group fairness in the context of classification where we only assume access to the sensitive attributes during the training phase.Missing: derivatives | Show results with:derivatives
  22. [22]
    [PDF] Matrix Calculus You Need For Deep Learning - arXiv
    Jul 2, 2018 · Recall that we use the numerator layout where the variables go horizontally and the functions go vertically in the. Jacobian. Wikipedia also ...
  23. [23]
    [PDF] Vector and Matrix Calculus - Herman Kamper
    Jan 30, 2013 · Table 1 indicates the six possible kinds of derivatives when using the denominator layout. Using this layout notation consistently, we have the ...<|control11|><|separator|>
  24. [24]
    Documentation - Matrix Calculus
    There are different layout conventions (numerator layout, denominator layout, mixed layout). Numerator layout is just the transpose of the denominator layout ...
  25. [25]
    Einstein Summation -- from Wolfram MathWorld
    Einstein summation is a notational convention for simplifying expressions including summations of vectors, matrices, and general tensors.
  26. [26]
    [PDF] Tensor Forms of Derivatives of Matrices and their applications in the ...
    Sep 18, 2025 · In Einstein's general relativity, the equation Guv = (8πG/c4)Tuv relates the curvature of spacetime (de- scribed by Einstein tensor Guv derived ...
  27. [27]
    Differentiation - MATLAB & Simulink - MathWorks
    In MATLAB, use `diff(f)` to find the derivative of symbolic expression `f`. For example, `Df = diff(f)` gives `Df = 5 cos ( 5 x )`. `diff(f,x)` finds the  ...
  28. [28]
    [PDF] Matrix Differential Calculus with Applications to Simple, Hadamard ...
    Matrix Differential Calculus with Applications to. Simple, Hadamard, and Kronecker Products. JAN R. MAGNUS. London School of Economics. AND. H. NEUDECKER.
  29. [29]
  30. [30]
    4. Vector dynamics - Jaime Villate
    Jan 11, 2025 · where a is the acceleration of the body, equal to the time derivative of the velocity. This is the most frequently used form for Newton's ...
  31. [31]
    Lecture Notes and Readings | Matrix Calculus for Machine Learning ...
    Matrix Calculus for Machine Learning and Beyond. Menu. More Info. Syllabus ... Steven Johnson gave another overview of optimization problems (PDF) in 18.335 ...Introduction (PDF) · Jacobians of Matrix Functions · Full Course Notes (PDF)
  32. [32]
    Matrix Differential Calculus with Applications in Statistics and ...
    Feb 18, 2019 · A brand new, fully updated edition of a popular classic on matrix differential calculus with applications in statistics and econometrics.
  33. [33]
    [PDF] Differentiating Vector- and Matrix-Valued Functions - cs.Princeton
    Directional Derivative. • What if we had done this in the previous case, where we had a function of a scalar, returning a vector? f (x) = xv. d f dx. = v. • ...
  34. [34]
    [PDF] Matrix derivatives cheat sheet
    This doesn't mean matrix derivatives always look just like scalar ones. In these examples, b is a constant scalar, and B is a constant matrix.
  35. [35]
    [PDF] Matrix Calculus
    In this Appendix we collect some useful formulas of matrix calculus that often appear in finite element derivations. §D.1 THE DERIVATIVES OF VECTOR FUNCTIONS.
  36. [36]
    [PDF] 3 Jacobians of Matrix Functions - MIT OpenCourseWare
    In this chapter, we discuss how it is still possible to represent f′ by a Jacobian matrix even for matrix inputs/outputs, and how the most common technique to ...
  37. [37]
    [PDF] An extended collection of matrix derivative results for forward and ...
    This paper collects together a number of matrix derivative results which are very useful in forward and reverse mode algorithmic differentiation (AD).
  38. [38]
    A note on parameter differentiation of matrix exponentials, with ...
    The new formula expresses the derivatives of a matrix exponential in terms of minors, polynomials, the exponential of the matrix as well as matrix inversion, ...
  39. [39]
    [PDF] Matrix Differential Calculus with Applications in Statistics and ...
    Matrix Differential Calculus with Applications in Statistics and Econometrics,. Third Edition. Jan R. Magnus and Heinz Neudecker. c 2019 John Wiley & Sons ...
  40. [40]
    [PDF] Real Functions in Several Variables: Volume XI
    1) Gradient of fg. ▽(fg) = g ▽ f + f ▽ g. this is proved by considering each coordinate separately,. ∂. ∂xj. (fg) = g. ∂f. ∂xj. + f. ∂g. ∂xj . We note ...
  41. [41]
    2.3 The Chain Rule
    The chain rule from single variable calculus has a direct analogue in multivariable calculus, where the derivative of each function is replaced by its Jacobian ...The Chain Rule · Some examples · Proof of the Chain Rule...
  42. [42]
    [PDF] Lecture 5 Vector Operators: Grad, Div and Curl
    We introduce three field operators which reveal interesting collective field properties, viz. • the gradient of a scalar field,. • the divergence of a vector ...<|separator|>
  43. [43]
    Higher order derivatives and perturbation bounds for determinants
    Nov 1, 2009 · The first derivative of the determinant function is given by the well-known Jacobi's formula. We obtain three different expressions for all higher order ...
  44. [44]
    [PDF] High-dimensional covariance estimation by minimizing l1
    This paper studies estimating covariance and inverse covariance matrices using l1-penalized log-determinant divergence, and shows consistency of the estimate.
  45. [45]
    [PDF] Full Lecture Notes: Matrix Calculus for Machine Learning and Beyond
    These notes are based on the class as it was run for the second time in January 2023, taught by Professors Alan. Edelman and Steven G. Johnson at MIT.
  46. [46]
    The Matrix Calculus You Need For Deep Learning - explained.ai
    This article walks through the derivation of some important rules for computing partial derivatives with respect to vectors, particularly those useful for ...Introduction to vector calculus... · Matrix calculus · Matrix Calculus Reference<|separator|>
  47. [47]
    [PDF] Newton's Method - Optimization Algorithms on Matrix Manifolds
    This chapter provides a detailed development of the archetypal second-order optimization method, Newton's method, as an iteration on manifolds. We.
  48. [48]
    [PDF] An Introduction to the Conjugate Gradient Method Without the ...
    Aug 4, 1994 · The Conjugate Gradient Method is the most prominent iterative method for solving sparse systems of linear equations.
  49. [49]
    [PDF] An Introduction to the Conjugate Gradient Method Without the ...
    Aug 4, 1994 · The Conjugate Gradient Method is the most prominent iterative method for solving sparse systems of linear equations.
  50. [50]
    [PDF] Least squares and the normal equations
    Mar 1, 2015 · Next week we will see that AT A is a positive semi-definite matrix and that this implies that the solution to AT Ax = AT b is a global minimum ...<|control11|><|separator|>
  51. [51]
    [PDF] Riemannian Coordinate Descent Algorithms on Matrix Manifolds
    Jun 4, 2024 · Optimization over SPD matrices has a long history and can be solved with semidefinite programming if the objective is convex. For general ...Missing: post- | Show results with:post-
  52. [52]
    [PDF] Introduction to Riemannian Optimization - Benyamin Ghojogh
    Important Riemannian Matrix Manifolds. Symmetric Positive Definite (SPD) manifold S++ is defined as the set of SPD matrices as: M = S++ := {X ∈ Rd×d | X ...Missing: post- | Show results with:post-
  53. [53]
    [PDF] Automatic Differentiation in Machine Learning: a Survey
    Derivatives, mostly in the form of gradients and Hessians, are ubiquitous in machine learn- ing. Automatic differentiation (AD), also called algorithmic ...
  54. [54]
    [PDF] A Brief Introduction to Automatic Differentiation for Machine Learning
    Oct 14, 2021 · Neural network models are typically implemented using frameworks that perform gradient based optimization methods to fit a model to a dataset.