Fact-checked by Grok 2 weeks ago

Matrix calculus

Matrix calculus is a mathematical framework that extends the principles of differentiation from scalars and vectors to functions involving matrices and higher-dimensional arrays, enabling the computation of derivatives such as gradients, Jacobians, and Hessians in multivariable settings while preserving rules like the chain rule and product rule.^[1]^[2] It treats matrices as unified objects rather than collections of scalars, facilitating efficient calculations for operations like matrix factorizations, determinants, and inverses.^[1] This discipline is particularly vital in fields requiring large-scale computations, such as machine learning, where it underpins optimization algorithms like gradient descent by providing derivatives of loss functions with respect to weight matrices.^[1]^[2] In statistics and signal processing, matrix calculus supports maximum likelihood estimation and Kalman filtering through derivatives of quadratic forms and covariance matrices.^[2] Key concepts include denominator layouts for notation—where the derivative of a scalar with respect to a matrix is arranged to match the input's dimensions—and the use of traces or Frobenius inner products to simplify expressions.^[2] Notable applications extend to physics and engineering, where it aids in solving differential equations involving tensor fields and in control theory for stability analysis.^[1] Resources like The Matrix Cookbook compile essential identities for these derivatives, emphasizing practical rules over abstract proofs to support computational implementations.^[3] Overall, matrix calculus bridges linear algebra and analysis, enabling scalable solutions to complex problems in modern data-driven sciences.^[1]

Scope and Fundamentals

Definition and Historical Context

Matrix calculus, also known as matrix differential calculus, is the branch of mathematics that extends classical calculus to functions whose inputs or outputs are vectors or matrices, rather than scalars alone.^[4] This field focuses on computing derivatives, gradients, and higher-order differentials in multivariable settings where variables are arranged in matrix form, enabling the analysis of complex systems in optimization, statistics, and engineering.^[5] Unlike scalar calculus, which deals with single-variable functions, matrix calculus accounts for the linear algebraic structure of vectors and matrices to handle multidimensional dependencies.^[4] The origins of matrix calculus trace back to the 19th century with Carl Gustav Jacob Jacobi's introduction of the Jacobian determinant in 1841, a functional determinant essential for change-of-variables in multiple integrals and early multivariable analysis.^[6] Although Augustin-Louis Cauchy explored similar ideas in 1815, Jacobi's systematic treatment in his 1841 paper "De formatione et proprietatibus determinantium" laid foundational concepts for derivatives in multiple dimensions.^[6] Significant expansions occurred in the mid-20th century, particularly in the 1940s and 1950s, as multivariate statistics advanced; for instance, M. S. Bartlett's 1947 paper on multivariate analysis employed matrix techniques to model correlations among multiple variables.^[7] In control theory, the 1960s saw further development through Rudolf E. Kálmán's state-space representations, which used matrix derivatives to describe dynamic systems and optimal control.^[8] A landmark consolidation came in the late 20th century with Jan R. Magnus and Heinz Neudecker's 1988 book Matrix Differential Calculus with Applications in Statistics and Econometrics, which standardized notation and differentials for matrix derivatives, building on earlier works from the 1950s and 1960s.^[9] Engaging with matrix calculus requires a solid foundation in linear algebra, including vector spaces, matrix operations, and properties like transposes and inverses, alongside multivariate calculus concepts such as partial derivatives and chain rules extended to higher dimensions.^[10] These prerequisites allow one to navigate the tensor-like nature of matrix derivatives without delving into specific computations.^[5] A representative example illustrating the extension from scalar to matrix calculus is the quadratic form f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x}, where \mathbf{x} is a vector and A is a symmetric matrix; its derivative with respect to \mathbf{x} yields $2A\mathbf{x}, contrasting with the simple scalar case f(x) = a x^2 where f'(x) = 2a x.^[10] This highlights how matrix structure introduces linear transformations into differentiation.^[11]

Relation to Other Calculus Branches

Matrix calculus serves as a natural extension of scalar calculus, where operations like partial derivatives are generalized to higher-dimensional arrays. In scalar calculus, the derivative of a function f(x) with respect to a scalar x yields a scalar, but in matrix calculus, the analogous derivative of a scalar-valued function with respect to a vector input produces a gradient vector, and further extension to matrix inputs results in a Jacobian matrix that captures all partial derivatives in a structured array.^[12] This generalization allows for the handling of multivariable functions over Euclidean spaces of matrices, treating them as flattened vectors while preserving matrix-specific operations like multiplication.^[12] Building on vector calculus, matrix calculus extends concepts such as gradients and Hessians to matrix domains, though it diverges in emphasis: vector calculus often arises in physics for integrals over paths or surfaces (e.g., line integrals or curls), whereas matrix calculus prioritizes optimization problems in finite-dimensional spaces, such as least squares or eigenvalue computations, without the same focus on differential forms or manifolds.^[4] For instance, the divergence or curl operators in vector calculus have matrix analogs in trace operations or adjugate matrices, but these are typically applied in computational contexts rather than continuous fields.^[4] This shift reflects matrix calculus's roots in numerical analysis and machine learning, where vector calculus tools are adapted for discrete array manipulations.^[13] Matrix calculus can be viewed as a specialized subset of tensor calculus, particularly for second-order tensors in finite-dimensional Euclidean spaces, where matrices represent linear transformations between vectors. Tensor calculus, used extensively in general relativity and continuum mechanics, employs abstract index notation and covariant derivatives to handle multi-linear maps on manifolds, whereas matrix calculus simplifies this for flat spaces using component-wise or layout-based notations, avoiding the full machinery of metric tensors and connections.^[13] The notational economy of matrix calculus—relying on traces, transposes, and Kronecker products—facilitates computations in optimization and statistics, contrasting with tensor calculus's broader applicability to curved geometries.^[13] A key distinction from scalar calculus lies in the non-commutativity of matrix operations, which complicates rules like the chain rule. In scalar calculus, multiplication is commutative, allowing flexible ordering in product rules (e.g., df = g \, dx + x \, dg), but matrix calculus requires careful attention to the direction of multiplication, as AB \neq BA in general, leading to distinct left- and right-multiplication variants in derivatives.^[14] This non-commutativity affects higher-order derivatives and necessitates specialized identities to ensure consistency.^[14] An illustrative example of these extensions is the generalization of the scalar Taylor series to matrix functions using the Frobenius norm. For a scalar function f(t) expanded as f(t) = f(a) + f'(a)(t - a) + \frac{1}{2}f''(a)(t - a)^2 + \cdots, the matrix analog for a differentiable matrix-valued function f(X) around X_0 involves the Fréchet derivative and higher-order terms, often measured via the Frobenius norm \| \cdot \|_F to quantify perturbations:

f(X_0 + E) = f(X_0) + Df(X_0)[E] + \frac{1}{2} D^2 f(X_0)[E, E] + \cdots,

where Df(X_0)[E] is the first Fréchet derivative applied to perturbation E, and the remainder is bounded using the Frobenius norm of E.^[15] This expansion preserves the local approximation property of scalar Taylor series while accounting for matrix structure, with applications in condition number estimation and numerical stability analysis.^[15] For the specific case of the Frobenius norm itself, \|X_0 + E\|_F^2 = \|X_0\|_F^2 + 2 \operatorname{Tr}(X_0^T E) + \|E\|_F^2, providing a quadratic approximation akin to scalar expansions.^[16]

Primary Applications

Matrix calculus finds extensive applications in machine learning, particularly in the training of neural networks through backpropagation, where matrix derivatives are essential for computing gradients that enable efficient optimization via gradient descent. The backpropagation algorithm, which propagates errors backward through the network using chain rule-based derivatives of matrix-valued functions, allows for scalable learning in deep architectures by avoiding explicit computation of high-dimensional Jacobians.^[17] In optimization, matrix calculus underpins second-order methods such as Newton's algorithm, which utilizes the Hessian matrix—the second derivative of the objective function—to approximate the curvature and guide faster convergence toward minima compared to first-order gradient descent. This approach is particularly valuable in large-scale problems where the Hessian provides quadratic convergence rates under suitable conditions, though computational costs often necessitate approximations like quasi-Newton methods.^[18] Within statistics, matrix derivatives facilitate the maximization of likelihood functions in multivariate regression models, enabling the derivation of estimators for parameters in high-dimensional settings, such as the covariance matrix in Gaussian assumptions. For instance, differentiating the log-likelihood with respect to regression coefficients yields closed-form solutions akin to ordinary least squares, while extensions handle constraints or regularization for robust inference.^[19] In control theory, matrix calculus supports the analysis and design of state-space models, where differentials of matrix equations describe system dynamics, stability, and optimal control policies in linear time-invariant systems. These derivatives are crucial for tasks like computing the sensitivity of state trajectories to parameter perturbations or synthesizing feedback controllers via Lyapunov methods.^[20] Emerging applications post-2020 highlight matrix calculus in quantum computing simulations, where derivatives of matrix exponentials optimize variational quantum algorithms for modeling complex quantum states.^[21] Similarly, in AI ethics, fairness gradients—computed as matrix derivatives of loss functions with respect to demographic parity constraints—enable debiasing during training to mitigate subgroup disparities without sacrificing overall performance.^[22] These developments address gaps in traditional treatments by applying matrix calculus to ethical optimization landscapes. A representative example is the computation of gradients in linear regression via least squares, where the derivative of the residual sum of squares with respect to the parameter matrix leads to the normal equations, providing the minimum-variance unbiased estimator in matrix form. This illustrates how matrix calculus simplifies multi-output predictions while ensuring computational efficiency.

Notation and Conventions

Standard Symbols and Definitions

Matrix calculus relies on a consistent set of symbols to represent differentiation operations and related concepts, facilitating precise communication in multivariable settings. The partial derivative is denoted by the symbol ∂, used to express the rate of change of a function with respect to a single variable while holding others constant, such as ∂f/∂x_{ij} for the partial derivative of scalar function f with respect to the (i,j)-th entry of matrix X.^[10] The gradient operator ∇ denotes the vector of first-order partial derivatives for a scalar-valued function, typically arranged as a column vector, for instance, \nabla_x f = \left( \frac{\partial f}{\partial x_1}, \dots, \frac{\partial f}{\partial x_n} \right)^T where x \in \mathbb{R}^n.^[10] Central to matrix calculus are the Jacobian and Hessian matrices, which organize partial derivatives into matrix form. The Jacobian J_f of a vector-valued function f: \mathbb{R}^n \to \mathbb{R}^m is the m \times n matrix whose (i,j)-th entry is the first partial derivative \partial f_i / \partial x_j, capturing the linear approximation of f near a point. The Hessian H_f for a scalar-valued function f: \mathbb{R}^n \to \mathbb{R} is the n \times n symmetric matrix with (i,j)-th entry \partial^2 f / \partial x_i \partial x_j, representing second-order partial derivatives. The vectorization operator, denoted vec, transforms a matrix into a column vector by stacking its columns; for an m \times n matrix A, vec(A) yields an mn \times 1 vector, which is essential for reformulating matrix derivatives in vector form.^[10] Dimension conventions specify the input and output spaces of functions to clarify derivative structures, such as f: \mathbb{R}^{m \times n} \to \mathbb{R}^p, indicating that f maps m \times n matrices to p-dimensional vectors.^[10] Auxiliary operators include the trace, denoted tr(A), which sums the diagonal elements of a square matrix A.^[10] The Frobenius inner product for compatible matrices A and B is defined as \langle A, B \rangle = \tr(A^T B), providing a scalar measure analogous to the dot product for vectors.^[10] As an illustrative example, for a function f(X) where X is an m \times n matrix, notation often involves derivatives with respect to X's entries or vec(X), ensuring compatibility with the function's domain \mathbb{R}^{m \times n}.^[10] These symbols form the foundational terminology, with layout arrangements of derivatives addressed separately.^[10]

Layout Conventions

In matrix calculus, the layout conventions specify how the partial derivatives are arranged within the resulting derivative matrices, particularly for Jacobians, to eliminate ambiguities arising from the multi-dimensional nature of vectors and matrices. The two dominant conventions are the numerator layout (NL) and the denominator layout (DL), which differ primarily in the orientation of rows and columns relative to the input and output variables.^[23]^[24] The numerator layout arranges derivatives as though they form the numerator of a fractional representation \frac{\partial \mathbf{y}}{\partial \mathbf{x}}, with rows indexed by the components of the output \mathbf{y} and columns by those of the input \mathbf{x}; this results in a Jacobian matrix whose dimensions match the outer dimensions of \mathbf{y} by \mathbf{x}.^[23] In contrast, the denominator layout arranges derivatives column-wise according to the denominator, with rows indexed by \mathbf{x} components and columns by \mathbf{y} components, yielding a Jacobian with dimensions matching \mathbf{x} by \mathbf{y}.^[24] These conventions ensure consistent application of rules like the chain rule but require care in interpretation across fields.^[25] The numerator layout is prevalent in engineering and machine learning contexts, where the example \partial \mathbf{y}/\partial \mathbf{x} naturally aligns rows with output variations, facilitating intuitive backpropagation without additional transpositions.^[23] Conversely, the denominator layout is standard in statistics and econometrics, as seen in applications involving covariance structures. A key advantage of the numerator layout is that it preserves the direct matrix multiplication order in the chain rule, \partial \mathbf{y}/\partial \mathbf{z} = (\partial \mathbf{y}/\partial \mathbf{x}) (\partial \mathbf{x}/\partial \mathbf{z}), mirroring scalar calculus.^[23] The denominator layout, however, integrates naturally with the vectorization operator \operatorname{vec}(\cdot) and Kronecker products, enabling compact expressions for linear transformations in statistical derivations without extraneous transposes.^[24] A disadvantage of the denominator layout is the occasional need for transpositions when interfacing with engineering-style computations, while the numerator layout may complicate vectorized statistical formulas.^[25] To convert between layouts, the derivative matrix in one convention is simply the transpose of the other, ensuring equivalence in the underlying partial derivatives.^[25] This transposition rule applies universally, as the layouts reorder the same set of scalar partials.^[24] The denominator layout gained prominence through the framework established by Magnus and Neudecker in their foundational text on matrix differential calculus, which emphasized its compatibility with econometric models and vectorization techniques. As an illustrative example, consider the linear mapping \mathbf{y} = A \mathbf{x}, where \mathbf{y} \in \mathbb{R}^m is a column vector, \mathbf{x} \in \mathbb{R}^n is a column vector, and A \in \mathbb{R}^{m \times n} is the coefficient matrix. The i-th component is y_i = \sum_{k=1}^n A_{i k} x_k. The partial derivative \partial y_i / \partial x_j = A_{i j} for each i=1,\dots,m and j=1,\dots,n. In the numerator layout, the Jacobian \partial \mathbf{y}/\partial \mathbf{x} is the m \times n matrix with (i,j)-th entry \partial y_i / \partial x_j = A_{i j}, yielding \partial \mathbf{y}/\partial \mathbf{x} = A. This arrangement stacks the row-wise gradients of each y_i with respect to \mathbf{x}. In the denominator layout, the Jacobian \partial \mathbf{y}/\partial \mathbf{x} is the n \times m matrix with (j,i)-th entry \partial y_i / \partial x_j = A_{i j}, resulting in \partial \mathbf{y}/\partial \mathbf{x} = A^\top. This column-wise arrangement reflects the input indexing first. For vectorized forms using the chain rule with Kronecker products, the denominator layout aligns the full Jacobian \partial \operatorname{vec}(\mathbf{y}) / \partial \operatorname{vec}(\mathbf{x})^\top = A directly, while the numerator layout requires A^\top; in generalized matrix cases like Y = A X with X \in \mathbb{R}^{p \times n}, the denominator layout yields I_n \otimes A for the vectorized Jacobian, preserving the standard Kronecker structure \operatorname{vec}(Y) = (I_n \otimes A) \operatorname{vec}(X).^[24]

Alternative Notational Systems

Component-wise notation in matrix calculus employs explicit indices to denote partial derivatives, such as \frac{\partial f_{ij}}{\partial x_{kl}}, which facilitates direct computation of individual elements in matrix-valued functions. This approach treats matrices as collections of scalars, allowing derivatives to be calculated entry by entry, often using the chain rule in indexed form for clarity in proofs and implementations. For instance, the Jacobian matrix J of a vector function \mathbf{f}(\mathbf{x}) can be expressed component-wise as J_{ij} = \frac{\partial f_i}{\partial x_j}, enabling straightforward verification of properties like rank and invertibility.^[12]^[10] Tensor notation extends this framework by incorporating Einstein summation convention, where repeated indices imply summation, particularly useful for higher-order derivatives in multidimensional arrays beyond standard matrices. In applications like extensions to general relativity, this notation represents matrix derivatives as tensor contractions, such as the partial derivative tensor \partial_k f_{ij} summed over k for covariant expressions. It generalizes matrix operations to multilinear forms, avoiding explicit summation symbols for conciseness in complex identities involving curvature or stress-energy tensors.^[26]^[27] Software-specific notations adapt these concepts for computational environments, prioritizing automatic differentiation over manual indexing. In MATLAB, the Symbolic Math Toolbox uses diff for element-wise or matrix derivatives, such as diff(F, X) yielding a symbolic Jacobian for matrix functions F(X). PyTorch's autograd system employs tensor attributes like .grad to compute gradients implicitly, representing matrix derivatives through backward passes without explicit index notation, as in torch.autograd.grad(outputs, inputs). These tools leverage vec operators and Kronecker products internally for efficiency.^[28] Component-wise notation excels in pedagogical settings for its transparency in deriving rules like the product rule but becomes verbose for large matrices, potentially obscuring structural insights. Tensor notation offers generality for high-dimensional problems, such as in physics simulations, yet introduces complexity in tracking index positions and covariants, risking errors without rigorous training. Software notations streamline implementation in optimization tasks, reducing errors in chain rule applications, though they abstract away explicit forms, hindering theoretical analysis compared to indexed methods.^[12]^[29]

Derivatives Involving Vectors

Vector-by-Scalar Derivatives

The derivative of a vector-valued function \mathbf{y} = f(t), where \mathbf{y} is an n \times 1 column vector and t is a scalar variable, with respect to t is defined as the n \times 1 vector whose entries are the partial derivatives of the components of \mathbf{y} with respect to t.^[14]^[10] This derivative, denoted \frac{d\mathbf{y}}{dt}, is computed component-wise, such that the i-th entry is \left( \frac{d\mathbf{y}}{dt} \right)_i = \frac{\partial y_i}{\partial t}.^[14]^[30] The differentiation rules for such vector-by-scalar derivatives follow from the linearity of the derivative operator: for a constant scalar a, \frac{d (a \mathbf{y})}{dt} = a \frac{d \mathbf{y}}{dt}, and for the sum of two vector functions \mathbf{y} and \mathbf{z}, \frac{d (\mathbf{y} + \mathbf{z})}{dt} = \frac{d \mathbf{y}}{dt} + \frac{d \mathbf{z}}{dt}.^[10] Additionally, the product rule applies to the product of a scalar function s(t) and a vector function \mathbf{u}(t), yielding \frac{d (s \mathbf{u})}{dt} = \frac{ds}{dt} \mathbf{u} + s \frac{d \mathbf{u}}{dt}.^[10] For example, consider \mathbf{y}(t) = t \mathbf{u} where \mathbf{u} is a constant vector; applying the product rule gives \frac{d\mathbf{y}}{dt} = \mathbf{u}.^[10] These derivatives are essential in modeling time-dependent phenomena, such as computing velocity as the time derivative of a position vector in classical dynamics.^[31]

Scalar-by-Vector Derivatives

In matrix calculus, the scalar-by-vector derivative refers to the gradient of a scalar-valued function f: \mathbb{R}^n \to \mathbb{R} with respect to an n \times 1 column vector input \mathbf{x}. This gradient is defined as the column vector \nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}, where each component is the partial derivative of f with respect to the corresponding element of \mathbf{x}.^[10] In machine learning and optimization contexts, this column vector convention aligns with the layout where differentials satisfy df = \nabla f(\mathbf{x})^T d\mathbf{x}, ensuring consistency with the scalar nature of df.^[32] The gradient operation exhibits linearity as a key property: for scalar constants a and b, and scalar functions f and g, it holds that \nabla (a f(\mathbf{x}) + b g(\mathbf{x})) = a \nabla f(\mathbf{x}) + b \nabla g(\mathbf{x}). This follows directly from the linearity of partial differentiation and extends the familiar rules from scalar calculus to vector inputs.^[10] Such properties facilitate the analysis of composite functions in optimization algorithms, where gradients guide descent directions.^[33] A representative example is the quadratic form f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x}, where A is an n \times n matrix independent of \mathbf{x}. To derive the gradient, expand f(\mathbf{x}) = \sum_{i=1}^n \sum_{j=1}^n x_i A_{ij} x_j. The partial derivative with respect to x_k is then \frac{\partial f}{\partial x_k} = \sum_{j=1}^n A_{kj} x_j + \sum_{i=1}^n x_i A_{ik}, which simplifies to the k-th component of (A^T \mathbf{x} + A \mathbf{x}). Thus, \nabla f(\mathbf{x}) = (A + A^T) \mathbf{x}. If A is symmetric (A = A^T), this reduces to \nabla f(\mathbf{x}) = 2 A \mathbf{x}, a form commonly encountered in least-squares problems and Hessian approximations.^[10] The directional derivative provides insight into the rate of change along specific directions and is given by the dot product \nabla f(\mathbf{x}) \cdot \mathbf{u} for a unit vector \mathbf{u} with \| \mathbf{u} \| = 1. This measures the instantaneous change in f at \mathbf{x} when moving in direction \mathbf{u}, generalizing the one-dimensional derivative to multivariable settings.^[34] In optimization, the maximum directional derivative occurs when \mathbf{u} aligns with \nabla f(\mathbf{x}), pointing toward the steepest ascent.^[33]

Vector-by-Vector Derivatives

In matrix calculus, the vector-by-vector derivative refers to the Jacobian matrix associated with a vector-valued function \mathbf{y} = f(\mathbf{x}), where both \mathbf{y} and \mathbf{x} are column vectors of dimensions m \times 1 and n \times 1, respectively.^[10] The Jacobian matrix \mathbf{J} = \frac{\partial \mathbf{y}}{\partial \mathbf{x}} is an m \times n matrix that captures the first-order partial derivatives of the output components with respect to the input components. The entries of the Jacobian are defined as J_{ij} = \frac{\partial y_i}{\partial x_j} for i = 1, \dots, m and j = 1, \dots, n.^[10] This arrangement linearizes the function locally around a point \mathbf{x}, approximating f(\mathbf{x} + \Delta \mathbf{x}) \approx f(\mathbf{x}) + \mathbf{J} \Delta \mathbf{x}. If m = n, the Jacobian is a square matrix; otherwise, it is rectangular, reflecting the mapping from \mathbb{R}^n to \mathbb{R}^m.^[10] A key property of the Jacobian is its behavior under function composition: for \mathbf{z} = g(\mathbf{y}) and \mathbf{y} = f(\mathbf{x}), the chain rule yields \mathbf{J}_{gf} = \mathbf{J}_g \mathbf{J}_f. This multiplicative structure facilitates the propagation of derivatives in multivariable settings.^[10] For a linear function \mathbf{y} = \mathbf{A} \mathbf{x}, where \mathbf{A} is an m \times n constant matrix, the Jacobian is simply \mathbf{J} = \mathbf{A}. In the nonlinear case, consider the component-wise application y_i = \sin(x_i) for i = 1, \dots, n (assuming m = n); the Jacobian is then a diagonal matrix with entries J_{ii} = \cos(x_i).^[10] The case where the output is scalar (m = 1) reduces to the gradient vector from scalar-by-vector derivatives, forming a $1 \times n Jacobian row.

Derivatives Involving Matrices

Matrix-by-Scalar Derivatives

In matrix calculus, the derivative of a matrix-valued function Y = F(t) \in \mathbb{R}^{m \times n}, where t is a scalar variable, is defined as the m \times n matrix whose entries are the partial derivatives of the components of Y with respect to t.^[12] This extends the concept from scalar and vector differentiation, where the output structure is preserved in the derivative.^[10] Component-wise, if Y = [Y_{ij}], then the derivative is the matrix \frac{dY}{dt} = \left[ \frac{\partial Y_{ij}}{\partial t} \right], computed element by element as in univariate calculus.^[12] Basic differentiation rules apply analogously, treating the matrix as a collection of scalars. For instance, if A and B are constant matrices of compatible dimensions, the product rule yields \frac{d}{dt} (A Y B) = A \left( \frac{dY}{dt} \right) B, since the derivatives of constants vanish.^[35] Similarly, linearity holds: \frac{d}{dt} (Y + Z) = \frac{dY}{dt} + \frac{dZ}{dt} and \frac{d}{dt} (c Y) = c \frac{dY}{dt} for a scalar constant c.^[10] A simple example is Y(t) = t I_m, the scalar multiple of the m \times m identity matrix, whose derivative is \frac{dY}{dt} = I_m, as each diagonal entry differentiates to 1 and off-diagonals to 0.^[12] In vectorized form, if \mathrm{vec}(Y) stacks the columns of Y into a vector, then \frac{d}{dt} \mathrm{vec}(Y) = \mathrm{vec}\left( \frac{dY}{dt} \right), since the scalar input imposes no additional Kronecker structure beyond the identity mapping.^[10] This formulation is useful for computational implementations but simplifies to the direct matrix derivative in practice.^[35]

Scalar-by-Matrix Derivatives

In matrix calculus, the scalar-by-matrix derivative arises when differentiating a scalar-valued function f(\mathbf{X}) with respect to an m \times n matrix variable \mathbf{X}. The result is an m \times n matrix \frac{\partial f}{\partial \mathbf{X}}, where each entry is given by \left( \frac{\partial f}{\partial \mathbf{X}} \right)_{ij} = \frac{\partial f}{\partial X_{ij}}, the partial derivative of f with respect to the scalar entry X_{ij}.^[10]^[12] This structure ensures that the derivative matrix has the same dimensions as \mathbf{X}, facilitating its use in gradient-based optimization and multivariable analysis.^[36] The properties of this derivative follow directly from the component-wise partials, making it analogous to the gradient in vector calculus but extended to matrix arguments. For instance, the derivative is linear in the sense that if f(\mathbf{X}) = g(\mathbf{X}) + h(\mathbf{X}), then \frac{\partial f}{\partial \mathbf{X}} = \frac{\partial g}{\partial \mathbf{X}} + \frac{\partial h}{\partial \mathbf{X}}, and similarly for scalar multiples.^[10] A simple example is the trace function, where f(\mathbf{X}) = \operatorname{tr}(\mathbf{X}). Here, \frac{\partial f}{\partial X_{ij}} = 1 if i = j and 0 otherwise, yielding \frac{\partial f}{\partial \mathbf{X}} = \mathbf{I}_m for a square m \times m matrix \mathbf{X}, assuming n = m.^[12]^[36] A more involved case is the bilinear form f(\mathbf{X}) = \mathbf{a}^T \mathbf{X} \mathbf{b}, where \mathbf{a} is m \times 1 and \mathbf{b} is n \times 1. To derive \frac{\partial f}{\partial \mathbf{X}}, expand f as f = \sum_{i=1}^m \sum_{j=1}^n a_i X_{ij} b_j. Differentiating with respect to X_{kl} gives \frac{\partial f}{\partial X_{kl}} = a_k b_l, since only the term with i = k and j = l contributes. Thus, the full derivative matrix has entries \left( \frac{\partial f}{\partial \mathbf{X}} \right)_{kl} = a_k b_l, which assembles into the outer product \frac{\partial f}{\partial \mathbf{X}} = \mathbf{a} \mathbf{b}^T.^[10]^[12] This result is foundational in applications like least squares problems and covariance matrix gradients.^[36] Regarding layout conventions, the scalar-by-matrix derivative can be vectorized into a column vector \operatorname{vec}\left( \frac{\partial f}{\partial \mathbf{X}} \right) for compatibility with vectorized optimization algorithms, where the vectorization follows either row-major (stacking rows) or column-major (stacking columns) order depending on the notational system employed.^[10] The Frobenius inner product provides a compact way to express differentials as df = \left\langle \frac{\partial f}{\partial \mathbf{X}}, d\mathbf{X} \right\rangle_F.^[12]

Matrix-by-Matrix Derivatives

In matrix calculus, the derivative of a matrix-valued function Y = F(X), where both Y is an m \times p matrix and X is a q \times r matrix, is generally a fourth-order tensor \frac{\partial Y}{\partial X} of dimensions m \times p \times q \times r, capturing the partial derivatives \frac{\partial Y_{ij}}{\partial X_{kl}} for all indices i,j,k,l.^[37] To handle this complexity in practice, the tensor is often represented in vectorized form, where the Jacobian matrix J = \frac{d \operatorname{vec}(Y)}{d \operatorname{vec}(X)^T} is an (mp) \times (qr) matrix, leveraging the vec operator to linearize the structure.^[10] This vectorization trick simplifies computations by transforming the matrix derivative into a standard Jacobian for vector arguments.^[37] For linear functions of the form Y = A X B, where A is m \times q and B is r \times p, the vectorized derivative is straightforward: \frac{\partial \operatorname{vec}(Y)}{\partial \operatorname{vec}(X)^T} = B^T \otimes A, derived from the identity \operatorname{vec}(A X B) = (B^T \otimes A) \operatorname{vec}(X).^[10] This Kronecker product structure preserves the linearity and facilitates efficient evaluation in applications like least squares optimization.^[38] A common nonlinear example is Y = X^2 = X X, assuming X is n \times n. The differential is dY = dX \, X + X \, dX, and vectorizing yields \operatorname{vec}(dY) = (X^T \otimes I_n + I_n \otimes X) \operatorname{vec}(dX), so the Jacobian is \frac{\partial \operatorname{vec}(Y)}{\partial \operatorname{vec}(X)^T} = X^T \otimes I_n + I_n \otimes X.^[10] For more general nonlinear cases, such as the matrix exponential Y = \exp(X), the derivative lacks a simple closed form unless dX commutes with X; otherwise, it involves the integral representation d \exp(X) = \int_0^1 \exp((1-s)X) \, dX \, \exp(s X) \, ds, which can be vectorized using Kronecker products for the full Jacobian tensor.^[39] The high dimensionality of the fourth-order tensor poses significant computational challenges, as storing or manipulating an O(m p q r) object becomes prohibitive for large matrices; instead, differentials dY = \frac{\partial Y}{\partial X} : dX are preferred, allowing component-wise operations without explicit tensor construction.^[40] This approach, emphasized in foundational treatments, avoids the need for full Jacobians in chain rule applications while maintaining rigor.^[40]

Key Identities and Rules

Vector-Specific Identities

Vector-specific identities in matrix calculus encompass differentiation rules tailored to operations on vectors, such as gradients and Jacobians of scalar or vector-valued functions. These identities extend single-variable rules like the product and chain rules to higher dimensions, where vectors are treated as column or row arrays, and derivatives form matrices or vectors accordingly. They are essential for applications in optimization algorithms, where gradients guide iterative updates, and in deriving equations for physical systems involving vector fields.^[10] A fundamental identity is the product rule for the gradient of the product of two scalar functions f and g, both depending on a vector variable \mathbf{x}:

\nabla (f g) = f \nabla g + g \nabla f

This holds because the partial derivative with respect to each component x_k satisfies \frac{\partial (f g)}{\partial x_k} = f \frac{\partial g}{\partial x_k} + g \frac{\partial f}{\partial x_k}, mirroring the scalar case but applied component-wise to form the gradient vector. To derive it, consider the definition of the gradient as the vector of partials; the linearity of differentiation ensures the rule applies directly.^[41] Another key identity is the chain rule for the Jacobian of a composition of vector functions. If \mathbf{y} = \mathbf{f}(\mathbf{z}) where \mathbf{z} = \mathbf{g}(\mathbf{x}), with \mathbf{f}: \mathbb{R}^m \to \mathbb{R}^p and \mathbf{g}: \mathbb{R}^n \to \mathbb{R}^m, then the Jacobian matrix of the composite function is

J_{\mathbf{f}(\mathbf{g}(\mathbf{x}))} = J_{\mathbf{f}}(\mathbf{g}(\mathbf{x})) \, J_{\mathbf{g}}(\mathbf{x}),

where J_{\mathbf{h}} denotes the Jacobian matrix whose (i,j)-entry is \frac{\partial h_i}{\partial x_j}. This matrix multiplication arises because the total differential d\mathbf{y} = J_{\mathbf{f}} \, d\mathbf{z} and d\mathbf{z} = J_{\mathbf{g}} \, d\mathbf{x}, composing linearly. The proof follows from applying the scalar chain rule to each entry of the Jacobian, confirming the product structure.^[42] A illustrative example is the gradient of the scalar quadratic form \mathbf{x}^T \mathbf{x}, where \mathbf{x} \in \mathbb{R}^n. This function is f(\mathbf{x}) = \sum_{i=1}^n x_i^2. The partial derivative with respect to x_k is

\frac{\partial f}{\partial x_k} = 2 x_k,

since only the k-th term in the sum depends on x_k, and its derivative is $2 x_k, while all other terms are constant with respect to x_k. Thus, collecting all partials yields the gradient vector \nabla f(\mathbf{x}) = 2\mathbf{x}. For a full proof, expand in coordinates: f(\mathbf{x}) = \mathbf{x}^T I \mathbf{x} where I is the identity matrix, and using the general rule for quadratic forms \nabla (\mathbf{x}^T A \mathbf{x}) = (A + A^T) \mathbf{x}, with A = I symmetric, gives $2\mathbf{x}. This identity is pivotal in least-squares optimization, as it shows the steepest descent direction for squared errors.^[10] For operations involving the vector cross product, consider two vector fields \mathbf{a} and \mathbf{b} in \mathbb{R}^3. The gradient (Jacobian tensor) of their cross product satisfies the identity

\nabla (\mathbf{a} \times \mathbf{b}) = \mathbf{a} (\nabla \cdot \mathbf{b}) - \mathbf{b} (\nabla \cdot \mathbf{a}) + (\mathbf{b} \cdot \nabla) \mathbf{a} - (\mathbf{a} \cdot \nabla) \mathbf{b},

contextualized in matrix form where (\mathbf{b} \cdot \nabla) acts as a directional derivative matrix applied to \mathbf{a}. This tensor expression expands the derivative using the product rule on components via the Levi-Civita symbol, isolating divergence and advection terms. In matrix calculus, it manifests as the Jacobian of the cross product map, useful for fluid dynamics and electromagnetism derivations without full tensor notation. The form derives from applying the vector product rule component-wise, confirming no curl terms appear in this gradient expansion.^[43]

Matrix-Specific Identities

Matrix-specific identities in calculus extend the rules for differentiation to operations that are inherently matrix-oriented, such as the trace, determinant, products, and inverses, which arise frequently in applications like optimization and statistics. These identities often leverage properties like cyclicity of the trace and vectorization to simplify computations, providing closed-form expressions for gradients that avoid explicit summation over components. Unlike vector-specific rules, these emphasize matrix structure, including transposes and Kronecker products, to maintain efficiency in higher dimensions.^[10] A fundamental identity involves the derivative of the trace of a matrix. For a square matrix X, the partial derivative of \operatorname{tr}(X) with respect to X is the identity matrix:

\frac{\partial \operatorname{tr}(X)}{\partial X} = I.

This follows from the linearity of the trace, as it sums the diagonal elements, and differentiation term-by-term yields the identity.^[10] More generally, for the trace of a bilinear form \operatorname{tr}(A X B), where A and B are constant matrices of compatible dimensions, the derivative with respect to X is

\frac{\partial \operatorname{tr}(A X B)}{\partial X} = B^T A^T.

This result exploits the cyclic property \operatorname{tr}(A X B) = \operatorname{tr}(B A X) and the Frobenius inner product interpretation of the trace. A special case occurs when A = I and B = Y, yielding \frac{\partial \operatorname{tr}(X Y)}{\partial X} = Y^T, which is useful for differentiating quadratic forms in matrix variables.^[10] The derivative of the logarithm of the determinant is another key identity, particularly in contexts involving covariance matrices or information theory. For a positive definite matrix X,

\frac{\partial \log \det(X)}{\partial X} = (X^{-1})^T.

This can be derived using the Jacobi formula for the differential of the determinant, d \det(X) = \det(X) \operatorname{tr}(X^{-1} dX), and applying the chain rule to the logarithm, resulting in a symmetric gradient equal to the transpose of the inverse.^[10] For matrix products, the derivative of A X B with respect to X, where A and B are constants, is expressed in vectorized form to capture the linear transformation. Vectorizing the output gives

\frac{\partial \operatorname{vec}(A X B)}{\partial \operatorname{vec}(X)^T} = B^T \otimes A,

where \otimes denotes the Kronecker product. This identity arises from the vectorization property \operatorname{vec}(A X B) = (B^T \otimes A) \operatorname{vec}(X), differentiating directly with respect to the vectorized input. It provides a compact representation of the Jacobian for matrix-valued functions.^[10] Finally, the derivative of the matrix inverse with respect to its elements accounts for the nonlinear dependence. For an invertible matrix X, the partial derivative of X^{-1} with respect to the (k, l)-th entry X_{kl} is

\frac{\partial X^{-1}}{\partial X_{kl}} = -X^{-1} e_k e_l^T X^{-1},

where e_k and e_l are standard basis vectors. This is obtained by differentiating the identity X X^{-1} = I elementwise, solving for the perturbation in the inverse, and recognizing the outer product form for the rank-one update.^[10]

Chain Rule and Product Rule Applications

In matrix calculus, the chain rule extends the scalar case to composite functions involving matrices, vectors, or scalars, accounting for the non-commutative nature of matrix multiplication. For a scalar-valued function f(g(X)) where g: \mathbb{R}^{m \times n} \to \mathbb{R}^{p \times q} and f: \mathbb{R}^{p \times q} \to \mathbb{R}, the derivative with respect to X is obtained via vectorization: \frac{\partial \mathrm{vec}(f)}{\partial \mathrm{vec}(X)} = \frac{\partial f}{\partial \mathrm{vec}(g)} \frac{\partial \mathrm{vec}(g)}{\partial \mathrm{vec}(X)}, where \mathrm{vec}(\cdot) stacks the columns of a matrix into a vector.^[29] This formulation leverages the Jacobian matrices of the inner and outer functions, ensuring compatibility in dimensions. For matrix-valued composites H(X) = G(F(X)), the differential form is dH(X) = dG(F(X)) \cdot dF(X), with the dot denoting the appropriate tensor contraction or identification map to preserve matrix structure.^[29] The product rule in matrix calculus mirrors the scalar version but requires careful attention to the order of terms due to non-commutativity. For the product of two compatible matrices Y = AB, where A and B may depend on a parameter or matrix X, the differential is dY = (dA)B + A(dB).^[14] In terms of derivatives, if A = A(X) and B is constant, then \frac{\partial \mathrm{vec}(Y)}{\partial \mathrm{vec}(X)} = (B^\top \otimes I) \frac{\partial \mathrm{vec}(A)}{\partial \mathrm{vec}(X)}, using the Kronecker product to linearize the operation.^[14] A representative example arises in least-squares optimization, where the objective function involves f(X) = \|AX - B\|^2 = (AX - B)^\top (AX - B); applying the product rule to the inner bilinear form yields the gradient \frac{\partial f}{\partial X} = 2A^\top (AX - B), facilitating iterative updates.^[14] An adaptation of the quotient rule for matrices handles divisions in structured forms, such as pseudo-inverses or ratios of matrix functions. For Y = A(X) B(X)^{-1}, the differential is dY = (dA) B^{-1} - A B^{-1} (dB) B^{-1}, derived by applying the product rule to Y = A C with C = B^{-1} and dC = -B^{-1} (dB) B^{-1}.^[14] This is particularly useful in updating covariance estimates in Kalman filters, where matrix inverses appear in recursive computations.^[14] Common pitfalls in applying these rules stem from matrix non-commutativity, such as incorrectly reversing the order in chain rule Jacobians, which can lead to dimension mismatches or sign errors; for instance, pre-multiplying instead of post-multiplying in \frac{\partial f}{\partial X} = \frac{\partial f}{\partial Z} \frac{\partial Z}{\partial X} assumes left-to-right dependency without verifying the layout convention (numerator or denominator).^[29] Additionally, overlooking the need for vectorization in composite derivatives often results in ill-defined tensor products, emphasizing the importance of consistent identification between differentials and Jacobians.^[29]

Differential Forms and Advanced Techniques

Matrix Differentials

Matrix differentials provide an alternative framework to explicit partial derivative computations in matrix calculus, particularly beneficial for circumventing the complexity of higher-order tensors that arise in direct differentiation of matrix-valued functions.^[12] This approach leverages the concept of infinitesimal changes, akin to scalar differentials, but adapted to the matrix domain using appropriate inner products.^[12] The differential of a scalar function f with respect to a vector \mathbf{x} is defined as df = \sum_i \frac{\partial f}{\partial x_i} dx_i, representing the linear approximation to the change in f.^[12] For a matrix-valued function Y(X), where X and Y are matrices, this extends to dY = \frac{\partial Y}{\partial X} : dX, with the colon denoting the Frobenius inner product, \frac{\partial Y}{\partial X} : dX = \sum_{i,j} \frac{\partial Y_{ij}}{\partial X_{kl}} dX_{kl} (summed over repeated indices).^[12] This notation encapsulates the first-order variation in Y due to a perturbation dX.^[12] A key advantage of matrix differentials lies in their linearity, which mirrors algebraic operations on matrices and simplifies rule derivations. For instance, the product rule for matrix multiplication holds as d(XY) = dX \, Y + X \, dY, allowing straightforward handling of composite expressions without tensor unraveling.^[12] This preservation of structure facilitates proofs and computations in applications like statistics and optimization.^[29] Illustrative examples highlight the utility of this framework. The differential of the trace function is d \operatorname{tr}(X) = \operatorname{tr}(dX), reflecting the trace's invariance under cyclic permutations.^[12] For the matrix inverse, assuming X is invertible, d(X^{-1}) = -X^{-1} dX X^{-1}, derived from differentiating X X^{-1} = I and applying linearity.^[12] Matrix differentials also integrate naturally with vectorization. Specifically, d \operatorname{vec} Y = J \, d \operatorname{vec} X, where J is the Jacobian matrix of the vectorized function, linking the differential approach to coordinate-based derivatives.^[12] Furthermore, the differential df of a scalar function can be integrated along a path in the matrix space to compute increments, analogous to line integrals: \int df = f(X_b) - f(X_a) for an exact differential along a path from X_a to X_b.^[29] This enables path-dependent analyses in matrix parameter spaces, such as in econometric models.^[29]

Trace and Determinant Derivatives

In matrix calculus, derivatives of the trace function play a central role due to its linearity and the cyclic property, which asserts that for compatible matrices A, B, and C, \operatorname{tr}(ABC) = \operatorname{tr}(BCA) = \operatorname{tr}(CAB). This invariance under cyclic permutations facilitates the simplification of expressions in optimization and statistical applications. The trace of a matrix X, denoted \operatorname{tr}(X), is the sum of its diagonal elements, and its derivative with respect to a scalar is straightforward, but matrix arguments require careful handling of layout conventions, such as the denominator or numerator layout in the framework of Magnus and Neudecker.^[10]^[29] A general result for the derivative of the trace of a function f(X) is \frac{\partial \operatorname{tr}(f(X))}{\partial X} = \left( \frac{\partial f}{\partial X} \right)^T under the denominator layout convention, where the transpose accounts for the identification of the derivative space with the original space. Specific cases illustrate this: for \operatorname{tr}(A X^2) where X is square, the general derivative is \frac{\partial}{\partial X} \operatorname{tr}(A X^2) = (A X + X A)^T. When both A and X are symmetric, this simplifies to $2 A X. This formula arises from expanding the differential d \operatorname{tr}(A X^2) = \operatorname{tr}(A (X dX + dX X)) = \operatorname{tr}((A X + X A) dX) and identifying the gradient using the chain rule on matrix differentials, ensuring compatibility with the multi-linear nature of the trace.^[10]^[29]^[44] Derivatives involving the determinant function \det(X) are equally important, particularly for invertible square matrices X. Jacobi's formula provides the differential form d \det(X) = \det(X) \operatorname{tr}(X^{-1} dX), which in component-wise gradient terms yields \frac{\partial \det(X)}{\partial X} = \det(X) (X^{-1})^T. This result stems from the multi-linear properties of the determinant as a function of the matrix rows or columns, where the adjugate matrix encodes cofactor expansions. The transpose appears due to the standard row-wise differentiation convention. For numerical stability in computations—especially with ill-conditioned or high-dimensional matrices—direct evaluation of \det(X) can lead to overflow or underflow, so the logarithm is preferred: \frac{\partial \log \det(X)}{\partial X} = (X^{-1})^T. This avoids exponentiation and focuses on the trace of the inverse differential.^[10]^[45] In the context of covariance matrices, which are symmetric positive definite, the derivative of \log \det \Sigma with respect to \Sigma simplifies to \Sigma^{-1}, as the off-diagonal elements' contributions symmetrize. This formula is pivotal in maximum likelihood estimation for multivariate Gaussians, where the log-likelihood includes -\frac{n}{2} \log \det \Sigma - \frac{1}{2} \operatorname{tr}(\Sigma^{-1} S) with sample covariance S, and the score function relies on this gradient for iterative optimization. The Jacobian of the determinant, reflecting its multi-linearity, further underscores that small perturbations in X scale by \det(X) \operatorname{tr}(X^{-1} \Delta X), emphasizing stability issues in high dimensions where Cholesky decompositions or LDL factorizations are often used to compute inverses without full inversion.^[46]^[29]

Applications in Optimization

Matrix calculus plays a pivotal role in optimization algorithms by providing the necessary derivatives for iterative updates in matrix-valued problems. In gradient descent, a fundamental first-order method, the matrix parameter X is updated as X_{k+1} = X_k - \alpha \frac{\partial f}{\partial X}, where \alpha > 0 is the step size and \frac{\partial f}{\partial X} is the matrix gradient of the objective function f.^[47] This approach is widely used for minimizing differentiable functions over matrix spaces, leveraging the chain rule to compute gradients efficiently.^[48] For second-order methods, Newton's method extends this by incorporating curvature information through the Hessian matrix H, updating X_{k+1} = X_k - H^{-1} \nabla f, where \nabla f is the gradient and H approximates the second derivatives.^[18] This quadratic approximation accelerates convergence near minima but requires solving linear systems involving the Hessian, which matrix calculus identities facilitate.^[49] In large-scale optimization, where direct Hessian inversion is prohibitive, the conjugate gradient method solves the systems H \Delta X = -\nabla f iteratively, generating conjugate directions that ensure efficient minimization for symmetric positive definite Hessians.^[50] This Krylov subspace technique is particularly suited to sparse or structured matrices, reducing computational cost while maintaining quadratic convergence properties in exact arithmetic.^[51] A representative application arises in matrix least squares optimization, where the goal is to minimize \|AX - B\|_F^2 over matrix X, with the Frobenius norm measuring the error; the gradient is $2A^T(AX - B), enabling gradient-based solvers to find the solution X = (A^TA)^{-1}A^TB iteratively.^[52] Post-2010 developments in Riemannian optimization adapt these techniques to matrix manifolds, such as the symmetric positive definite (SPD) matrices, which arise in covariance estimation and kernel methods; here, gradients are projected onto the tangent space using the manifold's metric, as in Riemannian gradient descent X_{k+1} = \mathrm{Retr}_X(-\alpha \mathrm{grad} f), where \mathrm{Retr} is the retraction.^[53] Seminal works have extended Newton's and conjugate gradient methods to these geometries, improving scalability for non-Euclidean constraints.^[54] In machine learning, automatic differentiation (autodiff) frameworks compute matrix gradients seamlessly during backpropagation, supporting optimization of deep networks with matrix parameters like weight matrices; for instance, reverse-mode autodiff evaluates \frac{\partial f}{\partial X} in a single backward pass proportional to forward computation time.^[55] This enables efficient training via gradient descent variants on high-dimensional matrix spaces.^[56]

References

[1]
Matrix Calculus for Machine Learning and Beyond | Mathematics
Modern applications such as machine learning and large-scale optimization require the next big step, “matrix calculus” and calculus on arbitrary vector spaces.Lecture Notes and Readings · Lecture Videos · Syllabus · Instructor InsightsMissing: tutorial | Show results with:tutorial
[2]
B.1 Matrix Calculus - Purdue Engineering
The central idea in our definition is that the dimensions of the derivative must match the dimensions of the resulting matrix. In particular, we allow ...
[3]
[PDF] The Matrix Cookbook
Nov 15, 2012 · This cookbook is a collection of facts about matrices, including identities, approximations, and relations, for quick reference.
[4]
A gentle introduction to matrix calculus - ScienceDirect
Matrix calculus is an important tool when we wish to optimize functions involving matrices or perform sensitivity analyses. This tutorial is designed to ...
[5]
https://www.janmagnus.nl/papers/matrix-derivatives.pdf
[6]
Carl Jacobi (1804 - 1851) - Biography - MacTutor
Jacobi was not the first to study the functional determinant which now bears his name, it appears first in a 1815 paper of Cauchy. However Jacobi wrote a long ...
[7]
Multivariate Analysis | Journal of the Royal Statistical Society Series B
M. S. Bartlett; Multivariate Analysis, Journal of the Royal Statistical Society Series B: Statistical Methodology, Volume 9, Issue 2, 1 July 1947, Pages 17.
[8]
[PDF] On the General Theory of Control Systems
Let X* be the dual vector space of the state space X, i.e. the space of all linear functions on X. An element z* or x* of X* is called a costate. A costate ...
[9]
[PDF] Matrix Derivatives: Why and Where Did It Go Wrong? - Jan Magnus
Introduction. The modern theory of matrix calculus rests on two pillars: a correct definition of the matrix derivative and the use of differentials.
[10]
[PDF] The Matrix Cookbook
Nov 15, 2012 · of another matrix. Let U = f(X), the goal is to find the derivative of the function g(U) with respect to X: ∂g(U). ∂X. = ∂g(f(X)). ∂X. (135).
[11]
[PDF] Review of Simple Matrix Derivatives
Oct 30, 2014 · Application: Differentiating Quadratic Form. xTAx = x1. ··· xn ... The first (k − 1)th order derivative is evaluated at ¯x; whereas the kth order ...
[12]
[PDF] Matrix Differentiation
Similarly, the rank of a matrix A is denoted by rank(A). An identity matrix will be denoted by I, and 0 will denote a null matrix. 3 Matrix Multiplication.Missing: cookbook | Show results with:cookbook
[13]
[PDF] Vector, Matrix, and Tensor Derivatives - CS231n
This document explains how to take derivatives of vectors, matrices, and higher order tensors, and how to take derivatives with respect to them.
[14]
[PDF] Matrix Calculus
In this Appendix we collect some useful formulas of matrix calculus that often appear in finite element derivations. §D.1 THE DERIVATIVES OF VECTOR FUNCTIONS.
[15]
Taylor's theorem for matrix functions with applications to condition ...
Sep 1, 2016 · This formula generalizes a known result for the remainder of the Taylor polynomial for an analytic function of a complex scalar.
[16]
[PDF] An extended collection of matrix derivative results for forward and ...
5 Matrix norms. 5.1 Frobenius norm. The Frobenius norm of matrix A is defined as. B = kAkF = pTr(AT A). Differentiating this gives. dB = (2B)−1 Tr(dAT A + AT ...
[17]
Learning representations by back-propagating errors - Nature
Oct 9, 1986 · We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in ...
[18]
[PDF] Matrix Derivatives and Descent Optimization Methods - Qiang Ning
Sep 29, 2017 · While Newton's method uses the. Hessian matrix, quasi-newton methods does not need to compute the Hessian matrix directly, but to update it ...
[19]
Matrix differential calculus with applications in the multivariate linear ...
In this paper, we present a study of this approach to matrix differential calculus with some of its key results along with illustrative examples. We also ...
[20]
Least Squares Matrix Algorithm for State-Space Modelling of ...
This work presents a novel least squares matrix algorithm (LSM) for the analysis of rapidly changing systems using state-space modelling.
[21]
[PDF] FairGrad: Fairness Aware Gradient Descent - OpenReview
In this paper, we focus on group fairness in the context of classification where we only assume access to the sensitive attributes during the training phase.Missing: derivatives | Show results with:derivatives
[22]
[PDF] Matrix Calculus You Need For Deep Learning - arXiv
Jul 2, 2018 · Recall that we use the numerator layout where the variables go horizontally and the functions go vertically in the. Jacobian. Wikipedia also ...
[23]
[PDF] Vector and Matrix Calculus - Herman Kamper
Jan 30, 2013 · Table 1 indicates the six possible kinds of derivatives when using the denominator layout. Using this layout notation consistently, we have the ...<|control11|><|separator|>
[24]
Documentation - Matrix Calculus
There are different layout conventions (numerator layout, denominator layout, mixed layout). Numerator layout is just the transpose of the denominator layout ...
[25]
Einstein Summation -- from Wolfram MathWorld
Einstein summation is a notational convention for simplifying expressions including summations of vectors, matrices, and general tensors.
[26]
[PDF] Tensor Forms of Derivatives of Matrices and their applications in the ...
Sep 18, 2025 · In Einstein's general relativity, the equation Guv = (8πG/c4)Tuv relates the curvature of spacetime (de- scribed by Einstein tensor Guv derived ...
[27]
Differentiation - MATLAB & Simulink - MathWorks
In MATLAB, use `diff(f)` to find the derivative of symbolic expression `f`. For example, `Df = diff(f)` gives `Df = 5 cos ( 5 x )`. `diff(f,x)` finds the ...
[28]
[PDF] Matrix Differential Calculus with Applications to Simple, Hadamard ...
Matrix Differential Calculus with Applications to. Simple, Hadamard, and Kronecker Products. JAN R. MAGNUS. London School of Economics. AND. H. NEUDECKER.
[29]
https://www.janmagnus.nl/papers/JRM012.pdf
[30]
4. Vector dynamics - Jaime Villate
Jan 11, 2025 · where a is the acceleration of the body, equal to the time derivative of the velocity. This is the most frequently used form for Newton's ...
[31]
Lecture Notes and Readings | Matrix Calculus for Machine Learning ...
Matrix Calculus for Machine Learning and Beyond. Menu. More Info. Syllabus ... Steven Johnson gave another overview of optimization problems (PDF) in 18.335 ...Introduction (PDF) · Jacobians of Matrix Functions · Full Course Notes (PDF)
[32]
Matrix Differential Calculus with Applications in Statistics and ...
Feb 18, 2019 · A brand new, fully updated edition of a popular classic on matrix differential calculus with applications in statistics and econometrics.
[33]
[PDF] Differentiating Vector- and Matrix-Valued Functions - cs.Princeton
Directional Derivative. • What if we had done this in the previous case, where we had a function of a scalar, returning a vector? f (x) = xv. d f dx. = v. • ...
[34]
[PDF] Matrix derivatives cheat sheet
This doesn't mean matrix derivatives always look just like scalar ones. In these examples, b is a constant scalar, and B is a constant matrix.
[35]
[PDF] Matrix Calculus
In this Appendix we collect some useful formulas of matrix calculus that often appear in finite element derivations. §D.1 THE DERIVATIVES OF VECTOR FUNCTIONS.
[36]
[PDF] 3 Jacobians of Matrix Functions - MIT OpenCourseWare
In this chapter, we discuss how it is still possible to represent f′ by a Jacobian matrix even for matrix inputs/outputs, and how the most common technique to ...
[37]
[PDF] An extended collection of matrix derivative results for forward and ...
This paper collects together a number of matrix derivative results which are very useful in forward and reverse mode algorithmic differentiation (AD).
[38]
A note on parameter differentiation of matrix exponentials, with ...
The new formula expresses the derivatives of a matrix exponential in terms of minors, polynomials, the exponential of the matrix as well as matrix inversion, ...
[39]
[PDF] Matrix Differential Calculus with Applications in Statistics and ...
Matrix Differential Calculus with Applications in Statistics and Econometrics,. Third Edition. Jan R. Magnus and Heinz Neudecker. c 2019 John Wiley & Sons ...
[40]
[PDF] Real Functions in Several Variables: Volume XI
1) Gradient of fg. ▽(fg) = g ▽ f + f ▽ g. this is proved by considering each coordinate separately,. ∂. ∂xj. (fg) = g. ∂f. ∂xj. + f. ∂g. ∂xj . We note ...
[41]
2.3 The Chain Rule
The chain rule from single variable calculus has a direct analogue in multivariable calculus, where the derivative of each function is replaced by its Jacobian ...The Chain Rule · Some examples · Proof of the Chain Rule...
[42]
[PDF] Lecture 5 Vector Operators: Grad, Div and Curl
We introduce three field operators which reveal interesting collective field properties, viz. • the gradient of a scalar field,. • the divergence of a vector ...<|separator|>
[43]
Higher order derivatives and perturbation bounds for determinants
Nov 1, 2009 · The first derivative of the determinant function is given by the well-known Jacobi's formula. We obtain three different expressions for all higher order ...
[44]
[PDF] High-dimensional covariance estimation by minimizing l1
This paper studies estimating covariance and inverse covariance matrices using l1-penalized log-determinant divergence, and shows consistency of the estimate.
[45]
[PDF] Full Lecture Notes: Matrix Calculus for Machine Learning and Beyond
These notes are based on the class as it was run for the second time in January 2023, taught by Professors Alan. Edelman and Steven G. Johnson at MIT.
[46]
The Matrix Calculus You Need For Deep Learning - explained.ai
This article walks through the derivation of some important rules for computing partial derivatives with respect to vectors, particularly those useful for ...Introduction to vector calculus... · Matrix calculus · Matrix Calculus Reference<|separator|>
[47]
[PDF] Newton's Method - Optimization Algorithms on Matrix Manifolds
This chapter provides a detailed development of the archetypal second-order optimization method, Newton's method, as an iteration on manifolds. We.
[48]
[PDF] An Introduction to the Conjugate Gradient Method Without the ...
Aug 4, 1994 · The Conjugate Gradient Method is the most prominent iterative method for solving sparse systems of linear equations.
[49]
[PDF] An Introduction to the Conjugate Gradient Method Without the ...
Aug 4, 1994 · The Conjugate Gradient Method is the most prominent iterative method for solving sparse systems of linear equations.
[50]
[PDF] Least squares and the normal equations
Mar 1, 2015 · Next week we will see that AT A is a positive semi-definite matrix and that this implies that the solution to AT Ax = AT b is a global minimum ...<|control11|><|separator|>
[51]
[PDF] Riemannian Coordinate Descent Algorithms on Matrix Manifolds
Jun 4, 2024 · Optimization over SPD matrices has a long history and can be solved with semidefinite programming if the objective is convex. For general ...Missing: post- | Show results with:post-
[52]
[PDF] Introduction to Riemannian Optimization - Benyamin Ghojogh
Important Riemannian Matrix Manifolds. Symmetric Positive Definite (SPD) manifold S++ is defined as the set of SPD matrices as: M = S++ := {X ∈ Rd×d | X ...Missing: post- | Show results with:post-
[53]
[PDF] Automatic Differentiation in Machine Learning: a Survey
Derivatives, mostly in the form of gradients and Hessians, are ubiquitous in machine learn- ing. Automatic differentiation (AD), also called algorithmic ...
[54]
[PDF] A Brief Introduction to Automatic Differentiation for Machine Learning
Oct 14, 2021 · Neural network models are typically implemented using frameworks that perform gradient based optimization methods to fit a model to a dataset.