Fact-checked by Grok 2 weeks ago

Design matrix

In statistics, a design matrix, also known as a model matrix or regressor matrix and often denoted by X, is a matrix that organizes the values of explanatory variables (predictors) across multiple observations for use in linear models such as regression analysis or analysis of variance (ANOVA). It forms the core of the linear model equation Y = X\beta + \epsilon, where Y is the vector of response variables, \beta is the vector of unknown parameters (coefficients), and \epsilon represents the random error term with mean zero.^[1] The design matrix enables efficient matrix-based computations for parameter estimation, hypothesis testing, and prediction, making it fundamental to quantitative data analysis in fields like genomics, economics, and engineering.^[2] The construction of a design matrix depends on the nature of the predictors: for continuous variables, it includes the raw values alongside a column of ones for the intercept; for categorical factors, it employs dummy (indicator) variables coded as 0s and 1s to represent group memberships, with the number of columns equal to the number of levels minus one in a reference parameterization to avoid redundancy.^[3] For example, in a simple linear regression of weight on height for n individuals, X is an n \times 2 matrix with the first column filled with 1s and the second containing height measurements; in a one-way ANOVA comparing means across k treatments, X becomes an n \times k matrix of indicators specifying treatment assignments for each observation.^[1] Key properties include its dimensions (n rows for observations and p columns for parameters), and its rank, which must typically be full (equal to p) for unique parameter estimates, as lower rank signals multicollinearity among predictors that can inflate variance and hinder inference.^[3] Beyond regression, design matrices play a crucial role in experimental design, particularly in factorial experiments, where they specify the combinations of factor levels (often coded as -1 for low and +1 for high) across experimental runs to assess main effects and interactions orthogonally.^[4] For a two-level full factorial design with three factors, the $8 \times 3 design matrix lists all $2^3 treatment combinations in standard order, facilitating balanced and efficient estimation of effects without confounding.^[4] This versatility extends to generalized linear models and beyond, underscoring the design matrix's importance in ensuring model interpretability and statistical validity across diverse applications.^[2]

Fundamentals

Definition

In linear statistical modeling, the design matrix, commonly denoted as X, serves as a foundational tool represented by an n \times p matrix, where n denotes the number of observations and p the number of predictor variables or factors, with each row corresponding to an individual observation and each column to a specific predictor.^[3] This structure organizes the explanatory data to facilitate parameter estimation in regression analyses. The design matrix underpins the general linear model, expressed mathematically as

\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon},

where \mathbf{Y} is the n \times 1 response vector containing the observed outcomes, \boldsymbol{\beta} is the p \times 1 vector of unknown parameters to be estimated, and \boldsymbol{\varepsilon} is the n \times 1 error vector.^[5] The error term \boldsymbol{\varepsilon} is assumed to consist of independent components with zero mean and constant variance across observations, embodying the principles of independence and homoscedasticity essential for valid inference in the model.^[6] The use of matrices in statistical modeling, including design matrices, developed in the 20th century, building on early work in experimental design such as that of Ronald Fisher in agricultural applications. Unlike a covariance matrix, which quantifies the variance and correlations among random variables in the data, or a general data matrix that simply stores raw observations, the design matrix specifically encodes the structural relationships between predictors and the response within the linear model framework.^[3]

Dimensions and Notation

The design matrix, conventionally denoted as X, is an n \times p matrix, where n represents the number of observations or samples in the dataset, and p denotes the number of predictors or parameters to be estimated, including the intercept term when it is part of the model.^[7] Each row of X corresponds to a single observation, typically represented as the row vector \mathbf{x}_i for the i-th observation, which captures the predictor values associated with that sample.^[8] In standard formulations that include an intercept, the first column of X is a vector of ones, enabling the model to incorporate a constant term across all observations.^[9] This column of ones multiplies the intercept parameter in the linear combination, shifting the regression hyperplane to allow for non-zero predictions when all predictors are zero.^[10] For models excluding the intercept, known as regression through the origin, the design matrix omits the column of ones, resulting in dimensions n \times (p-1) where p-1 is the number of non-intercept predictors.^[11]

Construction

For Continuous Predictors

In the construction of a design matrix for linear regression models with continuous predictors, each column beyond the intercept corresponds to one such predictor, where the entry in row i and column j is the observed value x_{ij} of the j-th predictor for the i-th observation.^[12] The design matrix X thus takes the form of an n \times (p+1) array, with n rows for observations and p+1 columns including the intercept, ensuring the model captures the linear relationship between the response and these numerical explanatory variables.^[12] For a simple case involving one continuous predictor x and an intercept, the design matrix is constructed as X = [ \mathbf{1} \mid \mathbf{x} ], where \mathbf{1} is an n \times 1 column vector of ones and \mathbf{x} is the n \times 1 vector of predictor values; this structure supports estimation via ordinary least squares in the general linear model framework.^[12] To enhance numerical stability and coefficient interpretability, continuous predictors are often centered by subtracting their sample means and scaled by dividing by their standard deviations, transforming each column j to entries (x_{ij} - \bar{x}_j)/s_j, where \bar{x}_j is the mean and s_j the standard deviation of the j-th predictor.^[12] This standardization mitigates issues from differing scales among predictors, reduces multicollinearity in interactions, and facilitates comparison of effect sizes across variables.^[12]^[13] For modeling nonlinear relationships, higher-order polynomial terms are incorporated by adding columns for powers of the predictors, such as x^2 for quadratic effects or x^3 for cubic, effectively expanding the design matrix to include these derived features while maintaining the linear-in-parameters form.^[12] Interactions between continuous predictors, like the product x_1 x_2, are similarly added as separate columns to capture joint effects, as in second-order models where the mean function includes terms up to degree two in multiple variables.^[12] Orthogonal polynomials may be used for these terms to minimize numerical instability from high correlations among powers of the same predictor.^[12]

For Categorical Predictors

Categorical predictors, which represent qualitative or discrete factors, must be encoded numerically to be incorporated into the design matrix for linear models. The most common approach is dummy coding, where for a categorical variable with k levels, k-1 binary indicator columns are created in the design matrix, each corresponding to one level excluding a chosen reference level. This encoding ensures that the columns are mutually exclusive and exhaustive, allowing the model to estimate separate effects for each non-reference level relative to the reference.^[14]^[15] In dummy coding, the entry D_{i,j} in the design matrix is 1 if the i-th observation belongs to the (j+1)-th category (for j = 1, \dots, k-1), and 0 otherwise, with the k-th category serving as the reference (all zeros in those columns). For example, consider a factor with levels A (reference), B, and C; the design matrix includes two columns: one for B (1 if level B, 0 otherwise) and one for C (1 if level C, 0 otherwise). This setup avoids multicollinearity by preventing linear dependence among the columns, as including all k indicators would make their sum equal to the intercept column of ones, rendering the design matrix singular. The reference level is often selected based on interpretability, such as the most frequent or baseline category.^[14]^[16] An alternative to dummy coding is effect coding, which constructs k-1 columns such that the values across all levels sum to zero for each column, facilitating interpretations centered on deviations from the grand mean rather than a specific reference. In effect coding, non-reference levels are typically coded as 1, the reference as -1, and adjustments are made for balance (e.g., using -1/(k-1) for the reference in some implementations to ensure the sum-to-zero property). This is particularly useful in balanced designs, where the intercept estimates the overall mean, and coefficients represent average deviations for each level from that grand mean, aiding in the analysis of main effects without biasing toward a reference category.^[15]^[16] For unordered (nominal) categories, dummy or effect coding is essential to capture qualitative distinctions without implying order. In contrast, ordered (ordinal) categories may sometimes be treated as continuous predictors by assigning numeric scores to levels, or encoded using polynomials to model trends, though dummy coding remains applicable for nominal treatment when order is not central to the hypothesis.^[17]^[15]

Properties

Full Rank Conditions

In linear models, the design matrix X of dimensions n \times p, where n is the number of observations and p the number of parameters, possesses full column rank if \operatorname{rank}(X) = p, signifying that its columns are linearly independent.^[18]^[19] This property guarantees unique ordinary least squares estimates for the model parameters \beta.^[19] Achieving full rank necessitates the absence of perfect multicollinearity among the predictor columns.^[18] Practitioners can verify this condition computationally by confirming that the determinant of X^T X exceeds zero, thereby establishing the positive definiteness and invertibility of X^T X,^[19] or by applying singular value decomposition to X = U \Sigma V^T, where full rank holds if all singular values in \Sigma are strictly positive with no zeros.^[20] Rank deficiency arises when \operatorname{rank}(X) < p, rendering X^T X singular and non-invertible, which precludes a unique solution to the normal equations and yields infinitely many parameter estimates consistent with the data.^[19]^[18] Addressing this typically involves employing a generalized inverse of X^T X to compute a minimum-norm solution or simplifying the model by eliminating linearly dependent predictors.^[21] In overparameterized designs, such deficiency manifests as aliasing, where distinct parameter configurations produce indistinguishable fitted values, complicating interpretation.^[21] The normal equations underpinning least squares estimation are formulated as

X^T X \beta = X^T y,

which demand full column rank for a unique solution; in general, \operatorname{rank}(X) \leq \min(n, p).^[18]^[19]

Orthogonality and Efficiency

An orthogonal design matrix X in linear regression satisfies X^T X = cI, where c is a scalar and I is the identity matrix, implying that the columns of X are pairwise orthogonal and each has equal norm \sqrt{c}.^[22] This property ensures that the ordinary least squares estimator simplifies to \hat{\beta} = \frac{1}{c} X^T Y, as the inverse (X^T X)^{-1} becomes diagonal and straightforward to compute, decoupling the estimates of individual parameters.^[22] Under the assumptions of the Gauss-Markov theorem—linearity, unbiasedness, homoscedasticity, and no serial correlation—the ordinary least squares estimator is the best linear unbiased estimator (BLUE), with covariance matrix \operatorname{Var}(\hat{\beta}) = \sigma^2 (X^T X)^{-1}.^[23] For orthogonal designs, this covariance matrix diagonalizes, yielding uncorrelated parameter estimates with minimal variances \sigma^2 / c for each component, thereby enhancing estimation efficiency by avoiding variance inflation from collinearity.^[22]^[23] Examples of orthogonal designs include balanced $2^k factorial experiments, where factors are coded as \pm 1 and the resulting columns of X are orthogonal, allowing independent assessment of main effects and interactions.^[22] Similarly, Helmert contrast matrices in balanced one-way ANOVA produce orthogonal columns by comparing each group mean to the average of subsequent groups, partitioning the sum of squares into independent components for hypothesis testing.^[24] In non-orthogonal designs, collinearity inflates the variances of \hat{\beta}, as measured by the condition number \kappa(X) = \sigma_{\max} / \sigma_{\min} from the singular value decomposition of X, where large \kappa(X) (e.g., >30) signals numerical instability and amplified estimation errors.^[25] This degradation contrasts with orthogonal cases, where \kappa(X) = 1, ensuring optimal stability and precision.^[25]

Examples

Arithmetic Mean Estimation

The simplest application of the design matrix arises in estimating the population arithmetic mean from a sample of independent observations. Consider a sample of n observations Y_i for i = 1, \dots, n, modeled as Y_i = \mu + \varepsilon_i, where \mu is the unknown population mean and \varepsilon_i are independent error terms with mean zero.^[26] In matrix form, this is expressed as \mathbf{Y} = X \boldsymbol{\beta} + \boldsymbol{\varepsilon}, where \mathbf{Y} is the n \times 1 vector of observations, \boldsymbol{\beta} = \mu is the scalar parameter, and \boldsymbol{\varepsilon} is the n \times 1 vector of errors.^[26] The design matrix X in this intercept-only model is an n \times 1 column vector of ones, denoted \mathbf{1}_n.^[26] The ordinary least squares (OLS) estimator for \boldsymbol{\beta} is given by

\hat{\boldsymbol{\beta}} = (X^T X)^{-1} X^T \mathbf{Y}.

Substituting X = \mathbf{1}_n, this simplifies to \hat{\beta} = (\mathbf{1}_n^T \mathbf{1}_n)^{-1} \mathbf{1}_n^T \mathbf{Y} = n^{-1} \sum_{i=1}^n Y_i = \bar{y}, the sample arithmetic mean.^[26] Thus, the least squares estimate coincides with the familiar sample mean, providing an unbiased estimator of \mu under the model assumptions.^[27] The parameter \mu represents the grand mean of the population, serving as the constant level around which the observations fluctuate.^[26] The variance of the estimator \hat{\beta} is \sigma^2 / n, where \sigma^2 is the error variance, reflecting that precision improves with larger sample sizes.^[27] This formulation has historical roots in Carl Friedrich Gauss's development of the least squares method in the early 19th century, where he applied it to constant models in astronomical observations, demonstrating that the arithmetic mean minimizes the sum of squared deviations under normality assumptions.^[28] Gauss's work, detailed in his 1809 publication Theoria Motus Corporum Coelestium, established the statistical foundation for such estimation.^[28]

Simple Linear Regression

In simple linear regression, the model posits a linear relationship between a response variable Y and a single continuous predictor x, expressed as Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i for i = 1, \dots, n, where \beta_0 is the intercept, \beta_1 is the slope, and \varepsilon_i are independent errors typically assumed to follow a normal distribution with mean zero and constant variance \sigma^2.^[29]^[30] The design matrix \mathbf{X} for this model is an n \times 2 matrix that facilitates the matrix formulation of the regression, with the first column consisting of ones to account for the intercept term and the second column containing the observed values of the predictor x. This structure is denoted as \mathbf{X} = [\mathbf{1}_n \mid \mathbf{x}], where \mathbf{1}_n is an n \times 1 vector of ones and \mathbf{x} is the n \times 1 vector of predictor values.^[29]^[30] The ordinary least squares (OLS) estimator for the parameters is given by \hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y}, where \mathbf{Y} is the n \times 1 response vector and \boldsymbol{\beta} = [\beta_0, \beta_1]^T. This yields the explicit formulas \hat{\beta}_1 = \frac{\text{Cov}(x, y)}{\text{Var}(x)} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} and \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}, where \bar{x} and \bar{y} are the sample means of x and y, respectively.^[30] Geometrically, the columns of \mathbf{X} span a two-dimensional subspace in \mathbb{R}^n, and the OLS fit projects the response vector \mathbf{Y} onto this subspace, minimizing the Euclidean distance (residual sum of squares) to obtain the fitted values \hat{\mathbf{Y}} = \mathbf{X} \hat{\boldsymbol{\beta}}.^[29]^[30] For the design matrix to be full rank and the OLS estimator to be well-defined, the columns must be linearly independent, which holds as long as the predictor x is not constant across all observations (i.e., \text{[Var](/page/Var)}(x) > [0](/page/0)). If x is constant, the second column becomes a scalar multiple of the first, rendering \mathbf{X}^T \mathbf{X} singular and the model reducible to a constant mean.^[30] This setup assumes the predictor is continuous, as constructed in standard matrix form for such variables.^[29]

Multiple Linear Regression

In multiple linear regression, the model extends the simple linear case to incorporate several continuous predictor variables, allowing for the joint estimation of their effects on the response. The general form is given by Y_i = \beta_0 + \sum_{j=1}^{p-1} \beta_j x_{ij} + \epsilon_i for i = 1, \dots, n, where Y_i is the response, x_{ij} are the continuous predictors, \beta_0 is the intercept, \beta_j are the partial regression coefficients representing the change in Y per unit change in x_j holding other predictors constant, and \epsilon_i are independent errors with mean zero and constant variance \sigma^2.^[31]^[8] In matrix notation, this becomes \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}, where \mathbf{Y} is the n \times 1 response vector, \boldsymbol{\beta} is the p \times 1 parameter vector (with p = k+1 for k predictors), and \boldsymbol{\epsilon} is the error vector. The design matrix \mathbf{X} is n \times p, constructed as \mathbf{X} = [\mathbf{1}_n \mid \mathbf{x}_1 \mid \dots \mid \mathbf{x}_{p-1}], with \mathbf{1}_n a column of ones for the intercept and each \mathbf{x}_j the n \times 1 vector of observations for the j-th predictor.^[31]^[8] The ordinary least squares (OLS) estimator for \boldsymbol{\beta} is \hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y}, which minimizes the sum of squared residuals and provides unbiased estimates under the model assumptions, provided \mathbf{X} has full column rank.^[31]^[8] The partial regression coefficients \hat{\beta}_j in \hat{\boldsymbol{\beta}} quantify the unique contribution of each predictor, adjusted for the others, enabling assessment of multicollinear influences or confounding.^[31] To model interactions among continuous predictors, additional columns are appended to \mathbf{X} consisting of products of the predictor vectors, such as \mathbf{x}_1 \odot \mathbf{x}_2 (element-wise multiplication) for a two-way interaction term \beta_p (x_{i1} x_{i2}).^[8] This expands the model to Y_i = \beta_0 + \sum_{j=1}^{p-1} \beta_j x_{ij} + \sum_{m} \beta_m z_{im} + \epsilon_i, where z_{im} are the interaction terms, allowing the effect of one predictor to vary with levels of another.^[8] The OLS estimation applies similarly to the augmented \mathbf{X}. A key property arises when predictors are highly correlated, leading to multicollinearity: the matrix \mathbf{X}^T \mathbf{X} becomes ill-conditioned and near-singular, causing the inverse (\mathbf{X}^T \mathbf{X})^{-1} to have large elements and inflating the variances of the coefficient estimates, as \text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^T \mathbf{X})^{-1}.^[32]^[8] This instability can make individual \hat{\beta}_j unreliable, though the overall model fit may remain adequate; orthogonal predictors mitigate such issues by simplifying \mathbf{X}^T \mathbf{X} to a scaled identity matrix.^[32]

One-Way ANOVA Models

In the one-way analysis of variance (ANOVA) model, the design matrix X facilitates the estimation of group means through the linear model Y = X\beta + \epsilon, where Y is the response vector, \beta contains the parameters of interest, and \epsilon represents the errors assumed to be normally distributed with mean zero and constant variance \sigma^2.^[33] Two primary parameterizations are used: the cell means model and the reference group model. These approaches differ in how X is constructed and how the parameters \beta interpret the group-specific means \mu_j for k groups. The cell means model directly parameterizes each \beta_j = \mu_j, the mean of group j. Here, X is an n \times k matrix with k columns, each serving as an indicator for membership in a specific group; the entry x_{ij} = 1 if observation i belongs to group j, and 0 otherwise, with no intercept column included.^[34] This construction ensures X has full column rank k, as the columns have disjoint supports corresponding to the groups, avoiding linear dependence.^[34] In balanced designs, where each group has an equal number of observations n_j = n/k, the columns each contain the same number of 1s, leading to orthogonal columns and simplified computations. In unbalanced designs, with unequal n_j, the columns have varying numbers of 1s, but X remains full rank, though parameter estimates and inferences adjust for the differing sample sizes.^[35] In contrast, the reference group model uses an intercept and k-1 dummy variables to achieve identifiability. The design matrix X is n \times k, with the first column of all 1s for the intercept and subsequent columns as indicators for groups 1 through k-1, omitting the reference group (typically group k). Here, \beta_0 = \mu_k, the mean of the reference group, and \beta_j = \mu_j - \mu_k for j = 1, \dots, k-1, representing deviations from the reference mean.^[36] As in the cell means model, balanced designs feature uniform replication across columns, while unbalanced designs result in unequal 1s per column, influencing the least squares estimates \hat{\beta} = (X'X)^{-1}X'Y.^[35] The F-test for equality of group means in one-way ANOVA relies on the design matrix to partition the total sum of squares into between-group and within-group components via orthogonal projections. The between-group sum of squares (SSB) is computed as Y'(H - \frac{1}{n}J)Y, where H = X(X'X)^{-1}X' is the hat matrix projecting onto the column space of X, and J is the all-ones matrix; this quantifies variation explained by the group effects under either parameterization.^[33] The within-group sum of squares (SSW) is Y'(I_n - H)Y, and the F-statistic is F = \frac{\text{SSB}/(k-1)}{\text{SSW}/(n-k)}, testing the null hypothesis that all \mu_j are equal, with the same result across balanced and unbalanced designs when using appropriate projections.^[33]

Extensions

In Generalized Linear Models

In generalized linear models (GLMs), the design matrix X retains its fundamental role as the matrix encoding the predictor variables, much like in ordinary linear regression, but the linear predictor \eta = X \beta is connected to the expected response \mu through a monotonic link function g, yielding g(\mu) = X \beta, where \beta is the parameter vector. This framework accommodates non-normal response distributions from the exponential family, such as binomial, Poisson, or gamma, allowing GLMs to model diverse data types including binary outcomes and counts. The canonical link functions vary by distribution; for example, the logit link g(\mu) = \log\left(\frac{\mu}{1-\mu}\right) is standard for binomial responses in logistic regression.^[37] Parameter estimation in GLMs proceeds via maximum likelihood, which lacks a closed-form solution like the ordinary least squares estimator \hat{\beta} = (X^T X)^{-1} X^T y of linear models, necessitating iterative numerical methods. The iteratively reweighted least squares (IRLS) algorithm is the primary approach, transforming the problem into a sequence of weighted linear regressions. Starting with an initial guess for \beta, IRLS constructs a working response vector z that approximates the current \eta, along with a diagonal weight matrix W derived from the variance function and link derivative, then solves \hat{\beta}^{(k+1)} = (X^T W^{(k)} X)^{-1} X^T W^{(k)} z^{(k)} until convergence. Throughout, the design matrix X remains fixed, while W updates to reflect the nonlinear nature of the model.^[37]^[38] In logistic regression, a canonical GLM example, the design matrix X is constructed identically to the linear case—for instance, with an intercept column and columns for continuous or categorical predictors— but \beta is estimated by maximizing the binomial likelihood rather than minimizing squared residuals. The logit link ensures predicted probabilities lie between 0 and 1, and IRLS iteratively refines \beta using weights W_{ii} = \mu_i (1 - \mu_i) based on the current fitted values. This adaptation highlights how the design matrix's structure supports predictor effects additively on the link scale, enabling interpretation of coefficients as log-odds changes, without altering X itself from its linear regression form.^[37]^[38]

In Experimental Design

In experimental design, the design matrix \mathbf{X} plays a central role in planning controlled experiments to optimize inference about factor effects. Optimal design criteria guide the selection of factor levels and run combinations to enhance the precision of parameter estimates in the linear model \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}. One widely used criterion is D-optimality, which maximizes the determinant of the information matrix \det(\mathbf{X}^T \mathbf{X}), thereby minimizing the generalized variance of the least-squares estimates \hat{\boldsymbol{\beta}}.^[39] This approach selects subsets of points from a candidate set of factor levels, ensuring efficient use of limited experimental resources while reducing the volume of the confidence ellipsoid for \boldsymbol{\beta}.^[40] Another criterion, A-optimality, minimizes the trace of the variance-covariance matrix \trace((\mathbf{X}^T \mathbf{X})^{-1}), which lowers the average variance of the parameter estimates and improves overall prediction accuracy in polynomial models.^[41] These criteria prioritize designs that yield precise inferences, often computed algorithmically for complex factor spaces. Factorial designs, particularly full $2^[k](/page/K) setups, construct the design matrix \mathbf{X} with rows representing all $2^[k](/page/K) combinations of k factors at two levels each, typically coded as -1 (low) and +1 (high). Each factor column contains an equal number of -1s and +1s, facilitating orthogonal contrasts for estimating main effects and interactions.^[42] The standard order arranges columns such that the first alternates -1 and +1, while subsequent columns repeat blocks of $2^{i-1} identical values before switching signs, ensuring balanced representation across levels.^[4] For resource-constrained scenarios, fractional factorial designs select a subset of these rows (e.g., $2^{k-p}), preserving key effects while aliasing higher-order interactions, with the matrix still using \pm 1 coding to maintain estimability. These constructions often yield orthogonal columns, enhancing estimation efficiency as discussed in properties of the design matrix.^[42] Blocking and randomization further refine the design matrix to control extraneous variability. In randomized complete block designs, block effects are incorporated as additional columns in \mathbf{X}, typically using indicator variables for each block level, allowing the model to account for nuisance factors like batch or operator differences without confounding treatment effects.^[43] The model becomes \mathbf{y} = \mathbf{X}_T \boldsymbol{\beta}_T + \mathbf{X}_B \boldsymbol{\beta}_B + \boldsymbol{\epsilon}, where \mathbf{X}_T and \mathbf{X}_B partition the matrix for treatments and blocks, respectively, partitioning total variability into treatment, block, and error components via ANOVA.^[44] Randomization within blocks assigns treatment levels to experimental units randomly, ensuring unbiased estimates and mitigating systematic errors, with the full matrix reflecting these assignments for subsequent analysis.^[43] Software tools facilitate the generation of these design matrices. In R, the AlgDesign package computes exact D-, A-, and I-optimal designs, including blocked variants, from user-specified candidate sets and models, with version 1.2.1.2 released in April 2025.^[45] Similarly, the skpr package supports optimal design creation for D-, A-, and other criteria, handling split-plot and blocked structures while evaluating power, updated to version 1.9.2 in September 2025.^[46]

References

[1]
7.1 - Linear Models | STAT 555
X is called the design matrix. It is a matrix with known entries which is a function of our data x - in this case a column of 1's and the column with the ...
[2]
Chapter 7 Design Matrices | Statistics for Genomics
Design matrices are fundamental concepts used in differential expression analysis to understand the relationship between gene expression and explanatory ...
[3]
Design matrix - StatLect
A design matrix is a matrix containing data about multiple characteristics of several individuals or objects. Each row corresponds to an individual and each ...Examples · How the design matrix is... · Rank of the design matrix
[4]
5.3.3.3.1. Two-level full factorial designs
The table formed by the columns X1, X2 and X3 is called the Design Table or Design Matrix. Orthogonality Properties of Analysis Matrices for 2-Factor ...
[5]
Linear Regression Models - SAS Help Center
where is the design matrix (rows are observations and columns are the regressors), is the vector of unknown parameters, and is the vector of unobservable ...Missing: definition | Show results with:definition
[6]
Testing the assumptions of linear regression - Duke People
The four assumptions are: linearity/additivity, independence of errors, homoscedasticity (constant variance) of errors, and normality of the error distribution.
[7]
1.1 - A Quick History of the Design of Experiments (DOE) | STAT 503
Note: A lot of what we are going to learn in this course goes back to what Sir Ronald Fisher developed in the UK in the first half of the 20th century. He ...Missing: matrix | Show results with:matrix
[8]
[PDF] Stat 5102 Notes: Regression
Apr 27, 2007 · Note that y and e have dimension n, but β has dimension p. The matrix X is called the design matrix or model matrix and has dimension n × p.
[9]
[PDF] Topic 3 Chapter 5: Linear Regression in Matrix Form
... design matrix and additional beta parameters). Multiple Regression. Data for Multiple Regression. • ... Solutions -> analysis -> interactive data analysis.
[10]
[PDF] 3.0 Linear Regression with Matrices - Stat@Duke
The Design Matrix is the n × (p + 1) matrix X whose ith row is. (1,xi1,...,xip) for i = 1,...n. The name comes from the fact that in.
[11]
Lecture 12: Matrix Notation for Regression
Here, β represents a vector of regression coefficients (intercepts, group means, etc.), X is an n×k “design matrix” for the model (more on this later), ...
[12]
[PDF] Chapter 2 Multiple Linear Regression
The no intercept MLR model, also known as regression through the origin, is still Y = Xβ + e, but there is no intercept β1 in the model, so X does not.Missing: centering | Show results with:centering
[13]
[PDF] Multicollinearity (and Model Validation) - San Jose State University
To do this, Ridge regression assumes that the model has no intercept term, or both the response and the predictors have been centered so that. ˆ β0 = 0. Dr ...
[14]
[PDF] Applied Linear Regression - Purdue Department of Statistics
... Predictors and Regressors, 55. 3.4 Ordinary Least Squares, 58. 3.4.1 Data and Matrix ... Continuous Predictor, 103. 5.1.4 The Main Effects Model, 106. 5.2 Many ...
[15]
Centering in Multiple Regression Does Not Always Reduce ...
Mean centering is recommended both to simplify the interpretation of the coefficients and to reduce the problem of multicollinearity.
[16]
4.4 - Dummy Variable Regression | STAT 502
### Summary of Dummy Variable Regression for Categorical Predictors
[17]
Coding Systems for Categorical Variables in Regression Analysis
Unlike dummy coding, effect coding allows you to assign different weights the various levels of the categorical variable. While the “rule” in dummy coding is ...
[18]
Coding schemes for categorical predictors - Support - Minitab
The default coding scheme is 1, 0 (also known as binary and dummy coding) is commonly used in regression analyses. Using 1, 0 coding, coefficients represent the ...
[19]
Chapter 6 Categorical predictor variables | Analysing Data using ...
The basic trick that we need is dummy coding. Dummy coding involves making one or more new variables, that reflects the categorisation seen with a categorical ...<|control11|><|separator|>
[20]
[PDF] Linear Models - Math
is the socalled “regression matrix,” or “design matrix.” The elements of the × matrix X are assumed to be known; these are the “descriptive”.
[21]
[PDF] Full Rank Linear Models
Definition 2.1. A set V ⊆ Rk is a vector space if for any vectors x, y, z ∈ V, and scalars a and b, the operations of vector addition and scalar.
[22]
None
### Summary: Using SVD to Verify Full Rank of Design Matrix in Linear Regression
[23]
[PDF] Applying Generalized Linear Models - LEG/UFPR
1.3.3 Aliasing. For various reasons, the design matrix, Xn×p, in a linear model may not be of full rank p. If the columns, x1,..., xj, form a linearly ...
[24]
[PDF] 8 Orthogonal Structure in the Design Matrix
Then the theorem implies that the optimal design has orthogonal columns and all variables set to +1 or −1. If n = 2k such a design is called a 2k factorial ...Missing: variance | Show results with:variance
[25]
[PDF] Chapter 4 - The Gauss-Markov Theorem
By the Gauss-Markov theorem bγLSE is the BLUE for γ and l/β = a/γ is a linear function of γ. n − k . Proof. The simple proof is to observe that this estimator ...
[26]
None
### Summary of Helmert Contrasts in ANOVA from https://pdixon.stat.iastate.edu/stat511/notes4/part%201.pdf
[27]
[PDF] ARTICLE TEMPLATE Variance Inflation Factor and Condition ...
ABSTRACT The Variance Inflation Factor and the Condition Number are measures traditionally applied to detect the presence of collinearity in a multiple linear ...
[28]
Linear regression model | Mathematics and matrix notation - StatLect
Linear regression model · Dependent and independent variables · Regression coefficients and errors · Example · Matrix notation · Intercept · Zero-mean errors · OLS ...Matrix notation · Zero-mean errors · OLS estimator · Formula for the OLS estimator<|control11|><|separator|>
[29]
24.4 - Mean and Variance of Sample Mean | STAT 414
The mean of the sample mean is the same as the mean of the individual population. The variance of the sample mean decreases as the sample size increases.
[30]
[PDF] Gauss on least-squares and maximum-likelihood estimation1
Dec 18, 2021 · Key words: Gauss, least squares, maximum likelihood, history ... postulate of the arithmetic mean, which is in fact a consequence of the nor-.Missing: origin | Show results with:origin
[31]
[PDF] Chapter 5 – Matrix Approach to Simple Linear Regression - Statistics
Definition: A matrix is a rectangular array of numbers or symbolic elements. • In many applications, the rows of a matrix will represent individuals cases ...
[32]
[PDF] Lecture 13: Simple Linear Regression in Matrix Format
Oct 14, 2015 · That is, xβ is the n × 1 matrix which contains the point predictions. The matrix x is sometimes called the design matrix. 1.2 Mean Squared Error.
[33]
5.4 - A Matrix Formulation of the Multiple Regression Model
Here, we review basic matrix algebra, as well as learn some of the more important multiple regression formulas in matrix form.
[34]
10.4 - Multicollinearity | STAT 462
Multicollinearity exists when two or more of the predictors in a regression model are moderately or highly correlated with one another.
[35]
[PDF] One-Way Analysis of Variance - University of Minnesota Twin Cities
Jan 4, 2017 · In matrix form, the one-way ANOVA model is y = Xb + e ... In one-way ANOVA model, the relevant sums-of-squares are. Total: SST ...
[36]
[PDF] Chapter 5 One Way ANOVA
Definition 5.9. The cell means model is the parameterization of the one way fixed effects ANOVA model such that. Yij = µi + eij where Yij is the value of the ...
[37]
Balanced and unbalanced designs in ANOVA models - Minitab
An unbalanced design has an unequal number of observations. Balanced Design. You have exactly one observation for all possible combinations of the factor levels ...
[38]
4: ANOVA Models Part II - STAT ONLINE
Apply the overall mean, cell means, and dummy variable regression models for a one-way ANOVA and interpret the results. Identify the design matrix and the ...
[39]
Generalized Linear Models - jstor
The technique of iterative weighted linear regression can be used to obtain maximum likelihood estimates of the parameters with observations distri-.
[40]
[PDF] Generalized Linear Models - Department of Statistics
Dec 6, 2021 · Logistic regression is a specific type of GLM. We will develop logistic regression from first principles before discussing GLM's in general ...
[41]
5.5.2.1. D-Optimal designs - Information Technology Laboratory
D-optimal designs are often used when classical designs do not apply, D-optimal designs are one form of design provided by a computer algorithm.
[42]
[PDF] D-Optimal Designs - NCSS
D-optimal designs are constructed to minimize the generalized variance of the estimated regression coefficients. In the multiple regression setting, ...
[43]
Optimality Criteria - Mixture Designs - Stat-Ease
An A-optimal design minimizes the trace of the variance-covariance matrix. This has the effect of minimizing the average prediction variance of the polynomial ...
[44]
Lesson 6: The $2^k$ Factorial Design - STAT ONLINE
The 2 k refers to designs with k factors where each factor has just two levels. These designs are created to explore a large number of factors, with each factor ...
[45]
5.3.3.2. Randomized block designs
The general rule is: "Block what you can, randomize what you cannot." Blocking is used to remove the effects of a few of the most important nuisance variables ...
[46]
[PDF] Design of Engineering Experiments The Blocking Principle
Blocking is a technique to deal with nuisance factors, which are factors of no interest but their variability needs to be minimized. A block is a specific ...
[47]
AlgDesign: Algorithmic Experimental Design
- **Does AlgDesign generate design matrices?** Yes, it generates design matrices for experimental designs.
[48]
skpr: Design of Experiments Suite: Generate and Evaluate Optimal Designs
### Summary of skpr Package (Version 1.9.2, Released 2025)