Fact-checked by Grok 2 weeks ago

Categorical variable

A categorical variable, also known as a qualitative variable, is a type of variable in statistics that represents data through distinct categories or labels without an inherent numerical order or meaningful arithmetic operations between the categories.^[1]^[2] These variables are used to classify observations into groups based on shared characteristics, such as attributes or types, and are fundamental in descriptive and inferential statistics for analyzing non-numeric data patterns.^[3] Unlike quantitative variables, which involve measurable numerical values with consistent intervals, categorical variables focus on grouping rather than magnitude, enabling analyses like frequency distributions and associations between groups.^[2]^[1] Categorical variables are broadly classified into two subtypes: nominal and ordinal. Nominal variables consist of categories with no implied order or hierarchy, where the labels serve solely for identification, such as gender (male, female) or eye color (blue, brown, green).^[1]^[3] Ordinal variables, in contrast, maintain a clear ranking or sequence among categories, though the intervals between ranks are not necessarily equal, as seen in educational attainment (high school, bachelor's degree, master's degree) or socioeconomic status (low, medium, high).^[1] This distinction is crucial because it influences the choice of statistical tests and visualizations, with nominal data often analyzed via chi-square tests and ordinal data allowing for measures like medians.^[1] Common examples of categorical variables include race, sex, age groups (e.g., under 18, 18-35, over 35), and favorite ice cream flavors, which can be binned from continuous data for targeted analysis.^[3]^[2] In practice, these variables are visualized and summarized using tools like bar graphs, pie charts, and contingency tables to display frequencies or proportions, facilitating insights into relationships, such as the distribution of eye color by hair color in a population sample.^[3] For instance, in a dataset of 20 individuals, a two-way table might reveal that 50% of redheads have brown eyes, highlighting categorical associations without assuming numerical differences.^[3] The role of categorical variables extends to various fields, including social sciences, medicine, and machine learning, where they form the basis for modeling predictors like treatment types in clinical trials or user preferences in surveys.^[1] Proper handling, such as one-hot encoding for nominal variables in regression models, ensures accurate inference, as misclassifying them as quantitative can lead to invalid conclusions.^[1] Overall, understanding categorical variables is essential for robust data analysis, as they capture qualitative diversity that quantitative measures alone cannot address.^[2]

Definition and Types

Definition

In statistics, a variable refers to any characteristic, number, or quantity that can be measured or counted and that varies across observations or units of analysis.^[4] A categorical variable, also known as a qualitative variable, is a specific type of variable that assigns each observation to one of a limited, usually fixed, number of discrete categories or labels, where the categories lack inherent numerical meaning or a natural order.^[5] These categories represent distinct groups based on qualitative properties rather than measurable quantities, enabling the classification of data into non-overlapping groupings.^[6] A defining feature of categorical variables is that their categories must be mutually exclusive, ensuring that each observation belongs to exactly one category without overlap, and exhaustive, meaning the set of categories encompasses all possible outcomes for the variable.^[6] This structure facilitates the analysis of associations and distributions within datasets, distinguishing categorical variables from numerical ones, which support arithmetic operations and possess intrinsic ordering.^[1] The origins of categorical variables trace back to early 20th-century statistical developments, particularly Karl Pearson's foundational work on contingency tables in 1900, which introduced methods for examining relationships between such variables through chi-squared tests.^[7] This innovation built on prior probabilistic ideas but formalized the treatment of categorical data as a core component of statistical inference.^[8]

Nominal Variables

Nominal variables represent a fundamental subtype of categorical variables, characterized by categories that lack any intrinsic order, ranking, or numerical progression. These variables serve to classify observations into distinct groups based solely on qualitative differences, such as eye color or marital status, where one category cannot be considered inherently greater or lesser than another. Unlike other forms of categorical data, nominal variables treat all categories as equals, with no implied hierarchy or magnitude.^[1] A key characteristic of nominal variables is the equality among their categories, which precludes the application of arithmetic operations like addition or subtraction across values. This equality makes them particularly suitable for statistical tests that assess associations or independence between groups, such as the chi-square test of independence, which evaluates whether observed frequencies in a contingency table deviate significantly from expected values under a null hypothesis of no relationship. For instance, in analyzing survey data on preferred beverage types (e.g., coffee, tea, soda), a chi-square test can determine if preferences differ by demographic group without assuming any ordering.^[9] In the typology of measurement scales proposed by S.S. Stevens, nominal measurement occupies the lowest level, serving primarily as a classification or naming system without quantitative implications. Stevens defined nominal scales as those permitting only the determination of equality or inequality between entities, with permissible statistics limited to mode, chi-square measures, and contingency coefficients. This foundational framework underscores that nominal data cannot support more advanced operations, such as ranking or interval estimation, distinguishing it from higher scales like ordinal or interval.^[10] The implications for analysis of nominal variables are significant, as traditional measures of central tendency like the mean or median are inapplicable due to the absence of numerical ordering or spacing. Instead, descriptive analysis focuses on frequencies—the count of occurrences within each category—and the mode, which identifies the most frequent category. For example, in a dataset of blood types (A, B, AB, O), one would report the percentage distribution and highlight the most common type, rather than averaging the categories. This approach ensures that interpretations remain aligned with the qualitative nature of the data, avoiding misleading quantitative summaries.^[11]^[12]

Ordinal Variables

Ordinal variables represent a subtype of categorical variables characterized by categories that have a natural, meaningful order, but with intervals between successive categories that are not necessarily equal or quantifiable. This ordering allows for the ranking of observations, such as classifying severity levels in medical assessments or preference degrees in surveys, without implying that the difference between adjacent categories is uniform across the scale. For instance, a pain intensity scale might order responses as "none," "mild," "moderate," "severe," and "extreme," where each step indicates increasing intensity, yet the psychological or physiological gap between "mild" and "moderate" may differ from that between "severe" and "extreme."^[13] In S. S. Stevens' foundational typology of measurement scales, ordinal variables occupy the second level, following nominal scales, emphasizing the ability to determine relative position or rank while prohibiting operations that assume equal spacing, such as calculating arithmetic means without qualification.^[14] A classic example is the Likert scale, originally developed for attitude measurement, which typically features five or seven ordered response options from "strongly disagree" to "strongly agree," capturing subjective intensity without assuming equidistant intervals.^[15] Unlike nominal variables, which treat categories as unordered and interchangeable, ordinal variables enable directional comparisons, such as identifying whether one response is "higher" than another.^[1] Key characteristics of ordinal variables include their suitability for ranking-based analyses, where the focus is on order rather than magnitude of differences, making them ideal for non-parametric statistical tests that avoid assumptions of normality or equal intervals. The Wilcoxon rank-sum test, for example, ranks all observations from two independent groups and compares the sum of ranks to assess differences in central tendency, providing a robust method for ordinal data in comparative studies.^[16] This approach preserves the ordinal nature by treating categories as ranks, circumventing issues with unequal spacing that could invalidate parametric alternatives.^[13] For descriptive analysis, medians and modes serve as appropriate central tendency measures for ordinal variables, with the median indicating the middle value in an ordered dataset and the mode highlighting the most common category; these avoid the pitfalls of assuming interval properties. Means, however, require caution as they imply equal distances between categories, potentially leading to misleading interpretations unless specific assumptions hold, such as the presence of five or more categories with roughly symmetric response thresholds. Under such conditions, ordinal data may be approximated as interval for parametric methods, though this should be justified empirically to maintain validity.^[13]^[17]

Examples

Everyday Examples

Categorical variables appear frequently in daily life, where they classify observations into distinct groups using labels rather than numerical values that imply magnitude or order.^[18] For instance, eye color serves as a classic example of a nominal categorical variable, categorizing individuals into groups such as blue, brown, green, or hazel without any inherent ranking or numerical computation between the categories.^[19] These labels simply assign qualitative distinctions to describe characteristics, allowing for grouping and comparison based on frequencies rather than arithmetic operations.^[18] Another relatable example is education level, which represents an ordinal categorical variable by ordering categories like elementary school, high school, bachelor's degree, or master's degree, where the sequence implies progression but the differences between levels are not quantifiable numerically.^[18] Here, the variable assigns hierarchical labels to reflect relative standing without enabling direct mathematical calculations, such as addition or averaging across levels.^[19] Binary categorical variables, a special case with exactly two categories, often arise in preferences or simple choices, such as yes/no responses to questions like "Do you prefer tea over coffee?" These are frequently represented as 0 and 1 for convenience in data handling, but the core function remains labeling mutually exclusive options without numerical meaning.^[18] Nominal and ordinal types, as defined earlier, encompass these everyday applications by providing structured ways to categorize non-numeric attributes in observations.^[19]

Domain-Specific Examples

In medicine, blood type serves as a classic nominal categorical variable, classifying individuals into mutually exclusive groups such as A, B, AB, or O based on the ABO blood group system.^[20] This variable is crucial for informing transfusion decisions and investigating disease associations; for instance, contingency tables have been used to analyze links between blood types and infection risks, like higher COVID-19 susceptibility in type A individuals compared to type O.^[21] With four categories, it exemplifies multi-category complexity, requiring methods that account for multiple levels to detect subtle associations without assuming order.^[22] In oncology, tumor stage represents an ordinal categorical variable, categorizing cancer progression into ordered levels such as stage I (localized), II (regional spread), III (advanced regional), and IV (metastatic).^[23] This staging informs treatment planning and prognosis; contingency tables help evaluate associations between stages and outcomes, such as survival rates post-therapy, by cross-tabulating stage groups with response categories to guide clinical trial designs.^[24] The multi-level nature (often four or more stages) adds complexity, as analyses must respect the inherent ordering while handling uneven category distributions across patient cohorts.^[25] Social sciences frequently employ political affiliation as a nominal categorical variable, grouping respondents into categories like Democrat, Republican, Independent, or other parties without implied hierarchy.^[26] It aids in studying voter behavior and policy preferences; contingency tables reveal associations, such as between affiliation and support for legislation, enabling researchers to quantify partisan divides in surveys.^[27] Multi-category setups, with three or more affiliations, highlight analytical challenges like sparse cells in tables, necessitating robust tests for independence.^[28] In marketing, product categories function as a nominal categorical variable, segmenting items into groups such as electronics, apparel, groceries, or books for inventory and targeting purposes.^[29] These inform sales strategies and customer segmentation; contingency tables cross-tabulate categories with purchase behaviors to identify patterns, like higher electronics sales among certain demographics, supporting targeted campaigns.^[30] With numerous categories (often exceeding five in retail datasets), this variable underscores the intricacies of multi-category analysis, where high dimensionality can complicate association detection without aggregation.^[31]

Notation and Properties

Standard Notation

In statistical literature, categorical variables serving as predictors are commonly denoted by an uppercase letter such as X, with categories distinguished by subscripts to indicate specific levels, for instance X_j for the j-th category among K possible values.^[32] For binary cases, this simplifies to X = 0 or X = 1, or equivalently X_1 and X_2.^[33] To represent membership in a particular category, the indicator function I(X = k) is frequently used, where it equals 1 if the variable X takes the value corresponding to category k and 0 otherwise; this notation facilitates modeling and computation in analyses involving multiple categories.^[34] In software environments for data analysis, categorical variables employ specialized notations for efficient storage and manipulation. In the R programming language, they are implemented as factors, which internally map category labels to integer codes while preserving the categorical structure.^[35] Similarly, in Python's pandas library, the 'category' dtype designates such variables, optimizing memory usage for datasets with repeated category labels.^[36]

Number of Possible Values

A categorical variable consists of a fixed, finite set of categories, conventionally denoted by k levels where k \geq 2.^[37] This structure distinguishes it from continuous variables, as the possible values are discrete and exhaustive within the defined set, enabling straightforward enumeration in data analysis.^[12] The binary case, where k=2, represents the simplest form of a categorical variable, often termed dichotomous, with outcomes such as yes/no or success/failure.^[38] This configuration minimizes analytical demands, as it aligns directly with binary logistic models or simple proportions without requiring additional partitioning.^[1] For multicategory variables, where k > 2, the analysis grows in complexity due to the need to account for multiple distinctions among levels, often necessitating techniques like contingency tables or multinomial models to capture inter-category relationships.^[37] A key implication arises in hypothesis testing and regression, where the degrees of freedom for the variable equal k-1, reflecting the redundancy in representing all levels independently.^[39] This adjustment ensures unbiased estimation while preventing overparameterization in models.^[40]

Finiteness and Exhaustiveness

Categorical variables are defined by a finite set of discrete categories, in contrast to continuous variables that allow for an infinite range of values within intervals. This finiteness ensures that the possible outcomes are limited and countable, facilitating discrete probability modeling and avoiding the complexities associated with uncountable spaces. For instance, a variable representing eye color might include only a handful of options such as blue, brown, green, and hazel, rather than any conceivable shade along a spectrum.^[5]^[41] A key structural requirement for categorical variables is exhaustiveness, where the categories are mutually exclusive—each observation belongs to exactly one category—and collectively complete, encompassing all possible values that the variable can take in the population or sample. This property prevents overlap and omission, ensuring that the variable fully partitions the outcome space. In statistical analyses, such as contingency tables, this completeness allows marginal probabilities to sum to unity across categories.^[41]^[42] Violations of finiteness or exhaustiveness can occur when categories are incomplete, such as in surveys where respondents provide responses outside predefined options, leading to unclassified data. To address this, practitioners often introduce an "other" category to capture residual cases and restore exhaustiveness without discarding information. Alternatively, for missing or uncategorized entries, imputation strategies like multiple imputation by chained equations (MICE) can estimate values based on observed patterns, preserving the variable's discrete nature while minimizing bias.^[43]^[44] Theoretically, finiteness and exhaustiveness underpin the validity of probability distributions for categorical variables, particularly the multinomial distribution, which models counts across a fixed number of categories with probabilities summing to one. This framework supports inference in models like logistic regression for multicategory outcomes, ensuring parameters are identifiable and estimates are consistent. Without these properties, the assumption of a closed outcome space would fail, complicating likelihood-based analyses.^[41]^[45]

Descriptive Analysis

Visualization Techniques

Visualization techniques for categorical variables enable the graphical representation of data distributions, proportions, and relationships, facilitating exploratory analysis and effective communication without relying on numerical computations. Bar charts are a primary method for displaying the frequencies or counts of categories, where each bar's height corresponds to the number of observations in a given category, making it suitable for both nominal and ordinal variables. For instance, in a dataset of preferred fruits, a bar chart can clearly show the count for each fruit type, allowing quick identification of the most common preferences.^[46] Pie charts represent proportions of categories as slices of a circle, where the angle of each slice reflects the relative frequency, offering an intuitive view for simple datasets with few categories. However, pie charts can distort perceptions of differences between slices, especially when categories have similar proportions or when more than a handful of categories are present, leading experts to recommend them only for emphasizing parts of a whole in limited cases.^[47] For exploring associations between two or more categorical variables, mosaic plots extend the concept of stacked bar charts by dividing a rectangle into tiles whose areas represent joint frequencies or proportions, visually highlighting deviations from independence. This technique is particularly useful for contingency tables, as the tile widths and heights proportionally encode marginal distributions while shading can indicate residuals for statistical inference.^[48] Best practices in these visualizations include clearly labeling categories and axes to ensure interpretability, using distinct colors for differentiation without relying on color alone for those with visual impairments, and avoiding three-dimensional effects that can introduce perspective distortions and mislead viewers. Software tools like ggplot2 in R support these methods through functions such as geom_bar() for bar charts and geom_mosaic() via extensions for mosaic plots, while Matplotlib in Python offers similar capabilities with plt.bar() for categorical bars and extensions like statsmodels for mosaic displays.^[49]^[50]^[51] These graphical approaches reveal underlying patterns, such as imbalances in category distributions or unexpected associations, in a non-numerical manner that enhances accessibility for diverse audiences and supports initial data exploration.^[48]

Summary Measures

Summary measures for categorical variables provide numerical summaries of their distributions and associations without relying on graphical representations. For central tendency in nominal categorical data, the mode is the appropriate measure, defined as the category with the highest frequency.^[52] This captures the most common value, as arithmetic means are inapplicable due to the lack of numerical ordering.^[52] To describe the overall distribution, frequency counts indicate the absolute number of occurrences for each category, while percentages express these as proportions of the total sample size.^[53] These measures are often presented in contingency tables, offering a tabular overview of category prevalences.^[53] For ordinal categorical variables, which possess a natural ordering, the median serves as a central tendency measure by identifying the category at the 50th percentile when data are ranked.^[54] Associations between two categorical variables are commonly assessed using Pearson's chi-squared test of independence, which evaluates whether observed frequencies differ significantly from expected frequencies under the null hypothesis of no association.^[55] The test statistic is calculated as

\chi^2 = \sum \frac{(O - E)^2}{E},

where O denotes observed frequencies and E expected frequencies across all cells of the contingency table.^[55] Introduced by Karl Pearson in 1900,^[56] this statistic follows a chi-squared distribution under the null hypothesis, enabling p-value computation for significance testing.^[9] A key limitation of summary measures for nominal categorical variables is the absence of a standard variance metric, as categories lack quantifiable distances or intervals for dispersion calculation.^[57] Such measures are thus restricted to counts, proportions, and modes, complementing visualization techniques for a fuller descriptive analysis.^[53]

Encoding Techniques

Dummy Coding

Dummy coding is a fundamental technique for encoding categorical variables into numerical form suitable for statistical modeling, particularly in regression analysis. It involves creating binary indicator variables, each taking values of 0 or 1, to represent the presence or absence of specific categories. For a categorical variable with k levels, exactly k-1 dummy variables are generated, omitting one category as the reference or baseline to avoid redundancy.^[58]^[59] The construction of dummy variables follows a straightforward rule: for each non-reference category j (where j = 1, 2, \dots, k-1), the dummy variable D_j is set to 1 if the observation falls into category j, and 0 otherwise. The reference category is implicitly represented when all dummy variables are 0. This omission is crucial to prevent the dummy variable trap, a form of perfect multicollinearity that would arise if all k dummies were included alongside a model intercept, as the dummies would sum to a constant.^[59]^[60] A primary advantage of dummy coding lies in its interpretability, especially in linear regression models. The coefficient for each dummy variable quantifies the average difference in the outcome variable between that category and the reference category, controlling for other predictors. This direct comparison facilitates clear insights into category-specific effects.^[58]^[60] As an illustration, consider a binary gender variable with categories "male" and "female." One dummy variable D_{\text{male}} can be defined such that D_{\text{male}} = [1](/page/1) for males and 0 for females, treating female as the reference. In a regression model, the coefficient on D_{\text{male}} would estimate the additional effect on the response associated with being male compared to female.^[60]

Effects Coding

Effects coding is a scheme for encoding categorical predictors in statistical models, such as linear regression, by assigning values that allow coefficients to represent deviations from the grand mean of the response variable across all categories.^[61] For a categorical variable with k levels, this method employs k-1 binary indicator variables, where each variable corresponds to one non-reference level.^[62] The coding assigns +1 to observations in the corresponding level, 0 to levels neither corresponding nor the reference, and -1 to the reference level, ensuring the design matrix columns sum to zero in balanced designs.^[61] In the regression model, the intercept \beta_0 estimates the overall mean \bar{y} of the dependent variable, while each coefficient \beta_j for the j-th effects-coded variable estimates the deviation of the mean for level j from the grand mean, given by \beta_j = \bar{y}_j - \bar{y}.^[61] This interpretation holds under ordinary least squares estimation with balanced data, where the sample sizes per category are equal.^[62] For illustration, consider a categorical variable with four levels (A, B, C, D), treating D as reference; the coding for three variables is:

Level	Variable 1	Variable 2	Variable 3
A	1	0	0
B	0	1	0
C	0	0	1
D	-1	-1	-1

This setup yields coefficients where \beta_1 = \bar{y}_A - \bar{y}, \beta_2 = \bar{y}_B - \bar{y}, and \beta_3 = \bar{y}_C - \bar{y}.^[61] The primary advantages of effects coding include the property that the coefficients sum to zero (\sum \beta_j = 0), facilitating tests of overall effects and maintaining interpretability of main effects independent of other factors in multifactor designs.^[63] It promotes orthogonality among predictors, leading to equal standard errors and higher statistical power in balanced factorial experiments compared to non-orthogonal schemes.^[62] Relative to dummy coding, effects coding avoids designating a specific reference category, instead centering all interpretations around the grand mean for a more symmetric view of categorical effects.^[61]

Contrast Coding

Contrast coding assigns specific numerical weights to the levels of a categorical variable in regression or ANOVA models to test targeted hypotheses about differences between group means, rather than estimating all parameters separately. These weights are chosen such that they sum to zero across levels, ensuring orthogonality and allowing the model's intercept to represent the grand mean of the response variable. This approach is particularly useful for planned comparisons, where researchers specify contrasts in advance to increase statistical power and focus on theoretically relevant differences.^[64]^[65] One common type is the treatment versus control contrast, which compares treatment levels to a designated control or reference level, often using weights adjusted for hypothesis testing. For instance, in a design with one control and multiple treatments, a single overall contrast can test the average treatment effect against the control by assigning -1 to the control and +1/n to each treatment level (where n is the number of treatment levels), enabling a test of whether the mean of the treatments differs from the control. Individual comparisons of each treatment to the control use separate contrast variables. Another type, the Helmert contrast, compares the mean of each level to the mean of all subsequent levels, facilitating sequential hypothesis tests such as whether the first level differs from the average of the rest. This is defined for k levels with weights that partition the comparisons orthogonally, such as for three levels: first contrast (1, -0.5, -0.5), second (0, 1, -1). Polynomial contrasts, suitable for ordinal categorical variables, model trends like linear or quadratic effects across ordered levels by assigning weights derived from orthogonal polynomials, such as for a linear trend in four levels: (-3/√10, -1/√10, 1/√10, 3/√10), normalized for unit variance.^[64]^[65] A straightforward example for a two-group categorical variable (e.g., control and treatment) uses weights of -0.5 and +0.5, respectively. In a linear model Y = \beta_0 + \beta_1 X + [\epsilon](/page/Epsilon), where X is the contrast-coded predictor, the intercept \beta_0 estimates the grand mean, and \beta_1 estimates the signed difference between group means (full difference if groups are balanced). This setup directly tests the null hypothesis H_0: \mu_1 - \mu_2 = [0](/page/0) via the t-statistic on \beta_1. Effects coding, which compares each level to the grand mean using weights that sum to zero (e.g., +1 and -1 for two groups, scaled), serves as a special case of contrast coding for omnibus mean comparisons.^[64]^[65] The primary advantages of contrast coding include its efficiency in parameter estimation, as it uses k-1 orthogonal predictors for k levels, reducing multicollinearity and degrees of freedom compared to unadjusted dummy coding while enabling precise hypothesis tests. It enhances power for a priori contrasts by concentrating variance on specific comparisons, minimizing Type II errors in experimental designs. Additionally, for ordinal data, polynomial contrasts reveal underlying trends without assuming arbitrary group differences, supporting interpretable inferences in fields like psychology and social sciences.^[65]

Advanced Representations

Nonsense Coding

Nonsense coding refers to a method of representing categorical variables in statistical models by assigning arbitrary or randomly selected numerical values to each category, without any intent to impose meaningful structure or order. This approach contrasts with structured schemes like dummy or effects coding, as the chosen values bear no relation to the categories' substantive differences. According to O'Grady and Medoff (1988), nonsense coding uses any non-redundant set of coefficients to indicate category membership, but its parameters are only interpretable under limited conditions, often leading to misleading conclusions about category effects. The purpose of nonsense coding is primarily pedagogical: it demonstrates that the overall fit of a regression model, such as the multiple correlation coefficient R or the accuracy of predicted values, remains invariant across different coding schemes for categorical predictors, including arbitrary ones. Gardner (n.d.) illustrates this in the context of a 2x2 factorial design, where nonsense coding yields the same R^2 value (e.g., 0.346) and cell mean estimates as standard codings, but alters the numerical values and significance tests of the regression coefficients. This highlights how overparameterized models can achieve good predictive performance even with non-informative representations, underscoring the distinction between statistical fit and substantive insight.^[66] A concrete example involves a three-level categorical variable, such as treatment groups A, B, and C, coded arbitrarily as 3 for A, 7 for B, and 1 for C. In a multiple regression analysis, the resulting coefficients for these codes would reflect linear combinations of category effects but lack any direct, meaningful interpretation—unlike dummy coding, where coefficients represent deviations from a reference category. The key lesson is that the choice of coding profoundly influences the ability to draw valid inferences about categorical effects, even if predictive utility is preserved.

Embeddings

Embeddings represent categorical variables as low-dimensional dense vectors learned directly from data, enabling machine learning models to infer latent relationships and similarities among categories without relying on predefined structures.^[67] This approach treats categories as entities to be mapped into a continuous Euclidean space, where proximity reflects functional or semantic similarity, as demonstrated in entity embedding techniques for function approximation problems.^[67] For instance, in text processing, word embeddings like those from Word2Vec model words as categorical tokens, capturing contextual analogies such as "king" - "man" + "woman" ≈ "queen" through vector arithmetic.^[68] These embeddings are typically learned end-to-end within neural network architectures, starting with categorical inputs converted to integer indices or one-hot encodings, which are then projected via a trainable embedding layer into a fixed-size vector space of lower dimensionality than the number of categories.^[67] The learning process optimizes the vectors based on the overall model objective, such as minimizing prediction error in supervised tasks, allowing the embeddings to adaptively encode category interactions with other features.^[67] This contrasts with sparse traditional codings by producing compact, dense representations that generalize better across datasets.^[67] A key advantage of embeddings is their ability to quantify category similarities using metrics like cosine distance, where vectors for related categories (e.g., "dog" and "puppy" in an animal classification task) cluster closely, facilitating downstream tasks like clustering or nearest-neighbor search.^[67] They are especially valuable for high-cardinality variables, where the explosion of unique categories would render one-hot encodings computationally prohibitive, reducing parameter count while preserving expressive power.^[67] In applications, embeddings have transformed natural language processing by enabling efficient handling of vocabulary as categorical variables, powering tasks from sentiment analysis to machine translation since the introduction of efficient training methods in the 2010s.^[68] In recommendation systems, they represent user preferences or item attributes as categories, improving personalization by learning latent factors that capture user-item affinities, as extended from entity embedding principles to large-scale collaborative filtering.^[67] This development, building on foundational neural language models, has become a standard in deep learning pipelines for categorical data since Mikolov et al.'s 2013 work.^[68]

Regression Applications

Incorporating Categorical Predictors

To incorporate categorical predictors into a linear regression model, the categories are first encoded into a set of binary indicator variables (dummies) or contrast variables, with one category typically omitted as the reference to avoid perfect multicollinearity. These encoded variables then replace the original categorical predictor in the model specification, allowing the regression to estimate category-specific effects alongside other predictors. The resulting model takes the form

Y = \beta_0 + \sum_{j=1}^{k-1} \beta_j D_j + \epsilon,

where Y is the response variable, \beta_0 is the intercept (representing the mean of Y for the reference category when all other predictors are zero), D_j are the indicator variables for the k-1 non-reference categories (each D_j = 1 if the observation belongs to category j, and 0 otherwise), \beta_j are the coefficients for those categories, and \epsilon is the error term. This approach, originally formalized for handling qualitative factors in econometric models, enables the linear regression framework to accommodate non-numeric predictors without altering the core estimation procedure. The coefficients \beta_j in this model are interpreted as the adjusted difference in the expected value of Y between category j and the reference category, holding all other predictors constant; for example, a positive \beta_j indicates that category j is associated with a higher mean response than the reference. This interpretation depends on the chosen encoding scheme, such as dummy coding where \beta_j directly measures the deviation from the reference, but remains consistent across valid encodings like contrasts as long as the reference is clearly defined. In practice, the intercept \beta_0 provides the baseline prediction for the reference group, while the \beta_j terms quantify incremental effects.^[69]^[70] Key assumptions for this incorporation include linearity in the parameters (the effects of the categorical predictors enter the model additively through the linear predictor) and no multicollinearity among the encoded variables, which is ensured by excluding one category as the reference to prevent linear dependence. Violation of the no-multicollinearity assumption would lead to unstable coefficient estimates, but the reference category omission resolves this for categorical predictors alone; interactions or correlated covariates may introduce additional issues requiring separate diagnostics. These assumptions align with the standard linear regression framework, ensuring unbiased and efficient estimation under ordinary least squares.^[69]^[70] In software implementations, categorical predictors are integrated seamlessly into linear models via functions like R's lm(), which automatically applies treatment contrasts (dummy coding) to factor variables upon model fitting, or generalized linear model (GLM) frameworks that extend this to non-normal responses while maintaining the same encoding process. For instance, specifying a factor variable in lm(Y ~ categorical_factor + other_predictors, data = dataset) generates the necessary dummies internally, with coefficients output relative to the first level as reference unless contrasts are customized. This built-in handling simplifies analysis in tools like R or SAS, reducing manual preprocessing while supporting extensions to GLMs for broader applicability.^[71]^[69]

Interactions

In regression analysis, interactions involving categorical variables arise when the effect of a categorical predictor on the response variable depends on the value of another predictor, necessitating the inclusion of product terms to model this dependency accurately. For instance, the influence of a categorical factor such as treatment type on an outcome may vary across levels of another variable, like dosage, requiring terms that capture these conditional effects.^[72] This approach ensures that the model reflects real-world complexities where categorical effects are not uniform.^[73] The rationale for incorporating such interactions stems from the observation that assuming additive effects alone can lead to biased interpretations of main effects, particularly when categorical predictors moderate relationships with other variables. By including interaction terms, researchers can account for non-additive influences, enhancing the model's explanatory power and validity, as supported by theoretical expansions like Taylor series that justify product terms for smooth functions.^[74]^[73] In multiple regression, the general form extends the standard linear model by adding cross-products of encoded categorical variables and other predictors; for a categorical variable encoded as dummy indicators D_j and another predictor Z_k, the term \beta_{jk} D_j Z_k is included to represent varying slopes or intercepts across categories.^[72] Detection of these interactions typically involves statistical tests and graphical methods to assess their significance before inclusion. Analysis of variance (ANOVA) can test the overall significance of interaction terms through their p-values in the model output, indicating whether the combined effects deviate from additivity.^[72] Alternatively, added variable plots, which partial out main effects to visualize the relationship between residuals and the interaction term, or residual plots against the product of predictors, help identify non-random patterns suggestive of interactions.^[73] Encoding techniques, such as dummy coding, are briefly referenced here to form these product terms appropriately.^[72]

Categorical-Categorical Interactions

In regression models, interactions between two categorical variables capture how the effect of one categorical predictor on the outcome varies across levels of the other categorical predictor. This allows for modeling non-additive relationships, where the combined influence of the categories differs from the sum of their individual main effects. Such interactions are particularly useful in scenarios like factorial designs or observational studies involving multiple grouping factors.^[75] To incorporate these interactions, categorical variables are first encoded using dummy variables. For a categorical predictor with k levels, k-1 dummy variables are created, typically with one level as the reference (coded 0) and others as 1 or -1 depending on the scheme (e.g., dummy or effect coding). The interaction terms are then formed by taking the cross-products of these dummy variables from each predictor. For two categorical variables with k and m levels, this results in (k-1)(m-1) interaction terms, which fully parameterize the deviations from additivity across all non-reference combinations. This approach ensures identifiability while spanning the space of possible cell-specific effects in the saturated model.^[76]^[75] The coefficients of these interaction terms represent the conditional effects or deviations from the main effects. Specifically, the coefficient for a particular cross-product indicates the additional change in the outcome when both corresponding categories are active, relative to the reference levels, holding other factors constant. This yields stratified estimates: for instance, the effect of one variable's levels can be interpreted separately within each level of the other variable, revealing how associations differ across subgroups. In effect coding, these coefficients further reflect deviations from the grand mean, facilitating comparisons to overall averages.^[75]^[77] A classic example occurs in analyzing treatment effects moderated by gender, akin to two-way ANOVA models. Consider a study regressing patient recovery scores on treatment (placebo vs. drug, encoded as a single dummy) and gender (female as reference). The interaction term (treatment × male) captures whether the drug's benefit differs for males versus females. If the interaction coefficient is positive and significant, it suggests the treatment elevates recovery more for males (e.g., +5 points beyond the main effect) than females, allowing tailored inferences like stratified odds ratios or means.^[77]^[78] To reduce the complexity of the full set of interaction terms, especially with many levels, analysts can parameterize the model using cell means or predefined contrasts. Cell means coding directly estimates the mean outcome for each combination of categories, equivalent to a saturated model with k × m parameters (one per cell, with overparameterization resolved via constraints). Alternatively, simplified contrasts—such as planned comparisons (e.g., treatment vs. control within each gender)—collapse multiple terms into fewer interpretable ones, focusing on specific hypotheses while maintaining model parsimony. These methods, often implemented via software options like LSMEANS, aid in post-hoc testing and visualization without fitting the exhaustive cross-product set.^[79]^[80]

Categorical-Continuous Interactions

Categorical-continuous interactions in regression models allow the effect of a continuous predictor on the outcome to vary across levels of a categorical predictor, enabling the modeling of heterogeneous slopes. This is achieved by creating interaction terms that multiply a dummy-coded representation of the categorical variable with the continuous variable. For a categorical variable with k categories, k-1 dummy variables D_j (where j = 1, \dots, k-1) are used, and the interaction terms are formed as D_j \times X, where X is the continuous predictor; the corresponding coefficients \beta_j in the model Y = \beta_0 + \beta_1 X + \sum \gamma_j D_j + \sum \beta_j (D_j X) + \epsilon capture the differences in slopes relative to the reference category.^[81] The interpretation of these interactions reveals separate regression lines for each category of the categorical variable, differing in both intercepts (from the main effect of the dummies) and slopes (from the interaction terms). For the reference category, the slope is simply \beta_1; for category j, it becomes \beta_1 + \beta_j, indicating how the continuous variable's influence adjusts per group. This approach visualizes as parallel or non-parallel lines in scatterplots stratified by category, highlighting moderation effects where the continuous predictor's impact is not uniform.^[81] A common example involves examining how the effect of age (continuous) on income (outcome) differs by gender (categorical). In such a model, the interaction terms test whether the age-income slope is steeper for one gender, say males, compared to females as the reference, reflecting potential labor market disparities. This setup, analyzed in behavioral science applications, underscores how demographic categories can moderate age-related trajectories.^[81] To assess the significance of these interactions, an F-test is employed on the set of interaction coefficients collectively, evaluating whether the slopes differ significantly across categories beyond what main effects alone explain; a significant F-statistic (e.g., with degrees of freedom (k-1, n - p - 1), where p is the number of predictors) supports including the interaction in the model. Follow-up t-tests on individual \beta_j can probe specific category differences if the overall test is significant.^[81]

References

[1]
What is the difference between categorical, ordinal and interval ...
A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories.
[2]
1.1.1 - Categorical & Quantitative Variables | STAT 200
Variables can be classified as categorical or quantitative. Categorical variables are those that provide groupings that may have no logical order.
[3]
Categorical Data - Yale Statistics and Data Science
Categorical variables represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, and educational ...
[4]
Variables - Australian Bureau of Statistics
Feb 2, 2023 · A variable is any characteristic, number, or quantity that can be measured or counted. A variable may also be called a data item.
[5]
What are categorical, discrete, and continuous variables? - Minitab
Categorical variables contain a finite number of categories or distinct groups. Categorical data might not have a logical order. For example, categorical ...
[6]
Categorical variables - Statistics By Jim
A categorical variable has values that you can put into a countable number of distinct groups based on a characteristic.
[7]
Three centuries of categorical data analysis: Log-linear models and ...
The common view of the history of contingency tables is that it begins in 1900 with the work of Pearson and Yule, but in fact it extends back at least into ...
[8]
[PDF] chi-square test - analysis of contingency tables - University of Vermont
The original chi-square test, often known as Pearson's chi-square, dates from papers by Karl Pearson in the earlier 1900s.
[9]
The Chi-square test of independence - PMC - NIH
Jun 15, 2013 · Nominal variables require the use of non-parametric tests, and there are three commonly used significance tests that can be used for this type ...Missing: characteristics | Show results with:characteristics
[10]
On the Theory of Scales of Measurement - Science
On the Theory of Scales of Measurement ... PREVIOUS ARTICLE. Transmission Lines, Antennas and Wave Guides. Ronold W. P. King, Harry Rowe Mimno, and Alexander H.
[11]
Central Tendency & Variability - Sociology 3112
Apr 12, 2021 · The fact that calculating the mean requires addition and division is the very reason it can't be used with either nominal or ordinal variables.Missing: focus | Show results with:focus
[12]
1.1 - Types of Discrete Data - STAT ONLINE
Variables producing such data can be of any of the following types: Nominal (e.g., gender, ethnic background, religious or political affiliation)
[13]
Analysis of ordinal data in clinical and experimental studies - PMC
Nov 11, 2020 · Ordinal data provide less precise information than their quantitative alternatives, reducing analytical power. This has even more influence on ...
[14]
[PDF] On the Theory of Scales of Measurement
Friday, June 7, 1946. On the Theory of Scales of Measurement. S. S. Stevens. Director, Psycho-Acoustic Laboratory, Harvard University. OR SEVEN YEARS A ...
[15]
Analyzing and Interpreting Data From Likert-Type Scales - PMC
The typical Likert scale is a 5- or 7-point ordinal scale used by respondents to rate the degree to which they agree or disagree with a statement.
[16]
Nonparametric statistical tests for the continuous data - NIH
Wilcoxon's rank sum test ranks all data points in order, calculates the rank sum of each sample, and compares the difference in the rank sums (Table 4). If two ...
[17]
[PDF] When Can Categorical Variables Be Treated as Continuous? A ...
There are several reasons why applied researchers may continue using continuous estimation methods, even with ordinal data. Continuous methods are older and ...
[18]
Types of Variables in Research & Statistics | Examples - Scribbr
Sep 19, 2022 · In statistical research, a variable is defined as an attribute of an object of study. Choosing which variables to measure is central to good ...
[19]
3.3: Categorical Variables- Variables That Vary by Type
### Summary of Everyday Examples of Categorical Variables
[20]
Types of biological variables - PMC - NIH
Typical examples of nominal variables are sex, religion, blood group, symptoms of disease, cause of death etc. Numerical values assigned to different categories ...
[21]
Analytical Methods for Disease Association Studies with ... - NIH
A contingency table, or cross-tabulation, is used to test the difference, or independence, of frequency distributions for categorical variables.
[22]
1. Data display and summary - The BMJ
Categorical variables are either nominal (unordered) or ordinal (ordered). Examples of nominal variables are male/female, alive/dead, blood group O, A, B, AB.
[23]
Principles of Epidemiology | Lesson 2 - Section 2 - CDC Archive
A nominal-scale variable is one whose values are categories without any numerical ranking, such as county of residence. In epidemiology, nominal variables with ...
[24]
How to analyze tumor stage data in clinical research - PMC - NIH
Number of patients was tabulated in Table 2 by treatment and stage groups. This table is called contingency table in categorical data analyses. For example ...
[25]
Types of variables in biomedical research - ResearchGate
This in turn provides the basis to classify the overall tumor stage variable to be an ordinal categorical variable (see Table 1). ... ...
[26]
Sage Research Methods - Nominal Variable
An example of a nominal variable is the voting behavior of a sample. We can classify members of the sample in terms of the political party for ...
[27]
Understanding the Connection Between Party Affiliation ...
Researchers from Yale University conducted a randomized evaluation to examine the effects of political party identification on political attitudes and ...<|separator|>
[28]
4.4: Levels of Measurment - Social Sci LibreTexts
Oct 22, 2021 · Relationship status, gender, race, political party affiliation, and religious affiliation are all examples of nominal-level variables. For ...Missing: categorical | Show results with:categorical
[29]
Nominal Data: Definition, Characteristics, Examples | Appinio Blog
Feb 23, 2024 · Product Categories. In retail and market research, products are categorized into distinct groups: Apparel: Tops, Bottoms, Dresses ...
[30]
Cross-Tabulation Analysis: A Full Guide (+ Examples) | Appinio Blog
Apr 1, 2024 · Contingency Table: Also known as a cross-tabulation table, a contingency table organizes data into rows and columns, with each cell representing ...
[31]
7 Examples of Nominal Data in Business - Insights for Professionals
Oct 13, 2022 · Nominal data helps companies analyze qualitative data to make better value decisions in their marketing, services and product.
[32]
[PDF] Chapter 16 Analyzing Experiments with Categorical Outcomes
It is a common situation to measure two categorical variables, say X (with k levels) and Y (with m levels) on each subject in a study. For example, if we ...
[33]
Chapter 5 Categorial Variables | Introduction to Econometrics with R
The simplest type of categorical variable is the binary, boolean, or just dummy variable. As the name suggests, it can take on only two values, 0 and 1 , or ...
[34]
Indicator function | Indicator random variable - StatLect
An indicator function is a random variable that takes value 1 when an event happens, and 0 when it does not happen.
[35]
15 Factors | R for Data Science
In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values.
[36]
Categorical data — pandas 2.3.3 documentation - PyData |
A CategoricalDtype can be used in any place pandas expects a dtype . For example pandas.read_csv() , pandas.DataFrame.astype() , or in the Series constructor.Nullable integer data type · Dev · 2.1 · 2.0
[37]
[PDF] An Introduction to Categorical Data Analysis | ALAN AGRESTI
Agresti, Alan. An introduction to categorical data analysis / Alan Agresti. p. cm. Includes bibliographical references and index. ISBN 978-0-471-22618-5. 1.
[38]
An Introduction to Statistics – Data Types, Distributions and ... - NIH
This is known as ordinal data. Categorical data can also be classified depending on the number of categories. If only two categories are present e.g., dead ...
[39]
Coding Systems for Categorical Variables in Regression Analysis
The simplest and perhaps most common coding system is called dummy coding. It is a way to make the categorical variable into a series of dichotomous variables.
[40]
Chapter 5: Categorical Variables as Predictors
1.1.2.2.1 Indicator variables for predictors with more than two categories. A categorical predictor with \(k\) possible values (\(k\) feature types, groups, etc)
[41]
[PDF] Categorical Data Analysis
... Categorical Data Analysis. Second Edition. ALAN AGRESTI. University of ... categorical data analysis to refer to methods for categorical response variables.
[42]
Introduction to Categorical Variables - Aptech
Mar 9, 2021 · The groups are mutually exclusive, which means that each individual fits into only one category.Missing: exhaustive | Show results with:exhaustive
[43]
7.2. Deciding on the Correct Level of Measurement
To make our dog breed variable exhaustive, we could create two “other” categories: an “other breed” attribute to cover dogs who are purebred, but not of a ...
[44]
Multiple imputation methods for handling missing values in a ...
Jan 10, 2019 · MVNI imputes missing values by fitting a joint imputation model for all the variables with missing data, assuming that these variables follow a ...<|control11|><|separator|>
[45]
[PDF] Lecture 7: Multinomial distribution
The multinomial distribution is a common distribution for characterizing categorical variables. Suppose a random variable Z has k categories, we can code ...
[46]
4 Exploring categorical data - Introduction to Modern Statistics (2e)
A mosaic plot is a visualization technique suitable for contingency tables that resembles a standardized stacked bar plot with the benefit that we still see the ...Missing: practices | Show results with:practices
[47]
Examining data visualization pitfalls in scientific publications - NIH
Oct 29, 2021 · Common practice [37] suggested that the pie chart can be best used with less than seven categories (same magic number as in color) due to the ...
[48]
Choosing the Best Visualization Type - Data Visualization
Jul 29, 2024 · Caution: Bar and column plots may not show the spread or range of data which could significantly impact how the data can be interpreted. ... A ...
[49]
Data visualization with R and ggplot2 - The R Graph Gallery
ggplot2 builds charts through layers using geom_ functions. Here is a list of the different available geoms. Click one to see an example using it. geom_bar ...
[50]
Plotting categorical variables — Matplotlib 3.10.7 documentation
Categorical values (strings) can be used as x or y values. They are mapped to positions, and repeated values map to the same position. Example: `ax.bar(names, ...Missing: ggplot2 | Show results with:ggplot2
[51]
Mosaic Plot | Introduction to Statistics - JMP
A mosaic plot is a special type of stacked bar chart that shows percentages of data in groups. The plot is a graphical representation of a contingency table.
[52]
Mean, Mode and Median - Measures of Central Tendency
Histogram showing mode as highest bar in the middle of the continuous distribution as the mode. Normally, the mode is used for categorical data where we wish ...
[53]
Chapter 6 Summarizing Categorical Variables - Thomas E. Love
Summarizing categorical variables numerically is mostly about building tables, and calculating percentages or proportions.
[54]
Ordinal Data | Definition, Examples, Data Collection & Analysis
Aug 12, 2020 · In an odd-numbered data set, the median is the value at the middle of your data set when it is ranked. In an even-numbered data set, the median ...
[55]
Chi-Square (Χ²) Tests | Types, Formula & Examples - Scribbr
May 23, 2022 · A chi-square (Χ²) test is a statistical test for categorical data. It determines whether your data are significantly different from what you ...
[56]
Nominal, Ordinal, Interval, and Ratio Scales - Statistics By Jim
You cannot calculate the mean, median, or standard deviation for nominal variables because you only have information about categories. The mode is the proper ...
[57]
Coding Systems for Categorical Variables in Regression Analysis
The level of the categorical variable that is coded as zero in all of the new variables is the reference level, or the level to which all of the other levels ...
[58]
Dummy variable | Interpretation and examples - StatLect
A dummy variable is a regressor that can take only two values: either 1 or 0. Dummy variables are typically used to encode categorical features.Example · Matrix form · Collinearity
[59]
Dummy Variables - MATLAB & Simulink - MathWorks
The appropriate way to include categorical predictors is as dummy variables. To define dummy variables, use indicator variables that have the values 0 and 1. ...
[60]
FAQ: What is effect coding? - OARC Stats - UCLA
Effect coding provides one way of using categorical predictor variables in various kinds of estimation models (see also dummy coding), such as, linear ...
[61]
[PDF] Dummy and Effect Coding in the Analysis of Factorial Designs
The 4th group, less than primary education, can be represented implicitly by non-membership of the other 3 groups. Note that we can choose another category as ...
[62]
Applied Multiple Regression/Correlation Analysis for the Behavioral Sc
Jun 17, 2013 · This classic text on multiple regression is noted for its nonmathematical, applied, and data-analytic approach.
[63]
R Library Contrast Coding Systems for categorical variables
A categorical variable of K categories is usually entered in a regression analysis as a sequence of K-1 variables, e.g. as a sequence of K-1 dummy variables.
[64]
[PDF] Coding in Multiple Regression Analysis: A Review of Popular ...
Jun 1, 2010 · It is a coding technique where the coded variables are chosen arbitrarily or at random. 2 Planned vs. Unplanned Contrasts. Planned contrasts ...
[65]
[PDF] C:\PSYCH540\2X2 Analysis of Variance and Multiple Regression.wpd
Most presentations deal with Effect coding or Dummy coding, but in fact any type of coding, even nonsense coding, will yield the same multiple correlations and ...Missing: statistics | Show results with:statistics
[66]
[1604.06737] Entity Embeddings of Categorical Variables - arXiv
Apr 22, 2016 · We map categorical variables in a function approximation problem into Euclidean spaces, which are the entity embeddings of the categorical variables.
[67]
Efficient Estimation of Word Representations in Vector Space - arXiv
Jan 16, 2013 · Access Paper: View a PDF of the paper titled Efficient Estimation of Word Representations in Vector Space, by Tomas Mikolov and 3 other authors.
[68]
Lesson 8: Categorical Predictors | STAT 462
Categorical predictors, coded using indicator variables, are qualitative predictors in multiple linear regression. Interactions with quantitative predictors ...
[69]
SPSS Regression with Categorical Predictors - OARC Stats - UCLA
The * symbol denotes interaction or cell means. ... If you need them, you will have to manually standardize the coefficients and re-run the model with the new ...
[70]
Coding for Categorical Variables in Regression Models - OARC Stats
The C function (this must be a upper-case "C") allows you to create several different kinds of contrasts, including treatment, Helmert, sum and poly.
[71]
Understanding Interaction Effects in Statistics
An interaction effect occurs when the effect of one variable depends on the value of another variable. Interaction effects are common in regression models, ...
[72]
[PDF] Lecture 19: Interactions - Statistics & Data Science
Nov 3, 2015 · The standard multiple linear regression model of course includes no interactions between any of the predictor variables. General considerations ...
[73]
The pros and cons of including interactions in linear regression ...
Interaction models are defined as models in which, in addition to incorporating the predictors in the ordinary additive way, the product term of two predictor ...
[74]
FAQ: How do I interpret the coefficients of an effect-coded variable ...
Categorical or nominal variables that are to be included as predictors in regression models must be first be transformed into a set of variables (henceforth ...
[75]
[PDF] Three-way interactions - Outline
In R, X1 ∗ X2 is a shortcut for X1 + X2 + X1 : X2. The interaction X1 : X2 requires (k1 − 1) ∗ (k2 − 1) df, i.e. extra coefficients.Missing: encoding m-<|control11|><|separator|>
[76]
Understanding Interaction Between Dummy Coded Categorical ...
Let's take a look at the interaction between two dummy coded categorical predictor variables. The data set for our example is the 2014 General Social Survey.
[77]
Lesson 8: Categorical Predictors - STAT ONLINE
In this lesson, we investigate the use of such indicator variables for coding qualitative or categorical predictors in multiple linear regression more ...
[78]
[PDF] Interactions among Categorical Predictors - Lesa Hoffman
➢ Allows LSMEANS/EMMEANS/MARGINS (for cell means and differences). ➢ Provides omnibus (multiple df) group F-tests (or χ2 tests). ➢ Marginalizes the group ...
[79]
Regression with Stata Chapter 6: More on interactions of categorical ...
A partial interaction allows you to apply contrasts to one of the effects in an interaction term. For example, we can draw the interaction of collcat by mealcat ...
[80]
Applied Multiple Regression/Correlation Analysis for the Behavioral Sc
In stock Free deliveryApplied Multiple Regression/Correlation Analysis for the Behavioral Sciences. By Jacob Cohen, Patricia Cohen, Stephen G. West, Leona S. Aiken Copyright 2003.