Fact-checked by Grok 2 weeks ago

Classical test theory

Classical test theory (CTT), also known as true score theory, is a foundational psychometric framework that posits an observed test score as the sum of a true score and random measurement error, providing essential methods for evaluating the reliability and validity of educational and psychological assessments. Developed in the early 20th century, CTT originated with Charles Spearman's work on correcting for attenuation in correlations due to measurement error, laying the groundwork for modern test construction and scoring. At its core, the theory assumes that true scores represent the hypothetical average performance over infinite test administrations under identical conditions, while errors are random, normally distributed with a mean of zero, and uncorrelated with true scores or other errors. Key to CTT is the concept of reliability, defined as the proportion of observed score variance attributable to true score variance, expressed as R = \frac{\sigma_T^2}{\sigma_X^2} = 1 - \frac{\sigma_E^2}{\sigma_X^2}, where \sigma_T^2, \sigma_X^2, and \sigma_E^2 are the variances of true, observed, and error scores, respectively. Reliability estimates range from 0 (no consistency) to 1 (perfect consistency) and can be computed using methods such as test-retest correlations, parallel forms, split-half techniques, or Cronbach's coefficient alpha, which assesses for multi-item tests. For instance, coefficient alpha is calculated as \alpha = \frac{k}{k-1} \left(1 - \frac{\sum \sigma_i^2}{\sigma_X^2}\right), where k is the number of items and \sigma_i^2 are item variances; values above 0.70 are often deemed acceptable for research purposes, though higher thresholds like 0.90 are preferred for high-stakes applications. CTT also addresses validity indirectly through its focus on minimizing to ensure scores reflect intended constructs, though it primarily emphasizes reliability over systematic . The of measurement (), derived from reliability as SEM = \sigma_X \sqrt{1 - R}, quantifies score precision and is assumed constant across ability levels, allowing confidence intervals around observed scores (e.g., ±1.96 SEM for 95% intervals). At the item level, CTT evaluates difficulty (proportion correct) and (correlation with total score), guiding test but requiring representative samples since item properties are sample-dependent. Historically, Spearman's 1904 formulation of reliability as the between parallel tests marked CTT's inception, with subsequent refinements by figures like (1951) for alpha and Frederic Lord and Melvin Novick (1968) for axiomatic treatments. While influential for over a century in fields like and , CTT's assumptions—such as interval-level true scores and error independence—have limitations, particularly in handling systematic errors or invariant item calibration, leading to the rise of complementary approaches like .

History

Origins and Early Concepts

The foundations of classical test theory emerged in the late through the pioneering work of , who emphasized the measurement of individual differences in human abilities and introduced the concept of to quantify relationships among traits. Galton's studies in the , including his analysis of and psychophysical experiments, shifted focus from average group performance to variability among individuals, laying the groundwork for psychometric assessment by highlighting the need to account for natural variation in mental characteristics. This work built on Karl Pearson's development of the product-moment in the , which provided a statistical tool for analyzing associations while considering measurement imperfections. Building on these ideas, advanced the field in 1904 with his development of the correction for formula, which accounted for the underestimation of true correlations due to errors, thereby introducing the notion of separating observed scores into true components and random errors central to CTT. This statistical approach, derived from analyses of correlations, underscored the impact of error on psychometric measurements and influenced subsequent models by positing that observed variations could be parsed into systematic true scores and unsystematic errors. The practical application of these concepts gained momentum in 1905 when and Théodore developed the first standardized to identify children needing educational support in French schools, marking a shift toward empirical testing of mental levels through age-graded tasks. Their Binet- prioritized qualitative judgment over precise quantification, focusing on practical screening rather than absolute measurement. In the United States, expanded this framework in 1916 by revising the Binet-Simon scale into the Stanford-Binet test, which introduced an (IQ) for ranking individuals based on aggregated scores relative to age norms. Terman's adaptation emphasized efficient and normative comparisons, reflecting the era's initial psychometric emphasis on ordinal ranking for educational placement over interval-scale precision. These developments set the stage for more formalized theoretical structures in subsequent decades.

Formalization and Key Contributors

The formalization of classical test theory (CTT) began to take shape in the early through statistical advancements in test construction and interpretation. Truman L. Kelley's 1927 book, Interpretation of Educational Measurements, provided a foundational framework by outlining statistical methods for evaluating test reliability and validity, including procedures for calculating reliability coefficients, standard deviations, and correlations between test scores. Kelley's work emphasized the practical application of these methods to educational tests, such as the Stanford Achievement Test and Thorndike-McCall Reading Scale, establishing reliability thresholds (e.g., ≥0.94 for individual assessments) that influenced subsequent psychometric standards. This text built on earlier correlational approaches, consolidating them into systematic tools for test development. Practical demands from large-scale testing programs further propelled theoretical advancements. During , the U.S. Army developed the and Army Beta tests to assess over 1.5 million recruits' cognitive abilities, with Alpha targeting verbal and numerical skills for literate individuals and Beta using non-verbal formats for others. These efforts demonstrated the feasibility of group-administered standardized tests, driving the refinement of CTT by highlighting the need for reliable, scalable measurement in high-stakes contexts like . The military's success in applying these tests influenced post-war , shifting focus toward error minimization and score consistency in diverse populations. By the mid-20th century, Harold Gulliksen's Theory of Mental Tests (1950) emerged as the seminal text formalizing the true score model central to CTT, synthesizing prior statistical insights into a comprehensive handbook for test theory and application. Gulliksen, who contributed to naval assessment during , integrated concepts like reliability estimation and error analysis, making CTT accessible for educational and psychological measurement. The 1940s and 1950s saw further evolution through organizations like the (APA) and the Psychometric Society (founded 1935), which published Psychometrika in 1936 and advanced standardized reliability metrics, such as the Kuder-Richardson formulas (1937) and Guttman's lower bounds (1945). These developments solidified CTT's framework amid growing applications in and industry.

The Classical Model

True Score, Observed Score, and Error

In classical test theory, the observed score, denoted as X, represents the actual score obtained by an examinee on a given test administration. This score is conceptualized as comprising two distinct components: the true score and the error score. The true score, denoted as T, is defined as the of the observed score across an infinite number of test administrations under identical conditions, formally expressed as T = E(X). This hypothetical average reflects the examinee's underlying or without distortion from inconsistencies. The score, denoted as E, captures random deviations in the observed score due to factors such as test-taking conditions, momentary fluctuations in performance, or measurement imprecision, with its being zero, E(E) = 0. These errors are assumed to be uncorrelated with the true score. The foundational equation of the model integrates these elements as X = T + E, positing that every observed score is the sum of the true score and an component. This underscores that observed scores serve as imperfect indicators of true , with variability arising solely from random errors under the theory's framework. For score , the model implies that the true score provides an unbiased estimate of the examinee's when errors average to zero over replications, but systematic deviations in —such as consistent biases from design flaws—would introduce non-random distortions not accounted for in the classical formulation.

Core Assumptions

Classical test theory (CTT) is grounded in the foundational model where an observed score X equals the true score T plus an component E, expressed as X = T + E. This model relies on several core assumptions to ensure its mathematical coherence and practical applicability in psychological . The assumption of linearity posits that errors are strictly additive to the true score and uncorrelated with it, meaning the covariance between T and E is zero (\operatorname{Cov}(T, E) = 0). This ensures that the observed score variance decomposes cleanly into true score variance and error variance without interaction effects, preserving the model's simplicity. Unbiasedness requires that the expected value of the error score is zero across repeated administrations of the test (E(E) = 0). This condition implies that, on average, measurement errors do not systematically over- or underestimate the true score, allowing for impartial estimation in large samples. CTT further assumes constant true scores for a given individual across parallel test forms, such that the true score remains stable and unaffected by test-specific factors. Complementing this, errors from different administrations are independent and uncorrelated (\operatorname{Cov}(E_i, E_j) = 0 for i \neq j). These assumptions collectively enable the concept of parallel tests, which are alternate forms measuring the same construct with equivalent expected true scores, identical error variances, and uncorrelated errors, thereby yielding the same population means and variances. This parallelism is essential for comparing scores across administrations and assessing measurement consistency.

Reliability Assessment

Conceptualization of Reliability

In classical test theory, reliability refers to the consistency with which a measures an underlying across repeated administrations under equivalent conditions, distinguishing stable true differences among examinees from random s. Under the true score model, where the observed score X decomposes into the true score T and error score E such that X = T + E, reliability is quantified by the \rho_{XX'}, defined as the of true score variance to total observed score variance: \rho_{XX'} = \frac{\Var(T)}{\Var(X)} = 1 - \frac{\Var(E)}{\Var(X)}. This proportion indicates the extent to which systematic individual differences account for score variation, with the remainder attributed to unsystematic . The reliability coefficient \rho_{XX'} equals the expected between observed scores from two parallel test forms, which yield identical true scores but independent errors. It ranges from 0, signifying complete inconsistency where all score variance stems from error, to 1, denoting perfect consistency with negligible error influence. As error variance increases relative to true variance—due to factors like imprecise items or unstable testing conditions—reliability decreases, amplifying score fluctuations unrelated to the trait. A primary factor affecting reliability is test length, as extending a test reduces the relative impact of error by averaging across more items. The predicts this effect, estimating the reliability \rho_{kX} of a test k times longer than an original with reliability \rho_{XX}: \rho_{kX} = \frac{k \rho_{XX}}{1 + (k-1) \rho_{XX}}. This formula arises from the properties of : for k equivalent subtests sharing the same true score, the variance of the total true score is k^2 \Var(T), while the total observed score variance is k^2 \Var(T) + k \Var(E), where \Var(T) and \Var(E) are for a single subtest. Thus, \rho_{kX} = \frac{k^2 \Var(T)}{k^2 \Var(T) + k \Var(E)} = \frac{k \rho_{XX}}{1 + (k-1) \rho_{XX}} after substituting \rho_{XX} = \frac{\Var(T)}{\Var(T) + \Var(E)}. Reliability approaches 1 as k grows, assuming item homogeneity.

Estimation Methods

The test-retest method estimates reliability by administering the same test to the same sample on two occasions separated by a short interval and computing the between the scores from the two administrations. This approach is most appropriate for measuring stable traits or attributes where temporal changes are minimal, as it primarily captures consistency over time while minimizing practice effects through brief retesting intervals. The or alternate forms method involves constructing two equivalent versions of the test that sample the same content domain and administering both to the same participants, typically in counterbalanced order, to obtain the between scores. It specifically targets content sampling error by demonstrating that different but parallel item sets yield consistent results, though developing truly equivalent forms can be resource-intensive. Internal consistency methods assess reliability within a single test administration by examining the homogeneity of items. The split-half technique divides the test items into two comparable halves (e.g., odd-even), computes the correlation between half scores, and applies the Spearman-Brown prophecy formula to estimate the reliability for the full test length: r_{full} = \frac{2r_{half}}{1 + r_{half}} This correction accounts for the increased length of the full test, assuming tau-equivalence. For tests with dichotomous (0/1) items, the Kuder-Richardson Formula 20 (KR-20) provides an internal consistency estimate equivalent to the average of all possible split-half reliabilities: \text{KR-20} = \frac{k}{k-1} \left(1 - \frac{\sum_{i=1}^{k} p_i q_i}{\sigma_X^2}\right) where k is the number of items, p_i is the proportion of correct responses for item i, q_i = 1 - p_i, and \sigma_X^2 is the total test score variance. For polytomous items with more than two response categories, Cronbach's alpha extends this approach: \alpha = \frac{k}{k-1} \left(1 - \frac{\sum_{i=1}^{k} \sigma_{Y_i}^2}{\sigma_X^2}\right) where \sigma_{Y_i}^2 is the variance of scores on item i. Alpha is lower-bounded by KR-20 and widely used due to its applicability across item formats. For tests involving subjective scoring, such as responses or behavioral observations, evaluates consistency across multiple raters and can be estimated using coefficient, which adjusts observed agreement for chance: \kappa = \frac{p_o - p_e}{1 - p_e} where p_o is the observed agreement proportion and p_e is the expected agreement by chance. This serves as an extension of classical methods for rater-induced error. Selecting an estimation method depends on the test's purpose, trait stability, and practical constraints: test-retest suits stable, non-cognitive traits with available retesting time; parallel forms is ideal for content-heavy assessments but requires form development resources; internal consistency methods like KR-20 or alpha are efficient for single administrations and homogeneous item sets; inter-rater approaches apply to subjective evaluations. Higher-resource methods like parallel forms often yield more comprehensive estimates but are less feasible for rapid assessments.

Item Analysis

Item Difficulty

In classical test theory, item difficulty, denoted as p, quantifies the relative ease or challenge of a test item and is calculated as the proportion of examinees who respond correctly to it. Specifically, p = \frac{R}{N}, where R is the number of examinees answering the item correctly and N is the total sample size. This statistic, also known as the item facility index, ranges from 0 (no one correct) to 1 (everyone correct) and is inherently sample-dependent, reflecting the item's performance within a particular group of examinees. The interpretation of p emphasizes its in assessing an item's informativeness. A p value near 0.5 is optimal because it maximizes the item's variance in responses, thereby offering the greatest potential for distinguishing between examinees of varying levels. In contrast, extreme values close to 0 (very difficult) or 1 (very easy) signal problematic items that provide minimal differentiation, as nearly all or no examinees succeed, reducing the item's contribution to meaningful measurement. During test construction and revision, item difficulty guides the selection and assembly of items to create a balanced instrument. Practitioners aim to include items with p values typically between 0.3 and 0.7, ensuring a distribution of challenges that aligns with the intended ability range of the target population and promotes adequate score spread. This targeted range helps avoid or effects that could compress scores and limit the test's . For multiple-choice items, random guessing can inflate the observed p, necessitating an adjustment to better estimate the item's inherent challenge. The corrected difficulty index is given by p' = \frac{p - \frac{1}{c}}{1 - \frac{1}{c}} where c is the number of response options; this rescales the proportion correct by subtracting the chance-level guessing rate ($1/c) and normalizing to the effective response range above chance. Such adjustments refine item evaluation, particularly for formats with four or five options where guessing impacts are notable.

Item Discrimination

Item discrimination in classical test theory refers to an item's capacity to differentiate between high-performing and low-performing examinees, serving as a key metric in item analysis to evaluate test quality. This is rooted in the idea that effective items should correlate positively with overall test performance, thereby contributing to the test's ability to measure the intended construct. Typically assessed after determining item difficulty, which provides a for potential , item discrimination helps identify items that align well with the total score while flagging those that do not. The primary quantitative measure of item discrimination is the , often computed as the r_{it}, defined as r_{it} = \frac{\mathrm{Cov}(X_i, X_{-i})}{\sigma_{X_i} \sigma_{X_{-i}}}, where X_i represents the score on the individual item (typically dichotomous, 0 or ), X_{-i} is the score excluding that item, \mathrm{Cov} denotes , and \sigma indicates deviation. This quantifies the extent to which performance on the item predicts performance on the rest of the test, with values ranging from - to . A value greater than 0.3 generally indicates good , suggesting the item effectively distinguishes levels; values between 0.2 and 0.3 are acceptable but may warrant review, while those below 0.2 signal poor . Negative values are problematic, often indicating flawed items such as those requiring reverse scoring or containing ambiguities that confuse high performers. An alternative approach, the upper-lower group method, provides a simpler, non-parametric estimate of by dividing examinees into upper and lower groups based on total scores (commonly the top and bottom 27% to maximize group differences and stability). The discrimination index D is then calculated as D = \frac{P_U - P_L}{N_g}, where P_U is the number correct in the upper group, P_L is the number correct in the lower group, and N_g is the size of the larger group; this yields the difference in proportions correct between groups. Positive D values above 0.3 denote effective items, with higher values (e.g., 0.4 or more) ideal for strong differentiation; negative or very low values (below 0.2) highlight items that fail to separate ability levels. In test revision, item discrimination indices guide decisions to retain, modify, or discard items, as low-discrimination items can undermine overall test reliability by adding noise rather than signal to the total score. For instance, removing items with r_{it} < 0.2 or D < 0.2 often enhances the test's without substantially altering its length or coverage. This process ensures that the final test comprises items that robustly contribute to measuring the underlying .

Limitations and Criticisms

Violations of Assumptions

Classical test theory (CTT) relies on several core assumptions, including the independence of errors, constancy of true scores across parallel measures, unbiased errors with zero, and homoscedastic error variance, to ensure reliable and valid inferences from observed scores. Violations of these assumptions can distort reliability estimates, validity coefficients, and score interpretations, compromising the theory's applicability in real-world testing scenarios. Correlated errors occur when measurement errors across items or test forms are not independent, often due to shared test conditions, item interactions, or administration proximity, such as items drawing from the same . This violates CTT's assumption of uncorrelated errors, leading to inflated or deflated reliability estimates; for instance, positive error correlations can cause observed correlations between tests to exceed true correlations, undermining the Spearman-Brown prophecy formula and correction-for-attenuation procedures. In adaptive testing, where item selection depends on prior responses, such dependencies further introduce correlated errors, reducing the of estimates. Non-constant true scores arise when the underlying or is unstable over time or contexts, invalidating the of invariant true scores for forms. Trait instability, such as fluctuations in or skill due to intervening experiences, means that observed score differences may reflect true changes rather than errors alone, distorting -forms reliability and longitudinal comparisons. This violation is particularly problematic in repeated testing, where CTT treats score variances as solely due to errors, potentially misattributing variability to measurement imprecision. Systematic in errors manifests when the expected value is not zero, introducing consistent over- or underestimation tied to factors like cultural differences or instructional ambiguities. For example, cultural es in item wording can lead to systematically lower scores for certain groups, violating the unbiased assumption and biasing true score estimates across populations. Instructional biases, such as unclear directions favoring familiarized test-takers, similarly produce non-random errors, eroding the separation of true scores from systematic distortions in CTT models. Heteroscedasticity refers to non-constant error variance that varies across ability levels, breaching CTT's homoscedasticity assumption and resulting in unequal measurement precision. Low- and high-ability examinees often exhibit higher error variance—forming a U-shaped pattern—leading to less reliable scores at ability extremes compared to mid-range performers. This violation implies that CTT-derived standard errors of measurement overestimate precision for extreme scores, affecting in educational or clinical assessments. Empirical detection of these violations often involves residual analysis, where differences between observed and estimated true scores are examined for patterns indicating non-randomness, such as correlations or varying spreads. For non-parallelism in forms, violation indices assess discrepancies in means (via paired t-tests), variances (via Morgan-Pitman tests), and intercorrelations (via Wilks' lambda), with significant deviations signaling assumption breaches. These methods, such as Bradley-Blackwood tests for combined equality, provide robust power to identify non-parallel forms, guiding adjustments in test construction.

Practical Shortcomings

One key practical shortcoming of classical test theory (CTT) is its heavy reliance on sample-specific data, which limits the generalizability of test statistics across diverse populations. In CTT, item difficulty (denoted as p, the proportion of examinees answering correctly) and item-total correlations ( r_{it}, measuring ) are calculated based on the particular sample tested, meaning these parameters can vary significantly if the test is administered to a different group with varying ability levels or demographics. This sample dependency complicates the reuse of item banks in new contexts, as recalibration is often required, increasing the workload in test development and potentially leading to biased interpretations when applied beyond the original validation sample. CTT also faces challenges related to test length, particularly with shorter assessments, where reliability estimates tend to be unstable and underestimate true . Short tests amplify the impact of random error on observed scores, resulting in lower reliability coefficients that fluctuate across administrations due to insufficient items to average out variability. To address this, practitioners often apply the Spearman-Brown prophecy formula to predict reliability for longer versions, which adjusts the split-half reliability estimate as r_{kk} = \frac{k r_{xx}}{1 + (k-1) r_{xx}}, where k is the ratio of desired to current test length and r_{xx} is the original reliability. However, this extrapolation assumes the additional items are parallel in quality to the originals—a condition rarely met in practice—and can overestimate reliability if the test lacks homogeneous content, leading to misguided decisions on test design. At the individual level, CTT offers limited precision for inferring personal , focusing instead on aggregate norms and total scores that mask person-specific patterns. Observed scores in CTT represent a simple sum or average, providing group-level comparisons but not tailored estimates of an individual's latent trait, as the model does not account for how responses vary by . This aggregate approach is sufficient for norm-referenced decisions but falls short in contexts requiring diagnostic feedback, such as personalized or clinical assessments, where finer-grained ability profiling is needed without additional modeling. Furthermore, CTT is inefficient for adaptive testing formats, as it presupposes fixed test forms administered uniformly to all examinees, disregarding adjustments based on response patterns. The sample-dependent nature of CTT parameters hinders dynamic item selection, where ideal adaptive systems select items to maximize information at an individual's estimated ability level; instead, CTT requires pre-calibrated parallel forms, which are resource-intensive to develop and validate. This rigidity reduces testing efficiency, particularly in computerized environments where shorter, tailored administrations could minimize respondent burden without compromising measurement quality.

Alternatives

Item Response Theory

Item response theory (IRT) represents a probabilistic framework in that models the probability of a correct response to an item as a of an individual's latent , denoted as θ, and item-specific parameters, such as difficulty (b) and discrimination (a). Unlike classical test theory, which relies on aggregate test scores, IRT posits that responses depend on the interaction between person and item characteristics, enabling more precise estimation of latent traits. This approach was foundationalized in seminal works, including those by Rasch (1960) and and Novick (1968). A prominent example is the one-parameter logistic model, also known as the , which simplifies the framework by assuming equal discrimination across items (a = 1). The probability of a correct response, P(θ), is given by: P(\theta) = \frac{1}{1 + e^{-(\theta - b)}} In this model, θ represents the person's ability on a logit scale, and b denotes the item's difficulty, the point where the probability of success is 50%. The emphasizes specific objectivity, ensuring that item difficulties are independent of the sample tested. Key advantages of IRT over classical test theory include item parameter invariance, where item characteristics remain stable across different samples, and enhanced ability estimation that accounts for individual response patterns rather than total scores. These properties mitigate the sample dependency inherent in classical methods, allowing for more generalizable inferences. IRT extends classical test theory by incorporating non-linear response functions, which better capture the varying difficulty of items for different ability levels. In applications, IRT underpins computerized adaptive testing (CAT), where items are dynamically selected based on interim ability estimates to maximize measurement efficiency and precision. CAT reduces test length while maintaining reliability, as seen in large-scale assessments like the Graduate Record Examination, by administering only items most informative for the examinee's θ. This efficiency stems from IRT's ability to pre-calibrate item banks and adapt in real-time.

Generalizability Theory

Generalizability theory (G-theory) extends the true score model of classical test theory by incorporating multiple sources of error, or facets, such as items, raters, occasions, and tasks, to provide a more comprehensive analysis of measurement dependability. Developed by Lee J. Cronbach and colleagues, G-theory employs an analysis of variance (ANOVA) framework to decompose the total observed score variance into components attributable to the person (p), the facets, and their interactions, thereby identifying and quantifying systematic and unsystematic sources of variation in complex assessments. This approach addresses the limitation of classical test theory's single error term by explicitly modeling interactions among facets, which are often ignored in simpler reliability estimates. The core of G-theory begins with a generalizability (G-study), which uses ANOVA to estimate variance components from empirical data collected under a specific . These components form the basis for computing the generalizability coefficient, denoted as \phi, which parallels the reliability coefficient in classical test theory but is tailored to the facets of interest. The formula is given by: \phi = \frac{\sigma^2_p}{\sigma^2_p + \sigma^2_\delta} where \sigma^2_p represents the variance due to persons (the signal), and \sigma^2_\delta denotes the relevant error variance, which may include relative or absolute error depending on the decision context. Higher values of \phi indicate greater dependability across the universe of admissible observations defined by the facets. Building on G-study results, decision studies (D-studies) apply the estimated variance components to forecast and optimize designs for practical use. By simulating adjustments to facet levels—such as increasing the number of items or raters—D-studies predict how these changes would influence \phi, enabling cost-effective decisions that balance reliability and resources. For instance, in essay scoring, a D-study might determine the optimal number of prompts and raters to achieve a desired \phi while for interactions between tasks and judgments. A primary advantage of G-theory lies in its capacity to manage multifaceted reliability in scenarios where multiple error sources interact, such as performance assessments involving subjective raters and varied prompts, where it quantifies contributions from each facet to improve overall dependability. This multifaceted approach enhances 's framework by providing actionable insights into error reduction, making it particularly valuable for educational and psychological measurements requiring high-stakes decisions.

References

  1. [1]
    [PDF] Classical Test Theory - Psycholosphere
    Jan 10, 2005 · The mathematical proofs for this will not be reviewed but can be found in some psychometrics texts (e.g., Allen &. Yen, 1979). This new ...
  2. [2]
    [PDF] Classical Test Theory and the Measurement of Reliability
    Consider a class of first year graduate students in psychometrics. An exam testing their knowledge will probably not be very reliable in that the variance ...
  3. [3]
    [PDF] Introduction to Classical Test Theory with CITAS
    CITAS provides basic analytics necessary for this evaluation, and it does so without requiring advanced knowledge of psychometrics or of software programming.
  4. [4]
    [PDF] FRANCIS GALTON - University of York
    But Francis Galton realised among the earliest that a comparison of the individual organs and characters of local races needs supplementing by a comparison ...Missing: differences primary
  5. [5]
    [PDF] Francis Galton - The Personality Project
    Mar 1, 2014 · He did pioneering work on the correlation coefficient, behavior genetics and the measurement of individual differences. ... Galton's psychology.Missing: 1880s primary
  6. [6]
    Spearman, C. (1904). General Intelligence, Objectively Determined ...
    Spearman, C. (1904). General Intelligence, Objectively Determined and Measured. The American Journal of Psychology, 15, 201-292.
  7. [7]
    Binet-Simon Test - an overview | ScienceDirect Topics
    The first successful and widely accepted intelligence test was developed in 1905 by a French psychologist named Alfred Binet. Unlike the methods used by Galton ...
  8. [8]
    The Project Gutenberg eBook of The Measurement of Intelligence ...
    An explanation of and a complete guide for the use of the Stanford revision and extension of the Binet-Simon intelligence scale.The Uses of Intelligence Tests · Nature of the Stanford... · General Instructions
  9. [9]
    The measurement of intelligence, 1916. - APA PsycNet
    Terman's revision and extension of Binet and Simon's tests, known as the Stanford-Binet, was the first to have adequate standardization through the school ...
  10. [10]
    [PDF] Classical Test Theory in historical Perspective - Winsteps.com
    What were the historic origins of classical test theory? What ... What psychometric and scientific perspectives influ- enced the development of G theory?
  11. [11]
    [PDF] Interpretation of educational measurements / by Truman Lee Kelley.
    Oct 22, 2013 · Interpretation of educational measurements / by Truman Lee Kelley. Kelley, Truman Lee, 1884-. Yonkers-on-Hudson, N.Y. : World Book, c1927. http ...
  12. [12]
    History of Military Testing - ASVAB
    Jul 27, 2023 · The military has used aptitude tests since World War I to screen people for military service. In 1917-1918, the Army Alpha and Army Beta tests were developed.
  13. [13]
    An Academic Genealogy of Psychometric Society Presidents - PMC
    Gulliksen's Theory of Mental Tests (1950) was one of the first comprehensive handbooks on classical test theory. In 1945, he was appointed professor of ...
  14. [14]
    Encyclopedia of Psychological Assessment - Sage Knowledge
    ... classical test theory developed up to that time was Gulliksen's book (1950) Theory of Mental Tests. Gulliksen had been a student of Thurstone, and later his ...
  15. [15]
    [PDF] statistical theories - of mental test scores
    Frederic M Lord, Willam Meredith, Samuel Messick, Roderick McDonald,. Melvin R. Novick, Fumko Samejima, J. ... statistical theories of mental test scores.
  16. [16]
    Theory of mental tests : Gulliksen, Harold - Internet Archive
    Apr 5, 2022 · Theory of mental tests. xix, 486 p. ; 24 cm Reprint. Originally published: New York : Wiley, 1950. Bibliography: p. 397-420. Includes indexes.
  17. [17]
    CORRELATION CALCULATED FROM FAULTY DATA - SPEARMAN
    Volume 3, Issue 3 pp. 271-295 British Journal of Psychology, 1904-1920. Full Access. CORRELATION CALCULATED FROM FAULTY DATA. C. SPEARMAN,. C. SPEARMAN.
  18. [18]
    SOME EXPERIMENTAL RESULTS IN THE CORRELATION OF ...
    The present article forms the third part of the writer's thesis on “The Use of the Theory of Correlation in Psychology,” approved for the degree of Doctor ...
  19. [19]
    Overview of Classical Test Theory and Item Response Theory ... - NIH
    The greater the proportion of shared variation, the more the items share in common and the more consistent they are in reflecting a common true score. The ...
  20. [20]
    [PDF] Test Reliability—Basic Concepts - ETS
    Alternate-forms reliability gives the test makers information about a source of inconsistency over which they have some control. By making the test longer ...<|separator|>
  21. [21]
    The theory of the estimation of test reliability | Psychometrika
    The theory of the estimation of test reliability. Published: September 1937. Volume 2, pages 151–160, (1937); Cite this ...
  22. [22]
    [PDF] cronbach_1951.pdf - University of Illinois
    "Kuder-. Richardson Formula 20" is an awkward handle for a tool that we ex- pect to become increasingly prominent in the test literature. A second reason for ...
  23. [23]
    A Coefficient of Agreement for Nominal Scales - Jacob Cohen, 1960
    For a table of proportions, χ 2 is N times the value obtained by performing the usual operations on the proportions rather than the frequencies.
  24. [24]
    Part 1: Principles for Evaluating Psychometric Tests - NCBI - NIH
    For a psychometric test to be reliable, its results should be consistent across time (test-retest reliability), across items (internal reliability), and across ...
  25. [25]
    [PDF] Introduction to Classical Test Theory
    Sep 25, 2009 · Classical Test Theory (CTT) includes Classical true-score theory and Common factor theory. Main statistics include item difficulty, item-test ...<|control11|><|separator|>
  26. [26]
    [PDF] An item analysis using Classical Test Theory (CTT) - Purdue e-Pubs
    This study will examine the results of item analysis (descriptive statistics, item difficulty, and item discrimination) and examine the reliability of test ...
  27. [27]
    [PDF] An item analysis using Classical Test Theory (CTT) on Alberta's data
    Items with p values above 0.90 and those below 0.20 warrant a careful evaluation, whatever the criterion one may use in determining a reasonable estimate for p.
  28. [28]
    [PDF] Item Analysis - ERIC
    Item discrimination compares the number of high scorers and low ... upper and lower groups promotes stability by maximizing differences between the two groups.
  29. [29]
    Classical Test Theory: Item Statistics | Assessment Systems (ASC)
    Classical test theory item statistics are widely used to evaluate performance of test questions, diagnose issues, and fix them. Here's how.
  30. [30]
  31. [31]
  32. [32]
    Is classical test theory "robust" under violation of the assumption of ...
    Is classical test theory "robust" under violation of the assumption of uncorrelated errors? ... The theory of test validity and correlated errors of measurement.
  33. [33]
  34. [34]
    None
    **Key Points on Varying Error Variance Across Ability Levels in Classical Test Theory**
  35. [35]
    [PDF] an empirical comparison of their item/person statistics.
    The degree of invariance of the CTT item difficulty index was highly comparable with, if not better than, that of IRT item difficulty parameter estimates. 5 ...<|control11|><|separator|>
  36. [36]
    (PDF) Breaking Free from the Limitations of Classical Test Theory
    Jun 8, 2016 · ... CTT, items are often. chosen based on sample dependent measures. It is possible that both approaches lead to similar or. equivalent scales ...
  37. [37]
    [PDF] Influences on and Limitations of Classical Test Theory Reliability ...
    Jan 25, 1996 · The reliability coefficient is "the proportion of observed score variance that may be attributed to variation in the examinees' true scores". (p ...Missing: seminal | Show results with:seminal
  38. [38]
    Why ability point estimates can be pointless: a primer on using skill ...
    Jan 19, 2021 · ... individual ability estimates (i.e., test scores). To better ... classical test theory (CTT) and item response theory (IRT), also known ...
  39. [39]
    Relationships Among Classical Test Theory and Item Response ...
    Jun 6, 2015 · It is understood that in the CTT framework, person and item statistics are test- and sample-dependent. This is not the perception with IRT. For ...
  40. [40]
    Using Computerized Adaptive Testing to Reduce the Burden ... - NIH
    Psychiatric measurement has been based primarily on subjective judgment and classical test theory. ... In adaptive testing, a person's initial item ...Missing: inefficiency | Show results with:inefficiency
  41. [41]
    Applying modern psychometric techniques to melodic discrimination ...
    Jun 15, 2017 · However, the vast majority of these tests (e.g. refs 2, 4,5,6,7,8,9) are built in classical test theory ... Bayesian prior. The test was ...
  42. [42]
    Multilevel Psychometric Analysis of Clinician Burnout Using ...
    Oct 17, 2025 · Overall, the study validates the methodological advantage of combining G-Theory, IRT, and Bayesian ... These classical test theory (CTT) ...<|control11|><|separator|>
  43. [43]
    An introduction to Item Response Theory and Rasch Analysis ... - NIH
    The purpose of this paper is twofold: introduce basic concepts of item response theory and demonstrate this analytic approach in a worked example.Missing: correction | Show results with:correction<|control11|><|separator|>
  44. [44]
    Item Response Theory
    In CTT, the true score predicts the level of the latent variable and the observed score. The error is normally distributed with a mean of 0 and a SD of 1.
  45. [45]
    [PDF] Item Response Theory – A Statistical Framework for Educational ...
    Aug 19, 2021 · In the 1960s, Rasch. (1960) and Lord and Novick (1968) laid the foundation of IRT as a theory for educational and psychological testing.
  46. [46]
    [PDF] item response theory: a broad psychometric framework ... - Psicothema
    Rasch model advocates have adopted. <<specific objectivity» and «<sufficient sta- tistics» as fundamental and essential in their work. To many psychometricians,.
  47. [47]
    Item Response Theory | SpringerLink
    Oct 18, 2017 · Item response theory (IRT) models, in their many forms, are undoubtedly the most widely used models in large-scale operational assessment programs.Item Response Theory · 5.3. 2 Irt Software... · 5.6 Irt Software Development...
  48. [48]
    Advances in Applications of Item Response Theory to Clinical ... - NIH
    This paper reviews advances in applications of IRT to clinical measurement in an effort to identify tangible improvements that can be attributed to the ...
  49. [49]
    A narrative review of adaptive testing and its application to medical ...
    The IRT model allowed for more accurate and efficient adaptive testing, as it could estimate the candidate's ability level with greater precision than previous ...
  50. [50]
    [PDF] Item response theory, computer adaptive testing and the risk of self ...
    Improvements due to adaptive testing are often estimated using reliability coefficients based on item response theory (IRT).Missing: inefficiency | Show results with:inefficiency
  51. [51]
    [PDF] The Dependability of Behavioral Measurements - Gwern.net
    Hunter independently carried out pertinent mathematical studies. During this period,. Cronbach, Gleser, and Rajaratnam revised the technical reports for journal.
  52. [52]
    [PDF] Generalizability Theory: An Analysis of Variance Approach to ...
    Feb 22, 2012 · Generalizability Theory (GT) extends CTT by providing a flexible and practical framework for estimating the effects of multiple sources of ...
  53. [53]
    [PDF] Coefficients and Indices in Generalizability Theory
    Generalizability theory offers an extensive conceptual framework and a pow- erful set of statistical procedures for addressing numerous measurement issues.
  54. [54]
    Generalizability Theory Made Simple(r): An Introductory Primer to G ...
    G-theory is a statistical framework ... We start with factors in a standard ANOVA, but then continue with the variance components, or facets in a G-study.
  55. [55]
    Generalizability theory - Oxford Academic - Oxford University Press
    Feb 1, 2024 · The initial analysis of the design is called a generalizability study or G study and uses ANOVA methods to compute variance components. These, ...Missing: framework | Show results with:framework