Fact-checked by Grok 2 weeks ago

Achievement test

An achievement test is a standardized assessment designed to measure an individual's acquired knowledge, skills, or competencies in a specific subject or domain, such as , reading, or vocational training, reflecting what has been learned through instruction or experience rather than innate potential. In contrast to tests, which gauge potential for learning new material, achievement tests evaluate demonstrated mastery of previously taught content, often using norm-referenced formats to compare performance against a representative sample. Achievement tests have been integral to since the early , coinciding with the rise of and efforts to quantify learning outcomes amid expanding schooling. Pioneering multiple-choice formats, such as the 1915 Kansas Silent Reading Exam, enabled efficient, objective scoring across large populations, facilitating accountability in schools and influencing policies like No Child Left Behind. Today, they serve multiple purposes, including diagnosing instructional gaps, certifying competencies for or , and predicting future academic success, with formats ranging from criterion-referenced state exams to subject-specific batteries like the Iowa Tests of Skills. Empirical evidence supports the reliability and of well-constructed achievement tests, which correlate strongly with real-world outcomes like grade point averages and job performance in trained domains, though their requires alignment with instructional content to avoid misalignment artifacts. Controversies persist over potential cultural or socioeconomic influences on scores, prompting debates on , yet longitudinal studies demonstrate that gains in test performance track causally with improved teaching and resources, underscoring their role as objective metrics amid subjective alternatives. Critics' concerns about "teaching to the test" are countered by data showing that targeted preparation enhances substantive learning without diminishing broader skills.

Definition and Distinctions

Core Definition

An achievement test is a standardized assessment instrument designed to measure an individual's current level of knowledge, skills, or competencies attained through prior instruction, training, or experience in a specific subject area, such as mathematics, language arts, or vocational trades. These tests evaluate mastery of defined content domains, reflecting the outcomes of educational processes rather than inherent potential or general cognitive capacity. For instance, instruments like the Stanford Achievement Test or state-mandated exams assess proficiency against established curricula, providing data on what examinees have learned up to the point of testing. In contrast to aptitude tests, which predict future learning ability by gauging innate or developed capacities for acquiring new skills, achievement tests focus exclusively on demonstrated accomplishments from past exposure to material. This emphasis on acquired attainment enables achievement tests to serve as indicators of instructional efficacy and individual progress, though their validity depends on alignment between test content and the covered. Achievement tests may be norm-referenced, comparing performance to a , or criterion-referenced, measuring against fixed performance standards, but both prioritize of learning over predictive inference.

Differences from Aptitude and Intelligence Tests

Achievement tests evaluate knowledge and skills explicitly taught and mastered through formal or , focusing on current proficiency in defined content areas such as or . In contrast, tests predict an individual's potential to acquire new skills or succeed in specific future tasks, often emphasizing innate or broadly developed abilities rather than prior instruction. tests, which frequently serve as a form of general aptitude assessment, measure underlying cognitive capacities like reasoning, verbal , and perceptual , aiming to capture a broad factor of mental ability () independent of specific learning experiences. The conceptual boundary between these test types is not always rigid, as both and measures assess developed abilities shaped by experience, with the primary distinction lying in the specificity of the knowledge domain: achievement tests reference precise curricular antecedents, whereas aptitude tests draw from vaguer, more generalized prior exposures. For instance, early achievement tests can function as aptitude indicators in , where limited prior learning makes them reflective of potential, but this utility diminishes in later schooling as domain-specific accumulates. Psychometric analyses reveal substantial overlap, with correlations between test scores and achievement outcomes typically ranging from 0.50 to 0.80, indicating that general cognitive strongly forecasts learned but does not equate to it. Causally, evidence from longitudinal studies supports as a driver of rather than the reverse; psychometric exerts a directional on subsequent academic gains, as individuals with higher g process and retain instructed material more efficiently, though targeted can modulate outcomes within cognitive limits. tests are thus more amenable to improvement via direct preparation, such as curriculum-aligned , whereas and scores show greater stability and resistance to short-term coaching, reflecting their emphasis on predictive capacity over accumulated facts. This distinction informs applications: tests certify mastery for in educational systems, while and tests guide selection for roles demanding rapid skill acquisition, like or vocational .

Historical Development

Origins and Early Adoption (1900-1940s)

The development of achievement tests in the early emerged from efforts to apply scientific measurement to educational outcomes, particularly through the work of psychologist Edward L. Thorndike at Columbia University's Teachers College. Thorndike, influenced by his empirical approach to learning as measurable behavioral connections, advocated for quantitative assessment of specific knowledge and skills acquired through instruction, distinguishing these from innate abilities. His 1904 publication, An Introduction to the Theory of Mental and Social Measurements, laid foundational principles for scaling educational performance, emphasizing objective norms over subjective teacher judgments. In , Clifford Stone, a , created the first standardized achievement test focused on arithmetic reasoning, marking a shift toward uniform instruments that could compare performance across contexts. Thorndike himself developed a scale in 1910, providing graded exemplars for evaluation, while between and 1916, he and his graduate students established normative data for subjects including , reading, and , enabling reliable comparisons of learned proficiency. These early tests prioritized by aligning items directly with objectives, reflecting Thorndike's view that educational progress should be verifiable through aggregated trial-and-error learning metrics. By 1918, over 100 standardized achievement tests had proliferated, targeting core elementary and secondary subjects such as reading, , and language arts, driven by reforms seeking efficiency in mass schooling amid rising enrollment. Adoption accelerated in U.S. public schools during the , where tests facilitated student grouping by ability, teacher evaluation, and curriculum adjustment, with instruments like the 1914-1915 Kansas Test introducing multiple-choice formats for scalable administration. This era's tests, often norm-referenced, quantified relative standing rather than absolute mastery, supporting administrative decisions in expanding urban systems. Into the 1930s and 1940s, mechanical scoring innovations, such as IBM's Markograph machines acquired from inventor , enabled rapid processing of large-scale tests, boosting adoption despite economic constraints of the . Programs like the 1929 Iowa Tests of Basic Skills, initially for selection, expanded nationally, covering grades 1-9 in reading, , and by the 1930s. During , while military tests dominated, civilian testing persisted in schools to maintain educational standards amid wartime disruptions, underscoring their role in tracking instructional efficacy.

Expansion and Standardization (1950s-1980s)

The period following saw significant expansion of standardized achievement testing in U.S. schools, driven by rapid enrollment growth from the and a need for efficient assessment in larger systems. The Soviet Union's Sputnik launch in 1957 intensified fears of educational lag in and , prompting the of 1958, which allocated federal funds for testing to identify and nurture talent in strategic subjects. This legislation accelerated the adoption of norm-referenced achievement batteries, such as revisions to the Stanford Achievement Test (originally developed in 1923) and the Iowa Tests of Basic Skills (ITBS, formalized in 1935), which by the 1950s provided national norms for comparing student performance across grades and districts. Test vendor revenues reflected this growth, rising from approximately $35 million in 1960 to higher levels by the late 1970s as states and localities integrated testing into routine evaluation. The (ESEA) of 1965 marked a pivotal federal push for standardization, requiring objective achievement measures to assess Title I programs aiding low-income students, thereby embedding standardized tests in accountability frameworks at the local level. Complementing this, the (NAEP), initiated in 1969, established a national sampling approach using matrix sampling to gauge trends in reading, mathematics, and other subjects without testing every student, yielding the first comprehensive data on achievement disparities by 1971. Psychometric advancements, guided by the American Psychological Association's Standards for Educational and Psychological Testing (first published in 1954 and revised periodically), emphasized reliability (often with coefficients above 0.95 for major tests) and validity tied to curriculum content, enabling broader interstate comparisons. In the 1970s and early 1980s, concerns over fueled the minimum competency testing movement, with states like implementing graduation-linked assessments in 1973 to verify basic skills in reading, writing, and computation, a model adopted by over half of states by 1980. This extended testing to elementary grades, as evidenced by widespread use of criterion-referenced elements in tests like the California Achievement Test, alongside norm-referenced staples. While civil rights- litigation, such as Larry P. v. Riles (1979), scrutinized aptitude tests for racial bias—prompting moratoriums on certain IQ-based placements—achievement tests faced less restriction due to their alignment with taught material, though academic sources often amplified fairness critiques amid broader institutional skepticism toward objective metrics. By the mid-1980s, annual testing affected millions, solidifying standardized achievement assessments as tools for evaluation despite ongoing debates over instructional narrowing.

Reforms and High-Stakes Era (1990s-Present)

In the 1990s, U.S. education policy shifted toward standards-based reform, emphasizing accountability through achievement tests aligned with specific learning goals. This era built on earlier critiques like the 1983 A Nation at Risk report, prompting states to implement rigorous standards and high-stakes assessments to measure student progress and school performance. For instance, Massachusetts' 1993 Education Reform Act combined elevated standards with increased funding, resulting in substantial gains in student achievement on national metrics. Globally, similar systemic reforms emerged, driven by aims to enhance competitiveness and equity via standardized evaluations. The passage of the (NCLB) on January 8, 2001, marked a pivotal escalation in . NCLB mandated annual standardized achievement tests in reading and for grades 3–8 and once in high school, alongside science testing in specified grades, with results used to calculate Adequate Yearly Progress (AYP) for schools and districts. Failure to meet AYP thresholds triggered interventions, including potential staff replacement or state takeover, while tying federal funding to compliance. Proponents credited NCLB with raising test scores and narrowing racial achievement gaps, as national data showed improvements in fourth- and eighth-grade reading and math proficiency from 2003 to 2007. However, empirical analyses revealed mixed causal impacts: while state test scores in math rose under accountability pressure, independent audit tests indicated declines in actual math and reading proficiency, particularly for Black students in high-minority schools, suggesting inflated scores from rather than deeper learning. The 2010 adoption of the State Standards by 45 states further reformed achievement testing by establishing uniform benchmarks in English language arts and mathematics, prompting the development of computer-adaptive assessments like the and Smarter Balanced. These tests shifted focus toward higher-order skills, such as and evidence-based reasoning, moving beyond rote memorization. By 2015, however, adoption waned amid political backlash, with several states revising or abandoning Common Core-aligned tests due to concerns over federal overreach and implementation costs. Evaluations indicated that while the standards aimed to boost college readiness, achievement gaps persisted, with Common Core's emphasis on equity not fully closing disparities in outcomes. The Every Student Succeeds Act (ESSA), signed into law on December 10, 2015, supplanted NCLB by preserving annual testing requirements but granting states greater flexibility in designing systems and reducing the emphasis on single-test outcomes for high-stakes decisions like school closures. ESSA prohibited federal mandates for teacher evaluations based solely on test scores and encouraged multiple measures, including student growth metrics and non-test factors like . Implementation data through 2020 showed varied state responses, with some reducing test volume—capping federally required assessments at 2% of instructional time—while maintaining focus on underserved subgroups. Critics from advocates argued this diluted incentives for improvement, yet ESSA's framework persisted amid ongoing debates over testing's role, evidenced by rising movements peaking at over 600,000 students in 2015. Ongoing reforms continue to grapple with balancing measurement precision against instructional distortion, with recent studies underscoring that high-stakes systems yield modest gains in targeted subjects but risk curriculum narrowing.

Types and Formats

Norm-Referenced Achievement Tests

Norm-referenced achievement tests (NRTs) assess students' acquired knowledge and skills by comparing their performance to that of a representative norm group, typically a large sample of peers who have taken the same test under standardized conditions. Scores are reported in relative terms, such as percentiles, stanines, or grade-equivalent scores, indicating how a test-taker ranks against the norm group rather than absolute proficiency. This approach originated in early 20th-century psychometrics to enable efficient ranking for educational placement and selection, with norming samples stratified by factors like age, grade, and demographics to ensure representativeness. The design of NRTs emphasizes item difficulty calibrated to produce a spread of scores across the group, often following a where approximately 50% score at or below the 50th . Achievement-focused NRTs cover specific curricular domains, such as , , and , measuring factual , problem-solving, and application skills developed through instruction. Unlike criterion-referenced tests, which gauge mastery against fixed standards, NRTs prioritize differentiation among test-takers, making them suitable for identifying relative strengths and weaknesses in large populations. Prominent examples include the Tests of Basic Skills (ITBS), initially released in 1935 and periodically renormed (e.g., 2011 edition based on a sample of over 1 million students), which evaluates core subjects from through grade 12; the Stanford Achievement Test series, normed on national samples exceeding 100,000 students per cycle; and the Achievement Tests (CAT), with forms like CAT E/F normed in the 1980s on diverse U.S. populations. These tests are administered under timed, proctored conditions to maintain comparability, with reliability coefficients often exceeding 0.90 for subtests via and test-retest methods. In educational applications, NRTs facilitate comparative analysis for , gifted identification, and remedial placement, as higher ranks (e.g., above 90th) signal outperformance relative to national norms. They support causal inferences about instructional effectiveness when aggregated across groups, though individual scores require cautious interpretation due to factors like or cultural biases in norming samples. Advantages include for large-scale screening and provision of benchmarks for policy decisions, such as allocating resources to underperforming districts. Limitations arise from their focus on relative , which does not directly indicate whether students have met predefined learning objectives, potentially masking widespread deficiencies if the group performs poorly overall. Critics argue NRTs incentivize to the test's item types over , with validity evidence showing stronger correlations to future academic outcomes in competitive contexts but weaker alignment to specific curricula compared to criterion-referenced alternatives. Empirical studies confirm high for ranking purposes, yet underscore the need for supplementary diagnostics to address absolute skill gaps.

Criterion-Referenced Achievement Tests

Criterion-referenced achievement tests evaluate an individual's performance against a predefined set of standards or learning objectives, determining the degree to which specific , skills, or competencies have been mastered, irrespective of group norms. Unlike norm-referenced tests, which rank test-takers relative to peers, these assessments yield absolute measures, such as pass/fail classifications or proficiency levels (e.g., below basic, basic, proficient, advanced), often based on cut scores like 70-80% correct on domain-relevant items. This approach aligns with instructional goals by focusing on whether criteria—derived from standards or behavioral objectives—are met, enabling targeted for remediation or advancement. The concept emerged in the 1960s, rooted in programmed instruction and paradigms, with Robert Glaser's work emphasizing measurement tied to instructional outcomes rather than comparative ranking. Early development addressed limitations of norm-referenced testing in evaluating individualized progress, particularly in competency-based education systems. Test construction involves delineating a domain, generating items that sample it representatively, and establishing cut scores through methods like the Angoff procedure, where experts estimate the probability of minimally competent performance on each item. Examples in education include , which certify college-level mastery in subjects like or , and state assessments like those under No Child Left Behind frameworks, where proficiency thresholds determine school accountability. Professional licensing tests, such as bar exams or medical board certifications, similarly apply criterion-referenced scoring to ensure minimum competence for practice. Psychometric evaluation prioritizes decision consistency for reliability—measuring agreement across administrations or forms on categorical outcomes like mastery/non-mastery—using indices such as the , rather than traditional correlations suited to continuous norm-referenced scores. Validity focuses on content representativeness, ensuring items align with criteria via systematic domain specification, and consequential validity, assessing impacts like improved instruction from diagnostic results. Empirical studies indicate higher in criterion-referenced formats for performance assessments, though challenges persist in standard-setting subjectivity and potential overemphasis on narrow skills if criteria lack empirical grounding in real-world demands. Proponents highlight utility for equitable evaluation in diverse s, as scores reflect actual attainment without cohort variability inflating or deflating results, while detractors note risks of arbitrary thresholds leading to inconsistent proficiency inferences across contexts.

Standardized vs. Classroom-Based Tests

Standardized achievement tests are assessments developed by experts, administered under uniform conditions to large populations, and scored objectively using predetermined criteria or norms to enable comparisons across individuals, , or . These tests typically feature fixed question sets drawn from a validated item bank, ensuring consistency in difficulty and content coverage aligned with broad educational standards. In contrast, classroom-based achievement tests are constructed by individual teachers to evaluate specific learning objectives within a particular course or unit, often incorporating formats like essays, projects, or quizzes tailored to recent instruction. These tests prioritize alignment with immediate curricular goals over broad comparability, allowing for adaptations based on class demographics or pacing. A primary distinction lies in administration and scoring protocols. Standardized tests require controlled environments, such as timed sessions with proctors to minimize and external variables, facilitating reliable aggregation of data for policy decisions like school funding or student placement. Scoring is automated or rubric-based with checks, yielding high test-retest consistency often exceeding 0.90 in well-designed instruments. Classroom-based tests, however, permit flexible timing and settings within the classroom, with teachers handling both administration and evaluation, which can incorporate subjective elements like partial credit for reasoning processes. This approach supports formative loops, where results inform real-time instructional adjustments, but introduces variability; empirical studies indicate teacher-made tests can achieve reliability coefficients around 0.70-0.85 when properly constructed, though lower without statistical validation. Reliability and validity profiles differ due to design rigor. Standardized tests excel in psychometric stability, with established through expert reviews and empirical norming on representative samples, enabling predictions of future academic performance with correlations up to 0.50-0.70 against criteria like GPA. Their criterion-referenced variants measure mastery against fixed benchmarks, while norm-referenced forms rank against peers, both minimizing teacher bias. Classroom-based tests, while potentially higher in for domain-specific skills—such as applying concepts in novel contexts—often suffer from construct-irrelevant variance, like grading leniency, unless supplemented by item analysis. Longitudinal data from systems show teacher assessments matching standardized tests in (approximately 60%) and stability across grades, suggesting comparable causal signals for underlying achievement when aggregated over multiple evaluations.
AspectStandardized TestsClassroom-Based Tests
ComparabilityHigh; enables cross-group Low; context-specific
ObjectivityStrong; minimal scorer Variable; prone to subjective judgment
Cost and ScalabilityEfficient for large-scale useInexpensive but labor-intensive to develop
Depth of AssessmentOften multiple-choice; tests breadthFlexible formats; can probe deeper reasoning
Feedback TimelinessDelayed; summative focusImmediate; supports formative use
Empirical evidence underscores trade-offs in educational impact. Standardized tests provide benchmarks for systemic accountability, correlating with long-term outcomes like college completion in datasets from the 1990s onward, though critics note they may incentivize narrow focus. Classroom-based assessments foster , with studies indicating they better capture growth in diverse classrooms, yet their inconsistent application across teachers can exacerbate inequities in evaluation standards. Hybrid approaches, combining both, have shown additive validity in intervention evaluations, where standardized metrics validate teacher judgments against external criteria.

Design and Psychometrics

Principles of Test Construction

The construction of achievement tests follows established psychometric and edumetric principles to ensure they measure acquired and skills in specific content domains rather than innate potential. Unlike tests, which prioritize statistical properties like item across broad ability levels, achievement tests emphasize content fidelity through edumetric approaches, such as direct sampling of instructional objectives to avoid conflating mastery with general cognitive traits. The process begins with a clear statement of test purpose, defining the construct (e.g., proficiency at grade 8 level) and intended inferences, such as evaluating effectiveness or student readiness. Central to design is the development of test specifications, often termed a table of specifications or , which outlines the content categories, cognitive demands (e.g., recall, application, analysis per ), and proportional item allocation to reflect instructional emphasis. For instance, a achievement test might allocate 40% of items to sciences, 30% to physical sciences, and 30% to earth sciences, weighted by learning objectives and ensuring representation across difficulty levels. This guides item writing, where developers create clear, unambiguous questions—typically multiple-choice for efficiency in large-scale testing—avoiding construct-irrelevant elements like cultural biases or excessive reading load. Items undergo expert review for alignment and potential subgroup bias, followed by empirical piloting on representative samples to compute statistics such as p-values (item difficulty, ideally 0.3-0.7) and discrimination indices ( with total score, targeting >0.3). Reliability is established through (e.g., >0.8 for high-stakes tests) and test-retest methods, with precision reported via standard errors of measurement tailored to score uses. Validity, particularly , requires evidence that items adequately sample the domain, often via judgments or alignment studies; for achievement tests, this supersedes heavy reliance on predictive correlations to maintain focus on taught material. Fairness principles mandate (DIF) analyses to detect items performing differently across groups (e.g., , ) after controlling for , using methods like Mantel-Haenszel statistics, and incorporating to minimize barriers without altering constructs. Final assembly involves selecting items to meet blueprint proportions, equating forms if multiple versions exist, and norming or criterion-setting on diverse populations to enable comparable scoring. Documentation of all steps, including revisions from pilot data, ensures transparency, as required by professional standards updated in 2014 by the , , and . These principles, when rigorously applied, support defensible interpretations but demand ongoing validation against real-world outcomes, as incomplete content coverage can inflate scores unrelated to actual learning.

Ensuring Reliability and Validity

Reliability in achievement tests refers to the consistency and stability of scores across repeated administrations or equivalent forms, minimizing measurement error to ensure scores reflect true rather than random variation. For standardized achievement tests, high reliability is essential due to their one-time use for high-stakes decisions, with coefficients typically exceeding 0.90 for professional tests. , assessed via , measures how well items correlate within the test; values above 0.80 indicate good reliability, while those near 1.00 reflect high consistency in professionally developed instruments. To ensure reliability, test developers employ methods such as parallel forms reliability, comparing scores on alternate test versions to detect inconsistencies from item differences, and test-retest reliability, correlating scores from the same test given at intervals to assess temporal stability. For achievement tests with subjective elements like essays, is evaluated through agreement coefficients, such as , to confirm consistent scoring across evaluators. Standardized administration procedures, including controlled conditions and trained proctors, further bolster reliability by reducing external variances like fatigue or distractions. Validity ensures that an achievement test accurately measures the intended knowledge or skills, with interpretations grounded in rather than assumptions. , critical for achievement tests aligned to curricula, is established by expert panels verifying that items comprehensively sample the domain without extraneous material; for instance, mathematics achievement tests must cover specified standards like and proportionally. Criterion-related validity correlates test scores with external criteria, such as against teacher grades or against future academic performance, with correlations above 0.50 often deemed substantial in educational contexts. Achieving validity involves iterative processes like pilot testing for item analysis, where difficulty and discrimination indices identify flawed questions, and ongoing validation studies accumulating evidence across populations. , encompassing content and criterion aspects, requires multifaceted evidence, including to confirm underlying structures, though academic sources note potential overreliance on statistical methods without causal scrutiny of instructional alignment. In , bodies like the Educational Research Association set standards mandating documented validity evidence for test use, emphasizing that no single metric suffices and biases in item selection can undermine claims if not empirically tested.

Administration and Scoring

Test Administration Procedures

Test administration for achievement tests requires adherence to standardized protocols to ensure scores reflect acquired knowledge rather than extraneous factors, as outlined in professional guidelines. Procedures emphasize uniformity in instructions, timing, and environmental conditions to support reliability and comparability of results across administrations. Test developers specify these conditions in manuals, with variations permitted only under justified circumstances such as accommodations for disabilities. Administrators and proctors must undergo on test-specific procedures, measures, and handling of irregularities to minimize construct-irrelevant variance. Preparation includes scheduling sessions, assigning trained personnel, securing materials in locked , and preparing environments with adequate spacing (e.g., seats at least 3 feet apart, facing the same direction), lighting, and minimal distractions like noise. Instructions are delivered verbatim from official scripts, with test takers informed of rules prohibiting unauthorized aids such as notes or electronic devices. Timing is strictly enforced using synchronized clocks or software, with breaks scheduled as specified (e.g., 10 minutes between sections in tests like ). Security protocols involve continuous monitoring to detect cheating or disruptions, secure distribution and collection of materials (e.g., answer sheets or digital submissions), and immediate reporting of violations through designated tools. For online or computer-based formats, proctors verify device compatibility, provide access codes, and ensure no unauthorized software interferes. Post-administration, materials are checked in, irregularities documented, and data forensics may analyze patterns for potential invalidations. Accommodations, such as extended time or alternative formats, are provided to eligible test takers based on documented needs (e.g., via IEPs), with procedures ensuring they address barriers without altering the construct measured; their use must be reported to maintain score validity. Test users evaluate and document the impact of any nonstandard conditions on interpretations.

Scoring Methods and Interpretation Frameworks

Achievement tests typically begin with the computation of raw scores, which represent the number of items answered correctly out of the total possible, providing a straightforward count of demonstrated without adjustment for difficulty variations across test forms. These raw scores serve as the foundation for deriving more interpretable metrics, such as scaled scores, which transform raw totals into a standardized —often ranging from 0 to 500 or similar intervals—to equate performance across different test versions, grades, or administrations while accounting for item difficulty through methods like (IRT) or equating procedures. Derived scores facilitate comparative interpretation via norm-referenced frameworks, where an individual's performance is benchmarked against a representative group, yielding metrics like ranks (indicating the of the norm group scoring below the test-taker, e.g., the 75th signifies outperforming 75% of peers) or stanines (a 1-9 grouping into nine bands, with 5-6 denoting performance). scores, such as z-scores (mean of 0, standard deviation of 1) or T-scores (mean of 50, standard deviation of 10), further enable statistical comparisons by expressing deviation from the norm group's mean, supporting analyses of relative standing in achievement domains like or reading. In contrast, criterion-referenced interpretation evaluates mastery against predefined or cut scores, classifying performance into categories like proficient or basic via pass/fail thresholds derived from content experts or empirical methods (e.g., Angoff or bookmarking procedures), independent of peer comparisons. These frameworks guide educational decisions, such as identifying instructional gaps or eligibility for interventions, but require caution against overinterpretation; for instance, grade-equivalent scores (e.g., a 4th-grader scoring at a "6.2" level) imply mid-6th-grade yet can mislead by assuming linear progression and ignoring variability within grades. Reliability estimates, like often exceeding 0.80 for subtests, underpin score stability, while validity evidence—such as correlations with future academic outcomes—validates interpretive inferences, emphasizing empirical alignment over unsubstantiated assumptions. Multiple scores are ideally reported together for a multifaceted view, as no single metric captures full nuance.

Applications

Role in K-12 Education

Achievement tests serve as standardized measures of student knowledge and skills in core academic subjects, enabling educators and policymakers to assess mastery of curriculum standards in K-12 settings. In the United States, they are administered annually in public s under federal requirements, such as those outlined in the Every Student Succeeds Act (ESSA) of 2015, which mandates assessments in reading and for grades 3-8 and once in high school, alongside in specified grades. These tests, often criterion-referenced to gauge proficiency against state benchmarks, support -level accountability by generating data for performance ratings, identifying low-achieving schools eligible for interventions, and linking outcomes to funding decisions. At the district and state levels, achievement test results inform educational policy, resource distribution, and systemic reforms by highlighting disparities in student outcomes across demographics and regions. For instance, programs like California's Standardized Testing and Reporting (STAR), implemented from 1998 to 2014, required tests in grades 2-11 to track progress toward state content standards, influencing subsequent systems like the Smarter Balanced assessments under the . Nationally, the (NAEP), often called the "Nation's Report Card," provides periodic benchmarks since 1969 to compare state and national trends without high-stakes consequences for individual schools. For individual students, these tests facilitate diagnostic evaluation, such as identifying learning disabilities through discrepancy models comparing to ability, and guide instructional decisions like grade promotion or remedial support. They also contribute to teacher evaluations in many states, where student growth on tests factors into performance metrics, though implementation varies. Empirical analyses indicate that consistent use of such tests correlates with long-term academic , as meta-analyses show moderate rank-order stability in scores from elementary through secondary grades. Overall, achievement tests underpin merit-based progression in K-12 by providing quantifiable evidence of acquired competencies, distinct from subjective assessments.

Uses in Higher Education and Employment

Achievement tests are employed in higher education primarily for admissions, placement, and credit evaluation to gauge students' mastery of secondary-level content and predict postsecondary performance. Standardized exams such as and , which incorporate achievement components assessing learned skills in reading, writing, and , serve as benchmarks for college readiness, with meta-analyses showing they predict first-year college GPA with correlations around 0.3 to 0.5, often improving when combined with high school grades. Historically, early college entrance exams like the College Boards focused explicitly on subject mastery, a principle that persists in tests like (AP) exams, where scores of 3 or higher on a 1-5 scale can earn course credit at over 500 U.S. institutions as of 2023, enabling advanced standing based on demonstrated proficiency. Placement tests in subjects such as and foreign languages further utilize achievement measures to assign students to appropriate courses, reducing remediation needs; for instance, a 2022 study across multiple universities found such tests correlated with course success rates exceeding 70% when aligned with content standards. In employment contexts, achievement tests evaluate candidates' acquired job-specific and skills, distinguishing them from measures by focusing on prior learning rather than innate potential. Job knowledge tests, as outlined by the U.S. Office of Personnel Management, assess expertise in areas like accounting principles, , or contract law through targeted questions, with validity coefficients for job performance often ranging from 0.2 to 0.4 in professional roles. These are commonly used in hiring for technical positions, where employers administer simulations or written exams to verify competencies; for example, software firms may test coding proficiency via practical problems drawn from real workflows. exams, such as the (CPA) test administered by the AICPA since 1896 and updated annually, require passing rates around 50% and serve as gatekeepers for licensure, ensuring practitioners meet standardized thresholds backed by empirical validation against on-the-job outcomes. The EEOC guidelines permit such tests provided they demonstrate job-relatedness and avoid without business necessity, with longitudinal data indicating certified professionals exhibit 15-20% higher productivity in regulated fields like and .

International Contexts and Comparisons

Major international assessments of student achievement, including the OECD's (PISA), the International Association for the Evaluation of Educational Achievement's (IEA) Trends in International Mathematics and Science Study (TIMSS), and Progress in International Reading Literacy Study (PIRLS), provide standardized measures of knowledge and skills in reading, , and across participating countries every few years. These tests focus on curriculum-aligned competencies, with emphasizing real-world application among 15-year-olds, TIMSS targeting 4th- and 8th-grade content mastery in math and , and PIRLS assessing 4th-grade reading proficiency. Results from these assessments inform national education policies, highlighting systemic strengths and weaknesses; for example, high-performing nations like integrate frequent achievement testing into their curricula to enforce accountability and skill development. In the 2022 cycle, which tested over 690,000 students from 81 countries and economies in , reading, and , East Asian systems dominated top rankings, reflecting rigorous instructional focus on tested domains. achieved the highest score of 559 points (versus the average of 472), followed by Macao (China) at 535, at 533, at 533, and at 523. The scored 465 in , below the average and ranking 34th overall, with similar patterns in (485 vs. 485) and reading (504 vs. 476). These outcomes correlate with system designs: top performers employ centralized curricula, extended instructional time, and high-stakes national exams that prioritize factual recall and problem-solving, as opposed to systems in lower-ranked Western nations where decentralized approaches and reduced emphasis on rote mastery prevail. TIMSS 2023, assessing 4th- and 8th-grade students from 64 countries in and , reinforced these patterns, with again leading: 607 in 4th-grade (up from 595 in ) and topping scales internationally. Other high achievers included , , and , while U.S. 8th-grade scores declined to 539 from , placing it mid-tier among participants and evidencing persistent challenges in foundational skills post-pandemic. PIRLS 2021, conducted amid disruptions, showed at 587 in reading (international average 500), with stark declines in countries like (300), underscoring how test-oriented systems mitigate learning losses through structured recovery.
AssessmentTop Performers (Example Scores)OECD/U.S. Comparison
PISA 2022 Math (559), Macao (535), / (533)OECD avg. 472; U.S. 465
TIMSS 2023 4th-Grade (607), (~590)International avg. ~500; U.S. mid-tier
PIRLS 2021 Reading (587), (573)International avg. 500; U.S. 556 (pre-COVID trend)
Cross-national data indicate that achievement test scores predict long-term economic productivity, with variations explaining up to two-thirds of differences in national GDP growth; East Asian success stems from policies embedding tests in training and student progression, contrasting with selective implementation in and . Countries like and outperform peers through balanced national assessments informing reforms, while broader adoption in developing nations reveals gaps tied to rather than test design. These comparisons, drawn from verified participant samples exceeding 500,000 students per cycle, underscore achievement tests' role in evidencing causal links between instructional rigor and outcomes, though critics in lower-performing systems question validity amid socioeconomic confounders—claims rebutted by controls in IEA/ analyses showing policy levers like volume and content knowledge as key drivers.

Empirical Evidence

Predictive Validity for Academic and Life Outcomes

Achievement tests, which assess acquired and skills in specific domains such as , reading, and science, demonstrate moderate to strong for subsequent academic performance. For instance, scores on , a widely used achievement test for admissions, correlate with first-year grade-point average (GPA) at approximately r = 0.35 to 0.48, with the correlation strengthening when combined with high school GPA to explain up to 25% of variance in outcomes. Similarly, meta-analyses of graduate-level achievement tests like the Graduate Record Examination (GRE) show correlations of r = 0.31 with graduate GPA and r = 0.34 with degree completion, outperforming undergraduate GPA alone in some contexts. At selective institutions, such as Ivy-Plus colleges, standardized test scores predict first-year GPA more reliably than high school GPA, with a 400-point SAT difference (e.g., 1600 vs. 1200) associated with a 0.43-point higher GPA on a 4.0 scale. Longitudinal data further affirm this validity for broader academic milestones. Middle-school standardized achievement test scores in math and reading predict high school rates, college , and bachelor's degree attainment, with higher scores linked to a 10-20% increased likelihood of postsecondary . State-mandated achievement tests, such as those aligned with No Child Left Behind standards, forecast college readiness, where a one-standard-deviation increase in 8th-grade scores correlates with higher and rates. These associations hold across diverse samples, though varies by test specificity and student demographics, with math achievement often showing stronger links to STEM-related academic trajectories. Beyond academia, achievement test scores exhibit for life outcomes, particularly socioeconomic attainment and occupational success, largely through their overlap with . A one-standard-deviation gain in 8th-grade math achievement is associated with an 8% increase in adult earnings, alongside reductions in reliance on public assistance. Longitudinal analyses from cohorts like the National Longitudinal Survey reveal that high school achievement test predicts earnings at age 33 (14% premium per SD) and age 50 (18% premium), independent of family background. For , scores on achievement-oriented assessments correlate with job at r ≈ 0.5, comparable to general measures, as they capture crystallized relevant to task mastery and adaptability. Studies of programs like the GED, which uses achievement testing, confirm that test passers initially show labor market gains akin to high school graduates, though sustained success depends on underlying skills reflected in the scores. These predictive patterns underscore the causal role of domain-specific knowledge and cognitive proficiency in driving outcomes, with achievement tests serving as proxies for skills that compound over time. Evidence from international panels, such as PISA-linked studies, extends this to global contexts, where early achievement predicts adult income and with similar effect sizes. However, validity coefficients attenuate over longer intervals or when non-cognitive factors like intervene, emphasizing the tests' strength in near-term forecasts.

Effects on Educational Accountability and Student Performance

Achievement tests have been integrated into educational accountability systems, such as the U.S. of 2001, which mandated annual standardized testing in reading and for grades 3-8 and linked school performance to federal funding, sanctions, and public reporting. This framework aimed to enforce consequences for underperformance, including corrective actions like staff replacement or state takeover for persistently failing schools. Empirical analyses indicate that such accountability pressures generated targeted improvements in student test scores, particularly in mathematics for elementary students from disadvantaged backgrounds, with state-level showing statistically significant gains post-implementation. Multiple studies attribute these gains to mechanisms like heightened teacher focus on tested content and data-driven feedback loops, where testing frequency and stakes amplify learning effects by an effect size of approximately 0.2-0.4 standard deviations in achievement metrics. For instance, in 9 of 13 states with comparable pre- and post-NCLB data, average annual test score improvements accelerated after 2002, outpacing prior trends by 0.01-0.03 standard deviations per year in reading and math. Cross-state comparisons further reveal that accountability systems correlate with higher overall achievement growth, though effects diminish over time and vary by subgroup, with persistent racial achievement gaps. However, causal evidence also highlights trade-offs, including curriculum narrowing, where over 80% of reviewed studies document reduced instructional time and depth in non-tested subjects like , , and , alongside increased teacher-centered drill-and-practice methods. High-stakes environments incentivize strategic responses, such as score manipulation or exclusion of low performers, which distort performance indicators and exacerbate inequality between high- and low-achieving students on tested measures. Meta-analytic reviews of policy interventions confirm modest net positive effects on tested outcomes but warn of unintended declines in broader skill development and motivation, particularly for younger or lower-performing students. Overall, while accountability via achievement tests has demonstrably elevated average performance in core subjects, the magnitude remains small (typically under 0.1 standard deviations annually), with evidence suggesting sustainability requires balancing stakes with instructional flexibility to mitigate narrowing effects.

Benefits

Objective Assessment of Acquired Knowledge

Achievement tests serve as a standardized to evaluate the extent to which individuals have mastered specific and skills outlined in educational curricula, employing formats such as multiple-choice items that permit unambiguous scoring based on predetermined correct answers. This structure inherently reduces variability introduced by evaluator judgment, yielding results that reflect acquired competencies rather than interpretive differences. In contrast to subjective evaluations like essays or oral exams, where can fluctuate due to personal biases or inconsistent criteria, achievement tests prioritize verifiable factual recall and application, ensuring that scores directly correspond to demonstrated proficiency. Empirical measures of reliability underscore this objectivity; for instance, internal consistency coefficients, such as Kuder-Richardson 20, for well-constructed achievement tests typically range from 0.76 to 0.90, indicating stable measurement across items and administrations. Test-retest correlations for standardized achievement assessments often exceed 0.80, demonstrating consistency over short intervals and supporting their use as dependable indicators of knowledge retention. These metrics derive from psychometric validation processes that align test content with learning objectives, minimizing construct-irrelevant variance and providing educators with actionable data on instructional effectiveness without the confounding effects of subjective grading disparities observed in non-standardized formats. By focusing on observable outcomes—such as solving mathematical problems or identifying historical facts—achievement tests facilitate cross-student and cross-context comparisons, enabling identification of learning gaps tied causally to prior rather than external factors like teacher favoritism. This approach aligns with causal realism in , where scores serve as proxies for actual , backed by evidence from expert reviews and alignment studies ensuring items probe intended domains without cultural or interpretive overlays that plague less structured methods. Consequently, they promote in transmission, as low scores signal deficiencies in delivery rather than ambiguous evaluator opinions.

Support for Meritocracy and Individual Accountability

Achievement tests underpin by quantifying individual mastery of specific knowledge and skills, enabling decisions on advancement—such as placements, scholarships, or —to prioritize demonstrated over extraneous factors like family or recommendations. This approach aligns with causal mechanisms where preparation and , rather than systemic privileges, determine outcomes, as tests standardize evaluation across diverse backgrounds and reduce evaluator . For instance, from institutions like and indicates that standardized achievement metrics identify high-potential students from underrepresented socioeconomic groups, facilitating merit-based access to selective programs that might otherwise favor or subjective holistic reviews. By holding individuals directly responsible for their results, achievement tests foster , incentivizing sustained effort and self-directed learning as scores reflect personal investment rather than collective or external excuses. Empirical data from assessment studies show that graded evaluations, including high-stakes formats, prompt students to allocate greater study time and produce higher-quality work, with formative questions eliciting up to 20-30% more effort when tied to performance feedback. In policy contexts, the of 2001, which mandated annual achievement testing for , correlated with national gains in math proficiency—such as 12-point increases for 8th-grade Black students on NAEP assessments from 2003 to 2007—attributable to heightened individual and instructional focus on measurable outcomes. This framework counters narratives of inherent inequity by emphasizing malleable factors like preparation, where longitudinal evidence from 30 years of U.S. reforms demonstrates that test-linked elevates overall student achievement, particularly when disaggregated to individual levels rather than aggregated group metrics. Proponents argue that such systems dismantle patronage-based selection, as seen in meritocratic hiring via knowledge-based exams, yielding more competent outcomes than unverified proxies. Critics from often downplay these benefits due to institutional preferences for equity-focused alternatives, yet causal analyses affirm that ignoring individual test-derived merit perpetuates underperformance by obscuring effort's role.

Criticisms and Controversies

Allegations of Bias and Inequity

Critics have alleged that achievement tests exhibit racial and ethnic bias, pointing to persistent score gaps between white students and black or Hispanic students as evidence of cultural or linguistic favoritism toward majority groups. For instance, disparities in (NAEP) scores, where black students scored 27 points lower in 8th-grade reading in 2022 compared to white students, have been attributed by some to test content assuming familiarity with middle-class norms. Similar claims target college admissions tests like , with lawsuits arguing that items disadvantage non-native English speakers or those from non-Western backgrounds. However, empirical analyses indicate that such gaps largely reflect differences in prior knowledge and preparation rather than inherent test flaws, as modern achievement tests undergo rigorous item bias reviews using statistical methods like to ensure equivalence across groups. Socioeconomic status (SES) is frequently cited as a source of inequity, with data showing children from the top income quintile scoring 100-150 points higher on SATs than those from the bottom quintile in 2023. Allegations posit that tests proxy privilege through access to test prep, which wealthier families can afford, exacerbating unequal outcomes. Yet, studies controlling for SES factors like parental education and income explain only 50-70% of racial achievement gaps in subjects like math and reading, with residual differences persisting and correlating with long-term outcomes such as college completion across income levels. This suggests environmental influences on acquired skills, not test construction bias, as primary drivers; moreover, high-SES variability in scores within groups undermines claims of systemic discrimination. Gender-based allegations are less prominent but include assertions that test formats disadvantage girls in math due to or timing pressures, contributing to boys outperforming girls by 30 points on SAT math sections in 2023 data. Conversely, girls typically score higher in reading and earn better grades overall, with gaps varying by type—constructed-response formats narrowing male advantages in math by up to one-third grade level. These domain-specific differences align with behavioral patterns, such as boys' lower in schoolwork explaining grade gaps, rather than indicating format . Many claims originate from groups and unions, which may prioritize narratives over evidence, as seen in calls to abolish tests amid score disparities. Independent research, however, affirms that achievement tests maintain comparable validity coefficients (0.4-0.6) for future performance across demographic groups, outperforming alternatives like high school GPA, which suffers from and subjective . Persistent gaps thus highlight causal factors like family structure and school quality, not psychometric inequities, underscoring tests' role in revealing rather than creating disparities.

High-Stakes Drawbacks and Teaching to the Test

High-stakes achievement testing refers to assessments where results directly influence significant consequences, such as school funding, teacher evaluations, student promotion or graduation, or institutional accreditation. Under policies like the U.S. of 2001, states faced penalties for failing to meet proficiency thresholds, intensifying pressures on educators and students. Empirical analyses indicate that such stakes correlate with elevated student stress and declines; for instance, failing a high-stakes has been linked to increased internalizing problems like anxiety and in adolescents, persisting up to a year post-exam. A core drawback is the distortion of educational processes, as articulated in , which posits that the greater the reliance on quantitative indicators for decision-making, the more prone they become to corruption, including score inflation without corresponding skill gains. In practice, this manifests as "," where instruction prioritizes rote memorization of testable items over deeper conceptual understanding. Studies of NCLB-era reforms reveal that state accountability tests emphasized a narrow subset of standards—often excluding up to 40% of broader content—leading to detectable patterns of inflated scores on aligned materials but stagnation on unaligned assessments like the (NAEP). This practice contributes to curriculum narrowing, with teachers reallocating instructional time toward tested subjects like math and reading at the expense of others. research on NCLB found elementary schools increased math time by 27% and reading by 18%, while reducing by 18% and by 14%, effects persisting through 2009. Over 80% of reviewed studies confirm such shifts, including diminished coverage of , , and , as educators rationally prioritize measurable outcomes to avoid sanctions. Test-specific drills and practice yield minimal or negative impacts on general or curriculum-wide learning, per experimental evidence, as they foster superficial familiarity rather than transferable . Longer-term consequences include eroded teacher morale and systemic gaming, such as excluding low performers from tests or focusing resources on "bubble" students near proficiency cutoffs, which undermines broader without enhancing overall . While proponents argue high stakes incentivize focus, causal analyses attribute persistent gaps in non-tested domains to these distortions, with no net gains in complex problem-solving or civic knowledge. These patterns hold across contexts, including , where high-stakes finals correlate with reduced in due to .

Overemphasis on Testing vs. Broader Educational Goals

High-stakes achievement testing has been associated with curriculum narrowing, where educators prioritize content aligned with tested subjects at the expense of other areas. A qualitative metasynthesis of 49 studies found that 83.7% reported contraction of curriculum content to focus primarily on tested materials, such as reading and , while reducing coverage of non-tested topics. In the United States following the 2002 , 62% of school districts increased instructional time for language arts instruction, with 75% of districts serving underperforming schools making similar shifts toward , often reallocating time from science, , , music, and . This rational adaptation by teachers and administrators aims to boost accountability metrics but fragments knowledge into test-specific elements, as evidenced in 49% of the reviewed studies. Such narrowing displaces broader educational pursuits, including exposure to liberal arts subjects that foster and . Elementary school data indicate significant reallocations, with first-grade English/language arts time rising by 96 minutes per week from 1987-1988 to 2003-2004, accompanied by declines of 12 minutes per week in both and across grades 1-5. Participation in classes among nine-year-olds fell from 78% in 1992 to 71% in 2004, reflecting broader cuts to "specials" integrated into core tested areas. These shifts, observed in over 80% of reviewed studies on high-stakes contexts, promote teacher-centered pedagogies in 65.3% of cases, diminishing opportunities for or subject integration that support holistic development. Critics argue this overemphasis undermines goals like and , though direct causal evidence remains limited and often . Instructional practices under high-stakes pressure tend toward controlling strategies that prioritize over student autonomy, potentially hindering higher-order skills. However, empirical studies do not consistently demonstrate that reduced time in non-tested subjects like arts or improves test scores; some analyses show no significant negative and even positive trends with maintained allocations led by specialists. While narrowing may yield short-term gains in tested performance, it risks long-term deficits in diverse competencies, as pre-accountability eras showed stable time distributions without equivalent score erosion.

Recent Developments

Post-COVID Score Trends and Recovery Efforts (2020-2025)

The , with school closures starting in March 2020, caused substantial disruptions to instruction, leading to declines in achievement test scores that persisted through 2025. In the (NAEP) Long-Term Trend assessment for 9-year-olds, average reading scores dropped 5 points from 220 in 2020 to 215 in 2022—the largest decline since 1990—while mathematics scores fell 7 points from 241 to 234, marking the first-ever drop in that series. Main NAEP assessments confirmed ongoing stagnation or further erosion. Fourth- and eighth-grade reading scores declined by 2 points each in 2024 compared to 2022, remaining 3 points below 2019 pre-pandemic levels, with no states showing gains over 2022. In mathematics, fourth-grade scores rose 2 points from 2022 but stayed below 2019, while eighth-grade scores were flat versus 2022 after an 8-point drop from 2019; only exceeded 2019 in fourth-grade math. College admissions tests reflected similar patterns: average SAT and scores through 2025 failed to rebound to pre-2020 levels, with ACT composites hitting a 30-year low and participation rates remaining suppressed. International assessments underscored the U.S. trends. The 2023 Trends in International Mathematics and Science Study (TIMSS) showed U.S. eighth-grade math scores dropping 13 points from 2019, with fewer students at intermediate proficiency levels and widened gender gaps. Recovery efforts, funded by approximately $190 billion in federal Elementary and Secondary School Emergency Relief (ESSER) allocations through 2024, emphasized high-dosage tutoring, extended learning time, and targeted interventions for underserved students. The Education Recovery Scorecard, tracking over 8,700 districts, indicated modest math recovery—about 33% of losses regained by spring 2023—but only 25% in reading, with national progress halting by late 2024 amid expired funding and rising chronic absenteeism. Despite some state-level gains—such as 13 states improving fourth-grade math proficiency since 2022—full eluded all states in both subjects by 2025, with over 100 districts surpassing pre-pandemic benchmarks unevenly. Losses were most acute among low-income and minority students, widening gaps as high-achievers advanced faster; Brookings analysis linked reading stagnation to persistent instructional deficits rather than math's targeted reforms. Projections suggest math could extend beyond seven years at current paces, highlighting the limits of post-ESSER .
AssessmentGrade/LevelSubject2022 vs. Pre-Pandemic Change2024/2025 vs. 2022 ChangeStatus Relative to Pre-Pandemic
NAEP Main4thReading-3 pts (vs. 2019)-2 ptsBelow pre-pandemic
NAEP Main8thReading-3 pts (vs. 2019)-2 ptsBelow pre-pandemic
NAEP Main4thMath-5 pts (vs. 2019)+2 ptsBelow pre-pandemic
NAEP Main8thMath-8 pts (vs. 2019)FlatBelow pre-pandemic
SAT/High SchoolOverallDeclines from 2019No reboundBelow pre-pandemic

Technological Advances in Testing Delivery

The transition from paper-based to digital delivery of achievement tests began accelerating in the early 2010s with initiatives like the Smarter Balanced and assessments in the United States, which adopted computer-based platforms to enable adaptive item selection and immediate scoring. By 2024, major standardized tests such as fully shifted to digital formats, reducing test length from three hours to two while incorporating adaptive modules that adjust question difficulty based on initial performance, thereby improving precision in ability estimation with fewer items. This change, implemented nationwide on March 9, 2024, also streamlined administration by allowing flexible scheduling and faster score reporting within days rather than weeks. Computerized adaptive testing (CAT), a core technological advance, dynamically selects test items from large item banks to match examinee ability in real-time, minimizing test exposure to irrelevant difficulty levels and enhancing measurement efficiency. In educational contexts, CAT has been integrated into interim and summative achievement assessments, such as NWEA's MAP Growth tests, where item difficulty adjusts after each response, typically reducing test duration by 30-50% compared to fixed-form tests while maintaining or improving reliability. Empirical studies confirm CAT's validity in K-12 settings, with adaptive models yielding scores that correlate strongly (r > 0.90) with traditional metrics, though adoption remains uneven due to infrastructure demands. The catalyzed remote testing delivery, with platforms enabling proctored online administration to sustain continuity amid school closures. AI-driven proctoring technologies emerged as a key enabler, using facial recognition, eye-tracking, and behavioral to monitor examinees remotely, flagging anomalies like unauthorized materials or gaze aversion with over 95% accuracy in controlled trials. Systems like those from Proctorio and similar vendors integrate with learning management software, reducing human proctor needs by up to 40% and enabling scalable delivery for high-stakes achievement tests, though they require robust to mitigate gaps in rural or low-income areas. By 2025, these tools have become standard for virtual school assessments, where at-home testing correlates with higher than supervised alternatives in some datasets. Further innovations include integrated for and security enhancements like for item integrity, though empirical on long-term impacts remain emerging as of 2025. Overall, these advances prioritize efficiency and adaptability, supported by evidence of maintained psychometric rigor in digital environments.

Policy Reforms and Emerging Alternatives

In response to criticisms of high-stakes achievement testing, numerous U.S. states have reformed graduation requirements by eliminating or scaling back mandatory standardized exit exams. As of October 2025, only six states—, , , , , and —require such tests for the high school class of 2026, down from a peak where up to 27 states imposed them. Recent examples include , where voters rejected the MCAS as a graduation requirement in 2024 via ballot Question 2, prompting a shift toward alternative pathways; and , which made Regents exams optional in 2023 while introducing multiple diploma options like career and technical endorsements. These changes reflect a broader trend under the Every Student Succeeds Act (ESSA) of 2015, which grants states flexibility to prioritize locally developed measures over federal mandates for proficiency demonstrations. States are increasingly pursuing waivers to innovate beyond traditional end-of-year achievement tests, focusing on through-year assessments for more timely data. In 2025, submitted an ESSA waiver request to replace annual standardized tests with district-chosen alternatives, emphasizing real-time feedback over summative snapshots, though critics argue this could undermine data comparability and accountability. Similarly, advanced legislation (House Bill 8, 2025 session) for three-times-per-year testing, allowing districts to select initial assessments while seeking federal approval to bypass for the final one. The incoming administration, via Education Secretary , holds authority to approve such waivers, potentially accelerating deviations from ESSA's annual testing in grades 3-8 and high school, as states like have already piloted multi-test models. These reforms aim to reduce testing burden while maintaining oversight, but implementation risks include inconsistent student coverage and challenges in scaling reliable alternatives. Emerging alternatives emphasize performance-based and competency-focused evaluations to capture skills beyond multiple-choice formats. Portfolio assessments compile student work samples, such as projects and reflections, to demonstrate sustained mastery, as piloted in districts adopting proficiency-based progression where advancement hinges on verified competencies rather than seat time. Culminating projects and exhibitions require students to apply knowledge in real-world contexts, with rubrics evaluating criteria like critical thinking and collaboration, gaining traction in states like New Hampshire under ESSA's Innovative Assessment and Accountability Demonstration Authority. Student-led conferences and integrated assessments embed evaluation into ongoing instruction, providing formative feedback without isolated test events. Proposed blueprints advocate sampling methods or biennial testing to shrink scale while enhancing content coverage, reserving diagnostics for local tools and prioritizing snapshots of low-performing subgroups for accountability. Empirical validation of these alternatives remains limited compared to standardized tests' established reliability, with concerns over subjectivity and inter-rater variability persisting in peer-reviewed evaluations.

References

  1. [1]
    Achievement test - APA Dictionary of Psychology
    any norm-referenced standardized test intended to measure an individual's current level of skill or knowledge in a given subject. Often the distinction is ...Missing: education | Show results with:education
  2. [2]
    Achievement Testing - an overview | ScienceDirect Topics
    Achievement testing refers to any procedure or instrument that is used to measure an examinee's attainment of knowledge or skills.
  3. [3]
    The Purpose of Achievement Tests - Verywell Mind
    Nov 21, 2022 · An achievement test is designed to measure a person's level of skill, accomplishment, or knowledge in a specific area.
  4. [4]
    Aptitude vs Achievement | Definition, Use & Problems - Lesson
    However, aptitude tests focus on the potential someone has to learn new things while achievement tests focus on what has already been learned.What is Aptitude? · What is Achievement
  5. [5]
    Are Ability and Achievement Tests the Same? - Multi-Health Systems
    Apr 23, 2025 · Ability tests assess potential to learn, while achievement tests measure what a student has already learned. Both measure skills and knowledge, ...
  6. [6]
    [PDF] A History of Educational Testing - Princeton University
    A review of the history of achievement testing reveals that the rationales for standardized tests and the controversies surrounding test use are as old as ...
  7. [7]
    History of Standardized Testing in the United States | NEA
    Jun 25, 2020 · By 1918, there are well over 100 standardized tests, developed by different researchers to measure achievement in the principal elementary and secondary school ...
  8. [8]
    Standardized Testing History: An Evolution of Evaluation
    Aug 10, 2022 · Frederick J. Kelly, a Kansas school head, designed the Kansas Silent Reading Exam (1914-1915), the first known published multiple-choice test.
  9. [9]
    Achievement Tests: Definition, Types & Best Practices for Educators
    Mar 26, 2024 · Achievement tests are a type of standardized assessment format that are used to test an individual's knowledge, skill, and proficiency in specific subjects.
  10. [10]
    [PDF] The Use and Validity of Standardized Achievement Tests for ... - ERIC
    To our knowledge, no empirical data exist on the prevalence of standardized achieve- ment tests as outcome measures in applied educational research. The goal of ...
  11. [11]
    [PDF] Introduction Achievement Testing in U.S. Schools
    The greater the weight of validity evidence that is presented, the more confidence test users can have that they are making accurate (i.e., correct) ...<|control11|><|separator|>
  12. [12]
    Aptitude and achievement testing. - APA PsycNet
    This chapter briefly reviews the definitions of both achievement and aptitude testing from a historical perspective.
  13. [13]
    Full article: Current controversies in educational assessment
    Feb 20, 2023 · Some of the controversies in educational assessment are linked to inequalities in the education system, and the fact that students do not have access to the ...
  14. [14]
    Four Empirically Based Reasons Not to Administer Time-Limited Tests
    In this article, we present four empirically based reasons to administer untimed power tests rather than time-limited tests in educational settings.
  15. [15]
    Achievement Tests | Research Starters - EBSCO
    Achievement tests are standardized assessments measuring student knowledge and skills in specific subjects, often called norm-referenced tests.
  16. [16]
    Intelligence and Achievement Testing: Is the Half-Full Glass Getting ...
    Gardner's IQ tests measure not only verbal and mathematical skills but also musical, mechanical, physical, and even social skills. Similarly, cognitive ...
  17. [17]
    ED079344 - Aptitude, Intelligence, and Achievement., Test ... - ERIC
    Achievement tests can function as aptitude measures best in the early school years, less well at the junior and senior high school levels where courses ...
  18. [18]
    Psychometric intelligence and achievement: A cross-lagged panel ...
    Within the limits imposed by the design and sample, it appears that psychometric IQ is a causal influence on future achievement measures whereas achievement ...
  19. [19]
    What grades and achievement tests measure - PNAS
    Nov 8, 2016 · This paper uses a variety of datasets to show that personality and IQ predict grades and scores on achievement tests.
  20. [20]
    [PDF] 7. Aptitude and Achievement Tests - UNL Digital Commons
    It is generally recognized that traditional achievement tests are designed and used primarily to assess current status, whereas traditional aptitUde tests are ...
  21. [21]
    A History Of Evaluation | Teachers College, Columbia University
    Jun 26, 2013 · TC's legacy in measurement, assessment and evaluation dates back to 1904, when education psychologist Edward L. Thorndike published An Introduction to the ...
  22. [22]
    The Development of Educational Testing - jstor
    Stone, one of Thorndike's students, developed the first standar achievement test, in arithmetic reasoning, in 1908. solely the results of learning in the ...
  23. [23]
    [PDF] Testing Policy in the United States: A Historical Perspective - ETS
    professional education schools developed tests to measure achievement in the basic school subjects. In 1908, a student of Thorndike named Cliff Stone ...
  24. [24]
    Pioneers of Modern Testing - Education Week
    Jun 16, 1999 · The work of Thorndike and his graduate students from 1908 to 1916 established norms for arithmetic, reading, handwriting, and other subjects, ...
  25. [25]
    [PDF] A Historical Critique of High-Stakes Testing in Reading
    Johnson sold the Markograph to IBM who then introduced several test scoring machines throughout the 1930s-1940s, enabling tests to be graded much more rapidly ...
  26. [26]
    Made to Measure - Education Week
    Jun 16, 1999 · The testing program started as a scholarship competition run by the University of Iowa in 1929. In the first year, the test had sections ...<|separator|>
  27. [27]
  28. [28]
    [PDF] 1 A History of Achievement Testing in the United States Or - Ethan Hutt
    This essay offers a historical analysis of the structural and cultural aspects of American education that help explain the durability of standardized testing in ...
  29. [29]
    The Origins of American Test-Based Educational Accountability and ...
    May 5, 2022 · Statewide test-based accountability in the United States began with minimum competency assessments in public schools during the 1970s, starting in Florida and ...
  30. [30]
    Minimum Competency Testing | Research Starters - EBSCO
    Introduced primarily in the late 1970s, MCT was designed to establish a baseline of academic achievement necessary for receiving a high school diploma. While ...
  31. [31]
    [PDF] 11. Standards-Based Reform - Hoover Institution
    Massachusetts, which combined standards-based reform with an enormous increase in spending in its 1993 Education Reform Act, saw student achievement skyrocket ...
  32. [32]
    Trends in global education reform since the 1990 s
    The 1990s was a decade of comprehensive systemic education reforms around the world that happened for different reasons but had similar aims to change whole ...
  33. [33]
    FACT SHEET:No Child Left Behind Has Raised Expectations and ...
    Since No Child Left Behind Took Effect, Test Scores Have Risen, Accountability Has Increased, And The Achievement Gap Between White And Minority Students ...
  34. [34]
    The Effects of the No Child Left Behind Act on Multiple Measures of ...
    Sep 1, 2016 · NCLB accountability pressure increased math state test scores, but decreased math and reading scores on audit tests. Black students in high- ...
  35. [35]
    The Common Core Explained - Education Week
    Sep 30, 2015 · The Common Core State Standards arose from a simple idea: that creating one set of challenging academic expectations for all students would improve achievement ...<|control11|><|separator|>
  36. [36]
    Common Core State Standards: The Achievement Gap by William H ...
    The CCSS should be implemented—and student assessments should be judged—based on the long-standing American ideal of providing equal opportunity, not seeking ...
  37. [37]
    [PDF] Implementing the Every Student Succeeds Act
    Dec 10, 2015 · ESSA retains the requirement that states test all students in reading and math in grades three through eight and once in high school, as well as ...
  38. [38]
    Norm-Referenced Test Definition - The Glossary of Education Reform -
    Jul 22, 2015 · Norm-referenced refers to standardized tests that are designed to compare and rank test takers in relation to one another.Missing: disadvantages sources
  39. [39]
    Norm-Referenced Achievement Tests - Fairtest
    Norm-referenced tests (NRTs) compare a person's score against a group of people who have already taken the same exam, called the 'norming group'.Missing: sources | Show results with:sources
  40. [40]
    Norm-Referenced Testing | Research Starters - EBSCO
    Norm-referenced tests are assessments administered to students to determine how well they perform in comparison to other students taking the same assessment.
  41. [41]
    What's the difference? Criterion-referenced tests vs. norm ...
    Jul 11, 2018 · Criterion-referenced tests compare a person's performance to a standard, while norm-referenced tests compare it to a norm group's performance.Missing: disadvantages | Show results with:disadvantages
  42. [42]
    Norm-Referenced Tests - Florida Department of Education
    Abbreviation. Test ; CAT A. California Achievement Test Form A ; CAT E/F · California Achievement Test Forms E & F ; CTB A/B · Comprehensive Test of Basic Skills ...
  43. [43]
    Achievement Tests - Florida Gulf Coast University
    Advantages of Aptitude tests. Intelligence tests. Cautions in ... Why would a district choose to administer achievement tests from different publishers?Missing: differences | Show results with:differences<|control11|><|separator|>
  44. [44]
    What Is Norm-Referenced Assessment? - Illuminate Education
    Aug 18, 2022 · Common types of norm-referenced assessments include academic screeners and interim assessments, college entrance exams, IQ tests, and many of ...
  45. [45]
    Norm- and criterion-referenced testing. - APA PsycNet
    The chapter discusses the logic and methods (primarily basic psychometrics) of norm-referenced testing. It focuses on the validity of measurement.<|separator|>
  46. [46]
    Criterion-Referenced Test Definition
    Apr 30, 2014 · Criterion-referenced tests and assessments are designed to measure student performance against a fixed set of predetermined criteria or learning standards.Missing: disadvantages psychometrics
  47. [47]
    Norm- and Criterion-Referenced Testing. ERIC/AE Digest., 1996-Dec
    Tests can be categorized into two major groups: norm-referenced tests (NRTs) and criterion-referenced tests (CRTs). NRTs are designed to highlight ...
  48. [48]
    Criterion-referenced tests: I. Origins. - APA PsycNet
    Criterion-referenced tests: I. Origins. Citation. Glaser, R. (1994). Criterion-referenced tests: I. Origins. Educational Measurement: Issues and Practice, ...
  49. [49]
    Criterion-Referenced Measurement: Half a Century Wasted? - ASCD
    Mar 1, 2014 · Origins of an Idea; An Approach Preoccupied with Instruction; Four ... criterion-referenced tests" and "norm-referenced tests." This is ...
  50. [50]
    [PDF] Gorth, William P. Criterion-Referenced Testing: Issues and App - ERIC
    Almost all of the available aptitude and achievement.tests can be classified ... criterion-referenced test including itev analysis, reliability, and validity.
  51. [51]
    Evaluation Measures in Education: Norm-Referenced vs Criterion ...
    Jul 28, 2024 · Common examples of criterion-referenced tests 🔗​​ Driving license tests, swimming proficiency badges, classroom unit tests aligned with specific ...
  52. [52]
    5 Validity of the Achievement Levels - The National Academies Press
    Criterion-related validity evidence focuses on the relationships between the achievement levels and other similar measures external to NAEP. We consider the ...
  53. [53]
    Is It All About the Form? Norm- vs Criterion-Referenced Ratings and ...
    Criterion-referenced evaluation approaches appear to provide superior inter-rater reliability relative to norm-referenced evaluation scaling approaches.
  54. [54]
    [PDF] The Role of Standardized Tests as a Means of Assessment of Young ...
    They are carefully constructed by experts, machine scored, relatively easy to administer, inexpensive, and are objective.
  55. [55]
    The Benefits and Impacts of Implementing Standardized Tests
    Sep 3, 2024 · A standardized test is a type of exam where every student answers the same set of questions from a common pool.
  56. [56]
    [PDF] Pigge, Fred L. TITLE A Summary of Published Research: Classroom T
    Teachers most frequently use a combination of completion or short-response type questions in constructing their teacher-made tests followed by the use of.
  57. [57]
    conflicts in teacher-made and standardized tests for students
    Jul 5, 2020 · This paper thus, explores historical development of test, meaning of test, type of test, qualities of a good test, conflicts in teacher made and standardized ...
  58. [58]
    [PDF] CHAPTER 6 - Standardized Tests in schools: A Primer
    Aptitude and achievement tests differ primarily in the extent to which ... Achievement and aptitude tests differ, but the distinctions between the two ...
  59. [59]
    [PDF] Advantages and Disadvantages of Various Assessment Methods
    Disadvantages. • Measures relatively superficial knowledge or learning. • Unlikely to match the specific goals and objectives of a program/institution.
  60. [60]
    Teacher assessments during compulsory education are as reliable ...
    May 12, 2019 · Teacher assessments of achievement are as reliable, stable and heritable (~60%) as test scores at every stage of the educational experience.
  61. [61]
    Future of Testing in Education: Effective and Equitable Assessment ...
    Sep 16, 2021 · Designed to provide consistent results, standardized tests allow for comparisons between students in a single year and over time. Standardized ...
  62. [62]
    (PDF) Validity, Reliability, and Fairness in Classroom Tests
    Jun 3, 2020 · This chapter addresses the evaluation of music classroom testing quality using three key indicators: validity, reliability, and fairness.<|separator|>
  63. [63]
    What's Wrong With Standardized Tests? (Updated October 2023)
    Standardized tests are not designed, and should not be used for promoting or ranking students, evaluating teachers, or grading schools.
  64. [64]
    Standardized Testing is Still Failing Students | NEA
    Mar 30, 2023 · Standardized tests don't accurately measure student learning and growth. Unlike standardized tests, performance-based assessment allows ...
  65. [65]
    [PDF] standards_2014edition.pdf
    ... standards: See performance standards. achievement test: A test to measure the extent of knowledge or skill attained by a test taker in a content domain in ...
  66. [66]
    Two Dimensions of Tests - Psychometric and Edumetric - APA PsycNet
    When psychometric principles are used to develop an achievement test, the result is usually not an achievement test at all but an aptitude test (see Anderson, ...
  67. [67]
    3 The Test Development Process - The National Academies Press
    The process should begin with a clear statement of the purpose of the test and the intended inferences to be made from the test scores.
  68. [68]
    [PDF] Classroom Test Construction: The Power of a Table of Specifications
    What is a Table of Specifications? A TOS, sometimes called a test blueprint, is a table that helps teachers align objectives, instruction, and assessment ...
  69. [69]
    Creating a Table of Specifications (aka “Test Blueprint”)
    Feb 21, 2023 · A table of specifications is a two-way chart that identifies the format and content of the assessment, using the learning objectives and the instructional ...
  70. [70]
    [PDF] An Instructor's Guide to Understanding Test Reliability
    Test reliability is the consistency of scores students receive on alternate forms of the same test. Even the same test can produce different scores.Missing: achievement | Show results with:achievement
  71. [71]
    Making sense of Cronbach's alpha - PMC - NIH
    Jun 27, 2011 · In this paper we explain the meaning of Cronbach's alpha, the most widely used objective measure of reliability.
  72. [72]
    Measuring test validity and reliability: A guide - Turnitin
    Jun 1, 2023 · Test reliability is how consistently a test measures a characteristic, while validity is how well it measures what it claims to measure.
  73. [73]
    Reliability and Validity of Measurement - BC Open Textbooks
    Reliability refers to the consistency of a measure. · Validity is the extent to which the scores from a measure represent the variable they are intended to.<|control11|><|separator|>
  74. [74]
    Part 1: Principles for Evaluating Psychometric Tests - NCBI - NIH
    Finally, criterion validity assesses the ability of a psychometric test to predict an individual's performance or outcome now (concurrent validity) or in the ...
  75. [75]
    Current Concepts in Validity and Reliability for Psychometric ...
    Validity is the degree to which interpretations from psychometric instruments are well-grounded, justifiable, relevant, and meaningful, and how well one can ...Missing: achievement | Show results with:achievement
  76. [76]
    The 4 Types of Validity in Research | Definitions & Examples - Scribbr
    Sep 6, 2019 · To produce valid results, the content of a test, survey or measurement method must cover all relevant parts of the subject it aims to measure.Construct Validity · Content Validity · Criterion Validity<|separator|>
  77. [77]
    Importance of Validity and Reliability in Classroom Assessments
    Jan 24, 2023 · Validity and reliability are meaningful measurements that should be taken into account when attempting to evaluate the status of or progress toward any ...
  78. [78]
    Contemporary Test Validity in Theory and Practice: A Primer ... - NIH
    Content validity evidence alone is insufficient for establishing a high degree of validity; it should be combined with other forms of evidence to yield a strong ...Test Validity And The Test... · Categories Of Test Validity... · Analysis Of Cins Validity...
  79. [79]
    Chapter 1 Validity | 2020-21 Summative Technical Report
    Validity refers to the degree to which each interpretation or use of a test score is supported by the accumulated evidence.
  80. [80]
    Test Administration Guidelines (Before, During, & After Testing)
    Discover essential K-12 test administration guidelines from monitoring best practices to key preparation steps before, during & after testing.Missing: achievement | Show results with:achievement
  81. [81]
    [PDF] Fall 2025 SAT Suite of Assessments Proctor Manual
    This manual provides proctors with information for smooth testing, including test day procedures, troubleshooting, and test day toolkit instructions.
  82. [82]
    Raw Score Conversion Tables | Texas Education Agency
    A raw score is the number of points earned on a test. A scale score is a conversion of the raw score, accounting for question difficulty.
  83. [83]
    Score Explanation - Academic Excellence
    Raw scores are the actual number of correct answers within a given test section, and are used to calculate the Grade Equivalent, Percentile, and Stanine results ...
  84. [84]
    Scale Scores and NAEP Achievement Levels
    Aug 12, 2025 · NAEP assessment results are reported as average scores on a 0-500 scale (reading, mathematics at grades 4 and 8, US history, and geography) or on a 0-300 scale.
  85. [85]
    [PDF] A PARENT'S GUIDE TO STANDARDIZED ACHIEVEMENT TESTING
    Several different methods of scaling exist, but each is intended to provide a continuous score scale across the different forms and levels of a test series.
  86. [86]
    [PDF] Interpreting ACER Test Results
    ACER tests use scale scores, percentile ranks, and stanine scores. Scale scores are converted from raw scores and allow comparison of results. Percentile ranks ...
  87. [87]
    Methods of Interpreting Test Scores - Florida Gulf Coast University
    Scale scores vary from test to test and from grade to grade within the same test. The range, standard deviations, and means vary by test, subtest, and grade.Interpreting Test Scores and... · Methods of Interpreting Test...
  88. [88]
    Understanding Achievement Test Scores
    May 14, 2014 · A student scoring in the 75 percentile on most achievement tests is at the top of the average range, a student in the 25 percentile is near the ...
  89. [89]
    Test Scores, What do they really mean?? - LDinfo
    IQ Scores or 'Standard Scores' are typically used for individual intelligence and achievement tests. The key factors to understand about these scores include:.
  90. [90]
    Developing Criteria and Setting Cut Scores - EdTech Books
    1. Norm-referenced method: This approach sets the pass score based on the typical performance of test-takers in a norm group.
  91. [91]
    Interpretations and Decisions based on Achievement Test Scores
    The two most common current approaches for interpreting test scores are, with respect to norms, so-called norm-referenced testing (NRT), and with respect to a ...<|separator|>
  92. [92]
    [PDF] Explanation of Scoring Terms - Seton Testing Services
    Raw score is the number of correct answers. Scale score describes performance on a continuum. Grade equivalent shows performance in terms of grade level and ...
  93. [93]
    Interpreting Test Scores - 4 Key Terms | HEAV
    Raw scores are items answered correctly. Percentiles compare to others. Grade-equivalents are misleading. Stanines rate achievement on a scale. Percentiles are ...
  94. [94]
    Psychometric evaluation of a national exam for clinical ...
    Dec 14, 2022 · IRT can provide standard error of measurement for each ability level, which can facilitate the construct of achievement tests to maximize the ...
  95. [95]
    [PDF] What Do the Numbers Mean?
    Numbers include raw score (correct answers), scaled score, percentile rank, stanine, number possible, and number answered. Percentile ranks range from 1 to 99.
  96. [96]
    Achievement testing in K-12 education. - APA PsycNet
    Achievement testing in K–12 education is especially used to measure mastery of the educational jurisdiction's intended curricula or overarching educational ...
  97. [97]
    Government-Mandated Standardized Tests For Schools - Briefs
    Dec 26, 2023 · Local school districts must administer these tests to receive federal education funding.
  98. [98]
    California's Changing K-12 Accountability Program
    The new tests will allow the state to measure gains in each student's achievement, creating new options for how the state ranks schools. The change will also ...
  99. [99]
    The case for standardized testing - The Thomas B. Fordham Institute
    Aug 1, 2024 · Standardized tests are the most reliable measures we have for gauging performance at the school level, shedding light on systemic inequities, ...
  100. [100]
    Understanding California's Standardized Testing and Reporting ...
    California's STAR program is a battery of standardized tests taken each spring by students in grades 2-11 to measure state academic content standards.Missing: statewide | Show results with:statewide
  101. [101]
    The Effect of Achievement Test Selection on Identification of ... - NIH
    This study investigated the reliability of LD classification decisions of the concordance/discordance method (C/DM) across different psychoeducational ...
  102. [102]
    The stability of students' academic achievement in school: A meta ...
    Achievement tests are generally used for educational monitoring or selection decisions. School marks, on the other hand, have a broader range of functions. They ...
  103. [103]
    (PDF) Should Achievement Tests be Used to Judge School Quality?
    Aug 7, 2025 · This study provides empirical evidence to answer the question whether student scores on standardized achievement tests represent reasonable ...
  104. [104]
    The ACT Predicts Academic Performance—But Why? - PMC - NIH
    Jan 3, 2023 · Scores on the ACT college entrance exam predict college grades to a statistically and practically significant degree, but what explains this predictive ...
  105. [105]
    [PDF] Predictive Validity of the SAT® for Higher Education Systems and ...
    The combination of HSGPA, SAT, and average AP® Exam performance was the most predictive of first-year college performance. ▫ Combining SAT and HSGPA provided a ...
  106. [106]
    Reflections on a Century of College Admissions Tests
    Apr 1, 2009 · The original College Boards started out as achievement tests, designed to assess students' mastery of college-preparatory subjects. A century of ...
  107. [107]
    [PDF] USE OF PREDICTIVE VALIDITY STUDIES TO INFORM ADMISSION ...
    standardized tests predict achievement in college alone and in conjunction with other credentials for all students, for men and women, and for members of ...
  108. [108]
    Job Knowledge Tests - OPM
    Examples of job knowledge tests include tests of basic accounting principles, computer programming, financial management, and knowledge of contract law. Job ...
  109. [109]
    Tests to Consider for Job Applicants | Wolters Kluwer
    Use achievement tests to pick out those applicants who already possess a special skill or knowledge needed to perform a job. As opposed to aptitude tests, which ...
  110. [110]
    Employment Tests and Selection Procedures - EEOC
    Dec 1, 2007 · There are many different types of tests and selection procedures, including cognitive tests, personality tests, medical examinations, credit ...
  111. [111]
    PISA 2022 Results (Volume I) - OECD
    Dec 5, 2023 · Student performance (PISA) · Teachers and educators · Explore education ... Countries A - C. Afghanistan · Albania · Algeria · Andorra · Angola.Indonesia · Singapore · Australia · United States
  112. [112]
    PIRLS 2021 International Results in Reading
    PIRLS 2021 provides the only internationally comparative fourth grade achievement results collected during the COVID-19 pandemic.Countries’ Reading Achievement · International BenchmarksMissing: PISA | Show results with:PISA
  113. [113]
    TIMSS 2023 International Report and Results Now Available - IEA.nl
    Dec 4, 2024 · The TIMSS 2023 International Results in Mathematics and Science presents results from 64 participating countries and 6 benchmarking systems internationally.
  114. [114]
    Fast Facts: International comparisons of achievement (1)
    The U.S. average score was higher than that of 5th-graders in Latvia (5), Georgia (5), Bahrain (5), and Morocco (5), who scored 528, 494, 458, and 372, ...
  115. [115]
    PISA 2022 Worldwide Ranking - Average Score of Mathematics ...
    Dec 6, 2023 · 1. Singapore, 559,7 · 2. Macao (China), 535,0 · 3. Chinese Taipei, 533,0 · 4. Japan, 533,0 · 5. Korea, 523,3 · 6. Hong Kong (China), 520,0 · 7.
  116. [116]
    PISA 2022 Results (Volume I and II) - Country Notes: United States
    Dec 5, 2023 · Over 85% of students in Singapore, Macao (China), Japan, Hong Kong (China)*, Chinese Taipei and Estonia (in descending order of that share) ...
  117. [117]
    TIMSS 2023: Year 4s overcome pandemic learning disruption to set ...
    Dec 4, 2024 · In year 4 science: Singapore topped the list of participating countries with 607 points (up from 595 in 2019); the rest of the top 5 were Korea ...
  118. [118]
    Trends in International Mathematics and Science Study (TIMSS)
    The 2023 results offer an opportunity to examine any changes in performance since 2019—prior to the COVID-19 pandemic—as well as since the earliest ...
  119. [119]
    6 observations from a devastating international math test
    Dec 16, 2024 · 2023 TIMSS results show fewer US students in the middle, a re-emergent gender gap and a silver lining.
  120. [120]
    The Economics of International Differences in Educational ...
    Variations in skills measured by the international achievement tests are in turn strongly related to individual labor-market outcomes and, perhaps more ...
  121. [121]
    Student Testing: An International Context - Fraser Institute
    Jun 16, 2022 · Canadian policymakers use PISA results to compare student achievement over time and help them determine how educational systems can be improved.
  122. [122]
    Cross-Study Comparisons
    To help readers understand the similarities and differences between the assessments and to help identify what the international education assessments and NAEP ...
  123. [123]
    What assessments and examinations of students are in place? - OECD
    Sep 12, 2023 · Many countries use national/central assessments to inform teachers, students and parents how much students know about the assessed subject areas ...<|separator|>
  124. [124]
    [PDF] SAT® Score Relationships with College GPA:
    Many previous studies have shown that SAT scores are a consistently strong predictor of first- year GPA and add predictive value beyond high school GPA (HSGPA) ...
  125. [125]
    SAT predicts GPA better for high ability subjects - PubMed Central
    The SAT is a standardized test for college admissions in the United States. The SAT correlates moderately with college grade-point average (GPA) (r ≈ .35, ...
  126. [126]
    The Predictive Validity of the GRE Across Graduate Outcomes
    This meta-analysis assesses the predictive validity of the Graduate Record Examination (GRE) across outcome variables, including grade point average, ...
  127. [127]
    Standardized Test Scores and Academic Performance at Ivy-Plus ...
    Students with the highest possible scores (1600 SAT or 36 ACT) achieve a first-year GPA that is 0.43 points higher on a 4.0 scale than those with scores of 1200 ...<|separator|>
  128. [128]
    The Predictive Power of Standardized Tests - Education Next
    Jul 1, 2025 · The higher a student's middle-school test scores, the more likely they are to graduate high school, attend college, and earn a college degree.
  129. [129]
    [PDF] What Do Changes in State Test Scores Imply for Later Life Outcomes?
    We find that a standard deviation rise in 8th grade math achievement is associated with an 8 percent rise in adult's earned income, as well as improvements in ...Missing: validity employment
  130. [130]
    [PDF] Reexamining Associations Between Test Scores and Long - ERIC
    A 1-SD gain in age-16 math achievement predicted 14% more earnings at age 33, and approximately 18% more earnings at age 50.
  131. [131]
    Does IQ Really Predict Job Performance? - PMC - NIH
    Job performance has, for several reasons, been one such criterion. Correlations of around 0.5 have been regularly cited as evidence of test validity.
  132. [132]
    [PDF] Chapter 1 Achievement Tests and the Role of Character in American ...
    Oct 9, 2013 · Our evaluation of the GED provides strong evidence about the predictive power of achievement tests for outcomes that matter. Cognitive ability— ...<|separator|>
  133. [133]
    Do tests predict later success? - The Thomas B. Fordham Institute
    Jun 22, 2023 · Ample evidence suggests that test scores predict a range of student outcomes after high school. James J. Heckman, Jora Stixrud, and Sergio Urzua ...
  134. [134]
    Intelligence, Personality, and the Prediction of Life Outcomes - NIH
    May 15, 2023 · Borghans et al. (2016) argued that grades and achievement tests are generally better predictors of life outcomes than “pure” measures of ...Missing: employment | Show results with:employment
  135. [135]
    The Impact of No Child Left Behind on Student Achievement | NBER
    Nov 19, 2009 · This study presents evidence on whether NCLB has influenced student achievement based on an analysis of state-level panel data on student test scores.
  136. [136]
    [PDF] HAS STUDENT ACHIEVEMENT INCREASED SINCE NO CHILD ...
    In 9 of the 13 states with sufficient data to determine pre- and post-NCLB trends, average yearly gains in test scores were greater after NCLB took effect than ...
  137. [137]
    The Impact of No Child Left Behind on Students, Teachers, and ...
    Our results indicate that NCLB brought about targeted gains in the mathematics achievement of younger students, particularly those from disadvantaged ...
  138. [138]
    [PDF] The Effect of Testing on Student Achievement, 1910–2010
    Testing with feedback has a strong positive effect on achievement. Adding stakes or frequency also strongly and positively affects achievement.
  139. [139]
    School Accountability and Student Performance | Eric A. Hanushek
    Past evidence, however, shows that performance on standardized tests of the type central to state accountability systems has powerful economic effects.Missing: empirical | Show results with:empirical
  140. [140]
    Research Says… / High-Stakes Testing Narrows the Curriculum
    Mar 1, 2011 · More than 80 percent of the studies in the review found changes in curriculum content and increases in teacher-centered instruction. Similarly, ...
  141. [141]
    How manipulating test scores affects school accountability and ...
    Manipulation of test results distorts student performance indicators leading to misleading evaluations of the effectiveness of teachers and school programs.
  142. [142]
    Testing with accountability improves student achievement - CEPR
    Sep 18, 2018 · Proponents argue that increased use of testing and accountability systems are essential to improve educational outcomes. They argue that ...
  143. [143]
    School Accountability Raises Educational Performance | NBER
    The introduction of accountability systems leads to higher achievement growth than would have occurred without accountability.
  144. [144]
    Advantages, Disadvantages of Different Types of Test Questions
    Quick and easy to score, by hand or electronically · Can be written so that they test a wide range of higher-order thinking skills · Can cover lots of content ...
  145. [145]
    Subjective vs. Objective Assessment - Atomic Blog
    Mar 11, 2025 · Disadvantages of Subjective Assessment: · Potential for Bias: Personal opinions may influence grading. · Time-Consuming: Requires more effort and ...
  146. [146]
    [PDF] Validity and Reliability Study for Achievement Test on Matter Changing
    After the questions are excluded, Kuder Richardson-20 reliability coefficient is estimated to be 0,763. As a result of the study, an effective and reliable ...
  147. [147]
    What grades and achievement tests measure - PMC - NIH
    Nov 8, 2016 · Achievement tests were designed to capture general knowledge acquired in school and life (3–5). They were thought to be more objective and fair ...
  148. [148]
    (PDF) Development of achievement test: Validity and reliability study ...
    Sep 15, 2018 · An effective and reliable achievement test including 32 questions with intermediate difficulty level and well distinction strength created for "Matter Changing ...
  149. [149]
    The Standardized Testing Industry Needs to Defend Meritocracy
    Jan 20, 2025 · "Our research shows standardized tests help us better assess the academic preparedness of all applicants, and also help us identify ...Missing: evidence | Show results with:evidence
  150. [150]
    Achievement Versus Aptitude in College Admissions
    Achievement tests are fairer to students because they measure accomplishment rather than ill-defined notions of aptitude; they can be used to improve ...<|control11|><|separator|>
  151. [151]
    The Effect of Assessments on Student Motivation for Learning and Its ...
    Grading of assessments motivated students to submit high-quality work. Questions asked in a formative assessment garnered more study effort if they were ...
  152. [152]
    Do testing and accountability improve student learning?
    Jul 22, 2024 · In a word, yes! We have solid evidence over 30 years that students learn more when they are held to account for what and how well they're ...
  153. [153]
    Meritocracy, If You Can Keep It - Hoover Institution
    Until we learn methods to improve the schooling of disadvantaged children, achievement tests will not result in a different pattern of class recruitment.
  154. [154]
    [PDF] Hiding behind high-stakes testing: Meritocracy, objectivity ... - ERIC
    Virtually every scholar of teaching and schooling knows that when the variance in student scores on achievement tests is examined along with the many potential.
  155. [155]
    The Nation's Report Card | NAEP
    Sep 9, 2025 · Explore NAEP results about student performance, and access state and district results, the NAEP data explorer, assessment items, item maps, and ...About · Reading · NAEP Participation · MathematicsMissing: COVID | Show results with:COVID
  156. [156]
    Lawsuit Claims SAT And ACT Are Biased—Here's What Research ...
    Dec 11, 2019 · Research has shown that reminding students of their racial group before taking a test can impact their score. Perhaps the most egregious example ...
  157. [157]
    Modern Assessments of Intelligence Must Be Fair and Equitable - PMC
    Achievement gaps in cognitive assessments and standardized tests have been documented for decades with Black and Hispanic students performing worse compared to ...<|separator|>
  158. [158]
    Wide gap in SAT/ACT test scores between wealthy, lower-income kids
    Nov 22, 2023 · Children of the wealthiest 1 percent of Americans were 13 times likelier than the children of low-income families to score 1300 or higher on SAT/ACT tests.
  159. [159]
    Explaining Achievement Gaps: The Role of Socioeconomic Factors
    Aug 21, 2024 · Results show that a broad set of family SES factors explains a substantial portion of racial achievement gaps.
  160. [160]
    [PDF] The Achievement Gap Fails to Close - Eric A. Hanushek
    The achievement gap, a persistent divide between haves and have-nots, is an inequality of opportunity that has not closed despite a half century of testing.
  161. [161]
    The Evidence for Standardized Tests Already Exists
    Oct 17, 2018 · First, they cite research showing that admissions score variability is almost as great within income categories as across income categories, ...
  162. [162]
    [PDF] Gender Gaps in High School GPA and ACT Scores
    Female students average higher grades, especially in English, while male students score higher in math and science on the ACT. Overall ACT scores are similar.
  163. [163]
    How Test Format May Influence Gender Achievement Gaps on State ...
    Mar 28, 2018 · Equivalently, female students appear to perform 1/3 of a grade-level higher, relative to male students, on tests where half the questions are ...
  164. [164]
    The gender achievement gap in grades and standardised tests ...
    Oct 16, 2024 · The gender gap in grades was explained by boys' lower reading interests, effort put into schoolwork, and conscientiousness on homework.
  165. [165]
    Education advocates say the best way to address racial bias in ...
    Jan 31, 2022 · Education advocates say the best way to address racial bias in standardized testing is to eliminate the tests completely.Missing: allegations | Show results with:allegations<|separator|>
  166. [166]
    [PDF] How Standardized Tests Make College Admissions Fairer - ACT
    Apr 11, 2024 · High school GPAs, letters of recommendation, and other non- academic factors are all unmonitored during their development for potential bias.
  167. [167]
    Gender Achievement Gaps in U.S. School Districts
    Negative gaps (shown in orange) indicate that female students score higher on average than male students in the district; positive achievement gaps (shown in ...
  168. [168]
    The Hazards of High-Stakes Testing
    Hyped by many as the key to improving the quality of education, testing can do more harm than good if the limitations of tests are not understood.Missing: drawbacks | Show results with:drawbacks
  169. [169]
    [PDF] The Impact of No Child Left Behind on Students, Teachers, and ...
    We also find evidence that teachers responded to NCLB by reallocating instructional time from social studies and science toward key tested subjects, ...
  170. [170]
    Distressing testing: A propensity score analysis of high‐stakes exam ...
    Aug 11, 2023 · Results suggest that failing a high-stakes exam is associated with mental health issues and therefore may impact adolescents more broadly than captured in ...Missing: drawbacks | Show results with:drawbacks
  171. [171]
    Campbell's Law: Something Every Educator Should Know
    Dec 7, 2021 · Campbell's law states that “the more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures.
  172. [172]
    [PDF] “Teaching to the Test” in the NCLb Era - ERIC
    What is “teaching to the test,” and can one detect evidence of this practice in state test scores? This paper unpacks this concept and empirically ...
  173. [173]
    ERIC - EJ1044311 - "Teaching to the Test" in the NCLB Era
    We show that NCLB-era state tests predictably emphasized some state standards while consistently excluding others; a small number of standards typically ...
  174. [174]
    Does teaching to the test improve student learning? - ScienceDirect
    Logic dictates that teaching to the test significantly improves student achievement on the tests to which teachers teach (Bishop, 1997; Zakharov et al., 2014).
  175. [175]
    [PDF] Rational responses to high stakes testing: the case of curriculum ...
    The pressure of high stakes testing clearly results in a narrowing of the curriculum, a logical outcome of a penalty oriented program such as NCLB, where ...
  176. [176]
    A review of the benefits and drawbacks of high-stakes final ...
    Dec 1, 2023 · Studies have found a correlation between the competition amongst peers promoted by high-stakes exams and negative mental health impacts, ...
  177. [177]
    High-Stakes Testing and Curricular Control: A Qualitative ...
    The primary effect of high-stakes testing is that curricular content is narrowed to tested subjects, subject area knowledge is fragmented into test-related ...
  178. [178]
    [PDF] Getting Narrower at the Base: The American Curriculum After NCLB
    Executive Summary. 1. The Curriculum in an Era of “High-Stakes” Testing. 2. Why Narrowing is an Important Issue. 3. Has the Curriculum Narrowed?
  179. [179]
    The Evidence Base for Improving School Outcomes by Addressing ...
    Teachers told to ensure their students perform well on a high-stakes exam are more controlling in their instructional strategies and end up having students who ...
  180. [180]
    Time in the arts and physical education and school achievement
    The study found no significant correlation between time in arts/PE and core test scores, but a positive trend suggests better test scores with specialist-led  ...
  181. [181]
    NAEP Long-Term Trend Assessment Results: Reading and ...
    In 2022, average scores for 9-year-olds declined 5 points in reading and 7 points in math compared to 2020. Lower performing students saw greater score ...Missing: 2020-2025 | Show results with:2020-2025
  182. [182]
    The Nation's Report Card Shows Declines in Reading, Some ...
    Jan 29, 2025 · Reading scores declined in both 4th and 8th grade, with 4th grade math increasing. The nation is below pre-pandemic scores in both grades and ...Missing: 2020-2025 | Show results with:2020-2025
  183. [183]
    SAT and ACT participation remains below pre-pandemic levels
    - **SAT and ACT Participation Trends (2020–2025 vs. Pre-Pandemic):**
  184. [184]
    ACT Scores Hit 30-Year Low: A Sign of College Unpreparedness
    Aug 29, 2025 · Scores have been on a continuous decline for six years, with the trend intensifying during the tumultuous period of the COVID-19 pandemic. The ...Missing: recovery post- 2020-2025
  185. [185]
    U.S. math scores drop post-pandemic on international test - Chalkbeat
    Dec 4, 2024 · Test results from the TIMSS assessment show that fourth graders in more than a dozen countries improved their math scores. But not in the ...
  186. [186]
    Education Recovery Scorecard | Center for Education Policy Research
    In January 2024, the Education Recovery Scorecard results showed that students' math and reading scores improved from Spring 2022 to Spring 2023, making up ...Missing: SAT ACT 2020-2025
  187. [187]
    5 years after COVID-19 hit: Test data converge on math gains ...
    Mar 18, 2025 · Five years after COVID-19 disruptions, math scores have shown modest recovery, but reading scores continue to decline, with full recovery in ...
  188. [188]
    Future of Testing in Education: The Way Forward for State ...
    Sep 16, 2021 · Advances in technology—and even some decades-old assessment designs—can reduce testing time and improve the quality of the standardized tests ...
  189. [189]
    Digital SAT Launches Across the Country, Completing the Transition ...
    Mar 12, 2024 · The digital SAT provides a shorter test, with more time per question, and an overall streamlined testing experience for students and educators.
  190. [190]
    The SAT test is going digital. Here's what you need to know.
    Mar 6, 2024 · For the first time in the U.S., the standardized test will be offered solely in digital form starting March 9. It will also be shorter and ...
  191. [191]
    Computerized Adaptive Testing (CAT): Introduction and Benefits
    Apr 11, 2025 · Computerized adaptive testing (CAT) is an AI-based approach that personalizes assessments, making them shorter, more accurate, and more secure.
  192. [192]
    Computer adaptive assessment: A proven approach with limited ...
    Aug 3, 2023 · Computer adaptive tests tailor the difficulty of test items to student performance as they take the assessment. Questions generally get easier if a student is ...
  193. [193]
    Adaptive formative assessment system based on computerized ...
    Computerized adaptive testing (CAT) can effectively facilitate student assessment by dynamically selecting questions on the basis of learner knowledge and ...
  194. [194]
    Assessment in the time of COVID-19 - NWEA
    Remote testing, however, is less controlled than in-school testing, leading to concerns regarding test-taking engagement.<|control11|><|separator|>
  195. [195]
    Good Proctor or “Big Brother”? Ethics of Online Exam Supervision ...
    Online proctoring technologies purport to effectively oversee students sitting online exams by using artificial intelligence (AI) systems supplemented by human ...
  196. [196]
    Online Proctoring | Proctorio
    Proctorio's fully automated proctoring solution leverages AI technology to maintain a secure testing environment and deliver comprehensive, intelligible exam ...
  197. [197]
    AI Remote Proctoring: The Future of Secure, Scalable Online Testing
    Jul 10, 2025 · A study by Deloitte found that institutions implementing AI proctoring at scale can reduce per-student assessment costs by up to 40% while ...
  198. [198]
    Does testing environment matter for virtual school students?
    Students in virtual schools perform better on tests when allowed to test from home. At-home testing enhances the concurrent validity of tests.
  199. [199]
    Education Revolution: Leveraging Technology to Improve Learning ...
    Aug 6, 2025 · The use of artificial intelligence (AI) and learning analytics enables a more personalized, interactive, and adaptive learning experience.
  200. [200]
    [PDF] Achievement Tests Administration using Computerized Adaptive ...
    The computerized adaptive testing (CAT) model can overcome this weakness because the items that appear with the level of difficulty will adjust to the ability ...<|separator|>
  201. [201]
    Graduation Test Update: States That Recently Eliminated or Scaled ...
    Only six states have mandatory graduation tests in place for the high school class of 2026, down from a high of 27 that had or planned such tests.Last year the ...Missing: 2020-2025 | Show results with:2020-2025
  202. [202]
  203. [203]
    The Future of Annual State Testing Is in the Trump Admin.'s Hands
    Sep 8, 2025 · If the Trump administration gave its blessing, it could radically shift how state testing looks for students. Some school accountability experts ...Missing: 2020-2025 | Show results with:2020-2025
  204. [204]
    Alternatives to Standardized Tests - NewSchools Venture Fund
    Alternatives to Standardized Tests · Other obstacles exist. · Portfolio-Based Assessment · Performance Exams · Proficiency Exit Standards · Exhibitions · Parent ...
  205. [205]
    5 Alternatives to Standardized Testing - KaiPod Learning
    1. Portfolio Assessments · 2. Formative and Summative Rubrics · 3. Student-Led Conferences · 4. Culminating Projects · 5. Integrated or “Stealth” Assessment.
  206. [206]
    A New Blueprint for State Standardized Testing - FutureEd
    Mar 11, 2025 · This two-part strategy would weaken the case of testing abolitionists by improving the quality and shrinking the scale of state testing while ...Missing: 2020-2025 | Show results with:2020-2025<|separator|>