Fact-checked by Grok 2 weeks ago

Writing assessment

Writing assessment is the systematic evaluation of written texts to gauge proficiency in composing coherent, purposeful prose, typically conducted in educational contexts for purposes including to support learning, summative grading, placement, proficiency , and . It draws on research-informed principles emphasizing contextual fairness, multiple measures of performance, and alignment with diverse writing processes, recognizing that writing proficiency involves not only linguistic accuracy but also rhetorical adaptation to audience and purpose. Prominent methods include holistic scoring, which yields an overall impression of quality for efficiency in large-scale evaluations; analytic scoring, which dissects texts into traits such as organization, development, and mechanics for targeted diagnosis; and portfolio assessment, which compiles multiple artifacts to demonstrate growth over time. These approaches originated in early 20th-century standardized testing, evolving from Harvard's entrance exams to mid-century holistic innovations that revived direct evaluation amid concerns over multiple-choice proxies' limited validity for writing. A core challenge lies in achieving reliable and valid scores, as rater judgments introduce variability; empirical studies show interrater agreement often falls below 70% without training and multiple evaluators, necessitating at least three to four raters per response for reliabilities approaching 0.90 in high-stakes contexts. Validity debates persist, particularly regarding whether assessments capture authentic writing processes or merely surface features, with analytic methods offering diagnostic depth at the cost of holistic efficiency. Recent integration of automated scoring systems promises scalability but faces criticism for underestimating nuanced traits like argumentation, while amplifying biases from training data that disadvantage non-native speakers or underrepresented dialects.

Definition and Scope

Core Purposes and Objectives

The core purposes of writing assessment in educational contexts center on enhancing teaching and learning by tracking student progress, diagnosing strengths and weaknesses, informing instructional planning, providing feedback, and motivating engagement with writing tasks. These assessments evaluate students' ability to produce coherent, evidence-supported prose that communicates ideas effectively, aligning with broader goals of developing communication skills essential for academic and professional success. In practice, writing assessments serve both formative roles—offering ongoing diagnostic insights to refine instruction and student performance—and summative roles, such as assigning grades or certifying competence for advancement, placement, or certification. Formative writing assessments prioritize real-time to identify gaps in skills like , argumentation, and revision, enabling iterative improvements during the learning process rather than final . Summative assessments, by contrast, measure achievement against predefined standards at the conclusion of a unit or course, often using rubrics to quantify proficiency in elements such as idea development, conventions, and awareness, thereby supporting decisions on or qualification. This dual approach ensures assessments are purpose-driven, with reliability and validity tailored to whether the goal is instructional adjustment or . Key objectives include aligning evaluation criteria with explicit learning outcomes, such as articulating complex ideas, supporting claims with evidence, and sustaining coherent discourse, to promote measurable growth in written expression. Assessments also aim to foster self-regulation by encouraging students to monitor their own progress against success criteria, while minimizing subjective biases through standardized scoring protocols. In higher-stakes applications, like large-scale testing, objectives extend to benchmarking national or institutional performance in writing proficiency, informing policy on curriculum efficacy. Overall, these objectives prioritize causal links between assessment practices and tangible improvements in students' ability to write persuasively and accurately.

Distinctions from Oral and Reading Assessments

Writing assessment primarily evaluates productive language skills through the generation of original text, allowing test-takers time for , , and revision, which emphasizes accuracy in , range, , and rhetorical structure. In contrast, oral assessments measure spontaneous spoken output, focusing on , , intonation, and interactive , where responses occur in real-time without opportunity for editing and often involve direct examiner probing or peer interaction. This distinction arises because writing produces a permanent artifact amenable to detailed analytic scoring via rubrics, whereas oral performance captures ephemeral elements like stress patterns and hesitation management, which are harder to standardize but reveal communicative adaptability under pressure. Unlike reading assessments, which test receptive skills by measuring comprehension, inference, and knowledge integration from provided texts—often via objective formats like multiple-choice or cloze tasks—writing assessments demand active construction of meaning, evaluating originality, logical progression, and syntactic complexity in self-generated content. Reading tasks typically yield higher inter-rater reliability due to verifiable answers against keys, with scores reflecting decoding efficiency and background knowledge activation, while writing scoring relies on holistic or analytic judgments prone to subjectivity, though mitigated by trained raters and multiple evaluations. Empirical studies confirm that proficiency imbalances often occur, with individuals exhibiting stronger receptive abilities (e.g., in reading) than productive ones (e.g., in writing), underscoring the need for distinct assessment methods to avoid conflating input processing with output generation.

Historical Development

Early Educational and Rhetorical Traditions

In , rhetorical education emerged in the 5th century BCE amid democratic assemblies and legal disputes, where sophists such as and instructed students in composing persuasive speeches through imitation of model texts and practice in argumentation. Students drafted written orations focusing on , , and , as outlined by in his Rhetoric (circa 350 BCE), with evaluation conducted via teacher critiques emphasizing logical coherence, stylistic elegance, and persuasive efficacy during rehearsals or declamations. This formative assessment prioritized individualized feedback over standardized measures, reflecting the apprenticeship model where instructors assessed progress in invention and arrangement of ideas. Roman educators adapted methods, formalizing writing within the progymnasmata, a sequence of graded exercises originating in the Hellenistic era and refined by the , progressing from simple fables and narratives to complex theses and legal arguments. These tasks built compositional skills through imitation, expansion, and refutation, with students submitting written pieces for teacher correction on clarity, structure, and rhetorical force. , in his (completed circa 95 ), advocated systematic evaluation of such exercises, urging rhetors to provide detailed emendations on style and content while integrating writing with to assess overall oratorical potential. Assessment in these traditions remained inherently subjective, reliant on the instructor's expertise in the five canons of , , style, memory, and delivery—without empirical scoring rubrics or protocols. Teachers like emphasized moral and intellectual virtue in critiques, correcting drafts iteratively to foster improvement, though biases toward elite cultural norms could influence judgments. This approach contrasted with later standardized testing by embedding evaluation in ongoing pedagogical practice, aiming to cultivate eloquent citizens rather than rank performers uniformly.

20th-Century Standardization and Testing Movements

The standardization of writing assessment in the early emerged amid broader educational reforms emphasizing efficiency and , influenced by the Progressive Era's push for measurable outcomes in mass schooling. The College Entrance Examination Board, founded in 1900, introduced standardized written examinations in 1901 that included essay components to evaluate subject-area knowledge and composition skills, aiming to create uniform admission criteria for colleges amid rising enrollment. These direct assessments, however, faced immediate challenges: essay scoring proved time-intensive, prone to inter-rater variability, and difficult to scale for large populations, as evidenced by early psychometric studies highlighting low reliability coefficients often below 0.50 for subjective evaluations. By the 1920s and 1930s, dissatisfaction with essay reliability—stemming from inconsistent scoring due to raters' differing emphases on , content, or style—drove a pivotal shift toward indirect, measures. The Scholastic Aptitude Test (SAT), launched in 1926, initially incorporated essays but transitioned to multiple-choice formats by the mid-1930s, prioritizing cost-effectiveness, rapid scoring via machine-readable answers, and higher test-retest reliability (often exceeding 0.80), as indirect tests correlated moderately with while avoiding subjective . This movement aligned with the rise of experts like E.L. Thorndike, who advocated quantifiable proxies for writing skills, such as usage or tests, reflecting a broader faith in over holistic judgment amid expanding public education systems. Critics within measurement traditions noted, however, that these proxies captured mechanical aspects but validity for actual writing proficiency remained contested, with correlations to produced essays typically ranging from 0.40 to 0.60. Mid-century developments reinforced standardization through institutionalization, as the , established in 1947, advanced objective testing infrastructures for writing-related skills, influencing high-stakes uses like military and civil service exams. Post-World War II accountability demands, amplified by the 1957 Sputnik launch and subsequent of 1958, spurred federal interest in standardized achievement metrics, though writing lagged behind reading and math due to persistent scoring hurdles. The (NAEP), initiated in 1969, marked a partial reversal by reintegrating direct writing tasks with trained rater protocols, achieving improved reliability through multiple independent scores averaged for final metrics. The late saw the maturation of holistic scoring methods to reconcile direct assessment's authenticity with needs, pioneered in projects like the 1970s statewide testing, where raters evaluated overall essay quality on anchored scales (e.g., 1-6 bands) after calibration training to minimize variance. This approach, yielding inter-rater agreements of 70-80% within one point, addressed earlier subjectivity critiques while enabling large-scale , though empirical studies cautioned that holistic judgments often overweight superficial traits like length over depth. By the , movements like the standards-based reform under (1983) embedded standardized writing tests in state accountability systems, blending direct and indirect elements, yet psychometric analyses consistently showed direct methods' superior construct validity for compositional skills despite higher costs—up to 10 times that of multiple-choice. These efforts reflected causal pressures from demographic shifts and equity concerns, prioritizing scalable, defensible metrics over unstandardized teacher grading, even as source biases in academic sometimes overstated tests' universality.

Post-2000 Technological Advancements

Following the initial deployment of early automated essay scoring (AES) systems in the late 1990s, post-2000 developments emphasized enhancements in (NLP) and to improve scoring accuracy and provide formative feedback. The Project Essay Grader (), acquired by Measurement Inc. in 2002, was expanded to incorporate over 500 linguistic features such as fluency and grammar error rates, achieving correlations of 0.87 with human raters on standardized prompts. Similarly, ETS's e-rater engine, operational since 1999, underwent annual upgrades, including version 2.0 around 2005, which utilized 11 core features like discourse structure and vocabulary usage to yield correlations ranging from 0.87 to 0.94 against human scores in high-stakes tests such as the GRE and TOEFL. These systems shifted from purely statistical regression models to hybrid approaches integrating syntactic and semantic analysis, enabling scalability for large-scale assessments while maintaining reliability comparable to multiple human scorers. In the mid-2000s, web-based automated writing evaluation (AWE) platforms emerged to support classroom use beyond summative scoring, offering real-time feedback on traits like organization and mechanics. ETS's Criterion service, launched post-2000, leveraged e-rater for instant diagnostics, allowing students to revise essays iteratively and correlating strongly with expert evaluations in pilot studies. Vantage Learning's IntelliMetric, refined after 1998 for multilingual support, powered tools like MY Access!, which by the late 2000s provided trait-specific scores and achieved 0.83 average agreement with humans across prompts. Bibliometric analyses indicate a publication surge in AWE research post-2010, with annual growth exceeding 18% since 2018, reflecting broader adoption of these tools in higher education for reducing scorer subjectivity and enabling frequent practice. The 2010s marked a pivot to paradigms, automating feature extraction for nuanced evaluation of coherence and argumentation. Recurrent neural networks (RNNs) with (LSTM) units, as in Taghipour and Ng's 2016 model, outperformed earlier systems by 5.6% in quadratic weighted kappa (QWK) scores on the Automated Student Assessment Prize dataset, reaching 0.76 through end-to-end learning of semantic patterns. Convolutional neural networks (CNNs), applied in Dong and Zhang's 2016 two-layer architecture, captured syntactic-semantic interplay with a QWK of 0.73, while hybrid CNN-LSTM models by Dasgupta et al. in attained Pearson correlations up to 0.94 by emphasizing qualitative enhancements like topical relevance. These advancements, validated on diverse corpora, improved generalization across genres and reduced reliance on hand-engineered proxies, though empirical studies note persistent challenges in assessing where human-AI agreement dips below 0.80. Post-2020, transformer-based models and large language models (LLMs) have further elevated AES precision, integrating contextual understanding for holistic scoring. Tools like and Turnitin's Revision Assistant, evolving from earlier AWE frameworks, now employ for predictive feedback on clarity and engagement, with meta-analyses showing medium-to-strong effects on writing quantity and quality in elementary and EFL contexts. Emerging integrations of models akin to or variants enable dynamic rubric alignment, as evidenced by correlations exceeding 0.90 in recent benchmarks, facilitating personalized assessment in learning management systems. Despite these gains, adoption in formal evaluations remains hybrid, combining with human oversight to mitigate biases in non-standard prose, as confirmed by longitudinal validity studies.

Core Principles

Validity and Its Measurement Challenges

Validity in writing assessment refers to the degree to which scores reflect the intended construct of writing ability, encompassing skills such as argumentation, , and linguistic accuracy, rather than extraneous factors like test-taking savvy or prompt familiarity. , in particular, demands empirical evidence that scores align with theoretical models of writing, including convergent correlations with writing tasks and distinctions from unrelated abilities like verbal . However, measuring this validity is complicated by writing's multifaceted nature, where no single prompt or rubric fully captures domain-general proficiency, leading to construct underrepresentation—scores often emphasize surface features over deeper rhetorical competence. Empirical studies reveal modest validity coefficients, with inter-prompt correlations typically ranging from 0.40 to 0.70, indicating limited generalizability across writing tasks and contexts. For instance, predictive validity for college performance shows writing test scores correlating at r ≈ 0.30-0.50 with first-year GPA, weaker than for quantitative measures, partly due to the absence of a universal criterion for "real-world" writing success. Criterion-related validity is further challenged by rater biases, where holistic scoring introduces construct-irrelevant variance from subjective interpretations, despite training; factor analyses often explain only 30-40% of score variance through intended traits. Efforts to quantify validity include multitrait-multimethod matrices, which test whether writing scores converge more strongly with other writing metrics (e.g., portfolio assessments) than with non-writing proxies like multiple-choice tests, yet results frequently show cross-method discrepancies of 0.20-0.30 in correlations. In computer-based formats, mode effects undermine comparability, with keyboarding proficiency artifactually inflating scores (correlations up to 0.25 with typing speed) and interfaces potentially constraining revision processes, as evidenced by NAEP studies finding no overall mean differences but subgroup variations. These hurdles persist because writing lacks operationally defined benchmarks, relying instead on proxy validations that academic sources, potentially influenced by institutional incentives to affirm assessment utility, may overinterpret as robust despite the empirical modesty.

Reliability Across Scorers and Contexts

Inter-rater reliability in writing assessment refers to the consistency of scores assigned by different human evaluators to the same written response, often measured using intraclass correlation coefficients or generalizability coefficients derived from generalizability theory (G-theory). Empirical studies indicate moderate to high inter-rater reliability when standardized rubrics and rater training are employed, with coefficients typically ranging from 0.70 to 0.85 in controlled settings. For instance, in assessments of elementary students' narrative and expository writing using holistic scoring, rater variance contributed negligibly (0%) to total score variance after training, yielding generalizability coefficients of 0.81 for expository tasks and 0.82 for narrative tasks with two raters and three tasks. However, without such controls, inter-rater agreement can be lower; in one study of EFL university essays evaluated by 10 expert raters, significant differences emerged across scoring tools (p < 0.001), with analytic checklists and scales outperforming general impression marking but still showing variability due to subjective judgments. Factors influencing inter-rater reliability include rater expertise, training protocols, and scoring method. G-theory analyses reveal that rater effects can account for up to 26% of score variance in multi-faceted designs involving tasks and methods, though this diminishes with calibration training using anchor papers. Peer and analytical scoring methods, combined with multiple raters, enhance consistency, requiring at least four raters to achieve a generalizability coefficient of 0.80 in professional contexts. Despite these efforts, persistent challenges arise from raters' differing interpretations of traits like coherence or creativity, underscoring the subjective nature of writing evaluation compared to objective formats. Reliability across contexts encompasses score stability over varying tasks, prompts, occasions, and conditions, often assessed via to partition variance sources such as tasks (19-30% of total variance) and interactions between persons and tasks. Single-task assessments exhibit low reliability, particularly for L2 writers, where topic-specific demands introduce substantial error; fluency measures fare better than complexity or accuracy, but overall coefficients remain insufficient for robust inferences without replication. To attain acceptable generalizability (e.g., 0.80-0.90), multiple tasks are essential—typically three to seven prompts alongside one or two raters— as task variance dominates in elementary and secondary evaluations, reflecting how prompts differentially elicit skills like organization or vocabulary. Occasion effects, such as time between writings, contribute minimally when intervals are short (e.g., 1-21 days), suggesting trait stability but highlighting measurement error from contextual factors like prompt familiarity. In practice, these reliability constraints necessitate design trade-offs in large-scale testing, where single-sitting, single-task formats prioritize efficiency over precision, potentially underestimating true writing ability variance (52-54% attributable to individuals). G-theory underscores that integrated tasks and analytical methods yield higher cross-context dependability than holistic or isolated prompts, informing standardization in educational and certification contexts.

Objectivity, Bias, and Standardization Efforts

Objectivity in writing assessment is challenged by the subjective interpretation of qualitative elements such as argumentation quality and stylistic nuance, which can lead to variability in scores across evaluators. Rater bias, defined as systematic patterns of overly severe or lenient scoring, often arises from individual differences in background, experience, or perceived essay characteristics, with empirical evidence showing biases linked to rater language proficiency and prompt familiarity. For instance, studies on L2 writing evaluations have identified halo effects, where a strong impression in one trait influences ratings of others, exacerbating inconsistencies. Inter-rater reliability metrics, such as , typically range from 0.50 to 0.70 in holistic essay scoring without controls, reflecting moderate agreement but highlighting bias risks from scorer drift or contextual factors like handwriting legibility or demographic cues. , measuring consistency within the same evaluator over time, fares similarly, with discrepancies attributed to fatigue or shifting standards, as evidenced in analyses of EFL composition scoring where agreement dropped below 0.60 absent standardization. These patterns underscore that unmitigated human judgment introduces errors equivalent to 10-20% of score variance in uncontrolled settings. Standardization efforts primarily involve analytic rubrics, which decompose writing into discrete criteria like content organization and mechanics, providing explicit descriptors and scales to minimize subjective latitude. Rater training protocols, including benchmark exemplars and calibration sessions, have demonstrated efficacy in elevating inter-rater agreement by 15-25%, as raters practice aligning judgments against shared anchors to curb leniency or severity biases. Many-facet Rasch models further adjust for rater effects statistically, equating scores across panels and reducing bias impacts in large-scale assessments like standardized tests. Despite these measures, complete objectivity remains elusive, as rubrics cannot fully capture contextual or creative subtleties, and training effects may decay over time without reinforcement, with some studies reporting persistent 5-10% unexplained variance in scores. Ongoing refinements, such as integrating multiple raters or hybrid human-AI checks, aim to bolster reliability, though they require validation against independent performance predictors to ensure causal fidelity to writing proficiency.

Assessment Methods

Direct Writing Evaluations

Direct writing evaluations involve prompting examinees to produce original texts, such as essays or reports, which are subsequently scored by trained human raters to gauge skills in composition, argumentation, and expression. These methods prioritize authentic performance over proxy indicators, enabling assessment of integrated abilities like idea development and rhetorical effectiveness. Prompts are typically task-specific, specifying genre, audience, and purpose—for instance, persuasive essays limited to 30-45 minutes—to simulate real-world constraints while controlling for extraneous variables. Scoring relies on rubrics that outline performance levels across defined criteria. Holistic scoring assigns a single ordinal score, often on a 1-6 scale, reflecting the overall impression of quality, which facilitates efficiency in high-volume testing but risks overlooking trait-specific weaknesses. In contrast, analytic scoring decomposes evaluation into discrete traits—such as content (30-40% weight), organization, style, and conventions—yielding subscale scores for targeted feedback; research indicates analytic approaches yield higher interrater agreement, with exact matches up to 75% in controlled studies versus 60% for holistic. Primary trait scoring, a variant, emphasizes the prompt's core demand, like thesis clarity in argumentative tasks, minimizing halo effects from unrelated strengths. Implementation includes rater calibration through norming sessions, where evaluators score anchor papers to achieve consensus on benchmarks, followed by double-scoring of operational responses with resolution of discrepancies via third readers. Reliability coefficients for such systems range from 0.70 to 0.85 across raters and tasks, bolstered by requiring 2-3 raters per response to attain generalizability near 0.90, though variability persists due to rater fatigue or prompt ambiguity. Validity evidence derives from predictive correlations with subsequent writing outcomes (r ≈ 0.50-0.65) and expert judgments of construct alignment, yet challenges arise from low task generalizability, as scores reflect prompt-specific strategies rather than domain-wide proficiency.
Scoring MethodKey FeaturesStrengthsLimitations
HolisticSingle overall score (e.g., 1-6 scale) based on global judgmentRapid scoring; captures holistic qualityReduced diagnostic detail; prone to subjectivity
AnalyticMultiple subscale scores (e.g., content, mechanics)Provides trait feedback; higher reliabilityTime-intensive; potential for subscale inconsistencies
Primary TraitFocus on task-central feature (e.g., evidence use)Aligns closely with prompt goalsNarrow scope; ignores ancillary skills

Indirect Proxy Measures

Indirect proxy measures of writing ability evaluate discrete components of writing skills, such as grammar, syntax, punctuation, vocabulary usage, and recognition of stylistic errors, typically through multiple-choice or objective formats that do not require test-takers to produce original extended text. These assessments infer overall writing proficiency from performance on isolated elements, assuming mastery of mechanics correlates with effective composition. Common examples include the multiple-choice sections of standardized tests like the pre-2016 , which featured 49 questions on sentence correction, error identification, and paragraph improvement, or similar components in the TOEFL iBT's structure and written expression tasks. Such measures offer high reliability, with inter-scorer agreement approaching 100% due to automated or rule-based scoring, contrasting with the subjective variability in direct essay evaluations where reliability coefficients often range from 0.50 to 0.80. They enable efficient large-scale administration, as seen in the SAT's indirect writing component, which processed millions of responses annually with minimal cost and bias from human raters. Empirical studies confirm their internal consistency, with Cronbach's alpha values exceeding 0.85 in many implementations. Validity evidence shows moderate predictive power for writing outcomes. Correlations between indirect scores and direct essay ratings typically fall between 0.40 and 0.60; for instance, one analysis found objective tests correlating 0.41 with college English grades, nearly identical to essay correlations of 0.40. Educational Testing Service research on SAT data indicated indirect writing scores predicted first-year college writing course grades (r ≈ 0.45), though adding direct essay scores provided incremental validity of about 0.05 to 0.10 in regression models. These associations hold across contexts, including high school-to-college transitions, but weaken for advanced rhetorical skills. Limitations persist, as indirect measures may prioritize mechanical accuracy over higher-order competencies like idea development, coherence, or audience adaptation, potentially underestimating true writing aptitude in integrative tasks. Critics argue low-to-moderate correlations with holistic writing samples question their sufficiency as standalone proxies, with some studies showing indirect tests explaining only 16-36% of variance in direct performance. Despite this, they remain prevalent in high-stakes testing for their scalability, often supplemented by direct methods in comprehensive evaluations.

Automated and AI-Enhanced Scoring

Automated essay scoring (AES) systems originated with Project Essay Grade (PEG), developed by Ellis Batten Page in the 1960s, which predicted scores by correlating measurable text features like sentence length and word diversity with human-assigned grades from training corpora. By the 1990s, Educational Testing Service (ETS) advanced the field with e-rater, first deployed commercially in February 1999 for Graduate Management Admission Test (GMAT) essays, using natural language processing to analyze linguistic and rhetorical features such as grammar accuracy, vocabulary sophistication, and organizational structure. These early systems relied on regression models or machine learning classifiers trained on thousands of human-scored essays to generate scores typically on a 1-6 holistic scale, enabling rapid processing of large volumes unattainable by human raters alone. Traditional AES engines extract dozens of predefined features—ranging from syntactic complexity and error rates to discourse coherence—and map them to scores via statistical or neural network algorithms, achieving quadratic weighted kappa agreements with human raters of 0.50 to 0.75 in peer-reviewed validations. For instance, e-rater's deployment in high-stakes assessments like the TOEFL iBT and GRE has demonstrated reliability exceeding that of single human scorers, with exact agreement rates around 70% when multiple human benchmarks are averaged. Such consistency stems from immunity to scorer fatigue or subjectivity, allowing scalability for formative tools like ETS's Criterion platform, which provides instant feedback on over 200,000 essays annually in educational settings. Post-2020 advancements integrate deep learning and large language models (LLMs) for AI-enhanced scoring, shifting from rule-based features to generative evaluation of content relevance, argumentation strength, and stylistic nuance. Systems like those leveraging analyze semantic embeddings and rhetorical patterns, with studies reporting correlations to expert human scores of 0.60-0.80 for argumentative essays in controlled trials. However, LLM-based scorers exhibit inconsistencies, such as stricter grading than humans (e.g., underrating valid but unconventional arguments) and failure to detect sarcasm or contextual irony, reducing validity for higher-order skills like critical thinking. Empirical evidence highlights persistent limitations in AI-enhanced systems, including vulnerability to gaming via keyword stuffing or AI-generated text, which inflates scores without reflecting authentic proficiency; for example, e-rater has shown degraded performance on synthetic essays, misaligning with human judgments by up to 20% in directionality. Bias analyses from 2023-2025 reveal racial disparities, where algorithms trained on majority-group essays underrate minority students' work due to stylistic mismatches in training data, perpetuating inequities akin to those in human scoring but amplified by opaque model decisions. Multiple studies corroborate that while AES excels in mechanical traits (e.g., 85% agreement on grammar), it underperforms on holistic validity (correlations below 0.50 for creativity), necessitating hybrid human-AI adjudication for defensible assessments. Ongoing psychometric guidelines emphasize cross-validation against diverse prompts and populations to mitigate these gaps, though full replacement of trained human oversight remains unsubstantiated.

Fairness and Group Differences

Empirical Patterns in Performance Disparities

In the National Assessment of Educational Progress (NAEP) writing assessment administered in 2011—the most recent comprehensive national evaluation of writing performance—eighth-grade students identified as White scored an average of 158 on the 0-300 scale, compared to 132 for Black students (a 26-point gap) and 141 for Hispanic students (a 17-point gap), while Asian students averaged 164. Twelfth-grade patterns were similar, with White students averaging 159, Black students 130 (29-point gap), Hispanic students 142 (17-point gap), and Asian students 162. These disparities align with evidence from the SAT's Evidence-Based Reading and Writing (ERW) section, which evaluates reading comprehension, analysis, and writing skills; in the 2023 cohort, average ERW scores were 529 for White test-takers, 464 for Black (65-point gap), 474 for Hispanic (55-point gap), and 592 for Asian. Such gaps, equivalent to roughly 0.8-1.0 standard deviations in standardized writing metrics, have persisted across decades of assessments despite policy interventions, with NAEP data showing only modest narrowing in some racial comparisons since the 1990s. Gender differences in writing performance consistently favor females. In the 2011 NAEP writing assessment, female eighth-graders outperformed males by 11 points overall (156 vs. 145), a pattern holding across racial/ethnic groups, including a 12-point advantage for White females and 10 points for Black females. Similarly, medium-sized female advantages (Cohen's d ≈ 0.5) appear in writing tasks within large-scale evaluations, stable over time and larger than in reading. On the SAT ERW section in 2023, females averaged 521 compared to males' 518, though males show greater variance at the upper tail. Socioeconomic status (SES) correlates strongly with writing scores, amplifying other disparities. Higher-SES students, proxied by parental education or income, outperform lower-SES peers by 50-100 points on SAT ERW, with children from the top income quintile averaging over 100 points higher than those from the bottom. In NAEP data, students eligible for free or reduced-price lunch (indicative of lower SES) score 20-30 points below non-eligible peers in writing, a gap that intersects with racial differences as lower-SES groups are disproportionately represented among Black and Hispanic students. These patterns hold in peer-reviewed analyses of standardized writing proxies, where family SES accounts for 20-40% of variance in scores but leaves substantial residuals unexplained by environmental controls alone.
GroupNAEP Grade 8 Writing (2011 Avg. Score, 0-300 scale)SAT ERW (2023 Avg. Score, 200-800 scale)
White158529
Black132464
Hispanic141474
Asian164592
Female (overall)156521
Male (overall)145518

Environmental Explanations and Interventions

Socioeconomic status (SES) exhibits a robust positive correlation with writing achievement, with meta-analytic evidence indicating effect sizes ranging from moderate to large across diverse populations. Lower SES environments often correlate with reduced access to literacy-rich home settings, including fewer books and less parental reading interaction, which longitudinally predict deficits in writing composition and fluency by age 8-10. School-level SES further amplifies these effects through variations in teacher quality and instructional resources, where higher-SES schools demonstrate 0.2-0.4 standard deviation advantages in standardized writing scores after controlling for individual factors. Quality of educational input, including explicit writing instruction, accounts for substantial variance in performance disparities. Longitudinal studies reveal that students in under-resourced schools receive 40-50% less dedicated writing time weekly, correlating with persistent gaps in syntactic complexity and idea organization. Language exposure disparities, particularly in bilingual or low-literacy households, hinder proficiency; children with limited print exposure before age 5 show 15-20% lower writing output and accuracy in elementary assessments, as measured by vocabulary integration and narrative coherence. Classroom environmental factors, such as noise levels above 55 dB or suboptimal lighting, directly impair sustained writing tasks, reducing productivity by up to 10% in controlled experiments. Interventions targeting these environmental levers demonstrate efficacy in elevating scores, though gains are often modest and context-dependent. Self-Regulated Strategy Development (SRSD), involving explicit teaching of planning, drafting, and revision heuristics, yields effect sizes of 0.5-1.0 standard deviations in randomized controlled trials with grades 4-8 students, particularly benefiting lower performers through improved genre-specific structures. Process-oriented prewriting programs, including graphic organizers and peer feedback, enhance compositional quality by 20-30% in pilot RCTs for early elementary learners, with sustained effects over 6-12 months when embedded in daily routines. Meta-analyses of K-5 interventions confirm that multi-component approaches—combining increased writing volume (e.g., 30 minutes daily) with teacher modeling—outperform single-method strategies, closing SES-related gaps by 0.3 standard deviations on average, though fade-out occurs without ongoing support. Broader systemic interventions, such as professional development for evidence-based practices, show promise in scaling improvements; cluster-randomized studies report 15-25% gains in writing elements like coherence when teachers receive training, but implementation fidelity varies, with only 60-70% adherence in low-SES districts due to resource constraints. Increasing home-school literacy partnerships, via targeted reading programs, mitigates language exposure deficits, boosting writing fluency by 10-15% in longitudinal trials, yet these require sustained funding to prevent regression. Empirical data underscore that while environmental interventions address malleable factors, they explain less than 20-30% of total variance in writing skills, per twin studies partitioning shared environment effects, highlighting limits against entrenched disparities.

Biological and Heritability Factors

Twin studies have demonstrated substantial heritability for writing skills, with genetic factors accounting for a significant portion of variance in performance measures such as writing samples and handwriting fluency. For instance, in a study of adolescent twins, heritability estimates for writing samples reached approximately 60-70%, indicating that genetic influences explain a majority of individual differences beyond shared environmental factors. Similarly, genetic influences on writing development show strong covariation with related cognitive skills, including language processing and reading comprehension, where heritability for these components often exceeds 50%. These findings align with broader research on educational achievement, where twin and adoption studies attribute 50-75% of variance in writing and literacy performance to additive genetic effects, with minimal contributions from shared family environments after accounting for genetics. Biological sex differences also contribute to variations in writing ability, with females consistently outperforming males in writing assessments by medium effect sizes (d ≈ 0.5-0.7), stable across age groups and cultures. This disparity manifests in higher female scores on essay composition, fluency, and editing tasks, potentially linked to neurobiological factors such as differences in brain lateralization and verbal processing efficiency. Evidence from longitudinal data suggests these differences emerge early in development, with girls showing advanced fine-motor control and orthographic skills relevant to handwriting and text production, though males may exhibit greater variability in performance. Genetic underpinnings are implicated, as polygenic scores associated with literacy predict writing outcomes and correlate with sex-specific cognitive profiles. Heritability estimates increase with age for writing-related traits, mirroring patterns in general cognitive ability, where genetic influences rise from around 40% in childhood to over 70% in adulthood due to gene-environment correlations amplifying innate potentials. This developmental trajectory implies that biological factors, including polygenic architectures shared with verbal intelligence, play a causal role in sustained performance disparities observed in writing assessments, independent of environmental interventions. While direct genome-wide association studies on writing are emerging, overlaps with literacy genetics highlight pleiotropic effects from alleles influencing neural connectivity and phonological awareness. These heritable components underscore the limitations of purely environmental explanations for group differences in writing proficiency, as genetic variance persists across diverse populations and contexts.

Applications and Societal Impacts

Role in Educational Systems

Writing assessments serve formative and summative functions within K-12 educational systems, enabling educators to monitor student progress, identify skill gaps, and refine instructional strategies. Formative assessments, such as ongoing rubrics, checklists, and peer reviews, provide immediate feedback to support writing development during the learning process, allowing teachers to address misconceptions and scaffold skills like organization and coherence. These tools are integrated into daily classroom practices to foster iterative improvement, with research indicating that structured writing tasks positively influence performance on knowledge and comprehension outcomes. Summative assessments, including state-mandated tests and national benchmarks like the National Assessment of Educational Progress (NAEP), evaluate overall proficiency and inform accountability measures, such as school funding and teacher evaluations. For instance, the NAEP writing assessment, administered digitally to grades 4, 8, and 12, gauges students' ability to produce persuasive, explanatory, and narrative texts, revealing persistent challenges: in 2017, only 27% of 12th graders achieved proficiency or above. These evaluations contribute to curriculum alignment and policy decisions, with empirical evidence showing that classroom-based summative practices can enhance grade-level writing competence when tied to targeted interventions. In broader educational frameworks, writing assessments facilitate student placement, progression, and program evaluation, such as determining eligibility for advanced courses or remedial support. Comprehensive systems, as outlined in state guidelines like Oregon's K-12 framework, emphasize reliable measures for multiple purposes, including self-assessment to build student metacognition. Studies further demonstrate that integrating assessments with writing instruction correlates with improved task management and planning skills, underscoring their role in systemic efforts to elevate literacy outcomes amid documented national deficiencies.

Use in Employment and Professional Screening

Writing assessments are utilized in employment screening for roles requiring effective written communication, such as executive positions, consulting, legal, and technical writing jobs, where candidates may complete tasks like drafting reports, essays, or press releases under timed conditions. These exercises evaluate clarity, grammar, logical structure, and domain-specific knowledge, serving as direct proxies for job demands in producing professional documents. For instance, the endorses writing samples as either job-typical tasks or prompted essays to gauge applicants' proficiency in conveying complex information. Such assessments demonstrate predictive validity for job performance when aligned with role requirements, as written communication skills correlate with success in knowledge-based occupations that involve reporting, analysis, and persuasion. Research indicates that proficiency in critical thinking and writing predicts post-educational outcomes, including workplace effectiveness, with work sample tests yielding validities around 0.30-0.50 for relevant criteria like supervisory ratings in administrative roles. Employers like consulting firms incorporate these into case interviews to assess candidates' ability to synthesize data into coherent arguments, reducing reliance on self-reported resumes that may inflate credentials. In professional screening, writing tests must comply with legal standards under the Uniform Guidelines on Employee Selection Procedures, requiring demonstration of job-relatedness and absence of adverse impact unless justified by business necessity. Group differences in scores, often observed between demographic categories on cognitive and writing measures, reflect underlying skill variances rather than inherent test flaws, as validated assessments maintain utility despite disparate selection rates. For example, federal hiring for administrative series positions includes writing components that predict performance but show persistent gaps, prompting validation studies to confirm criterion-related validity over claims of cultural bias. Automated scoring tools are increasingly integrated into screening for scalability, analyzing metrics like coherence and vocabulary in applicant submissions, though human review persists for nuanced evaluation in high-stakes hiring. This approach enhances objectivity but necessitates ongoing validation to ensure scores forecast actual output, as seen in content marketing roles where prompts test editing and persuasive writing under constraints mimicking client deadlines. Overall, these methods prioritize empirical fit over subjective interviews, with meta-analytic evidence supporting their role in identifying candidates who sustain productivity in writing-dependent professions.

International Variations and Comparisons

Writing assessment practices exhibit substantial variation across nations, shaped by educational philosophies, cultural norms, and systemic priorities such as high-stakes testing versus formative feedback. In Western systems like the United States, evaluation often emphasizes process-oriented approaches, including portfolios, peer review, and holistic scoring that prioritizes personal voice, creativity, and revision over rigid structure; for instance, postsecondary admissions historically incorporated optional essay components in tests like the SAT (discontinued in 2021), with scoring focusing on reasoning and communication rather than factual recall. In contrast, many European and Asian systems favor summative, exam-driven models with analytic rubrics assessing content accuracy, logical organization, and linguistic precision. Cross-national studies highlight how these differences influence student outputs: American writers tend toward individualistic expression and inductive argumentation, while French and German counterparts employ deductive structures like the dissertation format (thesis-antithesis-synthesis) in or exams, which demand extended, discipline-specific essays graded on clarity and evidentiary support during multi-hour sittings. East Asian assessments, exemplified by China's , integrate writing as a high-stakes component of the National College Entrance Examination, where the Chinese language section allocates up to 60 points to a major essay (typically 800 characters) evaluated on thematic relevance, structural coherence, factual grounding in historical or moral knowledge, and rhetorical eloquence; scoring rubrics prioritize conformity to classical models influenced by Confucian traditions, with top scores (e.g., 45+ out of 60) rewarding balanced argumentation over originality. In Singapore's (PSLE), writing tasks require two extended responses (e.g., narrative or situational), externally marked on a A*-E scale using criteria for content development, language use, and organization, reflecting a blend of British colonial legacy and meritocratic rigor that correlates with high international literacy outcomes. These systems contrast with more flexible Western practices by enforcing formulaic genres, potentially fostering mechanical proficiency but limiting divergent thinking; empirical analyses of argumentative essays reveal East Asian students producing more direct claims with modal verbs of obligation, while North American styles favor indirect hedging and reader engagement.
Country/RegionKey Assessment ExamplePrimary CriteriaStakes and Format
United StatesSAT/ACT essays (pre-2021); portfolios in K-12Reasoning, personal voice, revisionLow-stakes for admissions; process-focused, multiple drafts
FranceBaccalauréat French examContent, clarity, thesis-antithesis structureHigh-stakes; one-draft dissertation, written/oral
ChinaGaokao Chinese essayFactual accuracy, moral/historical depth, eloquenceHigh-stakes; 800-character timed essay, analytic rubric
SingaporePSLE English writingOrganization, language accuracy, genre adherenceHigh-stakes; two extended tasks, external grading
Such divergences extend to primary levels, where England's Key Stage 2 uses moderated teacher portfolios for narrative and non-fiction genres, yielding working/exceeding expectations outcomes, whereas Australia's NAPLAN employs externally scored extended responses on a 10-band scale emphasizing persuasive and imaginative modes. Cross-cultural rhetorical research underscores causal links to instructional norms: high-stakes environments like or cultivate reader-responsible texts with explicit markers, correlating with stronger performance in structure but potential deficits in innovation compared to low-stakes U.S. systems, where empirical gaps in advanced argumentation persist despite emphasis on critical thinking. These patterns reflect deeper causal realities, including early specialization in Europe and Asia versus late tracking in the U.S., influencing heritability of skills through intensive practice versus broader exposure.

Criticisms and Future Directions

Subjectivity and Cultural Biases in Human Scoring

Human raters in writing assessments, even when trained and using standardized rubrics, demonstrate notable subjectivity, as evidenced by inter-rater reliability coefficients typically ranging from 0.50 to 0.80 across studies, indicating moderate agreement but persistent variability in holistic judgments of traits like argumentation and style. For instance, in EFL composition scoring, exact agreement between raters often falls below 70%, with discrepancies arising from differing interpretations of vague criteria such as "development" or "creativity," which allow personal biases to influence scores. Intra-rater reliability, measuring consistency within the same rater over time, similarly reveals inconsistencies, with generalizability theory analyses showing that rater effects can account for up to 20-30% of score variance in essay evaluations. These findings underscore that, despite calibration sessions, human scoring remains susceptible to subjective factors like fatigue, mood, or halo effects, where an initial impression of one trait skews overall ratings. Cultural biases further compound subjectivity, as raters' backgrounds shape preferences for rhetorical structures, vocabulary, and topics aligned with their own cultural norms, often disadvantaging essays from non-Western or minority perspectives. A 2019 study on writing assessment bias found that raters' language backgrounds and experience levels introduced systematic differences, with native English-speaking raters penalizing non-native rhetorical patterns, such as indirect argumentation common in Asian educational traditions, leading to score disparities of 0.5-1.0 points on 6-point scales. Similarly, research on teacher grading reveals biases favoring students perceived to possess "highbrow cultural capital," such as familiarity with canonical references, which correlates with higher essay scores for those from privileged socioeconomic or ethnic groups, independent of content quality. Ethnic minorities, including African-American and immigrant students, receive lower marks in subjective components like writing due to implicit stereotypes, with experimental designs showing graders assigning 5-10% lower scores to identical essays attributed to minority authors. Analytic rubrics mitigate some bias by breaking down scores into quantifiable traits but fail to eliminate cultural favoritism in interpretive areas like "voice" or "insight," where dominant cultural standards prevail. Efforts to address these issues, such as rater training and multiple independent ratings, improve reliability—yielding adjacent agreement rates above 90% in large-scale assessments—but do not fully eradicate biases rooted in raters' unexamined assumptions. For example, blinding graders to student demographics reduces but does not eliminate foreign-name penalties in essay scoring, with residual positive biases toward "native" styles persisting. Academic sources on these topics, often from education journals, merit caution due to potential institutional incentives favoring narratives of systemic inequity over measurement error, yet empirical data from controlled experiments consistently affirm the presence of both subjectivity and cultural skew in human scoring.

Limitations and Risks of Automation

Automated writing evaluation (AWE) systems, including those powered by large language models, often exhibit limitations in validity despite achieving high correlations with human raters on standard metrics like Quadratic Weighted Kappa (QWK), as these agreements fail to ensure reliability across diverse scenarios such as off-topic responses or nonsensical text. For instance, models trained on datasets like the Automated Student Assessment Prize (ASAP) initially detected 0% of off-topic essays correctly, assigning non-zero scores to irrelevant content, and similarly scored gibberish inputs like repeated random characters with undeserved points. Retraining with targeted adversarial examples improved detection to 55.4% for off-topic cases and 94.4% for gibberish, but baseline models remain vulnerable without such interventions, highlighting over-reliance on surface features like vocabulary and syntax rather than substantive content evaluation. A core risk involves biased scoring patterns influenced by demographic factors, where systems propagate disparities from training data lacking balanced representation across race, gender, and socioeconomic status. Peer-reviewed analyses reveal that shallow and deep learning-based AWE algorithms systematically over- or under-score based on student subgroups, with some models exhibiting positive bias toward certain racial groups while disadvantaging others, such as lower scores for essays implying non-White authors in ChatGPT evaluations. GPT-4o variants show elevated scores for Asian/Pacific Islander-associated essays compared to others, independent of content quality, underscoring how unmitigated training data imbalances exacerbate inequities in high-stakes assessments. Multiple studies confirm these biases persist across vendors, necessitating subgroup-specific audits to avoid perpetuating historical grading disparities. Automation further risks undermining academic integrity and skill development by facilitating undetectable AI-generated submissions and reducing cognitive engagement in writing processes. Systems like e-rater assign inflated scores to AI-produced essays (mean 5.32 versus 4.67 by humans), as they prioritize fluency over original reasoning, enabling test-takers to bypass authentic effort in remote settings. Over-reliance on AWE feedback correlates with diminished student recall of composed content and lower neural activation during writing tasks, potentially stunting critical thinking and revision skills essential for long-term proficiency. While AWE excels at surface-level corrections like grammar, it falters on deep errors in organization and relevance, such as "Chinglish" constructs in EFL contexts or off-subject deviations, fostering superficial improvements without addressing holistic competence. These issues amplify in scalable deployments, where unverified feedback may mislead learners and erode trust in automated metrics for educational decisions.

Prospects for Hybrid Approaches and Empirical Validation

Hybrid approaches in writing assessment integrate automated systems, such as machine learning models or large language models, with human evaluation to leverage the strengths of both: AI's speed and consistency in processing surface-level features like grammar and vocabulary, alongside human expertise in assessing deeper elements like argumentation coherence and originality. These methods often involve AI pre-scoring essays followed by human review of outliers or borderline cases, or ensemble models blending AI predictions with handcrafted linguistic features. For instance, a 2024 study combined RoBERTa-generated embeddings with features like syntactic complexity and achieved higher predictive accuracy than pure deep learning models on benchmark datasets, using quadratic weighted kappa scores exceeding 0.75 for agreement with human raters. Prospects for hybrids include scalability in high-stakes testing, where pure human scoring is resource-intensive; a progressive hybrid model applied to summative assessments reduced human workload by 70-80% while maintaining score reliability comparable to full human grading, as validated in large-scale deployments. They also address AI limitations in handling nuanced traits, such as cultural context or persuasive intent, by incorporating human oversight, potentially mitigating biases observed in standalone AI systems like overemphasis on lexical diversity at the expense of content validity. Emerging research suggests hybrids could enhance fairness across demographics, with one 2025 analysis of teacher-AI feedback showing improved student writing outcomes (effect size d=0.45) over AI-only, attributed to hybrid calibration reducing demographic score gaps by 15-20%. However, implementation requires careful feature selection to avoid amplifying AI errors, such as hallucinated coherence in generated text evaluation. Empirical validation of hybrids emphasizes metrics like inter-rater reliability (e.g., Cohen's kappa >0.80), predictive validity against learning outcomes, and adverse impact ratios for . A 2025 study on writing found hybrid feedback (AI-generated drafts reviewed by instructors) yielded revision improvements matching human-only ( r=0.68 with pre-post scores), outperforming AI-alone (r=0.52), with validation via randomized controlled trials on 200+ essays. Similarly, ensemble hybrids using neural networks and rule-based scoring demonstrated 10-15% gains in holistic trait prediction on TOEFL-like datasets, confirmed through cross-validation against expert human scores from 2018-2023 corpora. Longitudinal studies are needed to assess sustained validity, as short-term trials may overlook drift in AI performance over time; current evidence, drawn from peer-reviewed benchmarks, supports hybrids for mid-to-high volume assessments but cautions against over-reliance without domain-specific tuning. Ongoing research, including LLM-augmented hybrids, reports stability in scoring identical essays (consistency >95%), yet underscores the necessity of human validation for creative or argumentative writing where AI validity plateaus below 0.70 .

References

  1. [1]
    Writing Assessment: A Position Statement
    Writing assessment can be used for a variety of purposes, both inside the classroom and outside: supporting student learning, assigning a grade, placing ...Missing: scholarly | Show results with:scholarly
  2. [2]
    A comparative study of analytical assessment and holistic ...
    Aug 10, 2025 · This paper provides a review of previous empirical studies and compares the two assessment methods in terms of reliability, construct validity, practicability, ...
  3. [3]
    [PDF] REFRAMING RELIABILITY FOR WRITING ASSESSMENT
    This essay provides an overview of the research and scholarship on re- liability in college writing assessment from the author's perspective as.
  4. [4]
    Writing Evaluation: Rater and Task Effects on the Reliability of ... - NIH
    Feb 6, 2017 · We examined how raters and tasks influence measurement error in writing evaluation and how many raters and tasks are needed to reach a desirable level of .90 ...Abstract · Writing Evaluation · Results
  5. [5]
    Rater variability and reliability of constructed response questions in ...
    Nov 22, 2023 · This study examined the impact of the current one-rater holistic scoring practice on the rater variability and reliability of constructed response questions.Reliability As A Major... · Methodology · Results
  6. [6]
    Standards for the Assessment of Reading and Writing - NCTE
    The primary purpose of assessment is to improve teaching and learning. Assessment is used in educational settings for a variety of purposes, such as keeping ...
  7. [7]
    Writing | NAEP - National Center for Education Statistics (NCES)
    Aug 6, 2025 · (NAEP) writing assessment measures how well America's students can write—one of the most important skills that students acquire and develop ...What the Assessment Measures · Writing Achievement Levels · 2017 Writing<|separator|>
  8. [8]
    Formative and Summative Assessment - Northern Illinois University
    Formative assessment provides feedback during learning, while summative assessment occurs after learning and sums up the process. Formative focuses on the ...
  9. [9]
    [PDF] Learning Goals and Success Criteria - Formative Assessment Insights
    In formative assessment, a task aligned to a lesson-sized Learning Goal might be for students to respond to teacher questions and probes as they are engaged in ...
  10. [10]
    Formative vs Summative Writing - TeachingIdeas4U
    Jul 17, 2018 · Formative writing builds understanding during learning, scored for completeness. Summative writing assesses mastery, graded on quality, and is ...<|separator|>
  11. [11]
    [PDF] K-12 Writing - Assessment - Oregon.gov
    A Comprehensive Writing Assessment System: Relies on measures of writing that demonstrate reliability and validity for the purpose(s) they are being used ...
  12. [12]
    [PDF] How Do You Know?
    Measure ability to articulate complex ideas, examine claims and evidence, support ideas with relevant reasons and examples, sustain a coherent discussion, and ...
  13. [13]
    Principles of Assessing Student Writing
    It is crucial to develop criteria that match the specific learning goals and the genre of your assignment. What's valued in one discipline differs in others. In ...
  14. [14]
    Bridging the gap between receptive and productive competence
    Aug 27, 2015 · All language users have greater receptive competence (language they can understand) than productive competence (language they can produce).
  15. [15]
    Full article: Oral Exams: A More Meaningful Assessment of Students ...
    May 19, 2021 · Oral assessments provide instructors the opportunity to probe student explanations, obtaining a more complete picture of their understanding.
  16. [16]
    [PDF] Automated Scoring of Speaking and Writing: Starting to Hit its Stride ...
    This article reviews recent literature (2011–present) on the automated scoring (AS) of writing and speaking. Its purpose is to first survey the current ...
  17. [17]
    What Do Reading Comprehension Tests Measure? Knowledge.
    Reading comprehension tests measure broad general knowledge, as good reading requires the ability to understand a range of texts and relate new information to ...
  18. [18]
    Receptive Versus Productive Skills in Foreign Language Learning
    Aug 7, 2025 · Productive skills refer to speaking and writing skills, while receptive skills refer to listening and reading skills.
  19. [19]
    Rhetoric According to Aristotle - Pressbooks.pub
    Rhetoric is an argument created with the audience. It is based on arguments and beliefs shared by both the speaker and the audience.8 Rhetoric According To... · Rhetoric As A Technē Of... · Lines Of ArgumentMissing: composition evaluation
  20. [20]
    Aristotle's Rhetoric - Stanford Encyclopedia of Philosophy
    Mar 15, 2022 · Aristotle's rhetorical analysis of persuasion draws on many concepts and ideas that are also treated in his logical, ethical, political and psychological ...Missing: evaluation | Show results with:evaluation
  21. [21]
    A History of Education: Ancient Greece and Rome - AceReader Blog
    Oct 18, 2022 · This is the 6th in a series of blogs that examine how education developed throughout history until the present.Missing: early | Show results with:early
  22. [22]
    [PDF] The Progymnasmata: New/Old Ways to Teach Reading, Writing, and ...
    The progymnasmata—preliminary exercises in the classical rhetorical curriculum—teach students to perform well in recurring life situations. The focus of the.
  23. [23]
  24. [24]
  25. [25]
    Roman Writing Instruction as Described by Quintilian | 2 | v4 | A Shor
    These exercises include Rhetoric, Imitation, Declamation, and a dozen increasingly difficult Composition Exercises called Progymnasmata. Quintilian's system has ...Missing: evaluation | Show results with:evaluation
  26. [26]
  27. [27]
    History of Standardized Testing in the United States | NEA
    Jun 25, 2020 · The College Entrance Examination Board is established, and in 1901, the first examinations are administered around the country in nine subjects.
  28. [28]
    [PDF] A HISTORY OF WRITING ASSESSMENT, SYSTEMIC ...
    At the 2016 Assessment Institute—a conference targeted more toward assess- ment professionals than writing program administrators—there is a lot of buzz.
  29. [29]
    [PDF] A History of Educational Testing - Princeton University
    From the outset, standardized tests were used as an instrument of school reform and as a prod for student learning. Formal written testing began to replace oral ...
  30. [30]
    [PDF] A Brief History of Accountability and Standardized Testing
    Even in the twenty-first century, with technology unimagined in the early twen- tieth century, we are still using the same rod to measure student writing ...
  31. [31]
    [PDF] The Origins of the Standards Movement in the United States: Adop
    It has been stated that (1) the written examination for promotion has forced teachers to accept external standards given by educational administra- tors; (2) ...
  32. [32]
    Automated language essay scoring systems: a literature review - NIH
    Aug 12, 2019 · Automated Essay Scoring (AES) systems are used to overcome the challenges of scoring writing tasks by using Natural Language Processing (NLP) and machine ...
  33. [33]
    [PDF] Automated Essay Scoring With E-rater - ETS
    has been used by the Educational Testing Service for automated essay scoring since. 1999. This paper describes a new version of e-rater that differs from ...Missing: advancements | Show results with:advancements
  34. [34]
    Automated Scoring of Writing - SpringerLink
    Sep 15, 2023 · Dating back to the 1960s, AES started with the advent of Project Essay Grade (Page, 1966). Since then, automated scoring has advanced into ...
  35. [35]
    Two Decades of Academic Writing Assessment in Higher Education
    Oct 14, 2025 · Two Decades of Academic Writing Assessment in Higher Education: A Bibliometric and Technological Trend Analysis of Scopus (2000-2025). October ...
  36. [36]
    The Impact of Technology on Students' Writing Performances in ...
    The results of this meta-analysis confirmed that technology has a medium effect on writing quality and a strong effect on writing quantity for elementary school ...
  37. [37]
    Construct and consequence: Validity in writing assessment.
    Because early efforts to score essay tests were often unreliable (Breland, 1983), the field of writing assessment initially focused on ways to improve ...Missing: challenges psychometrics
  38. [38]
    Constructing Validity: New Developments in Creating Objective ...
    11). Accordingly, investigating a measure's construct validity necessarily involves empirical tests of hypothesized relations among theory-based constructs and ...
  39. [39]
    Validity evidences for scoring procedures of a writing assessment ...
    Common scoring procedures include rater mean, parity, tertium quid, expert, and discussion. Rater mean averages scores; parity adds a third rater; tertium quid ...
  40. [40]
    None
    ### Summary of Challenges to Validity in Computer-Based Writing Assessment
  41. [41]
    [PDF] Developing and Examining Validity Evidence for the Writing Rubric ...
    Fairness is psychometrically tied to both validity and reliability. Evidence of validity and reliability is critical, but insufficient for psychometrically ...
  42. [42]
    The empirical assessment of construct validity - ScienceDirect.com
    This paper provides an in-depth review of the different methods available for assessing the construct validity of measures used in empirical research.Missing: writing | Show results with:writing
  43. [43]
    Understanding Reliability and Validity - The WAC Clearinghouse
    While reliability is concerned with the accuracy of the actual measuring instrument or procedure, validity is concerned with the study's success at measuring ...
  44. [44]
    [PDF] Measuring Essay Assessment: Intra-rater and Inter-rater Reliability
    Purpose of Study: The purpose of the study is to reveal possible variation or consistency in grading essay writing ability of EFL writers by the same/different ...
  45. [45]
    The affectability of writing assessment scores: a G-theory analysis of ...
    Oct 1, 2021 · It provides an estimate of reliability that is comparable across different tests and testing contexts. It investigates the relative effects ...<|separator|>
  46. [46]
    The reliability of single task assessment in longitudinal L2 writing ...
    This study investigated the reliability of L2 writing assessments scored on different CAF measures, focusing on a) the reliability of single task writing ...
  47. [47]
    The effects of raters' perceived uncertainty on assessment of writing
    This study investigated how common raters' experiences of uncertainty in high-stakes testing are before, during, and after the rating of writing performances.2. Rater Uncertainty · 4. Findings · 4.1. Raters' Perceptions Of...
  48. [48]
    For a Greater Good: Bias Analysis in Writing Assessment
    Jan 8, 2019 · This article attempts to examine raters' experience, raters' language background, and essay prompt as potential sources of bias in writing assessment.<|separator|>
  49. [49]
    Effects of rating criteria order on the halo effect in L2 writing ...
    Nov 10, 2020 · Rater severity or leniency is a consistent tendency of raters to award higher or lower ratings to test takers' writing performance than the test ...<|separator|>
  50. [50]
    The Inter-rater Reliability in Scoring Composition - ResearchGate
    Aug 10, 2025 · This paper makes a study of the rater reliability in scoring composition in the test of English as a foreign language (EFL) and focuses on the ...
  51. [51]
    Examining the Impacts of Rater Effects in Performance Assessments
    The purpose of this study is to explore the impacts of rater severity, centrality, and misfit on student achievement estimates and on classification decisions.
  52. [52]
    Designing and Using Rubrics | Writing Across the Curriculum
    Grading rubrics (structured scoring guides) can make writing criteria more explicit, improving student performance and making valid and consistent grading ...
  53. [53]
    [PDF] Investigating the Impact of Rater Training on Rater Errors in ... - ERIC
    After training, the raters in the experimental group showed fewer rater behaviors than those in the control group in the process of assessing the writing ...
  54. [54]
    [PDF] An Empirical Study for the Statistical Adjustment of Rater Bias - ERIC
    Apr 24, 2019 · Abstract: This study investigated the effectiveness of statistical adjustments applied to rater bias in many-facet Rasch analysis. Some.Missing: evidence | Show results with:evidence
  55. [55]
    Features of difficult-to-score essays - ScienceDirect.com
    The term interrater reliability is often used loosely to refer to the degree to which several potential types of measurement error may influence scores assigned ...
  56. [56]
    [PDF] Meta-Analysis of Inter-Rater Agreement and Discrepancy Between ...
    The meta-analysis found a strong relationship with no discrepancies between automated and human scoring, with a mean correlation of .78.
  57. [57]
    [PDF] Validity and Reliability Issues in the Direct Assessment of Writing
    This essay will elaborate on and respond to some of these challenges and will speculate on future directions in writing assessment. The Reliability of Essay ...
  58. [58]
    ERIC - ED242756 - The Direct Assessment of Writing Skill
    Direct assessment of writing skill, usually considered to be synonymous with assessment by means of writing samples, is reviewed in terms of its history and ...
  59. [59]
    Types of Rubrics | Feedback & Grading - DePaul University Resources
    With a holistic rubric the rater assigns a single score (usually on a 1 to 4 or 1 to 6 point scale) based on an overall judgment of the student work. The rater ...
  60. [60]
    Types of Scoring | Foreign Language Teaching Methods: Writing
    Holistic scoring results in a more general description for categories, but includes the different elements of writing implicitly or explicitly. The result is ...
  61. [61]
    Analytic vs. Holistic Rubrics: Which Type of Rubric Should You Use?
    In contrast to a holistic rubric, an analytic rubric is scored using a grid that outlines the criteria for a student assignment. Each criterion should be in a ...
  62. [62]
    Analytic or holistic? A study about how to increase the agreement in ...
    Feb 12, 2021 · This study has explored how to increase agreement in teachers' grading by comparing analytic and holistic grading.
  63. [63]
    [PDF] the use of holistic versus analytic scoring for large-scale assessment ...
    This article discusses a variety of issues associated with the use of holistic and analytic scoring methods in large-scale stu- dent assessments.
  64. [64]
    Reliability and validity of benchmark rating and comparative ...
    Aug 27, 2023 · In the past years, comparative assessment approaches have gained ground as a viable method to assess text quality.
  65. [65]
    Accuracy in the scoring of writing: Studies of reliability and validity ...
    This article describes two studies that established the validity of the scoring system for use in New Zealand classrooms.Abstract · Determining Accuracy In The... · Assessing Writing In New...
  66. [66]
    (PDF) Reliability, Validity, and Writing Assessment: A Timeline
    Aug 7, 2025 · A major concern in writing assessment has been connected to the fact that the score comes from the rater's subjective judgment, so it concerns ...
  67. [67]
    A Comparison of Direct and Indirect Writing Assessment Methods.
    An area of current concern is that of the advantages and disadvantages of measuring writing proficiency directly via writing samples, and indirectly via ...Missing: proxy | Show results with:proxy
  68. [68]
    A Comparison of Direct and Indirect Assessment of Writing Skill
    Aug 6, 2025 · The common writing assessment practice at the beginning of the century was based on a preference for direct assessments (Breland & Gaynor ...Missing: shift | Show results with:shift
  69. [69]
    Measurement of Writing Ability at the College-Entrance Level - ETS
    Measurement of Writing Ability at the College-Entrance Level: Objective Vs. Subjective Testing Techniques. Author(s):: Huddleston, Edith M.<|separator|>
  70. [70]
    [PDF] Understanding What Essay Scores Add to HSGPA and SAT - ERIC
    However, the greater authenticity of direct writing assessment is usually offset by the lower reliability of scores compared to indirect writing assessments.
  71. [71]
    [PDF] Writing Multiple Choice Items that are Reliable and Valid
    This article is written to help educators construct high-quality, reliable, and valid multiple- choice test items to evaluate students' ability to demonstrate ...
  72. [72]
    A Comparison of Direct and Indirect Assessments of Writing Skill - jstor
    In a second study, the essay correlated .40 with instructors' ratings and .41 with English grades, while the corresponding correlations for the objective test.
  73. [73]
    [PDF] Relationship Between Operational SAT® Essay Scores and College ...
    Oct 31, 2018 · Results show that there is a positive relationship between the three SAT Essay score dimensions and both first-semester English and writing.
  74. [74]
    Relationships between Direct and Indirect Measures of Writing Ability.
    This paper reports on Educational Testing Service research studies investigating the parameters critical to reliability and validity in both the direct and ...
  75. [75]
    Does indirect writing assessment have any relevance to direct ...
    Does indirect writing assessment have any relevance to direct writing assessment? Focus on validity and reliability.
  76. [76]
    Ahead of the Curve: How PEG™ Has Led Automated Scoring ... - ERB
    Aug 29, 2022 · PEG, or Project Essay Grade, is the automated scoring system at the core of ERB Writing Practice. It was invented in the 1960s by Ellis Batten Page.
  77. [77]
    The e-rater Scoring Engine - ETS
    The e-rater engine automatically: assess and nurtures key writing skills; scores essays and provides feedback on writing using a model built on the theory of ...
  78. [78]
    [PDF] Automated Essay Scoring - ERIC
    The Electronic Essay Rater (E-rater) was developed by the Educational Testing Service. (ETS) to evaluate the quality of an essay by identifying linguistic ...
  79. [79]
    Automated essay scoring: Psychometric guidelines and practices
    In this paper, we provide an overview of psychometric procedures and guidelines Educational Testing Service (ETS) uses to evaluate automated essay scoring ...
  80. [80]
    The e-rater® automated essay scoring system. - APA PsycNet
    Automated essay scoring (AES) is a well-established technology in educational settings. The technology is now supported by a number of commercial vendors.
  81. [81]
    [PDF] Automated Essay Scoring for Nonnative English Speakers - ETS
    The e-rater systemTM 1 is an operational automated essay scoring system, developed at Educational Testing Service (ETS). The average agreement between human ...
  82. [82]
    How the e-rater Scoring Engine Works - ETS
    The e-rater engine scores essays by extracting a set of features representing important aspects of writing quality from each essay.
  83. [83]
  84. [84]
  85. [85]
    Comparing expert tutor evaluation of reflective essays with marking ...
    AI systems may struggle to recognize sarcasm or irony in an essay, leading to an incorrect score, while Expert Tutors are better able to recognize and correct ...
  86. [86]
    AI-generated Essays: Characteristics and Implications on Automated ...
    Oct 16, 2025 · Our findings highlight limitations in existing automated scoring systems, such as e-rater ®, when applied to essays generated or heavily ...
  87. [87]
    AI Shows Racial Bias When Grading Essays — and Can't Tell Good ...
    May 6, 2025 · Teachers can harbor unconscious biases or apply inconsistent standards when scoring essays. But if AI both replicates those biases and fails to ...Missing: 2020-2025 | Show results with:2020-2025
  88. [88]
    AI versus human effectiveness in essay evaluation
    Oct 28, 2024 · The study found that 67% of the writing samples were correctly predicted by e-rater, with 17% in the wrong direction and 17% having identical e- ...
  89. [89]
    Validity of automated essay scores for elementary-age English ...
    This research study investigated and compared the predictive validity of automated and human scoring methods for elementary-age English ELLs.
  90. [90]
    The Nation's Report Card: Writing 2011: Executive Summary
    Average scores in eighth- and twelfth-grade NAEP writing, by race/ethnicity: 2011. Characteristic, Grade 8, Grade 12. White, 158, 159. Black, 132, 130. Hispanic ...
  91. [91]
    Methodology Studies - Achievement Gaps | NAEP
    Apr 3, 2024 · Explore NAEP achievement gaps among White, Black, and Hispanic students and how the gaps have changed over time.
  92. [92]
    [PDF] Differences in the Gender Gap: - ETS
    Females scored higher than males in NAEP Writing across all racial/ethnic groups. Figure 2: Differences in Average NAEP Writing Scores, by Race/Ethnicity and ...<|separator|>
  93. [93]
    [PDF] Gender Differences in Reading and Writing Achievement
    Jan 10, 2018 · Females show small-to-medium differences in reading and medium-sized differences in writing compared to males, with these differences stable ...
  94. [94]
    Wide gap in SAT/ACT test scores between wealthy, lower-income kids
    Nov 22, 2023 · Children of the wealthiest 1 percent of Americans were 13 times likelier than the children of low-income families to score 1300 or higher on SAT/ACT tests.
  95. [95]
    Explaining Achievement Gaps: The Role of Socioeconomic Factors
    Aug 21, 2024 · Results show that a broad set of family SES factors explains a substantial portion of racial achievement gaps.
  96. [96]
    A Meta-Analysis of Socioeconomic Differences in Writing Achievement
    Mar 7, 2025 · Even though extensive evidence links socioeconomic status to writing achievement, research often trivializes its significance. The current study ...
  97. [97]
    A Meta-Analytic Review of the Effect of Socioeconomic Status on ...
    Dec 24, 2021 · Findings revealed that the relation between SES and academic performance represented a moderate positive correlation.
  98. [98]
    Persistent association between family socioeconomic status and ...
    Apr 20, 2022 · We show for the first time that over 95 years in Britain the association between family SES and children's primary school performance has remained stable.
  99. [99]
    Full article: Meta-analytical insights on school SES effects
    The study shows that school SES is more strongly associated with specific school processes (school leadership and climate, teacher capacity, parental ...
  100. [100]
    A longitudinal intervention study of the effects of increasing amount ...
    Jun 8, 2023 · The current study examined the effectiveness of a writing is caught approach with young developing writers in Norway.
  101. [101]
    The Impact of Exposure to English Language on Language Acquisition
    Secondly, several studies have discussed that exposure to children significantly affects children's language development. ... writing competency and writing ...
  102. [102]
    The Impact of Environmental Factors on Academic Performance of ...
    Results showed that temperature, lighting, and noise have significant direct effects on university students' academic performance.2. Literature Review · 3. Methodology And Data · 4. Results<|separator|>
  103. [103]
    Professional development for evidence-based SRSD writing ...
    This multi-site cluster randomized controlled study evaluated the effectiveness of SRSD on student writing outcomes including prompt adherence, elements, and ...
  104. [104]
    The Science of Writing | SRSD Writing Interventions - SRSD Online
    This evidence-based approach guides effective instruction for developing writing skills, from foundational abilities like handwriting and spelling to more ...
  105. [105]
    Efficacy of a prewriting intervention: A pilot randomised control trial
    Aug 12, 2025 · Occupational therapists can play a key role in supporting teachers prewriting instruction and develop interventions that support young children ...
  106. [106]
    Comprehensive Meta-Analysis of Writing Interventions for Students ...
    A comprehensive meta-analysis, to explore what writing interventions are effective on what writing outcomes, for whom, and under what conditions in ...
  107. [107]
    Writing Intervention Research: A Meta-analysis of Teaching Children ...
    Aug 6, 2025 · In order to discover effective ways of teaching writing, we carried out a meta-analysis of studies that involved kids in grades 4 to 6 attending ...<|separator|>
  108. [108]
    The Effects of Genetic and Environmental Factors on Writing ...
    The conclusion that writing owes largely to genes and non-shared environment, however, might mask potential developmental differences across developmental ...
  109. [109]
    [PDF] EFFECTIVE STRATEGIES TO IMPROVE WRITING OF ...
    For learning-to-write, studies of the effectiveness of interventions designed to improve students' writing ... score 15% of the studies (randomly selected).Across ...
  110. [110]
    Genetic and environmental influences on writing and their ... - PubMed
    Substantial genetic influence was found on two of the writing measures, writing samples and handwriting copy, and all of the language and reading measures.
  111. [111]
    Genetic and Environmental Influences on Writing and their Relations ...
    Substantial genetic influence was found on two of the writing measures, Writing Samples and Handwriting Copy, and all of the language and reading measures.
  112. [112]
    Arithmetic, reading and writing performance has a strong genetic ...
    In general, individual differences in educational achievement were to a large extent due to genes and the influence of the family environment was negligible.
  113. [113]
    Maths and reading skills found to be 75 per cent genetic | SBS Insight
    Mar 14, 2016 · Genetic influences on reading, spelling and mathematics abilities were found to be between 50-75 per cent. The findings back up earlier research ...<|control11|><|separator|>
  114. [114]
    How men's and women's brains are different | Stanford Medicine
    May 22, 2017 · Women's reading comprehension and writing ability consistently exceed that of men, on average. They outperform men in tests of fine-motor ...
  115. [115]
    Explaining age and sex differences in children's handwriting
    Aug 10, 2025 · Typically, male handwriting is more angular, disorderly, and slanted than female handwriting, which is more regular, ordered, and round. The ...
  116. [116]
    Genome-Wide Polygenic Scores Predict Reading Performance ...
    Mar 28, 2017 · The origins of most individual differences in diverse reading skills lie with genetic differences (twin heritability estimates > 50%) that help ...
  117. [117]
    The heritability of general cognitive ability increases linearly from ...
    To test the hypothesis that the heritability of g increases from childhood to young adulthood, we created a consortium of six twin studies from four countries ...
  118. [118]
    The Role of Genetic Factors in Reading and its Development Across ...
    Jan 31, 2022 · This mini-review attempts to provide a capsule overview of the role of genetic factors in reading and its development across languages and ...
  119. [119]
    The effects of genetic and environmental factors on writing ...
    This study is the first to examine genetic and environmental influences that contribute to the covariation of writing with other writing components.
  120. [120]
    Writing Assessment | Reading Rockets
    Writing can be assessed using rubrics, checklists, rating scales, the 6+1 Trait model, student self-assessment, and peer editing.
  121. [121]
    Formative & Summative Assessments | Poorvu Center for Teaching ...
    Formative assessments often aim to identify strengths, challenges, and misconceptions and evaluate how to close those gaps. They may involve students assessing ...Missing: K- 12
  122. [122]
    The Impact of Writing Assignments on Student Learning
    We find that structured writing positively impacts students' performance on lower-order (knowledge and comprehension) assessments.
  123. [123]
    The Importance of Assessing Student Writing and Improving ... - ETS
    It is important to understand why so many students experience writing as deeply challenging and how they can be encouraged to develop their writing skills.
  124. [124]
    New Report Offers Evidence that Classroom-Based Assessments ...
    Sep 15, 2011 · Effective assessments are promising tools to help ensure that students write well enough to meet grade-level demands, according to a new report ...
  125. [125]
  126. [126]
    Unpacking the impact of teacher assessment approaches on student ...
    Jun 7, 2023 · The quantitative results showed that formative and portfolio writing assessments positively predicted students' planning, task management and ...
  127. [127]
    Written exercises during interviews (Types and samples) - Indeed
    Mar 27, 2025 · Job industry knowledge: Hiring managers use written tests to assess your industry knowledge. You can demonstrate this skill by using applicable ...
  128. [128]
    A Guide To Written Exercises During Interviews | Indeed.com
    Jun 9, 2025 · Written exercises assess writing skills, often for roles needing reports. Types include paraphrasing, press release, proofreading, essays, ...
  129. [129]
    [PDF] Writing Samples - OPM
    Writing samples can be a written task or a prompted essay. Written tasks are typical of the job, while prompted essays are based on a topic.
  130. [130]
    [PDF] Employers' and Advisors' Assessments of the Importance of Critical ...
    Research has shown that “generic” skills (Clanchy & Ballard, 1995) such as critical thinking and written communication are predictive of post-college outcomes ...
  131. [131]
    Work Samples as Measures of Performance
    The predictive validities of these methods, while often good enough to satisfy legal requirements, are not as high as could be expected given the long history ...<|separator|>
  132. [132]
    Legal Issues Relating to Pre-Employment Testing - Criteria Corp
    Pre-employment testing can help enhance the objectivity, equitability, and legal defensibility of an organization's hiring process.
  133. [133]
    EJ393270 - The Problem of Group Differences in Ability Test Scores ...
    employment tests are either biased against minorities or lack utility. Argues that group differences in test scores do not result from deficiencies in tests ...<|control11|><|separator|>
  134. [134]
    The problem of group differences in ability scores in employment ...
    Aug 6, 2025 · Group differences on employment tests of cognitive abilities often lead to lower job selection rates for blacks and some other minorities, ...
  135. [135]
    Pre-employment testing: pros and cons - Workable
    Pre-employment testing can predict the quality of hire under certain conditions. These tests should be legal, job-related, and well-validated.
  136. [136]
    6 Interview Writing Prompt Examples to Qualify Content Candidates
    Examples include: a 'wow' prompt for website content, a 'how' prompt for consideration, a 'now' prompt for decision, and a 3-part project testing editing,  ...
  137. [137]
    [PDF] Writing and Learning in Cross-National Perspective - ERIC
    This book presents research on academic writing development in six nations, comparing US practices with other national systems.
  138. [138]
    (PDF) A review of approaches to assessing writing at the end of ...
    Mar 22, 2019 · The purpose of this section is to consider what approaches to the assessment of. writing are currently being taken internationally.
  139. [139]
    Gaokao essays provide insight into today's youth - Global Times
    Jun 8, 2025 · Ancient Chinese used to say, "Writing reflects the writer" and "Writing conveys principles." An essay not only demonstrates the writer's ...
  140. [140]
    Linguistic, cultural and substantive patterns in L2 writing
    We found that linguistically, compared to American students, Chinese students used more adjectives, adverbs, modal verbs of obligation, and were more direct in ...
  141. [141]
    The French Baccalaureate - Globeducate
    In 11th grade, students take a French language and literature final exam in June. This tests them on their written skills and on an oral presentation of a text.
  142. [142]
    (PDF) Cross-cultural Aspects of Academic writing: a Study of ...
    Aug 6, 2025 · The paper presents the findings and implications of a contrastive rhetorical study of Hungarian and North American college students' L1 argumentative writing.
  143. [143]
    Teacher Bias in Assessments by Student Ascribed Status
    Aug 27, 2024 · Findings unveil teacher bias in an essay grading task favoring girls and highbrow cultural capital, aligning with status characteristics and ...
  144. [144]
    A quasi-experimental study of ethnic and gender bias in university ...
    Jul 22, 2021 · Some studies have found a negative bias against ethnic minorities in grading. Analyzing caste discrimination in India using an experimental ...
  145. [145]
    Discrimination in grading: A scoping review of studies on teachers ...
    Studies on teachers' grading suggest that school grades depend not only on students' performance but also on teachers' bias toward specific social categories.
  146. [146]
    [PDF] cognitive bias in the rubric evaluation of student performance in
    The findings of this study suggest that using a standards-based rubric to grade and essay did not activate any racial bias; however, expectation bias, a shifted ...
  147. [147]
    Constructed-Response Interrater Reliability
    Sep 23, 2024 · During the scoring of student responses, some responses to constructed-response items are rescored by a second rater for one of two reasons.
  148. [148]
    Biased grades? Changes in grading after a blinding of examinations ...
    Jul 10, 2019 · Relative to students with 'native' names, students with 'foreign' names appear to experience weak positive bias in the grading of their ...
  149. [149]
    [PDF] On the Limitations of Human-Computer Agreement in Automated ...
    We demonstrate that automated scoring models with high human-computer agreement fail to perform well on two out of three test scenarios. We also discuss the ...
  150. [150]
    [PDF] Fairness in Automated Essay Scoring: A Comparative Analysis of ...
    Jun 20, 2024 · They found that shallow and deep AES algorithms showed systematically overly positive and negative scoring depending on students' gender, race, ...
  151. [151]
    Identifying Limitations and Bias in ChatGPT Essay Scores
    Mar 31, 2025 · Overall, the results show ChatGPT struggles in essay scoring, particularly in performance differentiation. The chatbot was less likely to ...
  152. [152]
    Exploring potential biases in GPT-4o's ratings of English language ...
    Apr 21, 2025 · However, GPT-4o demonstrated significant bias regarding race/ethnicity, assigning unexpectedly higher scores to essays from the Asian/Pacific ...<|control11|><|separator|>
  153. [153]
    Brain Activity Is Lower for Writers Who Use AI. What That Means for ...
    Jun 26, 2025 · Many writers who had AI help with an essay couldn't recall what they wrote, a study finds. It could have implications for the classroom.
  154. [154]
    [PDF] Effectiveness of the Automated Writing Evaluation Program on ...
    Jun 13, 2022 · This review focuses on current scholarly works on the impact of the automated feedback provided by the AWE program on students' writing ...
  155. [155]
    Hybrid Approach to Automated Essay Scoring: Integrating Deep ...
    This study introduces an approach that combines embeddings generated using RoBERTa with handcrafted linguistic features, leveraging Lightweight XGBoost ( ...
  156. [156]
    (PDF) Hybrid Approach to Automated Essay Scoring - ResearchGate
    Oct 25, 2024 · This study introduces an approach that combines embeddings generated using RoBERTa with handcrafted linguistic features, leveraging Lightweight ...
  157. [157]
    Automated Essay Evaluation at Scale: Hybrid Automated Scoring ...
    This chapter presents a progressive human–machine hybrid scoring approach designed to address limitations of automated scoring applications in large-scale, ...
  158. [158]
    Evaluating Teacher, AI, and Hybrid Feedback in English Language ...
    Aug 16, 2025 · This study investigates the effectiveness of teacher, AI-generated, and hybrid teacher-AI feedback on university students' English writing ...
  159. [159]
    [PDF] Empirical Study of Large Language Models as Automated Essay ...
    This paper takes a microscopic approach, leveraging automated essay scoring to analyze the abilities, limitations, and strengths of large language models in the ...
  160. [160]
    Ensemble and Hybrid Models in Automated Essay Scoring
    Aug 8, 2025 · The review identifies two types of ensemble models and outlines diverse hybrid model techniques used in AES. It highlights the models' ...<|separator|>
  161. [161]
    ChatGPT as a Stable and Fair Tool for Automated Essay Scoring
    This study investigates whether ChatGPT can provide stable and fair essay scoring—specifically, whether identical student responses receive consistent ...<|control11|><|separator|>