Fact-checked by Grok 2 weeks ago

Winograd schema challenge

The Winograd Schema Challenge (WSC) is a in designed to evaluate machine through tasks. It consists of pairs of sentences that differ in only one or two words, each containing a referential —typically involving a —that is resolved in opposite ways depending on the altered word, requiring the use of world knowledge rather than statistical patterns or shallow heuristics. Proposed in 2012 by J. Levesque, Ernest Davis, and Leora Morgenstern, the challenge draws its name from linguist and serves as an alternative to the by providing clear, binary disambiguation questions that demand human-like inference without deception or open-ended conversation. The core purpose of the WSC is to probe AI systems' ability to perform default reasoning and apply everyday knowledge, exposing limitations in statistical natural language processing methods that rely on patterns like selectional restrictions or corpus frequencies. For instance, in the schema "The trophy doesn’t fit in the brown suitcase because it’s too [big/small]," the pronoun "it" refers to the trophy when "big" is used but to the suitcase when "small" is substituted, a resolution that hinges on physical commonsense rather than syntactic clues. Similarly, "If the hammer was thrown at the vase and it broke, what broke?" contrasts with "If the vase was thrown at the hammer and it broke," where "it" shifts reference based on plausibility in each scenario. These examples illustrate how the schemas avoid ambiguity resolution through mere word co-occurrence statistics, instead necessitating deeper understanding of real-world dynamics. Originally introduced at the Thirteenth International Conference on Principles of , the WSC gained prominence as a vivid test of AI progress, with Levesque highlighting its potential to drive advances in knowledge representation and . Ernest Davis compiled a collection of 150 schemas, licensed under Attribution 4.0, which became a foundational dataset for research. A related set of over 60 disambiguation problems was developed by Leora Morgenstern, further expanding the benchmark. The challenge spurred formal competitions, including one at the International Joint Conference on (IJCAI) in 2016—where the top system achieved only 58% accuracy on 60 problems, below human performance—and another offered at the Association for the Advancement of (AAAI) in 2018, with a $25,000 prize that went unclaimed. Despite its influence, the WSC has evolved amid ongoing AI developments, with datasets translated into languages like , , and to broaden applicability. Research continued at institutions such as , , and the , though sponsorship from ended, shifting focus to broader benchmarks. However, by the early 2020s, large language models achieved near-human or superhuman performance on the WSC, such as over 96% accuracy reported in 2022, leading to debates on whether this reflects genuine commonsense understanding or advanced . The challenge remains influential in assessing whether language models can transcend pattern-matching to achieve genuine understanding, influencing fields from knowledge graphs to neural architectures.

Introduction

Definition and Purpose

The Winograd Schema Challenge (WSC) consists of pairs of sentences that differ in only one or two words and contain a whose is , with the resolved in opposite ways in each sentence of the pair, requiring world knowledge and rather than syntactic analysis or statistical patterns from training data. This design ensures that the correct resolution depends on subtle inferences about real-world scenarios, making it difficult for systems trained on large corpora to succeed without genuine understanding. The primary purpose of the WSC is to test systems' capacity for human-like in , serving as a benchmark that avoids the pitfalls of data-driven statistical methods prevalent in contemporary , which often exploit superficial correlations rather than deep . By presenting binary-choice questions about resolution, the challenge evaluates whether machines can perform reliable, context-sensitive without relying on vast amounts of training data or probabilistic shortcuts, thereby highlighting limitations in current AI approaches to language comprehension. As an alternative to broader conversational tests like the , it emphasizes objective, narrow-scope assessment of inferential abilities. Named after computer scientist Terry Winograd, the challenge draws from schema examples he introduced in 1972 to illustrate the need for background knowledge in language understanding systems. It was proposed by Hector Levesque in 2011 and formalized in a 2012 paper co-authored with Ernest Davis and Leora Morgenstern. A representative example is the pair: "The trophy doesn’t fit in the brown suitcase because it’s too big" versus "The trophy doesn’t fit in the brown suitcase because it’s too small," where the question asks what "it" refers to. In the first sentence, "it" refers to the trophy, as a large trophy would prevent fitting into the suitcase; in the second, "it" refers to the suitcase, as a small suitcase would similarly cause the issue. This resolution hinges on commonsense knowledge about object sizes and spatial relations, not grammatical rules or word frequencies.

Relation to Artificial Intelligence Testing

The Winograd Schema Challenge (WSC) represents a targeted for assessing systems' capacity for , offering a structured alternative to broader evaluation paradigms that often prioritize over deeper cognitive processes. By presenting binary disambiguation tasks rooted in resolution, the WSC evaluates whether can integrate implicit world knowledge to interpret linguistic ambiguities correctly, thereby serving as a for progress toward general intelligence. This focus distinguishes it from deception-based assessments, providing quantifiable metrics that track advancements in without the confounding variables of interactive dialogue. A key advantage of the WSC lies in its divergence from benchmarks such as GLUE and SuperGLUE, which aggregate diverse tasks susceptible to exploitation through large-scale corpus correlations and statistical heuristics rather than authentic comprehension. While SuperGLUE incorporates the WSC as a resolution component to probe everyday and reasoning—evidenced by a substantial performance gap as of 2019 where humans achieve near-perfect scores but top models lag at around 64% accuracy—the challenge inherently resists shallow by design, ensuring that success demands genuine inference over memorized associations. However, as of 2025, leading models have achieved up to 100% accuracy on WSC, matching human performance. In essence, the WSC underscores limitations in data-driven approaches, as high scores on GLUE-style tasks often reflect superficial cues rather than the robust understanding required for schema resolution. Theoretically, the WSC is grounded in the notion that must actively apply background knowledge—encompassing spatial, temporal, and commonsense—to resolve ambiguities, a capability that transcends narrow algorithmic metrics and aligns with observable human-like thinking. Proponents frame it as a "hardcore" test of , one that necessitates non-algorithmic insight and resists solutions via probabilistic correlations, in contrast to "softcore" evaluations like Recognizing Textual Entailment, which permit progress through surface-level inferences. This framework positions the WSC as a critical tool for identifying whether advancements represent true reasoning or mere emulation, thereby guiding research toward more holistic benchmarks.

Historical Development

Origins in Natural Language Understanding

The origins of what would become known as Winograd schemas trace back to foundational research in within and during the early 1970s. Terry Winograd's 1972 book Understanding Natural Language presented the SHRDLU system, an innovative program capable of processing English commands, answering questions, and performing actions in a simulated environment. This work emphasized procedural models for parsing and interpreting language, laying the groundwork for context-dependent comprehension in AI systems. Central to SHRDLU was the use of schemas—structured representational frames that integrated syntactic analysis with world to resolve linguistic ambiguities. These schemas functioned as dynamic templates, drawing on situational context to infer meanings that static rules alone could not capture, such as reference resolution in . Winograd's approach demonstrated how such frames enabled the system to handle complex interactions, like manipulating virtual blocks based on verbal instructions, by simulating layered application. Influenced by emerging ideas in , Winograd's schemas modeled human-like inference processes within the constrained blocks-world domain, where the program relied on predefined physical and logical constraints to make sensible interpretations. This reflected a broader shift toward that prioritized over purely formal grammars, illustrating how contextual schemas could bridge , action, and in computational models. A classic illustration of pronoun ambiguity in Winograd's framework appears in his discussion of referential challenges: "The city councilmen refused the demonstrators a permit because they feared ," where "they" unambiguously points to the councilmen due to contextual coherence, versus the variant "they advocated ," shifting reference to the demonstrators. Such paired examples highlighted the necessity of inferential schemas for true understanding, influencing subsequent research on .

Formulation as a Challenge

In 2011, Hector Levesque presented the Winograd Schema Challenge at the AAAI Spring Symposium on Logical Formalizations of , proposing it as a robust alternative to the for evaluating machine intelligence. The presentation, titled "The Winograd Schema Challenge," outlined a structured based on Winograd's earlier ideas, emphasizing tasks that require rather than conversational deception or . Levesque collaborated with Ernest Davis and Leora Morgenstern to refine and formalize the challenge, motivated by the limitations of prevailing statistical methods in , which often succeeded through data correlations rather than true understanding. Their work highlighted how large-scale text corpora could inadvertently enable machines to "cheat" on comprehension tasks by exploiting statistical associations, prompting the need for a test resistant to such approaches. The formalization process established strict criteria for creating Winograd schemas to ensure fairness and difficulty for machines while maintaining simplicity for humans. Schemas consist of sentence pairs differing by only one or two words, creating referential resolvable only through world , with no leakage of training data patterns that could allow statistical models to perform well. The challenge requires systems to select the correct disambiguation from one sentence per pair, aiming for human-level accuracy where native English speakers achieve nearly 100% success, underscoring the benchmark's design as "easy for humans, hard for machines."

Schema Structure and Mechanics

Formal Description

A Winograd schema consists of a pair of sentences that are identical except for one or two content words, each containing a pronoun whose referential ambiguity is resolved in opposite ways depending on the variant, with resolution requiring commonsense knowledge rather than syntactic or lexical cues. The structure involves two noun phrases denoting distinct entities (the "parties"), a pronoun or possessive that could plausibly refer to either, and a binary disambiguation question where the correct antecedent for the pronoun is the first party in one sentence and the second in the other. This swap is triggered by substituting a "special word" with an "alternate word" that alters the commonsense interpretation, ensuring no overlapping words bias the resolution through superficial patterns like word frequency or selectional restrictions. Key requirements include that the schema must be trivially resolvable by humans, achieving near-perfect accuracy close to 100%, while machine systems without genuine understanding perform at chance levels around 50%. Additionally, the ambiguity cannot be resolved via searchable factual knowledge (i.e., it must be "Google-proof"), and the sentences must avoid any syntactic or semantic clues that could allow statistical or rule-based methods to succeed without deeper reasoning. Logically, a Winograd schema can be represented with a base template S containing a P and two candidate antecedents N1 and N2, paired with variants V1 (using key word K1, where P refers to N1) and V2 (using key word K2, where P refers to N2). For instance, if K1 is "large," it implies N1 (e.g., a ) as the due to commonsense fit; if K2 is "small," it implies N2 (e.g., a ). The resolution task is pseudocode-like:
Input: Schema pair (V1, V2) with P, N1, N2.
For V1: Output referent of P (N1).
For V2: Output referent of P (N2).
This tests the system's ability to invert the reference based solely on the variant's implication.

Core Components and Examples

The core components of a Winograd schema consist of a or adjective that creates , two plausible alternatives for its referent, and the necessity for specific to resolve the ambiguity correctly. The , such as "he," "she," or "it," refers ambiguously to one of the alternatives, which are typically entities like people or objects introduced earlier in the . Resolution demands non-local , drawing on background knowledge about physical properties, , or social norms, rather than syntactic or local contextual cues. This structure ensures the schema tests commonsense reasoning without relying on shallow statistical patterns. A representative example illustrates these components through causality and physical knowledge. Consider the schema pair:
  • "The man couldn't lift his son because he was so weak. Who was weak?" (Answer: the man)
  • "The man couldn't lift his son because he was so heavy. Who was heavy?" (Answer: the son)
Here, the pronoun "he" ambiguously refers to either the man or the son, the alternatives. The first sentence requires knowledge that lifting failure stems from the lifter's physical weakness, not the object's weight, resolving "he" to the man via causal inference about human capabilities. Replacing "weak" with "heavy" flips the resolution, as heaviness pertains to the lifted object, demanding understanding of physical properties and impossibility conditions. This non-local reasoning integrates event causality beyond the sentence's surface structure. Another schema highlights social norms and emotional responses. The pair is:
  • "Frank felt vindicated when his longtime rival Bill revealed that he was the winner of the competition. Who was the winner of the competition?" (Answer: )
  • "Frank felt crushed when his longtime rival Bill revealed that he was the winner of the competition. Who was the winner of the competition?" (Answer: )
The pronoun "he" ambiguously points to or as alternatives. In the first case, "vindicated" implies relief from doubt, resolved by social knowledge that a rival's revelation confirms the speaker's ('s) victory, evoking positive . The alternate "crushed" shifts to negative , indicating the rival () won, requiring about rivalry dynamics and typical emotional outcomes in competitions. This dissection underscores how schemas enforce holistic understanding, as local alone cannot disambiguate without normative expectations. A third example dissects institutional and social causality:
  • "The town councillors refused to give the angry demonstrators a permit because they feared violence. Who feared violence?" (Answer: the town councillors)
  • "The town councillors refused to give the angry demonstrators a permit because they advocated violence. Who advocated violence?" (Answer: the angry demonstrators)
"They" serves as the ambiguous pronoun, with alternatives being the councillors or demonstrators. Resolution in the first sentence relies on knowledge of roles and : officials deny permits to prevent unrest, so "they" (councillors) fear it. The "advocated" alternate flips to the demonstrators as instigators, drawing on norms of dynamics. Each pair demands integrated reasoning across entities and events, avoiding resolution via word co-occurrence or alone.

Strengths and Limitations

Advantages Over Traditional Tests

The Winograd Schema Challenge (WSC) provides distinct advantages over traditional AI benchmarks, such as the or recognizing tasks, by emphasizing robust, human-like without the pitfalls of subjective evaluation or superficial . Introduced as an alternative to the , the WSC employs pairs of sentences that differ by only one or two words, creating ambiguity in reference that resolves through world knowledge rather than linguistic heuristics. This design ensures that high performance demands genuine understanding, as opposed to mimicry of conversational fluency in broader Turing-style interrogations. A key benefit is its resistance to data contamination, as the schemas are intentionally "Google-proof"—meaning they cannot be resolved via simple searches or statistical correlations from training corpora, thus preventing AI systems from succeeding through of common patterns. In contrast to datasets prone to leakage from internet-sourced training data, the WSC's novel formulations require integrating disparate knowledge domains, such as physical constraints or , to disambiguate references correctly—for instance, distinguishing whether "they" refers to demonstrators or councillors based on contextual implications of and . This tests true commonsense, going beyond syntax-focused evaluations that overlook deeper semantic integration. The challenge's efficiency further sets it apart, featuring a compact original dataset of 273 schemas. While historically challenging for AI systems, recent large language models such as have achieved approximately 94% accuracy as of 2023, approaching human performance, though debates continue on whether this reflects true commonsense understanding. Traditional tests often involve expansive corpora or prolonged interactions, but the WSC's minimal size—combined with its design—allows for focused evaluation of core reasoning capabilities. Additionally, its binary resolution format, where systems select between two explicit options for pronoun reference, yields clear yes/no or choice-based scoring that is objective and automatable, eliminating the ambiguity and inter-judge variability common in open-ended tasks.

Common Pitfalls and Criticisms

One significant pitfall in the Winograd Schema Challenge (WSC) lies in the creation of schemas, which is labor-intensive and prone to inconsistencies. Constructing valid schemas requires careful to avoid superficial cues or statistical biases that AI systems could exploit without true commonsense understanding, yet ensuring this is challenging; for instance, early efforts by the challenge's proponent yielded only about three schemas per hour, resulting in a limited of around 300 examples. Human annotators also exhibit disagreements on resolutions, with inter-annotator agreement rates as low as 90-95% in some evaluations, highlighting subjectivity in determining "correct" answers based on implicit knowledge. Criticisms of the WSC often center on its narrow scope and potential for cultural biases. The challenge primarily tests pronoun in contrived twin sentences, which does not encompass the full breadth of , such as or multi-step planning, leading some researchers to argue it measures only a sliver of cognitive abilities rather than general . Additionally, schemas can embed cultural assumptions, like references to college costs in contexts, which may not generalize across diverse populations and introduce unintended biases. biases have been particularly noted in derived datasets inspired by the WSC, such as the Winogender schemas, where coreference resolution favors stereotypical gender pairings, potentially affecting model fairness in related tasks. Further critiques highlight the risk of "artifactual" success through or data leakage rather than genuine understanding. Approximately 13.6% of schemas in the WSC-273 dataset can be resolved via simple syntactic patterns, violating the challenge's own criteria for requiring world knowledge. Criticisms date back to 2011, when warned that the task's constraints might encourage clever tricks rather than true comprehension. In 2016, the IJCAI competition saw top systems achieve only 58% accuracy, below human levels, prompting discussions on the benchmark's validity. Following the 2020 release of , which achieved around 89% accuracy and revealed exploitable shortcuts, and Ernest Davis questioned the WSC's ability to distinguish genuine understanding from . Elazar et al. emphasized that lax validation in schema creation allows biases to persist, potentially overestimating capabilities. These issues have prompted extensions like WinoGrande (2019, with 44k adversarial examples where LLMs perform closer to random) and WSC+ (2024, where scores 68.7% vs. human 95.1%), underscoring the original WSC's limitations in robustness and motivating ongoing research into more challenging benchmarks.

Datasets and Benchmarks

Original Winograd Schema Collection

The original Winograd Schema Collection comprises 273 hand-crafted disambiguation problems, organized as 136 schema pairs, developed by Hector Levesque, Ernest Davis, and Leora Morgenstern. Released in 2012 as part of their seminal work introducing the challenge, this serves as the foundational for evaluating in systems. The schemas were meticulously sourced from diverse materials, including literary works, news articles, and original compositions, to promote variety across domains such as physical scenarios involving objects and spatial relations, as well as social contexts requiring of and interactions. This hand-crafting process ensured that each pair exhibited the core property of the Winograd schema: a minimal textual alteration that reverses resolution, demanding intuitive world rather than statistical patterns. To validate the dataset's suitability, extensive human testing was conducted, revealing mean accuracies of 92.1% on hard schemas and 98.6% on easy ones, confirming the problems' resolvability through everyday commonsense. In contrast, contemporary machine baselines, including early systems relying on syntactic or shallow semantic analysis, hovered around 50% accuracy—indistinguishable from random selection between the two candidate referents. The has been publicly available since its release, with collections hosted on academic websites like Davis's page and later integrated into platforms like Datasets, facilitating widespread use in research and evaluation.

Extensions and Modern Adaptations

Since its introduction, the Winograd Schema Challenge (WSC) has seen several key extensions that expand its scope and scale while preserving its core focus on through disambiguation. One early variant is the Definite Pronoun (DPR) dataset, which relaxes some of WSC's strict "Google-proof" constraints to include 1,322 and 564 test examples, totaling over 1,800 schemas, enabling broader for coreference systems. More recent developments include WinoGrande, a 2020 crowdsourced of 44,000 adversarial problems designed to mitigate biases and increase difficulty beyond the original 273 schemas by using automated filtering techniques like AFLITE to ensure non-associative examples. Other variants, such as WinoBias (3,160 examples from 2018) and WinoGender (720 examples from 2018), adapt the schema structure to probe for gender biases in language models through pro- and anti-stereotypical pairs. Multilingual adaptations have broadened the WSC's applicability to non-English languages, facilitating cross-lingual commonsense evaluation. In 2017, a French collection of 144 Google-proof schemas was created by translating and adapting English examples while maintaining ambiguity resolution challenges. Similarly, the 2020 Mandarinograd dataset provides a Chinese corpus of Winograd schemas, emphasizing anaphora resolution in Mandarin through manually constructed pairs that require cultural and linguistic commonsense. Additional translations exist for languages like Portuguese, Japanese, and others, often involving adjustments for translation-induced ambiguities to preserve the original intent. Integrations into larger benchmarks have embedded WSC variants into comprehensive AI evaluation suites. The Winograd NLI (WNLI) task, derived from 273 WSC examples and reformatted as a inference problem with 634 training instances, was incorporated into the GLUE benchmark in 2018 and later SuperGLUE in 2019 to assess commonsense in broader pipelines. In 2022, the WinoGrande dataset was integrated as a task in BIG-bench, a collaborative with over 200 tasks, to test laws in reasoning on adversarial . Elements of WSC-inspired pronoun disambiguation also influence datasets like CommonsenseQA, which extends multiple-choice commonsense questioning to include schema-like ambiguities. Efforts toward automated generation of WSC-like schemas have emerged to address scalability limitations of manual creation. While early extensions relied on , 2023 research explored procedural methods for generating diverse problems, such as template-based combined with detection to produce non-memorizable examples. These approaches aim to create larger, varied datasets without extensive human , though they often incorporate validation to ensure commonsense necessity. In 2024, updates specifically tailored for (LLM) testing introduced adversarial schemas to counteract memorization and observed in high-performing models. The EvoGrad framework, for instance, employs a evolutionary process to dynamically generate altered WSC instances, yielding datasets where even advanced LLMs like GPT-3.5 achieve only around 65% accuracy due to targeted perturbations that demand robust reasoning. Similarly, the Concept-Reversed WSC (CR-WSC) reverses conceptual mappings in schemas to probe deeper understanding, highlighting vulnerabilities in LLM commonsense application. Through these extensions, the WSC ecosystem has grown dramatically from the foundational 273 schemas to over 44,000 in major crowdsourced efforts like WinoGrande by 2020, with ongoing contributions pushing totals beyond 10,000 in multilingual and adversarial variants by 2025, enabling more rigorous evaluation of systems' world knowledge integration.

Research and Applications

Performance in AI Systems

Early attempts to solve the Winograd Schema Challenge (WSC) relied on rule-based systems and statistical models, which typically achieved accuracies of 50-60% on the original , falling short of exceeding 90%. For instance, in the inaugural WSC held in , the highest-scoring entry reached only 58%, highlighting the limitations of early approaches that struggled with the required for disambiguation. The advent of transformer-based models marked significant progress, though initial results still depended on heuristics rather than deep reasoning. In 2019, BERT on related pronoun resolution tasks yielded accuracies around 70-75% on the WSC, demonstrating improvements through but underscoring persistent gaps in handling novel ambiguities. A key study from that year explored knowledge injection by combining neural language models with external knowledge hunting, achieving up to 71.1% accuracy and emphasizing the role of retrieved commonsense facts in boosting performance without full reasoning capabilities. The (LLM) era brought scores closer to human levels on the original WSC, with attaining approximately 90% accuracy in zero-shot settings in 2020, though subsequent analyses revealed concerns over data leakage from training corpora that included WSC-like examples, leading to inflated results on contaminated benchmarks. By 2023-2025, fine-tuned LLMs such as reached 94-95% on the originals, reflecting scaled architectures' ability to memorize patterns effectively. However, evaluations on novel schemas, such as those introduced in a 2024 study reversing key concepts to disrupt superficial associations, showed sharp drops to below 80% for models like and Llama 3, indicating reliance on statistical shortcuts rather than robust commonsense understanding. Despite these advances, no AI system has achieved 100% accuracy on the WSC without invoking human-like reasoning, as confirmed by 2025 benchmarks that stress-test generalization to unseen variations, where even top LLMs fail on 10-20% of cases requiring true world knowledge integration.

Ongoing Developments and Activity

Recent research on the Winograd Schema Challenge (WSC) has increasingly incorporated elements, extending the traditional text-based task to evaluate vision-language models. In 2024, the WinoVis dataset was introduced as a novel for probing text-to-image models on pronoun disambiguation in visual contexts, where ambiguous s must be resolved using both textual descriptions and generated images. This approach builds on earlier efforts like Winoground (2022), which tested visio-linguistic compositionality, but WinoVis specifically adapts the WSC format to highlight failures in . Hybrid neuro-symbolic methods have emerged as a promising direction for addressing limitations in pure neural approaches to WSC and related commonsense tasks. A 2025 systematic review of highlights integrations of symbolic reasoning with neural networks to enhance interpretability and robustness in commonsense inference. These hybrid pipelines combine transformer-based encoding with rule-based coreference resolution, offering a pathway to more reliable pronoun disambiguation without relying solely on pattern matching. The WSC continues to influence the design of broader commonsense benchmarks, such as HellaSwag, which scales up adversarial inference tasks inspired by the schema's emphasis on grounded reasoning. Additionally, the challenge is applied in evaluating explainable AI systems, particularly in analyzing how large language models generate and utilize rationales for schema resolutions. A 2025 study examines qualitative differences between human and model explanations on WSC tasks, underscoring its role in probing transparency in AI decision-making. Ongoing activity includes active dataset expansions and methodological innovations presented at major AI conferences. For instance, the 2025 ACL Anthology features papers on concept-reversed WSC variants to test robustness beyond superficial patterns, and enhancements using tree-of-experts architectures for improved schema resolution. These developments reflect a shift toward more challenging, dynamic evaluations amid advancements.

References

  1. [1]
    [PDF] The Winograd Schema Challenge
    What we propose in this paper is a variant of the RTE that we call the Winograd Schema (or WS) challenge. It requires subjects to answer binary questions ...
  2. [2]
    The Winograd Schema Challenge - NYU Computer Science
    A Winograd schema is a pair of sentences that differ in only one or two words and that contain an ambiguity that is resolved in opposite ways.
  3. [3]
    Commonsense Reasoning ~ Winograd Schema Challenge
    A competition to encourage efforts to develop programs that can solve the Winograd Schema Challenge, an alternative to the Turing Test.
  4. [4]
  5. [5]
    Understanding natural language : Winograd, Terry - Internet Archive
    Nov 13, 2023 · Understanding natural language. vii, 191 pages ; 24 cm. Appeared in the journal, Cognitive psychology 3, no. 1, 1972.
  6. [6]
    [PDF] Understanding natural language | Semantic Scholar
    This article considers the approach to interpreting natural language phrases based on the “Meaning–Text” theory and proposes an intentional dialogue context ...Missing: online | Show results with:online
  7. [7]
    A Collection of Winograd Schemas - NYU Computer Science
    Sep 8, 2011 · A Winograd schema is a pair of sentences that differ in only one or two words and that contain an ambiguity that is resolved in opposite ways in the two ...
  8. [8]
    The Winograd Schema Challenge - AAAI
    Mar 29, 2023 · Papers from the 2011 AAAI Spring Symposium > No. 6: Logical Formalizations of Commonsense Reasoning The Winograd Schema Challenge
  9. [9]
    [PDF] The Winograd Schema Challenge - Commonsense Reasoning
    What we propose in this paper is a variant of the RTE that we call the Winograd Schema (or WS) challenge. It requires subjects to answer binary questions, but ...
  10. [10]
    [PDF] The Winograd Schema Challenge - Semantic Scholar
    This paper presents an alternative to the Turing Test that has some conceptual and practical advantages, and English-speaking adults will have no difficulty ...
  11. [11]
    [PDF] Establishing a Human Baseline for the Winograd Schema Challenge
    In this case, overall accuracy would rise to 94.4%, and if we omit ques- tions with lower than 90% accuracy, overall accuracy rises still further to 96.2%.
  12. [12]
    [PDF] The Defeat of the Winograd Schema Challenge - arXiv
    Jan 24, 2023 · Abstract. The Winograd Schema Challenge—a set of twin sentences involving pro- noun reference disambiguation that seem to require the use of ...
  13. [13]
  14. [14]
    Implement the WSC273 Winograd Schemas Challenge evaluation #12
    Sep 16, 2020 · The Winograd Schemas Challenge [LDM12] is a classical task in NLP that involves determining which word a pronoun refers to, when the pronoun ...
  15. [15]
    A Review of Winograd Schema Challenge Datasets and Approaches
    Apr 23, 2020 · This paper reviews existing Winograd Schema Challenge benchmark datasets and approaches that have been published since its introduction.
  16. [16]
    WinoGrande: An Adversarial Winograd Schema Challenge at Scale
    Jul 24, 2019 · We introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of ...Missing: 2018 960 schemas
  17. [17]
  18. [18]
  19. [19]
    Semantic noise in the Winograd Schema Challenge of pronoun ...
    Apr 11, 2023 · “The city councilmen refused the demonstrators a permit because they advocated violence.” This foundational WS is presented in Winograd's ...
  20. [20]
  21. [21]
    [2410.12040] Concept-Reversed Winograd Schema Challenge - arXiv
    Oct 15, 2024 · We propose a new evaluation dataset, the Concept-Reversed Winograd Schema Challenge (CR-WSC), based on the famous Winograd Schema Challenge (WSC) dataset.
  22. [22]
    Winograd Schema Challenge Results: AI Common Sense Still a ...
    Jul 28, 2016 · A Turing test alternative, the Winograd Schema Challenge aims to determine how well AI handles commonsense reasoning.
  23. [23]
    A Surprisingly Robust Trick for Winograd Schema Challenge - arXiv
    May 15, 2019 · In this paper, we show that the performance of three language models on WSC273 strongly improves when fine-tuned on a similar pronoun ...
  24. [24]
    Combining Knowledge Hunting and Neural Language Models to ...
    Winograd Schema Challenge (WSC) is a pronoun resolution task which seems to require reasoning with commonsense knowledge. The needed knowledge is not present in ...Missing: injection | Show results with:injection
  25. [25]
    A Visual Twist on the Winograd Schema Challenge - arXiv
    May 25, 2024 · We introduce WinoVis, a novel dataset specifically designed to probe text-to-image models on pronoun disambiguation within multimodal contexts.
  26. [26]
    [PDF] Winoground: Probing Vision and Language Models for Visio ...
    We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, ...
  27. [27]
    Neuro-Symbolic AI in 2024: A Systematic Review - arXiv
    This paper provides a systematic literature review of Neuro-Symbolic AI projects within the 2020-24 AI landscape, highlighting key developments, methodologies, ...Missing: Schema | Show results with:Schema
  28. [28]
    A Hybrid Neuro-Symbolic Pipeline for Coreference Resolution and ...
    This study presents a hybrid neuro-symbolic pipeline that combines transformer-based contextual encoding with symbolic coreference resolution and Abstract ...
  29. [29]
    [PDF] Advanced NLP - Anoop Sarkar
    Winograd Schema Challenge (from GLUE). Page 25. Swag and HellaSwag. Page 26. Swag. Large-scale adversarial dataset for grounded commonsense inference. • Given a ...
  30. [30]
    The Winograd Schema Challenge: A Study of Language Models ...
    Jun 10, 2025 · This thesis aims to examine how LLMs generate and use explanations in the context of the WSC task. It showcases qualitative differences between human and ...
  31. [31]
    [PDF] Concept-Reversed Winograd Schema Challenge - ACL Anthology
    Apr 29, 2025 · Saga Hansson, Konstantinos Mavromatakis, Yvonne. Adesam, Gerlof Bouma, and Dana Dannélls. ... lie Pavlick, Emanuele Rodola, Emma Lam, Eric Chu,.
  32. [32]
    Enhancing The Winograd Schema Challenge Using Tree-of-Experts
    Jun 28, 2025 · The Winograd Schema (WS) challenge has been proposed as an alternative to the Turing Test as a test for machine intelligence. In this short ...<|separator|>