Fact-checked by Grok 2 weeks ago

Commonsense reasoning

Commonsense reasoning is the capacity of to infer and apply implicit knowledge about everyday physical, social, and psychological phenomena to understand contexts, make decisions, and perform tasks intuitively, much like humans do without explicit instruction. In (), it represents a core challenge, enabling applications in , , , and by allowing systems to handle ambiguity, uncertainty, and novel situations. As of 2025, despite further advances in , models often remain brittle and narrow, frequently failing at intuitive reasoning due to a lack of robust background knowledge about the world. Recognized as a fundamental problem in AI since the , commonsense reasoning has seen slow progress compared to other subfields, with early efforts highlighting its complexity in formalizing everyday inferences. Key challenges include representing vast, uncertain knowledge across domains like physics and , handling plausible but defeasible conclusions, and scaling to long-tail rare events without exhaustive data. For instance, systems struggle with tasks like resolving pronoun ambiguities in sentences such as "The city councilmen refused the demonstrators a permit because they feared violence," where context determines . Approaches to commonsense reasoning span symbolic and neural methods, including large-scale knowledge bases like , which encodes millions of concepts and over 25 million axioms, and ConceptNet, built via for relational commonsense. More recent neural techniques, such as generative models like , produce commonsense inferences from text by learning patterns from datasets like , containing 1.3 million if-then event facts. Hybrid efforts emphasize from minimal priors to mimic human-like flexibility in novel environments. These developments aim to bridge AI's gaps in contextual understanding, with projects like achieving 80-92% accuracy in ethical reasoning scenarios. Ongoing research as of 2025 continues to integrate commonsense reasoning into large language models, though full human-like capabilities remain elusive.

Definitions and Fundamentals

Definitions and Characterizations

Commonsense reasoning refers to the capacity to make inferences based on implicit, everyday about the world that individuals acquire through natural experience and observation, rather than through formal instruction or specialized training. This form of reasoning enables humans to navigate ambiguous situations by drawing on probabilistic assumptions about physical objects, interactions, human intentions, and temporal sequences, allowing for practical in unstructured environments. For instance, understanding that can fly but typically cannot involves integrating basic biological and physical without explicit rules. Unlike deductive logic, which guarantees conclusions from through sound, rule-based but often fails to address real-world ambiguities and incomplete information, commonsense reasoning is inherently defeasible and context-sensitive, relying on assumptions that can be revised with new evidence. Deductive systems excel in formal domains like but struggle with the of or physical , where multiple interpretations are possible. A classic example is the sentence "The trophy doesn't fit in because it is too big," which requires inferring that "it" refers to the trophy based on spatial commonsense, rather than the suitcase, to resolve the — a task that highlights the need for world knowledge beyond syntactic analysis. The philosophical underpinnings of commonsense reasoning trace back to Aristotle's concept of the practical syllogism, which describes how agents derive actions from general principles and particular circumstances through deliberative reasoning grounded in everyday prudence, bridging theoretical knowledge with practical application. In modern , this evolves into views of "naive physics" and "naive psychology," where humans intuitively model physical causality and mental states using qualitative, non-expert representations of the world, such as expecting objects to fall due to or inferring from facial expressions. These characterizations emphasize commonsense as an innate, modular cognitive faculty shaped by evolutionary and developmental processes. Commonsense reasoning differs fundamentally from domain-specific expertise, which relies on narrow, rule-based systems tailored to particular fields like or chess, often encoded in formal ontologies or algorithms for precise, deterministic outcomes. In contrast, commonsense operates across broad, interdisciplinary scenarios with probabilistic judgments, such as deciding not to pour wine into a broken based on intuitive physical and social norms, rather than specialized protocols. This breadth makes it essential for general but challenging to formalize, as it defies exhaustive enumeration in expert-like databases.

Historical Development

The pursuit of commonsense reasoning in originated in the late 1950s, when researchers recognized the need for machines to handle everyday knowledge beyond formal logic. In 1959, John McCarthy proposed the "Advice Taker," a hypothetical program designed to manipulate sentences in formal languages to solve problems, incorporating commonsense inferences by accepting advice in and using it to guide deductions. This work laid foundational ideas for knowledge representation, emphasizing how programs could learn and apply implicit rules akin to human intuition. By the late , challenges in formalizing dynamic environments became evident; McCarthy and Patrick Hayes articulated the "" in 1969, highlighting the difficulty of specifying what remains unchanged after an action without enumerating irrelevant details, a core obstacle in modeling real-world reasoning. The 1980s and 1990s saw concerted efforts to build extensive knowledge bases for commonsense, shifting toward symbolic approaches. launched the project in 1984 at the Microelectronics and Computer Technology Corporation (MCC), aiming to hand-code millions of axioms capturing human consensus knowledge to enable inference in diverse domains. By the mid-1980s, research introduced alternative paradigms; developed the subsumption architecture in 1986, a layered for mobile robots that prioritized reactive behaviors over centralized planning, promoting "" where intelligence emerges from direct interaction with the environment rather than abstract representations. These developments underscored debates between knowledge-intensive symbolic methods and behavior-based systems, influencing how commonsense could be operationalized in practical . The 2000s marked a transition to statistical and crowdsourced methods, democratizing . The Open Mind Common Sense (OMCS) project, initiated in 2000 by Push Singh at , collected natural language statements from volunteers to build a broad commonsense , gathering over 400,000 entries by 2002 through web-based interfaces. This approach contrasted with manual encoding, leveraging public input for scalability. Benchmarks emerged in the 2010s to evaluate progress; the , proposed by Hector Levesque in 2011, tested pronoun resolution requiring world knowledge, using paired sentences to assess non-trivial without relying on statistical patterns. From the late 2010s onward, integrated generative techniques for commonsense, though debates persisted on their depth. The model, introduced in 2019, used transformer-based architectures pretrained on graphs like ConceptNet and to generate relational inferences, enabling automatic expansion of commonsense graphs for applications in . Critics, including and Ernest , have argued that even advanced large models often mimic patterns superficially rather than truly reasoning, a view reinforced in analyses showing persistent failures in robust commonsense tasks despite scaling.

Core Problems in Commonsense Reasoning

The Reasoning Problem

The reasoning problem in commonsense reasoning refers to the computational challenges in developing inference mechanisms that replicate human-like deductions, which often involve uncertainty, defaults, and revisions based on incomplete information. Traditional formal logics, such as , are monotonic, meaning that once a conclusion is derived from a set of premises, adding new information cannot invalidate it; however, commonsense reasoning is inherently non-monotonic, as new facts can retract prior assumptions, leading to . For instance, the illustrates this difficulty: in modeling dynamic scenarios, it is computationally infeasible to explicitly specify everything that remains unchanged after an action without enumerating irrelevant details, as formal representations would require listing all non-effects for each event. Key challenges arise from handling vagueness in everyday concepts, reliance on for plausible inferences, and for generating explanatory hypotheses from observations. A classic illustration is the "Tweety bird" example, where one assumes that a flies by default unless (e.g., it being a penguin) indicates otherwise, demonstrating how commonsense defaults permit provisional conclusions that may be overridden. These issues extend to the qualification problem, which questions how to determine the preconditions for an action's success in the face of unforeseen interferences, as exhaustively listing all qualifiers is impractical in real-world contexts. Theoretical foundations for addressing these challenges include non-monotonic logics, such as Reiter's default logic, which formalizes defaults as rules that apply only if consistent with the overall , enabling extensions of theories that support . In systems, this reasoning problem manifests as the brittleness of rule-based approaches in open-world scenarios, where assumptions about completeness fail, causing systems to draw incorrect inferences or overlook plausible outcomes without exhaustive rules.

The Knowledge Problem

The knowledge problem in commonsense reasoning centers on the immense scope and diversity of implicit that humans intuitively possess and apply in everyday situations. This encompasses multiple categories, including physical facts such as objects falling due to under normal conditions, social norms like the expectation of in greetings, and temporal patterns such as the alternation of day and night cycles. Commonsense also extends to spatial relations, psychological motivations, and causal interactions, forming a vast web of interconnected facts that enable intuitive understanding of the world. Estimates suggest that an adult human holds millions of such pieces of , far exceeding the scale of any current artificial system, with symbolic resources like containing over 1.3 million rules yet remaining insufficient for comprehensive coverage. Acquiring this knowledge poses significant barriers due to its largely implicit and tacit nature, which is rarely explicitly articulated in or , making systematic challenging. Efforts through expert elicitation often result in incompleteness, as even domain specialists struggle to enumerate the full breadth of everyday assumptions, while approaches introduce biases such as cultural skews or reporting inconsistencies, where contributors overemphasize salient events and underrepresent mundane ones. For instance, automated from text corpora can yield millions of relations but suffers from sparsity and noise, as infrequently encodes all relevant commonsense details. These barriers have historically limited the scale of acquired , with early systems capturing only a of human-level breadth. Representing commonsense knowledge further complicates the problem, requiring formats that balance structure, flexibility, and usability for reasoning. Ontologies like provide hierarchical lexical relations—such as hyponymy (e.g., "" as a type of "animal")—to model semantic interconnections, enabling inference over word meanings and basic concepts, while more comprehensive systems like employ formal axioms to capture logical rules. In contrast, flat databases store facts as simple (e.g., subject-predicate-object) but falter in handling exceptions and context-dependency, where the same fact may vary by situation, such as " is wet" applying differently in frozen or gaseous states. Effective representation thus demands hybrid approaches to accommodate variability, ensuring knowledge remains applicable across diverse scenarios without rigid overgeneralization. Early attempts to address the knowledge problem highlighted the tension between manual curation and automation. The project, initiated in 1984, relied on hand-crafting tens of thousands of axioms by knowledge engineers to build a foundational , aiming to encode millions of human-like rules over decades, though this labor-intensive process proved slow and prone to . Conversely, automated methods emerged to extract from large text corpora, such as deriving event relations from parsed sentences in resources like the , offering scalability but at the cost of precision and coverage of non-linguistic intuitions. These pioneering efforts underscored the need for integrated strategies to amass and organize the diverse, context-sensitive essential for robust commonsense reasoning.

Applications in Intelligent Systems

Natural Language Processing

Commonsense reasoning plays a crucial role in (NLP) by enabling systems to interpret ambiguous linguistic structures that rely on implicit world knowledge. A prominent example is resolving pronoun ambiguity, as demonstrated in Winograd schemas, where subtle contextual cues determine reference. For instance, in the sentence "The city councilmen refused the demonstrators a permit because they feared ," "they" refers to the councilmen, whereas in "The city councilmen refused the demonstrators a permit because they advocated ," "they" refers to the demonstrators. These schemas test a model's ability to perform commonsense inference without relying on superficial patterns, highlighting the limitations of statistical correlations alone. Key applications of commonsense reasoning in include , dialogue systems, and story generation. In , datasets like CommonsenseQA require models to draw on everyday knowledge to select correct answers from multiple choices, such as inferring that should be stored in a "" rather than a or glove box. This goes beyond factual recall, demanding inference about plausible scenarios. Dialogue systems leverage commonsense to infer and maintain coherent conversations, for example, understanding that a request for "directions to the nearest café" implies a need for walkable routes if the user is on foot. Similarly, in story generation, commonsense ensures the plausibility of event sequences, such as a character seeking shelter during rain rather than continuing an outdoor activity unaffected. Commonsense reasoning has been integrated into transformer-based models to enhance performance. , introduced in 2018, can be fine-tuned on inference (NLI) tasks that incorporate commonsense elements, improving its ability to recognize entailment relations grounded in real-world assumptions. Generative models in the series, scaling up to billions of parameters by 2020 and beyond, simulate commonsense through but frequently produce hallucinations—plausible yet factually incorrect outputs—due to gaps in explicit knowledge encoding. These integrations yield significant benefits, particularly in improving coherence and reducing errors in tasks like and summarization. In , commonsense helps resolve ambiguities arising from cultural or contextual gaps, such as translating idioms that imply physical actions without literal equivalents. For summarization, it fills implicit gaps in source texts, ensuring generated summaries maintain logical flow and avoid implausible inferences. Overall, incorporating commonsense reasoning elevates systems from to more human-like understanding.

Computer Vision

Commonsense reasoning in extends beyond mere and to enable models to infer implicit properties and interactions in visual scenes, such as recognizing a partially occluded as graspable based on its affordances like and typical use. This capability draws from human-like understanding of object functionalities, where visual cues trigger expectations about potential actions, even when parts of the object are hidden. For instance, relational affordances allow systems to predict how one object might support or block another, facilitating search tasks in cluttered environments. Similarly, intuitive physics reasoning empowers models to anticipate physical events, such as predicting the curved of a rolling ball under , by simulating basic laws like and collision without explicit training on every scenario. In applications like visual question answering (VQA), commonsense reasoning is crucial for tasks that demand world knowledge, such as answering "Why is the on?" by inferring ongoing cooking activity from visual cues like utensils and steam, rather than relying solely on literal image content. generation further leverages relational commonsense to construct structured representations of images, capturing not just objects but their plausible interactions—e.g., a person "holding" a "" implies containment and support relations informed by everyday physics and social norms. These applications highlight how commonsense bridges low-level visual features with high-level semantic understanding, improving robustness in dynamic scenes. Key techniques integrate visual backbones like convolutional neural networks (CNNs) with external knowledge structures; for example, the Visual Commonsense Reasoning (VCR) model from 2019 combines CNNs for feature extraction with knowledge graphs to evaluate multiple-choice questions about movie scenes, requiring over visual evidence and rationales. In the , multimodal large language models (LLMs) have advanced this by incorporating commonsense prompts into vision-language models like CLIP, where textual queries augmented with world knowledge enhance zero-shot performance on tasks like attribute prediction. These methods disentangle from reasoning, using cascaded architectures to inject commonsense priors during . By leveraging world models that encode intuitive physics and object affordances, commonsense reasoning addresses challenges in zero-shot recognition, reducing errors in novel scenarios—such as identifying unseen actions—through generalized , as seen in frameworks that fuse visual embeddings with commonsense graphs. This approach mitigates in traditional systems, enabling more reliable predictions in unstructured environments without extensive retraining.

Robotics and Physical Manipulation

In , commonsense reasoning plays a critical role in enabling physical by allowing robots to interpret object and anticipate action outcomes in dynamic environments. Object refer to the potential uses or interactions an object supports based on its properties, such as recognizing that a affords pulling or twisting to open a . This understanding draws from foundational concepts in and has been formalized in robotic systems through knowledge bases that link visual attributes to functional possibilities. For instance, early work extracted affordance knowledge from images to reason about object interactions, enabling robots to select appropriate strategies without exhaustive trial-and-error. Similarly, predicting consequences involves naive physics reasoning, where robots infer effects like the of stacking fragile items to avoid collapses, grounded in qualitative simulations of physical laws. Applications of commonsense reasoning in household robots often focus on tasks in cluttered, unstructured spaces, where traditional fails due to unforeseen obstacles. By integrating naive physics, robots can plan grasps that account for object stability and environmental constraints, such as scooping items without scattering them. A notable example is a system using large language models to self-correct during , allowing a to recover from perturbations like dropped objects by reasoning about subtasks and physical feasibility. In collaborative , commonsense enables inferring human intentions from gestures, facilitating safe handovers or joint assembly; for example, robots interpret or hesitating motions as cues for assistance, adapting their actions to align with human goals in shared workspaces. Seminal systems like the PR2 robot from the incorporated commonsense for task planning, using qualitative reasoning to envision manipulation effects, such as projecting how pushing an object might cause secondary displacements. This approach improved success rates in everyday scenarios by simulating outcomes before execution. By 2025, advances in embodied , such as Figure AI's Figure 01 equipped with the vision-language-action model, leverage simulation-based reasoning for general-purpose manipulation, enabling intuitive interactions like navigating household clutter with common-sense predictions of physical interactions. These integrations enhance adaptability in unstructured settings, significantly reducing reliance on repetitive trials and improving efficiency in real-world deployment.

Achievements in Automated Commonsense

Taxonomic and Ontological Reasoning

Taxonomic and ontological reasoning in commonsense involves structuring into hierarchies of categories and relations, enabling automated about class memberships and properties. This approach leverages upper ontologies and lexical databases to represent is-a relations (hyponymy and hypernymy), allowing systems to derive commonsense facts through , such as inferring that all instances of a subclass share properties of its superclass. A seminal achievement is the development of the in the late , which provides a formal upper-level framework for merging diverse knowledge sources into a cohesive . SUMO supports inheritance-based ; for instance, since "Dog" is a subclass of "Animal," properties like "breathes" applicable to animals are automatically inherited by dogs. This structure facilitates commonsense classification by defining approximately 25,000 terms and 80,000 axioms across SUMO, the Mid-Level Ontology (), and domain extensions as of 2025, covering domains like and . The exemplifies large-scale taxonomic reasoning, featuring a of approximately 1.5 million concepts by the late , enabling complex queries on hyponymy and hypernymy relations. Cyc's allows disambiguation of entities by traversing its —for example, distinguishing "" as a versus a river edge based on contextual superclass links—and supports inference over millions of assertions for commonsense queries. Similarly, , introduced in 1995, offers a lexical with over 117,000 synsets connected via hypernym/hyponym pointers, capturing everyday commonsense relations like "" as a hypernym of "." These systems have demonstrated high accuracy in entity recognition and disambiguation tasks, with WordNet-based semantic similarity measures achieving correlations of up to 0.84 with human judgments in lexical relatedness evaluations, enhancing performance in word sense disambiguation. In semantic search applications, ontologies like SUMO and Cyc improve precision by 10-20% over baseline methods through hierarchical matching, as seen in knowledge extraction pipelines that resolve ambiguities via subclass alignments. Early scalability challenges in ontology construction were addressed through automated merging techniques in SUMO, which integrate disparate sources like IEEE Standard Upper Ontology into a unified structure without manual intervention for core alignments, enabling handling of millions of entities.

Action, Change, and Causal Reasoning

Action, change, and in commonsense involves modeling how actions trigger state changes and causal effects in everyday scenarios, enabling systems to predict outcomes like preconditions for events or consequences of interventions. Early foundational work includes the STRIPS (Stanford Research Institute Problem Solver) framework, developed in the , which represents actions through preconditions, add lists for new facts, and delete lists for removed facts, allowing automated in domains like blocks worlds by simulating sequential state transitions. This approach laid the groundwork for extending symbolic to commonsense actions, such as inferring that moving a block requires it to be clear and on a surface. Building on such systems, the event calculus, formalized in the , provides a logic-based formalism for reasoning about actions and their effects over time, addressing the by circumscriptively defining what holds true between events unless specified otherwise. For instance, it models preconditions and effects, such as "flipping a switch causes the light to turn on" if the switch is connected and powered, enabling deduction of state changes from event occurrences. In tasks, event calculus-inspired methods have achieved high accuracy, often exceeding 90% in simulating physical manipulations like grasping objects without collisions, by integrating causal rules with sensor data. Modern advances leverage knowledge graphs and neural models for broader commonsense coverage. The ATOMIC 2020 dataset, which contains over 1.33 million if-then relations across social dimensions like "x wants" or "x causes," facilitating training on everyday causal inferences such as "breaking a causes a mess" for simulation in virtual environments. CausalBERT, a 2021 model, injects causal knowledge into via minimal supervision on synthetic triples derived from corpora, outperforming baselines on causal benchmarks by capturing event-effect links in text. In the 2020s, large language models (LLMs) have integrated these elements for zero-shot causal prediction, on causal graphs to simulate world models without task-specific training. For example, methods inducing causal structures in LLMs enable zero-shot reasoning on physical tasks, achieving up to 15% gains over vanilla models on benchmarks like PIQA for predicting action outcomes in novel scenarios. These developments support applications in , from narrative generation to safe robotic , by bridging symbolic precision with learned probabilistic .

Temporal and Qualitative Reasoning

Temporal reasoning in commonsense systems involves representing and inferring relations between events or states over time without relying on precise numerical timestamps, enabling qualitative descriptions such as "before," "during," or "overlaps." A foundational achievement is James F. Allen's interval algebra, introduced in 1983, which defines 13 mutually exclusive and exhaustive binary relations between time intervals, such as "meets" (one interval ends exactly when the other begins) or "overlaps" (intervals partially coincide). This algebra supports efficient constraint propagation for consistency checking and path finding in temporal networks, forming the basis for many planning and scheduling applications. Qualitative spatial reasoning (QSR) extends similar principles to spatial domains, focusing on approximate relations like , direction, and distance without metric details. The Region Connection Calculus (RCC-8), developed by Randell, Cui, and Cohn in , provides eight basic topological relations between regions, including "disconnected" (no contact), "externally connected" (touching at boundary), and "inside" (one fully contained in the other). RCC-8 enables tractable inference through its composition table, allowing deduction of implied relations, such as inferring containment from partial overlaps in geographic information systems. Key systems integrate these formalisms for dynamic simulations. Kenneth D. Forbus's qualitative process theory (QPT), proposed in 1984, models continuous physical changes using qualitative variables (e.g., increasing, steady, decreasing) and processes like or , predicting behaviors such as a ball accelerating downhill without exact equations. QPT supports envisionment—generating possible state trajectories—for commonsense physics reasoning in domains like engineering design. The Simple Hierarchical Ordered Planner (), introduced by Nau et al. in 1999, incorporates temporal constraints from interval algebra into planning, achieving efficient ordering of actions with temporal dependencies in and . Applications span narrative analysis and . In , Allen's relations facilitate timeline construction from stories, ordering events like "the character arrives before the meeting starts" to infer plot coherence. QPT and RCC-8 enable approximate physics predictions, such as simulating object trajectories in virtual environments where a rolls downhill due to inferred gravitational pull. Recent large language models have advanced temporal QA, with state-of-the-art systems like GPT-4o achieving approximately 79% exact match accuracy on the EvolveBench for temporal reasoning tasks as of 2025, demonstrating improved integration of qualitative temporal reasoning.

Persistent Challenges

Representation and Acquisition Difficulties

Representing commonsense knowledge encounters significant hurdles due to the inherent ambiguity in multi-modal information, where visual, textual, and contextual cues often conflict or require abductive inference to resolve. For instance, in multimodal scenarios, models struggle with atypical image-text pairs, such as a common image leading to an uncommon outcome, frequently defaulting to high-probability stereotypes rather than nuanced reasoning. This ambiguity extends to cultural variations in social norms, where what constitutes "polite behavior" or "appropriate interaction" differs across societies; for example, AI systems trained on predominantly Western data may misinterpret gestures like the Indian head wobble as uncertainty rather than affirmation, leading to culturally insensitive outputs. Additionally, the scalability of knowledge graphs exacerbates incompleteness, as these structures, operating under an open-world assumption, rarely encode negative or exceptional knowledge comprehensively; resources like ConceptNet and ATOMIC, with average entity degrees far below those of factual graphs, omit vast implicit relations, such as common activities for entities like dogs (e.g., petting) or negated capabilities (e.g., "a dog cannot fly"). Acquisition of commonsense knowledge faces biases inherent in crowdsourced datasets, which often overrepresent Western, Educated, Industrialized, Rich, and Democratic () populations, skewing representations toward majority cultural perspectives and underrepresenting global diversity. Studies of bases like reveal social biases, such as gender stereotypes in relational inferences (e.g., associating "" more with domestic roles), originating from both source corpora and demographics. Separately, analysis of GenericsKB shows up to 38.6% of facts embedding such biases. concerns further complicate extraction from user interactions, as conversational systems aggregate and repurpose personal disclosures for training without transparent , risking harms like secondary use for model or behavioral ; users report unease over linkability of shared to real identities, potentially enabling or . Specific challenges arise in handling exceptions to general rules, where commonsense must generics like " can fly" with counterexamples such as or newborn birds that cannot. bases inadequately enumerate these exceptions, leading to overgeneralization in reasoning; a generating instantiations for generics demonstrates improved (12.8 points over GPT-3 baselines) by explicitly modeling when rules hold or fail, underscoring the need for linguistically informed representations. Critiques as of 2025 highlight how large language models (LLMs) amplify errors when extracting , prioritizing "helpfulness" over accuracy and generating false inferences that propagate biases and hallucinations through downstream applications, including in commonsense tasks. These difficulties result in inconsistent reasoning across domains, where incomplete or biased representations yield unreliable outputs in culturally diverse or exceptional scenarios, hindering robust AI deployment in real-world .

Scalability and Computational Limitations

Commonsense reasoning systems encounter significant computational costs when operating over large knowledge bases, as inference often involves exponential search through vast sets. Automated provers, for example, applied to ontologies like Adimen-SUMO with approximately 8,000 s, suffer from high runtime demands due to the in proof exploration, rendering full-scale reasoning impractical without heuristic reductions. These costs are particularly acute in probabilistic models, where of dimensionality amplifies challenges by generating high-dimensional state spaces; in robotic navigation tasks, even modeling 30 locations with binary occupancy and lighting variables can yield approximately $2^{65} states, necessitating techniques to maintain feasibility. In dynamic environments such as , scalability issues intensify due to requirements, where action predictions must occur within milliseconds to ensure safe interaction. The , which involves efficiently delineating what remains unchanged after an action, becomes amplified in these settings, as rapid environmental shifts demand constant re-evaluation of irrelevant fluents, leading to prohibitive computational overhead without specialized approximations. For instance, low-level robotic , including adjustments for commonsense-informed actions, operate on timescales of a few milliseconds, constraining the depth of reasoning possible during execution. Large language models (LLMs), increasingly used for commonsense tasks, face inherent limits from their mechanisms, which scale (O(n^2)) with sequence length, restricting the depth of multi-step reasoning in complex scenarios. Studies from 2025 highlight that this quadratic complexity contributes to performance collapse beyond medium complexity levels, where reasoning effort peaks and then declines despite available compute, undermining reliable over extended contexts. Additionally, handling vast and updating without full retraining poses challenges, as lifelong editing methods struggle with reliability and at scale, often requiring selective forgetting or external augmentation that introduces further latency. These limitations have broader impacts on deployment, particularly restricting commonsense reasoning in resource-constrained edge devices like mobile robots, where power and memory budgets preclude large-scale models, forcing trade-offs in accuracy for low-latency operation amid real-world variability.

Approaches and Techniques

Symbolic and Knowledge-Based Methods

Symbolic and knowledge-based methods in commonsense reasoning rely on explicit representations of through formal logics and structured ontologies, enabling rule-based without reliance on statistical patterns. These approaches emphasize hand-crafted rules and hierarchies to model everyday , such as object properties, causal relations, and taxonomic structures. Core techniques include , exemplified by , which facilitates and resolution for deriving commonsense inferences from declarative rules. In , predicates define facts and rules, allowing for scenarios like planning actions based on preconditions, such as inferring that a can be opened if unlocked. Description logics (DLs) form another foundational technique, providing a fragment of for defining , roles, and axioms in ontologies suitable for commonsense querying. DLs support operations like subsumption (checking if one is a subclass of another) and instance checking, enabling precise queries over bases, such as determining that "a is an " through axiomatic . Systems using DLs, such as those based on ALC (Attributive Language with Complements), ensure decidable reasoning for taxonomic hierarchies, though complexity increases with expressive power. Prominent systems illustrate these techniques in action. The project, initiated in 1984, employs a massive of over a million axioms organized into microtheories—context-specific subsets that localize commonsense rules to avoid global inconsistencies during inference. 's applies forward and backward across these microtheories to perform , such as deriving that "water wets surfaces" from physical properties and causal links. Similarly, ConceptNet, developed in the late 2000s, is a multilingual connecting words and phrases via weighted edges representing commonsense associations like "is-a" or "used-for." Reasoning in ConceptNet involves graph traversals, such as path-finding algorithms to infer multi-hop relations, e.g., linking "" to "" through "causes" and "effect" edges. These methods excel in interpretability, as rules and derivations are human-readable, allowing of reasoning steps, and in within closed domains where is exhaustively encoded. For instance, deductive closure in taxonomic queries ensures all implied subclass relations are computed, supporting reliable of properties like "mammals nurse their young." From 2020 to 2025, symbolic approaches have evolved toward . This progression emphasizes scalable management without compromising logical rigor.

Learning-Based and Statistical Methods

Learning-based and statistical methods in commonsense reasoning leverage large-scale data to infer implicit knowledge through probabilistic patterns, contrasting with rigid symbolic rules by emphasizing empirical learning from corpora. These approaches typically involve supervised techniques, where models are fine-tuned on annotated datasets to recognize commonsense relations, such as entailment between everyday scenarios. For instance, fine-tuning on datasets like CommonsenseQA enables the model to predict plausible answers to questions requiring implicit world knowledge, achieving accuracies around 56% on multiple-choice tasks by learning contextual associations. Unsupervised methods, meanwhile, extract relational similarities from unlabeled text using embeddings; vectors, trained on global word co-occurrence statistics, capture analogies like " - man + ," which reflect basic commonsense relations such as or royalty hierarchies. Key models exemplify these techniques' evolution toward scalable inference. , introduced in 2019, adapts transformer-based language models fine-tuned on structured knowledge graphs like to generate event predictions, producing tuples such as "PersonX buys a car → wants to travel," thereby extending commonsense knowledge bases with millions of inferred relations via autoregressive generation. By 2025, large language models like GPT-5 variants, pretrained on web-scale corpora exceeding trillions of tokens, demonstrate enhanced implicit reasoning, outperforming predecessors on benchmarks involving causal and social inference by integrating patterns from diverse textual sources without explicit supervision. A primary advantage of these methods is their scalability, allowing extraction and utilization of millions of probabilistic facts from vast datasets, far beyond manual knowledge engineering. For example, Bayesian networks facilitate commonsense defaults through conditional probabilities, modeling scenarios like "a bird typically flies unless evidence suggests otherwise," enabling efficient inference over uncertain everyday assumptions with learned parameters from data. Despite these strengths, drawbacks persist in achieving deep comprehension, as models often rely on of surface patterns rather than genuine causal understanding. In 2025 critiques, highlighted that even advanced LLMs like GPT-5 exhibit superficial commonsense, failing on novel scenarios requiring true abstraction, such as temporal or physical consistency, due to over-reliance on statistical correlations from training data. This limitation underscores evaluations on benchmarks like those in physical tasks, where statistical methods briefly reference symbolic rigidity but prioritize corpus-derived probabilities.

Hybrid and Neuro-Symbolic Methods

Hybrid and neuro-symbolic methods in commonsense reasoning integrate symbolic logic with neural learning to overcome the limitations of purely symbolic rigidity and neural black-box opacity. These approaches embed logical structures, such as rules and theorems, directly into neural architectures, enabling differentiable reasoning over knowledge representations. A seminal example is the Neural Theorem Prover (NTP), introduced in the 2010s, which constructs neural networks from symbolic rules to perform end-to-end differentiable proving on knowledge bases, facilitating automated inference in relational domains like commonsense facts. Building on this, Logic Tensor Networks (LTNs), proposed in 2017, extend neuro-symbolic paradigms by representing logical formulas as neural computations using tensor-based semantics, allowing for probabilistic and over uncertain knowledge. LTNs support tasks involving semantic interpretation and relational reasoning, where symbolic constraints guide neural optimization to align with commonsense priors. Recent frameworks exemplify the evolution toward practical commonsense applications. The language, released in 2023, enables for neuro-symbolic systems by combining Datalog-like rules with neural modules, supporting and aggregation for tasks like visual and in commonsense scenarios. achieves high efficiency, training models with 50 episodes to reach 99.4% accuracy on benchmarks, compared to 84.9% for baselines requiring 50,000 episodes. In 2025, advances like the framework further blend neuro-symbolic reasoning with multi-modal observations, converting sensory data into symbolic representations for modular commonsense in and , improving generalizability across domains. These methods address key challenges by enhancing causal and temporal reasoning in commonsense tasks; for instance, neuro-symbolic theorem provers in conversational settings improve performance in inferring implicit goals over neural-only models on benchmarks. The integration mitigates symbolic brittleness through neural adaptability while providing interpretable paths via symbolic traces, reducing opacity in . As of 2025, neuro-symbolic approaches see growing adoption for trustworthy , fulfilling like explainability and robustness as outlined in analyses of criteria.

Evaluation and Benchmarks

Key Datasets and Resources

Foundational datasets have played a pivotal role in advancing commonsense reasoning research by providing structured s and question-answering benchmarks that capture everyday human understanding. One seminal resource is ConceptNet, originally developed in 2004 and expanded in subsequent versions, which forms a multilingual connecting words and phrases with labeled relationships derived from crowdsourced and expert-curated sources; the widely used ConceptNet 5.5 (2017) encompasses 83 languages (with at least 10,000 nodes each) and over 21 million edges representing general relational knowledge such as "is used for" or "has subevent." Another key is ATOMIC (2019), a of 877,000 if-then relational tuples focused on everyday events and their social implications, organized into nine relation types like "xIntent" (goals) and "xEffect" (outcomes) to support inferential reasoning about sequences of actions. Complementing these, CommonsenseQA (2018) offers a multiple-choice question-answering with around 12,000 questions automatically generated from ConceptNet, designed to test models' ability to draw on diverse commonsense knowledge for disambiguating answers in contexts like object properties or event causes. Domain-specific datasets extend these foundations by targeting particular aspects of commonsense, such as physical or social interactions, enabling more focused evaluation and . PIQA (2020), for instance, introduces a for physical commonsense reasoning with over 21,000 multiple-choice questions (including approximately 16,000 for ) about everyday physical interactions, like "how to unclog a ," emphasizing plausible actions in real-world scenarios without requiring expert physics knowledge. Similarly, SocialIQA (2019) provides 38,000 multiple-choice questions probing social norms and in interpersonal situations, such as inferring appropriate reactions in scenarios like "What will say next?" to assess understanding of human motivations and . Recent additions in 2025 incorporate commonsense annotations for dialog flows in task-oriented dialogue datasets, enabling reasoning over contextual implications in multi-turn conversations across various domains. Beyond datasets, essential resources include knowledge bases and annotation platforms that underpin data creation and representation. , initiated in the late 1990s, serves as a lexical database of over 1,200 semantic frames describing event structures and participant roles in English, with annotated sentence examples to capture qualitative commonsense about situations like "commerce" or "motion." Crowdsourcing platforms like have been instrumental for scalable annotation, facilitating the collection of diverse commonsense judgments through human workers on tasks ranging from relation extraction to plausibility ratings in datasets like and SocialIQA. The evolution of these resources reflects a shift toward integration and dynamic generation, addressing limitations in purely textual data. For example, the Visual Commonsense Reasoning (VCR) dataset () extends commonsense to vision-language tasks with 290,000 multiple-choice questions derived from 110,000 movie scenes, requiring inferences about visual scenes, motivations, and rationales like "Why might the person react this way?" More recently, large language models (LLMs) have enabled dynamic updates to commonsense resources, such as generating or augmenting knowledge graphs with inferred relations from prompts, as seen in retrieval-augmented approaches that combine inputs for generative reasoning over evolving datasets.

Metrics and Assessment Frameworks

Commonsense reasoning systems are evaluated using a variety of metrics tailored to the nature of the task, whether it involves classification, generation, or subjective plausibility assessment. For tasks, which often test selection of the correct inference from multiple options, accuracy and F1-score are fundamental, measuring the proportion of correct predictions and the balance between , respectively. These metrics are particularly prevalent in benchmarks like the , where binary classification of pronoun resolution requires near-perfect semantic understanding. In generative tasks, such as producing explanations or event continuations, automated scores like and quantify n-gram overlap between model outputs and reference texts, providing an objective proxy for semantic alignment, though they often underperform in capturing nuanced reasoning. Human judgments remain essential for assessing plausibility and coherence in outputs that defy automated scoring, typically employing Likert scales (e.g., 1-5 ratings for naturalness or ) to gauge how well a response aligns with intuitive human expectations. For instance, in evaluating commonsense narratives, annotators rate the logical flow and factual grounding, revealing gaps in model that metrics like accuracy might overlook. This approach correlates strongly with probabilistic margins in multiple-choice setups, where margins between predicted probabilities indicate confidence in reasoning over rote recall. Prominent frameworks include the (WSC), introduced in 2012, which uses binary accuracy on hand-crafted, hard cases to probe resolution requiring world knowledge, achieving human-level performance around 96% while challenging early models to below 50%. The BIG-bench (2022) incorporates commonsense subsets evaluated via accuracy, normalized scores across tasks, and human-LLM agreement, emphasizing diverse subtasks like and social norms to test . More recent 2025 frameworks build on desiderata from Lenat and Marcus, incorporating trustworthiness metrics such as robustness to adversarial perturbations (e.g., input noise testing stability) and verifiability scores for output traceability, extending their 2023 outline of 16 criteria for reliable . Advanced metrics address deeper aspects of reasoning . Causal scores, for example, evaluate whether a model's chain-of-thought explanations causally influence predictions, using interventions like counterfactual perturbations to measure alignment between rationale and outcome, often revealing unfaithful reasoning in up to 40% of cases. Knowledge coverage metrics, such as the Knowledge Coverage Ratio, quantify the proportion of relevant commonsense dimensions (e.g., temporal, ) spanned by model inferences, highlighting incompleteness in graphs or embeddings. studies in 2025 neural analyses distinguish memorization from reasoning by selectively pathways in models, showing that memorized patterns dominate superficial tasks while true emerges in ablated reasoning circuits, with performance drops of 20-30% indicating reliance on mimicry. A notable trend is the shift toward holistic evaluation frameworks that integrate multiple metrics across task suites for assessing generalizability, as seen in surveys of over 139 commonsense benchmarks, prioritizing composite scores over isolated task performance to better capture real-world robustness. This approach, exemplified by BIG-bench's multi-task aggregation, underscores the limitations of siloed metrics and promotes benchmarks that simulate persistent challenges like in dynamic environments.

References

  1. [1]
    [PDF] Commonsense Reasoning and Commonsense Knowledge in Artificial
    Commonsense reasoning is a central challenge in AI, needed for tasks like understanding texts, computer vision, and planning, but progress has been slow.
  2. [2]
    The Curious Case of Commonsense Intelligence - MIT Press Direct
    May 1, 2022 · One of the fundamental limitations of AI can be characterized as its lack of commonsense intelligence: the ability to reason intuitively about ...
  3. [3]
    Common-sense reasoning - Oxford Reference
    Common-sense reasoning is concerned with the understanding and manipulation of information about the everyday world of objects and their interactions.
  4. [4]
    [PDF] A Simple Method for Commonsense Reasoning - arXiv
    Sep 26, 2019 · Table 1: Example of full and partial scoring for the test "The trophy doesn't fit in the suitcase because it is too big." with two reference ...
  5. [5]
    Aristotle's Logic - Stanford Encyclopedia of Philosophy
    Mar 18, 2000 · Aristotle's logic, especially his theory of the syllogism, has had an unparalleled influence on the history of Western thought.
  6. [6]
    Naive Physics: An Essay in Ontology
    The notion of providing an adequate theory of the common-sense world has been taken seriously of late above all by those, such as Patrick Hayes or Kenneth ...Missing: roots | Show results with:roots
  7. [7]
    [PDF] PROGRAMS WITH COMMON SENSE - Formal Reasoning Group
    The advice taker is a proposed program for solving problems by manip- ulating sentences in formal languages. The main difference between it and. 1. Page 2 ...
  8. [8]
    CYC: a large-scale investment in knowledge infrastructure
    Since 1984, a person-century of effort has gone into building CYC, a universal schema of roughly 105 general concepts spanning human reality.
  9. [9]
    [PDF] A Robust Layered Control System for a Mobile Robot
    We call this architecture a subsumption architecture. In such a scheme we have a working control system for the robot very early in the piece as soon as we ...
  10. [10]
    [PDF] Open Mind Common Sense: Knowledge Acquisition from the ...
    OMCS-1 has been running on the web since September. 2000. As of January 2002 we have gathered 400,000 pieces of commonsense knowledge from over 8000 people.Missing: crowdsourced | Show results with:crowdsourced
  11. [11]
    COMET: Commonsense Transformers for Automatic Knowledge ...
    Jun 12, 2019 · We present the first comprehensive study on automatic knowledge base construction for two prevalent commonsense knowledge graphs.
  12. [12]
    Commonsense reasoning and commonsense knowledge in artificial ...
    Abstract. AI has seen great advances of many kinds recently, but there is one critical area where progress has been extremely slow: ordinary commonsense.Missing: critique | Show results with:critique
  13. [13]
  14. [14]
    [PDF] SOME PHILOSOPHICAL PROBLEMS FROM THE STANDPOINT OF ...
    The first is to introduce the notion of frame, like the state vector in McCarthy (1962). A number of fluents are declared as attached to the frame and the ...
  15. [15]
    [PDF] A Logic for Default Reasoning - John Horty
    In this paper we propose a logic for default reasoning. We ... A few results relating the two will be contained in a forthcoming paper (Reiter (1980)).
  16. [16]
    [PDF] Epistemological Problems of Artificial Intelligence - IJCAI
    In (McCarthy and Hayes 1969), we proposed dividing the artificial intelligence problem into two parts - an epistemological part and a heuristic part. This ...
  17. [17]
  18. [18]
    [PDF] The Winograd Schema Challenge
    In this paper, we present an alternative to the Turing Test that has some conceptual and practical advantages. A Wino- grad schema is a pair of sentences ...
  19. [19]
    CommonsenseQA: A Question Answering Challenge Targeting ...
    Nov 2, 2018 · To investigate question answering with prior knowledge, we present CommonsenseQA: a challenging new dataset for commonsense question answering.
  20. [20]
    Guiding Automated Story Generation with Commonsense Reasoning
    May 4, 2021 · We introduce Commonsense-inference Augmented neural StoryTelling (CAST), a framework for introducing commonsense reasoning into the generation process.
  21. [21]
    [1810.04805] BERT: Pre-training of Deep Bidirectional Transformers ...
    Oct 11, 2018 · BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
  22. [22]
    [2005.14165] Language Models are Few-Shot Learners - arXiv
    May 28, 2020 · Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of- ...
  23. [23]
    [2311.05232] A Survey on Hallucination in Large Language Models
    Nov 9, 2023 · In this survey, we begin with an innovative taxonomy of hallucination in the era of LLM and then delve into the factors contributing to hallucinations.
  24. [24]
    Revisiting Commonsense Reasoning in Machine Translation
    This paper studies commonsense reasoning (CR) in machine translation (NMT), exploring training, evaluation, and challenges in moving beyond pattern recognition.
  25. [25]
    Towards a standard upper ontology - ACM Digital Library
    In this paper we outline the strategy used to create the current version of the SUMO, discuss some of the challenges that we faced in constructing the ontology, ...
  26. [26]
    [PDF] A Large Ontology for the Semantic Web and its Applications
    In this paper we discuss the development and application of a large formal ontology to the semantic web. The. Suggested Upper Merged Ontology (SUMO) (Niles &.<|separator|>
  27. [27]
    The Suggested Upper Merged Ontology (SUMO) - Ontology Portal
    Aug 10, 2025 · The Suggested Upper Merged Ontology (SUMO) and its domain ontologies form the largest formal public ontology in existence today.
  28. [28]
    [PDF] Trusted, Transparent, Actually Intelligent Technology Overview | Cyc
    Jan 29, 2019 · The Knowledge Base comprises: • An ontology of about 1.5 million general concepts (e.g., taxonomically. “placing” terms like eyes, sleep ...Missing: hyponymy hypernymy
  29. [29]
    WordNet: a lexical database for English - ACM Digital Library
    WordNet is an online lexical database designed for use under program control. English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms.
  30. [30]
    [PDF] Evaluating WordNet-based Measures of Lexical Semantic ...
    The quantification of lexical semantic relatedness has many applications in NLP, and many different measures have been proposed. We evaluate five of these ...
  31. [31]
    [PDF] Disambiguation for Semi-Supervised Extraction of Complex ... - Cyc
    In this work, we propose two methods: (i) We discuss how contents of the Cyc knowledge base could be used to design a similarity-based disambiguation scheme to ...Missing: applications | Show results with:applications
  32. [32]
    [PDF] Integrating YAGO into the Suggested Upper Merged Ontology
    This paper discusses how the two worlds can be brought together by combining the high-level axiomatizations from the Standard Upper Merged Ontology. (SUMO) with ...<|control11|><|separator|>
  33. [33]
    [PDF] STRIPS: A New Approach to the Application of .Theorem Proving to ...
    ABSTRACT. We describe a new problem solver called STRIPS that attempts to find a sequence of operators in a spcce of world models to transform a given ...
  34. [34]
    [PDF] The Event Calculus Explained - Department of Computing
    Abstract. This article presents the event calculus, a logic-based formalism for representing actions and their effects. A circumscriptive solution to the ...
  35. [35]
    ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning
    We present ATOMIC, an atlas of everyday commonsense reasoning, organized through 877k textual descriptions of inferential knowledge.Missing: dataset | Show results with:dataset
  36. [36]
    CausalBERT: Injecting Causal Knowledge Into Pre-trained Models ...
    Jul 21, 2021 · CausalBERT captures rich causal knowledge and outperforms all pre-trained models-based state-of-the-art methods, achieving a new causal inference benchmark.
  37. [37]
    [PDF] Maintaining knowledge about temporal intervals
    One other formal approach, currently under development, that is compatible with an interval-based temporal representa-. November 1983 Volume 26 Number 11.
  38. [38]
    [PDF] A Spatial Logic based on Regions and Connection
    In Randell, Cui and Cohn (1992) we allowed atomic regions or atoms to be introduced into the ontology. Atoms were defined as regions with no proper parts, and ...Missing: RCC | Show results with:RCC
  39. [39]
    Qualitative process theory - ScienceDirect
    This paper describes the basic concepts of qualitative process theory, several different kinds of reasoning that can be performed with them, and discusses its ...
  40. [40]
    [PDF] SHOP: Simple Hierarchical Ordered Planner - IJCAI
    SHOP (Simple Hierarchical Ordered Planner) is a domain-independent HTN planning system with the following characteristics. • SHOP plans for tasks in the ...
  41. [41]
    [PDF] Shortcomings of Modern Large-Scale Common Sense Knowledge ...
    incomplete information in these graphs does not allow using CWA. Negative knowledge is arguably more important than positive knowledge in commonsense.Missing: issues incompleteness<|separator|>
  42. [42]
    'That's Just Common Sense'. USC researchers find bias in up to 38.6 ...
    'That's Just Common Sense'. USC researchers find bias in up to 38.6% of 'facts' used by AI · More than a third of those “facts” are biased “ ...
  43. [43]
    [PDF] Where Does Bias in Common Sense Knowledge Models Come From?
    Common sense knowledge bases and models have been shown to embed bias. In this article, we investigate the source of such bias in a knowledge model called.
  44. [44]
    [2205.11658] Penguins Don't Fly: Reasoning about Generics ... - arXiv
    May 23, 2022 · However, they are not universally true -- while sparrows and penguins are both birds, only sparrows can fly and penguins cannot. Commonsense ...
  45. [45]
    When helpfulness backfires: LLMs and the risk of false medical ...
    Oct 17, 2025 · We found that LLMs prioritize learned helpfulness over inherent logical reasoning in our datasets, leading them to generate false information ...
  46. [46]
    [PDF] Beyond LLM-Guided Common-Sense Reasoning for Natural ...
    Sep 16, 2025 · Applying automated theorem provers to large-scale knowledge bases quickly reveals a major chal- lenge: The sheer size of a knowledge base – in ...
  47. [47]
    [PDF] Integrated Commonsense Reasoning and Probabilistic Planning
    Two planning paradigms have been developed for robots that work on such complex tasks: task planning and probabilis- tic planning. Task planning algorithms ...Missing: PR2 | Show results with:PR2
  48. [48]
    [PDF] The Problem with Solutions to the Frame Problem
    McCarthy and Hayes (1969), however, immediately identified the frame problem as the problem of predicting within the situation calculus and without using frame ...
  49. [49]
    Embodied AI Agents: Modeling the World - arXiv
    Jun 27, 2025 · On the one hand we have low-level dynamics—joint torques that change every few milliseconds for robotic actions, while on the other hand we have ...
  50. [50]
    None
    Summary of each segment:
  51. [51]
    Understanding the Limits of Lifelong Knowledge Editing in LLMs
    Mar 7, 2025 · In this work, we aim to bridge research into lifelong knowledge editing to real-world edits at practically relevant scale.Missing: commonsense | Show results with:commonsense
  52. [52]
    Agentic AI Reasoning for Mobile Edge General Intelligence - arXiv
    Sep 27, 2025 · Edge devices operate under strict resource constraints, limiting their computational power and memory for deploying large-scale LLMs with ...
  53. [53]
    Commonsense Reasoning in Prolog
    Generally speaking, Prolog uses the second approach but also has some features of the first approach. The closed world assumption about some pred- icates can ...
  54. [54]
    Logic Programming - MIT Press
    Logic programming is also fundamental to work in artificial intelligence, where it has been used for nonmonotonic and commonsense reasoning, expert systems ...<|separator|>
  55. [55]
    Description Logics as Ontology Languages for the Semantic Web
    In this paper, we describe what description logics are and what they can do for the Semantic Web. Descriptions logics are very useful for defining, integrating ...
  56. [56]
    [PDF] Reasoning and Query Answering in Description Logics - CEUR-WS
    ALC concepts are defined inductively: • Every concept name A ∈ NC is a concept. • > and ⊥ are concepts. • If C is a concept, then ¬C is a concept.
  57. [57]
    [PDF] Cyc - AAAI Publications
    Cyc is a project attempting to build a large common-sense knowledge base, describing its evolution and current state.
  58. [58]
    [PDF] Common Sense Reasoning – From Cyc to Intelligent Assistant
    Mar 2, 1986 · Default assertions can be overridden by new knowledge, whether it comes from a person using Cyc or is derived by Cyc's own inference engine.
  59. [59]
    [PDF] ConceptNet — a practical commonsense reasoning tool-kit
    Commonsense knowledge, thus defined, spans a huge portion of human experience, encompassing knowledge about the spatial, physical, social, temporal, and.
  60. [60]
    [PDF] Grounded Conversation Generation as Guided Traverses in ...
    The traverses in the concept graph are guided by graph attention mechanisms, which derives from graph neural networks to attend on more appro- priate concepts.
  61. [61]
    [PDF] Generating Commonsense Ontologies with Answer Set Programming
    Dec 3, 2020 · This paper presents a non-monotonic method using Answer Set Programming (ASP) to automatically generate commonsense ontologies, supporting ...
  62. [62]
    [PDF] Commonsense reasoning in AI systems
    Mar 25, 2025 · The objective of this research is to determine how commonsense reasoning is relevant to AI and suggest certain.
  63. [63]
    Logical Rule-Based Knowledge Graph Reasoning - MDPI
    In addition to ensuring accurate reasoning, logical rule-based methods also exhibit strong interpretability, facilitating intuitive comprehension of the ...<|control11|><|separator|>
  64. [64]
    Introducing GPT-5 - OpenAI
    Aug 7, 2025 · GPT‑5 is a unified system with a smart, efficient model that answers most questions, a deeper reasoning model (GPT‑5 thinking) for harder ...Missing: LLMs commonsense
  65. [65]
    [PDF] Simple Rules for Probabilistic Commonsense Reasoning
    Bayesian networks (BNs) have the important advantage that known conditional frequencies re- lating rule antecedents to consequents can be used directly to set ...
  66. [66]
    AI still lacks “common” sense, 70 years later - Marcus on AI
    Jan 5, 2025 · We, Ernie and Gary, have spent many year trying to explain how important commonsense is for AI, and what makes it challenging.
  67. [67]
    The paradox of GPT-5 - by Azeem Azhar and Nathan Warren
    Aug 14, 2025 · LLMs are extraordinary pattern machines, but that does not guarantee they can sustain memory, reason across time or adapt to new environments.
  68. [68]
    Logic Tensor Networks for Semantic Image Interpretation - arXiv
    In this paper, we develop and apply LTNs to two of the main tasks of SII, namely, the classification of an image's bounding boxes and the detection of the ...
  69. [69]
    [2304.04812] Scallop: A Language for Neurosymbolic Programming
    Apr 10, 2023 · Scallop enables users to write a wide range of neurosymbolic applications and train them in a data- and compute-efficient manner.
  70. [70]
    [PDF] Conversational Neuro-Symbolic Commonsense Reasoning
    AI commonsense systems lack full coverage, we also present an interactive conversational framework built on our neuro- symbolic system, that ...
  71. [71]
  72. [72]
    ConceptNet 5.5: An Open Multilingual Graph of General Knowledge
    Dec 12, 2016 · ConceptNet is a knowledge graph that connects words and phrases of natural language with labeled edges. Its knowledge is collected from many sources.Missing: original | Show results with:original
  73. [73]
    PIQA: Reasoning about Physical Commonsense in Natural Language
    Nov 26, 2019 · In this paper, we introduce the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA.
  74. [74]
    SocialIQA: Commonsense Reasoning about Social Interactions - arXiv
    Apr 22, 2019 · We introduce Social IQa, the first largescale benchmark for commonsense reasoning about social situations. Social IQa contains 38,000 multiple ...
  75. [75]
    Context matters in common sense-enhanced task-based dialogue ...
    Jun 15, 2025 · Our approach is different from previous task-based datasets such as Multi-Domain Wizard-of-Oz (MultiWOZ) (Budzianowski et al., 2018), Taskmaster ...<|separator|>
  76. [76]
    [PDF] The Berkeley FrameNet Project - ACL Anthology
    The Berkeley FrameNet project 1 is producing frame-semantic descriptions of several thousand. English lexical items and backing up these de- scriptions with ...Missing: original | Show results with:original
  77. [77]
    From Recognition to Cognition: Visual Commonsense Reasoning
    Nov 27, 2018 · Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for ...
  78. [78]
    [PDF] Multi-mOdal REtrieval Augmented Generative Commonsense ...
    Aug 11, 2024 · It is designed for gener- ative commonsense reasoning tasks involving the composition of discrete concepts into sentences depicting everyday ...
  79. [79]
    ACCENT: An Automatic Event Commonsense Evaluation Metric for ...
    ACCENT is an efficient metric for event commonsense evaluation, which achieves higher correlations with human judgments than existing baselines.
  80. [80]
    [2502.18848] A Causal Lens for Evaluating Faithfulness Metrics - arXiv
    Feb 26, 2025 · Here, we present Causal Diagnosticity, a framework that serves as a common testbed to evaluate faithfulness metrics for natural language ...Missing: scores | Show results with:scores
  81. [81]
    Beyond Surface Simplicity: Revealing Hidden Reasoning Attributes ...
    Additionally, ReComSBench proposes three new metrics for decoupled evaluation: Knowledge Balanced Accuracy, Marginal Sampling Gain, and Knowledge Coverage Ratio ...
  82. [82]
    [PDF] The Reasoning-Memorization Interplay in Language Models Is ...
    Jul 27, 2025 · In this paper, we adopt memorization as poor reasoning generalizability and propose a novel mechanistic interpretation of the reasoning- ...
  83. [83]
    Benchmarks for Automated Commonsense Reasoning: A Survey
    Language in problems should be natural. Commonsense benchmarks often contain language that is unnatural, stilted or weird; this introduces a confounding factor.<|separator|>