Fact-checked by Grok 2 weeks ago

Question answering

Question answering (QA) is a core task in (NLP) that involves developing computational systems to comprehend human-posed questions in and provide accurate, contextually relevant responses, often drawing from structured or sources. These systems aim to bridge the gap between human inquiry and machine intelligence, enabling applications such as virtual assistants, search engines, and knowledge retrieval tools. The history of QA traces back to the 1960s with early rule-based systems like , which answered queries about baseball statistics using predefined grammatical rules, and LUNAR, designed for lunar rock composition questions. Progress accelerated in the late 1990s through initiatives like the Text Retrieval Conference (TREC), particularly TREC-8 in 1999, which standardized open-domain QA evaluation and spurred research in information retrieval-based approaches. Subsequent shifts incorporated statistical methods in the early 2000s, for , and from the 2010s onward, leveraging neural networks to handle linguistic nuances more effectively. QA systems are broadly categorized into three main paradigms: information retrieval-based QA (IRQA), which retrieves and extracts answers from large text corpora; knowledge base QA (KBQA), which queries structured databases like ontologies or graphs; and generative QA (GQA), which produces novel answers using language models without direct extraction. Key benchmarks have driven advancements, including the Stanford Question Answering Dataset () introduced in 2016 for tasks, and TREC datasets for and complex question evaluation. Challenges persist in areas such as , multi-hop reasoning, and , requiring robust handling of diverse question types like factual, opinion-based, or definitional queries. Recent developments, particularly since 2020, have been propelled by transformer architectures and large language models (LLMs) such as (2018) for contextual understanding and GPT-series models (e.g., in 2020, in 2023, and GPT-4o in 2024) for generative capabilities, achieving state-of-the-art performance on benchmarks through techniques like in-context learning and (RLHF). Evaluations have evolved with over 50 new metrics since 2014, including exact match (EM), F1-score for extraction tasks, and learning-based scores like BERTScore for semantic alignment, though human-centric assessments remain essential due to issues like hallucinations in LLMs. Ongoing research as of 2025 emphasizes multilingual QA, multimodal integration (e.g., visual question answering), agentic prompting approaches, and ethical considerations to mitigate biases in responses.

Fundamentals

Definition and Scope

Question answering (QA) is a subfield of (NLP) focused on the task of automatically generating answers to questions expressed in , utilizing a , , or other information sources to provide relevant and accurate responses. QA systems process the input question to understand its intent, retrieve pertinent information, and formulate an output that directly addresses the query, often in a concise textual form. This capability enables more intuitive human-machine interactions compared to traditional search mechanisms. Originating in research, QA aims to replicate human-like comprehension of language and retrieval. The scope of QA includes diverse question formats, such as questions seeking discrete facts (e.g., names, dates, or locations), list questions requiring enumerations of items, and complex questions demanding explanatory or inferential reasoning (e.g., causal or hypothetical scenarios). For instance, a QA system might answer "Who is the ?" with a specific individual's name, while a complex QA approach could tackle "Why did the stock market crash in ?" by integrating economic and historical factors into a synthesized explanation. QA differs fundamentally from (IR), which returns ranked lists of documents or passages for user review rather than pinpointing exact answers, and from systems, which sustain multi-turn conversations involving maintenance and clarification rather than isolated query resolution. These distinctions highlight QA's emphasis on and over mere document sourcing or extended interaction.

Key Components

Question answering (QA) systems rely on several core components to process queries and retrieve accurate responses. The primary stages include question analysis, knowledge source access, candidate answer generation, and answer ranking or selection. These components work modularly to transform a user's query into a structured search and refine potential answers for and . Question analysis begins with parsing the intent, identifying key entities, and discerning relations within the query to determine its focus. This involves breaking down the question into semantic elements, such as the head noun (e.g., "river" in "What is the longest river?") and modifiers, using techniques like or . Natural language understanding (NLU) plays a crucial role here by interpreting the semantics to pinpoint the question's focus and anticipate the expected answer type, such as a name, , or explanation. Knowledge source access follows, where the system retrieves relevant information from structured , unstructured text corpora, or the web to form a basis for answers. This step often reformulates the parsed question into a search query to fetch documents or passages with high recall, prioritizing sources that align with the query's semantic needs over exhaustive coverage. For instance, in open-domain , web-scale corpora provide broad access, while closed-domain systems limit to specialized bases. Candidate answer generation identifies potential responses from the retrieved sources by extracting phrases or entities that match the question's requirements. This process leverages (NER) to tag elements like persons or locations and semantic parsing to convert text into logical forms that align with the query's structure. Prerequisites for effective include robust semantic parsing, which maps to formal representations for precise matching, and entity recognition, which ensures key facts are not overlooked in . Finally, answer ranking and selection evaluate candidates using heuristics like keyword proximity, semantic similarity, or redundancy checks across sources to select the most confident response. This stage validates answers against lexical resources or external corroboration to minimize errors. QA systems handle diverse question types, broadly categorized as factoid, definitional, and opinion-based, each demanding tailored processing. Factoid questions seek specific facts, such as "Who was the first U.S. president?" or "When did end?", typically yielding short answers like names, dates, or quantities. Definitional questions request descriptions or explanations, for example, "What is ?", requiring concise summaries or passages. Opinion-based questions involve subjective views, like "Why is controversial?", often drawing from explanatory or argumentative texts.

Types

Closed-Domain Question Answering

Closed-domain question answering () systems are designed to respond to queries restricted to a specific, predefined domain, such as , sports, or legal affairs, leveraging structured bases like ontologies, relational databases, and domain-curated corpora to ensure focused and relevant answers. These systems operate within bounded information sources, where questions are expected to have answers derivable from the domain's explicit , enabling precise mapping between input and available . By limiting the search space, closed-domain QA avoids the and scale issues prevalent in broader contexts, prioritizing depth over generality. The primary advantage of closed-domain QA lies in its elevated accuracy and reliability, stemming from the constrained scope that minimizes exposure to extraneous or conflicting information. For example, FAQ systems in domains match user questions to a finite set of pre-authored responses, achieving high precision by exploiting repetitive query patterns typical in specialized interactions. IBM's , which was originally developed for the open-domain Jeopardy! trivia challenge, has been extended to biomedical variants, such as Watson for , where it draws on structured medical ontologies and evidence-based literature to suggest cancer treatments, demonstrating how supports expert-level in high-stakes fields. These examples illustrate the technique's efficacy in delivering verifiable, contextually rich answers that outperform generalist approaches in targeted applications. Key techniques in closed-domain QA emphasize domain-tailored processing, including and . identifies syntactic and semantic patterns in incoming questions to align them with predefined answer templates, which is particularly suited to domains with predictable question types, such as procedural queries in . translates questions into executable representations, like logical forms or database queries, customized to the domain's —for instance, generating SQL statements for querying medical patient records or for ontology-based retrieval in . These approaches integrate domain-specific lexicons and rules to handle jargon and relations unique to the field, facilitating accurate extraction from structured sources. A prominent case study is the Text REtrieval Conference (TREC) Genomics Track, organized by the National Institute of Standards and Technology (NIST) from 2003 to 2007, which assessed QA systems in the biomedical genomics domain using full-text articles from sources like the Journal of Biological Chemistry. The track featured entity-centric tasks, requiring systems to answer questions such as "List all proteins that interact with gene X," with evaluations based on passage-level relevance and entity accuracy metrics. Participating systems employed techniques like named entity recognition and passage retrieval, with leading performers attaining aspect MAP scores of approximately 0.26 on complex queries, revealing the demands of integrating heterogeneous biomedical data while advancing domain-specific QA methodologies.

Open-Domain Question Answering

Open-domain question answering (OpenQA) refers to the task of providing accurate answers to questions drawn from a broad range of topics, using large-scale, unstructured knowledge sources such as the web or encyclopedias like , without reliance on pre-provided context or domain-specific restrictions. Unlike closed-domain systems, OpenQA requires an initial retrieval step to identify relevant documents from vast corpora, followed by answer extraction or from those passages. This approach enables handling diverse, real-world queries but demands scalable mechanisms to manage the scale and heterogeneity of general knowledge sources. Key challenges in OpenQA include resolving query ambiguity, where questions may have multiple interpretations requiring contextual ; mitigating from irrelevant or low-quality retrieved documents; and incorporating broad world knowledge to handle factual inaccuracies or gaps in the . Retrieval inefficiencies arise from term mismatches between queries and documents, often necessitating advanced dense retrieval methods over traditional sparse techniques like BM25. Additionally, scaling to massive corpora introduces computational demands, while ensuring robustness to adversarial or unanswerable questions remains critical for reliable performance. Early milestones in OpenQA include the FAQFinder system, which in 1997 pioneered retrieval-based answering by matching user questions to existing FAQ pairs across diverse online sources, demonstrating the feasibility of open retrieval without domain limits. The TREC Question Answering track, starting in 1999, formalized OpenQA evaluation by challenging systems to extract precise answers from large news collections, spurring advancements in QA. A significant leap came with Google's integration in 2012, which enhanced search-based QA by leveraging structured entity knowledge to provide direct answers for billions of queries annually. More recently, the DrQA framework in 2017 established the influential retriever-reader paradigm, combining TF-IDF retrieval with neural on to achieve state-of-the-art results on open benchmarks. Evaluation of OpenQA systems typically employs metrics such as Exact Match (EM), which measures if the predicted answer precisely matches the ground truth, and F1 score, which accounts for partial overlaps in token precision and recall. These are applied to benchmarks like Natural Questions (NQ), where systems retrieve and answer from web documents, reporting F1 scores around 50-60% for top models as of 2019. Other datasets, such as TriviaQA and MS MARCO, emphasize diverse question types and real-world search scenarios to assess retrieval accuracy and answer faithfulness.

Specialized Question Answering

Specialized question answering encompasses variants of question answering () systems that extend beyond textual inputs, incorporating domain-specific expertise or non-text modalities such as , visuals, or combinations thereof. These systems address queries requiring precise computation, visual interpretation, or integrated reasoning across multiple data types, often integrating specialized tools like symbolic solvers or models to achieve accuracy in constrained domains. Unlike general text-based , specialized approaches must handle unique representational challenges, such as formal notations in math or spatial relationships in images. Mathematical QA focuses on solving problems involving equations, proofs, or word problems that demand numerical or algebraic reasoning. Datasets like MathQA provide large-scale collections of math word problems, comprising over 37,000 examples annotated with step-by-step operation programs to facilitate interpretable solving. Techniques in this area often integrate reasoning, where neural models generate executable programs that invoke solvers for verification, enabling systems to decompose complex problems into verifiable steps. A prominent example is Wolfram Alpha, a computational engine launched in 2009 that uses symbolic computation to answer mathematical queries by evaluating expressions and providing step-by-step derivations. Unique challenges include ensuring step-by-step reasoning accuracy, as errors in intermediate calculations can propagate, and handling diverse problem formats from to . Visual QA (VQA) involves answering questions about images, requiring models to jointly process visual content and textual queries. Seminal datasets such as VQA v1.0, introduced in 2015, contain approximately 250,000 images paired with 760,000 open-ended questions and 10 million answers, emphasizing the need for vision-language alignment. Techniques typically employ components, like convolutional neural networks or vision transformers, to extract image features, which are then fused with question embeddings for prediction. Recent advances in the leverage transformer-based architectures, such as LXMERT, which uses cross-modality encoders pretrained on datasets to improve performance on tasks like visual entailment and question answering. Challenges in VQA include resolving visual ambiguity, where similar images may yield different answers based on subtle contextual cues, and mitigating language biases that ignore image details. Multimodal QA extends VQA to incorporate additional modalities, such as combining text with images or videos for more comprehensive querying. This variant addresses questions that span static visuals and dynamic sequences, using datasets like those derived from MSVD for video QA, which include thousands of clips with temporal questions. Techniques build on transformers to fuse representations from text encoders and visual processors, enabling reasoning over spatiotemporal elements in videos. For instance, systems processing video inputs apply attention mechanisms to track object trajectories across frames while aligning with textual queries. Key challenges involve managing temporal ambiguity in videos, where actions unfold over time, and scaling integration across modalities without losing fidelity in non-textual reasoning.

History

Early Developments (Pre-1990s)

The origins of question answering (QA) systems trace back to the early 1960s, when researchers in began developing programs capable of interpreting queries against structured data. One of the pioneering efforts was the system, created by Bert F. Green Jr. and colleagues in 1961. This program answered questions in English about statistics from a single season, stored on punched cards, by employing and a dictionary-based to map queries to data retrieval operations. For instance, it could respond to questions like "How many games did the Dodgers win in 1958?" by parsing the input for key elements such as teams and dates, then querying the database accordingly. The system's success in handling a limited domain demonstrated the feasibility of rule-based for QA, though it was constrained to predefined patterns and required exact matches for reliable performance. In the late 1960s and early 1970s, linguistic influences from advanced QA toward more sophisticated . Terry Winograd's SHRDLU, developed at between 1968 and 1970, represented a significant leap by enabling interactive QA within a simulated "block world" environment. The system processed commands and questions like "Can the table pick up blocks?" by integrating procedural semantics, where representations of the world (e.g., blocks, tables) were manipulated through a parser that understood context, reference, and inference. SHRDLU's ability to maintain dialogue state and resolve ambiguities, such as references, highlighted the importance of world knowledge in QA, influencing subsequent research in knowledge representation. Its implementation in Micro Planner, a Lisp-based , underscored the role of symbolic in achieving coherent responses. The rule-based era of QA expanded in the early 1970s with expert systems tailored to specific domains, exemplified by the LUNAR system developed by William A. Woods in 1971. LUNAR allowed geologists to query a database of lunar rock chemical analyses using natural English, such as "How much iron is in the high-titanium basalts?" The system featured a robust semantic and parser that converted questions into procedural representations for database interrogation, achieving over 90% accuracy on test queries at a lunar conference demonstration. By incorporating domain-specific rules for quantification and aggregation, LUNAR illustrated how QA could support scientific inquiry, paving the way for more complex inference in restricted environments. Central to these early systems were foundational concepts like question templates and semantic grammars, which provided structured ways to interpret without relying on broad statistical models. Question templates, as used in , predefined syntactic patterns to classify and route queries, enabling efficient matching against data schemas. Semantic grammars, prominent in LUNAR and SHRDLU, augmented syntactic parsing with meaning-driven rules to handle variations in phrasing while preserving logical structure, such as distinguishing between "what" and "how many" interrogatives. These approaches emphasized hand-crafted rules and domain expertise, establishing as a cornerstone of symbolic before the shift toward data-driven paradigms.

Rise of Statistical and Machine Learning Methods (1990s–2010s)

The 1990s marked a pivotal shift in research from rule-based symbolic approaches to data-driven statistical methods, driven by advances in and the growing availability of large text . This era emphasized probabilistic models for passage retrieval and answer extraction, leveraging techniques like term frequency-inverse frequency (TF-IDF) to identify relevant snippets containing answers. TF-IDF, which weights terms based on their frequency in a relative to the , became a cornerstone for ranking candidate passages in early QA systems, enabling more scalable processing of unstructured text without deep linguistic parsing. A landmark event was the introduction of the Question Answering track at the Text REtrieval Conference (TREC-8) in 1999, organized by the National Institute of Standards and Technology (NIST), which established standardized evaluations for open-domain systems. The track focused on factoid questions requiring short, precise answers (e.g., 50-byte snippets) from a fixed document collection, promoting the development of systems that retrieved exact answers rather than full documents. Evaluation metrics, such as (MRR)—the average of the reciprocal ranks of the first correct answer per question—provided a rigorous benchmark, with MRR scores highlighting the limitations of early statistical methods (typically below 0.3 in initial runs). Participating systems often combined for initial retrieval with simple statistical scoring for answer selection, setting the stage for broader adoption of probabilistic techniques. Entering the 2000s, (ML) techniques enhanced statistical QA by improving answer ranking and validation, particularly through supervised classifiers trained on annotated data from evaluations like TREC. For instance, the AskMSR system, developed by and evaluated at TREC 2002, utilized decision trees—a form of ML—for reranking candidate answers extracted from web search results, achieving an MRR of 0.507 by prioritizing n-grams based on features like word overlap and question type compatibility. This approach exploited web-scale redundancy, where frequent answer occurrences signaled reliability, marking a departure from hand-crafted rules toward learning-based refinement. Broader QA@NIST evaluations, continuing through TREC from 2000 to 2010, refined tasks to include complex questions and "NIL" responses for unanswerable queries, while maintaining MRR as the primary metric alongside strict/lenient scoring variants to assess answer support. These annual benchmarks spurred innovations, with top systems reaching MRR above 0.5 by mid-decade, underscoring the efficacy of statistical pipelines. Knowledge bases like , a lexical database of English synsets developed in the early 1990s, integrated into statistical QA to expand query terms and resolve semantic ambiguities during retrieval and answer validation. In systems such as IBM's statistical QA entry at TREC-10 (2001), facilitated focus expansion by linking query words to synonyms and hypernyms, boosting recall in IR stages without relying on full ontologies. This hybrid use of lexical resources with probabilistic models improved handling of paraphrases, contributing to more robust answer selection in open-domain settings. By the late 2000s, multi-stream QA architectures emerged as a key advancement, combining outputs from multiple independent pipelines to enhance accuracy through redundancy and voting mechanisms. The MultiStream approach, explored in evaluations like the 2007 Answer Validation Exercise (AVE), aggregated answers from diverse systems—each using statistical IR or ML components—and applied learning-based selection to identify the most supported response, achieving improvements of up to 20% in F1 scores over single-stream baselines. Such methods exemplified the era's emphasis on ensemble techniques, leveraging statistical diversity to mitigate individual system weaknesses. The success of IBM's in defeating chess champion in 1997 further inspired computational AI pursuits, indirectly fueling investments in text QA capabilities that culminated in later systems like .

Deep Learning and Transformer Era (2010s–Present)

The deep learning era in question answering (QA) began in the early 2010s with the adoption of recurrent neural networks (RNNs) and (LSTM) units, which enabled more sophisticated modeling of sequential dependencies in text for tasks like . These architectures addressed limitations of earlier statistical methods by learning distributed representations of words and contexts, allowing systems to infer answers from passages without rigid rule-based . A pivotal advancement was the introduction of Memory Networks in , which incorporated an external memory component to store and retrieve relevant facts, facilitating end-to-end training for QA on synthetic and real-world datasets. This approach demonstrated improved performance on simple factoid questions by dynamically attending to memory slots during inference. The release of the Stanford Question Answering Dataset () in 2016 further catalyzed progress, providing over 100,000 crowd-sourced question-answer pairs from articles and establishing a for extractive QA that spurred the development of neural models surpassing human baselines. The advent of the architecture in 2017 revolutionized QA by enabling parallelizable processing and capturing long-range dependencies through self-attention mechanisms. In 2018, Bidirectional Encoder Representations from (BERT) marked a breakthrough, pre-training a bidirectional on masked language modeling and next-sentence prediction tasks before on QA datasets like . BERT-Large achieved state-of-the-art results of 85.1% exact match and 91.8% F1 on the 1.1 test set (single model with TriviaQA fine-tuning), by leveraging contextual embeddings that better understood question-passage alignments compared to unidirectional RNNs. Subsequent variants, including and ELECTRA, refined this paradigm through optimized pre-training objectives and data scaling, solidifying as a standard for closed-domain QA. Entering the 2020s, the scaling of large language models (LLMs) transformed QA into a generative task, where models produce free-form answers rather than extracting spans. GPT-3, released in 2020 with 175 billion parameters, showcased capabilities for open-domain QA, achieving competitive results on benchmarks like Natural Questions without task-specific by prompting the model with examples. Similarly, the Text-to-Text Transfer Transformer () in 2020 unified QA as a text generation problem within a sequence-to-sequence framework, attaining 90.6% F1 on through supervised on diverse tasks. To mitigate hallucinations in generative QA, Retrieval-Augmented Generation (RAG) emerged in 2020, combining parametric LLMs with non-parametric retrieval from external corpora like , yielding up to 44% improvement on knowledge-intensive tasks such as open-domain trivia QA. By 2023, multimodal extensions like GPT-4V integrated vision capabilities, enabling QA over images and text, such as describing visual content in medical or diagram-based queries with accuracies exceeding 80% on specialized benchmarks. As of 2025, QA has increasingly integrated into agentic systems, where autonomous agents leverage QA modules for multi-step reasoning, tool use, and in dynamic environments, as seen in frameworks like Agentic-R1 that distill dual reasoning strategies for efficient problem-solving. Efficiency improvements via have also gained prominence, compressing large models like into smaller variants that retain 95% of QA performance while reducing inference costs by factors of 10, facilitating deployment in resource-constrained settings.

Architectures

Traditional Pipeline Architectures

Traditional pipeline architectures in question answering represent a modular, sequential approach that dominated early systems, particularly in large-scale evaluations like the Text REtrieval Conference (TREC) QA track from 1999 to 2007. These systems break down the QA process into distinct, interpretable stages to handle queries over unstructured text corpora, emphasizing precision in factoid and list question types. The design allows for targeted optimization of each component, drawing on established (IR) and techniques prevalent before the widespread adoption of methods. The core structure typically follows a step-by-step : question processing, , passage selection, , and verification. In question processing, the input is parsed to identify its type (e.g., who, what, where), expected format, and key terms, often using rule-based classifiers or keyword to reformulate it for retrieval. then employs IR engines, such as InQuery in early TREC systems or in later implementations, to rank and fetch a set of relevant documents from a large collection based on query-document similarity metrics like TF-IDF. Passage selection refines this by identifying candidate text spans within the documents using proximity heuristics or density scoring for answer-bearing content. applies , , or shallow parsing to pinpoint potential answers, generating a list of candidates. Finally, verification ranks these candidates by confidence scores derived from evidence strength, redundancy across sources, or semantic coherence, selecting the top for output. Exemplary systems from the TREC QA track illustrate this pipeline in action; for instance, the system at TREC 2002 featured a component-based architecture with dedicated modules for , retrieval via IR tools, and answer validation, enabling systematic performance analysis at each stage. Similarly, the system at TREC 2005 followed a standard sequence of question analysis, document search with engines like Lucene, passage ranking, and answer justification to ensure factual accuracy. These pipelines were highly effective for closed-domain or factoid QA, achieving up to 65% accuracy on TREC-9 questions through precise retrieval and extraction. A primary strength of these architectures lies in their interpretability and modularity, which facilitate debugging, component swapping, and evaluation of individual stages—such as isolating retrieval errors without retraining the entire system—making them suitable for research and development in resource-constrained environments. However, they suffer from error propagation, where inaccuracies in upstream steps (e.g., irrelevant documents from poor retrieval) amplify downstream, leading to brittle performance on complex or ambiguous queries. Prior to 2010, these pipelines were the prevailing paradigm in QA, powering most competitive systems in benchmarks like TREC and enabling scalable handling of web-scale corpora.

End-to-End Neural Architectures

End-to-end neural architectures in integrate the entire process—from question encoding and understanding to answer generation or —within a single, jointly trainable model, typically leveraging encoder-decoder frameworks to process inputs holistically without modular handoffs. These models emerged prominently post-2015, enabling direct optimization of all components via on large-scale datasets, which facilitates capturing complex interactions between questions and contexts. A seminal example for extractive question answering is the Bi-Directional Attention Flow (BiDAF) model, introduced in 2017, which employs a multi-stage hierarchical process to represent context at varying granularities and applies bi-directional attention mechanisms—flowing from context to query and vice versa—to identify answer spans within passages. BiDAF achieved state-of-the-art results on the dataset at the time, with an F1 score exceeding 80%, demonstrating its effectiveness in handling nuanced semantic alignments without relying on separate retrieval or post-processing steps. For generative question answering, where answers are produced as free-form text rather than extracted spans, models like (2019) and (2020) represent key advancements by framing QA as a sequence-to-sequence task. , a denoising pre-trained on corrupted text, excels in abstractive QA by reconstructing answers from noisy question-context pairs, outperforming prior extractive methods on benchmarks like Natural Questions. Similarly, unifies QA under a text-to-text paradigm, fine-tuning a to generate answers directly from prefixed inputs like "question: [Q] context: [C]," yielding superior performance on diverse datasets such as TriviaQA with exact match scores around 70%. These architectures offer advantages in handling contextual nuances and multi-hop reasoning, as end-to-end training allows the model to learn implicit alignments and dependencies across the input, often surpassing modular systems in accuracy when scaled on massive corpora like those from . By jointly optimizing encoding and decoding, they reduce error propagation and adapt better to varied question types, though they require substantial computational resources for training. As of , hybrid neural-symbolic architectures have gained traction to enhance robustness in end-to-end , integrating neural components for with reasoning for logical and , as explored in recent surveys of complex QA systems.

Methods

Rule-Based and Knowledge-Driven Methods

Rule-based and knowledge-driven methods in question answering represent early deterministic approaches that rely on predefined rules, patterns, and structured representations to interpret queries and generate responses, without depending on statistical learning or large corpora. These systems typically involve handcrafted rules for natural language questions into formal queries that can be executed against a , emphasizing logical over probabilistic matching. Pattern matching forms a core technique in these methods, where syntactic or semantic templates are used to identify question types and map them to database operations or steps. A seminal example of pattern matching in rule-based QA is the LUNAR system, developed in the early 1970s to answer questions about lunar rock samples from the Apollo missions by employing an augmented transition network (ATN) parser to match question patterns against a procedural semantic grammar, enabling precise retrieval from a structured database. Similarly, the BASEBALL system from the 1960s demonstrated early pattern-based QA by processing queries about baseball statistics through rule-driven transformations into relational algebra expressions. Knowledge-driven methods extend this paradigm by leveraging and graphs for inference, often using RDF triples (subject-predicate-object statements) to represent domain and derive answers via logical rules. For instance, template-based systems translate natural language questions into queries over RDF data by applying ontology-aligned patterns, allowing inference across related entities in the graph. The project, initiated in the and ongoing, exemplifies a large-scale knowledge-driven approach for commonsense QA, encoding millions of assertions in a formal () to support and answer complex queries without external training data. Rule engines in closed-domain systems, such as those for or medical diagnostics, further apply these inference mechanisms to predefined bases for reliable, domain-specific responses. These methods offer key strengths, including high —where decision paths are fully traceable due to explicit rules—and the absence of need for annotated , making them suitable for resource-constrained or highly controlled environments. However, they suffer from weaknesses such as poor scalability to open domains, where crafting exhaustive rules becomes infeasible, and coverage gaps arising from incomplete knowledge representations that fail to handle linguistic variations or novel queries. In contrast to -driven methods, rule-based and knowledge-driven approaches prioritize interpretability over adaptability to unstructured text.

Retrieval-Augmented Methods

Retrieval-augmented methods in question answering integrate techniques with to enable scalable, open-domain systems that ground answers in external knowledge sources. These approaches typically involve two main stages: first, retrieving relevant passages or documents from a large based on the query, and second, processing those retrieved items to extract or generate the final answer. This paradigm addresses the limitations of purely models by leveraging non-parametric memory, such as vast text corpora like , to improve factual accuracy and handle knowledge-intensive queries. A foundational technique in retrieval is sparse retrieval, exemplified by BM25, which ranks documents using term frequency and inverse document frequency to match query keywords with corpus content. BM25, developed in the 1990s, remains a baseline for its efficiency in lexical matching without requiring deep semantic understanding. In contrast, dense retrieval methods represent queries and passages as low-dimensional embeddings, enabling similarity-based retrieval that captures semantic relationships. For instance, Dense Passage Retrieval (DPR) uses dual encoders to produce dense vectors for questions and passages, achieving superior performance by outperforming sparse methods by 9-19% in top-20 passage retrieval accuracy on benchmarks like Natural Questions. Early retrieval-augmented systems focused on extractive , such as DrQA, which retrieves candidate paragraphs using TF-IDF or BM25 and then applies a neural reader to identify answers within them, demonstrating strong results on open-domain datasets without end-to-end training. Building on this, extends the framework to generative by fusing retrieved documents with a model, allowing the system to produce free-form answers informed by external evidence; set state-of-the-art results on tasks like open-domain in 2020 by combining parametric generation with non-parametric retrieval from a dense index of articles. Recent advances as of emphasize iterative retrieval to handle complex, multi-hop questions that require chaining multiple pieces of information. Methods like KiRAG employ knowledge-driven iteration, where an initial retrieval is refined through subsequent queries generated by a , improving accuracy on multi-hop benchmarks by progressively incorporating deeper semantic features. Similarly, ReSP introduces a dual-function summarizer in an iterative retrieval-augmented generation loop to compress and plan retrievals for multi-hop , outperforming single-pass baselines on datasets requiring reasoning over extended contexts. These developments enhance scalability for real-world applications while mitigating issues like retrieval noise in intricate queries.

Generative Methods

Generative methods in question answering involve producing free-form text responses directly from input questions, typically leveraging encoder-decoder architectures to map question encodings to answer sequences. Early implementations relied on models, such as those using recurrent neural networks (RNNs) with mechanisms, to generate answers autoregressively. A notable extension is the pointer-generator network, which combines generation with copying mechanisms from the input context to improve factual accuracy and handle out-of-vocabulary terms, originally developed for summarization but adapted for QA tasks like knowledge graph-based answering. Key advancements have centered on transformer-based decoders, particularly in large pretrained models like the series, which enable zero-shot or few-shot question answering through in-context learning. In this paradigm, models generate answers by conditioning on prompts that include question-answer demonstrations without parameter updates, as demonstrated in GPT-3's performance on diverse benchmarks. these models on QA pairs further enhances specificity, allowing adaptation to domain-specific tasks while preserving generative flexibility. The UnifiedQA framework exemplifies this by unifying multiple QA formats—such as extractive, abstractive, and multiple-choice—under a single T5-based model, achieving state-of-the-art results across 20 datasets by reformatting all tasks as text generation. These methods excel at handling non-factoid questions, such as those requiring explanations or reasoning, by producing coherent, outputs rather than fixed spans. However, a primary challenge is , where models generate plausible but factually incorrect information due to over-reliance on parametric knowledge. Mitigation strategies include advanced prompting techniques, like chain-of-thought reasoning to encourage step-by-step verification, and post-generation checks against external sources, though integration with retrieval can further ground outputs in verified contexts.

Applications

Conversational Agents and Virtual Assistants

Conversational agents and virtual assistants rely on question answering (QA) as a foundational component to process and respond to user queries in natural, interactive dialogues. Apple's , introduced in 2011 with the , pioneered this integration by enabling voice-based QA for tasks such as providing real-time information on weather, sports, and directions through . Amazon's , launched in 2014, extended QA capabilities via its devices, using to answer factual questions, perform conversions, and handle multi-turn interactions through skills like custom Q&A blueprints. OpenAI's , released in November 2022, advanced conversational QA by leveraging large language models to engage in extended dialogues, admitting errors, and addressing follow-up questions in a human-like manner. A key feature of QA in these systems is context maintenance across multiple turns, allowing agents to reference prior exchanges for coherent responses; for instance, and support follow-up queries without repetition, while 's dialogue format enables it to challenge premises or refine answers based on ongoing conversation history. further enhances QA by incorporating user history, preferences, and profiles— uses voice recognition for tailored responses, and adapts to individual interaction styles over time. Google Assistant exemplifies QA for factual retrieval, integrating with Google's to deliver quick, accurate answers on topics like , local information, and contextual rephrasings for follow-ups, such as clarifying ambiguous queries in real-time. In enterprise settings, chatbots like those from and Rasa employ QA to automate , resolving common inquiries on product details or through intent detection and retrieval-augmented generation, thereby reducing response times and agent workload. By 2025, trends in conversational emphasize emotional , where agents detect user sentiment via voice tone or text cues to deliver empathetic responses, enhancing support in scenarios like chats or ; this is driven by advances in models, projected to grow the emotional AI market to $13.4 billion. Recent developments include agentic AI systems that autonomously handle multi-step tasks and inputs, as seen in updates to models like GPT-4o, enabling more dynamic and context-aware interactions.

Search Engines and Information Retrieval

Question answering (QA) has significantly enhanced traditional search engines by shifting from mere link provision to delivering direct, synthesized answers, thereby improving user experience in . Early search systems relied on keyword matching, which often required users to sift through multiple links to find relevant information. This evolved with the introduction of in 2012, which enabled knowledge panels—structured information boxes displaying key facts about entities such as people, places, or topics directly in search results. These panels draw from a vast database to provide concise answers, reducing the need for users to navigate external sites. Building on this, launched featured snippets in January 2014, extracting and reformatting content from top-ranking pages to answer common queries succinctly at the top of search results. Techniques underlying QA in search engines involve processing queries over large-scale web indexes to identify and retrieve precise answers. Search systems employ to parse user questions, often using transformer-based models to understand intent and context. A key approach is hybrid QA, which combines retrieval from web corpora with —mapping query mentions to specific entities in knowledge bases like or proprietary graphs—to ground answers in verifiable facts. For instance, Google's systems use to connect ambiguous terms to entities, enabling accurate extraction of attributes or relations from indexed content. This method improves precision by disambiguating queries and integrating structured data with unstructured text. Prominent examples illustrate QA's integration into major search platforms. Microsoft's Bing incorporates QA into its Visual Search feature, allowing users to upload images and receive textual answers about identified objects, landmarks, or concepts, leveraging multimodal models for interpretation since its expansion in 2023. Similarly, Baidu integrated its ERNIE Bot, a large language model-based QA system, into its search engine in late 2023, enabling generative responses to complex queries by augmenting retrieval with real-time web data. In 2024, Google introduced AI Overviews, a generative QA feature that provides synthesized summaries for complex queries across more than 100 countries by October 2024, drawing from web sources to deliver comprehensive answers. Emerging platforms like Perplexity AI, launched in 2022 and prominent by 2025, specialize in conversational QA with cited sources, offering real-time, accurate responses to factual and research-oriented questions. These implementations demonstrate how QA extends beyond text to visual and conversational elements in search interfaces. The impact of QA features in search engines includes substantial reductions in user effort and enhancements in query accuracy, particularly for information needs. By providing direct answers, features like featured snippets and knowledge panels minimize the clicks required to resolve queries, with Google's data indicating that such elements help users find information faster without always needing to visit source sites. Studies confirm increased user satisfaction due to quicker access, though challenges persist in ensuring factual accuracy for dynamic topics like . Overall, these advancements have made search more efficient, handling billions of daily queries with higher relevance.

Education and Tutoring Systems

Question answering (QA) technologies have been integrated into intelligent tutoring systems (ITS) to provide adaptive, interactive support in educational settings, enabling students to receive immediate, personalized responses to queries during learning activities. Seminal systems like AutoTutor, developed in the early , use for conversational QA to simulate human dialogues, prompting students with questions and scaffolding explanations based on their responses. These systems analyze student inputs against to detect understanding gaps and deliver targeted feedback, enhancing engagement in subjects like and physics. In language learning, platforms such as employ QA mechanisms through features like "Explain My Answer" in Duolingo Max, powered by large language models, to clarify grammar rules and vocabulary usage in response to user or errors during exercises. For mathematics education, Carnegie Learning's MATHia serves as an AI-driven ITS that incorporates QA to offer step-by-step guidance on problem-solving, adapting question difficulty and providing hints based on real-time performance data from over 500,000 students annually. Similarly, Khan Academy's Khanmigo AI tutor resolves doubts by answering student questions in math, , and humanities through guided Socratic-style dialogues, fostering deeper comprehension without direct solutions. QA also supports auto-grading of essays by evaluating responses against rubrics, extracting key arguments via semantic analysis to assign scores and suggest improvements efficiently. The primary benefits of QA in tutoring systems include personalized feedback that adjusts to individual learning paces and scaffolding complex explanations through iterative questioning, which research shows improves retention and problem-solving skills in K-12 settings. By 2025, advancements in QA have enabled AI tutors to handle queries involving diagrams and visuals, such as explaining geometric proofs from uploaded images, as demonstrated in benchmarks like MMTutorBench for tutoring. These developments, including multi-agent systems for adaptive interactions and AI-enhanced high-dose tutoring with real-time , allow for richer educational experiences across diverse domains.

Evaluation and Progress

Benchmarks and Datasets

Evaluation of question answering (QA) systems relies on standardized metrics that assess the accuracy and quality of predicted answers against . For extractive QA tasks, where the answer is a from a given context, the Exact Match () metric measures whether the predicted answer exactly matches the ground truth, providing a strict evaluation. The F1 score, which balances at the token level, is commonly used alongside EM to account for partial overlaps in answers. In generative QA, where models produce free-form responses, metrics like and evaluate n-gram overlap and longest common subsequences between generated and reference answers, respectively, though they are less ideal for semantic fidelity. For more complex, open-ended QA involving reasoning or dialogue, human judgments often serve as the gold standard, supplemented by automated proxies due to scalability needs. Seminal datasets have shaped QA research, beginning with reading comprehension benchmarks like , introduced in 2016, which consists of over 100,000 question-answer pairs derived from articles, focusing on extractive answers within provided passages. TriviaQA, released in 2017, extends this to open-domain QA with 95,000 trivia questions paired with evidence from web documents and , emphasizing distant supervision and multi-sentence reasoning. Natural Questions (NQ), from 2019, shifts toward real-world queries by using anonymized logs, resulting in 307,000 questions with answers extracted from , promoting evaluation in web-scale contexts. The evolution of QA datasets reflects a progression from closed-domain, English-centric resources to open-domain, multilingual, and multimodal ones. Early closed-book setups, like those in , tested models on fixed contexts, but open-domain datasets such as TriviaQA and NQ introduced retrieval challenges, requiring systems to fetch relevant evidence from large corpora. Multilingual extensions, exemplified by XQuAD in 2020, adapt to 11 languages with 1,190 question-paragraph-answer triples per language, enabling cross-lingual transfer evaluation without language-specific training data. Broader benchmarks like GLUE (2018) and SuperGLUE (2019) incorporate QA subsets—such as QNLI and BoolQ—to assess in composite tasks, while leaderboards like the Open LLM Leaderboard rank models on QA-specific benchmarks including and TruthfulQA. Recent advancements emphasize reasoning and , with BIG-bench (2022) introducing over 200 diverse tasks, including QA variants for logical and commonsense, to probe scaling behaviors in large models. By 2025, multimodal datasets have proliferated, such as SPIQA for scientific paper image QA and ProMQA for procedural video understanding, extending traditional text-based QA to integrate visual and audio cues, often building on foundations like VQA v2 with bias-mitigated variants. This shift underscores the need for benchmarks that capture real-world complexity across modalities and languages.

Challenges and Future Directions

One of the primary challenges in generative (QA) systems is the phenomenon of , where models produce plausible but factually incorrect or unsubstantiated responses due to over-reliance on parametric knowledge or gaps in retrieved . This issue persists even in advanced large language models (LLMs), with studies showing hallucination rates exceeding 20% in open-domain QA tasks without external verification mechanisms. To mitigate this, researchers emphasize retrieval-augmented generation () techniques, though integration remains imperfect for complex queries. Bias in training data represents another significant hurdle, as QA models often inherit and amplify societal prejudices embedded in corpora like or , leading to skewed answers that disadvantage underrepresented groups in areas such as , , or . For instance, analyses of models trained on English-centric s reveal up to 30% higher error rates for queries involving non-Western cultural contexts. Addressing this requires diverse curation and debiasing algorithms, yet progress is slow due to the scale of data involved. Robustness to adversarial questions further limits QA reliability, as systems are vulnerable to perturbations like paraphrasing or adding irrelevant details that cause sharp performance drops—sometimes by over 50% on benchmarks designed for such attacks. This vulnerability arises from superficial rather than semantic understanding. Multilingual and low-resource support compounds these issues, with models performing poorly on non-English queries due to insufficient training data; for example, zero-shot transfer to languages like yields accuracy below 10% compared to 70% for English. Ethical concerns in QA, particularly conversational variants, include privacy risks from retaining user interaction data for personalization, potentially exposing sensitive information in violation of regulations like GDPR. Additionally, these systems can propagate misinformation by confidently outputting unverified claims, exacerbating societal harms in high-stakes domains such as healthcare or news summarization. Efforts to incorporate fact-checking layers are underway, but they often trade off response speed and naturalness. Looking to future directions, neurosymbolic QA approaches aim to enhance reasoning by hybridizing neural with symbolic logic, enabling more interpretable and accurate handling of multi-hop questions that current LLMs struggle with. models, which allow incremental adaptation without catastrophic forgetting, promise sustained performance in dynamic environments by continuously incorporating new knowledge streams. Integration of QA with for embodied QA is an emerging frontier, where systems must ground answers in physical interactions, such as querying object affordances in real-world tasks, bridging the gap between textual and sensory data. As of 2025, advances in efficient , including quantized LLMs that reduce model size by up to 4x while maintaining near-full , are enabling deployment on devices for real-time QA applications. Similarly, zero-shot QA capabilities have improved through instruction-tuning paradigms, achieving competitive results on unseen domains without , though generalization to novel reasoning patterns remains a key research goal.

References

  1. [1]
    [PDF] Advances in Natural Language Question Answering: A Review - arXiv
    This paper discusses the successes and challenges in question answering question answering systems and techniques that are used in these challenges. Keywords— ...Missing: scholarly | Show results with:scholarly
  2. [2]
    Evaluation of Question Answering Systems: Complexity of Judging a ...
    Aug 30, 2025 · Question answering (QA) systems are a leading and rapidly advancing field of natural language processing (NLP) research.
  3. [3]
    A comprehensive survey on answer generation methods using NLP
    This paper presents a comprehensive review of the evolution of question answering systems, with a focus on the developments over the last few years.
  4. [4]
    [PDF] Evaluation of Question Answering Systems - arXiv
    Sep 10, 2022 · Abstract. Question answering (QA) systems are among the most important and rapidly developing research topics in natural language processing.
  5. [5]
    (PDF) A Review of Question Answering Systems - ResearchGate
    Aug 9, 2025 · Question Answering Systems offer an automated approach to procuring solutions to queries expressed in natural language.
  6. [6]
    [PDF] A Brief Survey of Question Answering Systems
    This survey summarizes the history and current state of the field and is intended as an introductory overview of QA systems.
  7. [7]
    (PDF) The Question Answering Systems: A Survey - ResearchGate
    Dec 5, 2016 · These three core components are: question classification, information retrieval, and answer extraction. Question classification plays an ...
  8. [8]
    (PDF) Named entity recognition for question answering
    In this paper we present a NER that aims at higher recall by allowing multiple entity labels to strings. The NER is embedded in a question answering system and ...
  9. [9]
    A survey on question answering systems with classification
    Generally, the factoid or list questions have answers in the form of sentences. Causal, hypothetical questions have answers in the form of passages.
  10. [10]
    Evaluation of Question Answering Systems: Complexity of judging a ...
    In other words, a closed-domain can refer to a single specific knowledge domain, in which the correct answer for an associated question is supposed to be part ...
  11. [11]
    (PDF) A Review on Question Answering Systems: Domains ...
    Sep 12, 2022 · Table 1: Comparing between open and closed domain QA systems. Domain Type Advantages Disadvantages Example. Open-domain. QA systems. • Easy to ...
  12. [12]
    Watson: Beyond Jeopardy! - ScienceDirect.com
    This paper presents a vision for applying the Watson technology to health care and describes the steps needed to adapt and improve performance in a new domain.<|separator|>
  13. [13]
    Building Watson: An Overview of the DeepQA Project | AI Magazine
    Jul 28, 2010 · Our results strongly suggest that DeepQA is an effective and extensible architecture that may be used as a foundation for combining, deploying, ...
  14. [14]
    [PDF] Closed Domain Question Answering for Cultural Heritage
    In closed domains, question structures are more predictable than in open domain and we propose to design a sophisticated module of template matching based on a ...
  15. [15]
    A Semantic Parsing Method for Mapping Clinical Questions to ... - NIH
    This paper presents a method for converting natural language questions about structured data in the electronic health record (EHR) into logical forms.
  16. [16]
    Question Analysis for a Closed Domain Question Answering System
    This study describes and evaluates the techniques we developed for the question analysis module of a closed domain Question Answering (QA) system that is ...Missing: template | Show results with:template
  17. [17]
    [PDF] TREC 2007 Genomics Track Overview - Text REtrieval Conference
    The TREC 2007 Genomics Track employed an entity-based question-answering task. Runs were required to nominate passages of text from a collection of full-text ...<|separator|>
  18. [18]
    Enhancing access to the Bibliome: the TREC 2004 Genomics Track
    The goal of the TREC Genomics Track is to create test collections for evaluation of information retrieval (IR) and related tasks in the genomics domain. The ...
  19. [19]
    [PDF] A Comprehensive Survey on Open-domain Question Answering
    May 8, 2021 · Abstract—Open-domain Question Answering (OpenQA) is an important task in Natural Language Processing (NLP), which aims to.
  20. [20]
    [PDF] Open-Domain Question Answering - Scott Wen-tau Yih
    Open-domain question answering (QA), the task of answering questions using a large collection of documents of diversified topics, has been a long-.
  21. [21]
    [PDF] arXiv:2004.04906v3 [cs.CL] 30 Sep 2020
    Sep 30, 2020 · Open-domain question answering (QA) (Voorhees,. 1999) is a task that answers factoid questions us- ing a large collection of documents. While ...
  22. [22]
    [PDF] A Survey for Efficient Open Domain Question Answering
    Jul 9, 2023 · Open domain question answering (ODQA) is a longstanding task aimed at answering factual questions from a large knowledge corpus with- out any ...Missing: definition | Show results with:definition
  23. [23]
    [PDF] Question Answering from Frequently Asked Question Files
    Second, each. QA pair is matched against the user's ques- tion to find the ones that best match it. For the first stage of processing, FAQ FINDER uses standard ...
  24. [24]
    [1905.13319] MathQA: Towards Interpretable Math Word Problem ...
    May 30, 2019 · We introduce a large-scale dataset of math word problems and an interpretable neural math problem solver that learns to map problems to operation programs.
  25. [25]
    [1505.00468] VQA: Visual Question Answering - arXiv
    May 3, 2015 · We provide a dataset containing ~0.25M images, ~0.76M questions, and ~10M answers (this http URL), and discuss the information it provides.Missing: seminal | Show results with:seminal
  26. [26]
    [PDF] A Comprehensive Approach with the MathQA Dataset - HAL
    Aug 5, 2024 · The MathQA dataset has 37,259 math word problems across six categories, with 80% for training, 12% for dev, and 8% for test.
  27. [27]
    VQA: Visual Question Answering
    VQA is a dataset with open-ended questions about images, requiring vision, language, and commonsense knowledge to answer.Challenge · Download · VQA v1 · Code
  28. [28]
    Learning Cross-Modality Encoder Representations from Transformers
    Aug 20, 2019 · LXMERT is a framework to learn vision-and-language connections using a Transformer model with three encoders: object relationship, language, ...
  29. [29]
    Video Question Answering: Datasets, Algorithms and Challenges
    This survey covers VideoQA datasets (normal, multi-modal, knowledge-based, factoid, inference), techniques, and research trends beyond factoid VideoQA.
  30. [30]
    Baseball: an automatic question-answerer - ACM Digital Library
    Baseball is a computer program that answers questions in English about stored data, using a dictionary and content analysis to extract information.
  31. [31]
    Procedures as a Representation for Data in a Computer Program for ...
    This paper describes a system for the computer understanding of English. The system answers questions, executes commands, and accepts information in normal ...Missing: original | Show results with:original
  32. [32]
    [PDF] Bridging the Lexical Chasm: Statistical Approaches to Answer-Finding
    (idf) weighting is a popular IR method for weighting terms by their “information content,” taken to be related to the frequency with which documents contain ...
  33. [33]
    Question answering in TREC | Proceedings of the tenth international ...
    A question answering track was introduced in TREC-8 1999. The track has generated wide-spread interest in the QA problem [2, 3, 4], and has documented ...
  34. [34]
    [PDF] An Analysis of the AskMSR Question-Answering System
    We built a decision tree to predict whether a correct answer appears in the top 5 answers, based on all of the question-derived features de- scribed earlier, ...Missing: SVM | Show results with:SVM
  35. [35]
    Text REtrieval Conference (TREC) QA Data
    Apr 23, 2002 · The QA task runs were evaluated using mean reciprocal rank (MRR). The score for an individual question was the reciprocal of the rank at which ...
  36. [36]
    [PDF] IBM's Statistical Question Answering System - TREC-10
    Focus expansion using WordNet (Miller, 1990). . Dependency relationships using syntatic pars- ing. . A maximum entropy formulation for answer se- lection ...
  37. [37]
    Learning to select the correct answer in multi-stream question ...
    This paper focuses on this problem, namely, the selection of the correct answer from a given set of responses corresponding to different QA systems. In ...
  38. [38]
    [PDF] Evaluating Answer Validation in multi-stream Question Answering
    Dec 16, 2008 · We follow the opinion that Question Answering. (QA) performance can be improved by combining different systems.
  39. [39]
    Deep Blue - IBM
    Deep Blue has had an impact on computing in many industries. It gave developers insights into ways they could design computers to analyze a vast number of ...Missing: question answering
  40. [40]
    [PDF] Overview of the TREC-9 Question Answering Track
    The TREC question answering track is an effort to bring the benefits of large-scale evaluation to bear on the question answering problem.
  41. [41]
    The TREC question answering track | Natural Language Engineering
    Feb 14, 2002 · The Text REtrieval Conference (TREC) question answering track is an effort to bring the benefits of large-scale evaluation to bear on a ...<|control11|><|separator|>
  42. [42]
    [PDF] A Multi-Strategy and Multi-Source Approach to Question Answering
    Traditional question answering systems typically employ a single pipeline architecture, consisting roughly of three components: question analysis, search, and ...
  43. [43]
    [PDF] A Data Driven Approach to Query Expansion in Question Answering
    Information re- trieval (IR) performance, provided by en- gines such as Lucene, places a bound on overall system performance. For example, no answer bearing ...<|control11|><|separator|>
  44. [44]
    [PDF] The JAVELIN Question-Answering System at TREC 2002
    The architecture is designed to support component-level evaluation, so that competing strategies and operators can be compared in terms of various performance ...
  45. [45]
    [PDF] Question Answering with QED at TREC-2005
    With respect to its architecture, QED is a fairly tradi- tional QA system, which is composed of a standard se- quence of modules: Question Analysis, Document ...
  46. [46]
    Integrating Modular Pipelines with End-to-End Learning: A Hybrid ...
    The advantage of this architecture lies in its interpretability, enabling a comprehensive evaluation of each component. However, the extensive coupling of ...
  47. [47]
    When it's all piling up: Investigating error propagation in an NLP ...
    Dec 14, 2016 · However, the cascading structure is prone to error propagation, where early-stage inaccuracies can amplify throughout the pipeline and lead to ...
  48. [48]
    A review on persian question answering systems: from traditional to ...
    Feb 13, 2025 · The current study provides a brief explanation of these systems' evolution from traditional architectures to LLM-based approaches, their classification, the ...
  49. [49]
    [1703.04816] Making Neural QA as Simple as Possible but not Simpler
    Mar 14, 2017 · In this work, we propose a simple heuristic that guides the development of neural baseline systems for the extractive QA task.
  50. [50]
    End-to-End Models for Complex AI Tasks | Capital One
    May 12, 2021 · Advantages of end-to-end models · Better metrics: · Simplicity: · Reduced effort: · Applicability to new tasks: · Ability to leverage naturally- ...
  51. [51]
    Bidirectional Attention Flow for Machine Comprehension - arXiv
    Nov 5, 2016 · In this paper we introduce the Bi-Directional Attention Flow (BIDAF) ... Question Answering Dataset (SQuAD) and CNN/DailyMail cloze test.
  52. [52]
    BART: Denoising Sequence-to-Sequence Pre-training for Natural ...
    Oct 29, 2019 · We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function.
  53. [53]
    Exploring the Limits of Transfer Learning with a Unified Text-to-Text ...
    Oct 23, 2019 · In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language ...
  54. [54]
    A Hybrid Neuro-Symbolic Pipeline for Coreference Resolution and ...
    This study presents a hybrid neuro-symbolic pipeline that combines transformer-based contextual encoding with symbolic coreference resolution and Abstract ...
  55. [55]
  56. [56]
  57. [57]
    Template-based question answering over RDF data
    We present a novel approach that relies on a parse of the question to produce a SPARQL template that directly mirrors the internal structure of the question.Missing: early | Show results with:early
  58. [58]
    Cyc: toward programs with common sense - ACM Digital Library
    Cyc is a bold attempt to assemble a massive knowledge base (on the order of 108 axioms) spanning human consensus knowledge. This article examines the need ...
  59. [59]
    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
    May 22, 2020 · We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric ...
  60. [60]
    The Probabilistic Relevance Framework: BM25 and Beyond
    This paper presents our novel relevance feedback (RF) algorithm that uses the probabilistic document-context based retrieval model with limited relevance ...
  61. [61]
    Dense Passage Retrieval for Open-Domain Question Answering
    Apr 10, 2020 · This paper introduces a dense passage retrieval method for open-domain QA, using dense representations and a dual-encoder framework, ...
  62. [62]
    [1704.00051] Reading Wikipedia to Answer Open-Domain Questions
    Mar 31, 2017 · This paper proposes to tackle open- domain question answering using Wikipedia as the unique knowledge source: the answer to any factoid question ...
  63. [63]
    Advancing Multi-hop Question Answering with an Iterative Approach
    Jul 18, 2024 · In this paper, we propose a novel iterative RAG method called ReSP, equipped with a dual-function summarizer. This summarizer compresses information from ...
  64. [64]
    Research on Automatic Question Answering of Generative ... - MDPI
    In the answer generation part, one combination of a vocabulary constructed by the knowledge graph and a pointer generator network(PGN) is proposed to point to ...Abstract · Share and Cite · Article Metrics
  65. [65]
    UnifiedQA: Crossing Format Boundaries With a Single QA System
    May 2, 2020 · Abstract:Question answering (QA) tasks have been posed using a variety of formats, such as extractive span selection, multiple choice, etc.
  66. [66]
    7 Practical Techniques to Reduce LLM Hallucinations
    Sep 30, 2025 · Must know approaches to mitigate hallucinations in LLMs · 1. Prompting · 2. Reasoning · 3. Retrieval Augmented Generation (RAG) · 4. ReAct (Reason + ...
  67. [67]
    Apple's Siri voice assistant based on extensive research - CNN
    Oct 5, 2011 · The program lets people bark commands or ask questions to the phone, and it will provide an answer or ask follow-up questions in order to ...
  68. [68]
    Siri – Knowledge and References - Taylor & Francis
    In 2011, Apple released its Siri technology for the iPhone (Apple, 2015). Siri is a virtual assistant able to understand natural-language voice commands and ...
  69. [69]
    Alexa can now help brands answer customer questions
    Sep 14, 2022 · All answers go through Alexa's content moderation and quality checks before Alexa selects the most relevant answer to share with customers.
  70. [70]
    How Amazon Alexa Works Using NLP: A Complete Guide
    Aug 6, 2025 · Amazon Alexa uses NLP to comprehend, decipher, and react to voice commands. The foundation of Alexa's capabilities is NLP.
  71. [71]
    Introducing ChatGPT - OpenAI
    Nov 30, 2022 · The dialogue format makes it possible for ChatGPT to answer followup questions, admit its mistakes, challenge incorrect premises, and reject ...Introducing ChatGPT search · Introducing ChatGPT Pro · OpenAI announces new...
  72. [72]
    About Alexa Conversations | Alexa Skills Kit - Amazon Developers
    Nov 27, 2023 · Alexa Conversations is a deep learning–based approach to dialog management that enables you to create natural, human-like voice experiences on Alexa.
  73. [73]
    Amazon Alexa – Learn what Alexa can do
    From microphone and camera controls to the ability to view and delete your voice recordings, you have transparency and control over your Alexa experience. Learn ...Alexa Information · Alexa Profiles · Alexa Entertainment · Alexa Productivity
  74. [74]
    AI Chatbot to Discover, Learn & Create - ChatGPT
    Type, talk, and use it your way. With ChatGPT, you can type or start a real-time voice conversation by tapping the soundwave icon in the mobile app.Download · Business · Education · EnterpriseMissing: 2022 | Show results with:2022
  75. [75]
    What you can ask Google Assistant
    Get to know your Assistant: “Do you dream?” “What's your favorite color?” Games: “Let's play a game.” “Give me a trivia question.” Entertainment: “Tell me a ...
  76. [76]
    Contextual Rephrasing in Google Assistant
    May 17, 2022 · We demonstrate how Assistant is now able to rephrase follow-up queries, adding contextual information before providing an answer.
  77. [77]
    Enterprise chatbots: Why and how to use them for support - Zendesk
    Jul 15, 2025 · Start with the chatbot's flow—it's your answer tree for customer questions. The bot flow allows you to helpfully direct the conversation to ...
  78. [78]
    A Complete Guide to Enterprise Customer Service Chatbot Platforms
    Dec 20, 2024 · Customer care: Chatbots provide instant answers, resolve issues, and deliver personalized support. Enterprise operations and IT helpdesk: ...
  79. [79]
    Emotionally Intelligent AI Voice Agents - SuperAGI
    Jun 27, 2025 · According to a report by IDC, the market for emotional AI is expected to grow to $13.4 billion by 2025, with emotional computing being a key ...Personalization At Scale · The Human-Ai Collaboration... · Emotion Detection And...
  80. [80]
    Conversational AI Trends for 2025: What You Need to Know - DXwand
    6. Emotionally Intelligent AI Chatbots. A major focus in AI development is creating chatbots with emotional intelligence, setting them apart from traditional ...1. Generative Ai's Big... · 3. Voice Assistants Go... · 5. Ai Chatbots: Expanding...
  81. [81]
    8 Conversational AI Trends in 2025 - Daffodil Software
    Jan 13, 2025 · Voice Emotion Recognition: AI systems can also analyze the pitch, tone, and speed of a user's voice to infer emotional states. This lets AI ...3) Emotionally Intelligent... · 5) Voice Search Optimization · 6) Integration With Iot And...
  82. [82]
    Introducing the Knowledge Graph: things, not strings - The Keyword
    May 16, 2012 · The Knowledge Graph enables you to search for things, people or places that Google knows about—landmarks, celebrities, cities, sports teams, ...
  83. [83]
    A reintroduction to Google's featured snippets - The Keyword
    Jan 30, 2018 · When we introduced featured snippets in January 2014, there were some concerns that they might cause publishers to lose traffic. What if ...
  84. [84]
    How Google May Use Entity References to Answer Questions
    Oct 12, 2014 · Google describes how it may answer questions from facts on Web pages by looking for entity references from structured and unstructured data.
  85. [85]
    Visual Search now live in Bing Chat
    Jul 18, 2023 · Bing Chat now supports visual search, which means you can now upload a photo or take a picture and have Bing Chat respond with answers around those visuals.
  86. [86]
    Baidu to integrate ERNIE 4.0, which 'rivals' GPT-4, into Search
    Oct 17, 2023 · The Chinese tech giant is planning to incorporate Ernie 4.0 into its search engine, which will change its SERPs.
  87. [87]
    Investigating the Influence of Featured Snippets on User Attitudes
    Mar 20, 2023 · This paper examines the effect of featured snippets in more nuanced and complicated search scenarios concerning debated topics that have no ...Missing: reduce effort
  88. [88]
    [PDF] Intelligent Tutoring Systems with Conversational Dialogue
    The tutoring systems present challenging prob- lems and questions to the learner, the learner types in answers in English, and there is a lengthy mul- titurn ...Missing: seminal | Show results with:seminal
  89. [89]
    [PDF] Intelligent Tutoring Systems: New Challenges and Directions - IJCAI
    ITS research has successfully delivered techniques and systems that provide adaptive support for student problem solving or question-answering activities in a ...
  90. [90]
    Introducing Duolingo Max, a learning experience powered by GPT-4
    Mar 14, 2023 · Duolingo Max is a new subscription tier above Super Duolingo that gives learners access to two brand-new features and exercises: Explain My Answer and Roleplay.How the Duolingo English Test... · Practice Hub · Talking to real learners
  91. [91]
    MATHia by Carnegie Learning | AI-Powered Math Supplement for ...
    MATHia, our award-winning, intelligent math software, is designed to provide individual student support and insightful data. Request a Demo ...Missing: answering | Show results with:answering
  92. [92]
    Meet Khanmigo: Khan Academy's AI-powered teaching assistant ...
    Type in a homework question and get instant help. Like a good tutor, Khanmigo gently guides your child to discover the answers themselves. Get Khanmigo.Free, AI-powered teacher... · Learners · Parents · Writing Coach
  93. [93]
    CoGrader | AI Essay Grader | Spend Less Time Grading, More Time ...
    CoGrader is an AI essay grader that helps teachers provide quality feedback, saving 80% of grading time, and provides timely, specific feedback.AI Grading Tool · How to Grade Essays Using... · AI Essay Grader · AI In CoGrader
  94. [94]
    A systematic review of AI-driven intelligent tutoring systems (ITS) in ...
    May 14, 2025 · This systematic review aims to identify the effects of ITSs on K-12 students' learning and performance and which experimental designs are currently used to ...
  95. [95]
    MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring
    Oct 27, 2025 · We present MMTutorBench, the first multimodal benchmark for AI math tutoring. It evaluates MLLMs across diverse mathematical domains and ...
  96. [96]
    Adaptive Multi-Agent Tutoring AI for Multimodal Mathematics ...
    Oct 26, 2025 · This paper introduces a conversational AI tutoring system grounded in Large Language Models (LLMs) and multi-agent design, with each agent ...
  97. [97]
    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for ...
    May 9, 2017 · TriviaQA is a reading comprehension dataset with over 650K question-answer-evidence triples, including 95K question-answer pairs and six ...
  98. [98]
    SuperGLUE: A Stickier Benchmark for General-Purpose Language ...
    May 2, 2019 · In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a ...Missing: QA | Show results with:QA
  99. [99]
    Open LLM Leaderboard Archived - Hugging Face
    Compare the performance of open-source Large Language Models using multiple benchmarks like IFEval, BBH, MATH, GPQA, MUSR, and MMLU-PRO.
  100. [100]
    Datasets Benchmarks 2024 - NeurIPS 2025
    We introduce SPIQA (Scientific Paper Image Question Answering), the first large-scale QA dataset specifically designed to interpret complex figures and tables ...
  101. [101]
    ProMQA: Question Answering Dataset for Multimodal Procedural ...
    ProMQA consists of 401 multimodal procedural QA pairs on user recording of procedural activities, ie, cooking, coupled with their corresponding instruction.