Fact-checked by Grok 2 weeks ago

Question answering

Question answering (QA) is a core task in natural language processing (NLP) that involves developing computational systems to comprehend human-posed questions in natural language and provide accurate, contextually relevant responses, often drawing from structured or unstructured data sources.^[1] These systems aim to bridge the gap between human inquiry and machine intelligence, enabling applications such as virtual assistants, search engines, and knowledge retrieval tools.^[2] The history of QA traces back to the 1960s with early rule-based systems like BASEBALL, which answered queries about baseball statistics using predefined grammatical rules, and LUNAR, designed for lunar rock composition questions.^[2] Progress accelerated in the late 1990s through initiatives like the Text Retrieval Conference (TREC), particularly TREC-8 in 1999, which standardized open-domain QA evaluation and spurred research in information retrieval-based approaches.^[2] Subsequent shifts incorporated statistical methods in the early 2000s, machine learning for pattern recognition, and deep learning from the 2010s onward, leveraging neural networks to handle linguistic nuances more effectively.^[1] QA systems are broadly categorized into three main paradigms: information retrieval-based QA (IRQA), which retrieves and extracts answers from large text corpora; knowledge base QA (KBQA), which queries structured databases like ontologies or graphs; and generative QA (GQA), which produces novel answers using language models without direct extraction.^[2] Key benchmarks have driven advancements, including the Stanford Question Answering Dataset (SQuAD) introduced in 2016 for reading comprehension tasks, and TREC datasets for factoid and complex question evaluation.^[1] Challenges persist in areas such as semantic ambiguity, multi-hop reasoning, and domain adaptation, requiring robust handling of diverse question types like factual, opinion-based, or definitional queries.^[3] Recent developments, particularly since 2020, have been propelled by transformer architectures and large language models (LLMs) such as BERT (2018) for contextual understanding and GPT-series models (e.g., GPT-3 in 2020, GPT-4 in 2023, and GPT-4o in 2024) for generative capabilities, achieving state-of-the-art performance on benchmarks through techniques like in-context learning and reinforcement learning from human feedback (RLHF).^[2] Evaluations have evolved with over 50 new metrics since 2014, including exact match (EM), F1-score for extraction tasks, and learning-based scores like BERTScore for semantic alignment, though human-centric assessments remain essential due to issues like hallucinations in LLMs.^[2] Ongoing research as of 2025 emphasizes multilingual QA, multimodal integration (e.g., visual question answering), agentic prompting approaches, and ethical considerations to mitigate biases in responses.^[3]^[4]

Fundamentals

Definition and Scope

Question answering (QA) is a subfield of natural language processing (NLP) focused on the task of automatically generating answers to questions expressed in natural language, utilizing a knowledge base, corpus, or other information sources to provide relevant and accurate responses.^[5] QA systems process the input question to understand its intent, retrieve pertinent information, and formulate an output that directly addresses the query, often in a concise textual form.^[6] This capability enables more intuitive human-machine interactions compared to traditional search mechanisms.^[5] Originating in artificial intelligence research, QA aims to replicate human-like comprehension of language and knowledge retrieval.^[7] The scope of QA includes diverse question formats, such as factoid questions seeking discrete facts (e.g., names, dates, or locations), list questions requiring enumerations of items, and complex questions demanding explanatory or inferential reasoning (e.g., causal or hypothetical scenarios).^[5] For instance, a factoid QA system might answer "Who is the president of the United States?" with a specific individual's name, while a complex QA approach could tackle "Why did the stock market crash in 1929?" by integrating economic and historical factors into a synthesized explanation.^[6] QA differs fundamentally from information retrieval (IR), which returns ranked lists of documents or passages for user review rather than pinpointing exact answers, and from dialogue systems, which sustain multi-turn conversations involving context maintenance and clarification rather than isolated query resolution.^[5] These distinctions highlight QA's emphasis on precision and natural language generation over mere document sourcing or extended interaction.^[6]

Key Components

Question answering (QA) systems rely on several core components to process natural language queries and retrieve accurate responses. The primary stages include question analysis, knowledge source access, candidate answer generation, and answer ranking or selection. These components work modularly to transform a user's query into a structured search and refine potential answers for relevance and precision.^[8]^[7] Question analysis begins with parsing the intent, identifying key entities, and discerning relations within the query to determine its focus. This involves breaking down the question into semantic elements, such as the head noun (e.g., "river" in "What is the longest river?") and modifiers, using techniques like pattern matching or syntactic parsing. Natural language understanding (NLU) plays a crucial role here by interpreting the semantics to pinpoint the question's focus and anticipate the expected answer type, such as a name, date, or explanation.^[8]^[7]^[1] Knowledge source access follows, where the system retrieves relevant information from structured databases, unstructured text corpora, or the web to form a basis for answers. This step often reformulates the parsed question into a search query to fetch documents or passages with high recall, prioritizing sources that align with the query's semantic needs over exhaustive coverage. For instance, in open-domain QA, web-scale corpora provide broad access, while closed-domain systems limit to specialized knowledge bases.^[8]^[7] Candidate answer generation identifies potential responses from the retrieved sources by extracting phrases or entities that match the question's requirements. This process leverages named entity recognition (NER) to tag elements like persons or locations and semantic parsing to convert text into logical forms that align with the query's structure. Prerequisites for effective QA include robust semantic parsing, which maps natural language to formal representations for precise matching, and entity recognition, which ensures key facts are not overlooked in unstructured data.^[8]^[7]^[9] Finally, answer ranking and selection evaluate candidates using heuristics like keyword proximity, semantic similarity, or redundancy checks across sources to select the most confident response. This stage validates answers against lexical resources or external corroboration to minimize errors.^[8] QA systems handle diverse question types, broadly categorized as factoid, definitional, and opinion-based, each demanding tailored processing. Factoid questions seek specific facts, such as "Who was the first U.S. president?" or "When did World War II end?", typically yielding short answers like names, dates, or quantities. Definitional questions request descriptions or explanations, for example, "What is photosynthesis?", requiring concise summaries or passages. Opinion-based questions involve subjective views, like "Why is climate change controversial?", often drawing from explanatory or argumentative texts.^[10]

Types

Closed-Domain Question Answering

Closed-domain question answering (QA) systems are designed to respond to queries restricted to a specific, predefined knowledge domain, such as biomedicine, sports, or legal affairs, leveraging structured knowledge bases like ontologies, relational databases, and domain-curated corpora to ensure focused and relevant answers. These systems operate within bounded information sources, where questions are expected to have answers derivable from the domain's explicit knowledge, enabling precise mapping between user input and available data. By limiting the search space, closed-domain QA avoids the ambiguity and scale issues prevalent in broader contexts, prioritizing depth over generality.^[11]^[12] The primary advantage of closed-domain QA lies in its elevated accuracy and reliability, stemming from the constrained scope that minimizes exposure to extraneous or conflicting information. For example, FAQ systems in customer service domains match user questions to a finite set of pre-authored responses, achieving high precision by exploiting repetitive query patterns typical in specialized interactions. IBM's Watson, which was originally developed for the open-domain Jeopardy! trivia challenge, has been extended to biomedical variants, such as Watson for Oncology, where it draws on structured medical ontologies and evidence-based literature to suggest cancer treatments, demonstrating how domain specificity supports expert-level decision-making in high-stakes fields. These examples illustrate the technique's efficacy in delivering verifiable, contextually rich answers that outperform generalist approaches in targeted applications.^[12]^[13]^[14] Key techniques in closed-domain QA emphasize domain-tailored processing, including template matching and semantic parsing. Template matching identifies syntactic and semantic patterns in incoming questions to align them with predefined answer templates, which is particularly suited to domains with predictable question types, such as procedural queries in technical support. Semantic parsing translates natural language questions into executable representations, like logical forms or database queries, customized to the domain's schema—for instance, generating SQL statements for querying medical patient records or SPARQL for ontology-based retrieval in genomics. These approaches integrate domain-specific lexicons and rules to handle jargon and relations unique to the field, facilitating accurate extraction from structured sources.^[15]^[16]^[17] A prominent case study is the Text REtrieval Conference (TREC) Genomics Track, organized by the National Institute of Standards and Technology (NIST) from 2003 to 2007, which assessed QA systems in the biomedical genomics domain using full-text articles from sources like the Journal of Biological Chemistry. The track featured entity-centric tasks, requiring systems to answer questions such as "List all proteins that interact with gene X," with evaluations based on passage-level relevance and entity accuracy metrics. Participating systems employed techniques like named entity recognition and passage retrieval, with leading performers attaining aspect MAP scores of approximately 0.26 on complex queries, revealing the demands of integrating heterogeneous biomedical data while advancing domain-specific QA methodologies.^[18]^[19]

Open-Domain Question Answering

Open-domain question answering (OpenQA) refers to the task of providing accurate answers to natural language questions drawn from a broad range of topics, using large-scale, unstructured knowledge sources such as the web or encyclopedias like Wikipedia, without reliance on pre-provided context or domain-specific restrictions.^[20] Unlike closed-domain systems, OpenQA requires an initial retrieval step to identify relevant documents from vast corpora, followed by answer extraction or generation from those passages.^[21] This approach enables handling diverse, real-world queries but demands scalable mechanisms to manage the scale and heterogeneity of general knowledge sources.^[22] Key challenges in OpenQA include resolving query ambiguity, where questions may have multiple interpretations requiring contextual disambiguation; mitigating noise from irrelevant or low-quality retrieved documents; and incorporating broad world knowledge to handle factual inaccuracies or gaps in the corpus.^[20] Retrieval inefficiencies arise from term mismatches between queries and documents, often necessitating advanced dense retrieval methods over traditional sparse techniques like BM25.^[23] Additionally, scaling to massive corpora introduces computational demands, while ensuring robustness to adversarial or unanswerable questions remains critical for reliable performance.^[20] Early milestones in OpenQA include the FAQFinder system, which in 1997 pioneered retrieval-based answering by matching user questions to existing FAQ pairs across diverse online sources, demonstrating the feasibility of open retrieval without domain limits.^[24] The TREC Question Answering track, starting in 1999, formalized OpenQA evaluation by challenging systems to extract precise answers from large news collections, spurring advancements in factoid QA. A significant leap came with Google's Knowledge Graph integration in 2012, which enhanced search-based QA by leveraging structured entity knowledge to provide direct answers for billions of queries annually. More recently, the DrQA framework in 2017 established the influential retriever-reader paradigm, combining TF-IDF retrieval with neural reading comprehension on Wikipedia to achieve state-of-the-art results on open benchmarks. Evaluation of OpenQA systems typically employs metrics such as Exact Match (EM), which measures if the predicted answer precisely matches the ground truth, and F1 score, which accounts for partial overlaps in token precision and recall.^[20] These are applied to benchmarks like Natural Questions (NQ), where systems retrieve and answer from web documents, reporting F1 scores around 50-60% for top models as of 2019. Other datasets, such as TriviaQA and MS MARCO, emphasize diverse question types and real-world search scenarios to assess retrieval accuracy and answer faithfulness.

Specialized Question Answering

Specialized question answering encompasses variants of question answering (QA) systems that extend beyond textual inputs, incorporating domain-specific expertise or non-text modalities such as mathematics, visuals, or combinations thereof. These systems address queries requiring precise computation, visual interpretation, or integrated reasoning across multiple data types, often integrating specialized tools like symbolic solvers or computer vision models to achieve accuracy in constrained domains. Unlike general text-based QA, specialized approaches must handle unique representational challenges, such as formal notations in math or spatial relationships in images.^[25]^[26] Mathematical QA focuses on solving problems involving equations, proofs, or word problems that demand numerical or algebraic reasoning. Datasets like MathQA provide large-scale collections of math word problems, comprising over 37,000 examples annotated with step-by-step operation programs to facilitate interpretable solving. Techniques in this area often integrate symbolic reasoning, where neural models generate executable programs that invoke equation solvers for verification, enabling systems to decompose complex problems into verifiable steps. A prominent example is Wolfram Alpha, a computational engine launched in 2009 that uses symbolic computation to answer mathematical queries by evaluating expressions and providing step-by-step derivations. Unique challenges include ensuring step-by-step reasoning accuracy, as errors in intermediate calculations can propagate, and handling diverse problem formats from arithmetic to calculus.^[25]^[27] Visual QA (VQA) involves answering natural language questions about images, requiring models to jointly process visual content and textual queries. Seminal datasets such as VQA v1.0, introduced in 2015, contain approximately 250,000 images paired with 760,000 open-ended questions and 10 million answers, emphasizing the need for vision-language alignment. Techniques typically employ computer vision components, like convolutional neural networks or vision transformers, to extract image features, which are then fused with question embeddings for prediction. Recent advances in the 2020s leverage transformer-based architectures, such as LXMERT, which uses cross-modality encoders pretrained on multimodal datasets to improve performance on tasks like visual entailment and question answering. Challenges in VQA include resolving visual ambiguity, where similar images may yield different answers based on subtle contextual cues, and mitigating language biases that ignore image details.^[26]^[28]^[29] Multimodal QA extends VQA to incorporate additional modalities, such as combining text with images or videos for more comprehensive querying. This variant addresses questions that span static visuals and dynamic sequences, using datasets like those derived from MSVD for video QA, which include thousands of clips with temporal questions. Techniques build on multimodal transformers to fuse representations from text encoders and visual processors, enabling reasoning over spatiotemporal elements in videos. For instance, systems processing video inputs apply attention mechanisms to track object trajectories across frames while aligning with textual queries. Key challenges involve managing temporal ambiguity in videos, where actions unfold over time, and scaling integration across modalities without losing fidelity in non-textual reasoning.^[30]^[29]

History

Early Developments (Pre-1990s)

The origins of question answering (QA) systems trace back to the early 1960s, when researchers in artificial intelligence began developing programs capable of interpreting natural language queries against structured data. One of the pioneering efforts was the BASEBALL system, created by Bert F. Green Jr. and colleagues in 1961. This program answered questions in English about Major League Baseball statistics from a single season, stored on punched cards, by employing pattern matching and a dictionary-based content analysis to map queries to data retrieval operations. For instance, it could respond to questions like "How many games did the Dodgers win in 1958?" by parsing the input for key elements such as teams and dates, then querying the database accordingly. The system's success in handling a limited domain demonstrated the feasibility of rule-based natural language processing for QA, though it was constrained to predefined patterns and required exact matches for reliable performance.^[31] In the late 1960s and early 1970s, linguistic influences from computational linguistics advanced QA toward more sophisticated natural language understanding. Terry Winograd's SHRDLU, developed at MIT between 1968 and 1970, represented a significant leap by enabling interactive QA within a simulated "block world" environment. The system processed commands and questions like "Can the table pick up blocks?" by integrating procedural semantics, where representations of the world (e.g., blocks, tables) were manipulated through a parser that understood context, reference, and inference. SHRDLU's ability to maintain dialogue state and resolve ambiguities, such as pronoun references, highlighted the importance of world knowledge in QA, influencing subsequent research in knowledge representation. Its implementation in Micro Planner, a Lisp-based language, underscored the role of symbolic AI in achieving coherent responses.^[32] The rule-based era of QA expanded in the early 1970s with expert systems tailored to specific domains, exemplified by the LUNAR system developed by William A. Woods in 1971. LUNAR allowed geologists to query a database of lunar rock chemical analyses using natural English, such as "How much iron is in the high-titanium basalts?" The system featured a robust semantic grammar and parser that converted questions into procedural representations for database interrogation, achieving over 90% accuracy on test queries at a lunar science conference demonstration. By incorporating domain-specific rules for quantification and aggregation, LUNAR illustrated how QA could support scientific inquiry, paving the way for more complex inference in restricted environments. Central to these early systems were foundational concepts like question templates and semantic grammars, which provided structured ways to interpret user intent without relying on broad statistical models. Question templates, as used in BASEBALL, predefined syntactic patterns to classify and route queries, enabling efficient matching against data schemas. Semantic grammars, prominent in LUNAR and SHRDLU, augmented syntactic parsing with meaning-driven rules to handle variations in phrasing while preserving logical structure, such as distinguishing between "what" and "how many" interrogatives. These approaches emphasized hand-crafted rules and domain expertise, establishing QA as a cornerstone of symbolic AI before the shift toward data-driven paradigms.^[31]

Rise of Statistical and Machine Learning Methods (1990s–2010s)

The 1990s marked a pivotal shift in question answering (QA) research from rule-based symbolic approaches to data-driven statistical methods, driven by advances in information retrieval (IR) and the growing availability of large text corpora. This era emphasized probabilistic models for passage retrieval and answer extraction, leveraging techniques like term frequency-inverse document frequency (TF-IDF) to identify relevant snippets containing answers. TF-IDF, which weights terms based on their frequency in a document relative to the corpus, became a cornerstone for ranking candidate passages in early QA systems, enabling more scalable processing of unstructured text without deep linguistic parsing.^[33] A landmark event was the introduction of the Question Answering track at the Text REtrieval Conference (TREC-8) in 1999, organized by the National Institute of Standards and Technology (NIST), which established standardized evaluations for open-domain QA systems. The track focused on factoid questions requiring short, precise answers (e.g., 50-byte snippets) from a fixed document collection, promoting the development of systems that retrieved exact answers rather than full documents. Evaluation metrics, such as Mean Reciprocal Rank (MRR)—the average of the reciprocal ranks of the first correct answer per question—provided a rigorous benchmark, with MRR scores highlighting the limitations of early statistical methods (typically below 0.3 in initial runs). Participating systems often combined IR for initial retrieval with simple statistical scoring for answer selection, setting the stage for broader adoption of probabilistic techniques.^[34] Entering the 2000s, machine learning (ML) techniques enhanced statistical QA by improving answer ranking and validation, particularly through supervised classifiers trained on annotated data from evaluations like TREC. For instance, the AskMSR system, developed by Microsoft Research and evaluated at TREC 2002, utilized decision trees—a form of ML—for reranking candidate answers extracted from web search results, achieving an MRR of 0.507 by prioritizing n-grams based on features like word overlap and question type compatibility. This approach exploited web-scale redundancy, where frequent answer occurrences signaled reliability, marking a departure from hand-crafted rules toward learning-based refinement. Broader QA@NIST evaluations, continuing through TREC from 2000 to 2010, refined tasks to include complex questions and "NIL" responses for unanswerable queries, while maintaining MRR as the primary metric alongside strict/lenient scoring variants to assess answer support. These annual benchmarks spurred innovations, with top systems reaching MRR above 0.5 by mid-decade, underscoring the efficacy of statistical IR pipelines.^[35]^[36] Knowledge bases like WordNet, a lexical database of English synsets developed in the early 1990s, integrated into statistical QA to expand query terms and resolve semantic ambiguities during retrieval and answer validation. In systems such as IBM's statistical QA entry at TREC-10 (2001), WordNet facilitated focus expansion by linking query words to synonyms and hypernyms, boosting recall in IR stages without relying on full ontologies. This hybrid use of lexical resources with probabilistic models improved handling of paraphrases, contributing to more robust answer selection in open-domain settings.^[37] By the late 2000s, multi-stream QA architectures emerged as a key advancement, combining outputs from multiple independent pipelines to enhance accuracy through redundancy and voting mechanisms. The MultiStream approach, explored in evaluations like the 2007 Answer Validation Exercise (AVE), aggregated answers from diverse systems—each using statistical IR or ML components—and applied learning-based selection to identify the most supported response, achieving improvements of up to 20% in F1 scores over single-stream baselines. Such methods exemplified the era's emphasis on ensemble techniques, leveraging statistical diversity to mitigate individual system weaknesses. The success of IBM's Deep Blue in defeating chess champion Garry Kasparov in 1997 further inspired computational AI pursuits, indirectly fueling investments in text QA capabilities that culminated in later systems like Watson.^[38]^[39]^[40]

Deep Learning and Transformer Era (2010s–Present)

The deep learning era in question answering (QA) began in the early 2010s with the adoption of recurrent neural networks (RNNs) and long short-term memory (LSTM) units, which enabled more sophisticated modeling of sequential dependencies in text for tasks like reading comprehension. These architectures addressed limitations of earlier statistical methods by learning distributed representations of words and contexts, allowing systems to infer answers from passages without rigid rule-based parsing. A pivotal advancement was the introduction of Memory Networks in 2015, which incorporated an external memory component to store and retrieve relevant facts, facilitating end-to-end training for QA on synthetic and real-world datasets. This approach demonstrated improved performance on simple factoid questions by dynamically attending to memory slots during inference. The release of the Stanford Question Answering Dataset (SQuAD) in 2016 further catalyzed progress, providing over 100,000 crowd-sourced question-answer pairs from Wikipedia articles and establishing a benchmark for extractive QA that spurred the development of neural models surpassing human baselines. The advent of the Transformer architecture in 2017 revolutionized QA by enabling parallelizable processing and capturing long-range dependencies through self-attention mechanisms. In 2018, Bidirectional Encoder Representations from Transformers (BERT) marked a breakthrough, pre-training a bidirectional Transformer on masked language modeling and next-sentence prediction tasks before fine-tuning on QA datasets like SQuAD. BERT-Large achieved state-of-the-art results of 85.1% exact match and 91.8% F1 on the SQuAD 1.1 test set (single model with TriviaQA fine-tuning), by leveraging contextual embeddings that better understood question-passage alignments compared to unidirectional RNNs.^[41] Subsequent variants, including RoBERTa and ELECTRA, refined this paradigm through optimized pre-training objectives and data scaling, solidifying fine-tuning as a standard for closed-domain QA. Entering the 2020s, the scaling of large language models (LLMs) transformed QA into a generative task, where models produce free-form answers rather than extracting spans. GPT-3, released in 2020 with 175 billion parameters, showcased few-shot learning capabilities for open-domain QA, achieving competitive results on benchmarks like Natural Questions without task-specific fine-tuning by prompting the model with examples. Similarly, the Text-to-Text Transfer Transformer (T5) in 2020 unified QA as a text generation problem within a sequence-to-sequence framework, attaining 90.6% F1 on SQuAD through supervised fine-tuning on diverse NLP tasks. To mitigate hallucinations in generative QA, Retrieval-Augmented Generation (RAG) emerged in 2020, combining parametric LLMs with non-parametric retrieval from external corpora like Wikipedia, yielding up to 44% improvement on knowledge-intensive tasks such as open-domain trivia QA. By 2023, multimodal extensions like GPT-4V integrated vision capabilities, enabling QA over images and text, such as describing visual content in medical or diagram-based queries with accuracies exceeding 80% on specialized benchmarks. As of 2025, QA has increasingly integrated into agentic AI systems, where autonomous agents leverage QA modules for multi-step reasoning, tool use, and decision-making in dynamic environments, as seen in frameworks like Agentic-R1 that distill dual reasoning strategies for efficient problem-solving. Efficiency improvements via knowledge distillation have also gained prominence, compressing large models like GPT-4 into smaller variants that retain 95% of QA performance while reducing inference costs by factors of 10, facilitating deployment in resource-constrained settings.

Architectures

Traditional Pipeline Architectures

Traditional pipeline architectures in question answering represent a modular, sequential approach that dominated early systems, particularly in large-scale evaluations like the Text REtrieval Conference (TREC) QA track from 1999 to 2007. These systems break down the QA process into distinct, interpretable stages to handle natural language queries over unstructured text corpora, emphasizing precision in factoid and list question types. The design allows for targeted optimization of each component, drawing on established information retrieval (IR) and natural language processing techniques prevalent before the widespread adoption of deep learning methods.^[42]^[43] The core structure typically follows a step-by-step flow: question processing, document retrieval, passage selection, answer extraction, and verification. In question processing, the input query is parsed to identify its type (e.g., who, what, where), expected answer format, and key terms, often using rule-based classifiers or keyword extraction to reformulate it for retrieval. Document retrieval then employs IR engines, such as InQuery in early TREC systems or Apache Lucene in later implementations, to rank and fetch a set of relevant documents from a large collection based on query-document similarity metrics like TF-IDF. Passage selection refines this by identifying candidate text spans within the documents using proximity heuristics or density scoring for answer-bearing content. Answer extraction applies pattern matching, named entity recognition, or shallow parsing to pinpoint potential answers, generating a list of candidates. Finally, verification ranks these candidates by confidence scores derived from evidence strength, redundancy across sources, or semantic coherence, selecting the top answer for output.^[42]^[44]^[45] Exemplary systems from the TREC QA track illustrate this pipeline in action; for instance, the JAVELIN system at TREC 2002 featured a component-based architecture with dedicated modules for query expansion, retrieval via IR tools, and answer validation, enabling systematic performance analysis at each stage. Similarly, the QED system at TREC 2005 followed a standard sequence of question analysis, document search with engines like Lucene, passage ranking, and answer justification to ensure factual accuracy. These pipelines were highly effective for closed-domain or factoid QA, achieving up to 65% accuracy on TREC-9 questions through precise retrieval and extraction.^[46]^[47]^[42] A primary strength of these architectures lies in their interpretability and modularity, which facilitate debugging, component swapping, and evaluation of individual stages—such as isolating retrieval errors without retraining the entire system—making them suitable for research and development in resource-constrained environments. However, they suffer from error propagation, where inaccuracies in upstream steps (e.g., irrelevant documents from poor retrieval) amplify downstream, leading to brittle performance on complex or ambiguous queries. Prior to 2010, these pipelines were the prevailing paradigm in QA, powering most competitive systems in benchmarks like TREC and enabling scalable handling of web-scale corpora.^[48]^[49]^[50]

End-to-End Neural Architectures

End-to-end neural architectures in question answering integrate the entire process—from question encoding and context understanding to answer generation or extraction—within a single, jointly trainable model, typically leveraging encoder-decoder frameworks to process inputs holistically without modular handoffs.^[51] These models emerged prominently post-2015, enabling direct optimization of all components via backpropagation on large-scale datasets, which facilitates capturing complex interactions between questions and contexts.^[52] A seminal example for extractive question answering is the Bi-Directional Attention Flow (BiDAF) model, introduced in 2017, which employs a multi-stage hierarchical process to represent context at varying granularities and applies bi-directional attention mechanisms—flowing from context to query and vice versa—to identify answer spans within passages.^[53] BiDAF achieved state-of-the-art results on the SQuAD dataset at the time, with an F1 score exceeding 80%, demonstrating its effectiveness in handling nuanced semantic alignments without relying on separate retrieval or post-processing steps.^[53] For generative question answering, where answers are produced as free-form text rather than extracted spans, models like BART (2019) and T5 (2020) represent key advancements by framing QA as a sequence-to-sequence task. BART, a denoising autoencoder pre-trained on corrupted text, excels in abstractive QA by reconstructing answers from noisy question-context pairs, outperforming prior extractive methods on benchmarks like Natural Questions.^[54] Similarly, T5 unifies QA under a text-to-text paradigm, fine-tuning a Transformer to generate answers directly from prefixed inputs like "question: [Q] context: [C]," yielding superior performance on diverse datasets such as TriviaQA with exact match scores around 70%.^[55] These architectures offer advantages in handling contextual nuances and multi-hop reasoning, as end-to-end training allows the model to learn implicit alignments and dependencies across the input, often surpassing modular systems in accuracy when scaled on massive corpora like those from Common Crawl.^[51] By jointly optimizing encoding and decoding, they reduce error propagation and adapt better to varied question types, though they require substantial computational resources for training.^[52] As of 2025, hybrid neural-symbolic architectures have gained traction to enhance robustness in end-to-end QA, integrating neural components for pattern recognition with symbolic reasoning for logical inference and fact-checking, as explored in recent surveys of complex QA systems.^[56]

Methods

Rule-Based and Knowledge-Driven Methods

Rule-based and knowledge-driven methods in question answering represent early deterministic approaches that rely on predefined rules, patterns, and structured knowledge representations to interpret queries and generate responses, without depending on statistical learning or large corpora.^[5] These systems typically involve handcrafted rules for parsing natural language questions into formal queries that can be executed against a knowledge base, emphasizing logical inference over probabilistic matching.^[5] Pattern matching forms a core technique in these methods, where syntactic or semantic templates are used to identify question types and map them to database operations or inference steps.^[5] A seminal example of pattern matching in rule-based QA is the LUNAR system, developed in the early 1970s to answer questions about lunar rock samples from the Apollo missions by employing an augmented transition network (ATN) parser to match question patterns against a procedural semantic grammar, enabling precise retrieval from a structured database.^[57] Similarly, the BASEBALL system from the 1960s demonstrated early pattern-based QA by processing queries about baseball statistics through rule-driven transformations into relational algebra expressions.^[58] Knowledge-driven methods extend this paradigm by leveraging ontologies and knowledge graphs for inference, often using RDF triples (subject-predicate-object statements) to represent domain knowledge and derive answers via logical rules.^[59] For instance, template-based systems translate natural language questions into SPARQL queries over RDF data by applying ontology-aligned patterns, allowing inference across related entities in the graph.^[59] The Cyc project, initiated in the 1980s and ongoing, exemplifies a large-scale knowledge-driven approach for commonsense QA, encoding millions of assertions in a formal ontology (CycL) to support deductive reasoning and answer complex queries without external training data.^[60] Rule engines in closed-domain systems, such as those for technical support or medical diagnostics, further apply these inference mechanisms to predefined knowledge bases for reliable, domain-specific responses. These methods offer key strengths, including high transparency—where decision paths are fully traceable due to explicit rules—and the absence of need for annotated training data, making them suitable for resource-constrained or highly controlled environments.^[5] However, they suffer from weaknesses such as poor scalability to open domains, where crafting exhaustive rules becomes infeasible, and coverage gaps arising from incomplete knowledge representations that fail to handle linguistic variations or novel queries.^[5] In contrast to data-driven methods, rule-based and knowledge-driven approaches prioritize interpretability over adaptability to unstructured text.^[5]

Retrieval-Augmented Methods

Retrieval-augmented methods in question answering integrate information retrieval techniques with natural language processing to enable scalable, open-domain systems that ground answers in external knowledge sources. These approaches typically involve two main stages: first, retrieving relevant passages or documents from a large corpus based on the query, and second, processing those retrieved items to extract or generate the final answer. This paradigm addresses the limitations of purely parametric models by leveraging non-parametric memory, such as vast text corpora like Wikipedia, to improve factual accuracy and handle knowledge-intensive queries.^[61] A foundational technique in retrieval is sparse retrieval, exemplified by BM25, which ranks documents using term frequency and inverse document frequency to match query keywords with corpus content. BM25, developed in the 1990s, remains a baseline for its efficiency in lexical matching without requiring deep semantic understanding. In contrast, dense retrieval methods represent queries and passages as low-dimensional embeddings, enabling similarity-based retrieval that captures semantic relationships. For instance, Dense Passage Retrieval (DPR) uses dual BERT encoders to produce dense vectors for questions and passages, achieving superior performance by outperforming sparse methods by 9-19% in top-20 passage retrieval accuracy on benchmarks like Natural Questions.^[62]^[63] Early retrieval-augmented systems focused on extractive QA, such as DrQA, which retrieves candidate Wikipedia paragraphs using TF-IDF or BM25 and then applies a neural reader to identify answers within them, demonstrating strong results on open-domain datasets without end-to-end training. Building on this, Retrieval-Augmented Generation (RAG) extends the framework to generative QA by fusing retrieved documents with a seq2seq model, allowing the system to produce free-form answers informed by external evidence; RAG set state-of-the-art results on tasks like open-domain QA in 2020 by combining parametric generation with non-parametric retrieval from a dense index of Wikipedia articles.^[64]^[61] Recent advances as of 2025 emphasize iterative retrieval to handle complex, multi-hop questions that require chaining multiple pieces of information. Methods like KiRAG employ knowledge-driven iteration, where an initial retrieval is refined through subsequent queries generated by a language model, improving accuracy on multi-hop benchmarks by progressively incorporating deeper semantic features.^[65] Similarly, ReSP introduces a dual-function summarizer in an iterative retrieval-augmented generation loop to compress and plan retrievals for multi-hop QA, outperforming single-pass baselines on datasets requiring reasoning over extended contexts.^[66] These developments enhance scalability for real-world applications while mitigating issues like retrieval noise in intricate queries.

Generative Methods

Generative methods in question answering involve producing free-form text responses directly from input questions, typically leveraging encoder-decoder architectures to map question encodings to answer sequences. Early implementations relied on sequence-to-sequence (seq2seq) models, such as those using recurrent neural networks (RNNs) with attention mechanisms, to generate answers autoregressively. A notable extension is the pointer-generator network, which combines generation with copying mechanisms from the input context to improve factual accuracy and handle out-of-vocabulary terms, originally developed for summarization but adapted for QA tasks like knowledge graph-based answering.^[67] Key advancements have centered on transformer-based decoders, particularly in large pretrained models like the GPT series, which enable zero-shot or few-shot question answering through in-context learning. In this paradigm, models generate answers by conditioning on prompts that include question-answer demonstrations without parameter updates, as demonstrated in GPT-3's performance on diverse QA benchmarks. Fine-tuning these models on QA pairs further enhances specificity, allowing adaptation to domain-specific tasks while preserving generative flexibility. The UnifiedQA framework exemplifies this by unifying multiple QA formats—such as extractive, abstractive, and multiple-choice—under a single T5-based seq2seq model, achieving state-of-the-art results across 20 datasets by reformatting all tasks as text generation.^[68] These methods excel at handling non-factoid questions, such as those requiring explanations or reasoning, by producing coherent, natural language outputs rather than fixed spans. However, a primary challenge is hallucination, where models generate plausible but factually incorrect information due to over-reliance on parametric knowledge. Mitigation strategies include advanced prompting techniques, like chain-of-thought reasoning to encourage step-by-step verification, and post-generation checks against external sources, though integration with retrieval can further ground outputs in verified contexts.^[69]

Applications

Conversational Agents and Virtual Assistants

Conversational agents and virtual assistants rely on question answering (QA) as a foundational component to process and respond to user queries in natural, interactive dialogues. Apple's Siri, introduced in 2011 with the iPhone 4S, pioneered this integration by enabling voice-based QA for tasks such as providing real-time information on weather, sports, and directions through natural language understanding.^[70]^[71] Amazon's Alexa, launched in 2014, extended QA capabilities via its Echo devices, using natural language processing to answer factual questions, perform conversions, and handle multi-turn interactions through skills like custom Q&A blueprints.^[72]^[73] OpenAI's ChatGPT, released in November 2022, advanced conversational QA by leveraging large language models to engage in extended dialogues, admitting errors, and addressing follow-up questions in a human-like manner.^[74] A key feature of QA in these systems is context maintenance across multiple turns, allowing agents to reference prior exchanges for coherent responses; for instance, Siri and Alexa support follow-up queries without repetition, while ChatGPT's dialogue format enables it to challenge premises or refine answers based on ongoing conversation history.^[70]^[75]^[74] Personalization further enhances QA by incorporating user history, preferences, and profiles—Alexa uses voice recognition for tailored responses, and ChatGPT adapts to individual interaction styles over time.^[76]^[77] Google Assistant exemplifies QA for factual retrieval, integrating with Google's knowledge graph to deliver quick, accurate answers on topics like trivia, local information, and contextual rephrasings for follow-ups, such as clarifying ambiguous queries in real-time.^[78]^[79] In enterprise settings, chatbots like those from Zendesk and Rasa employ QA to automate customer support, resolving common inquiries on product details or troubleshooting through intent detection and retrieval-augmented generation, thereby reducing response times and agent workload.^[80]^[81] By 2025, trends in conversational AI emphasize emotional QA, where agents detect user sentiment via voice tone or text cues to deliver empathetic responses, enhancing support in scenarios like mental health chats or customer service; this is driven by advances in emotional intelligence models, projected to grow the emotional AI market to $13.4 billion.^[82]^[83]^[84] Recent developments include agentic AI systems that autonomously handle multi-step tasks and multimodal inputs, as seen in updates to models like GPT-4o, enabling more dynamic and context-aware QA interactions.

Search Engines and Information Retrieval

Question answering (QA) has significantly enhanced traditional search engines by shifting from mere link provision to delivering direct, synthesized answers, thereby improving user experience in information retrieval. Early search systems relied on keyword matching, which often required users to sift through multiple links to find relevant information. This evolved with the introduction of Google's Knowledge Graph in 2012, which enabled knowledge panels—structured information boxes displaying key facts about entities such as people, places, or topics directly in search results.^[85] These panels draw from a vast database to provide concise answers, reducing the need for users to navigate external sites. Building on this, Google launched featured snippets in January 2014, extracting and reformatting content from top-ranking pages to answer common queries succinctly at the top of search results.^[86] Techniques underlying QA in search engines involve processing queries over large-scale web indexes to identify and retrieve precise answers. Search systems employ natural language processing to parse user questions, often using transformer-based models to understand intent and context. A key approach is hybrid QA, which combines retrieval from web corpora with entity linking—mapping query mentions to specific entities in knowledge bases like Wikidata or proprietary graphs—to ground answers in verifiable facts. For instance, Google's systems use entity linking to connect ambiguous terms to canonical entities, enabling accurate extraction of attributes or relations from indexed content. This method improves precision by disambiguating queries and integrating structured data with unstructured web text.^[87] Prominent examples illustrate QA's integration into major search platforms. Microsoft's Bing incorporates QA into its Visual Search feature, allowing users to upload images and receive textual answers about identified objects, landmarks, or concepts, leveraging multimodal models for interpretation since its expansion in 2023.^[88] Similarly, Baidu integrated its ERNIE Bot, a large language model-based QA system, into its search engine in late 2023, enabling generative responses to complex queries by augmenting retrieval with real-time web data.^[89] In 2024, Google introduced AI Overviews, a generative QA feature that provides synthesized summaries for complex queries across more than 100 countries by October 2024, drawing from web sources to deliver comprehensive answers.^[90]^[91] Emerging platforms like Perplexity AI, launched in 2022 and prominent by 2025, specialize in conversational QA with cited sources, offering real-time, accurate responses to factual and research-oriented questions. These implementations demonstrate how QA extends beyond text to visual and conversational elements in search interfaces. The impact of QA features in search engines includes substantial reductions in user effort and enhancements in query accuracy, particularly for real-time information needs. By providing direct answers, features like featured snippets and knowledge panels minimize the clicks required to resolve queries, with Google's data indicating that such elements help users find information faster without always needing to visit source sites. Studies confirm increased user satisfaction due to quicker access, though challenges persist in ensuring factual accuracy for dynamic topics like news. Overall, these advancements have made search more efficient, handling billions of daily queries with higher relevance.^[86]^[92]

Education and Tutoring Systems

Question answering (QA) technologies have been integrated into intelligent tutoring systems (ITS) to provide adaptive, interactive support in educational settings, enabling students to receive immediate, personalized responses to queries during learning activities.^[93] Seminal systems like AutoTutor, developed in the early 2000s, use natural language processing for conversational QA to simulate human tutoring dialogues, prompting students with questions and scaffolding explanations based on their responses.^[93] These systems analyze student inputs against domain knowledge to detect understanding gaps and deliver targeted feedback, enhancing engagement in subjects like computer literacy and physics.^[94] In language learning, platforms such as Duolingo employ QA mechanisms through features like "Explain My Answer" in Duolingo Max, powered by large language models, to clarify grammar rules and vocabulary usage in response to user queries or errors during exercises.^[95] For mathematics education, Carnegie Learning's MATHia serves as an AI-driven ITS that incorporates QA to offer step-by-step guidance on problem-solving, adapting question difficulty and providing hints based on real-time performance data from over 500,000 students annually.^[96] Similarly, Khan Academy's Khanmigo AI tutor resolves doubts by answering student questions in math, science, and humanities through guided Socratic-style dialogues, fostering deeper comprehension without direct solutions.^[97] QA also supports auto-grading of essays by evaluating responses against rubrics, extracting key arguments via semantic analysis to assign scores and suggest improvements efficiently.^[98] The primary benefits of QA in tutoring systems include personalized feedback that adjusts to individual learning paces and scaffolding complex explanations through iterative questioning, which research shows improves retention and problem-solving skills in K-12 settings.^[99] By 2025, advancements in multimodal QA have enabled AI tutors to handle queries involving diagrams and visuals, such as explaining geometric proofs from uploaded images, as demonstrated in benchmarks like MMTutorBench for mathematics tutoring.^[100] These developments, including multi-agent systems for adaptive multimodal interactions and AI-enhanced high-dose tutoring with real-time feedback, allow for richer educational experiences across diverse domains.^[101]

Evaluation and Progress

Benchmarks and Datasets

Evaluation of question answering (QA) systems relies on standardized metrics that assess the accuracy and quality of predicted answers against ground truth. For extractive QA tasks, where the answer is a span from a given context, the Exact Match (EM) metric measures whether the predicted answer exactly matches the ground truth, providing a strict binary evaluation. The F1 score, which balances precision and recall at the token level, is commonly used alongside EM to account for partial overlaps in answers. In generative QA, where models produce free-form responses, metrics like BLEU and ROUGE evaluate n-gram overlap and longest common subsequences between generated and reference answers, respectively, though they are less ideal for semantic fidelity. For more complex, open-ended QA involving reasoning or dialogue, human judgments often serve as the gold standard, supplemented by automated proxies due to scalability needs. Seminal datasets have shaped QA research, beginning with reading comprehension benchmarks like SQuAD, introduced in 2016, which consists of over 100,000 question-answer pairs derived from Wikipedia articles, focusing on extractive answers within provided passages. TriviaQA, released in 2017, extends this to open-domain QA with 95,000 trivia questions paired with evidence from web documents and Wikipedia, emphasizing distant supervision and multi-sentence reasoning.^[102] Natural Questions (NQ), from 2019, shifts toward real-world queries by using anonymized Google search logs, resulting in 307,000 questions with answers extracted from Wikipedia, promoting evaluation in web-scale contexts. The evolution of QA datasets reflects a progression from closed-domain, English-centric resources to open-domain, multilingual, and multimodal ones. Early closed-book setups, like those in SQuAD, tested models on fixed contexts, but open-domain datasets such as TriviaQA and NQ introduced retrieval challenges, requiring systems to fetch relevant evidence from large corpora. Multilingual extensions, exemplified by XQuAD in 2020, adapt SQuAD to 11 languages with 1,190 question-paragraph-answer triples per language, enabling cross-lingual transfer evaluation without language-specific training data. Broader benchmarks like GLUE (2018) and SuperGLUE (2019) incorporate QA subsets—such as QNLI and BoolQ—to assess natural language understanding in composite tasks, while leaderboards like the Hugging Face Open LLM Leaderboard rank models on QA-specific benchmarks including ARC and TruthfulQA.^[103]^[104] Recent advancements emphasize reasoning and multimodality, with BIG-bench (2022) introducing over 200 diverse tasks, including QA variants for logical inference and commonsense, to probe scaling behaviors in large models. By 2025, multimodal datasets have proliferated, such as SPIQA for scientific paper image QA and ProMQA for procedural video understanding, extending traditional text-based QA to integrate visual and audio cues, often building on foundations like VQA v2 with bias-mitigated variants.^[105]^[106] This shift underscores the need for benchmarks that capture real-world complexity across modalities and languages.

Challenges and Future Directions

One of the primary challenges in generative question answering (QA) systems is the phenomenon of hallucinations, where models produce plausible but factually incorrect or unsubstantiated responses due to over-reliance on parametric knowledge or gaps in retrieved evidence. This issue persists even in advanced large language models (LLMs), with studies showing hallucination rates exceeding 20% in open-domain QA tasks without external verification mechanisms. To mitigate this, researchers emphasize retrieval-augmented generation (RAG) techniques, though integration remains imperfect for complex queries. Bias in training data represents another significant hurdle, as QA models often inherit and amplify societal prejudices embedded in corpora like Common Crawl or Wikipedia, leading to skewed answers that disadvantage underrepresented groups in areas such as gender, race, or geography. For instance, analyses of models trained on English-centric datasets reveal up to 30% higher error rates for queries involving non-Western cultural contexts. Addressing this requires diverse dataset curation and debiasing algorithms, yet progress is slow due to the scale of data involved. Robustness to adversarial questions further limits QA reliability, as systems are vulnerable to perturbations like paraphrasing or adding irrelevant details that cause sharp performance drops—sometimes by over 50% on benchmarks designed for such attacks. This vulnerability arises from superficial pattern matching rather than deep semantic understanding. Multilingual and low-resource language support compounds these issues, with models performing poorly on non-English queries due to insufficient training data; for example, zero-shot transfer to languages like Swahili yields accuracy below 10% compared to 70% for English. Ethical concerns in QA, particularly conversational variants, include privacy risks from retaining user interaction data for personalization, potentially exposing sensitive information in violation of regulations like GDPR. Additionally, these systems can propagate misinformation by confidently outputting unverified claims, exacerbating societal harms in high-stakes domains such as healthcare or news summarization. Efforts to incorporate fact-checking layers are underway, but they often trade off response speed and naturalness. Looking to future directions, neurosymbolic QA approaches aim to enhance reasoning by hybridizing neural pattern recognition with symbolic logic, enabling more interpretable and accurate handling of multi-hop questions that current LLMs struggle with. Lifelong learning models, which allow incremental adaptation without catastrophic forgetting, promise sustained performance in dynamic environments by continuously incorporating new knowledge streams. Integration of QA with robotics for embodied QA is an emerging frontier, where systems must ground answers in physical interactions, such as querying object affordances in real-world navigation tasks, bridging the gap between textual and sensory data. As of 2025, advances in efficient inference, including quantized LLMs that reduce model size by up to 4x while maintaining near-full precision, are enabling deployment on edge devices for real-time QA applications. Similarly, zero-shot QA capabilities have improved through instruction-tuning paradigms, achieving competitive results on unseen domains without fine-tuning, though generalization to novel reasoning patterns remains a key research goal.

References

[1]
[PDF] Advances in Natural Language Question Answering: A Review - arXiv
This paper discusses the successes and challenges in question answering question answering systems and techniques that are used in these challenges. Keywords— ...Missing: scholarly | Show results with:scholarly
[2]
Evaluation of Question Answering Systems: Complexity of Judging a ...
Aug 30, 2025 · Question answering (QA) systems are a leading and rapidly advancing field of natural language processing (NLP) research.
[3]
A comprehensive survey on answer generation methods using NLP
This paper presents a comprehensive review of the evolution of question answering systems, with a focus on the developments over the last few years.
[4]
[PDF] Evaluation of Question Answering Systems - arXiv
Sep 10, 2022 · Abstract. Question answering (QA) systems are among the most important and rapidly developing research topics in natural language processing.
[5]
(PDF) A Review of Question Answering Systems - ResearchGate
Aug 9, 2025 · Question Answering Systems offer an automated approach to procuring solutions to queries expressed in natural language.
[6]
[PDF] A Brief Survey of Question Answering Systems
This survey summarizes the history and current state of the field and is intended as an introductory overview of QA systems.
[7]
(PDF) The Question Answering Systems: A Survey - ResearchGate
Dec 5, 2016 · These three core components are: question classification, information retrieval, and answer extraction. Question classification plays an ...
[8]
(PDF) Named entity recognition for question answering
In this paper we present a NER that aims at higher recall by allowing multiple entity labels to strings. The NER is embedded in a question answering system and ...
[9]
A survey on question answering systems with classification
Generally, the factoid or list questions have answers in the form of sentences. Causal, hypothetical questions have answers in the form of passages.
[10]
Evaluation of Question Answering Systems: Complexity of judging a ...
In other words, a closed-domain can refer to a single specific knowledge domain, in which the correct answer for an associated question is supposed to be part ...
[11]
(PDF) A Review on Question Answering Systems: Domains ...
Sep 12, 2022 · Table 1: Comparing between open and closed domain QA systems. Domain Type Advantages Disadvantages Example. Open-domain. QA systems. • Easy to ...
[12]
Watson: Beyond Jeopardy! - ScienceDirect.com
This paper presents a vision for applying the Watson technology to health care and describes the steps needed to adapt and improve performance in a new domain.<|separator|>
[13]
Building Watson: An Overview of the DeepQA Project | AI Magazine
Jul 28, 2010 · Our results strongly suggest that DeepQA is an effective and extensible architecture that may be used as a foundation for combining, deploying, ...
[14]
[PDF] Closed Domain Question Answering for Cultural Heritage
In closed domains, question structures are more predictable than in open domain and we propose to design a sophisticated module of template matching based on a ...
[15]
A Semantic Parsing Method for Mapping Clinical Questions to ... - NIH
This paper presents a method for converting natural language questions about structured data in the electronic health record (EHR) into logical forms.
[16]
Question Analysis for a Closed Domain Question Answering System
This study describes and evaluates the techniques we developed for the question analysis module of a closed domain Question Answering (QA) system that is ...Missing: template | Show results with:template
[17]
[PDF] TREC 2007 Genomics Track Overview - Text REtrieval Conference
The TREC 2007 Genomics Track employed an entity-based question-answering task. Runs were required to nominate passages of text from a collection of full-text ...<|separator|>
[18]
Enhancing access to the Bibliome: the TREC 2004 Genomics Track
The goal of the TREC Genomics Track is to create test collections for evaluation of information retrieval (IR) and related tasks in the genomics domain. The ...
[19]
[PDF] A Comprehensive Survey on Open-domain Question Answering
May 8, 2021 · Abstract—Open-domain Question Answering (OpenQA) is an important task in Natural Language Processing (NLP), which aims to.
[20]
[PDF] Open-Domain Question Answering - Scott Wen-tau Yih
Open-domain question answering (QA), the task of answering questions using a large collection of documents of diversified topics, has been a long-.
[21]
[PDF] arXiv:2004.04906v3 [cs.CL] 30 Sep 2020
Sep 30, 2020 · Open-domain question answering (QA) (Voorhees,. 1999) is a task that answers factoid questions us- ing a large collection of documents. While ...
[22]
[PDF] A Survey for Efficient Open Domain Question Answering
Jul 9, 2023 · Open domain question answering (ODQA) is a longstanding task aimed at answering factual questions from a large knowledge corpus with- out any ...Missing: definition | Show results with:definition
[23]
[PDF] Question Answering from Frequently Asked Question Files
Second, each. QA pair is matched against the user's ques- tion to find the ones that best match it. For the first stage of processing, FAQ FINDER uses standard ...
[24]
[1905.13319] MathQA: Towards Interpretable Math Word Problem ...
May 30, 2019 · We introduce a large-scale dataset of math word problems and an interpretable neural math problem solver that learns to map problems to operation programs.
[25]
[1505.00468] VQA: Visual Question Answering - arXiv
May 3, 2015 · We provide a dataset containing ~0.25M images, ~0.76M questions, and ~10M answers (this http URL), and discuss the information it provides.Missing: seminal | Show results with:seminal
[26]
[PDF] A Comprehensive Approach with the MathQA Dataset - HAL
Aug 5, 2024 · The MathQA dataset has 37,259 math word problems across six categories, with 80% for training, 12% for dev, and 8% for test.
[27]
VQA: Visual Question Answering
VQA is a dataset with open-ended questions about images, requiring vision, language, and commonsense knowledge to answer.Challenge · Download · VQA v1 · Code
[28]
Learning Cross-Modality Encoder Representations from Transformers
Aug 20, 2019 · LXMERT is a framework to learn vision-and-language connections using a Transformer model with three encoders: object relationship, language, ...
[29]
Video Question Answering: Datasets, Algorithms and Challenges
This survey covers VideoQA datasets (normal, multi-modal, knowledge-based, factoid, inference), techniques, and research trends beyond factoid VideoQA.
[30]
Baseball: an automatic question-answerer - ACM Digital Library
Baseball is a computer program that answers questions in English about stored data, using a dictionary and content analysis to extract information.
[31]
Procedures as a Representation for Data in a Computer Program for ...
This paper describes a system for the computer understanding of English. The system answers questions, executes commands, and accepts information in normal ...Missing: original | Show results with:original
[32]
[PDF] Bridging the Lexical Chasm: Statistical Approaches to Answer-Finding
(idf) weighting is a popular IR method for weighting terms by their “information content,” taken to be related to the frequency with which documents contain ...
[33]
Question answering in TREC | Proceedings of the tenth international ...
A question answering track was introduced in TREC-8 1999. The track has generated wide-spread interest in the QA problem [2, 3, 4], and has documented ...
[34]
[PDF] An Analysis of the AskMSR Question-Answering System
We built a decision tree to predict whether a correct answer appears in the top 5 answers, based on all of the question-derived features de- scribed earlier, ...Missing: SVM | Show results with:SVM
[35]
Text REtrieval Conference (TREC) QA Data
Apr 23, 2002 · The QA task runs were evaluated using mean reciprocal rank (MRR). The score for an individual question was the reciprocal of the rank at which ...
[36]
[PDF] IBM's Statistical Question Answering System - TREC-10
Focus expansion using WordNet (Miller, 1990). . Dependency relationships using syntatic pars- ing. . A maximum entropy formulation for answer se- lection ...
[37]
Learning to select the correct answer in multi-stream question ...
This paper focuses on this problem, namely, the selection of the correct answer from a given set of responses corresponding to different QA systems. In ...
[38]
[PDF] Evaluating Answer Validation in multi-stream Question Answering
Dec 16, 2008 · We follow the opinion that Question Answering. (QA) performance can be improved by combining different systems.
[39]
Deep Blue - IBM
Deep Blue has had an impact on computing in many industries. It gave developers insights into ways they could design computers to analyze a vast number of ...Missing: question answering
[40]
[PDF] Overview of the TREC-9 Question Answering Track
The TREC question answering track is an effort to bring the benefits of large-scale evaluation to bear on the question answering problem.
[41]
The TREC question answering track | Natural Language Engineering
Feb 14, 2002 · The Text REtrieval Conference (TREC) question answering track is an effort to bring the benefits of large-scale evaluation to bear on a ...<|control11|><|separator|>
[42]
[PDF] A Multi-Strategy and Multi-Source Approach to Question Answering
Traditional question answering systems typically employ a single pipeline architecture, consisting roughly of three components: question analysis, search, and ...
[43]
[PDF] A Data Driven Approach to Query Expansion in Question Answering
Information re- trieval (IR) performance, provided by en- gines such as Lucene, places a bound on overall system performance. For example, no answer bearing ...<|control11|><|separator|>
[44]
[PDF] The JAVELIN Question-Answering System at TREC 2002
The architecture is designed to support component-level evaluation, so that competing strategies and operators can be compared in terms of various performance ...
[45]
[PDF] Question Answering with QED at TREC-2005
With respect to its architecture, QED is a fairly tradi- tional QA system, which is composed of a standard se- quence of modules: Question Analysis, Document ...
[46]
Integrating Modular Pipelines with End-to-End Learning: A Hybrid ...
The advantage of this architecture lies in its interpretability, enabling a comprehensive evaluation of each component. However, the extensive coupling of ...
[47]
When it's all piling up: Investigating error propagation in an NLP ...
Dec 14, 2016 · However, the cascading structure is prone to error propagation, where early-stage inaccuracies can amplify throughout the pipeline and lead to ...
[48]
A review on persian question answering systems: from traditional to ...
Feb 13, 2025 · The current study provides a brief explanation of these systems' evolution from traditional architectures to LLM-based approaches, their classification, the ...
[49]
[1703.04816] Making Neural QA as Simple as Possible but not Simpler
Mar 14, 2017 · In this work, we propose a simple heuristic that guides the development of neural baseline systems for the extractive QA task.
[50]
End-to-End Models for Complex AI Tasks | Capital One
May 12, 2021 · Advantages of end-to-end models · Better metrics: · Simplicity: · Reduced effort: · Applicability to new tasks: · Ability to leverage naturally- ...
[51]
Bidirectional Attention Flow for Machine Comprehension - arXiv
Nov 5, 2016 · In this paper we introduce the Bi-Directional Attention Flow (BIDAF) ... Question Answering Dataset (SQuAD) and CNN/DailyMail cloze test.
[52]
BART: Denoising Sequence-to-Sequence Pre-training for Natural ...
Oct 29, 2019 · We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function.
[53]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text ...
Oct 23, 2019 · In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language ...
[54]
A Hybrid Neuro-Symbolic Pipeline for Coreference Resolution and ...
This study presents a hybrid neuro-symbolic pipeline that combines transformer-based contextual encoding with symbolic coreference resolution and Abstract ...
[55]
https://arxiv.org/abs/1910.10683
[56]
https://arxiv.org/abs/2302.09051
[57]
Template-based question answering over RDF data
We present a novel approach that relies on a parse of the question to produce a SPARQL template that directly mirrors the internal structure of the question.Missing: early | Show results with:early
[58]
Cyc: toward programs with common sense - ACM Digital Library
Cyc is a bold attempt to assemble a massive knowledge base (on the order of 108 axioms) spanning human consensus knowledge. This article examines the need ...
[59]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
May 22, 2020 · We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric ...
[60]
The Probabilistic Relevance Framework: BM25 and Beyond
This paper presents our novel relevance feedback (RF) algorithm that uses the probabilistic document-context based retrieval model with limited relevance ...
[61]
Dense Passage Retrieval for Open-Domain Question Answering
Apr 10, 2020 · This paper introduces a dense passage retrieval method for open-domain QA, using dense representations and a dual-encoder framework, ...
[62]
[1704.00051] Reading Wikipedia to Answer Open-Domain Questions
Mar 31, 2017 · This paper proposes to tackle open- domain question answering using Wikipedia as the unique knowledge source: the answer to any factoid question ...
[63]
Advancing Multi-hop Question Answering with an Iterative Approach
Jul 18, 2024 · In this paper, we propose a novel iterative RAG method called ReSP, equipped with a dual-function summarizer. This summarizer compresses information from ...
[64]
Research on Automatic Question Answering of Generative ... - MDPI
In the answer generation part, one combination of a vocabulary constructed by the knowledge graph and a pointer generator network(PGN) is proposed to point to ...Abstract · Share and Cite · Article Metrics
[65]
UnifiedQA: Crossing Format Boundaries With a Single QA System
May 2, 2020 · Abstract:Question answering (QA) tasks have been posed using a variety of formats, such as extractive span selection, multiple choice, etc.
[66]
7 Practical Techniques to Reduce LLM Hallucinations
Sep 30, 2025 · Must know approaches to mitigate hallucinations in LLMs · 1. Prompting · 2. Reasoning · 3. Retrieval Augmented Generation (RAG) · 4. ReAct (Reason + ...
[67]
Apple's Siri voice assistant based on extensive research - CNN
Oct 5, 2011 · The program lets people bark commands or ask questions to the phone, and it will provide an answer or ask follow-up questions in order to ...
[68]
Siri – Knowledge and References - Taylor & Francis
In 2011, Apple released its Siri technology for the iPhone (Apple, 2015). Siri is a virtual assistant able to understand natural-language voice commands and ...
[69]
Alexa can now help brands answer customer questions
Sep 14, 2022 · All answers go through Alexa's content moderation and quality checks before Alexa selects the most relevant answer to share with customers.
[70]
How Amazon Alexa Works Using NLP: A Complete Guide
Aug 6, 2025 · Amazon Alexa uses NLP to comprehend, decipher, and react to voice commands. The foundation of Alexa's capabilities is NLP.
[71]
Introducing ChatGPT - OpenAI
Nov 30, 2022 · The dialogue format makes it possible for ChatGPT to answer followup questions, admit its mistakes, challenge incorrect premises, and reject ...Introducing ChatGPT search · Introducing ChatGPT Pro · OpenAI announces new...
[72]
About Alexa Conversations | Alexa Skills Kit - Amazon Developers
Nov 27, 2023 · Alexa Conversations is a deep learning–based approach to dialog management that enables you to create natural, human-like voice experiences on Alexa.
[73]
Amazon Alexa – Learn what Alexa can do
From microphone and camera controls to the ability to view and delete your voice recordings, you have transparency and control over your Alexa experience. Learn ...Alexa Information · Alexa Profiles · Alexa Entertainment · Alexa Productivity
[74]
AI Chatbot to Discover, Learn & Create - ChatGPT
Type, talk, and use it your way. With ChatGPT, you can type or start a real-time voice conversation by tapping the soundwave icon in the mobile app.Download · Business · Education · EnterpriseMissing: 2022 | Show results with:2022
[75]
What you can ask Google Assistant
Get to know your Assistant: “Do you dream?” “What's your favorite color?” Games: “Let's play a game.” “Give me a trivia question.” Entertainment: “Tell me a ...
[76]
Contextual Rephrasing in Google Assistant
May 17, 2022 · We demonstrate how Assistant is now able to rephrase follow-up queries, adding contextual information before providing an answer.
[77]
Enterprise chatbots: Why and how to use them for support - Zendesk
Jul 15, 2025 · Start with the chatbot's flow—it's your answer tree for customer questions. The bot flow allows you to helpfully direct the conversation to ...
[78]
A Complete Guide to Enterprise Customer Service Chatbot Platforms
Dec 20, 2024 · Customer care: Chatbots provide instant answers, resolve issues, and deliver personalized support. Enterprise operations and IT helpdesk: ...
[79]
Emotionally Intelligent AI Voice Agents - SuperAGI
Jun 27, 2025 · According to a report by IDC, the market for emotional AI is expected to grow to $13.4 billion by 2025, with emotional computing being a key ...Personalization At Scale · The Human-Ai Collaboration... · Emotion Detection And...
[80]
Conversational AI Trends for 2025: What You Need to Know - DXwand
6. Emotionally Intelligent AI Chatbots. A major focus in AI development is creating chatbots with emotional intelligence, setting them apart from traditional ...1. Generative Ai's Big... · 3. Voice Assistants Go... · 5. Ai Chatbots: Expanding...
[81]
8 Conversational AI Trends in 2025 - Daffodil Software
Jan 13, 2025 · Voice Emotion Recognition: AI systems can also analyze the pitch, tone, and speed of a user's voice to infer emotional states. This lets AI ...3) Emotionally Intelligent... · 5) Voice Search Optimization · 6) Integration With Iot And...
[82]
Introducing the Knowledge Graph: things, not strings - The Keyword
May 16, 2012 · The Knowledge Graph enables you to search for things, people or places that Google knows about—landmarks, celebrities, cities, sports teams, ...
[83]
A reintroduction to Google's featured snippets - The Keyword
Jan 30, 2018 · When we introduced featured snippets in January 2014, there were some concerns that they might cause publishers to lose traffic. What if ...
[84]
How Google May Use Entity References to Answer Questions
Oct 12, 2014 · Google describes how it may answer questions from facts on Web pages by looking for entity references from structured and unstructured data.
[85]
Visual Search now live in Bing Chat
Jul 18, 2023 · Bing Chat now supports visual search, which means you can now upload a photo or take a picture and have Bing Chat respond with answers around those visuals.
[86]
Baidu to integrate ERNIE 4.0, which 'rivals' GPT-4, into Search
Oct 17, 2023 · The Chinese tech giant is planning to incorporate Ernie 4.0 into its search engine, which will change its SERPs.
[87]
Investigating the Influence of Featured Snippets on User Attitudes
Mar 20, 2023 · This paper examines the effect of featured snippets in more nuanced and complicated search scenarios concerning debated topics that have no ...Missing: reduce effort
[88]
[PDF] Intelligent Tutoring Systems with Conversational Dialogue
The tutoring systems present challenging prob- lems and questions to the learner, the learner types in answers in English, and there is a lengthy mul- titurn ...Missing: seminal | Show results with:seminal
[89]
[PDF] Intelligent Tutoring Systems: New Challenges and Directions - IJCAI
ITS research has successfully delivered techniques and systems that provide adaptive support for student problem solving or question-answering activities in a ...
[90]
Introducing Duolingo Max, a learning experience powered by GPT-4
Mar 14, 2023 · Duolingo Max is a new subscription tier above Super Duolingo that gives learners access to two brand-new features and exercises: Explain My Answer and Roleplay.How the Duolingo English Test... · Practice Hub · Talking to real learners
[91]
MATHia by Carnegie Learning | AI-Powered Math Supplement for ...
MATHia, our award-winning, intelligent math software, is designed to provide individual student support and insightful data. Request a Demo ...Missing: answering | Show results with:answering
[92]
Meet Khanmigo: Khan Academy's AI-powered teaching assistant ...
Type in a homework question and get instant help. Like a good tutor, Khanmigo gently guides your child to discover the answers themselves. Get Khanmigo.Free, AI-powered teacher... · Learners · Parents · Writing Coach
[93]
CoGrader | AI Essay Grader | Spend Less Time Grading, More Time ...
CoGrader is an AI essay grader that helps teachers provide quality feedback, saving 80% of grading time, and provides timely, specific feedback.AI Grading Tool · How to Grade Essays Using... · AI Essay Grader · AI In CoGrader
[94]
A systematic review of AI-driven intelligent tutoring systems (ITS) in ...
May 14, 2025 · This systematic review aims to identify the effects of ITSs on K-12 students' learning and performance and which experimental designs are currently used to ...
[95]
MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring
Oct 27, 2025 · We present MMTutorBench, the first multimodal benchmark for AI math tutoring. It evaluates MLLMs across diverse mathematical domains and ...
[96]
Adaptive Multi-Agent Tutoring AI for Multimodal Mathematics ...
Oct 26, 2025 · This paper introduces a conversational AI tutoring system grounded in Large Language Models (LLMs) and multi-agent design, with each agent ...
[97]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for ...
May 9, 2017 · TriviaQA is a reading comprehension dataset with over 650K question-answer-evidence triples, including 95K question-answer pairs and six ...
[98]
SuperGLUE: A Stickier Benchmark for General-Purpose Language ...
May 2, 2019 · In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a ...Missing: QA | Show results with:QA
[99]
Open LLM Leaderboard Archived - Hugging Face
Compare the performance of open-source Large Language Models using multiple benchmarks like IFEval, BBH, MATH, GPQA, MUSR, and MMLU-PRO.
[100]
Datasets Benchmarks 2024 - NeurIPS 2025
We introduce SPIQA (Scientific Paper Image Question Answering), the first large-scale QA dataset specifically designed to interpret complex figures and tables ...
[101]
ProMQA: Question Answering Dataset for Multimodal Procedural ...
ProMQA consists of 401 multimodal procedural QA pairs on user recording of procedural activities, ie, cooking, coupled with their corresponding instruction.