Fact-checked by Grok 2 weeks ago

Information extraction

Information extraction (IE) is a subfield of natural language processing that focuses on automatically identifying and extracting structured information—such as entities, relationships, and events—from unstructured or semi-structured text sources, transforming raw textual data into a format suitable for querying, analysis, and knowledge base population. This process addresses the challenge of handling vast amounts of natural language data by converting it into relational tuples or other structured representations, enabling applications in search engines, question answering systems, and knowledge graph construction. The subtasks of information typically include named entity recognition (NER), which identifies and classifies entities like persons, organizations, or locations in text; relation extraction, which detects semantic relationships between entities (e.g., "employs" between a and an ); and event extraction, which uncovers structured descriptions of , including participants, triggers, and arguments. Additional components may involve coreference resolution to link pronouns or expressions to their referents, and template filling to populate predefined schemas with . These tasks often operate in a pipeline or joint manner, with challenges arising from ambiguity, context dependence, and the need for domain-specific adaptation. Historically, information extraction originated in the 1990s with rule-based systems like FASTUS, which employed finite-state transducers to process text for tasks such as terrorist event detection. The field advanced significantly through the Message Understanding Conferences (MUCs), starting with MUC-3 in 1991, which standardized evaluation metrics and promoted supervised machine learning approaches for entity and template extraction. Subsequent programs like Automatic Content Extraction (ACE) in the early 2000s expanded focus to relations and events, shifting paradigms from hand-crafted rules to statistical models and, later, deep neural networks. In recent years, the advent of large models (LLMs) has revolutionized IE by generative paradigms that directly output structured without extensive or predefined schemas. Techniques such as prompt-based extraction and fine-tuning of models like BERT or GPT variants have improved on low-resource and open-domain settings, though challenges persist in and to long documents. Open information extraction (OpenIE), a flexible unrestricted by or types, has similarly evolved from rule-based extractors like TextRunner () to LLM-driven methods, facilitating broader applications in web-scale .

Fundamentals

Definition and Scope

Information extraction (IE) is the automatic process of identifying and deriving specific, structured knowledge—such as entities, relations, and events—from unstructured or semi-structured machine-readable text. This involves transforming natural language content into formalized representations that can be queried, analyzed, or integrated into databases, distinguishing IE as a core technique in natural language processing (NLP) for handling textual data at scale. The scope of IE includes extracting predefined facts in closed systems to populate structured formats like relational databases, knowledge graphs, or triples (e.g., subject-predicate-object), as well as open extraction for broader discovery tasks without fixed schemas or domains. It differs from text mining, which encompasses pattern recognition, clustering, and classification across large text corpora to uncover trends or associations, often without requiring rigid structured outputs. In contrast to knowledge extraction, which involves building comprehensive ontologies or deriving insights from both structured and unstructured sources, IE emphasizes targeted fact retrieval to support downstream applications like question answering or semantic search. Core tasks within IE, such as named entity recognition, contribute to this structured output but are explored in greater detail elsewhere. Originating within the NLP community, IE has expanded beyond traditional text to encompass multimedia and visually rich documents, enabling extraction from diverse formats like images or layouts embedded in text. A primary goal is to minimize human annotation efforts in knowledge base population by automating the identification of novel entities and relations from vast, unstructured sources. For instance, IE systems can process news articles to extract person names (e.g., "Elon Musk") and locations (e.g., "Austin, Texas"), structuring them into database entries for real-time analysis.

Core Components

Information extraction systems typically operate through a modular pipeline that processes unstructured text to produce structured outputs, consisting of preprocessing, core extraction modules, and post-processing stages. Preprocessing begins with tokenization, which segments raw text into words, numerals, and punctuation marks using delimiters such as spaces or periods, followed by part-of-speech (POS) tagging to assign grammatical categories like nouns or verbs to each token, often employing tag sets such as the Penn Treebank's 45 labels for syntactic analysis. These initial steps prepare the text for downstream extraction by normalizing its structure and highlighting linguistic patterns. Extraction modules then apply task-specific logic to identify entities, relations, or events, while post-processing handles refinements such as coreference resolution—linking pronouns or repeated mentions to their antecedents—and merging overlapping or redundant extractions to ensure consistency. Key components within this pipeline include feature extractors, which generate inputs for extraction models by deriving linguistic features like word shapes, dictionary matches for common entities (e.g., person names), and syntactic features such as dependency parses or POS sequences to capture contextual relationships. Confidence scoring mechanisms assign probabilistic scores to extracted elements, estimating their reliability based on model outputs (e.g., via conditional random fields), allowing systems to filter low-confidence results or propagate uncertainty in downstream applications. Output normalization standardizes the final results into formats like RDF for knowledge graph integration or JSON for database ingestion, ensuring compatibility with structured querying systems. Component interdependencies drive the pipeline's efficiency, as preprocessing outputs directly feed extraction modules—for instance, POS tags inform entity boundary detection by prioritizing noun phrases—and post-processing depends on extraction results for resolution tasks, creating a sequential flow where errors in early stages propagate unless mitigated by iterative feedback. Early systems like FASTUS from the 1990s exemplified this modular design through cascaded finite-state transducers, with stages for tokenization, phrase recognition (using POS-informed patterns), pattern matching for relations, and merging of incident descriptions, achieving high throughput on news texts.

Historical Development

Early Foundations

The origins of information extraction trace back to the 1960s within the field of computational linguistics, where early efforts focused on pattern matching techniques to process and generate structured reports from natural language texts. Influenced by the need to automate linguistic analysis, researchers developed systems that parsed sentences using predefined grammatical patterns and semantic constraints to extract key elements into formal representations. A seminal example was Naomi Sager's Linguistic String Project (LSP) at New York University, initiated in the mid-1960s, which targeted medical discharge summaries and radiology reports. The LSP employed hand-crafted parsing rules based on a restricted grammar to convert unstructured text into database-compatible formats, such as CODASYL records, enabling automated indexing and retrieval of clinical data. In the 1970s, the (ARPA, now ) funded pioneering projects in , laying the groundwork for more targeted information from messages and narratives. These initiatives, including the program (1971–1976), emphasized robust of spoken and for practical applications, motivating the of template-based extraction to handle variability in input texts. A notable outcome was Schank's FRUMP at Yale, introduced around 1977, which used "sketchy scripts"—hand-crafted structures representing stereotypical sequences—to extract and summarize facts from UPI wires, covering about 60 common news scenarios with keyword-driven . The 1980s marked the introduction of more sophisticated rule-based systems, driven by ARPA's Message Understanding Conferences (starting with MUCK-1 in 1987), which standardized evaluations for extracting structured information from naval messages and similar documents. The CIRCUS system, developed at the University of Massachusetts in the late 1980s, exemplified this era by integrating symbolic and connectionist methods to produce semantic case frames from medical and general texts, relying on domain-specific hand-crafted rules for syntactic parsing and conceptual dependency analysis. Similarly, the SCISOR system, created by Paul Jacobs at General Electric Research in the mid-1980s, processed naval intelligence messages and corporate reports using a hybrid architecture of rule-based templates and scripts to identify entities, relations, and events, such as mergers or military actions, outputting them as filled templates for database population. These systems highlighted the reliance on expert-encoded knowledge to achieve precision in narrow domains. Throughout these decades, the primary motivations for information extraction stemmed from military and intelligence requirements to automate the analysis of vast document volumes, reducing manual labor in processing reports for strategic decision-making. ARPA's investments addressed the limitations of traditional information retrieval by enabling the creation of structured knowledge bases from unstructured sources, such as intercepted messages or field reports, to support rapid intelligence assessment and operational planning.

Modern Evolution

The Message Understanding Conferences (MUC-3 to MUC-7), held from 1991 to 1998 and funded by DARPA, played a pivotal role in standardizing information extraction tasks by introducing shared evaluation metrics and datasets for named entity recognition, coreference resolution, and template filling, which shifted the field toward more systematic benchmarking and spurred the development of practical systems. During the 1990s, statistical models began to emerge as alternatives to rule-based approaches, with Hidden Markov Models (HMMs) gaining prominence for named entity recognition by modeling sequential dependencies in text to achieve higher accuracy on unstructured prose. Early efforts in open information extraction, such as the WHIRL system (2001), also began exploring unsupervised relation extraction from web text. In the 2000s, the explosion of web-scale data drove innovations in scalable information extraction, exemplified by systems like KnowItAll, which automated fact extraction from billions of web pages using unsupervised methods and probabilistic inference to handle diverse, noisy sources. Concurrently, DARPA's Automatic Content Extraction (ACE) program, running from 1999 to 2008, advanced relation and event extraction by defining annotation standards and evaluation protocols for multilingual texts, influencing subsequent benchmarks like TAC-KBP and enabling more robust handling of complex scenarios such as temporal and spatial relations. The 2010s marked a deep learning boom in information extraction, with transformer-based architectures like BERT revolutionizing tasks through contextual embeddings that improved performance on NER and RE, with gains of several percentage points on benchmarks such as CoNLL-2003 (NER) and up to 10-15% in some RE settings like ACE 2005. In the 2020s, large language models (LLMs) such as GPT variants enabled zero-shot and few-shot extraction via fine-tuning, allowing adaptation to new domains without extensive labeled data and achieving state-of-the-art results on event extraction datasets like ACE 2005. Recent trends as of 2025 have emphasized multimodal information extraction, integrating text and images through benchmarks like MatViX, which facilitate evaluation of models extracting structured data from visually rich documents, enhancing applications in scientific literature analysis. The transformer era has further shifted paradigms toward end-to-end learning, reducing reliance on pipelines while raising ethical considerations in large-scale extraction, including privacy risks from automated data harvesting and biases in model training on web corpora.

Core Tasks

Named Entity Recognition

Named Entity Recognition (NER) is a fundamental subtask of information extraction that involves locating and classifying named entities in unstructured text into predefined categories, such as persons, organizations, locations, and dates. This process transforms raw textual data into structured representations, enabling downstream applications like knowledge base population and question answering. The task originated in the mid-1990s as part of efforts to extract structured information from news articles, with early definitions focusing on identifying noun phrases that refer to real-world entities. Traditional techniques for NER include dictionary-based matching, where predefined gazetteers or lists of known entities are used to identify matches in text, offering high precision for well-covered domains but struggling with unseen entities. Rule-based patterns, involving handcrafted linguistic rules such as regular expressions for capitalization or contextual cues, were prominent in early systems and remain useful for domain-specific scenarios. Supervised learning approaches, which dominate modern NER, train models on annotated corpora using features like word shapes (e.g., patterns of capitalization and punctuation), part-of-speech tags, and surrounding context to predict entity boundaries and types. These methods, often implemented with sequential models like Conditional Random Fields (CRFs), allow for probabilistic labeling of token sequences. NER faces significant challenges, including ambiguity, where a term like "Apple" could refer to a company or a fruit depending on context, requiring disambiguation through surrounding words or domain knowledge. Nested entities add further complexity, as one entity can embed another (e.g., "New York City" containing "New York" as a location within a larger location), violating assumptions of non-overlapping spans in flat NER models and necessitating hierarchical recognition strategies. These issues are exacerbated in low-resource languages or domains with sparse annotations, leading to reduced generalization. Evaluation of NER systems typically employs precision (the proportion of predicted entities that are correct), recall (the proportion of true entities that are identified), and the harmonic mean F1-score, computed at the entity level to account for boundary and type accuracy. The CoNLL-2003 benchmark, derived from Reuters news articles, serves as a standard dataset with annotations for persons (PER), organizations (ORG), locations (LOC), and miscellaneous (MISC) entities, where early systems achieved F1-scores around 80-85%, while modern models exceed 92%. For instance, BiLSTM-CRF models on this dataset report an F1-score of 91.00, highlighting the benchmark's role in tracking progress. The evolution of NER traces back to the 1990s, when the BIO (Beginning-Inside-Outside) tagging scheme was introduced in the Message Understanding Conference (MUC-6) to handle multi-token entities in rule-based systems. By the early 2000s, supervised machine learning with feature engineering, as in the CoNLL-2003 shared task, shifted focus to statistical models like CRFs for better handling of context. In contemporary developments, contextual embeddings from pre-trained language models, such as BERT (introduced in 2018), have revolutionized NER by capturing bidirectional dependencies and semantic nuances, achieving state-of-the-art F1-scores above 93% on benchmarks like CoNLL-2003. This progression from rigid tagging to embedding-based representations underscores NER's integration into broader neural architectures.

Relation Extraction

Relation extraction is a core task in information extraction that involves identifying and classifying semantic relationships between entities mentioned in text, typically represented as binary or n-ary tuples such as (entity1, relation, entity2). For example, in the sentence "Apple employs Tim Cook," the system detects the "employs" relation between the organization "Apple" and the person "Tim Cook." This process assumes entities have been previously identified through named entity recognition and focuses on linking them via predefined or open relation types to build structured knowledge representations. Early methods for relation extraction relied on pattern matching, where handcrafted linguistic templates capture relational expressions, such as Hearst patterns for detecting hypernymy (e.g., "such as" or "including" to infer "is-a" relations like "fruit such as apple"). More advanced techniques incorporate dependency parsing to identify argument roles and syntactic structures, enabling the extraction of relations based on grammatical dependencies between entity mentions and relational verbs or phrases. These approaches allow for scalable processing of unstructured text while handling variations in expression. Relation extraction encompasses supervised and unsupervised subtasks. In supervised settings, models are trained on labeled corpora like TACRED, a dataset of over 106,000 examples from newswire and web text, annotated with 42 relation types plus a "no relation" class, to classify relations between entity pairs. Unsupervised methods, such as bootstrapping, start with seed examples (e.g., known entity pairs and patterns) and iteratively expand by applying patterns to unlabeled text, discovering new instances without extensive annotation. A notable example is the REVERB system, introduced in 2011, which performs open-domain relation extraction from web text using lexical and syntactic constraints to identify verb-based relations with high precision (over 80% for 30% of extractions), outperforming prior open information extraction tools. Key challenges in relation extraction include handling long-distance dependencies, where entities are separated by intervening clauses or phrases, complicating syntactic analysis and relation inference. Implicit relations, which lack explicit markers and rely on contextual inference (e.g., co-reference or world knowledge), further reduce accuracy, as models must bridge gaps without direct lexical cues. These issues persist across domains, necessitating robust feature engineering or hybrid methods to maintain performance on diverse texts.

Event Extraction and Templating

Event extraction is a key subtask in information extraction that focuses on identifying and structuring information about events described in text, including the event trigger (a word or phrase indicating the occurrence, such as "attack" in a sentence about a conflict), the event type (e.g., attack or transport), and the associated arguments that fill roles like victim, perpetrator, time, and location. This process extends relation extraction by capturing dynamic, multi-argument scenarios rather than static pairwise links, often resulting in filled templates that answer "who-did-what-when-where-how" to represent complete event records. The task was formalized in the Automatic Content Extraction (ACE) 2005 evaluation, which defined events as specific occurrences with triggers and arguments drawn from predefined schemas. The primary subtasks of event extraction include trigger detection, which locates and classifies potential event indicators in text; argument identification and classification, which assigns semantic roles to entities or phrases relative to the trigger (e.g., labeling "militants" as perpetrator in an attack event); and event coreference resolution, which links multiple mentions of the same event across a document to avoid redundancy and build cohesive narratives. Trigger detection often relies on pattern matching or supervised models to spot verbs or nouns signaling events, while argument classification uses role inventories to map participants, such as distinguishing between agent and patient roles. Event coreference further integrates these by resolving ambiguities, like connecting a pronoun reference back to an earlier event mention. Key techniques for event extraction draw from frame semantics, as in the FrameNet resource, which defines event frames as semantic structures linking predicates to their arguments, enabling the parsing of nuanced roles beyond simple relations. Complementing this, PropBank provides predicate-argument structures for verbs, annotating sentences with numbered roles (e.g., Arg0 for agent, Arg1 for patient) to support event templating in domains like biomedical texts. Temporal reasoning enhances these by modeling event durations and sequences using Allen's interval algebra, a qualitative framework with 13 basic relations (e.g., "before," "overlaps," "during") to infer timelines between events without precise timestamps. This algebra is particularly useful in event relation extraction for ordering complex narratives, such as determining if one event precedes or meets another. Challenges in event extraction include handling multi-event overlap, where multiple events share arguments or triggers within the same span, complicating disambiguation (e.g., a single phrase triggering both a transport and attack event), as addressed in models like CasEE. Another issue is processing hypothetical or conditional events, such as those in reported speech or counterfactuals, which lack real-world grounding and require context-aware filtering to avoid extracting non-actual occurrences. Recent advancements, like the RESIN system for cross-document event extraction, introduce rich event representations that link events across texts using unified schemas, improving scalability for knowledge base construction. In the 2020s, multimodal event extraction has emerged to handle video and image-text pairs, as in the GAIA framework, which fuses visual cues with textual triggers for more robust detection in multimedia sources.

Approaches and Techniques

Rule-Based Methods

Rule-based methods in information extraction involve hand-engineered linguistic rules and patterns designed to deterministically identify and extract structured information from unstructured text. These approaches typically utilize cascading rules, where initial simple patterns annotate basic linguistic elements such as tokens or phrases, and subsequent rules operate on those annotations to detect more complex structures like entities or relations. Finite-state transducers (FSTs) form a core mechanism, enabling efficient, regular-expression-based matching over sequences of text or annotations to transform input into output representations. Gazetteers—precompiled lists of domain-specific terms or entities—are frequently incorporated to boost precision in recognizing predefined categories, such as organization names or locations, by matching against known vocabularies. A prominent example is the Java Annotation Patterns Engine (JAPE), a finite-state transduction tool that applies regular expressions over annotations to perform pattern matching and semantic tagging in a rule-driven pipeline. JAPE rules specify left-hand-side patterns for matching and right-hand-side actions for generating new annotations, facilitating modular extraction processes. Early applications of such rule-based techniques appeared in the Message Understanding Conferences (MUC) during the 1990s, where systems like FASTUS employed cascaded FSTs to extract event templates from news articles, achieving competitive performance through iterative rule application. These methods have been applied to core tasks like named entity recognition by defining explicit patterns for entity boundaries and types. The primary advantages of rule-based methods include high interpretability, as the explicit rules allow domain experts to understand, debug, and modify the extraction logic directly, and the absence of training data requirements, enabling rapid deployment in specialized domains without annotated corpora. However, these methods are limited by their brittleness to linguistic variations, such as synonyms, syntactic ambiguities, or informal phrasing, which can result in failures on out-of-pattern texts and require extensive manual rule engineering for maintenance. Rule-based information extraction techniques originated in the 1980s and gained prominence in the 1990s through initiatives like the MUC series, which standardized evaluation and spurred development of robust systems for real-world texts. Despite the rise of data-driven alternatives, these methods persist in hybrid configurations for high-precision applications, particularly in legal texts where deterministic control ensures compliance and accuracy in extracting clauses, entities, or obligations from contracts and statutes.

Machine Learning and Statistical Approaches

Machine learning and statistical approaches to information extraction represent a shift from rigid rule-based systems by leveraging data-driven models to learn patterns probabilistically, enabling adaptation to varied text structures through training on examples. These methods gained prominence in the early 2000s, particularly with the adoption of graphical models that model dependencies in sequences or graphs, peaking during that decade as annotated corpora became more available for supervised learning. Unlike rule-based baselines, which require manual crafting of patterns, statistical models estimate parameters from data to maximize likelihood, improving generalization across domains. Supervised techniques form the core of these approaches, relying on annotated corpora to train models for tasks like sequence labeling and classification. Conditional Random Fields (CRFs), introduced as probabilistic graphical models for segmenting and labeling sequences, have been widely used in named entity recognition and similar extraction subtasks. In CRFs, the probability of a label sequence y given an input sequence x is given by P(y \mid x) = \frac{1}{Z(x)} \prod_{t=1}^T \exp \left( \sum_k \lambda_k f_k(y_t, y_{t-1}, x, t) \right), where Z(x) is the normalization factor, \lambda_k are learned weights, and f_k are feature functions capturing local dependencies, such as n-gram word contexts or part-of-speech tags. Training involves maximizing the log-likelihood on labeled data using techniques like L-BFGS optimization. For relation classification, Support Vector Machines (SVMs) with kernel functions, such as subsequence or dependency tree kernels, classify pairs of entities by learning hyperplanes in high-dimensional feature spaces derived from lexical and syntactic cues. These supervised methods achieve robust performance when sufficient labeled data is available, as demonstrated in benchmarks on corpora like CoNLL-2003 for entity recognition. Unsupervised and semi-supervised variants address the scarcity of annotations by discovering patterns without or with minimal labeling. Unsupervised methods employ clustering algorithms to group similar text spans or the Expectation-Maximization (EM) algorithm to iteratively refine probabilistic models for pattern induction, such as aligning phrases around seed entities to uncover relations. Semi-supervised bootstrapping, exemplified by the Snowball system, starts with a small set of seed tuples and iteratively generates extraction patterns from unlabeled text, then applies those patterns to harvest new tuples while using generality scores to filter noise and prevent drift. This approach scales to large corpora, extracting thousands of relations from news articles with precision around 80% in early evaluations. The 2000s marked a high point for such graphical model-based techniques, with CRFs and Hidden Markov Models dominating due to their ability to handle sequential dependencies. Post-2015 developments in semi-supervised methods include hybrid frameworks combining bootstrapping with active learning or distant supervision to incorporate weak labels from knowledge bases, enhancing efficiency in domain-specific extraction like biomedical texts.

Neural and Deep Learning Methods

Neural and deep learning methods have transformed information extraction by leveraging distributed representations and end-to-end architectures, building on earlier statistical machine learning approaches as precursors for handling sequential dependencies. Early neural models, such as recurrent neural networks (RNNs) and long short-term memory (LSTM) units, excelled in sequence labeling tasks like named entity recognition (NER) by capturing contextual dependencies in text. For instance, bidirectional LSTMs combined with conditional random fields (BiLSTM-CRF) achieved state-of-the-art performance on NER benchmarks by modeling both local and global sequence information. The introduction of contextualized embeddings marked a pivotal shift; ELMo in 2018 provided deep contextual representations from LSTMs, revolutionizing IE tasks by improving accuracy in entity and relation extraction through layered, task-agnostic features. Similarly, BERT in 2019, with its bidirectional transformer architecture, further advanced IE by enabling fine-tuning for joint extraction of entities and relations, surpassing prior methods on datasets like CoNLL-2003 for NER. Transformer-based models have since dominated IE, offering scalable architectures that process entire sequences in parallel via self-attention mechanisms. The self-attention operation computes relevance scores between input elements as \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V, where Q, K, and V are query, key, and value matrices, and d_k is the dimension of the keys, allowing models to weigh contextual relationships dynamically. Fine-tuning BERT for joint IE extracts entities and relations simultaneously by predicting structured outputs from contextual embeddings, reducing error propagation compared to pipeline approaches. Graph neural networks (GNNs) extend this by modeling relation paths as graphs, where nodes represent entities and edges capture syntactic or semantic dependencies; for example, the 2019 GP-GNN generates graph parameters from text to propagate information for relation extraction, achieving superior performance on SemEval-2010 Task 8. Large language models (LLMs) have introduced prompt-based extraction techniques, adapting pre-trained models like GPT-4 for zero- or few-shot IE without extensive fine-tuning. In 2023 adaptations, prompts guide LLMs to parse text into structured triples for relation extraction, as demonstrated in frameworks like ChatIE, which leverages in-context learning to extract events from ACE 2005. These methods excel in handling ambiguity and long-range context by generating free-form outputs, with generative IE models like InstructUIE using instruction tuning on LLMs to unify NER, relation extraction (RE), and event extraction (EE). The advantages of neural and deep learning methods lie in their ability to capture nuanced context and resolve ambiguities inherent in natural language, such as coreference or polysemy, through distributed embeddings and attention. Transformers and LLMs scale effectively with data and compute, enabling few-shot adaptations that outperform traditional supervised models in low-resource domains, as seen in prompt-based EE with UIE (BART-based) achieving 73.36 F1 on ACE 2005 event trigger detection. Recent trends emphasize generative paradigms, where models like T5 linearize extraction outputs into sequences, facilitating end-to-end training across IE tasks and improving generalization.

Applications

Web and Digital Content Processing

Information extraction (IE) plays a pivotal role in processing web-scale unstructured data, such as HTML documents from websites, posts on social media platforms, and dynamic search results, enabling the transformation of raw online content into structured knowledge. This involves adapting IE techniques to handle the vast, heterogeneous nature of digital content, where data is often embedded in semi-structured formats like tables, lists, or text blocks amid varying layouts and languages. Key applications include extracting product details from e-commerce sites and identifying entities from structured elements like Wikipedia infoboxes or real-time news feeds, which support tasks such as price comparison, knowledge base population, and trend monitoring. A prominent use case is wrapper induction, which automates the creation of extraction rules for e-commerce sites to pull structured data like product names, prices, and descriptions from repetitive page layouts. Developed in the late 1990s and refined in subsequent works, this method learns patterns from labeled examples to generate wrappers that reliably parse HTML without manual coding for each site, achieving high accuracy on sites with consistent templates. Similarly, entity extraction from Wikipedia infoboxes leverages the semi-structured attribute-value pairs in these tables to populate knowledge bases, with approaches using infobox matching and text alignment to infer missing facts from article content, improving recall for long-tail entities. For news feeds, IE techniques identify named entities like persons, organizations, and locations in streaming articles, facilitating real-time summarization and event tracking. Techniques for web and digital content processing often integrate DOM (Document Object Model) parsing with core IE methods to navigate HTML structures and isolate relevant sections. By traversing the DOM tree, systems can target specific nodes for entity recognition or relation extraction while discarding irrelevant elements, such as navigation menus or footers, enhancing precision on complex pages. Handling noise—elements like advertisements, boilerplate text, or informal user-generated content on social platforms—is crucial, with methods employing template detection to segment and remove repetitive non-informative blocks based on similarity metrics and statistical analysis of tag frequencies. For user-generated content, which introduces variability like slang and abbreviations, noise reduction draws on domain adaptation to filter low-quality segments before applying extraction models. The impact of these IE applications is evident in powering search engines, where extracted facts from web sources fuel knowledge graphs like Google's, which integrates billions of entities and relations derived from HTML parsing and entity linking to deliver contextual search results. In the 2000s, projects such as YAGO demonstrated the scale of web IE by extracting over 2 billion facts from Wikipedia and other web corpora, laying the foundation for large-scale knowledge bases. More recently, from 2022 to 2025, efforts have shifted toward real-time extraction from social platforms, enabling focused user profiling and event detection in streaming data to support applications like crisis monitoring and personalized recommendations.

Domain-Specific Uses

In the biomedical domain, information extraction techniques are widely applied to mine protein-protein interactions from scientific literature, such as PubMed abstracts, enabling researchers to map complex biological networks. The BioNLP shared tasks, organized since 2009, have been pivotal in advancing these methods by providing annotated corpora for event extraction tasks focused on biomolecular interactions, with participating systems achieving F1 scores up to 0.52 for protein interaction detection in early iterations. These efforts have facilitated downstream applications like knowledge graph construction for drug target identification. In finance, IE supports the analysis of earnings call transcripts by extracting sentiment signals and key events, such as management guidance on revenue or risks, which inform market predictions. For instance, self-supervised models have been developed to identify insightful phrases from these transcripts, correlating extracted sentiments with stock performance metrics like abnormal returns. Such extractions help quantitative analysts detect temporal shifts in executive tone, with studies showing correlations between sentiment and financial outcomes. In the legal domain, IE is employed for contract clause identification, automating the detection of provisions like termination rights or confidentiality obligations from lengthy agreements. The Contract Understanding Atticus Dataset (CUAD), comprising over 500 annotated commercial contracts with 41 clause types, serves as a benchmark for these tasks, where transformer-based models achieve micro-F1 scores around 0.89 for clause extraction. This approach reduces manual review time, which traditionally consumes 60-80% of legal workflows, by prioritizing high-risk clauses for expert scrutiny. Domain-specific customizations enhance IE accuracy through the integration of specialized ontologies, such as the Unified Medical Language System (UMLS) in healthcare, which maps synonymous terms across vocabularies like SNOMED CT and MeSH to resolve ambiguities in clinical texts. UMLS, comprising over 3 million concepts from 171 sources, supports entity normalization in biomedical extraction pipelines. However, challenges persist due to domain jargon, including acronyms, abbreviations, and polysemous terms that vary by subfield, often reducing baseline IE precision to below 0.70 without ontology guidance. The 2010s marked substantial growth in clinical IE applications for electronic health records (EHRs), driven by the U.S. HITECH Act's incentives for EHR adoption, which significantly increased data availability in major healthcare systems. This expansion enabled IE to extract phenotypes and adverse events from free-text notes, with literature showing a substantial increase in publications during the decade, focusing on tasks like de-identification and cohort identification. As of 2025, IE advancements have bolstered AI-assisted drug discovery by automating extraction from pharmaceutical patents, identifying novel compounds and therapeutic claims to accelerate repurposing and novelty assessment. AI agents process patent corpora to retrieve chemical structures and efficacy data, aiding in the analysis of patent information.

Integration with Other AI Systems

Information extraction (IE) plays a pivotal role in populating knowledge graphs (KGs) by transforming unstructured text into structured triples that represent entities, relations, and attributes, thereby enabling scalable knowledge representation. For instance, DBpedia, one of the largest open-domain KGs, relies on IE techniques applied to Wikipedia infoboxes and abstracts to extract millions of facts, facilitating semantic web applications and interoperability across datasets. Similarly, distant supervision methods leverage existing KGs to automatically label training data from web corpora, extending graphs like Wikidata with high-coverage relational facts while mitigating manual annotation costs. In question-answering (QA) systems, IE enhances accuracy by supplying extracted facts as grounding for retrieval-augmented generation (RAG), where relevant entities and relations from documents are retrieved to inform large language model (LLM) responses, reducing hallucinations and improving factual precision. This integration is particularly evident in RAG pipelines, which use IE to preprocess corpora into queryable knowledge bases, enabling end-to-end systems that combine extraction with generative reasoning for complex queries. IE integrates seamlessly with chatbots through entity grounding, where extracted entities from user dialogues are linked to external KGs to resolve ambiguities and maintain conversational context, as seen in neuro-symbolic frameworks that combine rule-based extraction with LLM disambiguation for robust entity resolution. Furthermore, fusion with computer vision in multimodal IE allows extraction of relational facts from videos, such as event-object interactions, by aligning visual detections with textual captions to build dynamic KGs from surveillance or instructional footage. The primary benefits of these integrations include enabling symbolic reasoning over extracted data, where structured outputs from IE serve as inputs to inference engines for tasks like path querying in KGs or causal analysis in QA, thus bridging statistical pattern recognition with logical deduction. However, a key challenge is error propagation across pipeline stages, where inaccuracies in entity recognition cascade to downstream relation extraction or KG population, amplifying overall system unreliability without joint modeling or uncertainty estimation. In the 2020s, neuro-symbolic IE approaches have gained prominence for producing verifiable outputs by hybridizing neural encoders with symbolic rules, ensuring explainable extractions that mitigate black-box issues in pure deep learning models. Post-2023, LLM-IE hybrids have advanced this trend, leveraging fine-tuned LLMs for zero-shot extraction in domain-specific pipelines, such as radiology reports or historical documents, while incorporating symbolic constraints to enhance precision and adaptability.

Tools and Implementations

Open-Source Frameworks

Open-source frameworks have played a pivotal role in making information extraction (IE) accessible to researchers and developers, providing modular tools for tasks like named entity recognition (NER), relation extraction, and event extraction. These frameworks often include pre-trained models, extensible architectures, and integration with programming languages such as Python, enabling rapid prototyping and customization for various domains. One of the most widely adopted frameworks is spaCy, an industrial-strength NLP library developed by Explosion AI, which offers robust modules for NER and relation extraction through its dependency parsing and entity linking capabilities. SpaCy's features include pre-trained models trained on large corpora like OntoNotes for multilingual support, and its extensibility allows users to fine-tune models using custom training data via the spacy train command in Python pipelines. As of 2025, spaCy version 3.8 supports Python 3.13 and continues active development. For instance, a typical pipeline might load a pre-trained English model, add a custom NER component, and process text for extracting organizations and locations:
python
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is buying a UK startup for $1 billion.")
for ent in doc.ents:
    print(ent.text, ent.label_)
This modularity has fostered extensive community contributions on GitHub, enhancing its IE components. Stanford CoreNLP, maintained by Stanford NLP Group, provides a comprehensive Java-based pipeline for IE tasks, including coreference resolution, dependency parsing, and open information extraction (OpenIE), which extracts relational tuples from text without predefined schemas. Its pre-trained models, such as those based on neural dependency parsers, support English and multiple languages, and it can be integrated into Python via the stanfordcorenlp wrapper for seamless use in scripts. An example usage involves annotating text to extract triples like (subject, relation, object):
python
from stanfordcorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')
text = 'Barack Obama was born in Hawaii.'
result = nlp.annotate(text, properties={'annotators': 'openie'})
The framework's community engagement is evident in its contributions to shared tasks like SemEval, where Stanford tools have been baselines for relation extraction challenges. Other notable frameworks include Flair, a PyTorch-based library specializing in state-of-the-art sequence labeling for NER and other IE subtasks with contextual string embeddings, and NLTK (Natural Language Toolkit), which provides foundational tools for tokenization, POS tagging, and basic entity extraction suitable for educational and prototyping purposes. Hugging Face's Transformers library, released in 2018, has democratized neural IE by providing access to thousands of pre-trained models like BERT and RoBERTa fine-tuned for NER and relation extraction tasks. Its Python API facilitates easy loading and inference, such as using a pipeline for zero-shot entity extraction:
python
from transformers import pipeline
extractor = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")
result = extractor("Hugging Face is based in New York.")
Transformers has integrated with LangChain, enabling LLM-based extraction chains that combine prompting with traditional IE for hybrid systems, further enhancing its utility in production pipelines. This library's impact is underscored by its role in community-driven benchmarks and promoting collaborative advancements in IE.

Commercial and Proprietary Solutions

Commercial and proprietary solutions for information extraction primarily consist of cloud-based platforms and APIs offered by major technology companies, providing scalable, managed services for enterprise use. These tools emerged prominently in the mid-2010s as part of the broader adoption of software-as-a-service (SaaS) models in natural language processing, allowing organizations to extract entities, relations, and other structured data from unstructured text without building custom infrastructure. Google Cloud Natural Language API, launched in 2016, enables entity recognition across categories such as persons, organizations, locations, and events, along with sentiment and syntax analysis for comprehensive text processing. It supports relation extraction through entity linking and offers scalability for processing vast datasets via serverless architecture, with custom model training available through AutoML for domain-specific adaptations. Pricing follows a pay-as-you-go model based on units analyzed, starting at low costs for small volumes. This API facilitates business intelligence applications, such as deriving actionable insights from customer reviews and reports, and compliance monitoring by identifying regulated content in documents. Amazon Comprehend, introduced in 2017, provides entity recognition for types including personally identifiable information (PII), keyphrase extraction, and custom classifiers to detect specific patterns or relations in text. It scales automatically to handle high-volume inputs like emails and social media feeds, with options for training bespoke models on proprietary datasets without requiring machine learning expertise. The service uses a character-based pricing structure, charging per 100 characters processed. Common use cases include business intelligence for sentiment analysis in customer support tickets and compliance efforts, such as automatically redacting sensitive data in legal or financial documents. IBM Watson Discovery, available since 2017, integrates information extraction with search capabilities, supporting entity extraction for custom domains, relation detection via natural language understanding, and smart document processing for elements like tables and images. It offers scalability through cloud deployment and allows non-experts to train models using active learning on industry-specific data. Pricing includes a free lite plan, with paid tiers based on query volume and storage. The platform is employed in business intelligence to accelerate text analysis in sectors like insurance, reducing processing time by up to 90%, and in compliance for auditing regulatory documents in energy industries. Snowflake's Document AI, enhanced in 2024 with general availability in October, leverages large language models to extract structured information from unstructured documents, including text, tables, handwritten notes, and checkboxes, supporting zero-shot and fine-tuned extraction for types like invoices. It scales via SQL-based pipelines within Snowflake's data cloud for continuous processing of large document sets, with custom fine-tuning ensuring privacy for user data. Pricing aligns with Snowflake's consumption-based credits. In August 2025, table extraction capabilities reached general availability. These solutions support business intelligence by transforming raw documents into queryable data for analytics and compliance by automating detection of key facts in regulatory filings. Unlike open-source frameworks, these proprietary offerings emphasize managed scalability, vendor support, and seamless integration with enterprise cloud ecosystems for production-grade deployments.

Evaluation Metrics and Benchmarks

Evaluation of information extraction (IE) systems relies on standard metrics that quantify the accuracy and completeness of extracted entities, relations, and events from text. The primary metrics are precision, recall, and the F1-score, which are derived from true positives (TP), false positives (FP), and false negatives (FN). Precision is calculated as the ratio of correctly extracted items to all extracted items, given by: \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} Recall measures the ratio of correctly extracted items to all actual items, defined as: \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} The F1-score harmonizes these by computing the harmonic mean: \text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} These metrics are applied at the entity level, where performance is assessed per individual entity or relation instance, or at the document level, which evaluates the overall structure and coherence of extractions across an entire text, often requiring alignment of multiple spans or arguments. Key benchmarks provide standardized datasets for comparing IE models. The CoNLL-2003 shared task dataset, comprising annotated news articles in English and other languages, serves as a foundational benchmark for named entity recognition (NER), focusing on entities like persons, locations, organizations, and miscellaneous. For relation and event extraction, the Text Analysis Conference (TAC) Knowledge Base Population (KBP) track offers datasets with annotations for entity linking, relations (e.g., per:country_of_birth), and events (e.g., attack triggers with arguments), emphasizing real-world knowledge integration from diverse sources like news wires. More recent benchmarks include the REBEL dataset (introduced in 2021 and extended in 2022), a large-scale, multilingual collection of over 200 relation types derived from Wikipedia and Wikidata, enabling end-to-end relation extraction evaluation through sequence generation. Evaluation in IE faces challenges related to matching criteria and linguistic diversity. Strict matching requires exact boundary and type alignment between predicted and gold-standard extractions, which can penalize minor variations like tokenization differences, whereas lenient matching allows partial overlaps or type relaxations to better reflect practical utility. Cross-lingual evaluation adds complexity, as models must handle varying annotation schemas, entity normalization across scripts, and domain shifts between languages, often leading to degraded performance without multilingual pretraining. Benchmarks for IE have evolved with natural language understanding frameworks, including extensions of GLUE (2018) and SuperGLUE (2019-2020) that incorporate IE-inspired tasks like entailment for relation validation and coreference for entity resolution.

Challenges and Future Directions

Current Limitations

Information extraction (IE) systems continue to face significant technical challenges in accurately processing complex linguistic phenomena. Handling negation and sarcasm remains particularly difficult, as models often fail to correctly identify the scope and impact of negating words or ironic expressions, leading to erroneous extractions such as inferring events that did not occur (e.g., extracting an "attack" event from "Protesters did not attack the police"). This issue is exacerbated in event extraction tasks, where negation errors introduce false positives that propagate to downstream applications like knowledge graph construction. Similarly, sarcasm detection, which requires discerning implicit contradictions between literal and intended meanings, poses computational hurdles due to its reliance on multimodal cues and contextual inference, often resulting in low precision for IE in social media or conversational texts. Performance degrades further in low-resource languages and domains, where limited annotated data restricts model training and generalization. Surveys highlight that IE tasks like named entity recognition and relation extraction suffer from data scarcity, leading to suboptimal handling of unseen entity classes and cultural nuances, with no single approach (e.g., transfer learning or prompting with large language models) universally outperforming others across tasks. Pipeline-based IE architectures compound these issues through error cascading, where inaccuracies in early stages (e.g., entity recognition) amplify in subsequent steps (e.g., relation classification), reducing overall F1 scores by 1-2 points compared to joint models on benchmarks like ACE2005 and SciERC. Recent 2020s studies on open-domain event extraction report substantial error rates in challenging settings, underscoring persistent inaccuracies despite advances in neural methods. Scalability poses another barrier, particularly for processing web-scale volumes, where computational demands for preprocessing (e.g., ) and can require weeks or machine-years on terabyte-scale corpora, often necessitating trade-offs in accuracy or coverage. Extracting from personal texts introduces risks, as traditional methods sensitive to attacks like membership , complicating compliance with regulations while maintaining extraction utility in domains like healthcare or case studies. These constraints IE deployment on large, privacy-sensitive datasets without specialized safeguards. Ethical concerns arise from bias amplification inherent in training data, where inconsistent task definitions and underrepresented groups lead models to favor certain interpretations, perpetuating disparities in entity or relation extraction across domains. Such biases, drawn from skewed corpora, can exacerbate societal inequities when IE informs decision-making systems. Additionally, the potential misuse of IE in surveillance applications raises alarms, as automated extraction from communications enables mass monitoring without consent, infringing on privacy rights and enabling discriminatory profiling. One prominent trend in information extraction (IE) involves leveraging large language models (LLMs) for zero-shot and few-shot extraction, enabling models to identify entities and relations from unstructured text without extensive task-specific training data. This approach has demonstrated superior performance in domains like biomedical text, where LLMs such as GPT-4 achieve high accuracy in extracting clinical concepts by following natural language instructions, outperforming traditional supervised methods in low-resource scenarios. Similarly, in cancer-related clinical notes, zero-shot IE using LLMs like mCODEGPT extracts structured information with precision comparable to fine-tuned models, reducing the need for labeled datasets. Multimodal IE, which integrates text with audio, video, or visual elements, is advancing to handle diverse data sources beyond pure text. For instance, video-level multimodal relation extraction frameworks extract relational facts from video content by aligning textual descriptions with visual and audio cues, achieving up to 32 relation types in datasets like Vid-MRE. In visually rich documents, such as articles with images and layouts, multimodal IE models like MATVIX combine textual and visual processing to extract structured triples. Federated learning addresses privacy concerns in IE by enabling collaborative model training across distributed datasets without sharing raw text, particularly in named entity recognition (NER) tasks. Federated incremental NER, for example, handles non-IID entity distributions across clients, boosting F1 scores by up to 3% in medical applications while preserving data locality. Innovations in causal IE enhance reasoning capabilities by identifying cause-effect relations in text, supporting downstream inference in NLP pipelines. LLMs facilitate causality extraction from medical texts by prompting for explicit cause-effect pairs, yielding F1 scores above 0.75 on benchmark datasets without domain adaptation. Integration with blockchain ensures verifiable facts in extracted knowledge graphs, where blockchain timestamps provide immutable provenance for IE outputs in secure data ecosystems, as seen in verifiable authorization schemes for personal data extraction. Ongoing research emphasizes explainable AI (XAI) in IE to make extraction decisions transparent, such as through attention mechanisms that highlight influential text spans in NER models. In NLP contexts, XAI techniques like evidence extraction reduce bias in tasks such as sentiment analysis. Cross-domain transfer learning further enables IE models to adapt from resource-rich to low-resource domains, with divide-and-transfer paradigms in NER transferring knowledge via shared representations, achieving high F1 scores (around 80%) in target domains. Recent 2024-2025 papers explore diffusion models for generative IE, where denoising processes generate structured outputs like entity-relation triples from noisy text inputs, offering robustness to variations in surveys of LLM-based generative approaches. Additionally, quantum-assisted pattern matching holds potential for accelerating IE in large-scale text processing, with quantum algorithms enabling faster approximate matching in genomics and cybersecurity texts, potentially reducing computation time quadratically compared to classical methods.

References

  1. [1]
    Twenty-five years of information extraction | Natural Language ...
    Sep 20, 2019 · Information extraction is the process of converting unstructured text into a structured data base containing selected information from the ...2. Before Corpora... · 3. Supervised Methods: Ace · 4. Semi-Supervised Methods
  2. [2]
    Large Language Models for Generative Information Extraction - arXiv
    Dec 29, 2023 · To conduct a comprehensive systematic review and exploration of LLM efforts for IE tasks, in this study, we survey the most recent advancements in this field.
  3. [3]
    A Survey on Open Information Extraction from Rule-based Model to ...
    This survey paper provides an overview of OpenIE technologies spanning from 2007 to 2024, emphasizing a chronological perspective absent in prior surveys.
  4. [4]
    [PDF] Information Extraction: Beyond Document Retrieval - ACL Anthology
    Information extraction (IE) extracts pre-specified information from texts, populating structured sources, unlike information retrieval (IR) which retrieves ...
  5. [5]
    [PDF] Information Extraction: Relations, Events, and Time
    Oct 26, 2006 · Relation extraction has close links to populat- ing a relational database, and knowledge graphs, datasets of structured relational knowledge.
  6. [6]
    [PDF] Open Information Extraction Using Wikipedia - ACL Anthology
    Abstract. Information-extraction (IE) systems seek to distill semantic relations from natural- language text, but most systems use super-.
  7. [7]
    [PDF] A Brief Survey of Text Mining: Classification, Clustering and ... - arXiv
    Jul 28, 2017 · Information Extraction from text (IE): Information Extrac- tion is the task of automatically extracting information or facts from unstructured ...
  8. [8]
    [2402.06964] NLP for Knowledge Discovery and Information ... - arXiv
    Feb 10, 2024 · The NLP method enables machine understanding of textual data, offering an automated route to knowledge discovery and information extraction from energetics ...<|separator|>
  9. [9]
    [PDF] Information Extraction - Now Publishers
    Abstract. The automatic extraction of information from unstructured sources has opened up new avenues for querying, organizing, and analyzing data.
  10. [10]
    [PDF] Knowledge Base Population: Successful Approaches and Challenges
    The overall goal of KBP is to auto- matically identify salient and novel entities, link them to corresponding Knowledge Base (KB) en- tries (if the linkage ...
  11. [11]
    [PDF] Information Extraction from Visually Rich Documents using LLM ...
    Information extraction (IE) from Visually Rich. Documents (VRDs) containing layout features along with text is a critical and well-studied.
  12. [12]
    [PDF] Information Extraction - CIn UFPE
    We survey techniques for optimizing the various steps in an information extraction pipeline, adapting to dynamic data, integrating with existing entities and ...
  13. [13]
    Information extraction pipelines for knowledge graphs - PMC - NIH
    Plumber integrates 40 reusable components released by the research community for the subtasks entity linking, relation linking, text triple extraction (subject, ...
  14. [14]
    [PDF] FASTUS: A System for Extracting Information from Text
    One crucial innovation in the FASTUS system has been separating that process into the two steps of recognizing phrases and recognizing patterns. Phrases can be ...Missing: pipeline | Show results with:pipeline
  15. [15]
    Information extraction - ACM Digital Library
    An even earlier project— before 1970—for extract- ing useful information from text was directed by. Naomi Sager of the Lin- guistic String Project group at New ...
  16. [16]
    [PDF] DESCRIPTION OF THE CIRCUS SYSTEM AS USED FOR MUC-3
    ... CIRCUS was originally designed to investigate th e integration of connectionist and symbolic techniques for natural language processing. The original.
  17. [17]
    Information Extraction - ACM Queue
    Dec 16, 2005 · HMMs became widely used in the 1990s for extraction from English prose. ... IJCAI Workshop on Learning Statistical Models from Relational Data.
  18. [18]
    Web-scale information extraction in knowitall - ACM Digital Library
    KnowItAll automates extracting large collections of facts from the web, associating a probability with each fact. It extracted 54,753 facts in four days.
  19. [19]
    A Comparison of the Events and Relations Across ACE, ERE, TAC ...
    The ACE or Automatic Content Extraction program succeeded MUC in 1999, and event extraction was introduced in ACE in 2004 (Aguilar et al, 2014) . Starting ...
  20. [20]
    A Survey of Information Extraction Based on Deep Learning - MDPI
    In this paper, we explain the basic concepts of IE and DL, primarily expounding on the research progress and achievements of DL technologies in the field of IE.
  21. [21]
    [PDF] MATVIX: Multimodal Information Extraction from Visually Rich Articles
    Apr 29, 2025 · Abstract. Multimodal information extraction (MIE) is crucial for scientific literature, where valuable data is often spread across text, ...
  22. [22]
    Clinical concept extraction using transformers - PMC - PubMed Central
    Oct 29, 2020 · We systematically explored 4 widely used transformer-based architectures, including BERT, RoBERTa, ALBERT, and ELECTRA, for extracting various types of ...
  23. [23]
    Ethical Considerations in Big Data Analytics | OxJournal
    Sep 16, 2024 · This paper aims to discuss the issues present in each of these areas, propose how ethical standards can be maintained, and highlight specific case studies and ...
  24. [24]
    [PDF] comprehensive overview of named entity recognition - arXiv
    Sep 25, 2023 · Its pivotal role lies in its capacity to disentangle structured information from unstructured text, thereby enhancing data retrieval, analysis, ...
  25. [25]
    [PDF] A survey of named entity recognition and classification - NYU
    In defining the task, people noticed that it is essential to recognize information units like names, including person, organization and location names, and ...
  26. [26]
    Nested Named Entity Recognition: A Survey - ACM Digital Library
    Normally, named entities have a complex nested structure; that is, a named entity can contain or embed other entities. Recognizing named entities with the ...
  27. [27]
    [PDF] Do CoNLL-2003 Named Entity Taggers Still Work Well in 2023?
    Jul 9, 2023 · The CoNLL-2003 English named entity recog- nition (NER) dataset has been widely used to train and evaluate NER models for almost 20 years.
  28. [28]
    None
    ### Evolution of NER from BIO Tagging in 1990s to Modern Contextual Embeddings
  29. [29]
  30. [30]
    [PDF] Learning syntactic patterns for automatic hypernym discovery
    In this paper, we build an automatic classifier for the hypernym/hyponym relation. A noun. X is a hyponym of a noun Y if X is a subtype or instance of Y ...Missing: seminal | Show results with:seminal
  31. [31]
    A survey on Relation Extraction - ScienceDirect.com
    The Entity Extraction task identifies entities from the text, and the Relation Extraction (RE) task can identify relationships between those entities. Many NLP ...
  32. [32]
    TAC Relation Extraction Dataset - Stanford NLP Group
    TACRED is a large-scale relation extraction dataset with 106,264 examples, including subject/object spans, mention types, and relations, or no_relation.Introduction · Examples · Data Statistics · Dataset Usage
  33. [33]
    [PDF] Identifying Relations for Open Information Extraction - ACL Anthology
    ReVerb, in particular, relies on its ex- plicit lexical and syntactic constraints, which have no correlate in SRL systems. For a more detailed comparison of SRL ...
  34. [34]
    [PDF] Expanding the Recall of Relation Extraction by Bootstrapping
    Generic patterns employed in. KnowItAll achieve unsupervised, high- precision extraction, but often result in low recall. This paper compares two boot-.
  35. [35]
    [PDF] A Shortest Path Dependency Kernel for Relation Extraction
    [Non-local Dependencies] Long-distance de- pendencies arise due to various linguistic con- structions such as coordination, extraction, raising and control. In ...
  36. [36]
    [PDF] an overview of event extraction and its applications - arXiv
    Nov 5, 2021 · Event Extraction (EE) automatically extracts events from human language, aiming to discover event triggers with specific types and their ...Missing: seminal | Show results with:seminal
  37. [37]
    Extracting Events and Their Relations from Texts: A Survey on ...
    Event extraction involves identifying events and their arguments, while event relation extraction identifies relationships between events, such as ...Missing: seminal | Show results with:seminal<|control11|><|separator|>
  38. [38]
    [PDF] Generative Approaches to Event Extraction: Survey and Outlook
    Nov 15, 2024 · The subtasks are: Trigger extraction & classification (ED), Argument extraction, and Joint trigger and argument extraction. A model can be used ...
  39. [39]
    Streamlining event extraction with a simplified annotation framework
    Apr 28, 2024 · In this study, we propose an efficient open-domain event annotation framework tailored for subsequent information extraction, with a specific focus on its ...
  40. [40]
    [PDF] Document-Level Event Argument Extraction by Conditional Generation
    Apr 15, 2021 · We present the first document-level event ex- traction benchmark dataset with complete event and coreference annotation. We also introduce the ...
  41. [41]
    Frame-Semantic Parsing | Computational Linguistics | MIT Press
    In this article, we present a computational and statistical model for frame-semantic parsing, the problem of extracting from text semantic predicate-argument ...
  42. [42]
    [PDF] The Proposition Bank: An Annotated Corpus of Semantic Roles
    The Proposition Bank project takes a practical approach to semantic representation, adding a layer of predicate-argument information, or semantic role ...
  43. [43]
    PASBio: predicate-argument structures for event extraction in ...
    Oct 19, 2004 · In PropBank a verb may get more than one PAS frame if the verb sense and its argument set differ, reflecting the fundamental assumption that ...
  44. [44]
    [PDF] Complex Event Recognition with Allen Relations - KR Proceedings
    Complex event recognition (CER) systems process symbolic events online to report complex event patterns, using Allen's interval algebra for more accurate ...
  45. [45]
    (PDF) FASTUS: A Cascaded Finite-State Transducer for Extracting ...
    FASTUS is a system for extracting information from natural language text for entry into a database and for other applications.
  46. [46]
    FASTUS: A Cascaded Finite-State Transducer for Extracting ... - arXiv
    May 20, 1997 · Abstract: FASTUS is a system for extracting information from natural language text for entry into a database and for other applications.
  47. [47]
    Information extraction (rule-based information retrieval) - NCBI - NIH
    The use of gazetteers allows us to improve the ability of the system to specifically recognise the topics of interest for a given survey and tweak what is being ...
  48. [48]
    (PDF) JAPE: a Java Annotation Patterns Engine - ResearchGate
    In this context, we present two approaches to information extraction based on Natural Language Processing (NLP). The first method uses JAPE rules and ...
  49. [49]
    [PDF] Introduction to Information Extraction Technology - Courses
    There is another approach to the develop- ment of rule-based IE systems. We call it the atomic approach; the method here is to start with high recall and ...
  50. [50]
    Rule-based Information Extraction: Advantages, Limitations, and ...
    This paper summarizes the advantages and limitations of rule-based IE and discusses its role for legal information retrieval. In addition, several different ...
  51. [51]
    [PDF] Rule-Based Information Extraction is Dead! Long ... - ACL Anthology
    Oct 18, 2013 · Even in their current form, with ad-hoc solutions built on techniques from the early 1980's, rule-based sys- tems serve the industry needs ...
  52. [52]
    [PDF] Probabilistic Models for Segmenting and Labeling Sequence Data
    This paper introduces conditional random fields (CRFs), a sequence modeling framework that has all the advantages of MEMMs but also solves the label bias ...
  53. [53]
    [PDF] Snowball: Extracting Relations from Large Plain-Text Collections
    Snowball extracts structured data from plain text using training examples to generate patterns and extract tuples, with minimal human participation.
  54. [54]
    The BigGrams: the semi-supervised information extraction system ...
    Aug 20, 2017 · The aim of this study is to propose an information extraction system, called BigGrams, which is able to retrieve relevant and structural ...
  55. [55]
    [2205.11725] A Survey on Neural Open Information Extraction - arXiv
    May 24, 2022 · In this survey, we provide an extensive overview of the-state-of-the-art neural OpenIE models, their key design decisions, strengths and weakness.
  56. [56]
    Graph Neural Networks with Generated Parameters for Relation ...
    In this paper, we propose a novel graph neural network with generated parameters (GP-GNNs). The parameters in the propagation module, ie the transition ...
  57. [57]
    Structured information extraction from scientific text with large ...
    Feb 15, 2024 · Abstract. Extracting structured knowledge from scientific text remains a challenging task for machine learning models.
  58. [58]
  59. [59]
    Wrapper induction for information extraction - ACM Digital Library
    Kołaczkowski P and Gawrysiak P Extracting product descriptions from polish e-commerce websites ... Ruiz-Casado M, Alfonseca E, Okumura M and Castells P ...
  60. [60]
    [PDF] Extracting Structured Information from Wikipedia Articles to Populate ...
    Abstract. Roughly every third Wikipedia article contains an infobox – a table that displays important facts about the subject in attribute-value.Missing: feeds | Show results with:feeds
  61. [61]
    Wrapper induction: Efficiency and expressiveness - ScienceDirect.com
    Wrapper induction is a technique for automatically constructing wrappers, as an alternative to writing them manually.Missing: commerce | Show results with:commerce
  62. [62]
    [PDF] Information Extraction from Wikipedia: Moving Down the Long Tail
    Aug 27, 2008 · In this paper we describe three novel approaches for improving the recall of extraction of Wikipedia infobox attribute values. • By applying ...
  63. [63]
    [PDF] DOM-based Content Extraction of HTML Documents - Columbia CS
    By parsing a webpage's HTML into a DOM tree, we can not only extract information from large logical units similar to Buyukkokten's “Semantic Textual. Units ...
  64. [64]
    Web content information extraction based on DOM tree and ...
    Their method is based on the DOM structure to divide one web page into several blocks and extract content blocks with statistical information instead of machine ...
  65. [65]
    [PDF] Eliminating Noisy Information in Web Pages for Data Mining
    In this paper, we propose a noise elimination technique based on the following observation: In a given Web site, noisy blocks usually share some common contents ...
  66. [66]
    Effectiveness of template detection on noise reduction and websites ...
    In this paper, we introduce Noise Detector (ND) as an effective approach for detecting and removing templates from Web pages. ND segments Web pages into ...
  67. [67]
    [PDF] ACL-IJCNLP 2015 ACL 2015 Workshop on Noisy User-generated Text
    Jul 31, 2015 · The WNUT 2015 workshop focuses on a core set of natural language processing tasks on top of noisy user-generated text, such as that found on ...<|separator|>
  68. [68]
    How Google's Knowledge Graph works
    Google's search results sometimes show information that comes from our Knowledge Graph, our database of billions of facts about people, places, and things.
  69. [69]
    Real-Time Focused Extraction of Social Media Users - ResearchGate
    In this paper, we explore a real-time automation challenge: the problem of focused extraction of Social Media users. This challenge can be seen as a special ...
  70. [70]
    University of Turku in the BioNLP'11 Shared Task - PubMed
    Jun 26, 2012 · We present a system for extracting biomedical events (detailed descriptions of biomolecular interactions) from research articles, developed for the BioNLP'11 ...
  71. [71]
    BioNLP Shared Task - The Bacteria Track - PMC - PubMed Central
    The Bacteria Gene Interaction is a gene/protein interaction extraction task from individual sentences. The interactions have been categorized into ten different ...
  72. [72]
    Extracting key insights from earnings call transcript via information ...
    In this work, we introduce ECT-SKIE, an information-theoretic, self-supervised approach for extracting key insights from earnings call transcripts.
  73. [73]
    (PDF) Temporal Evolution of Sentiment in Earnings Calls and Its ...
    This study investigates the temporal evolution patterns of sentiment in earnings call transcripts and their relationship with subsequent financial performance.
  74. [74]
    CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review
    Mar 10, 2021 · The task is to highlight salient portions of a contract that are important for a human to review. We find that Transformer models have nascent ...
  75. [75]
    [PDF] Exploring Implicit Relations in Contracts for Contract Clause Extraction
    We study automatic Contract Clause. Extraction (CCE) by modeling implicit relations in legal contracts. Existing CCE methods mostly treat contracts as plain ...
  76. [76]
    The Unified Medical Language System (UMLS) - PubMed Central
    In this paper, we present a different approach to information integration through terminology integration: the Unified Medical Language System® (UMLS ...
  77. [77]
    Exploiting the UMLS Metathesaurus for extracting and categorizing ...
    Sep 8, 2015 · A method to exploit the UMLS Metathesaurus for extracting and categorizing concepts found in clinical text representing signs and symptoms to anatomically ...2.1. Preparation Steps · 2.1. 2. Umls Semantic Types · 2.2. Mapping Processes
  78. [78]
    [PDF] Information Extraction In Medical Domain - CFILT
    Jun 12, 2015 · Specific domains such as legal, technology, medical etc are harder due to abundance of domain specific jargon. We focus on the challenge of ...
  79. [79]
    Twenty-Five Years of Evolution and Hurdles in Electronic Health ...
    The 2010s witnessed widespread adoption of EHRs, driven by government incentives and the recognition of their potential to improve health care quality, safety, ...
  80. [80]
    Clinical Information Extraction Applications: A Literature Review - PMC
    A rapid growth of data can be observed since 1995, and the amount of data utilized in those studies reached a peak in 2009. A large quantity of EHR data became ...
  81. [81]
  82. [82]
    Improving Knowledge Base Construction from Robust Infobox ...
    One important approach to constructing a comprehensive knowledge base is to extract information from Wikipedia infobox tables to populate an existing KB.Missing: seminal | Show results with:seminal
  83. [83]
    Populating Web-Scale Knowledge Graphs Using Distantly ... - MDPI
    In this paper, we propose a fully automated system to extend knowledge graphs using external information from web-scale corpora.Missing: seminal | Show results with:seminal
  84. [84]
    [2506.18027] PDF Retrieval Augmented Question Answering - arXiv
    Jun 22, 2025 · This paper presents an advancement in Question-Answering (QA) systems using a Retrieval Augmented Generation (RAG) framework to enhance information extraction ...
  85. [85]
    Retrieval augmented generation for large language models in ... - NIH
    Briefly, the RAG process involves retrieving relevant information from the knowledge source, and then using the relevant information to generate a response to ...
  86. [86]
    [PDF] ChatEL: Entity Linking with Chatbots - ACL Anthology
    May 20, 2024 · In the present work, we bring the power of LLMs to the information extraction task, especially en- tity disambiguation. The proposed ChatEL ...
  87. [87]
    [PDF] Challenges and Advances in Information Extraction From Scientific ...
    In this paper, we have examined the major factors obstructing practical applications of computer-aided information extraction on scientific corpora, including ...<|control11|><|separator|>
  88. [88]
    [PDF] An Empirical Study of Pipeline vs. Joint approaches to Entity and ...
    Nov 20, 2022 · It is generally believed that the pipeline approach suffers from the problem of error propagation, while the joint approach could leverage ...Missing: challenges | Show results with:challenges
  89. [89]
    [PDF] investigating error propagation in an NLP pipeline - CEUR-WS
    These tasks yet depend on other low-level analysis. This paper shows where errors come from and whether they are propagated through the different layers. We.
  90. [90]
    a neuro-symbolic AI system for enhancing accuracy of named entity ...
    Nov 1, 2024 · A hybrid AI framework that integrates neurosymbolic methods with named entity recognition (NER) and entity linking (EL) to transform unstructured clinical ...
  91. [91]
    A scoping review of large language model based approaches ... - NIH
    Aug 24, 2024 · This paper provides a summary of the current state of research on Large Language Model (LLM) based approaches for information extraction (IE) from radiology ...
  92. [92]
    Information extraction from historical well records using a large ...
    Dec 30, 2024 · In this paper, we present an information extraction workflow based on open-source Llama 2 models and test it on a dataset of 160 well documents.
  93. [93]
    Google Launches Cloud Natural Language API - InfoQ
    Aug 29, 2016 · Google released their beta Cloud Natural Language API on July 20, joining the movement to make advances in natural language processing (NLP).Missing: launch Amazon Comprehend,
  94. [94]
    Cloud Natural Language
    ### Summary of Cloud Natural Language Features
  95. [95]
    Amazon Comprehend
    ### Summary of Amazon Comprehend Features and Use Cases
  96. [96]
    IBM Watson Discovery
    ### Summary of IBM Watson Discovery Features
  97. [97]
    Document AI | Snowflake Documentation
    Document AI is a Snowflake AI feature that uses Arctic-TILT, a proprietary large language model (LLM), to extract data from documents.Working with Document AI · Setting up Document AI
  98. [98]
    October 21, 2024 — Document AI — General Availability
    Oct 21, 2024 · Document AI enables setting up intelligent document processing (IDP) workflows within Snowflake by extracting information from documents.
  99. [99]
    [PDF] A Comprehensive Survey on Document-Level Information Extraction
    Nov 15, 2024 · This paper embarks on a comprehensive review and dis- cussion of contemporary literature related to doc-IE. In addition, we conduct a thorough ...
  100. [100]
    KIEval: Evaluation Metric for Document Key Information Extraction
    Entity-level F1 score is one of the most commonly used metrics for Document KIE model evaluation. Upon extracting entity-wise key-value pairs, they are matched ...
  101. [101]
    [PDF] Introduction to the CoNLL-2003 Shared Task
    The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, ...<|separator|>
  102. [102]
    [PDF] TAC KBP 2014 Event Argument Extraction Assessment
    May 1, 2014 · TAC KBP 2014 Event Argument. Extraction ... The distinction between a stative relation and the event that relation is a consequence.
  103. [103]
    TAC Relation Extraction Dataset - Linguistic Data Consortium
    Dec 15, 2018 · TAC Relation Extraction Dataset (TACRED) was developed by The Stanford NLP Group and is a large-scale relation extraction dataset with ...
  104. [104]
    REBEL: Relation Extraction By End-to-end Language generation
    We present REBEL, a seq2seq model based on BART that performs end-to-end relation extraction for more than 200 different relation types.Missing: 2022 | Show results with:2022
  105. [105]
    [PDF] Evaluating Data Augmentation for Medication Identification in ...
    For strict matching, the offsets of a span were required to match exactly. For lenient matching, it was sufficient for spans to overlap. Results for our ...
  106. [106]
    [PDF] Cross-lingual Information Extraction for the Assessment and ...
    Mar 20, 2024 · This thesis focuses on cross- and multi-lingual detection of adverse drug reactions in biomedical texts written by non-experts, using a tri- ...
  107. [107]
    [PDF] SuperGLUE - arXiv
    Feb 13, 2020 · GLUE is a collection of nine language understanding tasks built on existing public datasets, together with private test data, an evaluation ...Missing: extensions | Show results with:extensions
  108. [108]
    Evaluation of Large Language Model Performance in Assessing ...
    Oct 24, 2025 · This human-in-the-loop design reduces the need for perfect LLM performance and emphasizes usability and efficiency over automation alone ...<|separator|>
  109. [109]
    [PDF] Modality and Negation in Event Extraction - ACL Anthology
    Aug 5, 2021 · NLP systems struggle with these semantic phenom- ena, often incorrectly extracting events which did not happen, which can lead to issues in.
  110. [110]
    [2202.08063] Information Extraction in Low-Resource Scenarios
    Feb 16, 2022 · This survey aims to foster understanding of this field, inspire new ideas, and encourage widespread applications in both academia and industry.
  111. [111]
    Scaling Information Extraction to Large Document Collections
    We review key approaches for scaling up information extraction, including using general-purpose search engines as well as indexing techniques specialized for ...Missing: computational | Show results with:computational
  112. [112]
    Privacy-Preserving Information Extraction for Ethical Case Studies in ...
    Traditional information extraction methods often expose private data to risks such as membership inference and reconstruction attacks, compromising ...
  113. [113]
  114. [114]
    Surveillance Ethics | Internet Encyclopedia of Philosophy
    The bulk of this article focuses on considering the ethical challenges posed by surveillance. These include why surveillance is undertaken and by whom, as well ...
  115. [115]
    Harnessing large language models' zero-shot and few-shot learning ...
    Aug 23, 2024 · We demonstrated that some of these models can use few-shot or even zero-shot learning to achieve superior performance over our previous model, ...
  116. [116]
    Introducing mCODEGPT as a zero-shot information extraction from ...
    Oct 15, 2025 · This study introduces a tool based on the Large Language Model (LLM) for zero-shot information extraction from cancer-related clinical notes ...
  117. [117]
    Video-Level Multimodal Relation Extraction with Event-Entity ...
    Oct 27, 2025 · To advance this research, we present Vid-MRE, a new dataset containing 32 relation types and 12,402 multimodal relational facts, annotated ...
  118. [118]
    [PDF] Federated Incremental Named Entity Recognition - ACL Anthology
    Jan 19, 2025 · In the FINER setting, entity types are non-independent and identically distributed (Non-. IID) across different clients, and training data of.
  119. [119]
    Causality Extraction from Medical Text Using Large Language ...
    This study explores the potential of natural language models, including large language models, to extract causal relations from medical texts.Causality Extraction From... · 3. Data · 5. Experiments And Results
  120. [120]
    A Secure and Verifiable Blockchain-Based Framework for Personal ...
    Sep 23, 2024 · We propose a Verifiable Authorization Information Management Scheme (VAIMS). During the authorization process, the authorization information and personal data ...
  121. [121]
    Editorial: Explainable AI in Natural Language Processing - Frontiers
    Aug 15, 2024 · These methods help reduce bias and improve the completeness of evidence extraction, making the models more trustworthy and their decisions more ...
  122. [122]
    Cross-Domain NER under a Divide-and-Transfer Paradigm
    May 13, 2024 · Cross-domain Named Entity Recognition (NER) transfers knowledge learned from a rich-resource source domain to improve the learning in a low-resource target ...
  123. [123]
    [PDF] A Survey of Generative Information Extraction - ACL Anthology
    Jan 19, 2025 · In this sur- vey, we first review generative information ex- traction (IE) methods based on pre-trained lan- guage models (PLMs) and large ...
  124. [124]
    Quantum Algorithms for Faster Pattern Matching in Genomics and ...
    Oct 22, 2024 · Researchers present quantum algorithms to improve the efficiency of approximate pattern matching for fields like genomics and cybersecurity.Missing: assisted | Show results with:assisted