Fact-checked by Grok 2 weeks ago

Knowledge extraction

Knowledge extraction is the process of deriving structured , such as entities, concepts, and relations, from unstructured or sources like text, documents, and web content, often by linking the extracted information to knowledge bases using ontologies and formats like RDF or . This technique bridges the gap between raw data and machine-readable representations, enabling the automatic population of semantic knowledge graphs and supporting applications in and . At its core, knowledge extraction encompasses several key tasks, including (NER) to identify entities like persons or organizations, to associate them with entries, relation extraction to uncover connections between them, and concept extraction to derive topics or hierarchies from text. These tasks are typically performed on unstructured sources, such as articles or , to transform implicit information into explicit triples (e.g., subject-predicate-object) that can be queried and reasoned over. Methods range from rule-based approaches using hand-crafted patterns and linguistic rules to techniques like conditional random fields (CRFs) and neural networks, with recent advances incorporating models such as for joint entity and relation extraction. Hybrid methods combining supervised, unsupervised, and distant supervision further enhance accuracy by leveraging existing knowledge bases like DBpedia for training. The importance of knowledge extraction lies in its role in automating the creation of large-scale knowledge bases, facilitating , , and across domains like and . By addressing challenges such as data heterogeneity and , it supports the Semantic Web's vision of interconnected, , with tools like DBpedia Spotlight and Stanford CoreNLP enabling practical implementations. Ongoing research focuses on open information extraction to handle domain-independent relations and improving robustness against noisy web-scale data.

Introduction

Definition and Scope

Knowledge extraction is the process of identifying, retrieving, and structuring implicit or explicit knowledge from diverse data sources to produce usable, machine-readable representations such as knowledge graphs or ontologies. This involves transforming into semantically meaningful forms that capture entities, relationships, and facts, facilitating advanced reasoning and application integration. The key objectives of knowledge extraction include automating the acquisition of domain-specific knowledge from vast datasets, thereby reducing manual efforts; enabling by standardizing representations across heterogeneous systems; and supporting informed decision-making in systems through enhanced contextual understanding and inference capabilities. These goals address the challenges of scaling knowledge representation in data-intensive environments, such as enabling AI models to leverage structured insights for tasks like and recommendation. While focuses on discovering patterns and associations in data, knowledge extraction often emphasizes the creation of structured, semantically rich representations suitable for logical and . It also differs from , which focuses on identifying and ranking relevant documents or data snippets in response to user queries based on similarity measures, typically returning unstructured or semi-structured results. Knowledge extraction, however, actively parses and organizes content to generate structured outputs like entity-relation triples, moving beyond mere retrieval to knowledge . The scope of knowledge extraction spans structured sources like databases, semi-structured formats such as XML or , and including text corpora and , aiming to bridge the between raw and actionable . It excludes foundational data preprocessing steps like cleaning or , as well as passive storage mechanisms, concentrating instead on the interpretive and representational of content.

Historical Development

The roots of knowledge extraction trace back to the and , when research emphasized expert systems that required manual from domain experts to encode rules and facts into computable forms. A seminal example is the system, developed at in 1976, which used backward-chaining inference to diagnose bacterial infections and recommend antibiotics based on approximately 450 production rules derived from medical expertise. This era highlighted the "knowledge bottleneck," where acquiring and structuring human expertise proved labor-intensive, laying foundational concepts for later automated extraction techniques from diverse data sources. The 1990s marked a pivotal shift toward automated from text, driven by the need to process unstructured data at scale. The Message Understanding Conferences (MUC), initiated in 1991 under sponsorship, standardized evaluation benchmarks for extracting entities, relations, and events from news articles, focusing initially on terrorist incidents in . MUC-3 in 1991 introduced template-filling tasks with metrics like and , fostering rule-based and early approaches that achieved modest performance, such as 50-60% F1 scores on coreference resolution. These conferences evolved through MUC-7 in 1998, influencing the broader field by emphasizing scalable pipelines. In the 2000s, the semantic web paradigm propelled knowledge extraction toward structured, interoperable representations, with the World Wide Web Consortium (W3C) standardizing RDF in 1999 and OWL in 2004 to enable ontology-based knowledge modeling and inference. The Semantic Web Challenge, launched in 2003 alongside the International Semantic Web Conference, encouraged innovative applications integrating extracted knowledge, such as querying distributed RDF data for tourism recommendations. A landmark milestone was the DBpedia project in 2007, which automatically extracted over 2 million RDF triples from Wikipedia infoboxes, creating the first large-scale, multilingual knowledge base accessible via SPARQL queries and serving as a hub for linked open data. The 2010s saw knowledge extraction integrate with ecosystems and advanced , culminating in the widespread adoption of for search and recommendation systems. Google's , announced in 2012, incorporated billions of facts from sources like and to disambiguate queries and provide entity-based answers, improving search relevance by connecting over 500 million objects and 3.5 billion facts. This era emphasized hybrid extraction methods combining rule-based parsing with statistical models, scaling to web-scale data. Post-2020, the boom, particularly large language models (LLMs), has revolutionized extraction by enabling zero-shot entity and relation identification from unstructured text, with surveys highlighting LLM-empowered construction that reduces manual annotation needs and enhances factual accuracy in domains like . For instance, in biomedical , a retrieval-augmented method using LLMs improved F1 score by 20% and answer generation accuracy by 25% over baselines, bridging foundations with generative for dynamic updates.

Extraction from Structured Sources

Relational Databases to Semantic Models

Knowledge extraction from relational databases to semantic models involves transforming structured tabular data into RDF triples or knowledge graphs, enabling semantic querying and . This process typically employs direct mapping techniques that convert database schemas and instances into RDF representations without extensive restructuring. In a basic 1:1 transformation, each row in a relational is mapped to an RDF instance (), while columns define properties (predicates) linked to cell values as objects. Direct mapping approaches, such as those defined in the W3C's RDB Direct Mapping specification, automate this conversion by treating tables as classes and attributes as predicates, generating RDF from the database schema and content on-the-fly. For instance, a table named "Customers" with columns "ID", "Name", and "Email" would produce triples where each customer row becomes a subject URI like <http://example.com/customer/{ID}>, with predicates such as ex:name and ex:email pointing to the respective values. These mappings preserve the relational structure while exposing it semantically, facilitating integration with ontologies. Schema alignment addresses relationships across tables, particularly foreign keys, which are interpreted as RDF links between instances. Tools like D2RQ enable virtual mappings by defining correspondences between relational schemas and RDF vocabularies, rewriting queries to SQL without data replication. Similarly, the R2RML standard supports customized maps with referencing object maps to join tables via foreign keys, using conditions like rr:joinCondition to link child and parent columns. This allows, for example, an "Orders" table foreign key to "Customers.ID" to generate connecting order instances to customer subjects. Challenges in this conversion include handling , where denormalized views may be needed to avoid fragmented RDF graphs from vertically partitioned relations, and mismatches, such as converting SQL DATE to RDF xsd:date or xsd:dateTime via explicit mappings. Solutions involve declarative rules in R2RML to override defaults, ensuring literals match datatypes, and tools like D2RQ's generate-mapping utility to produce initial alignments that can be refined manually. issues are mitigated by creating R2RML views that denormalize data through SQL joins before RDF generation. A representative example is mapping a customer-order database. Consider two tables: "Customers" (columns: CustID [INTEGER PRIMARY KEY], Name [VARCHAR], Email [VARCHAR]) and "Orders" (columns: OrderID [INTEGER PRIMARY KEY], CustID [INTEGER FOREIGN KEY], Product [VARCHAR], Amount [DECIMAL]). Step-by-step mapping rules using R2RML:
  1. Triples Map for Customers: Define a logical table as rr:tableName "Customers". Set subject map: rr:template "http://example.com/customer/{CustID}", rr:class ex:Customer. Add predicate-object maps: rr:predicate ex:name, rr:objectMap [ rr:column "Name" ]; and rr:predicate ex:email, rr:objectMap [ rr:column "Email"; rr:datatype xsd:string ]. This generates triples like <http://example.com/customer/101> rdf:type ex:Customer . <http://example.com/customer/101> ex:name "Alice Smith" . <http://example.com/customer/101> ex:email "[email protected]" .
  2. Triples Map for Orders: Logical table: rr:tableName "Orders". Subject map: rr:template "http://example.com/order/{OrderID}", rr:class ex:Order. Predicate-object maps: rr:predicate ex:product, rr:objectMap [ rr:column "Product" ]; rr:predicate ex:amount, rr:objectMap [ rr:column "Amount"; rr:datatype xsd:decimal ]. Include a referencing object map for the foreign key: rr:predicate ex:customer, rr:parentTriplesMap <#CustomerMap>, rr:joinCondition [ rr:child "CustID"; rr:parent "CustID" ]. For a row with OrderID=201, CustID=101, Product="Laptop", Amount=999.99, this yields <http://example.com/order/201> rdf:type ex:Order . <http://example.com/order/201> ex:product "Laptop" . <http://example.com/order/201> ex:amount "999.99"^^xsd:decimal . <http://example.com/order/201> ex:customer <http://example.com/customer/101> .
Using D2RQ, an initial file could be generated from the , then customized in N3 syntax to align with the same RDF vocabulary, allowing access to the virtual graph. This approach ensures the semantic model captures relational integrity while enabling over the extracted .

XML and Other Markup Languages

Knowledge extraction from XML documents leverages the hierarchical, tag-based structure of markup languages to identify and transform data into structured representations, such as semantic models like RDF. XML, designed for encoding documents with explicit tags denoting content meaning, facilitates precise querying and mapping of elements to knowledge entities, enabling the conversion of raw markup into ontologies or triple stores. This process is particularly effective for sources like configuration files, data exchanges, and publishing formats where information guides . XML parsing techniques form the foundation of extraction, utilizing standards like for navigating document trees, for declarative querying, and for stylesheet-based transformations. allows selection of nodes via path expressions, such as /product/category[name='electronics']/item, to isolate relevant elements for knowledge representation. extends this by supporting functional queries that aggregate and filter data, often outputting results in formats amenable to semantic processing. For instance, can join multiple XML documents and project attributes into , streamlining the extraction of relationships like product hierarchies. , in turn, applies rules to transform XML into , using templates to map tags to predicates and attributes to objects; a seminal approach embeds within to generate triples dynamically, as demonstrated in streaming transformations for large-scale data. These tools ensure efficient, schema-aware parsing without full document loading, crucial for knowledge extraction pipelines. Schema-driven extraction enhances accuracy by inferring ontologies from XML Schema Definition (XSD) files, which define types, constraints, and hierarchies. XSD complex types can be mapped to classes, with attributes becoming properties and nesting indicating subclass relations; for example, an XSD <product> with sub-elements like <price> and <description> infers a Product class with price and description predicates. Automated tools mine these s to generate ontologies, preserving cardinality and data types while resolving ambiguities through pattern recognition. This method has been formalized in approaches that construct deep semantics, such as identifying via extension/restriction in XSD, yielding reusable bases from schema repositories. By grounding in XSD, the process minimizes manual annotation and supports validation during transformation. XML builds on predecessors like SGML, the , which introduced generalized tagging for document interchange in the 1980s, influencing XML's design for portability and extensibility. Modern publishing formats, such as —an XML vocabulary for technical documentation—extend this legacy by embedding semantic markup that aids extraction; for instance, DocBook's <book> and <chapter> elements can be transformed via to RDF, capturing structural knowledge like authorship and sections for knowledge graphs. These evolutions emphasize markup's role in facilitating . A representative involves extracting product catalogs from XML feeds, common in platforms like Amazon's data feeds, into knowledge bases. In one implementation, queries target elements such as <item><name> and <price>, while maps them to RDF triples (e.g., ex:product rdf:type ex:Item; ex:hasName "Laptop"; ex:hasPrice 999), integrating with endpoints for querying. This approach, tested on feeds with thousands of entries, achieves high precision in entity resolution and relation extraction, enabling applications like recommendation systems; GRDDL profiles further standardize such transformations by associating scripts with XML via profiles, as used in scenarios.

Tools and Direct Mapping Techniques

One prominent tool for direct mapping relational databases to RDF is the D2RQ Platform, developed since 2004 at Freie Universität Berlin. D2RQ enables access to relational databases as virtual, read-only RDF graphs by using a declarative language that relates database schemas to RDF vocabularies or ontologies. This approach allows for on-the-fly translation of SQL queries into without materializing the RDF data, facilitating integration of legacy databases into applications. Building on such efforts, the W3C standardized R2RML (RDB to RDF Mapping Language) in September 2012 as a recommendation for expressing customized mappings from relational databases to RDF datasets. R2RML defines mappings through triple maps, which associate logical tables (such as SQL queries or base tables) with RDF triples, enabling tailored views of the data while preserving relational integrity. Unlike earlier tools, R2RML's standardization promotes across processors, with implementations supporting both virtual and materialized RDF views. At the core of these tools are rule-based mappers that generate RDF terms deterministically from database rows. For instance, subject maps and predicate-object maps in R2RML use template maps to construct for entities, such as http://example.com/Person/{id} where {id} is a placeholder for a column value like a . Similarly, D2RQ employs property bridges and class maps to define IRI patterns based on column values, ensuring that entities and relations are linked without custom scripting. These rules are compiled into SQL views at runtime, translating patterns into efficient relational queries. Performance in these systems often revolves around query federation through endpoints, as provided by the D2R Server component of D2RQ. Simple queries can achieve comparable to hand-optimized SQL, but increases with joins or filters, potentially leading to SQL generation due to the mapping's declarative nature. R2RML processors similarly expose endpoints for federated queries, though optimization relies on database indexes and primary keys to mitigate translation overhead. Direct techniques, however, have limitations when applied to non-ideal schemas, such as denormalized where redundancy violates principles. In these cases, automated IRI generation may produce duplicate entities or incorrect relations, as the assumes correspondences that do not hold in denormalized s. For example, a denormalized repeating details across orders could yield multiple identical RDF subjects, necessitating advanced customization or preprocessing to maintain semantic accuracy. Such shortcomings often require shifting to more sophisticated methods for complex schemas.

Extraction from Semi-Structured Sources

JSON and NoSQL Databases

Knowledge extraction from JSON documents leverages the format's hierarchical and flexible structure to identify entities, properties, and relationships that can be mapped to semantic representations such as RDF triples. serves as a analogous to for XML, enabling precise navigation and extraction of data from JSON structures without requiring custom scripting. For instance, expressions like $.store.book[0].title allow traversal of nested objects and arrays to retrieve specific values, facilitating the isolation of potential knowledge elements like entities or attributes. Transformation of extracted data into RDF is standardized through , a W3C Recommendation from 2014 that embeds contextual mappings within JSON to serialize . uses a @context to map JSON keys to from ontologies, enabling automatic conversion of documents into RDF graphs where nested structures represent classes and properties; for example, a JSON object { "name": "Alice", "friend": { "name": "Bob" } } with appropriate context can yield triples like <Alice> <foaf:knows> <Bob> .. This approach supports schema flexibility in , allowing knowledge extraction without rigid predefined schemas. NoSQL databases amplify these techniques due to their schema-less nature, which mirrors JSON's variability but scales to distributed environments. In document-oriented stores like , extraction involves querying collections of JSON-like BSON documents and mapping them to RDF via formal definitions of document structure; one method parses nested fields into subject-predicate-object triples, constructing knowledge graphs by inferring relations from embedded arrays and objects. Graph databases such as , using , handle inherently relational data; the Neosemantics plugin exports results directly to RDF formats like or , preserving graph traversals as semantic edges without loss of connectivity. Schema inference automates the discovery of implicit structures in and data, treating nested objects as potential classes and their keys as properties to generate ontologies dynamically. Algorithms process datasets in parallel, inferring types for values (e.g., strings, numbers, arrays) and fusing them across documents to mark optional fields or unions, as in approaches using MapReduce-like steps on tools like ; this detects hierarchies where, for example, repeated nested objects indicate class instances with inherited properties. A representative example is extracting from feeds stored in format, such as data. Processing objects—containing fields like user, text, and timestamps—applies and relation extraction to generate RDF triples; for instance, from a like "Norway bans petrol cars," tools identify entities ( as Location, petrol as Fuel) and relations (ban), yielding triples such as <Norway> <bans> <petrol> . enriched with Schema.org vocabulary, forming a queryable via for insights like pollution policies.

Web Data and APIs

Knowledge extraction from web data and APIs involves retrieving and structuring semi-structured information from online sources, such as RESTful endpoints and HTML-embedded markup, to populate knowledge graphs or semantic models. REST APIs typically return data in JSON or XML formats, which can be parsed to identify entities, attributes, and relationships. For instance, JSON responses from APIs are processed using schema inference tools to generate RDF triples, enabling integration with ontologies like schema.org, a collaborative vocabulary for marking up web content with structured data. Schema.org provides extensible schemas that map API outputs to semantic concepts, such as products or events, facilitating automated extraction without custom parsers in many cases. Web scraping techniques target semi-structured elements embedded in , including and , which encode directly within page content. uses like itemscope and itemprop to denote structured items, while extends with RDF syntax for richer semantics. Tools like the Any23 library parse these formats to extract RDF quads from web corpora, as demonstrated by the Web Data Commons project, which has processed billions of pages from the to yield datasets of over 70 billion triples. This approach allows extraction of schema.org-compliant data, such as organization details or reviews, directly from webpages, converting them into nodes and edges. Ethical and legal considerations are paramount in web data extraction to ensure compliance and sustainability. Practitioners must respect robots.txt files, a standard protocol that instructs crawlers on permissible site access, preventing overload or unauthorized scraping. Additionally, under the EU's General Data Protection Regulation (GDPR), extracting personal data—such as user identifiers from API responses—requires lawful basis and consent, with non-compliance risking fines up to 4% of global turnover. Rate limiting, typically implemented via delays between requests, mitigates server strain and aligns with terms of service, promoting responsible data acquisition. A representative case is the of product data via Amazon's , which provide endpoints for item attributes like price, description, and reviews. Amazon has leveraged such data in constructing commonsense knowledge graphs to enhance product recommendations, encoding relationships between items (e.g., "compatible with") using graph databases like . This process involves parsing responses with schema.org vocabularies to infer entities and relations, yielding graphs that support real-time querying for over a billion products.

Parsing and Schema Inference Methods

Parsing and schema inference methods address the challenge of deriving structured representations from semi-structured data, such as JSON or XML, where explicit schemas are absent or inconsistent. These methods involve analyzing the data's internal structure, identifying recurring patterns in fields, types, and relationships, and generating a schema that captures the underlying organization without requiring predefined mappings. Unlike direct mapping techniques from structured sources, which rely on rigid predefined schemas, inference approaches handle variability by clustering similar elements and resolving ambiguities algorithmically. Inference techniques often employ to identify and group similar fields across records, treating field names and values as entities to be matched despite variations in naming or format. For instance, metrics, such as , measure the similarity between field names by calculating the minimum number of single-character edits needed to transform one string into another, enabling the merging of semantically equivalent fields like "user_name" and "username." This process facilitates schema normalization by linking disparate representations into unified attributes, improving data integration in semi-structured datasets. Tools like OpenRefine support schema inference through data cleaning and transformation workflows, allowing users to cluster similar values, facet data by types, and export reconciled structures to formats such as or RDF. OpenRefine processes semi-structured inputs by iteratively refining clusters based on user-guided or automated similarity thresholds, enabling the detection of field types and hierarchies without manual schema design. Additionally, specialized inference libraries, such as those implementing algorithms from the EDBT'17 framework, automate the generation of schemas from sample JSON instances by analyzing type distributions and nesting patterns across records. Probabilistic models enhance inference by estimating field types under , particularly in datasets with mixed or evolving formats. Basic Bayesian approaches compute the of a type given observed values, using as P(\text{type} \mid \text{value}) = \frac{P(\text{value} \mid \text{type}) \cdot P(\text{type})}{P(\text{value})}, where priors reflect common data patterns (e.g., strings for names) and likelihoods are derived from value characteristics like length or format. This enables robust type prediction for fields exhibiting variability, such as numeric identifiers that may appear as strings. A typical for begins with raw to extract key-value pairs and nested objects, followed by applying linkage and probabilistic techniques to cluster fields and infer types. The resulting is then mapped to an by translating structures into classes and properties, often using rule-based transformations to align with standards like . Validation steps involve sampling additional records against the inferred to measure coverage and accuracy, iterating refinements if discrepancies exceed thresholds, ensuring the ontology supports downstream knowledge extraction tasks.

Extraction from Unstructured Sources

Natural Language Processing Foundations

Natural language processing () forms the bedrock for extracting knowledge from unstructured textual sources by enabling the systematic analysis of linguistic structures. At its core, the NLP pipeline begins with tokenization, which breaks down raw text into smaller units such as words, subwords, or characters, facilitating subsequent processing steps. This initial phase addresses challenges like handling , contractions, and language-specific orthographic rules, ensuring that text is segmented into meaningful for further analysis. For instance, in English, tokenization typically splits sentences on whitespace while resolving ambiguities like "don't" into "do" and "n't". Following tokenization, part-of-speech (POS) tagging assigns grammatical categories—such as , , adjectives—to each token based on its syntactic role and context. This step relies on probabilistic models trained on annotated corpora to disambiguate words with multiple possible tags, like "run" as a or . A seminal advancement in POS tagging came in the with the adoption of statistical models, particularly Hidden Markov Models (HMMs), which model sequences of tags as hidden states emitting observed words, achieving accuracies exceeding 95% on standard benchmarks. Dependency parsing extends this by constructing a representation of syntactic relationships between words, identifying heads (governors) and dependents to reveal phrase structures and grammatical . Tools like the Stanford Parser employ unlexicalized probabilistic context-free grammars to produce dependency trees with high precision, often around 90% unlabeled attachment score on Wall Street Journal data. These parses are crucial for understanding sentence semantics, such as subject-verb-object relations, without relying on full constituency trees. Linguistic resources underpin these techniques by providing annotated data and lexical knowledge. The Penn Treebank, a large corpus of over 4.5 million words from diverse sources like news articles, offers bracketed syntactic parses and POS tags, serving as a primary training dataset for statistical parsers since its release. Complementing this, WordNet (1995) organizes English words into synsets—groups of synonyms linked by semantic relations like hypernymy—enabling inference of word meanings and relations for tasks like disambiguation. As a key preprocessing step in the pipeline, (NER) identifies and classifies entities such as persons, organizations, and locations within text, typically using rule-based patterns or statistical classifiers trained on annotated examples. Early NER efforts, formalized during the Sixth Message Understanding Conference (MUC-6) in 1995, focused on extracting entities from news texts with F1 scores around 90% for core types, laying the groundwork for scalable entity detection without domain-specific tuning. The evolution of these NLP foundations traces from rule-based systems in the 1960s, exemplified by , which used hand-crafted to simulate , to statistical paradigms in the 1990s that leveraged probabilistic models like HMMs for robust handling of ambiguity and variability in . This shift enabled more data-driven approaches, improving accuracy and scalability for knowledge extraction pipelines.

Traditional Information Extraction

Traditional encompasses rule-based methods that employ predefined patterns and heuristics to identify and extract entities, such as names, dates, and organizations, as well as relations between them from unstructured text. These approaches originated in the 1990s through initiatives like the Message Understanding Conferences (MUC), where systems competed to process documents into structured templates using cascading rule sets. Unlike later data-driven techniques, traditional methods prioritize explicit linguistic rules derived from domain expertise, often applied after basic preprocessing like tokenization and to segment and annotate text. A core technique in traditional information extraction is pattern matching, frequently implemented via regular expressions to capture syntactic structures indicative of target information. For instance, a regular expression such as \b[A-Z][a-z]+ [A-Z][a-z]+\b can match person names by targeting capitalized word sequences, while patterns like \d{1,2}/\d{1,2}/\d{4} extract dates in MM/DD/YYYY format. More sophisticated systems extend this to relational patterns, such as proximity-based rules that link entities (e.g., "CEO of [Organization]") to infer roles without deep semantic analysis. The GATE framework, released in 1996, exemplifies this by enabling developers to build modular pipelines of processing resources, including finite-state transducers and cascades for sequential entity recognition followed by relation extraction and co-reference resolution. In GATE, rules are often specified in JAPE (Java Annotation Pattern Engine), allowing patterns like {Token.kind == uppercase, Token.string == "Inc."} to tag corporate entities, which then feed into higher-level relation cascades. Evaluation of traditional information extraction systems traditionally employs , , and the F1-score, metrics standardized in the MUC evaluations to measure extracted items against gold-standard annotations. (P) is the ratio of correctly extracted items to total extracted items, (R) is the ratio of correctly extracted items to total relevant items in the text, and F1-score balances them as follows: F1 = \frac{2 \times (P \times R)}{P + R} For example, in MUC-6 tasks, top rule-based systems achieved F1-scores around 80-90% for entity extraction in controlled domains like management succession reports, demonstrating high accuracy on well-defined patterns but variability across diverse texts. Despite their interpretability and precision on narrow tasks, traditional methods suffer from scalability limitations due to the reliance on hand-crafted rules, which require extensive manual effort to cover linguistic variations, ambiguities, and domain shifts. As text corpora grow in size and complexity, maintaining and extending thousands of rules becomes labor-intensive and error-prone, often resulting in brittle systems that fail on unseen patterns or dialects. This hand-engineering has historically constrained their application to broad-scale without significant intervention.

Ontology-Based and Semantic Extraction

Ontology-based information extraction (OBIE) leverages predefined to guide the identification and structuring of entities, relations, and events from text, ensuring extracted knowledge aligns with a formal semantic model. Unlike traditional , which relies on general patterns, OBIE maps text spans—such as named entities or phrases—to specific ontology classes and properties, often using rule-based systems or models trained on ontology schemas. This process typically involves three stages: recognizing relevant text elements, classifying them according to ontology concepts, and populating the ontology with instances and relations. In the OBIE workflow, rules or classifiers disambiguate and categorize extracted elements by referencing the ontology's hierarchical structure and constraints. For instance, a rule-based approach might use lexical patterns combined with ontology axioms to link a mention like "Paris" to the class City rather than a person's name, while machine learning methods employ supervised classifiers fine-tuned on annotated corpora aligned with the ontology. Tools such as Ontotext's PoolParty facilitate this by integrating ontology management with extraction pipelines; for example, PoolParty can import the DBpedia ontology to automatically tag entities in text, extracting instances of classes like Person or Organization and linking them to DBpedia URIs for semantic enrichment. Semantic annotation standards further support OBIE by enabling the markup of text with RDF triples that conform to the ontology. The Evaluation and Report Language () 1.0, a W3C Working Draft schema from the early to mid-2000s, provides a framework for representing annotations as RDF statements, allowing tools to assert properties like dc:subject or foaf:depicts directly on text fragments. This RDF-based markup ensures interoperability, as annotations can be queried and integrated into larger knowledge bases using . A key advantage of ontology-based methods is their ability to enforce consistency and resolve ambiguities in extracted knowledge. For example, in processing the term "Apple," contextual analysis guided by an like DBpedia can distinguish between the Fruit class (e.g., in a ) and the Company class (e.g., in a report), preventing erroneous linkages and improving downstream applications such as . This structured guidance reduces errors compared to pattern-only approaches.

Advanced Techniques

Machine Learning and AI-Driven Extraction

Machine learning approaches to knowledge extraction have evolved from traditional supervised techniques to advanced and large language models, enabling automated identification and structuring of entities, relations, and concepts from diverse data sources. Supervised methods, particularly Conditional Random Fields (CRFs), have been foundational for tasks like (NER), where models are trained to assign labels to sequences of tokens representing entities such as persons, organizations, and locations. Introduced as probabilistic models for segmenting and labeling sequence data, CRFs address limitations of earlier approaches like Hidden Markov Models by directly modeling conditional probabilities and avoiding label bias issues. These models are typically trained on annotated corpora, with the CoNLL-2003 dataset serving as a benchmark for English NER, containing over 200,000 tokens from Reuters news articles labeled for four entity types. Early applications demonstrated CRFs achieving F1 scores around 88% on this dataset, establishing their efficacy for structured extraction in knowledge bases. Deep learning has advanced these capabilities through transformer architectures, which leverage self-attention mechanisms to capture long-range dependencies in text far more effectively than recurrent models. The transformer model, introduced in 2017, forms the backbone of modern systems for both NER and relation extraction by processing entire sequences in parallel. BERT (Bidirectional Encoder Representations from Transformers), released in 2018, exemplifies this shift; its pre-trained encoder is fine-tuned on task-specific data to excel in relation extraction, where it identifies semantic links between entities, such as "located_in" or "works_for," by treating the task as a classification over sentence spans. Fine-tuned BERT models have set state-of-the-art benchmarks, achieving F1 scores exceeding 90% on datasets like SemEval-2010 Task 8 for relation classification, outperforming prior methods by integrating contextual embeddings. This fine-tuning process adapts the model's bidirectional understanding of context, making it particularly suited for extracting relational knowledge from unstructured text. Unsupervised methods complement supervised ones by discovering patterns without labeled data, often through clustering techniques that group similar textual elements to infer entities or topics. (LDA), a generative probabilistic model from , enables topic-based extraction by representing documents as mixtures of latent topics, where each topic is a distribution over words; this uncovers thematic structures that can reveal implicit entities or relations in corpora. For instance, LDA has been applied to cluster news articles into topics like "" or "," facilitating entity discovery without annotations, as demonstrated in aspect extraction from reviews where it identifies opinion targets with coherence scores above 0.5 on benchmark sets. These approaches are valuable for scaling extraction to large, unlabeled datasets, though they require post-processing to map topics to structured . Recent advances in large language models (LLMs) have introduced zero-shot extraction, allowing models to perform knowledge extraction without task-specific training by leveraging emergent capabilities from vast pre-training. , released in 2023, supports zero-shot and extraction through , achieving competitive F1 scores ranging from 67% to 98% on radiological reports for extracting clinical findings, rivaling supervised models in low-resource settings. This extends to multimodal data, where models like process text-image pairs for integrated extraction; for example, systems using GPT-3.5 in zero-shot mode extract tags from images and captions, outperforming human annotations in on datasets like . As of 2025, subsequent models like GPT-4o have further improved zero-shot performance in such tasks. These developments shift knowledge extraction toward more flexible, generalizable AI systems, though challenges like persist.

Knowledge Graph Construction

Knowledge graph construction involves assembling extracted entities and relations from various sources into a structured representation, typically through a series of interconnected steps that ensure coherence and usability. The process begins with , where identified entities from text or data are mapped to existing nodes in a or new nodes are created if no matches exist. This step is crucial for avoiding duplicates and maintaining graph integrity, often employing similarity metrics such as the , which measures the overlap between sets of attributes or neighbors of candidate entities to determine matches. For instance, in embedding-assisted approaches like EAGER, Jaccard similarity is combined with graph embeddings to resolve entities across by comparing neighborhood structures. Following , relation inference identifies and extracts connections between entities, generating in the form of subject-predicate-object. These form the fundamental units of RDF graphs, as defined by the W3C RDF standard, where subjects and objects are resources ( or blank nodes) and predicates denote ships. Models like , a sequence-to-sequence based on , facilitate end-to-end extraction by linearizing into text sequences, enabling the population of graphs with over 200 types from unstructured input. Graph population then integrates these into a cohesive structure, often adhering to vocabularies such as schema.org, which provides extensible schemas for entities like and to enhance in knowledge graphs. A key challenge in knowledge graph construction is scalability, particularly when handling billions of triples across massive datasets. For example, grew to over 100 million entities by 2023, necessitating efficient algorithms for inference and resolution to manage exponential growth without compromising query performance. Recent advancements, including large language models for joint entity-relation , briefly reference techniques to automate these steps while addressing noise in extracted data.

Integration and Fusion Methods

Integration and fusion methods in knowledge extraction involve combining facts and entities derived from multiple heterogeneous sources to create a coherent, unified . These methods address challenges such as mismatches, redundant information, and inconsistencies by aligning structures, merging similar entities, and resolving discrepancies. The process ensures that the resulting maintains high accuracy and completeness, often leveraging probabilistic models or rule-based approaches to weigh evidence from different extractors. Fusion techniques commonly include ontology alignment, which matches concepts and relations across ontologies to enable . For instance, tools like compute similarities between OWL entities based on linguistic and structural features to generate mappings. Probabilistic merging extends this by treating knowledge as uncertain triples and fusing them using statistical models, such as supervised , to estimate the probability of truth for each fact across sources. These approaches prioritize high-confidence alignments, reducing errors in cross-ontology integration. Conflict resolution during fusion relies on mechanisms like and confidence scoring to reconcile differing extractions. Majority voting aggregates predictions from multiple extractors, selecting the most frequent assertion for a given fact, while incorporates confidence scores—probabilities output by extraction models—to favor . For example, in construction, facts with conflicting attributes are resolved by thresholding low-confidence scores or applying source-specific weights derived from historical accuracy. Standards such as the principles, outlined by in 2006, guide fusion by emphasizing the use of URIs for entity identification, dereferenceable HTTP access, and RDF-based descriptions to facilitate linking across datasets. The framework implements these principles through a declarative link discovery language, enabling scalable matching of entities based on similarity metrics like string distance and data type comparisons. A prominent example is Google's Knowledge Vault project from 2014, which fused probabilistic extractions from web content with prior knowledge from structured bases like to construct a web-scale knowledge repository containing 1.6 billion facts, of which 271 million were rated as confident. This system applied to propagate confidence across sources, achieving a 30% improvement in precision over single-source baselines by resolving conflicts through probabilistic inference.

Applications and Examples

Entity Linking and Resolution

Entity linking and resolution is a critical step in knowledge extraction that connects entity mentions identified in text—such as person names, locations, or organizations—to their corresponding entries in a structured , like or YAGO, while resolving ambiguities arising from multiple possible referents for the same mention. This process typically follows (NER) from traditional methods and enhances the semantic understanding of unstructured text by grounding it in a verifiable knowledge source. The process begins with candidate generation, where potential entities are retrieved for each mention using techniques such as surface form matching against titles, redirects, and anchor texts to create a shortlist of plausible candidates, often limited to the top-k most relevant ones to manage computational efficiency. Disambiguation then resolves the correct by comparing the local around the mention—such as surrounding words or keyphrases—with entity descriptions, commonly via vector representations like bag-of-words or , and incorporating global coherence across all mentions in the document to ensure consistency, for instance, by modeling entity relatedness through shared links in the . Key algorithms include AIDA, introduced in 2011, which employs a graph-based approach for news articles by constructing a mention-entity bipartite graph weighted by popularity priors, contextual similarity (using keyphrase overlap), and collective coherence (via in-link overlap in Wikipedia), then applying a greedy dense subgraph extraction for joint disambiguation to achieve global consistency. Collective classification methods, such as those in AIDA, extend local decisions by propagating information across mentions, outperforming independent ranking in ambiguous contexts through techniques like probabilistic graphical models or iterative optimization. Evaluation metrics for entity linking emphasize linking accuracy, with micro-F1 scores commonly reported on benchmarks like the AIDA-YAGO dataset, where AIDA achieves approximately 82% micro at rank 1, reflecting strong performance in disambiguating mentions from CoNLL-2003 news texts linked to YAGO entities. These metrics account for both correct links and handling of unlinkable mentions (), providing a balanced measure of in real-world scenarios. In applications, enhances search engines by enabling semantic retrieval, where disambiguated entities improve query understanding and result relevance, as demonstrated in systems that integrate linking with entity retrieval to support entity-oriented search over large document collections.

Domain-Specific Use Cases

In healthcare, knowledge extraction plays a pivotal role in processing electronic health records (EHRs) to identify and standardize medical entities, enabling better clinical and . The Unified Medical Language System (UMLS) is widely employed to map unstructured clinical text from EHRs to standardized concepts, facilitating the integration of diverse data sources into relational databases for analysis. For instance, UMLS-based methods extract and categorize from clinical narratives, linking them to anatomical locations to support diagnostic applications. During the 2020s, AI-driven extraction techniques were extensively applied to literature, where models annotated mechanisms and relations from scientific papers to build knowledge bases that accelerated development and treatment insights. These efforts, often leveraging to connect extracted terms to established biomedical ontologies, have demonstrated substantial efficiency gains, such as reducing manual annotation workloads by approximately 80% in collaborative human-LLM frameworks for screening biomedical texts. In the sector, knowledge extraction from regulatory reports like 10-K filings involves to detect linguistic indicators of , such as overly positive or evasive language, which aids in identifying potential . Relation extraction further enhances this by constructing graphs that model connections between financial entities, such as supplier-customer relationships or anomalous transaction patterns, to flag fraudulent activities in . For example, contextual language models applied to textual disclosures in annual reports have achieved high accuracy in fraud detection by quantifying sentiment shifts and relational inconsistencies, improving regulatory oversight and . Such applications yield significant ROI, as automated reduces the time and cost associated with manual audits, enabling proactive fraud prevention in large-scale financial datasets. E-commerce platforms utilize knowledge extraction to derive product insights from customer reviews, constructing knowledge graphs that capture attributes, sentiments, and relations for enhanced recommendations. 's approaches in the early 2020s, for instance, embed review texts into knowledge graphs using techniques like and , allowing the system to infer commonsense relationships between products and user preferences. By 2023, review-enhanced knowledge graphs integrated multimodal data from datasets, improving recommendation accuracy by incorporating fine-grained features like aspect-based sentiments from user feedback. This results in more personalized suggestions, boosting customer engagement and sales conversion rates through scalable, automated knowledge fusion from unstructured review corpora.

Evaluation Metrics and Challenges

Evaluation of knowledge extraction systems relies on a combination of intrinsic and extrinsic metrics to assess both the quality of extracted elements and their utility in broader applications. Intrinsic metrics focus on the direct performance of extraction components, such as , , and F1-score, which measure the accuracy of identifying entities, relations, and events against ground-truth annotations. These metrics evaluate and coverage, for instance, by calculating as the of true positives to the sum of true and false positives, as true positives over true positives plus false negatives, and F1 as their . In knowledge graph construction, additional intrinsic measures like (MRR) and root mean square error (RMSE) assess quality and accuracy for links or numerical attributes. Extrinsic metrics, in contrast, gauge the effectiveness of extracted knowledge in downstream tasks, such as or recommendation systems, where success is tied to overall task performance rather than isolated extraction fidelity. For , common extrinsic metrics include Hits@K, which computes the fraction of correct entities ranked in the top K positions, and (MRR), the average of the reciprocal ranks of true entities. Hits@K is particularly useful for evaluating retrieval-based linking, as it prioritizes top-ranked results while ignoring lower ranks, with values ranging from 0 to 1 where higher indicates better performance. These metrics highlight how well extracted entities integrate into bases for practical use, such as improving search . Despite advances in metrics, knowledge extraction faces significant challenges, including data privacy concerns amplified by regulations like the General Data Protection Regulation (GDPR), enacted in 2018. GDPR's principles of purpose limitation and data minimization require that used in processes align with initial collection purposes and be pseudonymized to reduce re-identification risks, particularly when AI infers sensitive attributes from unstructured text. For instance, automated profiling in extraction can trigger Article 22 safeguards, mandating human oversight and transparency to protect data subjects' rights, though ambiguities in explaining AI logic persist. Hallucinations in large language models (LLMs) pose another critical challenge, where models generate fabricated facts during relation or entity extraction, undermining reliability. Studies highlight that LLMs exhibit factual inconsistencies when constructing knowledge graphs from text, often due to overgeneralization or incomplete world knowledge. For example, benchmarks like HaluEval reveal response-level hallucinations in extraction tasks, prompting the use of knowledge graphs for grounding via retrieval-augmented generation to verify outputs. Bias issues further complicate extraction, stemming from underrepresentation in training datasets that skew results toward dominant demographics. In relation extraction datasets like NYT and CrossRE, women and Global South entities are underrepresented (11.8-20.0% for women), leading to allocative biases where certain relations are disproportionately assigned to overrepresented groups. Representational biases manifest as stereotypical associations, such as linking women to "" relations. Mitigation strategies include curating diverse corpora for pre-training, which can reduce gender bias by 3-5% but may inadvertently amplify geographic biases if not multi-axial. Looking ahead, remains a key challenge for knowledge , especially in resource-constrained environments, with ongoing developments in integration as of 2025 enabling low-latency processing. supports low-latency processing by deploying lightweight models on distributed devices, addressing limitations in applications like autonomous systems where must occur in milliseconds. Advances in dynamic resource provisioning and hybrid scaling will support scalable, privacy-preserving at , though challenges in hardware heterogeneity and model optimization persist.

Modern Tools and Developments

Survey of Established Tools

Established tools for knowledge extraction encompass a range of mature software suites that handle processing from unstructured text, semantic representations, and structured data sources. These tools, developed primarily before 2023, provide robust pipelines for tasks like entity identification, relation extraction, and , forming the backbone of many knowledge extraction workflows. In the domain of (NLP) and (IE), the Stanford NLP suite stands as a foundational toolkit originating in the early 2000s, with its core parser released in 2002 and a unified CoreNLP package in 2010. This Java-based suite includes annotators for , (NER), dependency parsing, and open information extraction, enabling the derivation of structured knowledge from raw text through modular pipelines. Widely adopted in and , it supports multilingual processing and integrates with ecosystems for scalable extraction. Complementing this, , an open-source library first released in 2015, emphasizes efficiency and production-ready pipelines for knowledge extraction. It offers pre-trained models for tokenization, NER, dependency parsing, and lemmatization, with customizable components for rule-based and statistical extraction methods. spaCy's architecture allows rapid processing of large corpora, making it ideal for extracting entities and relations from documents in real-world applications. For semantic knowledge extraction, Protégé serves as a prominent ontology editor, with its modern version released in building on earlier prototypes from the 1980s and 1990s. This free tool supports the development and editing of ontologies in and RDF formats, facilitating the formalization of extracted knowledge into reusable schemas and taxonomies. Protégé includes plugins for reasoning, visualization, and integration with IE outputs, aiding in the construction of domain-specific knowledge bases. Apache , an open-source framework first released in 2000, specializes in handling RDF data for semantic extraction and storage. It provides APIs for reading, writing, and querying RDF graphs using , along with inference engines for deriving implicit knowledge from explicit extractions. Jena's modular design supports triple stores and applications, enabling the fusion of extracted triples into coherent knowledge graphs. Addressing structured data extraction, Talend Open Studio for Data Integration, launched in 2006, functions as an platform with graphical job designers for mapping and transforming data. It connects to databases, files, and to extract relational data, applying transformations that can populate schemas or ontologies. The tool's component-based approach supports inference and checks, essential for integrating structured sources into broader knowledge extraction pipelines; however, the open-source version was discontinued in 2024. These tools draw on established extraction methods, such as rule-based and probabilistic models, to diverse inputs. To compare their capabilities, the following table summarizes key aspects:
ToolKey FeaturesSupported SourcesOpen-Source Status
Stanford NLP SuitePOS tagging, NER, dependency parsing, open IE pipelinesUnstructured text (multilingual)Yes (GPL)
Tokenization, NER, , customizable statistical pipelinesUnstructured text (English-focused, extensible)Yes (MIT)
ProtégéOntology editing, OWL/RDF support, reasoning pluginsOntology files, semantic schemasYes (BSD)
Apache JenaRDF manipulation, querying, inference enginesRDF graphs, Yes (Apache 2.0)
Talend Open StudioETL jobs, data mapping, schema inference, quality profilingDatabases, files, APIs (structured)Yes (GPL), discontinued 2024
In 2025, Agentic Document (ADE) emerged as a pioneering tool for processing complex documents, leveraging and agentic workflows to surpass traditional OCR by enabling visual grounding and semantic reasoning for layout understanding. Developed by LandingAI, ADE automates the extraction of structured from forms, reports, and tables without predefined templates, achieving higher accuracy on irregular layouts through iterative agent-based refinement. This tool integrates seamlessly into enterprise pipelines, as demonstrated in its native app deployment, which transforms unstructured PDFs into governed for downstream analytics. Advancements in retrieval-augmented generation () have extended to specialized variants enhancing knowledge extraction, particularly for handling structured queries in large corpora. While core frameworks optimize outputs by retrieving external knowledge bases, recent iterations incorporate and graph-based enhancements to improve factual accuracy and context relevance in extraction tasks. A key trend in 2024-2025 is extraction, building on CLIP's 2021 contrastive learning foundation through extensions that fuse text and modalities for richer semantic . Innovations like Synergy-CLIP integrate cross-modal encoders to extract generalized from unlabeled , enabling applications in video summarization and emotional from mixed-media sources. Similarly, MM-LG and GET frameworks unlock CLIP's potential for hierarchical feature extraction, improving performance on tasks involving visual-textual associations by up to 15% on benchmarks like Visual Genome. These developments prioritize conceptual fusion over siloed processing, facilitating extraction from diverse formats like scientific diagrams and scripts. Federated learning has gained traction for privacy-preserving knowledge extraction, allowing collaborative model training across distributed datasets without centralizing sensitive information. In 2024, selective knowledge sharing mechanisms in federated setups mitigated inference attacks while enabling heterogeneous model personalization, preserving up to 95% of local data utility in tasks like and clinical representation learning. This approach addresses regulatory demands in domains such as healthcare, where extraction from multi-institutional sources requires guarantees. Specific to 2025, tools for PDF and scientific have advanced through platforms like Opscidia, which employ generative to query and distill content from research PDFs directly. Opscidia's system outperforms manual methods by automating semantic searches within documents, extracting insights on methodologies and outcomes with approximately 50% faster processing times compared to traditional reviews. This facilitates scientific intelligence by consolidating knowledge from vast literature bases into actionable summaries. Integration of knowledge extraction with large language models (LLMs) like Grok-2 has accelerated in 2025, enhancing entity recognition and pattern extraction from unstructured text via fine-tuned prompting. Grok-2's capabilities support hybrid pipelines that combine retrieval with generative refinement, achieving superior performance in diagnostic assessments and reasoning tasks over benchmarks involving over 200 million scientific papers. Looking to 2026, projections indicate a surge in autonomous AI agents for end-to-end knowledge extraction pipelines, with forecasting that 40% of enterprise applications will incorporate task-specific agents for automated data ingestion, fusion, and validation. These agents will enable self-orchestrating workflows while adapting to dynamic sources like streams.

References

  1. [1]
    [PDF] Information Extraction meets the Semantic Web: A Survey
    Abstract. We provide a comprehensive survey of the research literature that applies Information Extraction techniques in a Se- mantic Web setting.
  2. [2]
    A novel knowledge extraction method based on deep learning in ...
    Dec 13, 2022 · Knowledge extraction aims to identify entities and extract relations between them from unstructured text, which are in the form of triplets.
  3. [3]
    [PDF] Open Information Extraction: A Review of Baseline Techniques ...
    Oct 18, 2023 · Open Information Extraction (OIE) improves relation extraction by analyzing relations across domains, avoiding hand-labeling, and extracting ...
  4. [4]
  5. [5]
    Knowledge Extraction | U.S. Geological Survey - USGS.gov
    Knowledge extraction in CEGIS aims to investigate automated technologies for extracting features such as hydrography, terrain, and buildings.
  6. [6]
    Automatic Knowledge Extraction to build Semantic Web of Things ...
    Automatic Knowledge Extraction to build Semantic Web of Things Applications ... Semantic interoperability describes smart devices according to their data ...
  7. [7]
    [PDF] A Review in Knowledge Extraction from Knowledge Bases
    A Review in Knowledge Extraction from Knowledge Bases. Fabio Y á ˜nez-Romero ... In this review, we will explain the different techniques used to test ...
  8. [8]
    [PDF] Chapter I: Introduction to Data Mining
    Other similar terms referring to data mining are: data dredging, knowledge extraction and pattern discovery. Page 5. Osmar R. Zaïane, 1999. CMPUT690 ...
  9. [9]
    [PDF] The Concept of Data Mining and Knowledge Extraction Techniques
    The Concept of Data Mining and Knowledge Extraction. Techniques. Renas Rajab Asaad. Department of Computer Science. Nawroz University. Duhok, Kurdistan Region ...
  10. [10]
    [PDF] From Information Retrieval to Information Extraction - ACL Anthology
    From Information Retrieval to Information Extraction. David Milward and James ... Table 1 shows the difference in processing times on two queries for ...
  11. [11]
    data mining - Relation and difference between information retrieval ...
    Dec 5, 2012 · On the other hand information extraction is more about extracting (or inferring) general knowledge (or relations) from a set of documents or ...
  12. [12]
    Mycin: A Knowledge-Based Computer Program Applied to Infectious ...
    Mycin: A Knowledge-Based Computer Program Applied to Infectious Diseases. Edward H Shortliffe. Edward H Shortliffe. Find articles by Edward H Shortliffe.Missing: original | Show results with:original
  13. [13]
    MYCIN: a knowledge-based consultation program for infectious ...
    MYCIN is a computer-based consultation system designed to assist physicians in the diagnosis of and therapy selection for patients with bacterial infections.
  14. [14]
    Computer-Based Medical Consultations: MYCIN. - ACP Journals
    MYCIN is a computer system that provides consultation to the physician to assist in establishing the proper diagnosis and therapy for patients with infectious ...Missing: original | Show results with:original
  15. [15]
    [PDF] Message Understanding Conference- 6: A Brief History
    For MUC-3 (1991), tile task shifted to reports of terrorist events ill Central and South Amer- ica, as reported in articles provided by the For- eign ...
  16. [16]
    Third message understanding evaluation and conference (MUC-3)
    A dry-run phase of the third evaluation was completed in February, 1991, and the official testing will be done in May, 1991, concluding with the Third Message ...
  17. [17]
    An Analysis of the Third Message Understanding Conference (MUC-3)
    This paper describes and analyzes the results of the Third Message Understanding Conference (MUC-3). It reviews the purpose, history, and methodology of the ...
  18. [18]
    RDF and OWL Are W3C Recommendations | 2004 | News
    Feb 10, 2004 · RDF is used to represent information and to exchange knowledge in the Web. OWL is used to publish and share sets of terms called ontologies, ...
  19. [19]
    Guest Editors' Introduction: Semantic Web Challenge 2003
    The specific challenge for 2003 was to apply Semantic Web techniques to build an online application that deduces, combines, and integrates information to help ...
  20. [20]
    [PDF] DBpedia: A Nucleus for a Web of Open Data - UPenn CIS
    The DBpedia project focuses on the task of converting Wikipedia content into structured knowledge, such that Semantic Web techniques can be employed against it ...Missing: original | Show results with:original
  21. [21]
    Introducing the Knowledge Graph: things, not strings - The Keyword
    The Knowledge Graph enables you to search for things, people or places that Google knows about—landmarks, celebrities, cities, sports teams, ...
  22. [22]
    A retrieval-augmented knowledge mining method with deep thinking ...
    Sep 17, 2025 · To advance biomedical knowledge extraction and application, this article proposes a comprehensive framework that integrates knowledge graphs ...
  23. [23]
    R2RML: RDB to RDF Mapping Language - W3C
    Sep 27, 2012 · This document describes R2RML, a language for expressing customized mappings from relational databases to RDF datasets.
  24. [24]
    [PDF] Bringing Relational Databases into the Semantic Web: A Survey
    In this section, we will sketch this naïve approach of translating a relational database schema and its con- tents to an RDF graph. To begin with though, we ...
  25. [25]
    The D2RQ Platform – Accessing Relational Databases as Virtual ...
    D2RQ is Open Source software and published under the Apache license. The source code is available on GitHub. You can contact the dev team through the issue ...
  26. [26]
    Streaming transformation of XML to RDF using XPath-based mappings
    We introduce an XML to RDF transformation approach, which is based on mappings comprising RDF triple templates that employ simple XPath expressions. Thanks to ...
  27. [27]
    Constructing ontologies by mining deep semantics from XML ...
    Sep 1, 2021 · In this paper, an approach for constructing ontologies by mining deep semantics from eXtensible Markup Language (XML) Schemas (including XML Schema 1.0 and XML ...
  28. [28]
    GRDDL Use Cases: Scenarios of extracting RDF data from XML ...
    Apr 6, 2007 · GRDDL is a mechanism for gleaning resource descriptions from XML, using markup to link to algorithms for extracting RDF data.
  29. [29]
    D2RQ Platform– Treating Non-RDF Databases as Virtual RDF Graphs
    D2RQ: Project Background. ▫ Developed at the Feie Universität Berlin since 2004. • latest release (10/08/09): D2RQ Engine 0.7 and D2RQ. Server 0.7. • (29/11/10): ...Missing: year | Show results with:year
  30. [30]
    The D2RQ Mapping Language
    Mar 12, 2012 · The D2RQ Mapping Language. Version: v0.8 – 2012-03-12; Authors: Richard Cyganiak (DERI, NUI Galway, Ireland): Chris Bizer (Freie Universität ...
  31. [31]
    D2RQ - Semantic Web Standards - W3C
    The D2RQ Platform is a system for accessing relational databases as virtual, read-only RDF graphs. It offers RDF-based access to the content of relational ...
  32. [32]
    [PDF] A survey of RDB to RDF translation approaches and tools | HAL
    May 3, 2014 · We make the hypothesis that an R2RML/D2RQ document is compiled into an. SQL view that reflects each triple map. The support of. R2RML named ...
  33. [33]
    R2RML - Semantic Web Standards - W3C
    R2RML is a language for expressing customized mappings from relational databases to RDF datasets. It is one of the two technologies developed as part of the ...
  34. [34]
    D2R Server: Accessing databases with SPARQL and as Linked Data
    Optimizing performance #​​ Here are some simple hints to improve D2R's performance: Define primary keys whenever you can and create indexes where applicable (e.g ...
  35. [35]
    A Direct Mapping of Relational Data to RDF - W3C
    Sep 27, 2012 · The direct mapping does not generate triples for NULL values. Note that it is not known how to relate the behavior of the obtained RDF graph ...
  36. [36]
    Semantics-Preserving RDB2RDF Data Transformation Using ... - MDPI
    In the field of direct mapping, semantics preservation is critical to ensure that the mapping method outputs RDF data without information loss or incorrect ...
  37. [37]
    [PDF] On Directly Mapping Relational Databases to RDF and OWL
    In this paper, we study the problem of directly mapping a relational database to an RDF graph with OWL vocabulary. A direct mapping is a default and automatic ...
  38. [38]
  39. [39]
    JSON-LD 1.0
    Summary of each segment:
  40. [40]
    Construction of RDF Knowledge Graph with MongoDB
    **Summary of Method for Extracting RDF from MongoDB:**
  41. [41]
    neosemantics (n10s): Neo4j RDF & Semantics toolkit - Neo4j Labs
    ### Summary: How Neosemantics Handles RDF Export from Neo4j Using Cypher Queries
  42. [42]
    [PDF] Schema Inference for Massive JSON Datasets - OpenProceedings.org
    Mar 21, 2017 · This section overviews the two steps of our schema fu- sion approach: type inference and type fusion. To this end, we first briefly recall the ...
  43. [43]
    [PDF] Generating an RDF dataset from Twitter data: A Study Using ...
    This thesis presents work focused on finding the most effective and efficient mechanism(s) for generating an RDF dataset from Twitter data. The motivation is ...
  44. [44]
    Developers - Schema.org
    ### Summary of Schema.org Usage in Web Data for Structured Extraction, Vocabularies for APIs, and Markup
  45. [45]
    Schema.org - Schema.org
    Schema.org is a set of extensible schemas that enables webmasters to embed structured data on their web pages for use by search engines and other ...Schema Markup Validator · Schemas · Documentation · Getting StartedMissing: extraction APIs
  46. [46]
    WDC - RDFa, Microdata, and Microformat Data Sets
    The Web Data Commons project extracts all Microdata, JSON-LD, RDFa, and Microformats data from the Common Crawl web corpus, the largest and most up-to-data web ...Missing: scraping knowledge
  47. [47]
    [PDF] Extracting Structured Data from Two Large Web Corpora - CEUR-WS
    Apr 16, 2012 · The Web Data Com- mons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-.
  48. [48]
    The Web Robots Pages
    - **Robots.txt Standard**: A protocol used by web robots (e.g., crawlers, spiders) to communicate with web servers. It specifies which parts of a site can be accessed or crawled.
  49. [49]
    General Data Protection Regulation (GDPR) Compliance Guidelines
    ### Key Points on GDPR Implications for Web Data Scraping and Extraction of Personal Data
  50. [50]
    Web Scraping for Research: Legal, Ethical, Institutional, and ... - arXiv
    Oct 30, 2024 · This paper proposes a comprehensive framework for web scraping in social science research for US-based researchers, examining the legal, ethical, institutional ...
  51. [51]
    Building commonsense knowledge graphs to aid product ...
    To help Amazon's recommendation engine make these types of commonsense inferences, we're building a knowledge graph that encodes relationships between products ...
  52. [52]
    How CSC Generation powers product discovery with knowledge ...
    Mar 21, 2023 · We developed a new PIM system called CorePIM, which at its heart, uses a knowledge graph powered by the managed graph database service Amazon Neptune.Background · Representing Products In The... · Representing Retailers In...
  53. [53]
    Extracting schema from semistructured data - ACM Digital Library
    In this paper, we consider a very general form of semistructured data based on labeled, directed graphs. We show that such data can be typed using the greatest ...Information & Contributors · Published In · Abstract
  54. [54]
    [PDF] A survey of approaches to automatic schema matching
    Schema matching is a basic problem in many database application domains, such as data integration, E- business, data warehousing, and semantic query processing.
  55. [55]
    [PDF] Schema Matching using Machine Learning - arXiv
    Abstract—Schema Matching is a method of finding attributes that are either similar to each other linguistically or represent the same information.
  56. [56]
    PXML: A Probabilistic Semistructured Data Model and Algebra.
    We propose a model for probabilistic semistructured data (PSD). The advantage of our approach is that it supports a flexible representation that allows the ...
  57. [57]
    [PDF] Translating JSON Schema logics into OWL axioms for unified data ...
    To perform ontology-based data validation with JSON data, one must: 1) map a JSON Schema file to an OWL ontology and 2) translate JSON data as OWL individuals ...
  58. [58]
    Tokenization as the Initial Phase in NLP - ACL Anthology
    Webster and Chunyu Kit. 1992. Tokenization as the Initial Phase in NLP. In COLING 1992 Volume 4: The 14th International Conference on Computational Linguistics.
  59. [59]
    Robust part-of-speech tagging using a hidden Markov model
    A system for part-of-speech tagging is described. It is based on a hidden Markov model which can be trained using a corpus of untagged text.
  60. [60]
    Accurate Unlexicalized Parsing - ACL Anthology
    "Accurate Unlexicalized Parsing" by Dan Klein and Christopher D. Manning was published in the 41st ACL meeting in 2003, pages 423-430.Missing: Stanford | Show results with:Stanford
  61. [61]
    Building a Large Annotated Corpus of English: The Penn Treebank
    Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19(2): ...
  62. [62]
    WordNet: a lexical database for English - ACM Digital Library
    WordNet is an online lexical database designed for use under program control. English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms.
  63. [63]
    Sixth Message Understanding Conference (MUC-6) - ACL Anthology
    Sixth Message Understanding Conference (MUC-6): Proceedings of a Conference Held in Columbia, Maryland, November 6-8, 1995 ...Missing: NER | Show results with:NER
  64. [64]
    ELIZA—a computer program for the study of natural language ...
    ELIZA—a computer program for the study of natural language communication between man and machine. Author: Joseph Weizenbaum.
  65. [65]
    [PDF] Bottom-Up Relational Learning of Pattern Matching Rules for ...
    Abstract. Information extraction is a form of shallow text processing that locates a specified set of relevant items in a natural-language document.
  66. [66]
    [PDF] GATE-a General Architecture for Text Engineering - ACL Anthology
    This paper describes GATE, a General. Architecture for Text Engineering, which is a freely-available system designed to help alleviate the problem. 1 ...
  67. [67]
    [PDF] RULE-BASED INFORMATION EXTRACTION: ADVANTAGES ...
    Other drawbacks of using manually crafted rules include the amount of labour implied and the interoperability of rule languages. While the infrastructure ...
  68. [68]
    Ontology-based information extraction: An introduction and a survey ...
    Mar 19, 2010 · In this paper, we provide an introduction to ontology-based information extraction and review the details of different OBIE systems developed so far.
  69. [69]
    [PDF] Ontology-Based Information Extraction - Computer Science
    This paper presents a survey of the current research work in this field including a classification of OBIE systems along different dimensions and an attempt to ...
  70. [70]
    PoolParty Semantic Suite - Your Complete Semantic Platform
    PoolParty Semantic Suite uses innovative means to help organizations build and manage enterprise knowledge graphs as a basis for various AI applications.Product Overview Page · Ontology Management · PoolParty Semantic Integrator
  71. [71]
    Automatic text analytics using DBpedia and PoolParty - A Live Demo
    Feb 2, 2012 · Step 1. Generate a thesaurus by using a linked data source like DBpedia · Step 2. Load the thesaurus into PoolParty and improve it to your needs.Missing: ontology | Show results with:ontology
  72. [72]
    Evaluation and Report Language (EARL) 1.0 Schema - W3C
    Apr 14, 2004 · This document describes the formal schema of the Evaluation and Report Language (EARL) 1.0. EARL is a vocabulary, the terms of which are defined across a set ...
  73. [73]
    Developer Guide for Evaluation and Report Language (EARL) 1.0
    Jan 25, 2012 · The terms of EARL are defined using the Resource Description Framework [RDF], which is technology to express semantic data in a machine-readable ...
  74. [74]
    [PDF] Automatic Gloss Finding for a Knowledge Base using Ontological ...
    Feb 2, 2015 · Notably, lexical aliases are of- ten ambiguous, e.g., the name 'Apple' refers to either Fruit:Apple or Company:Apple. While this representation ...
  75. [75]
    [PDF] Introduction to the CoNLL-2003 Shared Task - ACL Anthology
    The CoNLL-2003 shared task is about language-independent named entity recognition, focusing on persons, locations, organizations, and miscellaneous entities.
  76. [76]
    [PDF] Early Results for Named Entity Recognition with Conditional ...
    Conditional Random Fields (CRFs) (Lafferty et al., 2001) are undirected graphical models used to calculate the con- ditional probability of values on designated ...
  77. [77]
    Simple BERT Models for Relation Extraction and Semantic Role ...
    Apr 10, 2019 · This paper presents simple BERT-based models for relation extraction and semantic role labeling, achieving state-of-the-art performance without ...
  78. [78]
    [PDF] Latent Dirichlet Allocation - Journal of Machine Learning Research
    LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic ...
  79. [79]
    Zero-shot information extraction from radiological reports using ...
    Sep 4, 2023 · In this study, we aim to explore whether the most popular large language model, ChatGPT, can extract useful information from the radiological reports.Missing: GPT- | Show results with:GPT-
  80. [80]
    TagGPT: Large Language Models are Zero-shot Multimodal Taggers
    Apr 6, 2023 · In this work, we propose TagGPT, a fully automated system capable of tag extraction and multimodal tagging in a completely zero-shot fashion.<|separator|>
  81. [81]
    [PDF] EAGER: Embedding-Assisted Entity Resolution for Knowledge Graphs
    A promising approach is the use of graph embeddings for ER in order to determine the similarity of entities based on the similarity of their graph neighborhood.
  82. [82]
  83. [83]
    REBEL: Relation Extraction By End-to-end Language generation
    REBEL is a seq2seq model based on BART that performs end-to-end relation extraction for more than 200 different relation types.Missing: graph | Show results with:graph
  84. [84]
    Data and Datasets overview - Schema.org
    ... knowledge graphs. This approach emphasises the use of schema.org vocabulary to integrate information from multiple independent statistical datasets, by ...
  85. [85]
    Wikidata Subsetting: Approaches, Tools, and Evaluation
    Feb 14, 2023 · Wikidata is a massive Knowledge Graph (KG) including more than 100 million data items and nearly 1.5 billion statements, covering a wide ...
  86. [86]
    [PDF] Enhancing Knowledge Graph Construction Using Large Language ...
    May 8, 2023 · The process of Entity Linking is not part of the REBEL model, and it is an additional post-processing step that is used to refine the extracted ...
  87. [87]
  88. [88]
    [PDF] Ontology-Alignment Techniques: Survey and Analysis - MECS Press
    Nov 8, 2015 · OLA [30] is an algorithm to align ontologies represented in OWL. He tries to calculate the similarity of two entities in two ontologies based on ...
  89. [89]
    [PDF] From Data Fusion to Knowledge Fusion - VLDB Endowment
    Data fusion identifies true values from multiple sources, while knowledge fusion identifies true subject-predicate-object triples from multiple extractors.
  90. [90]
    Refining Automatically Extracted Knowledge Bases Using ...
    May 14, 2017 · The information extraction systems commonly provide a confidence score for each candidate fact, that is, the weight p in the knowledge base ...4. Methodology · 5. Experiments · 5.2. Experimental ResultsMissing: mechanisms | Show results with:mechanisms
  91. [91]
    Uncertainty Management in the Construction of Knowledge Graphs
    Jul 19, 2024 · As mentioned in Section 4, several extraction approaches provide confidence scores about the triples they extract.Uncertainty Management In... · 4 Knowledge Acquisition · 5 Knowledge Graph Refinement
  92. [92]
    [PDF] Silk - A Link Discovery Framework for the Web of Data
    The Silk - Link Discovery Framework is presented, a tool for finding relationships between entities within different data sources and features a declarative ...
  93. [93]
    Knowledge Vault: A Web-Scale Approach to Probabilistic ...
    A Web-scale probabilistic knowledge base that combines extractions from Web content (obtained via analysis of text, tabular data, page structure, and human ...Missing: DBpedia | Show results with:DBpedia
  94. [94]
    A web-scale approach to probabilistic knowledge fusion
    Knowledge Vault is a web-scale probabilistic knowledge base combining web content extractions with prior knowledge, using machine learning to fuse information.
  95. [95]
  96. [96]
    [PDF] Robust Disambiguation of Named Entities in Text - ACL Anthology
    Jul 27, 2011 · Disambiguating named entities in natural- language text maps mentions of ambiguous names onto canonical entities like people or.
  97. [97]
    Entity linking and retrieval for semantic search - ACM Digital Library
    We provide a comprehensive overview of entity linking and retrieval in the context of semantic search and thoroughly explore techniques for query understanding, ...
  98. [98]
    Using UMLS for electronic health data standardization and database ...
    Sep 17, 2020 · The repository is a relational database anchored by the Unified Medical Language System unique concept identifiers to integrate, map, and standardize the data ...
  99. [99]
    Exploiting the UMLS Metathesaurus for extracting and categorizing ...
    Sep 8, 2015 · A method to exploit the UMLS Metathesaurus for extracting and categorizing concepts found in clinical text representing signs and symptoms to anatomically ...
  100. [100]
    [PDF] Extracting a Knowledge Base of Mechanisms from COVID-19 Papers
    We annotate a dataset of mechanisms with our schema and train a model to extract mechanism relations from papers. Our exper- iments demonstrate the utility of ...Missing: 2020s | Show results with:2020s
  101. [101]
    A human-LLM collaborative annotation approach for screening ... - NIH
    Sep 29, 2025 · ... reducing the annotation workload by approximately 80% compared to manual annotation alone. Additionally, we trained a BioBERT-based ...
  102. [102]
    Artificial Intelligence in fraud detection: textual analysis of 10-K filings
    Apr 25, 2025 · In this paper, we investigate the potential of Artificial Intelligence ( AI ) in detecting fraud by analyzing linguistic indicators in 10-K filings.
  103. [103]
    Tracking down financial statement fraud by analyzing the supplier ...
    This paper introduces the supplier-customer relationships between companies to improve the accuracy of financial statement fraud detection.<|separator|>
  104. [104]
    Accounting fraud detection using contextual language learning
    In this study, we explore how textual contents from financial reports help in detecting accounting fraud. Pre-trained contextual language learning models, such ...
  105. [105]
    [PDF] Knowledge Graph Embedding Based Sentiment Analysis of Product ...
    Jun 5, 2023 · The proposed paradigm in this study is grounded in knowledge graphs, and knowledge graph em- bedding. Amazon product review benchmark datasets ...
  106. [106]
    RAKCR: Reviews sentiment-aware based knowledge graph ...
    Aug 15, 2024 · A generic review and knowledge graph-based framework that provides better recommendations by fully mining the fine-grained personalization features in user ...
  107. [107]
    Knowledge Graph Construction: Extraction, Learning, and Evaluation
    This survey provides an overview of research from 2022 to 2024 on KG construction—covering extraction, learning, and evaluation—and examines recent work ...
  108. [108]
    None
    ### Summary of Hits@K and Rank-Based Metrics for Entity Linking/Knowledge Graph Tasks
  109. [109]
  110. [110]
    Knowledge Graphs, Large Language Models, and Hallucinations
    In this paper, we discuss these open challenges covering state-of-the-art datasets and benchmarks as well as methods for knowledge integration and evaluating ...
  111. [111]
    [PDF] Dissecting Biases in Relation Extraction: A Cross-Dataset Analysis ...
    Aug 16, 2024 · We investigate bias at two levels: In the training data of three commonly used RE datasets (SREDFM, CrossRE, NYT), and in the predictions of a ...
  112. [112]
    Software - The Stanford Natural Language Processing Group
    We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems.Stanford Parser · Named Entity Recognizer (NER) · POS Tagger · Stanford OpenIE
  113. [113]
    spaCy · Industrial-strength Natural Language Processing in Python
    spaCy is a free open-source library for Natural Language Processing in Python ... Since its release in 2015, spaCy has become an industry standard with a huge ...spaCy 101 · Usage · Projects · Training Pipelines & Models
  114. [114]
    protégé
    A free, open-source ontology editor and framework for building intelligent systems. Protégé is supported by a strong community of academic, government, and ...Software · About · Support · Community
  115. [115]
    Release History - CoreNLP - Stanford NLP Group
    The earliest release of open-source software now comprising CoreNLP was the Stanford Parser on 5 December 2002. The first public release of a unified package ...
  116. [116]
    Overview - CoreNLP - Stanford NLP Group
    CoreNLP is your one stop shop for natural language processing in Java! CoreNLP enables users to derive linguistic annotations for text.Download · Tutorials · CoreNLP API · CoreNLP Server
  117. [117]
    Install spaCy · spaCy Usage Documentation
    spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.
  118. [118]
    Protege Desktop Old Versions
    Mar 18, 2019 · This page contains download links and a historical record of release notes for older versions of Protege Desktop. ... 2002-Apr-10, download. 1.6.2 ...Missing: ontology editor
  119. [119]
    Apache Jena - TDB System Properties - DB-Engines
    Initial release, 2000 ; Current release, 4.9.0, July 2023 ; License info Commercial or Open Source, Open Source info Apache License, Version 2.0 ; Cloud-based only ...
  120. [120]
    Apache Jena - Home
    A free and open source Java framework for building Semantic Web and Linked Data applications. Get started now!RDF core API tutorial · Download · Fuseki · Getting started
  121. [121]
    Welcome to Apache Jena
    Jena is a Java framework for building Semantic Web applications. Jena provides a collection of tools and Java libraries to help you to develop semantic web.
  122. [122]
    Getting Started with Talend Open Studio for Data Integration [Article]
    Talend Open Studio for Data Integration is a powerful open source tool that solves some of the most complex data integration challenges. Download it today ...Missing: features | Show results with:features
  123. [123]
    Data Integration Solutions: A Unified View for Trusted Data | Talend
    Talend's unified data migration platform consolidates data from different sources into your data lake with robust data quality and data integration tools.Connect All Your Data... · Bring Your Data Home · Manage Larger Data Sets, On...
  124. [124]
    Qlik Talend Cloud | Trusted, AI-Ready Data Integration & Quality
    Powerful data quality tools and workflows, such as Data Products and Lineage, deliver trusted, reusable datasets for analytics, machine learning, and business ...Contact us · Careers · Talend Academy · Talend Solutions
  125. [125]
    Agentic Document Extraction | AI Document Intelligence by LandingAI
    Unlock AI-powered document understanding with LandingAI's Agentic Document Extraction. Process complex forms & reports accurately using computer vision.Intelligent Document... · Complex Layout Extraction · Visual Grounding
  126. [126]
    OCR to Agentic Document Extraction: A look into the Evolution of ...
    Sep 24, 2025 · Agentic Document Extraction (ADE) pioneers a new paradigm shift by introducing a truly agentic document understanding system that is visual ...Early Deep Learning Models · A Leap In Semantic Reasoning · Visual Grounding: The...
  127. [127]
    Unlock Your Documents with Agentic Document Extraction Inside ...
    Oct 6, 2025 · With LandingAI's Agentic Document Extraction (ADE) Native App for Snowflake, you can transform complex documents into governed, AI-ready data, ...
  128. [128]
    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
    May 22, 2020 · We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric ...Missing: TA- | Show results with:TA-
  129. [129]
    What is RAG? - Retrieval-Augmented Generation AI Explained - AWS
    RAG is the process of optimizing the output of a large language model, so it references an authoritative knowledge base outside of its training data sources ...What are Foundation Models? · Amazon Kendra · What is Generative AI?Missing: TA- | Show results with:TA-
  130. [130]
    [2506.16673] Extracting Multimodal Learngene in CLIP - arXiv
    Jun 20, 2025 · In this paper, we put forward the idea of utilizing a multimodal block to extract the multimodal generalizable knowledge, which inspires us to propose MM-LG.Missing: extensions 2024
  131. [131]
    [PDF] GET: Unlocking the Multi-modal Potential of CLIP for Generalized ...
    To summarize, our contributions are as follows: • To tackle the problem that the text encoder can not be used on the unlabelled data, we propose a TES module.
  132. [132]
    Selective knowledge sharing for privacy-preserving federated ...
    Jan 8, 2024 · While federated learning (FL) is promising for efficient collaborative learning without revealing local data, it remains vulnerable to ...
  133. [133]
    Privacy-Preserving Heterogeneous Personalized Federated ...
    Apr 15, 2024 · This paper proposes a novel privacy-preserving PFL framework that supports heterogeneous model architectures and sizes in delivering personalized models for ...
  134. [134]
    Privacy-Preserving Federated Learning Framework for Multi-Source ...
    This work presents MultiProg, a secure federated learning framework for clinical representation learning. Our approach enables multiple medical institutions to ...
  135. [135]
    AI tools for extracting scientific information in 2025 - Opscidia
    May 14, 2025 · In this article, we'll look at how AI is transforming scientific intelligence in 2025, what the best tools available are, and how best to use ...
  136. [136]
    AI scientific monitoring vs. manual research: performance ... - Opscidia
    Jun 10, 2025 · It uses Opscidia 's AI data extraction function.This tool makes it possible toquery the content of PDFs directly: for example, you could ask ...
  137. [137]
    Performance of large language models (ChatGPT4-0, Grok2 and ...
    Jun 20, 2025 · This study demonstrates the potential of LLMs in supporting dental education and assessments. As LLMs continue to evolve, their integration ...
  138. [138]
    Evaluating large language models for information extraction from ...
    We developed an evaluation framework incorporating three hierarchical tasks: basic entity extraction, pattern recognition, and diagnostic assessment.
  139. [139]
    One Billion AI Agents by 2026: What This Means for ITSM and ...
    Oct 1, 2025 · Gartner predicts that by the end of 2026, 40% of enterprise applications will incorporate task-specific AI agents, up from less than 5% ...