Knowledge graph
A knowledge graph is a graph-based data structure designed to represent and integrate knowledge about the real world, where nodes denote entities such as people, places, or concepts, and directed edges capture relationships between them, often enriched with semantic meanings through ontologies or schemas.[1] This structure enables the accumulation, querying, and reasoning over large-scale, heterogeneous information, distinguishing it from traditional databases by its emphasis on interconnected facts and provenance.[2] The concept of knowledge graphs traces its roots to early artificial intelligence efforts in the 1970s, evolving from semantic networks and frame-based systems that modeled knowledge as interconnected nodes and relations.[3] The concept evolved alongside the Semantic Web in the early 2000s, with the term gaining prominence following Google's 2012 announcement of its Knowledge Graph that popularized the idea in industry, shifting search engines toward entity-based understanding rather than mere keyword matching.[1] Since then, knowledge graphs have proliferated in both academic and commercial contexts, with open-source projects like DBpedia and Wikidata emerging as collaborative efforts to extract and curate structured knowledge from sources such as Wikipedia.[3] At its core, a knowledge graph conforms to models like RDF (Resource Description Framework) or property graphs, where entities are uniquely identified (e.g., via URIs), relations are labeled with a constrained vocabulary, and additional attributes like timestamps or confidence scores provide context and provenance.[2] Ontologies, such as OWL (Web Ontology Language), formalize the semantics to support deductive reasoning, allowing inferences like "if A is a subclass of B and B relates to C, then A relates to C."[1] Query languages including SPARQL for RDF graphs or Cypher for property graphs facilitate complex traversals, while validation mechanisms like shapes graphs ensure data integrity.[1] Knowledge graphs underpin diverse applications, from enhancing search engines with entity linking and disambiguation to powering recommendation systems in e-commerce and personalized assistants in healthcare.[3] In enterprise settings, they integrate siloed data for analytics, such as creating 360-degree customer views, while in research, they support natural language processing, commonsense reasoning, and even scientific discovery through inductive techniques like graph embeddings.[1] Notable implementations include Google's Knowledge Graph, which processes billions of facts for web search, and Wikidata, a multilingual repository with over 119 million items and 1.65 billion statements as of 2025.[4][3]Fundamentals
Definition and Core Concepts
A knowledge graph is a structured representation of real-world entities—such as objects, events, or concepts—and the relationships between them, typically organized as a graph where entities serve as nodes and relationships as edges, enriched with semantic metadata to facilitate machine understanding and inference.[5] This semantic enrichment distinguishes knowledge graphs by embedding explicit meaning, often through ontologies, enabling not just data storage but also reasoning and knowledge derivation.[6] The concept emphasizes interoperability across diverse data sources, allowing systems to integrate and query information in a human- and machine-readable format.[7] At its core, a knowledge graph comprises entities represented as nodes, which can include people, places, organizations, or abstract concepts like "democracy." Relationships between these entities are modeled as typed edges, specifying directed connections such as "located in," "employs," or "subclass of," which convey precise semantic roles.[8] Attributes, or properties, attach additional descriptive data to nodes or edges, such as a person's birth date (e.g., "date of birth: 1980-01-01") or an edge's confidence score, further enhancing the graph's expressiveness and utility for applications like recommendation systems or question answering.[5] These elements collectively form a flexible schema that supports scalable knowledge representation without rigid tabular constraints.[9] Knowledge graphs differ from traditional relational databases, which organize data into fixed tables with predefined schemas optimized for transactional queries, by prioritizing flexible, relationship-centric structures that capture complex interconnections without joins.[10] Unlike simple graph databases, which focus on connectivity but lack inherent semantics, knowledge graphs incorporate typed relations and ontological constraints to enable reasoning, such as inferring transitive properties (e.g., if A is part of B and B is part of C, then A is part of C).[11] This semantic layer promotes data federation and discovery across heterogeneous sources, addressing limitations in scalability for highly linked data.[12] A prominent example is Google's Knowledge Graph, launched in 2012, which connects entities like the Eiffel Tower (node) to attributes such as its height (330 meters) and relations like "located in" Paris, drawing from sources like Freebase to provide contextual search results beyond keyword matching.[13][14]Components and Structure
In RDF-based knowledge graphs, the structure is fundamentally composed of triples, which serve as the atomic units of information representation. Each triple consists of three elements: a subject, a predicate, and an object, where the subject denotes an entity, the predicate specifies a relationship, and the object indicates another entity or a literal value.[15] This structure, rooted in the Resource Description Framework (RDF), enables the encoding of factual statements in a machine-readable format.[16] For example, the triple (Paris, capitalOf, France) asserts that Paris is the capital of the country France.[15] Schemas and ontologies provide the structural framework for organizing triples within a knowledge graph, defining classes of entities, properties of relationships, and constraints on data usage. The Web Ontology Language (OWL), built on RDF, facilitates this by allowing the specification of hierarchical classes, domain and range restrictions for properties, and logical axioms for inference. For instance, an ontology might define "City" as a subclass of "Place" and constrain the "capitalOf" property to link only to instances of "Country," ensuring semantic consistency across the graph. Knowledge graphs exhibit multi-relational characteristics, supporting diverse edge types to capture varied relationships between entities, such as "locatedIn," "foundedBy," or "employs." To handle complex relations beyond simple binary links—such as statements with additional qualifiers like time or source—reification treats an entire triple as a node, enabling further assertions about it.[17] In RDF 1.2, reification uses triple terms, allowing a triple to be quoted as an object in another triple, with the rdf:reifies predicate to make statements about it, such as adding confidence scores or temporal contexts without altering the core triple.[17] This approach accommodates n-ary relations while preserving the graph's flexibility.[18] Heterogeneity in knowledge graphs arises from the integration of data from diverse sources, including structured databases and unstructured text, to form a unified representation. For example, DBpedia extracts structured knowledge from Wikipedia's infoboxes and categories, merging it with external linked data to create a multifaceted graph encompassing entities like people, places, and events. This integration allows the graph to incorporate both rigid schemas from relational sources and flexible, emergent relations from natural language processing outputs, enhancing its comprehensiveness. Visually, a knowledge graph is represented as a directed labeled graph, where nodes correspond to entities, directed edges denote relationships with arrows indicating directionality, and labels on edges specify the predicate type. For instance, in a diagram, a node labeled "Paris" might connect via a directed edge labeled "capitalOf" to a node labeled "France," with additional edges like "locatedIn" pointing to "Europe" to illustrate connectivity.[8] Such visualizations aid in exploring the graph's structure, highlighting clusters of related entities and the semantics encoded in labels.[19]Historical Development
Origins in Semantic Networks
The origins of knowledge graphs trace back to the development of semantic networks in the 1960s and 1970s, which represented knowledge as interconnected nodes and edges to model associative memory in artificial intelligence systems.[20] M. Ross Quillian introduced this concept in his 1968 work on semantic memory, proposing a graph-based structure where nodes denote concepts and links represent relationships, enabling efficient retrieval through spreading activation mechanisms.[20] This approach drew from psychological models of human cognition, aiming to simulate how associations between ideas facilitate understanding and inference.[21] Subsequent refinements expanded semantic networks' applicability in AI research. Quillian further elaborated on these structures in collaborative efforts, while Nicholas V. Findler compiled key advancements in the 1979 edited volume Associative Networks: The Representation and Use of Knowledge by Computers, which integrated Quillian's ideas with practical implementations for knowledge representation.[22] These developments emphasized hierarchical and associative linkages to handle complex conceptual dependencies more robustly.[23] Influential extensions in AI further shaped these foundational ideas. Marvin Minsky's 1974 framework of "frames" built upon semantic networks by introducing structured templates for stereotypical situations, allowing dynamic adaptation of knowledge slots to new contexts. Similarly, Roger Schank's scripts, detailed in his 1977 collaboration with Robert Abelson, modeled sequences of events as narrative patterns, prioritizing goal-directed associations over static hierarchies to better capture human understanding.[24] Early applications of semantic networks appeared in natural language processing and expert systems. A prominent example is Terry Winograd's SHRDLU program (1972), which used procedural semantics embedded in network-like representations to enable a computer to comprehend and manipulate commands in a simulated block world, demonstrating interactive dialogue capabilities.[25] These systems highlighted the potential for graph structures in reasoning tasks but also revealed inherent challenges. A key limitation of early semantic networks was their lack of formal semantics, which often resulted in ambiguous interpretations of node-link configurations due to the absence of standardized inference rules.[26] This ambiguity hindered precise logical deductions, paving the way for later formalizations in the semantic web era.Evolution in the Semantic Web Era
The evolution of knowledge graphs in the Semantic Web era began in the late 1990s with the development of foundational standards aimed at enabling machine-readable data on the web. In 1999, the World Wide Web Consortium (W3C) published the Resource Description Framework (RDF) as a recommendation, providing a standardized model for representing resources as triples (subject-predicate-object) to facilitate interoperability across distributed data sources. This framework laid the groundwork for structured data exchange without assuming specific application domains. Building on this, Tim Berners-Lee articulated the vision of the Semantic Web in a 2001 Scientific American article, proposing an extension of the web where information would be annotated with well-defined meanings, allowing computers to perform more intelligent tasks like automated reasoning and data integration. Key milestones in the 2000s further advanced this vision through enhanced formalisms and practical implementations. The W3C released the Web Ontology Language (OWL) in 2004, extending RDF with constructs for defining complex ontologies, including classes, properties, and restrictions, to support richer knowledge representation and inference. This enabled the creation of domain-specific schemas that could be shared across the web. In 2007, the DBpedia project emerged as a pioneering effort to extract structured knowledge from Wikipedia infoboxes and other semi-structured content, generating a multilingual knowledge base with millions of RDF triples that served as a nucleus for the Linked Open Data cloud. Commercial adoption accelerated in the 2010s, driving widespread integration of knowledge graphs into search and recommendation systems. Google launched its Knowledge Graph in 2012, incorporating billions of facts about entities like people, places, and things to deliver context-aware search results beyond simple keyword matching. Microsoft introduced Satori in 2013 as the underlying engine for Bing's enhanced entity understanding, powering features like knowledge panels for people and locations. Similarly, Facebook rolled out Graph Search in 2013, leveraging its social graph to enable natural language queries over user connections, interests, and content. By the 2020s, knowledge graphs expanded significantly in scale and utility, particularly through synergies with emerging AI technologies. Wikidata, launched in 2012 as a central hub for structured data, grew to over 100 million items in October 2022 and to over 119 million items as of August 2025, fostering collaborative editing and integration across Wikimedia projects and beyond.[4] Recent developments have focused on combining knowledge graphs with large language models (LLMs) to mitigate hallucinations and enhance factual reasoning; for instance, techniques like retrieval-augmented generation use graph-based knowledge retrieval to ground LLM outputs in verified structures, as explored in ongoing research up to 2025.Formal Models and Representations
Graph-Based Formalisms
Knowledge graphs can be formally modeled using different graph structures, primarily RDF-based directed edge-labeled graphs or property graphs. In RDF, the model is a directed, labeled multigraph G = (V, E), where V is the set of nodes representing entities (including IRIs, literals, and blank nodes) and E \subseteq V \times R \times V is the set of directed edges labeled by relations from a finite relation vocabulary R. This representation captures the interconnected nature of knowledge, allowing entities to be linked through typed relationships that denote semantic connections such as "is-a" or "part-of." The multigraph structure accommodates multiple edges between the same pair of nodes, reflecting diverse or contextual relations, and the directed edges enforce asymmetry in relationships, such as distinguishing "parent-of" from "child-of."[27] Property graphs extend this with additional structure: G = (V, E, L_V, L_E, P), where V are nodes with unique IDs and labels L_V, E are edges with IDs, labels L_E \subseteq R, and both nodes and edges have properties P as key-value pairs mapping to literal values. This allows direct attachment of attributes to entities and relations, enhancing flexibility for non-semantic applications.[27] The explicit knowledge in RDF-based knowledge graphs is typically encoded as a collection of triples K = \{(h, r, t) \mid h, t \in V, r \in R\}, where each triple (h, r, t) indicates that head entity h stands in relation r to tail entity t; properties in property graphs are similarly representable as triples to literals but stored directly as maps. This format provides a compact, machine-readable structure for knowledge representation, facilitating operations like querying and extension. To support latent inference and completion tasks, knowledge graph embedding techniques map entities and relations to continuous vector spaces; a seminal approach is TransE, which optimizes embeddings such that the translation from head to relation approximates the tail vector, quantified by the scoring function f_r(h, t) = \|\mathbf{h} + \mathbf{r} - \mathbf{t}\|, where lower norms indicate higher plausibility. Path-based reasoning exploits the graph's connectivity for inference by computing transitive closures—sets of all reachable paths between nodes—or identifying subgraph patterns that reveal indirect associations. For example, if entity A relates to B and B to C, the transitive closure infers a path from A to C, enabling multi-hop predictions without explicit triples. The Path Ranking Algorithm formalizes this by generating relational paths through random walks constrained by entity types, ranking them to score potential links and complete missing knowledge. Despite these foundations, theoretical challenges arise in graph operations over knowledge graphs; notably, subgraph isomorphism—determining if a query subgraph embeds exactly into the knowledge graph—is NP-hard, with complexity escalating in dense graphs due to exponential search spaces for mappings. This hardness underscores the need for approximate methods in large-scale reasoning, as exact solutions become intractable beyond small patterns.[28]Ontologies and Schema Languages
Ontologies and schema languages provide the semantic layer for knowledge graphs, particularly those based on RDF, enabling the definition of concepts, relationships, and constraints that add meaning to raw graph structures. These formalisms ensure that data is not only interconnected but also interpretable across systems, supporting reasoning and interoperability. In RDF-based knowledge graphs, which build on triples, ontologies extend basic assertions by specifying hierarchies, restrictions, and inference rules, allowing for more expressive representations of domain knowledge. Property graphs typically use less formal schema definitions, such as label constraints and property type declarations in query languages like Cypher, to enforce structure without full ontological reasoning.[27] RDF Schema (RDFS) serves as a foundational vocabulary for describing classes, properties, and basic constraints in RDF-based knowledge graphs. It introducesrdfs:Class to define categories of resources and rdfs:subClassOf for establishing subclass relationships, which are transitive and enable inheritance of properties across hierarchies. Additionally, RDFS provides domain and range constraints through rdfs:domain and rdfs:range, which specify the expected classes for subjects and objects of a property, respectively, thereby enforcing semantic consistency without full logical entailment.[29]
The Web Ontology Language (OWL), building on RDFS, offers richer semantics grounded in description logics, specifically the SROIQ(D) fragment for OWL 2, to support advanced reasoning in knowledge graphs. OWL defines classes using owl:Class, which extends rdfs:Class to allow complex expressions like intersections or unions, and properties via owl:ObjectProperty for relations between individuals or owl:DatatypeProperty for data values. Key axioms include disjointness (owl:DisjointClasses or owl:DisjointWith to declare mutually exclusive classes) and cardinality restrictions (e.g., owl:cardinality, owl:minCardinality, or owl:maxCardinality to limit the number of property values for an individual). These features enable automated inference, such as deducing class memberships or property implications, crucial for knowledge completion in graphs.[30][31]
Extensions like the Shapes Constraint Language (SHACL), standardized in 2017, complement OWL and RDFS by focusing on data validation rather than inference, defining shapes to enforce structural constraints on knowledge graph instances. SHACL uses RDF to specify node shapes (describing focus nodes) and property shapes (constraining property values, e.g., via sh:minCount or sh:class), allowing validation reports that flag violations such as missing values or type mismatches. This declarative approach supports quality assurance in large-scale graphs, integrating seamlessly with existing ontology languages for comprehensive schema enforcement.[32]
These schema languages play a pivotal role in interoperability by aligning vocabularies across diverse knowledge graphs, facilitating data exchange and federation. For instance, schema.org provides a collaborative, extensible vocabulary of types and properties (e.g., schema:Person with schema:name and schema:jobTitle) for web markup, enabling structured data from heterogeneous sources to be unified into cohesive graphs while maintaining semantic consistency. It supports multiple serializations, including RDF and JSON-LD, making it applicable to both RDF and property graph contexts.[33][34]