Linked data
Linked Data is a set of best practices for publishing and interlinking structured data on the Web, transforming it from a space of documents into a global network of machine-readable data that can be discovered, shared, and reused across sources.[1] Coined by Tim Berners-Lee in his 2006 design note, the approach emphasizes using web standards to create meaningful connections between data, enabling applications to navigate and integrate information seamlessly.[1] At its core, Linked Data follows four principles: (1) use URIs as names for things; (2) use HTTP URIs so that these names can be looked up; (3) when someone looks up a URI, provide useful information using standards like RDF; and (4) include links to other URIs, so that more things can be discovered.[1] As a key component of the broader Semantic Web initiative, Linked Data leverages technologies such as Resource Description Framework (RDF) for representing data as triples (subject-predicate-object), RDF Schema (RDFS) and Web Ontology Language (OWL) for defining vocabularies and relationships, and SPARQL for querying distributed datasets.[2] This stack allows data to be expressed in a way that machines can interpret and link across silos, addressing limitations of traditional web content by focusing on data interoperability rather than just hyperlinks between pages.[2] The principles promote dereferenceable identifiers—HTTP URIs that resolve to human- and machine-readable descriptions—ensuring data is not only accessible but also contextually enriched.[3] The development of Linked Data accelerated through efforts like the W3C's Linking Open Data (LOD) community project, launched in 2007 to encourage the publication of open datasets in RDF format.[4] By April 2008, the emerging Web of Data included over 2 billion RDF triples connected by approximately 3 million links, with contributions from institutions like universities and organizations such as the BBC.[4] This growth has continued, with the LOD cloud diagram now visualizing interlinked datasets across domains; as of November 2025, it encompasses 1,678 datasets, each containing at least 1,000 RDF triples and 50 outbound links to qualify.[5] Linked Data has enabled diverse applications, from generic tools like data browsers (e.g., Tabulator) and search engines (e.g., Sindice) that aggregate information from multiple sources, to domain-specific uses in life sciences for drug discovery, government for open data transparency, and cultural heritage for enriched metadata.[2] In libraries and digital collections, it facilitates entity resolution and improved discoverability, as seen in projects integrating bibliographic data with external knowledge bases.[3] Foundational datasets like DBpedia (extracted from Wikipedia) and GeoNames (geospatial information) serve as hubs, powering mashups and analytics that demonstrate the value of interlinked data for real-world innovation.[2]Foundations
Principles
The foundational principles of Linked Data were articulated by Tim Berners-Lee in a 2006 design note published as part of the World Wide Web Consortium (W3C) Design Issues series, providing a blueprint for publishing structured data on the web in a way that facilitates interoperability and discovery.[1] These principles build on the broader vision of the Semantic Web, emphasizing decentralized data sharing without reliance on centralized authorities or proprietary formats.[1] The four principles are as follows:- Use URIs as names for things. This ensures that entities—such as people, places, or concepts—are identified using Uniform Resource Identifiers (URIs), which provide a global, unambiguous naming scheme compatible with web technologies.[1]
- Use HTTP URIs so that people can look up those names. By leveraging HTTP URIs, these identifiers become dereferenceable, allowing users and machines to access information about the named entity directly via standard web protocols, rather than opaque or non-web identifiers like LSIDs or DOIs.[1]
- When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL). Upon dereferencing a URI, servers should return relevant data in standardized formats like Resource Description Framework (RDF) for representation and SPARQL for querying, enabling consistent and machine-processable responses.[1]
- Include links to other URIs, so that they can discover more things. Data descriptions must incorporate RDF statements that reference additional URIs, creating hyperlinks between datasets and allowing navigation to related information across the web, much like traditional hypertext links.[1]
Relationship to Semantic Web
The Semantic Web was defined by Tim Berners-Lee, James Hendler, and Ora Lassila in 2001 as an extension of the current Web in which information is given well-defined meaning, thereby enabling computers and people to work in greater cooperation.[6] This vision aimed to create a Web of data that machines could interpret and process intelligently, moving beyond simple hypertext links to structured, meaningful content.[6] Linked Data represents a practical subset of Semantic Web technologies, focusing on the decentralized publishing and interlinking of structured data on the Web rather than relying on centralized ontologies or complex reasoning systems.[1] Coined by Tim Berners-Lee in a 2006 design note, Linked Data provides operational guidelines—such as the use of URIs, HTTP dereferencing, and RDF for descriptions—to make data accessible and linkable across the Web, aligning with but simplifying the broader Semantic Web goals.[1] This approach emphasizes interoperability through simple linking mechanisms, serving as a foundational layer for realizing the Semantic Web's potential without requiring full-scale inference at every step.[7] The Semantic Web architecture is often depicted as a layered stack, starting with foundational elements like Uniform Resource Identifiers (URIs) for unique naming, Unicode for character encoding, and XML for syntax, followed by Resource Description Framework (RDF) for data representation, RDF Schema (RDFS) for basic vocabulary definitions, and Web Ontology Language (OWL) for more expressive ontologies.[8] Linked Data primarily leverages the lower layers of this stack—particularly URIs and RDF—to ensure data interoperability and discoverability, allowing resources to be identified, described, and linked in a machine-readable format without delving into higher-level constructs like OWL.[7] By focusing on these core components, Linked Data promotes a Web-scale distribution of data that builds toward the Semantic Web's aspirational layers.[8] A key distinction lies in their scopes: while the Semantic Web encompasses advanced reasoning and inference capabilities, such as those enabled by OWL for deriving new knowledge from explicit statements, Linked Data prioritizes direct linking and retrieval of data, often deferring heavy inference to applications or users as needed.[6] This makes Linked Data more immediately deployable for publishing diverse datasets, fostering a "Web of data" that incrementally contributes to the Semantic Web's machine-understandable ecosystem without mandating comprehensive ontological commitments.[7]Technologies and Standards
Core Components
Linked Data relies on standardized identifiers to uniquely name entities across the web. While the Resource Description Framework (RDF) uses Internationalized Resource Identifiers (IRIs), which generalize Uniform Resource Identifiers (URIs) to support Unicode characters, the Linked Data principles specifically recommend HTTP URIs (a subset of IRIs) as global identifiers for resources such as people, places, or concepts.[9][1] Every such HTTP URI used in Linked Data should be dereferenceable, meaning that accessing the URI returns a description of the resource in a machine-readable format, typically RDF.[1] This dereferencing enables clients to retrieve and link data seamlessly, fostering interoperability.[3] The foundational data model for Linked Data is the Resource Description Framework (RDF), which represents information as directed graphs composed of subject-predicate-object triples. In an RDF triple, the subject is an IRI or blank node identifying the resource, the predicate is an IRI denoting the relationship, and the object is an IRI, blank node, or literal providing the value.[9] A collection of such triples forms an RDF graph, allowing complex descriptions where resources link to one another.[9] RDF graphs can be serialized in various formats to facilitate exchange and integration; common ones include RDF/XML for XML-based exchange, Turtle for compact textual representation using prefixes and abbreviations, and JSON-LD for embedding RDF in JSON structures suitable for web APIs.[10] To retrieve and manipulate Linked Data, SPARQL (SPARQL Protocol and RDF Query Language) serves as the standard query language, enabling pattern matching over RDF graphs similar to SQL for relational databases.[11] SPARQL supports operations like SELECT for retrieving results, CONSTRUCT for generating new RDF graphs, and ASK for boolean queries, with results often returned in formats like CSV, XML, or JSON.[12] For instance, a basic SELECT query to find all people and their names in a graph might be expressed as:This query matches triples where the predicate isPREFIX foaf: <http://xmlns.com/foaf/0.1/> . SELECT ?person ?name WHERE { ?person foaf:name ?name . }PREFIX foaf: <http://xmlns.com/foaf/0.1/> . SELECT ?person ?name WHERE { ?person foaf:name ?name . }
foaf:name and binds the subject to ?person and the object to ?name.[11]
Serving Linked Data over HTTP involves content negotiation, where servers respond to client requests by delivering RDF in an appropriate serialization based on the Accept header. For example, a client requesting text/[turtle](/page/Turtle) receives Turtle-formatted RDF, while one asking for application/ld+json gets JSON-LD.[3] This mechanism ensures flexibility, allowing the same IRI to provide human-readable HTML or machine-readable RDF depending on the context, while adhering to HTTP standards for caching and redirection.[3]