Fact-checked by Grok 2 weeks ago

Uniform Resource Identifier

A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource. URIs provide a simple and extensible means for identifying such resources, enabling uniform interpretation across different contexts and schemes on the Internet. The generic syntax of a URI consists of a scheme (e.g., "http" or "urn"), followed by a hierarchical part that may include an authority (such as a host and port), a path, an optional query for non-hierarchical data, and an optional fragment identifier for a secondary resource. This structure allows URIs to function as locators, names, or both, without implying that the resource is accessible or retrievable. For instance, common schemes include "http" for web resources, "ftp" for file transfers, and "mailto" for email addresses. URIs encompass two main subsets: Uniform Resource Locators (URLs), which identify resources while providing a primary access mechanism (e.g., network location), and Uniform Resource Names (), which offer globally unique and persistent names under the "" scheme, even if the resource becomes unavailable. Originating from the project in 1990, URIs evolved through standards like RFC 1630 and were formalized in RFC 3986 to support global resource identification independent of specific protocols.

Fundamentals

Definition and Purpose

A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource. This standardized string enables the unique referencing of entities such as documents, services, or concepts within networked systems, without necessarily implying direct access or location. The primary purpose of a URI is to facilitate across diverse information systems by providing a simple, universal mechanism for naming and referencing resources unambiguously. It supports a federated naming approach, allowing different protocols and schemes to coexist while ensuring consistent identification. Key characteristics include its , which aids in easy transcription and ; extensibility, permitting scheme-specific extensions without disrupting the overall ; and scheme-based identification, where a leading scheme (e.g., "http") dictates the syntax and semantics for the remainder of the identifier. URIs originated in the early to address naming inconsistencies arising from the proliferation of protocols and systems for document retrieval on the nascent . For instance, the URI "http://" identifies a specific , distinguishing it from other identifiers by its scheme and path components. While URIs form the basis for subtypes like Uniform Resource Locators (URLs) and Uniform Resource Names (URNs), they provide a general framework for resource identification.

Components and Syntax Overview

A Uniform Resource Identifier (URI) follows a generic syntax that structures its components to enable uniform identification of resources across different s. The overall form is scheme : hier-part [ ? query ] [ # fragment ], where the specifies the or naming system, the hierarchical part often includes an and , and optional query and fragment components provide additional data or references. This syntax ensures by defining how components delimit and encode . The component identifies the URI's naming or scheme, such as http or [mailto](/page/Mailto), and consists of a sequence starting with an alphabetic character followed by alphanumeric characters, plus, period, or . It is followed by a colon (:) and determines how the rest of the URI is interpreted. The component, when present, begins with two slashes (//) and represents a hierarchical addressing ; it includes an optional userinfo (credentials like username and password, in the form userinfo@), a required (domain name, IP address, or literal), and an optional (a number for service identification). For example, in example.com:8080, [example.com](/page/Example.com) is the host and 8080 the port. The follows the authority (or scheme if no authority) and denotes the resource's hierarchical location, composed of segments separated by slashes (/), such as /documents/file.txt. The optional query component, introduced by a (?), carries non-hierarchical parameters in key-value pairs, like key=value&other=param. Finally, the fragment identifier, starting with a (#), points to a secondary resource or internal section within the primary resource, such as #summary. The generic syntax is formally defined using Augmented Backus-Naur Form (ABNF) in RFC 3986. A simplified excerpt of the ABNF grammar for a URI is as follows:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
hier-part = "//" authority path-abempty / path-absolute / path-rootless / path-empty
authority = [ userinfo "@" ] host [ ":" port ]
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
query = *( pchar / "/" / "?" )
fragment = *( pchar / "/" / "?" )
Here, pchar represents path characters, including unreserved, percent-encoded, sub-delims, :, and @. This notation specifies the allowable structure and characters for each part. URIs distinguish between reserved and unreserved characters to separate delimiters from data. Reserved characters include generic delimiters like : / ? # [ ] @ and sub-delimiters like ! $ & ' ( ) * + , ; =, which may have special meanings in certain components and must be percent-encoded if used as data. Unreserved characters, such as alphanumeric letters, digits, hyphen (-), period (.), underscore (_), and tilde (~), can appear without encoding. Percent-encoding represents characters outside this set (or reserved ones used as data) as a percent sign (%) followed by two hexadecimal digits, e.g., space as %20 or non-ASCII characters via UTF-8 octet sequences. This ensures safe transmission across systems, with encoded forms equivalent to their decoded counterparts when unreserved. To illustrate, consider the URI https://user:[email protected]:8080/path?key=value#section:
  • Scheme: https (specifies secure HTTP protocol).
  • Authority: user:[email protected]:8080 (userinfo user:pass, host example.com, port 8080).
  • Path: /path (hierarchical resource location).
  • Query: key=value (parameters for the request).
  • Fragment: section (internal reference within the resource).
If the URI contained a space in the path, it would be encoded as %20 to comply with syntax rules.

History

Conception

The foundational concepts for (URIs), including the addressing system now known as URLs, were developed by in late 1990 as part of his implementation of the first prototype at . This work proposed a unified naming system to reference resources across the , influenced by hierarchical naming conventions in earlier systems such as the directory services and the (DNS). The public conception of the URI syntax emerged in early 1992 through Berners-Lee's Universal Document Identifier (UDI) proposal, which outlined a generic structure to address the growing need for consistent resource referencing. The primary motivation was the fragmentation in internet addressing schemes during the early 1990s, hindering seamless hypertext linking in the project. Protocols like FTP, , WAIS, and news groups employed incompatible formats—such as FTP's host-relative paths versus 's menu-based selectors—creating barriers to a cohesive "information universe." The UDI addressed these by introducing a , scheme-based syntax that abstracted protocol-specific details, exemplified by file://info.cern.ch/pub/www/doc/udi1.ps. This enabled dynamic linking regardless of retrieval mechanisms, fostering interoperability. Key early documents include the February 1992 UDI draft, which solicited feedback and highlighted integrations with WAIS and , and the contemporaneous November 1992 HTTP draft, which embedded URI-like addressing for hypertext retrieval. By March 1992, at an IETF BOF, these ideas had evolved into foundational web proposals, with UDI serving as the basis for unified naming across protocols.

Standardization and Evolution

The standardization of Uniform Resource Identifiers (URIs) began with RFC 1630, published in June 1994 by , which provided an informal definition of URI syntax and its role in enabling a global information infrastructure. This document outlined the basic structure of URIs, including schemes, hierarchical components, and the use of for non-ASCII characters, laying the groundwork for uniform naming and addressing on the without enforcing strict parsing rules. A significant refinement came with RFC 2396 in January 1998, authored by , , and Larry Masinter, which introduced a more precise syntax specification and formalized the handling of relative URI references. This update addressed ambiguities in the original syntax, defined equivalence rules for URI comparison, and emphasized the separation of scheme-specific processing, making URIs more robust for protocols. The IETF URI Working Group, established around this time, played a central role in these developments, coordinating input from the broader community to ensure interoperability. The current standard, RFC 3986 from January 2005, also authored by Fielding, Masinter, and Berners-Lee, obsoleted RFC 2396 and provided a comprehensive, ABNF-based syntax definition with enhanced clarity on aspects, such as reserved characters and fragment identifiers. This revision incorporated lessons from widespread URI deployment, including better support for secure schemes and procedures to reduce variant representations. URI evolution has continued through integrations with related protocols, such as RFC 7230 (June 2014), which defines HTTP/1.1 semantics and specifies how URIs are processed in HTTP messages, ensuring consistency in transfers. Additionally, RFC 6874 (February 2013) extended URI handling to include literal addresses within the host component, using zone IDs and bracketed notation to accommodate modern networking needs. Support for internationalization advanced with RFC 3987 (January 2005), which introduced Internationalized Resource Identifiers (IRIs) as a superset of URIs, allowing Unicode characters in international contexts while maintaining compatibility through UTF-8 encoding and mapping rules. In the 2020s, discussions within IETF and W3C working groups have explored URI adaptations for decentralized systems, such as Decentralized Identifiers (DIDs) under W3C Recommendation (July 2022), which leverage URI syntax for self-sovereign identity without central authority. Key contributions to URI standardization stem from Tim Berners-Lee's foundational vision, Roy Fielding's architectural refinements in dissertations and RFCs, and collaborative efforts by IETF working groups like URI and Appsawg, which have sustained updates amid evolving web technologies.

URI Structure

General Syntax

The general syntax of a Uniform Resource Identifier (URI) is formally defined in RFC 3986 as URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ], where the scheme identifies the URI's namespace and syntax rules, the hierarchical part provides the location or name, the query adds parameters, and the fragment identifies a secondary resource within the primary one. This syntax is specified using Augmented Backus-Naur Form (ABNF) grammar, which outlines the production rules for each component. The complete relevant ABNF for URI production is as follows:
URI           = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

scheme        = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

hier-part     = "//" authority path-abempty
              / path-absolute
              / path-rootless
              / path-empty

authority     = [ userinfo "@" ] host [ ":" port ]

path-abempty  = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-rootless = segment-nz *( "/" segment )
path-empty    = 0<pchar>

segment       = *pchar
segment-nz    = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )

pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

query         = *( pchar / "/" / "?" )

fragment      = *( pchar / "/" / "?" )

pct-encoded   = "%" HEXDIG HEXDIG

unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
reserved      = gen-delims / sub-delims
gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
These rules ensure URIs are structured and unambiguous, with HEXDIG representing hexadecimal digits (0-9, A-F, a-f) and ALPHA and DIGIT as standard alphabetic and numeric characters. URIs are classified as absolute or relative based on the presence of a scheme. An absolute URI begins with a scheme followed by a colon and includes a hierarchical part, as in absolute-URI = scheme ":" hier-part [ "?" query ], providing a complete reference independent of context. In contrast, a relative URI lacks a scheme and is resolved against a base URI; its reference form is relative-ref = relative-part [ "?" query ] [ "#" fragment ], where the relative part can be // authority path-abempty (network-path), path-absolute (starting with "/"), path-noscheme (starting with "//" but without authority), path-rootless (no leading "/"), or path-empty. Path-absolute forms, such as those beginning with "/", denote an absolute path from the root, while path-rootless forms, like "resource" without a leading slash, indicate a relative path starting from the current level. Percent-encoding is used to represent characters outside the unreserved set (ALPHA, DIGIT, "-", ".", "_", "~") or reserved set (gen-delims: ":", "/", "?", "#", "[", "]", "@"; sub-delims: "!", "$", "&", "'", "(", ")", "*", "+", ",", ";", "=") when they appear as data rather than s. Non-ASCII characters are first encoded into octet sequences, then each octet is percent-encoded as "%" followed by two uppercase digits; for example, the space character (U+0020) becomes "%20", the forward slash "/" (when used as data) becomes "%2F", and the "?" becomes "%3F". Reserved characters must be encoded if their literal interpretation would alter parsing, such as encoding "/" in a path segment to prevent it from being treated as a . Invalid URIs violate these syntax rules and may lead to failures or issues. Common errors include unencoded spaces, which are not allowed in any component and must be percent-encoded as "%20"; mismatched brackets, such as an unclosed "[" or "]" in the (e.g., in addresses), rendering the URI syntactically invalid; or improper use of , like lowercase hexadecimal digits (though normalization allows them, strict validation prefers uppercase). Implementations should reject or normalize such cases to ensure , as older systems might mishandle sequences like "/../" in queries as path traversals.

Scheme-Specific Elements

The scheme component of a URI serves as the initial identifier that specifies the , , or access method for the , enabling a federated and extensible naming system across different applications and environments. Each URI scheme defines its own syntax and semantics, which may impose restrictions or extensions on the generic URI structure defined in 3986, while adhering to the overall absolute-URI . For instance, schemes like "http" indicate the Hypertext Transfer Protocol, while "urn" denotes a namespace for persistent identifiers. Common schemes exhibit distinct syntactic requirements. The "http" scheme mandates an authority component with a host and optional port, where the host identifies the target and the port defaults to 80 if omitted; for example, http://[example.com](/page/Example.com) is equivalent to http://[example.com](/page/Example.com):80. In contrast, the "" scheme primarily uses a path-only structure for local file access, such as file:/etc/hosts, which is platform-dependent— systems start with a slash, while Windows supports drive letters like file:c:/path/[file](/page/File).txt—and treats an empty or "" authority as referring to the local host. The "" scheme embeds inline directly, following the syntax data:[<mediatype>][;base64],<data>, where the (defaulting to text/plain;charset=US-ASCII) specifies the content format, and the data is either URL-encoded or -encoded; an example is data:text/plain;[base64](/page/Base64),SGVsbG8gd29ybGQ=. The "" scheme, used for email addresses, consists of an email address optionally followed by headers in the query-like portion, such as mailto:user@[example.com](/page/Example.com)?subject=Hello. Authority components vary by scheme, reflecting security and usability considerations. In the "http" scheme, the userinfo subcomponent (e.g., username:[password](/page/Password)@host) is deprecated due to risks of exposing credentials in logs or referrals, and implementations should treat its presence as an error. Port defaults are scheme-specific: 80 for "http", 443 for "https", and none for schemes like "" that do not use network authorities. Schemes without an authority, such as "" or "mailto", omit the double slash (//) and proceed directly to the path or data. Query and fragment handling also adapts to scheme semantics. For "http", the query component (?key=value) carries non-hierarchical parameters for resource selection, such as http://example.com/search?q=uri, while the fragment (#anchor) identifies a secondary resource or location within the primary one, like a document section, processed client-side without server transmission. In the "file" scheme, queries are not used, and fragments may reference byte ranges or other file-specific anchors if supported by the implementation. The "data" scheme treats any post-comma content as opaque data without separate query or fragment support, though fragments can be appended for media-type-specific dereferencing. For "mailto", the query-like part holds email headers (e.g., [email protected]&[email protected]), but true fragments are not defined. URI schemes are registered with the (IANA) to ensure uniqueness and interoperability, following procedures outlined in BCP 35 (RFC 7595) for expert review or first-come-first-served allocation. The registry includes permanent, provisional, and historical entries, with 349 schemes documented as of November 2025. Common registered schemes encompass (File Transfer Protocol), (Lightweight Directory Access Protocol), (Telephone), (Constrained Application Protocol), (Session Initiation Protocol), and the examples noted above, each referencing a defining RFC for precise syntax.

URI Variants

Uniform Resource Locators (URLs)

A Uniform Resource Locator (URL) is a subset of Uniform Resource Identifiers (URIs) that not only identifies a resource but also provides a specific mechanism for locating and accessing it, typically over a network such as the . Unlike more general URIs, URLs incorporate scheme-specific details that enable retrieval, such as network protocols like HTTP or FTP. This focus on location makes URLs essential for web addressing and resource fetching in distributed systems. The term "" was coined in RFC 1738, published in December 1994, which formalized the syntax and semantics for locating resources available via the as part of the World Wide Web initiative. This specification built on earlier concepts from RFC 1630 and established URLs as compact string representations for Internet-accessible resources. Over time, URLs have become synonymous with web addresses, evolving alongside web technologies while maintaining their core role in resource location. In terms of structure, URLs for network schemes—such as those using HTTP or —require a mandatory component, which includes the host (e.g., a or ) and optionally a and user information, prefixed by "//". This is followed by a path that specifies the resource within the host, along with optional query parameters for additional data and a fragment identifier for intra-document navigation. The general form adheres to the URI syntax but emphasizes locatability through the scheme's access method. For example, consider the URL https://www.example.com/page?query=1#fragment:
  • Scheme: https indicates a secure HTTP .
  • Authority: www.example.com specifies .
  • Path: /page identifies the .
  • Query: ?query=1 passes parameters to the .
  • Fragment: #fragment targets a within the .
    This breakdown illustrates how URLs encode both location and access details hierarchically.
URLs have evolved to address practical challenges in global and constrained environments. services, first publicly released with in 2002, create compact aliases that redirect to the original long URL, aiding sharing on platforms with character limits like early . Additionally, support for Internationalized Domain Names (IDNs) was introduced via 3490 in 2003, using to encode non-ASCII characters in domain names (e.g., converting "café.com" to "xn--caf-dma.com"), enabling multilingual URLs while preserving ASCII compatibility in DNS. These developments enhance usability without altering the foundational location-based syntax.

Uniform Resource Names (URNs)

A Uniform Resource Name (URN) is a Uniform Resource Identifier (URI) that uses the "urn" scheme to provide a persistent, location-independent name for a resource. Originally specified in 1997, URNs serve as abstract identifiers that remain stable over time, enabling the naming of entities such as documents, books, or individuals without reference to their current location. Unlike locators, URNs focus on identification rather than retrieval, supporting long-term reference in systems where resources may migrate or change access points. The syntax of a URN follows the form urn:<NID>:<NSS>, where <NID> is the Namespace Identifier—a registered string of alphanumeric characters and hyphens that defines the naming authority—and <NSS> is the Namespace-Specific String, which carries the within that . The <NID> is case-insensitive and limited to 1-32 characters, while the <NSS> may include percent-encoded characters to handle reserved or non-ASCII data. This structure ensures global uniqueness and compatibility with URI parsing rules. For instance, urn:isbn:0-306-40615-2 identifies a specific book using the ISBN . Namespace Identifiers (NIDs) are formally registered with the (IANA) to prevent collisions and maintain interoperability; examples include "" for International Standard Book Numbers and "oid" for Object Identifiers used in standards like ASN.1. Registration follows an expert review process outlined in RFC 8141, ensuring each namespace has a defined assignment and resolution policy. URNs can be resolved through dedicated resolvers that map the identifier to , alternative representations, or locators as per the namespace's rules. Common examples illustrate URN applications: urn:ietf:rfc:2141 names the original URN syntax document itself, providing a stable reference for IETF standards, while namespaces like "mpeg" enable URNs for objects, such as urn:mpeg:url:abc123 for an MPEG-encoded resource. These demonstrate how URNs support diverse, enduring naming needs across digital ecosystems.

References and Resolution

URI References

A URI reference is a string that can represent either an absolute URI, a relative reference, or an , serving as a compact means to identify resources relative to a base URI. This form allows for flexible referencing in documents and protocols without requiring full absolute paths. Relative references follow the syntax relative-ref = relative-part [ "?" query ] [ "#" fragment ], where the relative-part can be a network-path (starting with "//"), an -path (starting with "/"), a rootless path (starting with but no "/"), or an empty path. paths begin with a slash and denote a path from the root, rootless paths start directly with a non-empty for subdirectories, and empty paths indicate the base itself without modification. The optional query and fragment components append parameters or internal anchors as in absolute URIs. To resolve a relative reference into an absolute URI, the process merges it with a base URI through a defined algorithm. First, the base URI is parsed into its components: scheme, authority, path, query, and fragment. If the relative reference includes a scheme, it is treated as absolute; otherwise, the base scheme and authority are retained unless the reference starts with "//", in which case only the authority is replaced. Paths are then merged by appending the relative path to the base path (after removing the last segment if necessary) and resolving dot-segments: "." represents the current directory and is removed, while ".." ascends to the parent directory, with a two-buffer mechanism to handle these iteratively. Query and fragment parts from the reference override those of the base if present. For example, given a base URI of http://a/b/c/d;p?q, the relative reference g resolves to http://a/b/c/g by appending to the base path; ../g resolves to http://a/b/g by removing the last two segments before appending; and /g resolves to http://a/g by replacing the entire path. Another common case is ./image.jpg relative to http://[example.com](/page/Example.com)/dir/, which resolves to http://[example.com](/page/Example.com)/dir/image.jpg after removing the "." segment. URI references are widely used in markup languages for hyperlinks and resource inclusion, such as in HTML's <a href=""> and <img src=""> attributes, where they resolve against the document's URI set by the <base> . In XML, the xml:base attribute establishes a URI for resolving relative references within , processing instructions, or entity content. This enables modular document structures, like linking to local images or stylesheets without absolute paths.

Resolution Mechanisms

Resolution of a Uniform Resource Identifier (URI) refers to the process of mapping the identifier to the corresponding resource through dereferencing, which involves determining the access mechanism and parameters based on the URI's and components. This mechanism enables applications to locate and interact with resources without requiring prior knowledge of their exact representation or location. The resolution process begins with parsing the URI into its components: scheme, authority (including host and port), path, query, and fragment, as defined by the generic syntax. The scheme dictates the protocol or handler to use, such as TCP/IP for hierarchical schemes. Next, the authority component is contacted: for hostnames, this typically involves (DNS) resolution to obtain an , followed by establishing a to the specified (defaulting to scheme-specific values, like for HTTP). The path and query components then guide the request to the specific resource within the authority's namespace. Delegation in URI resolution allows hierarchical administration of the , where the component enables a central registry to assign sub-namespaces to entities. For instance, in schemes using registered names, DNS provides a distributed model, resolving hostnames through a of authoritative servers. This structure supports scalable resource location without a single point of control. In the HTTP , resolution occurs over /: after DNS resolves the host to an , a client connects to the port, sends a GET request with the and query, and receives the or a response code. For the , is namespace-specific, often involving dedicated resolvers that map the to locators via protocols like NAPTR DNS records or HTTP-based services. Unlike location-based schemes, emphasizes persistence and may not yield direct access but rather equivalent URIs. Error handling during resolution is scheme-dependent; for example, in HTTP, if the resource is unavailable, the server returns a 404 Not Found status code. Redirects are managed through 3xx status codes, instructing the client to follow an alternative URI for the resource. Invalid URIs or unreachable authorities may result in connection failures or protocol-specific errors, prompting applications to flag or retry as appropriate.

Applications and Extensions

Use in Web Technologies

URIs play a foundational role in the Hypertext Transfer Protocol (HTTP), where they form the request-target that identifies the primary upon which an HTTP method is applied, such as GET for retrieval or for submission in ful architectures. This usage enables precise addressing of on the , supporting stateless interactions where the URI alone suffices to locate and operate on the target without additional session state. In design, URIs delineate endpoints that embody principles, allowing clients to manipulate through standardized methods while facilitating and across distributed systems. In markup languages like and XML, URIs integrate seamlessly to enable linking and resource embedding. The href attribute in HTML's <a> element specifies a URI reference for hyperlinks, directing users or agents to connected documents or sections, while the src attribute in elements like <img> or <script> denotes a URI for loading external media or code. Similarly, XML's specification employs the href attribute to embed URI-based locators within elements, supporting bidirectional, multi-ended, and out-of-line links that extend beyond simple anchors to complex traversals in XML documents. Within the Semantic Web, URIs function as unique, global identifiers for abstract resources in RDF and OWL ontologies, where HTTP URIs are preferred for their dereferenceability—allowing retrieval of machine-readable descriptions (e.g., ) via standard HTTP GET requests when accessed. This design promotes principles, enabling automated discovery and integration of knowledge across the web by resolving identifiers to informative representations. Contemporary web technologies extend URI applications to interactive and service-oriented protocols. WebSockets leverage the ws:// and wss:// URI schemes to initiate bidirectional communication channels over HTTP, with the URI specifying the host, , and path for the upgrade . Service workers register via a script URL and define an associated scope URL, intercepting fetch requests within that scope to enable offline functionality and caching. APIs typically expose a single HTTP endpoint URI (e.g., /graphql) for POST requests containing queries, allowing flexible data retrieval without multiple resource-specific URIs. URIs also underpin and in web ecosystems. In HTTP , the Domain and Path attributes derive from the request URI to scope cookie applicability, ensuring state is tied to specific origins. 2.0 employs URIs for critical parameters like redirect_uri, which specifies the client for returning codes or tokens, and client_id, a unique identifier for the client application. Additionally, in HTTP uses the request URI in conjunction with Accept headers to select resource variants, such as different media types or languages, based on client preferences.

Internationalization and IRIs

Internationalized Resource Identifiers (IRIs) extend the URI framework to support characters from the Universal Character Set (UCS), also known as or ISO 10646, enabling the use of non-ASCII scripts in resource identifiers. Defined in RFC 3987 published in January 2005, an IRI is a sequence of characters that allows internationalized text while maintaining compatibility with existing URI infrastructure. This extension addresses the limitations of URIs, which are restricted to ASCII characters, by permitting native representation of scripts such as , , or directly in the identifier. The syntax of an IRI closely mirrors that of a URI, as outlined in 3986, but replaces the unreserved character set with an expanded set that includes UCS characters (denoted as UCSCHAR in the Augmented Backus-Naur Form or ABNF grammar). Specifically, IRI components like the , , , query, and fragment follow the same hierarchical structure, but non-ASCII characters are allowed in positions where URIs permit unreserved characters, with reserved characters (such as /, ?, and #) retaining their delimiters. For instance, the authority component can include internationalized domain names via Internationalizing Domain Names in Applications (IDNA), while and query segments support UCS characters without immediate encoding. To ensure interoperability with URI-based systems, IRIs are mapped to URIs through a process involving UTF-8 encoding followed by percent-encoding of non-ASCII octets. The conversion algorithm first transforms the IRI's UCS characters (excluding those in the authority's ireg-name) into UTF-8 byte sequences, then applies percent-encoding to any bytes outside the US-ASCII range, producing a valid URI. For the domain name portion (ireg-name), the toASCII algorithm from RFC 3490 (Punycode) is applied to convert internationalized labels to ASCII Compatible Encoding (ACE) form, prefixed with "xn--". Conversely, the toUnicode algorithm reverses this process, decoding percent-encoded sequences back to UTF-8 and interpreting ACE domains as Unicode labels where supported. These mappings ensure that IRIs can be processed in legacy URI environments without loss of information. IRIs have seen widespread adoption in web standards and implementations, particularly for global accessibility. The Living Standard requires support for IRI semantics in URL handling, including parsing and serialization, to accommodate internationalized content in attributes like href. Modern web browsers handle IRIs by converting internationalized domain names to for DNS resolution while displaying the native script to users, as per IDNA guidelines; for example, and apply these conversions transparently in the and link processing. Protocols like HTTP/1.1 and further integrate IRI support, allowing non-ASCII characters in headers and document references when encoded appropriately. Despite these advancements, IRIs present challenges related to text rendering and equivalence. Bidirectional text in scripts like Arabic or Hebrew requires logical storage order and application of the Unicode Bidirectional Algorithm, with restrictions prohibiting mixed-direction components within a single IRI to avoid visual confusion or security risks. Normalization is another key issue; IRIs should be represented in Unicode Normalization Form C (NFC) to mitigate variations from different normalization forms, ensuring consistent comparison across systems—simple string matching or syntax-based normalization can then determine equivalence. For example, the IRI http://例.com/ページ maps to the URI http://xn--fsq.com/%E3%83%9A%E3%83%BC%E3%82%B8 , where the is -encoded and the is percent-encoded. Similarly, http://résumé.example.org as an IRI becomes http://xn--rsum-bpad.example.org in URI form, demonstrating application without encoding if ASCII. These conversions highlight how facilitate multilingual web navigation while preserving URI compatibility.

Considerations

Normalization and Munging

Normalization standardizes URI representations to enable accurate comparison and determination of without accessing the referenced resource. The process, outlined in RFC 3986, involves syntax-based adjustments to eliminate variations that do not affect the identified resource. These adjustments ensure that equivalent URIs, such as those differing only in case or encoding, are transformed into identical forms for syntactic . Case normalization converts the scheme and host components to lowercase, as they are case-insensitive. For example, the URI "HTTP://www.EXAMPLE.com/" normalizes to "http://www.example.com/". digits within percent-encoded octets are also normalized to uppercase for consistency, treating "%3a" and "%3A" as equivalent. normalization decodes any percent-encoded octets that represent unreserved characters (such as A-Z, a-z, 0-9, hyphen, , , and ), removing unnecessary encodings like "%20" for a where direct representation is allowed. Path segment normalization applies the remove_dot_segments algorithm to eliminate "." and ".." segments, simplifying paths like "/docs/./../docs" to "/docs". After these transformations, syntactic equivalence is assessed by character-by-character comparison of the normalized strings; identical results indicate the URIs reference the same syntactically. Semantic builds on this by incorporating scheme-specific rules, such as treating an empty in HTTP URIs as equivalent to a of "/". For instance, "", "", and "" are semantically equivalent under HTTP rules. URL munging involves unauthorized or ad-hoc modifications to URIs that can alter their or cause failures. Common practices include prepending "www." to the host component, such as changing "example.com" to "www.example.com", which may lead to errors if the does not configure the equivalently. Another frequent alteration is appending or removing trailing slashes from paths, potentially creating duplicate content or triggering unintended redirects; for example, "http://example.com/page" and "http://example.com/page/" might resolve differently depending on configuration. Such changes disrupt canonical forms and can result in broken links or inconsistent access. Best practices for handling normalization include using established canonicalization algorithms in programming libraries. Python's urllib.parse module, for instance, provides functions like urlsplit and urlunsplit that perform case on the scheme and host, decode percent-encodings appropriately, and handle path components, producing a standardized representation compliant with RFC 3986 basics. Implementations should apply full syntax-based before comparison to avoid false non-equivalences, prioritizing these steps over scheme-specific adjustments unless required for the application context.

Security Implications

Uniform Resource Identifiers (URIs) introduce several security risks due to their role in directing resource access, particularly when parsed or resolved without proper safeguards. Open redirects occur when applications accept untrusted URI inputs for redirection without validation, allowing attackers to manipulate users into visiting malicious sites, often as a precursor to or credential harvesting. Similarly, injection attacks exploit query parameters or fragments in URIs; for instance, unescaped inputs in query strings can lead to (XSS) if reflected into web pages, while fragments may trigger client-side script execution in vulnerable browsers. Scheme-specific threats amplify these vulnerabilities. The javascript: URI scheme enables direct execution of JavaScript code in the context of the current page, facilitating XSS attacks by injecting malicious scripts when users click or navigate to such links, as browsers historically allowed this for . The data: URI scheme, which embeds data directly into the URI, poses phishing risks by allowing attackers to craft self-contained pages mimicking legitimate sites, bypassing external hosting and evading some URL filters. Historical incidents highlight the real-world impact of URI-related exploits. In the 2010s, URL shortening services like bit.ly were abused in campaigns such as the worm, which used shortened URIs to redirect users to downloads, spreading via and infecting thousands of systems. These exploits often combined open redirects with obfuscated malicious payloads, demonstrating how URI opacity can facilitate large-scale attacks. Mitigations focus on defensive handling of URIs during and . URI validation involves checking schemes, hosts, and parameters against whitelists to block untrusted inputs, while sandboxing isolates URI processing to prevent from malicious schemes. Content-Security-Policy (CSP) headers provide an additional layer by restricting executable scripts and navigations, effectively blocking javascript: and certain data: executions in modern browsers. Best practices emphasize proactive design to minimize exposure. Developers should avoid the deprecated userinfo component (e.g., username:password@host) in URIs, as it exposes credentials in logs and browser histories; instead, use secure alternatives like with headers. Always validate allowed schemes (e.g., restricting to https:) and enforce to encrypt URIs in transit, preventing interception of sensitive parameters during resolution.

References

  1. [1]
  2. [2]
  3. [3]
  4. [4]
  5. [5]
  6. [6]
  7. [7]
  8. [8]
  9. [9]
  10. [10]
  11. [11]
  12. [12]
  13. [13]
  14. [14]
  15. [15]
  16. [16]
  17. [17]
    Draft: Universal Document Identifiers
    Draft: Universal Document Identifiers. Tim Berners-Lee (timbl) Thu, 27 Feb 92 17:22:44 GMT+0100. Messages sorted by: [ date ][ thread ][ subject ][ author ] ...
  18. [18]
    WAIS-W3-x.500 BOF minutes - CERN
    Tim discussed the differences between WWW, WAIS, Archie, Gopher and Prospero systems. The need for a Universal Document Identifier (UDI) for describing the ...
  19. [19]
    HTTP: A protocol for networked information - W3C
    This protocol allows an open-ended set of methods to be used. It builds on the discipline of reference provided by the Universal Resource Identifier (URI) as a ...Missing: Tim Berners- Lee
  20. [20]
  21. [21]
  22. [22]
  23. [23]
  24. [24]
  25. [25]
  26. [26]
  27. [27]
  28. [28]
  29. [29]
  30. [30]
  31. [31]
    Uniform Resource Identifier (URI) Schemes
    ### Overview of IANA URI Schemes Registry
  32. [32]
  33. [33]
    RFC 1738 - Uniform Resource Locators (URL) - IETF Datatracker
    This document specifies a Uniform Resource Locator (URL), the syntax and semantics of formalized information for location and access of resources via the ...
  34. [34]
    [PDF] Security and Privacy Implications of URL Shortening Services
    A shortening service can extract much data from the re- quest for shortened URLs. This data includes the re- quested URL, time of request, requesting browser ...
  35. [35]
    None
    Nothing is retrieved...<|separator|>
  36. [36]
    RFC 8141: Uniform Resource Names (URNs)
    ### Summary of RFC 8141: Uniform Resource Names (URNs)
  37. [37]
    RFC 2141: URN Syntax
    Uniform Resource Names (URNs) are intended to serve as persistent, location-independent, resource identifiers. This document sets forward the canonical syntax ...
  38. [38]
    Uniform Resource Names (URN) Namespaces
    This is the Official IANA Registry of URN Namespaces. Available Formats: XML · HTML · Plain text. Registries Included Below. Formal URN ...
  39. [39]
  40. [40]
  41. [41]
  42. [42]
  43. [43]
  44. [44]
  45. [45]
  46. [46]
  47. [47]
  48. [48]
  49. [49]
  50. [50]
  51. [51]
    RFC 7230 - Hypertext Transfer Protocol (HTTP/1.1) - IETF Datatracker
    The Hypertext Transfer Protocol (HTTP) is a stateless application- level protocol for distributed, collaborative, hypertext information systems.
  52. [52]
    Links in HTML documents - W3C
    Note that the href attribute in each source anchor specifies the address of the destination anchor with a URI. The destination anchor of a link may be an ...
  53. [53]
    XML Linking Language (XLink) Version 1.1 - W3C
    May 6, 2010 · This specification defines the XML Linking Language (XLink), which allows elements to be inserted into XML documents in order to create and describe links ...XLink Markup Design · XLink Element Type Attribute... · Locator Attribute (href)
  54. [54]
    RFC 6455 - The WebSocket Protocol - IETF Datatracker
    The WebSocket Protocol enables two-way communication between a client running untrusted code in a controlled environment to a remote host.
  55. [55]
    Service Workers - W3C
    Mar 6, 2025 · A service worker has an associated script url (a URL). A service worker has an associated type which is either " classic " or " module ".Motivations · Model · Client Context · Execution Context
  56. [56]
    Serving over HTTP - GraphQL
    Oct 31, 2025 · On this page, we'll review some key guidelines to follow when setting up a GraphQL server to operate over HTTP.Request Format · Methods · Response Format
  57. [57]
    RFC 6265 - HTTP State Management Mechanism - IETF Datatracker
    This document defines the HTTP Cookie and Set-Cookie header fields. These header fields can be used by HTTP servers to store state (called cookies) at HTTP ...
  58. [58]
    RFC 6749 - The OAuth 2.0 Authorization Framework
    ... URI used to deliver the access token to the client. The access token may be exposed to the resource owner or other applications with access to the resource ...Oauth · RFC 5849 · RFC 9700 · RFC 8252
  59. [59]
    RFC 3987 - Internationalized Resource Identifiers (IRIs)
    This document defines a new protocol element, the Internationalized Resource Identifier (IRI), as a complement to the Uniform Resource Identifier (URI).
  60. [60]
    An Introduction to Multilingual Web Addresses - W3C
    Jul 25, 2025 · If the whitelist is emptied, any non-ASCII character causes the address to be displayed as punycode. Mozilla 1.7x displays all IDNs as punycode.<|separator|>
  61. [61]
  62. [62]
  63. [63]
  64. [64]
  65. [65]
  66. [66]
  67. [67]
  68. [68]
    To slash or not to slash | Google Search Central Blog
    Historically, it's common for URLs with a trailing slash to indicate a directory, and those without a trailing slash to denote a file.
  69. [69]
    urllib.parse — Parse URLs into components — Python 3.14.0 ...
    This module defines a standard interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.)
  70. [70]
    Unvalidated Redirects and Forwards - OWASP Cheat Sheet Series
    Unvalidated redirect and forward attacks can also be used to maliciously craft a URL that would pass the application's access control check and then forward the ...
  71. [71]
    Cross Site Scripting (XSS) - OWASP Foundation
    Cross-Site Scripting (XSS) attacks are a type of injection, in which malicious scripts are injected into otherwise benign and trusted websites.Testing for reflected XSS · DOM Based XSS · OWASP Code Review Guide · Types
  72. [72]
    Zeus, Koobface and Zero-Day Exploits Dominate First Half of 2010
    Koobface criminals send IM spam to users with shortened URLs, which then redirect victims to malicious Websites. Users are used to incomprehensible URLs, thanks ...
  73. [73]
  74. [74]
    Content Security Policy - OWASP Cheat Sheet Series
    A strong CSP provides an effective second layer of protection against various types of vulnerabilities, especially XSS.
  75. [75]
  76. [76]