Universally unique identifier
A universally unique identifier (UUID) is a 128-bit label designed to uniquely identify objects or entities in computer systems without requiring a central registration authority, ensuring uniqueness across space and time.[1] Standardized by the Internet Engineering Task Force (IETF) in RFC 9562 (obsoleting RFC 4122 in 2024) as a Uniform Resource Name (URN) namespace, a UUID—also known as a globally unique identifier (GUID) in some contexts—facilitates distributed systems by providing collision-resistant identifiers for resources such as files, transactions, or database records.[1] Its structure consists of 16 octets (128 bits) arranged in a specific layout: a 32-bit time-low field, a 16-bit time-mid field, a 16-bit time-high-and-version field, an 8-bit clock-sequence-high-and-reserved field, an 8-bit clock-sequence-low field, and a 48-bit node field, typically represented in a canonical string format of eight hexadecimal digits, followed by a hyphen, then four, four, four, and twelve hexadecimal digits (e.g.,f81d4fae-7dec-11d0-a765-00a0c91e6bf6).[1] The variant field in the eighth octet distinguishes the UUID layout, with the defined variant using the bit pattern 10xx for compatibility.[1]
UUIDs come in eight versions, each with distinct algorithms to guarantee uniqueness (detailed in the dedicated section):
Uniqueness is achieved through these methods: time-based versions leverage monotonic clocks and unique node IDs (or random alternatives), while name-based, random, and custom versions use cryptographic hashing, sufficient entropy, or defined procedures to minimize collision probabilities, making UUIDs essential for scalable, decentralized applications like cloud computing and databases.[1]
History
Origins in OSF DCE
The universally unique identifier (UUID) was initially developed as a component of the Open Software Foundation's (OSF) Distributed Computing Environment (DCE), a middleware system aimed at facilitating distributed computing in client-server architectures during the early 1990s.[2] The primary goal was to enable the generation of unique identifiers for objects, such as files, processes, or resources, across networked systems without relying on a central coordinating authority, thereby supporting scalability in heterogeneous environments.[2] This approach addressed the challenges of ensuring global uniqueness in distributed setups where multiple nodes might independently create identifiers, avoiding conflicts through decentralized mechanisms.[3] The concept traces its roots to the Apollo Network Computing System (NCS), developed by Apollo Computer, before being formalized and expanded within OSF DCE.[2] Key figures in its specification included Paul J. Leach, then at Microsoft, and Rich Salz, associated with Certco, who co-authored early drafts building directly on the OSF DCE framework.[3] The first specification appeared in OSF DCE version 1.0, released around 1990–1991, defining the UUID as a fixed 128-bit value to balance compactness with sufficient entropy for uniqueness.[4] This structure incorporated a 60-bit timestamp (representing 100-nanosecond intervals since October 15, 1582), a 14-bit clock sequence to handle potential timestamp collisions, and a 48-bit node identifier typically derived from the machine's IEEE 802 MAC address, ensuring both temporal and spatial uniqueness.[5] Early adoption of UUIDs occurred within OSF DCE-based systems for tasks like naming interfaces and binding endpoints in remote procedure calls.[5] Microsoft integrated the OSF DCE remote procedure call (RPC) mechanism, including UUIDs, into its Component Object Model (COM) framework, where they served as globally unique identifiers (GUIDs) for components, interfaces, and classes in distributed applications.[2] This integration marked one of the earliest widespread uses beyond pure DCE environments, influencing subsequent implementations in Windows platforms and other enterprise systems.[2]Standardization and Evolution
The standardization of universally unique identifiers (UUIDs) originated in the Open Software Foundation's Distributed Computing Environment (OSF DCE) in the early 1990s, providing a foundational specification for generating unique identifiers across distributed systems. This initial framework was subsequently formalized as an international standard through ISO/IEC 11578:1996, which defines UUIDs within the context of Open Systems Interconnection—Remote Procedure Call (RPC), ensuring interoperability in information technology environments. In July 2005, the Internet Engineering Task Force (IETF) published RFC 4122, titled "A Universally Unique IDentifier (UUID) URN Namespace," which codified the DCE-based UUID format as a Uniform Resource Name (URN) namespace while introducing three new versions: version 3 for name-based UUIDs using MD5 hashing, version 4 for random or pseudo-random generation, and version 5 for name-based UUIDs using SHA-1 hashing.[2] This RFC addressed gaps in the earlier DCE and ISO specifications by providing a broader set of generation methods suitable for diverse applications, including those not reliant on distributed time synchronization. Over the years, RFC 4122 underwent refinements through published errata, resolving ambiguities in areas such as variant bit interpretation and byte order to enhance implementation consistency.[6] Advancements continued with the IETF's uuidrev working group, which in 2023 released draft-ietf-uuidrev-rfc4122bis-08, proposing version 7 UUIDs that incorporate a 48-bit Unix timestamp for improved chronological sortability without depending on synchronized clocks or hardware identifiers.[7] This draft evolved into RFC 9562, published in May 2024 as "Universally Unique IDentifiers (UUIDs)," which obsoletes RFC 4122 and introduces version 6 (a reordered time-based variant of version 1), version 7, and version 8 for custom or experimental applications where implementers define their own layouts within the UUID framework.[1] RFC 9562 also aligns UUID specifications more closely with ITU-T Recommendation X.667 | ISO/IEC 9834-8, emphasizing best practices for generation and usage. A key aspect of UUID evolution has been the shift away from hardware-dependent generation methods, particularly those using MAC addresses in versions 1 and 2, toward software-only approaches. This transition addresses privacy concerns, as MAC addresses can reveal device identities and enable tracking, compounded by modern operating systems' implementation of MAC address randomization to protect user anonymity in network environments.[1] RFC 9562 explicitly recommends using fixed or random node identifiers instead of actual MAC addresses to maintain uniqueness while mitigating these risks, reflecting broader industry adoption of privacy-preserving identifier strategies.[1]Standards
Core Specifications
The core specifications for universally unique identifiers (UUIDs) establish a standardized 128-bit format designed to ensure uniqueness across both spatial and temporal dimensions without relying on a central registration authority. This fixed length allows for an immense address space of approximately 3.4 × 10^38 possible values, minimizing collision risks in distributed systems.[2][1] The foundational international standard, ISO/IEC 11578:1996, defines the representation and generation algorithms for UUIDs within the Open Systems Interconnection (OSI) framework, specifically as part of the Distributed Computing Environment (DCE) remote procedure call (RPC) bindings. It specifies the canonical string format—32 hexadecimal digits grouped as 8-4-4-4-12 with hyphens—and outlines procedures for creating time-based and name-based identifiers to maintain global uniqueness. Compliance with this standard ensures interoperability in OSI-conformant environments by mandating the use of precise bit layouts for timestamp, clock sequence, and node fields.[8][9] Building on DCE principles, RFC 4122 (published in 2005 by the Internet Engineering Task Force) formalizes UUIDs as a Uniform Resource Name (URN) namespace, defining versions 1 through 5 with distinct generation methods while emphasizing collision avoidance through randomized elements like clock sequences and node identifiers. It introduces the variant field, encoded in the high-order bits of the eighth octet as the binary pattern 10xx (where xx are variable), to distinguish UUIDs generated under this specification from other variants and reserve space for future extensions. For interoperability, all compliant UUIDs must adhere to this variant encoding and include a version nibble (a 4-bit value in the high-order bits of the seventh octet) that identifies the specific generation algorithm used. Guidelines for collision avoidance include using unique node identifiers, such as IEEE 802 MAC addresses, and incrementing a clock sequence on system reboots or clock adjustments to prevent duplicates.[2] RFC 9562 (published in 2024) obsoletes and expands RFC 4122, incorporating versions 6 through 8 to address modern needs like sortable timestamps and custom subtypes, while clarifying and slightly revising the variant rules for enhanced robustness. It reaffirms the 128-bit length and the core requirement for decentralized uniqueness, with no central coordination needed for generation or validation. Updated compliance criteria mandate the variant bits (positions 64-65 in the UUID octet stream) be set to 10 for all defined versions, alongside the version nibble (bits 48-51) matching the UUID type (e.g., 0110 for version 6), ensuring seamless integration across systems and networks. These specifications collectively prioritize global interoperability by standardizing the layout and metadata fields that signal generation provenance.[1]Related and Variant Standards
In Microsoft Windows environments, the UUID is implemented as a Globally Unique Identifier (GUID), a 16-byte binary structure used extensively in Component Object Model (COM) and Distributed COM (DCOM) to uniquely identify interfaces, classes, and other objects.[10] GUIDs are stored in binary form within the Windows registry to index configuration information for applications and system components, enabling seamless object resolution across distributed systems.[11] This adaptation maintains compatibility with the core UUID structure while integrating with Windows-specific protocols for remote procedure calls.[12] The Bluetooth specification adopts 128-bit UUIDs for identifying services and profiles, often deriving them from shorter 16-bit or 32-bit forms assigned by the Bluetooth Special Interest Group (SIG) to optimize transmission efficiency in low-power devices.[13] These short UUIDs are expanded into full 128-bit equivalents by inserting the assigned value into a fixed base UUID (00000000-0000-1000-8000-00805F9B34FB), ensuring global uniqueness while minimizing on-air bytes in protocols like GATT.[14] Only SIG-assigned short forms are permitted for interoperability, with custom 128-bit UUIDs reserved for vendor-specific extensions.[14] In web and data interchange standards, UUIDs are represented in their canonical string lexical form—a sequence of 32 hexadecimal digits grouped as 8-4-4-4-12 and enclosed in braces for some contexts—to ensure consistent parsing within JSON documents, as JSON natively supports strings without requiring special handling.[15] This format promotes interoperability in JSON-based APIs and serialization, where UUIDs serve as keys or identifiers without altering the underlying 128-bit binary value.[15] Emerging standards and drafts extend UUIDs for cloud environments and enhanced privacy. For instance, Amazon Web Services (AWS) incorporates UUIDs within Amazon Resource Names (ARNs) for certain resource identifiers, such as in AWS Glue transformations that generate unique IDs for data rows, facilitating scalable, distributed resource management.[16] Privacy-enhanced variants, addressed in recent IETF updates, introduce new versions like UUIDv7 (time-ordered with high-entropy random bits) and UUIDv8 (custom subtypes), which avoid exposing hardware identifiers like MAC addresses to mitigate tracking risks in distributed systems.[1] These evolutions, formalized in RFC 9562, prioritize randomness and temporal sorting while preserving collision resistance.[17]Structure
Overall Layout
A Universally Unique Identifier (UUID) is a 128-bit integer value designed to be unique across space and time without centralized coordination.[17] It is typically stored and transmitted in network byte order (big-endian), ensuring consistent interpretation across different systems regardless of local endianness.[17] This fixed 128-bit size provides ample space to avoid collisions while remaining compact for storage and comparison purposes.[17] The UUID is divided into several fixed fields that collectively form its structure: an 8-bit variant field (indicating the encoding variant), a 4-bit version field (specifying the generation algorithm), a 48-bit node ID (often derived from hardware like a MAC address), a 60-bit timestamp or equivalent value (for time-based uniqueness), and a clock sequence field (typically 14 bits to handle clock adjustments and prevent duplicates).[17] These fields are not padded with leading zeros in their binary representation; instead, the fixed overall bit length ensures no overflow or alignment issues during generation or parsing.[17] Visually, the 128-bit UUID can be broken down as follows:This layout positions the time_low as the least significant bits, followed by time_mid, then time_hi_and_version (which embeds the version in its high 4 bits), clock_seq_and_variant (which embeds the variant in its high bits), and finally the node as the most significant bits.[17] The purpose of these fields is to ensure global uniqueness by integrating temporal information (timestamp), spatial identifiers (node ID), and elements of randomness or sequencing (clock sequence), allowing UUIDs to be generated independently on different systems without coordination.[17] For instance, in time-based versions, the timestamp provides ordering, while the node distinguishes devices; in random versions, the fields incorporate pseudorandom values to achieve the same goal.[17] This field-based design supports multiple UUID versions while maintaining a uniform overall layout.[17]UUID = time_low (32 bits) | time_mid (16 bits) | time_hi_and_version (16 bits) | clock_seq_and_variant (16 bits) | node (48 bits)UUID = time_low (32 bits) | time_mid (16 bits) | time_hi_and_version (16 bits) | clock_seq_and_variant (16 bits) | node (48 bits)
Variant and Version Fields
The variant field in a UUID occupies the three most significant bits (bits 0 through 2) of theclock_seq_hi_and_reserved octet (octet 8 in the 128-bit layout), determining the overall layout and interpretation of the remaining bits for interoperability across systems.[18] This field uses specific bit patterns to distinguish between different UUID encoding schemes: 0xx for Network Computing System (NCS) backward compatibility, 10x for the standard defined in RFC 9562 (providing compatibility with earlier RFC 4122 UUIDs), 110 for Microsoft GUID backward compatibility, and 111 reserved for future use.[18] The 10x pattern, where the two most significant bits are 10, is the most commonly used in modern implementations to ensure consistent parsing.[18]
The version field, a 4-bit nibble, is located in the most significant bits (bits 0 through 3) of the time_hi_and_version octet (octet 6), specifying the UUID generation algorithm and thus how the other fields should be interpreted.[19] Versions 1 through 5 represent the original methods: version 1 for time-based UUIDs using Gregorian timestamps, version 2 for DCE security UUIDs (largely reserved), version 3 for name-based UUIDs using MD5 hashing, version 4 for random or pseudorandom UUIDs, and version 5 for name-based UUIDs using SHA-1 hashing.[19] Updated versions 6 through 8 extend this scheme: version 6 for reordered time-based UUIDs (rearranging version 1 fields for better sorting), version 7 for Unix timestamp-based UUIDs with random components for improved monotonicity, and version 8 for custom or application-specific UUIDs.[19]
To detect the UUID type, systems parse these fields during decoding: the variant bits first classify the layout, followed by the version bits to identify the exact generation method, enabling validation of compliance with RFC 9562.[20] This dual classification is crucial for preventing misinterpretation; for instance, a variant mismatch (e.g., treating a Microsoft GUID as an RFC 9562 UUID) can lead to incorrect extraction of timestamps or node identifiers, causing errors in distributed systems.[18] By standardizing these bits, UUIDs maintain uniqueness and portability across diverse environments without requiring additional metadata.[21]
Time and Node Components
In time-based UUIDs, such as versions 1 and 6, the timestamp component provides a measure of the generation time, consisting of a 60-bit value representing the number of 100-nanosecond intervals since the Gregorian calendar epoch of October 15, 1582, 00:00:00.[22] This starting point deliberately follows the adoption of the Gregorian calendar to avoid complications from the earlier Julian calendar transition, ensuring a consistent reference across systems.[22] The timestamp is designed to roll over after approximately 3,400 years from the epoch, providing ample longevity for practical use.[22] The timestamp is subdivided and positioned within the 128-bit UUID structure as follows: the least significant 32 bits occupy the time_low field (octets 0-3), the next 16 bits form the time_mid field (octets 4-5), and the most significant 12 bits of the timestamp appear in the time_hi portion of the time_hi_and_version field (octets 6-7, with the remaining 4 bits reserved for the version number).[22] In version 1 UUIDs, this layout preserves the original DCE ordering, while version 6 reorders the fields to place the most significant timestamp bits first for improved chronological sorting in databases.[23] These components collectively ensure that the timestamp contributes to the UUID's uniqueness by embedding precise temporal information. To mitigate risks of duplicate UUIDs arising from clock adjustments, such as regressions or resets on the generating system, a 14-bit clock sequence field is included.[22] This sequence is typically initialized to a random value between 0 and 16,383 and is incremented (modulo 16,384) whenever the local clock is found to have regressed relative to the last UUID generation time, or upon node ID changes that could otherwise cause collisions.[22] The clock sequence occupies bits 66-79 in the binary representation: for the UUID variant, it consists of bit 66 (the variable bit of the 10x variant) followed by the 5 bits in the remainder of octet 8 (bits 67-71), and the 8 bits in octet 9 (bits 72-79).[22] This mechanism guarantees temporal uniqueness even in environments with imperfect clocks, without requiring synchronized time across nodes. The node ID component, a 48-bit field, identifies the hardware or network interface generating the UUID and occupies the final octets (10-15) in the binary layout.[22] It is conventionally set to the IEEE 802 MAC address of the local node, which uniquely identifies network interfaces worldwide.[22] When a true MAC address is unavailable or to preserve privacy, a randomly generated 48-bit value is used instead, with the multicast bit (the least significant bit of octet 10) set to 1 to distinguish it from unicast MAC addresses.[24] This node ID, combined with the timestamp and clock sequence, ensures global uniqueness by tying the UUID to a specific generating entity.[22] The variant and version fields are integrated adjacent to these components to classify the UUID type without altering their core roles.[25]UUID Versions
Time-based with MAC Address (Versions 1 and 6)
UUID version 1, also known as the time-based UUID, generates identifiers using a 60-bit timestamp representing the number of 100-nanosecond intervals since 00:00:00.00 UTC on October 15, 1582 (the Gregorian calendar epoch), combined with a 14-bit clock sequence and a 48-bit node identifier.[26] The timestamp is divided into three fields: time_low (32 bits), time_mid (16 bits), and time_hi_and_version (16 bits, with the 4-bit version set to 0001 and the remaining 12 bits for time_hi).[27] The clock sequence prevents duplicates if the system clock is reset or adjusted backward, initialized to a random value between 0 and 16383, while the node field typically holds the IEEE 802 MAC address of the generating machine's network interface; if no MAC is available, a random 48-bit value is used with the multicast bit (least significant bit of the first octet) set to 1.[28][29] The generation process for version 1 UUIDs follows a stateful algorithm to ensure monotonicity: obtain an exclusive lock to access the UUID generation state, retrieve the current UTC timestamp and node ID, compare the timestamp to the previous one—if it has not advanced, increment the clock sequence (or generate a new one if it overflows) and retry the timestamp acquisition up to a system-defined limit, then format the fields into the 128-bit structure and release the lock.[30] This design supports high generation rates, up to approximately 10 million UUIDs per second per node, as the 100-nanosecond granularity allows 10^7 intervals per second.[31] Uniqueness is guaranteed globally without central coordination: the timestamp and node combination ensures no collisions across distinct machines (due to unique MAC addresses), while the clock sequence handles duplicates within the same node and time slot.[32] Version 6 UUIDs, introduced as an update in RFC 9562, maintain the core elements of version 1—60-bit timestamp, 14-bit clock sequence, and 48-bit node—but reorder the timestamp fields for improved lexical sorting when stored as binary or text representations, enhancing database index locality and query performance in distributed systems.[33] Specifically, the structure places the most significant 48 bits of the timestamp first (time_high across octets 0-5, split as 32 bits in octets 0-3 and 16 bits in 4-5), followed by the version (0110 in bits 48-51 of octet 6), the least significant 12 bits of the timestamp (time_low in bits 52-63 of octet 6-7), the variant (10 in bits 64-65 of octet 8), the clock sequence (14 bits across octets 8-9), and the node (48 bits in octets 10-15).[33] Generation mirrors version 1, including the stateful timestamp and clock sequence logic, but with the timestamp bytes rearranged post-capture to prioritize higher-order bits for sortability, using the same epoch and node derivation rules.[33] Both versions 1 and 6 provide strong uniqueness guarantees identical to those of the original DCE specification, with one UUID per 100-nanosecond interval per node, enabling collision-free operation across space and time in uncoordinated environments.[32][33] They are particularly suited for use cases in distributed computing systems, such as the Open Software Foundation's Distributed Computing Environment (DCE), where temporal ordering is beneficial for logging, transaction tracking, or replication without requiring synchronized clocks beyond the node level.[34] Version 6's sorting advantage makes it preferable in modern databases for range queries or partitioning by time.[33]DCE Security with MAC Address (Version 2)
Version 2 UUIDs, known as DCE security UUIDs, represent a specialized variant of time-based identifiers designed for distributed computing environments requiring embedded security contexts. They extend the core structure of version 1 UUIDs by incorporating local identifiers such as POSIX user IDs (UIDs) or group IDs (GIDs) to associate UUIDs with specific principals for access control and auditing purposes. This variant was specified in the DCE 1.1 Authentication and Security Services standard to support privilege management within DCE cells. The layout of a version 2 UUID mirrors that of version 1 in its overall 128-bit composition, including a 60-bit timestamp split across time_low (32 bits), time_mid (16 bits), and time_hi_and_version (16 bits, with the 4 most significant bits set to 0010 binary to indicate version 2), a 14-bit clock sequence across clock_seq_hi_and_reserved (8 bits, with variant bits 10 in the 2 most significant bits) and clock_seq_low (8 bits), and a 48-bit node field containing the MAC address. However, the time_low field replaces the least significant 32 bits of the timestamp with the 32-bit local identifier (UID or GID), reducing timestamp precision but embedding security information. The clock sequence is effectively shortened to 6 bits in clock_seq_hi_and_reserved (bits 8-13 of the original sequence), while the 8-bit clock_seq_low field holds the domain value, which differentiates the type of local identifier used.[5] The domain value in clock_seq_low specifies the security context and supports three defined values: 0 for the person domain (using a user ID), 1 for the group domain (using a group ID), and 2 for the organization domain (using an organizational unit ID). These domains enable DCE systems to map UUIDs to specific access control entries, such as in privilege attribute certificates, ensuring that identifiers reflect the creating entity's security role within a local cell. Although the field is 8 bits (allowing values up to 255), only these three are standardized, with others left for potential future or implementation-specific use. Generation of a version 2 UUID involves capturing the current UTC timestamp in 100-nanosecond intervals since October 15, 1582, incrementing a 6-bit clock sequence (modulo 64) if the timestamp has not advanced, selecting the appropriate domain and retrieving the corresponding local ID (e.g., via POSIX getuid() or getgid()), and combining these with the system's 48-bit node ID (MAC address). The local ID and domain are embedded at creation time to record the security principal responsible, aiding in auditing and authorization without requiring centralized coordination. Unlike version 1, no standard DCE API directly generates version 2 UUIDs; implementations must customize the uuid_create() routine accordingly.[5] Due to their dependency on POSIX-specific identifiers and limited adoption beyond DCE ecosystems, version 2 UUIDs are largely obsolete today and omitted from many modern libraries and standards. RFC 9562 reserves the version for DCE security but provides no further details, deferring to the original specification, and notes their rarity in contemporary systems except for legacy DCE or certain Microsoft environments.[17]Namespace Name-based (Versions 3 and 5)
Namespace name-based UUIDs, designated as versions 3 and 5, are generated by applying a cryptographic hash function to a combination of a predefined namespace UUID and a unique name string, ensuring deterministic uniqueness within that namespace.[2] These versions provide a mechanism to create identifiers from human-readable names that are guaranteed to be unique as long as the name is unique within its specified namespace, making them suitable for applications requiring reproducible UUIDs, such as federated naming systems.[2] Unlike random or time-based UUIDs, the output is always the same for identical inputs, facilitating consistent identification across distributed systems without coordination.[2] The generation process begins with selecting a namespace UUID, which acts as a context for the name, followed by concatenating the namespace UUID—in network byte order—with the name encoded as a sequence of octets (using UTF-8 for strings).[2] For version 3, an MD5 hash is computed over this concatenation, yielding a 128-bit digest from which the UUID fields are derived: the first 32 bits form time_low, the next 16 bits time_mid, the following 16 bits populate time_hi_and_version (with the version bits set to 0011 binary, or 3 decimal), the subsequent 8 bits fill clock_seq_hi_and_reserved (with variant bits set to 10 binary), the next 8 bits clock_seq_low, and the final 48 bits the node field.[2] Version 5 follows an identical structure but uses a SHA-1 hash instead of MD5, with the version bits in time_hi_and_version set to 0101 binary (or 5 decimal); this substitution is recommended due to MD5's vulnerabilities, though neither version is intended for security-sensitive applications like credentials.[2] The resulting UUID adheres to the standard variant (bits 6-7 of octet 6 set to 10) and is converted to the appropriate byte order for representation.[2] RFC 4122 defines several predefined namespace UUIDs to standardize common use cases, including the DNS namespace (6ba7b810-9dad-11d1-80b4-00c04fd430c8) for domain names, the URL namespace (6ba7b811-9dad-11d1-80b4-00c04fd430c8) for uniform resource locators, and the OID namespace (6ba7b812-9dad-11d1-80b4-00c04fd430c8) for object identifiers.[2] These namespaces enable interoperability, allowing different systems to independently generate the same UUID for the same name, thus supporting scenarios like naming resources in distributed directories or registries.[2]Randomly Generated (Version 4)
Version 4 UUIDs are generated using random or pseudo-random numbers, providing a method for creating unique identifiers without reliance on timestamps or hardware addresses. This approach ensures uniqueness through high-entropy random bits, making it suitable for environments where deterministic generation is undesirable or impractical.[2] The structure of a Version 4 UUID follows the standard 128-bit layout, with specific fixed bits to indicate the version and variant. The version field, consisting of 4 bits (positions 12-15 in the time_hi_and_version octet), is set to the binary value 0100 to denote Version 4. The variant field, using 2 bits (positions 6-7 in the clock_seq_hi_and_reserved octet), is set to 10 to conform to the RFC 4122 variant specification. The remaining 122 bits are filled with random values, yielding over 2^122 possible unique UUIDs and effectively eliminating collision risks in practical applications.[2] Generation of Version 4 UUIDs requires a source of random numbers, preferably of cryptographic quality to maximize entropy and prevent predictability from poor seeding or algorithmic weaknesses. The process involves setting the fixed version and variant bits, then populating the other fields with random data: the 32-bit time_low field entirely random; the 16-bit time_mid field entirely random; the 12 least significant bits (0-11) of the time_hi_and_version field random; the 14-bit clock sequence (6 bits from clock_seq_hi_and_reserved positions 0-5 plus all 8 bits of clock_seq_low) random; and the 48-bit node field entirely random. This random placement across fields maintains compatibility with the UUID format while distributing entropy evenly.[2] A key advantage of Version 4 UUIDs is their independence from system clocks, avoiding synchronization issues common in time-based variants and enabling generation in offline or distributed systems without coordination. Additionally, by eschewing timestamps and MAC addresses, they enhance privacy by not leaking temporal or hardware-specific information about the generating system. The RFC 4122 standard explicitly recommends this random method for scenarios prioritizing simplicity and security over reproducibility.[2]Unix Timestamp with Random (Version 7)
UUID Version 7 (UUIDv7) is a time-ordered variant of the Universally Unique Identifier (UUID) standard, defined in RFC 9562, which incorporates a 48-bit Unix timestamp representing milliseconds since the Unix epoch (January 1, 1970, 00:00:00 UTC, excluding leap seconds) into its structure, alongside 4 bits for the version number (set to 0111 binary), a 12-bit field for randomness or a counter, and 62 bits of additional randomness, with the 2-bit variant field set to 10 binary to indicate RFC compliance.[17] The layout of UUIDv7 arranges the 48-bit timestamp across the most significant bits—specifically, the 32-bittime_low field, the 16-bit time_mid field, and the low-order 12 bits of the 16-bit time_hi_and_version field—to ensure that UUIDs generated in temporal sequence exhibit lexical sortability when represented as strings or binary values.[17] The version bits occupy the high-order 4 bits of the time_hi_and_version field, while the variant bits are placed in the high-order 2 bits of the clock_seq_hi_and_reserved field within the 16-bit clock_seq portion, replacing the traditional clock sequence and node identifier fields used in earlier time-based UUIDs.[17] This configuration, illustrated in the following bit-level breakdown, prioritizes temporal ordering in the initial 48 bits followed by randomized bits for uniqueness:
| Field | Bits | Description |
|---|---|---|
| unix_ts_ms | 0-47 | 48-bit Unix timestamp (ms) |
| ver | 48-51 | Version (7) |
| rand_a (or counter/sub-ms) | 52-63 | 12 bits random or monotonic counter |
| var | 64-65 | Variant (10) |
| rand_b | 66-127 | 62 bits random |
Custom (Version 8)
Version 8 UUIDs provide a flexible framework for custom identifier generation tailored to specific applications or vendors, where the standard layouts of other versions do not suffice. Defined in RFC 9562, this version reserves 122 bits for implementation-specific use while enforcing the version field to 8 (binary 1000 in bits 48-51) and the variant field to 10xx (bits 64-65 set to 10), ensuring basic compatibility with UUID parsing systems.[17] This approach allows embedding domain-specific data, such as sequence numbers, application metadata, or custom hashes, without conflicting with predefined structures in versions 1 through 7.[35] Implementations of version 8 UUIDs must fully document their custom layout to enable understanding and potential interoperability, as the RFC does not prescribe any particular algorithm beyond the fixed fields. The 128-bit structure allocates bits 0-47 (custom_a, 48 bits), bits 52-63 (custom_b, 12 bits), and bits 66-127 (custom_c, 62 bits) for user-defined content, leaving the version and variant bits to signal the custom nature.[36] Uniqueness is the responsibility of the implementer, who must ensure that the method used—whether time-based, random, or otherwise—avoids collisions within the intended scope, and the layout should not mimic patterns from other UUID versions to prevent misinterpretation.[35] For example, a custom version 8 UUID might incorporate a Unix timestamp in the initial bits followed by application-specific counters, as illustrated in RFC 9562 with the identifier 2489E9AD-2EE2-8E00-8EC9-32D5F69181C0, or use a SHA-256 hash of namespace and name data for deterministic generation, such as 5c146b14-3c52-8afd-938a-375d0df1fbf6.[37] These examples are illustrative only and not recommended for production without modification to suit the domain's needs. The RFC emphasizes that custom formats should prioritize uniqueness guarantees and rigorous testing, avoiding reliance on security properties like those in version 2.[38] A primary risk of version 8 UUIDs is diminished interoperability, as undocumented or proprietary layouts may render identifiers unusable across systems or lead to unintended collisions if uniqueness is not properly managed.[39] To mitigate this, the RFC recommends public documentation of algorithms and advises against using version 8 for scenarios requiring broad standardization, reserving it for controlled, application-specific environments.[35]Encoding
Binary Representation
A universally unique identifier (UUID) is represented in binary form as a fixed-size 16-byte (128-bit) array, providing a compact and efficient means for storage and transmission across systems without introducing variable-length overhead. This binary format ensures interoperability in low-level operations, such as memory allocation or direct byte manipulation in programming languages.[1] The byte order for this 16-byte array follows big-endian (most significant byte first, also known as network byte order) as specified in RFC 9562, particularly for timestamp-related fields like time_low, time_mid, and time_hi_and_version, where multi-byte values are serialized with the most significant octet first. The node identifier field is likewise transmitted in the order it appears on the network wire, maintaining consistency for cross-platform compatibility. However, implementations in Microsoft Windows APIs, such as the GUID structure, store multi-byte fields (Data1, Data2, and Data3) in little-endian order on little-endian architectures like x86, requiring conversion to big-endian when interfacing with network protocols or standards-compliant systems.[1][10] In database systems, UUIDs are commonly stored using a BINARY(16) data type, preserving the exact 16 bytes without additional formatting or padding, which allows for efficient indexing and querying. In C programming environments, a typical representation is a structure liketypedef unsigned char uuid_t[16];, treating the UUID as an opaque byte array to avoid endianness assumptions during local operations.[1]
For transmission in network protocols, UUIDs are sent as raw 16-byte sequences in big-endian order without byte swapping or transformation, ensuring direct usability in headers or payloads; examples include Server Message Block (SMB) for file sharing and custom HTTP headers in distributed systems.[1]
To parse the binary representation and extract components like the version field, bit manipulation operations are applied directly to the byte array assuming the standard big-endian layout. For instance, the UUID version is obtained from the high nibble of the seventh byte (octet 6, zero-based indexing), corresponding to bits 12-15 of the time_hi_and_version field:
This approach enables quick validation and field access in performance-critical code, such as in cryptographic libraries or identifier generators.[1]version = (bytes[6] >> 4) & 0x0F;version = (bytes[6] >> 4) & 0x0F;
Textual Representation
The canonical textual representation of a UUID, as defined in RFC 9562, consists of 32 hexadecimal digits (using lowercase letters a–f) arranged in five groups separated by hyphens in the format 8-4-4-4-12: the first group contains 8 digits for the time-low field, followed by 4 digits each for time-mid, time-high-and-version, clock-seq-and-reserved plus clock-seq-low, and 12 digits for the node field.[1] For example, a typical UUID appears as123e4567-e89b-12d3-a456-426614174000.[1] This format ensures human readability and interoperability across systems.[1]
A compact variant omits the hyphens, resulting in a continuous 32-hex-digit string, which is commonly used for storage efficiency or in contexts where brevity is prioritized, though it is not the canonical form specified by the RFC.[1] Uppercase hexadecimal letters are permitted on input for parsing but are not preferred for output, which should use lowercase; the RFC treats hexadecimal values as case-insensitive during processing.[1] When used as a Uniform Resource Name (URN), a UUID is prefixed with urn:uuid:, yielding forms like urn:uuid:123e4567-e89b-12d3-a456-426614174000.[1]
Validation of a UUID string typically involves verifying its length (36 characters with hyphens or 32 without), ensuring all characters are valid hexadecimal digits, and checking the variant and version identifiers embedded in specific positions.[1] The version nibble, located as the first hexadecimal digit of the third group (position 15 in the hyphenated string), must be 1, 3, 4, 5, 6, 7, or 8 to indicate one of the defined UUID versions.[1] Similarly, the variant bits, starting with the first digit of the fourth group (position 19), should match the RFC 9562 variant (binary 10xx, corresponding to hexadecimal 8, 9, a, or b) for compatibility.[1] Beyond format checks, the RFC provides no formal mechanism to confirm a UUID's overall validity, such as whether it is assigned or in the future.[1]
Many programming libraries support parsing UUID strings into binary form, often accommodating both hyphenated and compact representations. For instance, the uuid_parse() function in the libuuid library (part of the util-linux package) converts a standard hyphenated string to a 128-bit binary UUID, expecting the exact 36-character format including hyphens and null terminator.[40]
Special Values
Nil UUID
The nil UUID is a special form of universally unique identifier defined in the standards for UUIDs, consisting of 128 bits all set to zero.[41] It serves as a reserved value to represent the absence of a UUID, analogous to a null or uninitialized state in data structures.[41] In textual representation, the nil UUID is expressed as00000000-0000-0000-0000-000000000000.[41] According to RFC 9562, which obsoletes the earlier RFC 4122, this value is explicitly designated as the "nil UUID" and is not produced by any standard UUID generation algorithm.[41] Its variant field evaluates to 0 (following the NCS backward compatibility scheme due to the all-zero bits), and its version field is also 0, distinguishing it from versioned UUIDs.[42][43]
This nil UUID is commonly used in databases to indicate unassigned or optional identifiers, such as in PostgreSQL where the UUID type treats the all-zero value as a flag for an unknown or unset UUID, often inserted via functions like uuid_nil().[44] In programming environments, it represents uninitialized objects; for instance, Python's uuid module provides uuid.NIL as this constant for scenarios requiring a null UUID placeholder.[45] In APIs and data serialization formats like JSON, it denotes optional fields without a valid UUID, avoiding the need for separate null types while maintaining type consistency.[46] Such usage ensures clear signaling of absence without risking collision with generated UUIDs, as the nil value is explicitly reserved for implementation-specific null-like purposes.[41]
Maximum UUID
The maximum UUID, also known as the Max UUID, is a special value consisting of 128 bits all set to 1, represented in hexadecimal asffffffff-ffff-ffff-ffff-ffffffffffff.[17] This value serves as the theoretical upper bound within the UUID namespace, contrasting with the nil UUID by representing a "full" state rather than an "empty" one.[17]
Defined in RFC 9562, the Max UUID adheres to the overall UUID format but features a version number of 15 in its version bits (the first four bits of the third octet set to 1111), which is invalid for standard UUID versions 1 through 8 and reserved for future extensions.[17] Although not explicitly outlined in the earlier RFC 4122, it remains a valid UUID per the structural rules, as the specification does not prohibit all-ones configurations beyond defined variants.[2] In binary form, it is a continuous sequence of 128 ones, making it the largest possible 128-bit value expressible as a UUID.[17]
In practice, the Max UUID is rarely generated or encountered, as it is primarily reserved for specific system-level purposes rather than routine identification.[17] It functions as a sentinel value in scenarios requiring a 128-bit UUID placeholder where no valid identifier applies, such as denoting an invalid or uninitialized state in protocols or data structures.[17] Common contexts include overflow protection in UUID-based counters, where it signals the exhaustion of the identifier space, or as a reserved marker in technical specifications to avoid conflicts with assignable values.[17] For instance, database systems like Percona Server for MySQL provide functions to generate this value explicitly as the counterpart to the nil UUID for such sentinel roles.[47]
Implementations in programming languages further highlight its specialized role; the Python standard library's uuid module exposes it as uuid.MAX for programmatic use in boundary checks or defaults, while similar constants appear in Rust's uuid crate and Node.js's uuid package to represent the all-ones boundary.[45][48][49] Overall, its adoption emphasizes conceptual completeness in UUID ecosystems without implying routine generation, ensuring it does not collide with probabilistically unique identifiers.[17]
Collisions
Probability Calculations
The probability of collisions in UUIDs is analyzed using the birthday paradox approximation, which estimates the likelihood of at least one duplicate among n generated identifiers in a space of size N:P(\text{collision}) \approx 1 - e^{-n^2 / (2N)}
where N = 2^m and m is the number of effective random bits. This formula provides a practical bound for collision risk across UUID versions, assuming uniform distribution and independence.[17] For version 1 and version 6 UUIDs, collisions occur only if two identifiers share the exact 60-bit timestamp, 14-bit clock sequence, and 48-bit node identifier, yielding an effective randomness of 62 bits (m = 62, N \approx 2^{62}) within each timestamp slot. The global uniqueness is further ensured by using unique node identifiers, such as IEEE 802 MAC addresses, reducing the practical collision risk across distributed systems. For collisions within the same timestamp, the space is 62 bits, making duplicates extremely unlikely unless generating over $2^{31} UUIDs in the same slot across identical nodes, which is practically impossible. Globally, uniqueness is ensured by distinct timestamps and node IDs.[17] Version 2 UUIDs, a legacy DCE security variant, follow a similar time-based structure to version 1 but replace the 48-bit node field with 32-bit POSIX UID and 32-bit GID fields, resulting in an effective space of 60-bit timestamp + 14-bit clock sequence + 64-bit UID/GID (m \approx 78) within each slot, scoped to specific users or groups on a system. Uniqueness relies on distinct inputs, with collision risks comparable to version 1 but limited to the same system/user context; due to rarity of use, detailed probability analyses are uncommon.[50] Version 4 UUIDs utilize 122 random bits (m = 122, N = 2^{122}), excluding the fixed version and variant fields. Under the birthday approximation, generating approximately $2^{61} (about 2.3 quintillion) UUIDs yields a collision probability of roughly 50%, though smaller sets like $2^{50} UUIDs result in a negligible risk of about $10^{-15}. This vast space makes collisions extremely unlikely in most applications.[17] Version 3 UUIDs, based on MD5 hashing of a namespace and name, inherit MD5's known collision vulnerabilities, where practical attacks can produce distinct inputs with identical 128-bit outputs, though namespace scoping (e.g., DNS or URL) confines risks to specific domains and limits global impact. Version 5 UUIDs use SHA-1 hashing, which also suffers from demonstrated collisions (e.g., chosen-prefix attacks requiring feasible computation), but the same namespace constraints mitigate widespread uniqueness failures; both versions assume input uniqueness to avoid hash-based duplicates. NIST has deprecated MD5 due to these weaknesses and announced that SHA-1 should be phased out by December 31, 2030, for cryptographic uses.[17][51][52] Version 7 UUIDs combine a 48-bit Unix timestamp (milliseconds since 1970) with 74 random bits (m = 74, N = 2^{74}), providing slightly reduced effective randomness compared to version 4 due to the fixed time component. Under the birthday approximation, generating approximately $2^{37} version 7 UUIDs within the same millisecond would yield about a 50% chance of collision, though such volume in one millisecond is practically impossible. For realistic rates (e.g., millions per second), the risk remains negligible, with overall low probability given the time-ordered nature.[17] Version 8 UUIDs are custom, with structure defined by the implementer, typically allocating 122 bits for version-invariant data including random components (e.g., at least 74 random bits recommended). Collision probability depends on the effective random bits used (m up to 122); following RFC 9562 guidelines for sufficient entropy ensures risks similar to version 4, but poor implementations could increase vulnerabilities.[17]
Mitigation Strategies
To minimize the risk of collisions in UUID generation, implementations must adhere to the specifications outlined in relevant standards, particularly for random and time-based variants. For Version 4 and Version 7 UUIDs, which rely heavily on randomness, using a cryptographically secure pseudorandom number generator (CSPRNG) is essential to ensure sufficient entropy and unguessability; weak pseudorandom number generators like the standard Crand() function should be avoided, as they can lead to predictable sequences and increased collision probabilities. For Version 8, implementers must ensure adequate random bits and CSPRNG usage to match version 4 security levels.[1][53]
For Version 1 and Version 6 UUIDs, which incorporate timestamps and node identifiers, maintaining stable, monotonically increasing clocks is critical to prevent duplicates from clock rollbacks or low-resolution timing; if the clock regresses, the clock sequence must be incremented or randomized to maintain uniqueness. Unique node IDs, such as IEEE 802 MAC addresses, further reduce collision risks, but in their absence, a fallback to a randomly generated 48-bit node ID with the multicast bit set to 1 provides a viable alternative while preserving global uniqueness properties. Version 2 follows similar clock and sequence rules but scopes uniqueness via UID/GID.[1]
Version 3 and Version 5 UUIDs, being name-based, mitigate cross-domain collisions by scoping generations to predefined namespaces, such as the DNS namespace (UUID: 6ba7b810-9dad-11d1-80b4-00c04fd430c8), ensuring that identical names in different namespaces produce distinct UUIDs through hashing (MD5 for Version 3, SHA-1 for Version 5). Version 5 is preferred over Version 3 due to MD5's known vulnerabilities, though both rely on the uniqueness of the input name within its namespace to avoid collisions.[1]
Beyond version-specific measures, general best practices include generating UUIDs on demand during runtime rather than pre-allocating batches, which can introduce errors if not properly synchronized across distributed systems; in high-volume environments, such as databases handling millions of insertions, application-level monitoring for duplicates—via hashing indexes or periodic scans—is recommended to detect and handle any rare collisions promptly. Standard-compliant libraries facilitate these practices: the OSSP UUID library implements Versions 1, 3, 4, and 5 per RFC 4122 (updated in RFC 9562), using system-appropriate entropy sources for randomness, while Java's java.util.UUID class employs SecureRandom for randomUUID() to generate Version 4 UUIDs with cryptographic strength.[1][54][55]
Uses
Filesystems and Storage
In filesystems, universally unique identifiers (UUIDs) serve as persistent, hardware-independent labels for volumes and partitions, enabling reliable identification and mounting without reliance on volatile device paths. This approach facilitates seamless operation across diverse hardware configurations and prevents conflicts arising from device enumeration changes. UUIDs are typically generated randomly during filesystem creation, often adhering to version 4 of the UUID standard for high entropy and collision resistance.[56] The GUID Partition Table (GPT), standardized in the UEFI specification, employs a 128-bit Disk GUID to uniquely identify the entire disk, including its header and associated storage. This GUID is generated randomly upon GPT initialization and stored in the GPT header at byte offset 56, serving as a disk signature that distinguishes it from other storage devices even if cloned. Partition entries in GPT also use UUIDs for type identification and unique partitioning, ensuring unambiguous recognition in bootloaders and operating systems.[56] Linux filesystems like ext4 and XFS integrate UUIDs directly into their superblocks for volume identification. For ext4, the UUID is automatically generated as a random 128-bit value during filesystem creation with themkfs.ext4 command, unless explicitly set via the -U option; this identifier is then referenced in /etc/fstab for stable mounting, decoupling the process from device names like /dev/sda1 that may shift due to hardware additions. Similarly, XFS generates a random UUID by default when formatted with mkfs.xfs, storable in the superblock and customizable with the -m uuid=value option, allowing consistent administration and mounting via tools like mount and xfs_admin. These UUIDs enable automated detection and configuration in environments with dynamic storage topologies.[57][58]
Microsoft's NTFS filesystem utilizes 128-bit GUIDs as Object IDs for volumes and files, assigned to metadata structures like the master file table (MFT) records and the volume root. These GUIDs, supported exclusively on NTFS volumes, facilitate secure identification and access, particularly in security descriptors where they link ownership and permissions without depending on file paths. The volume's Object ID acts as a persistent GUID for the entire filesystem, complementing the 64-bit volume serial number and enabling features like volume mount points via the mountvol command, which references volumes as \\?\volume\{GUID}\.[59][60]
Apple File System (APFS) organizes storage into containers, each identified by a unique 128-bit UUID that encapsulates multiple volumes sharing the same physical space. This container UUID plays a critical role in encryption, where keybags are encrypted using the UUID to enable rapid, secure erasure of contents by invalidating keys tied to the identifier. For snapshots, APFS leverages the container structure to manage point-in-time copies across volumes, with the UUID ensuring integrity and isolation during operations like cloning or rollback, as volumes within the container inherit contextual metadata from it.[61]
The adoption of UUIDs in these filesystems yields key advantages, including portability across hardware platforms, as identifiers remain constant regardless of port changes or system reconfiguration, thus simplifying migration and virtualization. They also mitigate naming conflicts in multi-disk setups by providing globally unique labels, reducing errors in mounting and data access while enhancing resilience in distributed or cloud environments.[62]
Databases and Identification
In relational databases, UUIDs serve as surrogate primary keys, providing globally unique identifiers without relying on sequential values generated by the database. For instance, PostgreSQL includes a nativeuuid data type that stores 128-bit UUIDs efficiently as binary values, making it suitable for primary keys in distributed environments where uniqueness across systems is essential.[63] This approach avoids the predictability of auto-incrementing integer IDs, which can expose sensitive information through enumeration attacks or reveal database growth patterns.[64]
Certain UUID versions enhance database operations involving time. Versions 1 and 6 incorporate timestamps, enabling temporal queries by extracting creation times directly from the identifier for filtering or ordering records based on when data was inserted.[1] In contrast, version 7 prioritizes sortability by placing a Unix timestamp in the most significant bits, improving index performance in B-tree structures for time-ordered data retrieval.[1]
In NoSQL databases, UUIDs offer alternatives to native identifier schemes. MongoDB defaults to ObjectIds, which are 12-byte values embedding timestamps for efficient indexing and sorting, but UUIDs provide stronger cross-system uniqueness at the cost of larger storage when encoded as binary (16 bytes versus ObjectId's compact form). Cassandra employs version 4 UUIDs—randomly generated for even data distribution—as partition keys, ensuring balanced load across nodes in clustered setups without hotspots from sequential patterns.[65]
For indexing, UUIDs are stored in binary format (16 bytes) in systems like PostgreSQL, which is more space-efficient than text representations but doubles the size of 8-byte integers like BIGINT, potentially increasing index bloat in high-volume tables.[63] Despite this, binary storage supports fast comparisons and hashing. In distributed transactions, UUIDs are generated client-side at insert time, ensuring consistency across replicas without central coordination and preventing ID conflicts during merges.[64]
Networking and Distributed Systems
In distributed networking and systems, universally unique identifiers (UUIDs) play a critical role in ensuring unambiguous identification of objects, sessions, and messages across heterogeneous environments, preventing conflicts in transient references without relying on centralized coordination. This is particularly valuable in protocols where objects or data must be referenced remotely, as UUIDs provide a 128-bit space that minimizes collision risks even in high-scale, decentralized scenarios. In the Common Object Request Broker Architecture (CORBA) using the Internet Inter-ORB Protocol (IIOP), UUIDs form part of the object key within Interoperable Object References (IORs), enabling unique identification of distributed objects across ORBs. The DCE UUID format is specified for this purpose in IIOP profiles, allowing clients to invoke methods on remote objects without name resolution dependencies.[66] Similarly, Microsoft's Distributed Component Object Model (DCOM) employs GUIDs—equivalent to UUIDs—for interface marshaling, where the Interface Identifier (IID) uniquely specifies the COM interface being accessed, and the Causality Identifier (CID) tracks related call chains during remote activation and invocation.[67] For modern web-based protocols, RESTful HTTP APIs frequently incorporate UUIDs in resource URLs to denote specific entities, such as/api/resources/{uuid}, which obscures sequential patterns and supports distributed generation without database coordination.[68] ETags for caching can also leverage UUIDs as opaque validators, ensuring efficient conditional requests by comparing resource versions across distributed caches. In gRPC with Protocol Buffers, UUIDs are typically encoded as fixed-length strings (e.g., 36 characters in hyphenated form) or 16-byte fields within message definitions to identify requests, responses, or session objects, facilitating reliable routing in microservices architectures.[69]
Custom UUID variants may be defined in application-specific protocols to incorporate network metadata, such as timestamps or node IDs, enhancing traceability in remoting scenarios.