Fact-checked by Grok 2 weeks ago

Double-precision floating-point format

The , officially known as binary64 in the standard, is a 64-bit binary interchange for representing real numbers in computer systems, consisting of a 1-bit , an 11-bit biased exponent , and a 52-bit with an implicit leading 1 for normalized values. This enables the approximation of a wide range of real numbers using in binary, where the value is calculated as (-1)^ × 2^(exponent - 1023) × (1 + /2^52) for normalized numbers. Key characteristics include a of 53 bits for the significand, equivalent to approximately 16 decimal digits, allowing for high-fidelity representations in numerical computations. The of 1023 supports a from the smallest normalized positive value of about 2.225 × 10^{-308} to the largest of approximately 1.798 × 10^{308}, with subnormal numbers extending the underflow range down to around 4.940 × 10^{-324}. The format also accommodates special values such as signed zeros, infinities, and Not-a-Number () payloads for handling exceptional conditions in arithmetic operations. Developed as part of the standard, which was drafted starting in 1978 at the under the leadership of , the double-precision format has become the for in most modern processors and programming languages due to its balance of precision and range for scientific, engineering, and general-purpose computing. Subsequent revisions, including IEEE 754-2008 and IEEE 754-2019, have refined the standard while maintaining backward compatibility for binary64, ensuring its widespread adoption in hardware like x86, , and GPU architectures.

Overview and Fundamentals

Definition and Purpose

Double-precision floating-point format, also known as binary64 in the standard, is a 64-bit computational representation designed to approximate real numbers using a , an 11-bit exponent, and a 52-bit (). This structure allows for the encoding of a wide variety of numerical values in , where the number is expressed as ±(1 + fraction) × 2^(exponent - ), providing a normalized form for most representable values. The format was standardized in to ensure consistent representation and arithmetic operations across different computer systems, promoting portability in software and hardware implementations. The primary purpose of double-precision format in is to achieve a between numerical range and suitable for scientific simulations, analyses, and general-purpose calculations that require higher accuracy than single-precision alternatives. It supports a of approximately 10^{-308} to 10^{308}, enabling the representation of extremely large or small magnitudes without excessive loss of detail, and offers about 15 decimal digits of due to the 53-bit effective (including the implicit leading 1). This is sufficient for most applications where relative accuracy matters more than absolute exactness, such as in physics modeling or financial computations. Compared to fixed-point or formats, double-precision floating-point significantly reduces the risks of and underflow by dynamically adjusting the binary point through the exponent, allowing seamless handling of scales from subatomic to astronomical without manual rescaling. This feature makes it indispensable for iterative algorithms and data processing where input values vary widely, ensuring computational stability and efficiency in diverse fields.

Historical Development and Standardization

The development of double-precision floating-point formats emerged in the 1950s and 1960s amid the growing need for mainframe computers to handle scientific and engineering computations requiring greater numerical range and accuracy than single-precision or integer formats could provide. The , introduced in 1954 as the first mass-produced computer with dedicated floating-point hardware, supported single-precision (36-bit) and double-precision (72-bit) binary formats, enabling more reliable processing of complex mathematical operations in fields like physics and . Subsequent systems, such as the 7094 in 1962, expanded these capabilities with hardware support for double-precision operations and index registers, further solidifying as essential for . Pre-IEEE implementations varied significantly, contributing to interoperability challenges, but several influenced the eventual standard. The DEC PDP-11 series, launched in 1970, featured a 64-bit double-precision floating-point format (G-floating) with an 8-bit exponent, 55-bit , and hidden-bit , which bore close resemblance to the later IEEE design despite differences in (129 versus 1023) and handling of subnormals; this format's structure informed discussions on binary representation during standardization efforts. In the , divergent floating-point implementations across architectures led to severe portability issues, such as inconsistent behaviors and unreliable results (e.g., distinct nonzero values yielding zero under ), inflating costs and limiting numerical reliability. To address this "," the IEEE formed the Floating-Point Working Group in 1977 under the Microprocessor Standards Subcommittee, with initial meetings in November of that year; , a key consultant to and co-author of influential drafts, chaired efforts that balanced precision, range, and needs from industry stakeholders. The resulting standard formalized the 64-bit binary double-precision format (binary64), specifying a 1-bit , 11-bit biased exponent, and 52-bit (with implicit leading 1), thereby establishing a portable foundation for in binary systems. The core binary64 format has remained stable through subsequent revisions, which focused on enhancements rather than fundamental changes. IEEE 754-2008 introduced fused multiply-add (FMA) operations—computing x \times y + z with a single rounding step to minimize error accumulation—along with decimal formats and refined , while preserving the unchanged structure of binary64 for . The 2019 revision (IEEE 754-2019) delivered minor bug fixes, clarified operations like augmented addition, and ensured upward compatibility, without altering the binary64 specification to maintain ecosystem stability. Adoption accelerated in the late 1980s, driven by hardware implementations that embedded the standard into mainstream processors. The 80387 , released in 1987 as a for the 80386, provided full support for double-precision arithmetic, including 64-bit operations with 53-bit precision and , enabling widespread use in x86-based personal computers and scientific applications by the 1990s.

IEEE 754 Binary64 Specification

Bit Layout and Components

The double-precision floating-point format, designated as binary64 in the standard, employs a fixed 64-bit structure to encode real numbers, balancing range and precision for computational applications. This layout divides the 64 bits into three primary fields: a , an exponent field, and a field. The occupies the most significant position (bit 63), with a value of 0 denoting a positive number and 1 indicating a . The exponent field spans the next 11 bits (bits 62 through 52), serving as an unsigned that scales the overall magnitude. The field, also known as the or , comprises the least significant 52 bits (bits 51 through 0), capturing the binary digits that define the number's . For normalized numbers, the represented value is given by the formula: (-1)^s \times \left(1 + \frac{f}{2^{52}}\right) \times 2^{e - 1023} where s is the (0 or 1), f is the bits interpreted as a 52-bit , and e is the exponent value as an 11-bit unsigned . The straightforwardly controls the number's polarity. The exponent adjusts the binary scale factor, while the encodes the fractional part after the radix point. In normalized form, the assumes an implicit leading 1 (the "hidden bit") before the explicit 52 bits, yielding a total of 53 bits of precision for the . This hidden bit convention ensures efficient use of storage by omitting the redundant leading 1 in normalized representations. The 64-bit structure, with its 53-bit effective significand precision, supports approximately 15.95 decimal digits of accuracy, calculated as \log_{10}(2^{53}).
FieldBit PositionsWidth (bits)Purpose
Sign631Polarity (0: positive, 1: negative)
Exponent62–5211Scaling factor (biased)
Significand51–052Precision digits (with hidden bit)

Exponent Encoding and Bias

In the IEEE 754 binary64 format, the exponent is represented by an 11-bit field that stores a value to accommodate both positive and negative exponents using an unsigned encoding. This bias mechanism adds a fixed offset to the true exponent, allowing the field to range from 0 to 2047 while mapping to effective exponents that span negative and positive values. The bias value for binary64 is 1023, calculated as $2^{11-1} - 1. The true exponent e is obtained by subtracting the from the encoded exponent E: e = E - 1023 This formula enables the representation of normalized numbers with exponents ranging from -1022 to +1023. For normalized values, E ranges from 1 to 2046, ensuring an implicit leading bit of 1 and full precision. When the exponent field is all zeros (E = 0), it denotes subnormal (denormalized) numbers, providing gradual underflow toward zero rather than abrupt flushing. In this case, the effective exponent is fixed at -[1022](/page/1022), and the value is computed as (-1)^s \times (0 + f / 2^{[52](/page/52)}) \times 2^{-[1022](/page/1022)}, where s is the and f is the 52-bit field interpreted as a less than 1. The all-ones exponent field (E = 2047) is reserved for special values. The use of allows the exponent to be stored as an , supporting signed effective exponents without the complexities of representation, such as asymmetric ranges or additional hardware for sign handling. This design choice simplifies comparisons of floating-point magnitudes by treating the biased exponents as unsigned values and promotes efficient implementation across diverse hardware architectures.

Significand Representation and Normalization

In the IEEE 754 binary64 format, the significand, also known as the mantissa, consists of 52 explicitly stored bits in the trailing significand field, augmented by an implicit leading bit of 1 for normalized numbers, resulting in an effective precision of 53 bits. This design allows the significand to represent values in the range [1, 2) in binary, where the explicit bits capture the fractional part following the implicit integer bit. The choice of 53 bits provides a relative precision of approximately $2^{-53} \approx 1.11 \times 10^{-16} for numbers near 1, enabling the representation of about 15 to 17 decimal digits of accuracy. Normalization ensures that the is adjusted to have its leading bit as 1, maximizing the use of available bits for . During the process, the of a number is shifted left or right until the most significant bit is 1, with corresponding adjustments to the exponent to maintain the overall value; this implicit leading 1 is not stored, freeing up space for additional al bits. For a normalized binary64 number, the value is thus given by $1.f \times 2^{E}, where f is the 52-bit and E is the unbiased exponent (with the biased exponent referenced from the encoding scheme). This applies to all finite nonzero numbers except subnormals, ensuring consistent across the representable . Denormalized (or subnormal) numbers are used to represent values smaller than the smallest , extending the range toward zero without underflow to zero. In this case, the exponent field is set to zero, and there is no implicit leading 1; instead, the is interpreted as $0.f \times 2^{-1022}, where f is the 52-bit , resulting in reduced precision that gradually decreases as more leading zeros appear in the . This mechanism fills the gap between zero and the minimum normalized value of $2^{-1022}, with the smallest positive subnormal being $2^{-1074}.

Special Values and Exceptions

In the IEEE 754 binary64 format, zero is encoded with an exponent field of all zeros (unbiased value 0) and a of all zeros, where the determines positive zero (+0) or negative zero (-0). Signed zeros are preserved in arithmetic operations and are significant in contexts like , where they affect the sign of the resulting (e.g., $1 / +0 = +\infty and $1 / -0 = -\infty). This distinction ensures consistent handling of directional and branch cuts in complex arithmetic. Positive and negative are represented by setting the exponent field to all ones (biased value 2047) and the to all zeros, with the specifying the direction. These values arise from operations like , where a result exceeds the largest representable finite number (approximately $1.8 \times 10^{308}), or , producing +\infty or -\infty based on signs. propagates through most arithmetic operations, such as \infty + 5 = \infty, maintaining the expected mathematical behavior while signaling potential issues. Not-a-Number (NaN) values indicate indeterminate or invalid results and are encoded with the exponent field set to 2047 and a non-zero significand. NaNs are categorized as quiet NaNs (qNaNs), which have the most significant bit of the significand set to 1 and silently propagate through operations without raising exceptions, or signaling NaNs (sNaNs), which have that bit set to 0 and trigger the invalid operation exception upon use. The remaining 51 bits of the significand serve as a payload, allowing implementations to embed diagnostic information, such as the operation that generated the NaN, in line with IEEE 754 requirements for NaN propagation in arithmetic (e.g., \text{NaN} + 3 = \text{NaN}). IEEE 754 defines five floating-point exceptions to handle edge cases, with default results that maintain computational continuity: the invalid operation exception, triggered by operations like \sqrt{-1} or $0/0, yields a NaN; division by zero produces infinity; overflow rounds to infinity; underflow delivers a denormalized number or zero when the result is subnormal; and inexact indicates rounding occurred, though it does not alter the result. Implementations may enable traps for these exceptions, but the standard mandates non-trapping default behavior to ensure portability.

Representation and Conversion

Converting Decimal to Binary64

Converting a decimal number to the binary64 format involves decomposing the number into its sign, exponent, and significand components according to the IEEE 754 standard. For positive decimal numbers, the process begins by separating the integer and fractional parts. The integer part is converted to binary by repeated division by 2, recording the remainders as bits from least to most significant. The fractional part is converted by repeated multiplication by 2, recording the integer parts (0 or 1) until the fraction terminates or a sufficient number of bits (up to 53 for binary64) is obtained. The combined binary representation is then normalized to the form $1.m \times 2^e, where m is the fractional significand and e is the unbiased exponent, by shifting the binary point left or right as needed. The significand is taken as the first 52 bits after the leading 1 (implicit in normalized form), with rounding applied if more bits are generated. The exponent e is biased by adding 1023 to produce the 11-bit encoded exponent field. For negative numbers, the process follows the same steps for the , after which the is set to 1. Consider the 3.5 as an illustrative outline. The part 3 is 11 in , and the 0.5 is 0.1 in , yielding 11.1. Normalizing gives $1.11 \times 2^1, so the is 11 followed by 50 zeros (padded and rounded if necessary), the unbiased exponent is 1, and the biased exponent is $1 + 1023 = 1024, encoded in binary as 10000000000. The reverse conversion from binary64 to decimal extracts the sign bit to determine positivity or negativity. The biased exponent is unbiased by subtracting 1023, and the significand is reconstructed as $1.f where f is the 52-bit fraction. The value is then (-1)^s \times (1.f) \times 2^{e-1023}, which can be converted to decimal by scaling and summing powers of 2, though software typically handles the final decimal output. Not all decimal numbers are exactly representable in ; for instance, 0.1 requires an infinite repeating fraction (0.0001100110011...), which is rounded to the nearest representable value, leading to approximation.

Endianness in Memory Storage

The standard for floating-point arithmetic, including the double-precision (binary64) format, defines the logical bit layout—consisting of 1 , 11 exponent bits, and 52 bits—but does not specify the byte order in which these 64 bits are stored in memory. This omission allows implementations to vary based on the underlying hardware architecture, leading to differences in how binary64 values are serialized into the 8 bytes of memory they occupy. In big-endian storage, common in network protocols and certain processors, the most significant byte is placed at the lowest , followed by successively less significant bytes. Conversely, little-endian storage, prevalent in x86 architectures, places the least significant byte first. For example, the double-precision representation of the value 1.0 has the 64-bit pattern 0x3FF0000000000000, where the is 0, the biased exponent is 0x3FF ( in decimal), and the is 0 (implied leading 1). In big-endian memory, this appears as the byte sequence 3F F0 00 00 00 00 00 00; in little-endian, it is reversed to 00 00 00 00 00 00 F0 3F. These variations create portability challenges, particularly when transmitting binary64 data across systems or networks, where big-endian is the conventional order (as established in protocols like TCP/IP). Programmers must use byte-swapping functions—analogous to htonl/ntohl for integers but extended for 64-bit floats, such as via unions or explicit swaps—to ensure correct interpretation on heterogeneous platforms. Failure to account for endianness can result in corrupted values, as a little-endian system's output read on a big-endian system would reinterpret the bytes in reverse order. Platform-specific behaviors further highlight these differences: x86 processors (e.g., and ) universally employ little-endian storage for binary64 values, while PowerPC architectures default to big-endian but support a little-endian mode configurable at runtime. ARM processors, used in many embedded and mobile systems, also support both modes, with the double-precision format split into two 32-bit words: in big-endian, the most significant word (containing the sign, exponent, and high bits) precedes the least significant word; the reverse holds in little-endian. Some processors, such as certain implementations, offer bi-endian flexibility, allowing software to select the order for compatibility in diverse environments.

Double-Precision Numerical Examples

The double-precision floating-point format, as defined by the standard, encodes real numbers using a 1-bit , 11-bit biased exponent, and 52-bit , allowing for precise representations of many values but approximations for others due to constraints. To illustrate this encoding, consider the representation of the 1.0, which is a normalized value with an unbiased exponent of 0 and an implicit leading bit of 1 followed by all zeros. For 1.0, the sign bit is 0 (positive), the biased exponent is (binary 01111111111, as the bias is for double precision), and the significand bits are all 0 (representing exactly). This yields the 64-bit pattern:
0 01111111111 0000000000000000000000000000000000000000000000000000
In , this is 0x3FF0000000000000. A bit-by-bit breakdown confirms the structure: the first bit () is 0; bits 1–11 (exponent) are 01111111111 ( ); and bits 12–63 () are 0000000000000000000000000000000000000000000000000000. Another example is the decimal 0.1, which cannot be represented exactly in binary as it has a repeating expansion (0.0001100110011...), leading to rounding in the significand. Its double-precision encoding uses sign bit 0, biased exponent 01111111011 (decimal 1019, unbiased -4), and a rounded significand of 10011001100110011001100110011001100110011001100110011010 (implicit leading 1 makes it approximately 0.1). The hexadecimal value is 0x3FB999999999999A, demonstrating the inexact nature due to binary limitations. The approximation of π (3.141592653589793) provides a normalized example with a non-zero after to 53 bits of . Here, the is 0, the biased exponent is (binary 10000000000, unbiased 1), and the is rounded to 10010000011010100111110110110001001100110011001100110 (implicit leading 1). This results in the hexadecimal 0x400921FB54442D18. Finally, the smallest positive denormalized number, 2^{-1074} (approximately 4.940 × 10^{-324}), uses exponent bits all 0 (interpreted as - for denormals) and with only the least significant bit set to 1, representing the tiniest non-zero value before underflow to . Its binary is all zeros except the final bit:
0 00000000000 0000000000000000000000000000000000000000000000000001
In hexadecimal, this is 0x0000000000000001.

Precision and Limitations

Exact Representation of Integers

In the IEEE 754 binary64 format, also known as double-precision floating-point, all integers with absolute values up to and including $2^{[53](/page/53)} (exactly 9,007,199,254,740,992) can be represented exactly without any error. This range arises because the -bit , including the implicit leading 1, provides sufficient precision to encode every in this interval as a normalized floating-point number with no . For any x satisfying |x| \leq 2^{[53](/page/53)}, the representation is unique and precise, ensuring that conversion to and from this format preserves the exact value. Beyond $2^{53}, not all integers are exactly representable due to the fixed 53-bit length, which limits the density of representable values in higher magnitude ranges. However, certain larger integers remain exact if their binary representation fits within the 53-bit after ; for instance, powers of 2 such as $2^{60} are exactly representable because they require only a single bit in the (the implicit 1) and an appropriate exponent. In contrast, $2^{60} + 1 cannot be exactly represented, as it would require an additional bit beyond the 's capacity, leading to rounding to the nearest representable value, which is $2^{60}. For magnitudes greater than $2^{53}, the unit in the last place (ulp) increases with the exponent, meaning representable integers are spaced by multiples of $2^{e-52}, where e is the unbiased exponent corresponding to the number's scale. Thus, in the range from $2^{53} to $2^{54}, only even integers are representable, and the spacing doubles with each subsequent power-of-2 interval, skipping more values as the magnitude grows. This behavior ensures that while some sparse integers remain exact within the overall exponent range (up to approximately $1.8 \times 10^{308}), dense integer sequences cannot be preserved without loss. In programming contexts, this limit implies that integer computations in double-precision are safe and exact for values up to $2^{53}, beyond which developers must resort to types, big-integer libraries, or explicit to avoid unintended during arithmetic operations or storage. For example, adding 1 to $2^{53} in double-precision yields no change, highlighting the need for awareness in applications involving large counters or identifiers.

Rounding Errors and Precision Loss

In double-precision floating-point arithmetic, as defined by the standard, computations may introduce inaccuracies due to the finite precision of the 53-bit (including the implicit leading 1). These inaccuracies arise primarily from during operations like , , , and , where the exact result cannot be represented exactly in the binary64 format. The standard specifies four rounding modes to control how these inexact results are handled: round to nearest (the default, which rounds to the closest representable value, with ties to even), round toward zero (truncation), round toward positive infinity, and round toward negative infinity. To enable accurate rounding in hardware implementations, floating-point units typically employ extra bits beyond the 52 stored significand bits: a guard bit (capturing the first bit discarded during normalization or alignment), a round bit (the next bit), and a sticky bit (an OR of all remaining lower-order bits). These guard, round, and sticky bits (collectively GRS) allow the rounding decision to consider the full precision of intermediate results, minimizing the rounding error to at most half a unit in the last place (ulp). For instance, without these bits, alignment shifts in addition could lead to larger errors, but their use ensures compliance with IEEE 754's requirement for correctly rounded operations in the default mode. Representation errors occur when a cannot be exactly encoded in binary64, such as the 1/3, which is approximately 0.333... in but requires an infinite expansion (0.010101...₂), resulting in the closest representable value of about 0.33333333333333331. Subtraction can exacerbate errors through , where two nearly equal numbers are subtracted, yielding a result with significantly reduced ; for example, subtracting two close approximations can amplify relative errors by orders of magnitude due to the loss of leading digits. The unit roundoff, denoted ε and equal to 2^{-53} ≈ 1.11 × 10^{-16}, represents the maximum relative rounding error for a single operation in round-to-nearest mode, bounding the error by ε/2. In multi-step computations, such as summing terms in a series approximation for π (e.g., the Leibniz formula π/4 = 1 - 1/3 + 1/5 - ...), errors accumulate: each addition introduces a new rounding error of up to ε/2 times the current partial sum, leading to a total error that grows roughly linearly with the number of terms, potentially reaching O(n ε) for n additions in naive summation. Techniques like compensated summation can mitigate this growth, but unoptimized implementations may lose several digits of precision over many operations.

Comparison to Single-Precision Format

The double-precision floating-point format, also known as binary64 in the standard, utilizes 64 bits to represent a number, consisting of 1 , 11 exponent bits, and 52 bits, providing an effective of 53 bits (including the implicit leading 1 for normalized numbers). In contrast, the single-precision format, or binary32, employs 32 bits with 1 , 8 exponent bits, and 23 bits, yielding 24 bits of . This structural expansion in double precision allows for greater representational capacity compared to single precision. Regarding , double precision accommodates normalized numbers from approximately $2^{-1022} (about $2.23 \times 10^{-308}) to (2 - 2^{-52}) \times 2^{1023} (about $1.80 \times 10^{308}), far exceeding the single-precision range of approximately $2^{-126} (about $1.18 \times 10^{-38}) to (2 - 2^{-23}) \times 2^{127} (about $3.40 \times 10^{38}). This broader exponent field in double precision (11 bits versus 8) enables handling of much larger and smaller magnitudes without overflow or underflow in applications requiring extensive scales. In terms of precision, double precision supports roughly 15 to 16 significant decimal digits, compared to 7 to 8 digits for single precision, due to the additional bits that minimize relative errors. This enhanced precision is particularly beneficial in iterative algorithms, where accumulated errors in single precision can lead to significant divergence, whereas double precision maintains accuracy over more iterations by reducing the impact of each step. Double precision serves as the default for most scientific and engineering computations demanding , such as simulations and , while precision is favored in graphics rendering and inference to conserve and boost computational throughput. A key property of conversions between formats is that transforming a -precision value to double precision preserves the exact bit-for-bit representation by padding the with zeros, but the reverse conversion may introduce errors if the value exceeds precision's capabilities.
AspectSingle Precision (binary32)Double Precision (binary64)
Total Bits3264
Significand Bits23 (24 effective)52 (53 effective)
Exponent Bits811
Approx. Decimal Digits7–815–16
Normalized Range\approx 10^{-38} to $10^{38}\approx 10^{-308} to $10^{308}

Performance and Hardware

Execution Speed in Arithmetic Operations

The execution speed of double-precision floating-point arithmetic operations varies significantly depending on the operation type and hardware platform. On modern CPUs such as Skylake and later microarchitectures, basic operations like (ADDPD) and (MULPD) exhibit latencies of approximately 4 cycles and reciprocal throughputs of 0.5 cycles per , enabling high throughput in pipelined execution. In contrast, more complex operations like (DIVPD) and (SQRTSD) are notably slower, with latencies ranging from 14 to 23 cycles for and 17 to 21 cycles for , alongside reciprocal throughputs of 6 to 11 cycles for and 8 to 12 cycles for . These differences arise because and leverage dedicated, high-speed functional units, while and require iterative algorithms such as SRT or Goldschmidt methods, which demand more computational steps. A key optimization for improving both speed and accuracy in double-precision computations is the fused multiply-add (FMA) instruction, which computes a \times b + c in a single operation without intermediate , unlike separate multiply and add instructions that introduce an extra step. On CPUs supporting FMA3 (e.g., Haswell and later), the VFMADDPD instruction has a of 4 cycles and a reciprocal throughput of 0.5 cycles, matching that of individual multiply or add operations but effectively doubling the performance for the combined computation in terms of instruction efficiency and reduced error accumulation. This can yield up to 2x speedup over separate operations in scenarios where multiply and add are chained dependently, as it avoids serialization and extra overhead. Performance bottlenecks in double-precision arithmetic often stem from the need for exponent during and , where the mantissas of operands with differing exponents must be shifted to normalize them before , potentially introducing variable latency based on the exponent difference. In pipelines, such alignments can lead to stalls if subsequent dependent operations cannot proceed while waiting for or carry propagation resolution. On processing units (GPUs), double-precision operations face additional hardware trade-offs; for instance, on NVIDIA's architecture (e.g., V100), peak double-precision (FP64) reaches 7.8 TFLOPS, compared to 15.7 TFLOPS for single-precision (FP32), resulting in double-precision being roughly half the speed due to optimized tensor cores and cores prioritizing parallel single-precision workloads for and . Recent trends in CPU design emphasize (SIMD) extensions to boost double-precision throughput. Intel's instructions enable across 512-bit registers, processing 8 double-precision elements simultaneously per instruction, which can deliver up to 8x the throughput of scalar double-precision operations on compatible processors like Skylake-X or Ice Lake, particularly in vectorizable workloads such as matrix multiplications or simulations. This improvement is contingent on auto-vectorization or explicit intrinsics, and it scales performance while maintaining the per-element of underlying operations.

Hardware Support and Implementations

The , introduced in 1980 as a for the 8086 , provided early hardware support for double-precision floating-point operations, handling 64-bit -compliant formats alongside single-precision and extended-precision modes. Today, double-precision support is ubiquitous in modern processors, though some architectures like require optional extensions for full compliance, such as the D extension for 64-bit floating-point registers and operations. In x86 architectures, the Intel 80486 microprocessor, released in 1989, was the first to integrate a (FPU) on-chip, enabling native double-precision arithmetic without a separate . The (SSE2), introduced with the in 2000, further mandated comprehensive support for packed double-precision floating-point operations across 128-bit XMM registers, enhancing vectorized performance. ARM architectures incorporate double-precision support through the advanced SIMD extension, particularly in mode, where it enables vectorized double-precision operations fully compliant with IEEE 754. In graphics processing units (GPUs), NVIDIA's cores provide double-precision capabilities starting from compute capability 1.3, but older architectures like those in the family (compute capability 5.x) deliver only about 1/32 the throughput of single-precision operations due to limited dedicated hardware units. Double-precision operations generally consume roughly twice the energy of single-precision ones, owing to the use of wider 64-bit registers and data paths that increase switching activity and capacitance.

Software Implementations

In the C programming language, the double type is used to represent double-precision floating-point numbers, which occupy 64 bits and conform to the binary64 format defined by IEEE 754. The float type, in contrast, represents single-precision values using 32 bits in the binary32 format. These types provide the foundation for floating-point arithmetic in C and related languages like C++. The C99 standard (ISO/IEC 9899:1999) includes optional support for IEC 60559, the international equivalent of , through its informative Annex F, though most modern implementations fully conform to these requirements for portability and predictability. This support ensures that floating-point types and operations adhere to IEEE 754 semantics, including normalized representations, subnormal numbers, infinities, and . The header <float.h> provides macros for querying implementation limits, such as DBL_MAX (the maximum representable finite value, approximately 1.7976931348623157 × 10^308) and DBL_MIN (the minimum normalized positive value, approximately 2.2250738585072014 × 10^{-308}), enabling portable bounds checking. C provides standard library functions in <math.h> for manipulating double-precision values according to IEEE 754 rules. For example, the frexp function decomposes a double-precision number into its significand (a value between 0.5 and 1.0) and an integral exponent to the base 2, facilitating exponent extraction for custom arithmetic or analysis. The C11 standard (ISO/IEC 9899:2011) enhances this with normative Annex F, which mandates fuller IEC 60559 (IEEE 754-2008 equivalent) compliance when the __STDC_IEC_559__ macro is defined, including precise exception handling via <fenv.h> and support for rounding modes like FE_TONEAREST. Developers must be aware of common pitfalls in handling double-precision values. Signed zeros (+0.0 and -0.0) are distinct representations that can affect operations like division (e.g., 1.0 / -0.0 yields -), and comparisons involving s always fail equality checks ( != ), as per semantics. To safely test for special values, use functions like isnan (from <math.h>) for s and isfinite for finite numbers, avoiding direct == comparisons which can lead to unexpected behavior. Compilers such as GCC and Clang offer optimization flags that can impact IEEE 754 strictness. The -ffast-math flag enables aggressive floating-point optimizations by assuming no NaNs or infinities, reordering operations, and relaxing rounding, which can improve performance but risks non-compliant results, such as incorrect handling of signed zeros or exceptions; it is not recommended for code requiring exact IEEE conformance. In related languages, Fortran's DOUBLE PRECISION type, introduced in Fortran 77 and retained in modern standards like Fortran 2003 and 2018, typically maps to the C double type, providing 64-bit IEEE 754 binary64 precision when using the ISO_C_BINDING module for interoperability (e.g., REAL(KIND=C_DOUBLE)).

In Java and JavaScript

In Java, the double primitive type implements the IEEE 754 double-precision 64-bit floating-point format, providing approximately 15 decimal digits of precision for representing real numbers. The Double class serves as an object wrapper for double values, enabling their use in object-oriented contexts such as collections and enabling methods for conversion, parsing, and string representation. To ensure predictable rounding behavior across different hardware platforms, Java provides the StrictMath class, which implements mathematical functions with strict adherence to IEEE 754 semantics, avoiding platform-specific optimizations that could alter results. Java Virtual Machines (JVMs) have been required to support IEEE 754 floating-point operations, including subnormal numbers and gradual underflow, since the initial release of JDK 1.0 in January 1996. In JavaScript, the Number type exclusively uses double-precision 64-bit IEEE 754 format for all numeric values, with no distinct single-precision type available, which simplifies numeric handling but limits exact integer representation to the safe range of -2^53 to 2^53. This format was standardized in the first edition of ECMA-262 in June 1997, defining Number as the set of double-precision IEEE 754 values. For integers exceeding this safe range, ECMAScript 2020 introduced the BigInt primitive, which supports arbitrary-precision integers without floating-point approximations. Major engines like Google's V8 and Mozilla's SpiderMonkey fully implement IEEE 754 for Number, ensuring consistent behavior for operations such as addition, multiplication, and special values like NaN and Infinity. However, JavaScript's loose equality operator == treats NaN as not equal to itself (returning false), a quirk of the language specification; the Object.is() method provides strict equality that correctly handles NaN comparisons. Both and assume host platform conformance to for portability, but embedded JavaScript engines, such as those in resource-constrained devices, may introduce variations in precision or special value handling to optimize for limited hardware, potentially deviating from full standard compliance.

In Other Languages and Standards

In , double-precision floating-point numbers are declared using the DOUBLE PRECISION type or the non-standard REAL*8 specifier, both of which allocate 8 bytes and provide approximately 15 decimal digits of precision. These types conform to the binary64 format when compiled with modern Fortran processors supporting the standard. provides intrinsic functions to query limits, such as HUGE(), which returns the largest representable positive value for a given double-precision , approximately 1.7976931348623157 × 10^308. Common Lisp defines the DOUBLE-FLOAT type as a subtype of FLOAT, representing IEEE 754 double-precision values with 53 bits of mantissa precision. This type supports IEEE special values including infinities, NaNs, and subnormals, enabling robust handling of exceptional conditions in numerical computations across Common Lisp implementations like SBCL and LispWorks. In , the f64 primitive type implements the binary64 double-precision format, offering 64 bits of storage with operations that adhere to the standard's requirements for accuracy and exceptions. For interoperability with code, f64 can be used with the #[repr(C)] attribute on structs to ensure layout compatibility, while the std::os::raw::c_double alias explicitly maps to f64 for foreign function interfaces. 's std::f64 module enforces round-to-nearest-ties-to-even rounding as the default mode for arithmetic operations, consistent with semantics. Zig's double-precision support mirrors C's double type through its f64 type, which is defined as an binary64 float to facilitate seamless C ABI compatibility and low-level systems programming. In JSON, numeric values are serialized assuming IEEE 754 double-precision representation, as per 8259, but the format provides no explicit guarantee of precision preservation during transmission or parsing, potentially leading to artifacts like 0.1 + 0.2 yielding 0.30000000000000004 in JavaScript environments. The specification warns that numbers exceeding double-precision limits, such as very large exponents, should trigger errors to avoid silent precision loss.

References

  1. [1]
    Double-Precision Floating Point - IBM
    A double-precision floating-point number is 64 bits (8 bytes) long, encoded using IEEE standard, with a sign (1 bit), exponent (11 bits), and fractional part ( ...
  2. [2]
    IEEE Floating-Point Representation | Microsoft Learn
    Aug 3, 2021 · The IEEE-754 standard describes floating-point formats, a way to represent real numbers in hardware. There are at least five internal formats for floating- ...
  3. [3]
    IEEE Arithmetic
    The IEEE double format has a significand precision of 53 bits and occupies 64 bits overall. Two classes of extended floating-point formats: single extended and ...
  4. [4]
    Milestones:IEEE Standard 754 for Binary Floating-Point Arithmetic ...
    In 1978, faculty and students at U.C. Berkeley drafted what became IEEE Standard 754 for Binary Floating-Point Arithmetic. Prof. William Kahan, who had attended ...
  5. [5]
    [PDF] IEEE 754 Floating Point Representation
    – FP allows for range but sacrifices precision (can't represent all numbers in its range). • Double Precision (64-bits) Equivalent Decimal Range: • 16 ...
  6. [6]
    The IEEE 754 Format
    For double-precision, exponents in the range -1022 to +1023 are biased by adding 1023 to get a value in the range 1 to 2046 (0 and 2047 have special meanings).
  7. [7]
    IEEE 754-2008 - IEEE SA
    Aug 29, 2008 · This standard specifies formats and methods for floating-point arithmetic in computer systems: standard and extended functions with single, ...
  8. [8]
    [PDF] Implementation of Double Precision Floating Point Arithmetic
    Double precision provides a greater range, approximately 10** (-. 308) to 10** 308 and about 15 decimal digits of precision compared to a single precision whose ...
  9. [9]
    Floating point numerical information - Emory CS
    A double precision floating point variable can represent a floating point number: in range of from −10308 to 10308. and with about 15 decimal digits accuracy ...
  10. [10]
    [PDF] Quantitative Analysis of Floating Point Arithmetic on FPGA Based ...
    The use of floating point helps to alleviate the underflow and overflow problems often seen in fixed point formats. An advantage of using a CCM for floating ...
  11. [11]
    IBM's 704, the First Computer to Incorporate Indexing & Floating ...
    It was the first commercially available computer to incorporate indexing and floating point arithmetic as standard features.
  12. [12]
    IBM's Single-Processor Supercomputer Efforts
    Dec 1, 2010 · In the 1950s and 1960s IBM undertook three major supercomputer projects: Stretch (1956–1961), the System/360 Model 90 series, and ACS (both 1961–1969).
  13. [13]
    From the IBM 704 to the IBM 7094
    The IBM 7094 computer added two important features to the architecture; hardware double-precision floating point, and the ability to use seven index registers ...
  14. [14]
    Hidden-bit IEEE-754 and DEC - JHM. Bonten, homepage
    The floating point formats defined by the Digital Equipment Corporation for their PDP-11 pedigree are pretty similar to those described by IEEE in their ...
  15. [15]
    Where did the free parameters of IEEE 754 come from?
    Jan 27, 2020 · 64-bit float decisions might be more random since the precision it gives and the maximum/minimum exponent it allows seemed to be less of ...Was the design of the PDP-11 Floating Point Processor responsible ...history - What other computers used this floating-point format?More results from retrocomputing.stackexchange.com
  16. [16]
    An Interview with the Old Man of Floating-Point - People @EECS
    It was the committee assembled to produce a standard for floating-point arithmetic on microprocessors." IEEE p754. Anarchy, among floating-point arithmetics ...Missing: formation | Show results with:formation
  17. [17]
    IEEE 754: An Interview with William Kahan - ResearchGate
    Aug 5, 2025 · It greatly facilitated the design and portability of numerical software. The reader interested in the history of the birth of the Standard can ...
  18. [18]
  19. [19]
    [PDF] Floating Point and IEEE 754 Compliance for NVIDIA GPUs
    We also discuss the fused multiply-add operator, which was added to the. IEEE 754 standard in 2008 [2] and is built into the hardware of NVIDIA GPUs. In Section ...
  20. [20]
    IEEE 754-2019 - IEEE SA
    Jul 22, 2019 · This standard specifies interchange and arithmetic formats and methods for binary and decimal floating-point arithmetic in computer programming environments.
  21. [21]
    A New IEEE 754 Standard for Floating-Point Arithmetic in an Ever ...
    Jul 6, 2021 · The new IEEE 754 standard includes new capabilities, bug fixes, new operations like augmented addition, and consistent exception handling.
  22. [22]
    [PDF] 231917-001_80387_Programmers_Reference_Manual_1987.pdf
    This manual describes the 80387 Numeric Processor Extension (NPX) for the 80386 micro- processor. Understanding the 80387 requires an understanding of the 80386 ...
  23. [23]
    [PDF] Floating Point Case Study - Intel
    To revise IEEE 754, a committee of as many of the original working group as could be assembled was formed. This included Professor Kahan and some of his former.Missing: history | Show results with:history
  24. [24]
    IEEE Standard 754 Floating Point Numbers - Steve Hollasch
    Aug 24, 2018 · IEEE floating point numbers have three basic components: the sign, the exponent, and the mantissa. The mantissa is composed of the fraction and ...
  25. [25]
    Decimal Precision of Binary Floating-Point Numbers
    Jun 9, 2016 · By the same argument above, the precision of a double is not log10(253) ≈ 15.95. Doing the same analysis for doubles I computed an average ...Missing: binary64 | Show results with:binary64
  26. [26]
    754-2008 - IEEE Standard for Floating-Point Arithmetic
    Aug 29, 2008 · Abstract: This standard specifies interchange and arithmetic formats and methods for binary and decimal floating-point arithmetic in ...
  27. [27]
    [PDF] 2008 (Revision of IEEE Std 754-1985), IEEE Standard for Floating ...
    Aug 29, 2008 · Abstract: This standard specifies interchange and arithmetic formats and methods for binary and decimal floating-point arithmetic in computer ...
  28. [28]
    Easy Accurate Reading and Writing of Floating-Point Numbers
    Abstract. Presented here are algorithms for converting between (decimal) scientific-notation and (binary) IEEE-754 double-precision floating-point numbers.
  29. [29]
    Why is 7 * 0.1 ≠ 0.7 ? - People | MIT CSAIL
    0.1 cannot be exactly represented in this format; the closest double precision number is: (/ #x1999999999999a (expt 2 56)) (- .1 (/ #x19999999999999 (expt 2 ...
  30. [30]
    754-2019 - IEEE Standard for Floating-Point Arithmetic
    Jul 22, 2019 · Abstract: This standard specifies interchange and arithmetic formats and methods for binary and decimal floating-point arithmetic in ...
  31. [31]
    VFP double-precision format - Arm Developer
    In a little-endian memory system, the least significant word appears at the lower memory address and the most significant word at the higher memory address. In ...Missing: example | Show results with:example
  32. [32]
    [PDF] PowerPC Processor Supplement - Linux Foundation
    The PowerPC processor family supports either Big-Endian or Little-Endian byte ordering. ... Note – "extended precision (IEEE)" in Table 3-1 means IEEE 754 double ...<|control11|><|separator|>
  33. [33]
    IEEE-754 Floating Point Converter - h-schmidt.net
    This page allows you to convert between the decimal representation of a number (like "1.02") and the binary format used by all modern CPUs (a.k.a. "IEEE 754 ...
  34. [34]
    What Every Computer Scientist Should Know About Floating-Point ...
    IEEE 754 is a binary standard that requires = 2, p = 24 for single precision and p = 53 for double precision [IEEE 1987].
  35. [35]
    excel bug - simple minus bug - Microsoft Q&A - Microsoft Learn
    Mar 15, 2013 · In particular, all integers from 0 to +/- 9007199254740992 (2^53) can be represented exactly. Example of Precision Error and discussion. If ...
  36. [36]
    [PDF] Why .1 + .1 Might Not Equal .2 and Other Pitfalls of Floating-Point ...
    The rest of this paper focuses on the IEEE 754 double-precision floating-point standard. ... These examples illustrate that all exact powers of two are stored ...
  37. [37]
    IEEE 754 arithmetic and rounding - Arm Developer
    Hardware floating-point environments and the enhanced floating-point libraries support all four rounding modes. Round up, or round toward plus infinity. The ...
  38. [38]
    [PDF] On the Definition of Unit Roundoff - TUHH
    similar to IEEE 754 double precision. The relative rounding error unit is then about. 0.77·10−16 compared to about 1.11·10−16 in double precision. With ...
  39. [39]
    Single, Double, Multi, and Mixed-Precision Computing - AMD
    While there are still a variety of ways in which to represent floating-point numbers, IEEE 754 is the most common because it is generally the most efficient ...<|control11|><|separator|>
  40. [40]
    [PDF] 4. Instruction tables - Agner Fog
    Sep 20, 2025 · Subnormal numbers, NAN's and infinity may increase the latencies by possibly more than 100 clock cycles on many processors, except in move, ...Missing: addition | Show results with:addition
  41. [41]
    fma: A faster, more accurate instruction - Moments in Graphics
    Dec 1, 2021 · It is faster than separate multiplication and addition because on most CPUs and GPUs it counts as one instruction. It also introduces less rounding error.
  42. [42]
    Micro-optimization of floating-point operations
    The alignment operations of cascaded additions can be simplified if the largest exponent is identified and used as a block exponent for the additions. All ...
  43. [43]
    [PDF] NVIDIA TESLA V100 GPU ARCHITECTURE
    ▻ 7.8 TFLOPS1 of double precision floating-point (FP64) performance. ▻ 15.7 TFLOPS1 of single precision (FP32) performance. ▻ 125 Tensor TFLOPS1. Figure 3 ...
  44. [44]
    Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Overview
    Intel® AVX 2.0 delivers 16 double precision and 32 single precision floating point operations per second per clock cycle within the 256-bit vectors, with up to ...Missing: throughput | Show results with:throughput
  45. [45]
    Capabilities of Intel® AVX-512 in Intel® Xeon® Scalable Processors ...
    Sep 19, 2017 · This paper reviews the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set and answers two critical questions.<|control11|><|separator|>
  46. [46]
    [PDF] Intel 8087 Math CoProcessor
    The 8087 supports integer, floating point and BCD data formats, and fully conforms to the ANSI/IEEE floating point standard. The 8087 is fabricated with HMOS ...
  47. [47]
    "F" Extension for Single-Precision Floating-Point, Version 2.2 :: RISC ...
    FMV.X.W moves the single-precision value in floating-point register rs1 represented in IEEE 754-2008 encoding to the lower 32 bits of integer register rd. The ...
  48. [48]
    80486 - Intel - WikiChip
    Dec 12, 2024 · The 80486, also i486 and 486, (pronounced eighty-four-eighty-six) was a family of 32-bit 4th-generation x86 microprocessors introduced by Intel in 1989
  49. [49]
    Intel® Instruction Set Extensions Technology
    They extend this model with support for packed double precision floating-point values and for 128-bit packed integers. Streaming SIMD Extensions 3 (SSE3).
  50. [50]
  51. [51]
    Uncovering Detailed Power Characterizations of GPUs on Edge ...
    May 13, 2025 · However, the overall energy consumption of single-precision instructions is several orders of magnitude lower than double-precision instructions ...
  52. [52]
    None
    Below is a merged summary of Annex F (IEC 60559 and IEEE 754) from the C11 Draft (N1570), consolidating all information from the provided segments into a comprehensive response. To maximize detail and clarity, I’ve organized the content into sections and used a table where appropriate to present dense, structured information (e.g., key features, sections, and references). The response retains all unique details while avoiding redundancy.
  53. [53]
    [PDF] Rationale for International Standard— Programming Languages— C
    1993, IEEE 754 was published as international standard IEC 559, now IEC 60559. Now virtually all new floating-point implementations conform to IEC 60559, at ...
  54. [54]
    Floating Type Specs (GNU C Language Manual)
    DBL_MIN; LDBL_MIN. Defines the minimum normalized positive floating-point values that can be represented with the type.
  55. [55]
    frexp
    These functions shall break a floating-point number num into a normalized fraction and an integral power of 2. The integer exponent shall be stored in the int ...<|separator|>
  56. [56]
    What every computer scientist should know about floating-point ...
    What every computer scientist should know about floating-point arithmetic. Author: David Goldberg. David Goldberg. Xerox Palo Alto Research Center, Palo Alto ...
  57. [57]
  58. [58]
    Any platform where Fortran `double precision` is different from C ...
    Mar 22, 2019 · If one wanted to map a C double , the standard way would be REAL(KIND = C_DOUBLE) :: a_double . Similar for mapping C float s. It is therefore ...How to set double precision for elements of an array in Fortran 90Confusing double precision real in Fortran - Stack OverflowMore results from stackoverflow.com
  59. [59]
    Double (Java Platform SE 8 ) - Oracle Help Center
    The Double class wraps a value of the primitive type double in an object. An object of type Double contains a single field whose type is double.Missing: StrictMath predictable 1996
  60. [60]
    StrictMath (Java Platform SE 8 ) - Oracle Help Center
    The class StrictMath contains methods for performing basic numeric operations such as the elementary exponential, logarithm, square root, and trigonometric ...
  61. [61]
    Chapter 2. The Structure of the Java Virtual Machine
    The Java Virtual Machine requires support of IEEE 754 subnormal floating-point numbers and gradual underflow, which make it easier to prove desirable properties ...
  62. [62]
    Number - JavaScript - MDN Web Docs - Mozilla
    Jul 10, 2025 · The JavaScript Number type is a double-precision 64-bit binary format IEEE 754 value, like double in Java or C#. ... 253 - 1) and 253 - 1 ...Number.isInteger() · Number.parseInt() · Number.prototype.toString() · Number.NaN
  63. [63]
    [PDF] ECMA-262, 1st edition, June 1997
    The type Number is a set of values representing numbers. In ECMAScript the set of values represent the double- precision 64-bit format IEEE 754 value along with ...
  64. [64]
    BigInt - JavaScript - MDN Web Docs
    Jul 10, 2025 · Only use a BigInt value when values greater than 253 are reasonably expected. Don't coerce between BigInt values and Number values.BigInt() constructor · BigInt.asIntN() · Number.MAX_SAFE_INTEGER<|separator|>
  65. [65]
    BigInt: arbitrary-precision integers in JavaScript
    May 1, 2018 · BigInts are a new numeric primitive in JavaScript that can represent integers with arbitrary precision. This article walks through some use ...
  66. [66]
    REAL*8 (Double-Precision Real)
    The REAL*8, data type is a synonym for DOUBLE PRECISION, except that it always has a size of 8 bytes, independent of any compiler options.
  67. [67]
    REAL(8) or DOUBLE PRECISION Constants - Intel
    Intel® Fortran Compiler Classic and Intel® Fortran Compiler Developer Guide and Reference. Download PDF. ID 767251.
  68. [68]
    HUGE - Intel
    Inquiry Intrinsic Function (Generic): Returns the largest number in the model representing the same type and kind parameters as the argument.
  69. [69]
    2.1.3. Floating-Point Numbers
    Common Lisp allows an implementation to provide one or more kinds of floating-point number, which collectively make up the type float. Now a floating-point ...
  70. [70]
    Lisp Knowledgebase: LispWorks 5 default floating point type is ...
    double-float (IEEE double precision) LispWorks 4 supports only one distinct floating point type on Windows and Linux: double-float (IEEE double precision)
  71. [71]
    CMUCL User's Manual: Design Choices and Extensions
    CMUCL supports IEEE denormalized floats. Denormalized floats provide a mechanism for gradual underflow. The Common Lisp float-precision function returns the ...<|separator|>
  72. [72]
    f64 - Rust Documentation
    A 64-bit floating-point type (specifically, the “binary64” type defined in IEEE 754-2008). This type is very similar to f32 , but has increased precision by ...
  73. [73]
    c_double in std::os::raw - Rust
    Equivalent to C's double type. This type will almost always be f64 , which is guaranteed to be an IEEE 754 double-precision float in Rust.Missing: compliance repr(
  74. [74]
    Why Zig When There is Already C++, D, and Rust?
    Zig is attempting to become the new portable language for libraries by simultaneously making it straightforward to conform to the C ABI for external functions, ...Missing: precision mirrors
  75. [75]
    RFC 8259 - The JavaScript Object Notation (JSON) Data ...
    Since software that implements IEEE 754 binary64 (double precision) numbers ... JSON numbers within the expected precision. A JSON number such as 1E400 or ...