Machine epsilon
Machine epsilon, denoted ε_mach or simply ε, is the smallest positive floating-point number such that 1 + ε is distinguishable from 1 in the floating-point arithmetic of a given computing system, serving as a measure of the relative precision inherent in floating-point representations.[1] It quantifies the maximum relative rounding error introduced when representing real numbers as floating-point approximations, typically arising from the finite number of bits allocated to the significand in formats like IEEE 754.[2] This value is fundamental in numerical analysis, as it bounds the error in arithmetic operations and informs the design of stable algorithms.[3]
In the IEEE 754 standard, which defines the most widely used binary floating-point formats, machine epsilon for single precision (32 bits, 24-bit significand) is 2^{-23} ≈ 1.19 × 10^{-7}, while for double precision (64 bits, 53-bit significand) it is 2^{-52} ≈ 2.22 × 10^{-16}.[1] These values correspond to the distance from 1 to the next representable floating-point number greater than 1, also known as the unit in the last place (ulp) at 1. The unit roundoff u, often u = ε/2, represents the maximum relative error for rounding to nearest, and machine epsilon is twice this quantity in symmetric rounding modes.[3] Machine epsilon can be computed programmatically by iteratively halving a value starting from 1 until 1 + δ equals 1, highlighting its dependence on the system's arithmetic implementation.[2]
The concept is crucial for assessing the accuracy and stability of numerical computations, particularly in error propagation analysis, where floating-point operations introduce perturbations bounded by multiples of ε. For instance, the computed result fl(x ⊕ y) of addition satisfies |fl(x ⊕ y) - (x + y)| ≤ u |x + y|, ensuring predictable error growth in algorithms.[1] While machine epsilon provides an upper bound on representation errors near 1, the relative precision varies across the number line due to normalization, being ulp(x)/|x| for numbers away from powers of the radix.[3] Its role extends to validating software implementations of floating-point standards and guiding choices in tolerance settings for iterative solvers and convergence tests in scientific computing.[2]
Floating-Point Arithmetic Basics
Binary Floating-Point Representation
Binary floating-point numbers represent real numbers in computer systems using a fixed number of bits divided into three components: a sign bit, an exponent field, and a significand (also known as the mantissa).[4] The sign bit, which is a single bit, indicates whether the number is positive (0) or negative (1).[4] The exponent field consists of several bits that encode the scale of the number as a power of 2, while the significand field stores the fractional part of the number in binary form.[4]
In normalized binary floating-point representation, as defined by the IEEE 754 standard, the value x of a finite non-zero number is given by the formula
x = (-1)^s \times m \times 2^{e - \text{bias}},
where s is the sign bit (0 or 1), m is the significand interpreted as a value between 1 (inclusive) and 2 (exclusive) with an implied leading 1 followed by the stored fraction bits, e is the stored biased exponent, and bias is a constant specific to the format (e.g., 127 for single precision).[4][5] This normalization ensures that the significand always starts with a leading 1 in binary, maximizing precision by avoiding leading zeros.[4]
The unit in the last place (ulp) refers to the spacing or gap between two consecutive representable floating-point numbers in the same binade (range between powers of 2), equivalent to the value of the least significant bit in the significand for a given exponent.[6] It provides a measure of the precision limits inherent in the representation, as the ulp varies with the magnitude of the number: larger exponents result in larger ulps, reflecting coarser spacing for bigger values.[6]
For example, the number 1.0 in single-precision binary floating-point (32 bits total) is represented with sign bit 0, biased exponent 127 (actual exponent 0), and significand fraction bits all 0 (implied leading 1, so m = 1.0).[4] In binary, this is 0 01111111 00000000000000000000000, or in hexadecimal 0x3F800000.[4] The ulp at 1.0 is $2^{-23} \approx 1.192 \times 10^{-7}, the distance to the next representable number 1 + ulp.[6]
Rounding in Arithmetic Operations
In floating-point arithmetic, basic operations such as addition, subtraction, multiplication, and division are performed by first computing an exact intermediate result, which is then rounded to the nearest representable value in the floating-point format to fit within the available precision.[7] This rounding step is necessary because most real numbers and operation outcomes cannot be exactly represented due to the finite number of bits allocated for the significand.[8] The IEEE 754 standard mandates correct rounding for these operations, ensuring the computed result is as close as possible to the mathematically exact value under the selected mode.[7]
The standard specifies five rounding modes to control how inexact results are approximated, with round-to-nearest (ties to even) as the default for most implementations.[5] In round-to-nearest, the exact result is rounded to the closest representable floating-point number; if it lies exactly midway between two candidates, the tie is broken by rounding to the one with an even least significant bit in the significand.[8] Alternative directed modes include round toward zero (truncation, which discards excess magnitude), round toward positive infinity (rounding up for positive excesses), and round toward negative infinity (rounding down for negative excesses); a fifth mode, round to nearest with ties away from zero, was added in the 2019 revision for specific use cases like statistical computations.[5] These modes allow programmers to influence error behavior, such as in interval arithmetic or when preserving sign in financial calculations.[7]
Rounding introduces an error when the exact result falls between two consecutive representable numbers, known as the unit in the last place (ulp), which is the spacing between those numbers at the given magnitude.[7] In round-to-nearest mode, this rounding error is bounded by 0.5 ulp of the result, meaning the computed value deviates from the exact value by at most half the local spacing between representable points.[7] The ulp itself varies with the exponent, being larger for bigger numbers since the significand's fixed bit length scales the precision dynamically across magnitudes.[8]
A clear illustration of rounding error occurs in addition, where a small value added to a larger one may be effectively ignored if it falls below the resolution threshold. For instance, adding a tiny positive number ε to 1.0 yields exactly 1.0 + ε, but if ε is smaller than 0.5 ulp at 1.0, the sum rounds back to 1.0 in round-to-nearest mode, losing the small contribution entirely.[7] This effect arises because the binary floating-point representation around 1.0 has a fixed ulp determined by the significand's precision, as covered in the structure of binary floating-point numbers.[7] Such losses highlight the limitations of finite precision and can propagate in chained operations, emphasizing the need for awareness in numerical algorithms.
The unit roundoff quantifies the maximum relative rounding error possible in these operations, providing a bound on how much the computed result can deviate from the exact one relative to the result's scale.[7] In the context of round-to-nearest, this relative error is at most 0.5 ulp normalized by the result's magnitude, serving as a fundamental measure of the arithmetic's faithfulness to real-number mathematics.[7]
Core Definitions
Rounding Machine Epsilon
The rounding machine epsilon, also known as the unit roundoff and denoted u, represents the maximum relative rounding error introduced when representing a real number in a floating-point system or performing arithmetic operations that round to the nearest representable value.[9] In binary floating-point arithmetic, it is formally defined as u = 2^{-p}, where p is the precision, corresponding to the number of bits in the significand (including the implicit leading bit). This is half the machine epsilon \epsilon = 2^{1-p}.[9]
This definition arises within the standard model of floating-point arithmetic, which assumes that every operation rounds the exact result to the nearest representable number, with ties broken in a specified manner (such as round-to-nearest-even). Under this model, the computed result \mathrm{fl}(x \oplus y) of an operation \oplus satisfies \mathrm{fl}(x \oplus y) = (x \oplus y)(1 + \delta), where |\delta| \leq u, bounding the relative error.[9]
The value of u can be derived directly from the unit in the last place (ulp) at 1, which is \epsilon = 2^{1-p}; thus, u = \epsilon / 2. In binary floating-point representation, the representable numbers near 1 have a spacing of $2^{1-p}, so the maximum rounding error is half this spacing relative to the number.[9]
The concept of rounding machine epsilon (unit roundoff) originated in early numerical analysis literature for bounding errors in computational processes, notably in Wilkinson's seminal work on rounding error analysis.[10] Note that terminology varies; while some sources use "rounding machine epsilon" for u, the more common term is "unit roundoff," with "machine epsilon" referring to the full spacing \epsilon.
Interval Machine Epsilon
The interval machine epsilon, denoted as \epsilon_{\text{mach}}, is defined as the smallest positive floating-point number \epsilon > 0 such that the floating-point representation \mathrm{fl}(1 + \epsilon) > 1, where \mathrm{fl} denotes the floating-point mapping function.[11] This value captures the smallest distinguishable addition to 1 in the floating-point system, independent of the specific rounding mode, and is also known as the standard machine epsilon.[12]
In binary floating-point formats with a p-bit mantissa (including the implicit leading 1), the interval machine epsilon equals the unit in the last place (ulp) at 1.0, expressed as \epsilon = 2^{1-p}.[13] This spacing arises because representable numbers near 1 are separated by increments of $2^{-(p-1)}, reflecting the granularity of the mantissa in the normalized binary representation.[14]
It differs from the rounding machine epsilon (unit roundoff u), which bounds the maximum relative error in rounding operations and equals half the interval machine epsilon in round-to-nearest mode (u = \epsilon / 2).[12] The interval variant emphasizes the representable gap for detection purposes, while the rounding variant focuses on error propagation in computations.[15] Note that "machine epsilon" typically refers to this interval value in most literature and programming languages (e.g., DBL_EPSILON in C).
One practical method to approximate the interval machine epsilon empirically involves iteratively halving an initial value until the addition to 1 becomes indistinguishable:
eps = 1.0
while (1.0 + eps != 1.0):
eps = eps / 2.0
eps = eps * 2.0 // Adjust to the last distinguishable [value](/page/Value)
eps = 1.0
while (1.0 + eps != 1.0):
eps = eps / 2.0
eps = eps * 2.0 // Adjust to the last distinguishable [value](/page/Value)
This loop exploits the floating-point comparison to identify the threshold where $1 + \epsilon = 1 in representation.[16]
Computation Techniques
Theoretical Derivation
In binary floating-point arithmetic, a normalized number is represented as x = \pm (1.f) \times 2^e, where f is the fractional part consisting of p-1 bits, e is the exponent, and the leading bit is implicitly 1, making the significand have p bits in total.[7][5]
For numbers in the interval [1, 2), the exponent e = 0, and the significand ranges from $1.000\ldots_2 (which is 1) to $1.111\ldots_2 = 2 - 2^{-(p-1)}. The representable values are thus $1 + k \times 2^{-(p-1)} for integer k = 0, 1, \dots, 2^{p-1} - 1. The spacing between consecutive representable numbers in this binade, known as the unit in the last place (ulp), is therefore $2^{-(p-1)}.[7]
The interval machine epsilon, denoted \epsilon, is defined as this ulp value at 1, representing the smallest positive floating-point number such that $1 + \epsilon > 1 in the arithmetic. Thus, \epsilon = 2^{-(p-1)}.[7][13]
This derivation assumes an IEEE 754-like model with binary base (\beta = 2), normalized representations (no denormals, which affect subnormal numbers far from 1), and no overflow or underflow near 1.0.[5][7]
The formula holds specifically for binary floating-point but generalizes to \epsilon = \beta^{1-p} for base-\beta systems with p-digit significands, such as decimal formats where the spacing differs accordingly.[7]
Practical Approximation
A practical method for approximating machine epsilon in software involves an iterative algorithm that empirically determines the smallest positive floating-point number \epsilon such that $1 + \epsilon > 1 in the given arithmetic environment. This approach begins with an initial value of \epsilon = 1.0 and repeatedly halves it until the condition $1 + \epsilon = 1 holds true in floating-point evaluation, at which point \epsilon is doubled to recover the boundary value.[7]
The algorithm can be expressed in pseudocode as follows:
eps = 1.0
while (1.0 + eps > 1.0):
eps = eps / 2.0
eps = eps * 2.0
eps = 1.0
while (1.0 + eps > 1.0):
eps = eps / 2.0
eps = eps * 2.0
This yields the machine epsilon for the prevailing floating-point format.[7]
The method avoids underflow by starting from a value well above the smallest representable number and converging toward machine epsilon through division, ensuring the loop terminates long before reaching denormalized or zero values. It is portable and effective across common programming languages, including C, Python, and MATLAB, where it can be implemented directly without reliance on predefined constants.[7][17]
However, results may vary slightly due to compiler optimizations that rearrange expressions or apply algebraic simplifications, potentially treating $1 + \epsilon > 1 as \epsilon > 0 and yielding an incorrect (too small) value. Additionally, on architectures with extended precision registers, such as x86, intermediate computations might use higher precision, leading to a larger apparent epsilon unless strict floating-point modes or volatile qualifiers are employed to enforce the target precision.[7]
IEEE 754 Single and Double Precision
The IEEE 754 standard for binary floating-point arithmetic, initially published in 1985 and subsequently revised in 2008 and 2019, specifies formats that dominate modern computing hardware, including x86 and ARM processor architectures.[18][19][5][20] These formats ensure consistent representation and operations across systems, with single and double precision being the most prevalent for general-purpose computations.
In IEEE 754 single precision (binary32), the 32-bit format allocates 1 bit for the sign, 8 bits for the biased exponent, and 23 bits for the significand, yielding an effective precision of 24 bits due to the implicit leading 1 in normalized numbers. The machine epsilon, defined as the smallest positive value \epsilon such that $1 + \epsilon > 1 in floating-point representation, is \epsilon = 2^{-23} \approx 1.192 \times 10^{-7}.[21] This value represents the unit in the last place (ulp) at 1.0, bounding the spacing between representable numbers in the vicinity of 1.
Double precision (binary64) employs 64 bits: 1 sign bit, 11 exponent bits, and 52 significand bits, for an effective precision of 53 bits including the implicit 1. Here, the machine epsilon is \epsilon = 2^{-52} \approx 2.220 \times 10^{-16}, again corresponding to the ulp at 1.0.[21] The unit roundoff u, which quantifies the maximum relative rounding error in round-to-nearest mode, is half the machine epsilon, or u = \epsilon / 2.
The following table summarizes these key parameters for comparison:
| Format | Total Bits | Precision p | Machine Epsilon \epsilon | Unit Roundoff u | ULP at 1.0 |
|---|
| Single (binary32) | 32 | 24 | $2^{-23} \approx 1.192 \times 10^{-7} | $2^{-24} \approx 5.961 \times 10^{-8} | $2^{-23} |
| Double (binary64) | 64 | 53 | $2^{-52} \approx 2.220 \times 10^{-16} | $2^{-53} \approx 1.110 \times 10^{-16} | $2^{-52} |
These values are derived directly from the significand structure in the IEEE 754 standard and are fundamental to assessing rounding errors in numerical algorithms.[22][21]
Half-precision floating-point, also known as binary16 in the IEEE 754-2008 standard, utilizes 11 bits for the significand (including an implicit leading 1), resulting in a machine epsilon of $2^{-10} \approx 9.765 \times 10^{-4}.[23] This format is commonly employed in graphics processing and machine learning applications where reduced storage is beneficial, though it offers lower precision compared to single or double formats.[23]
Quadruple-precision floating-point, or binary128, extends the significand to 113 bits under IEEE 754-2008, yielding a machine epsilon of $2^{-112} \approx 1.925 \times 10^{-34}.[23] It provides exceptionally high accuracy for scientific simulations requiring minimal rounding errors, such as in computational physics.[23]
Decimal floating-point formats, introduced in IEEE 754-2008, operate in base-10 rather than base-2, altering the nature of spacing and rounding. For decimal32, which supports 7 decimal digits of precision, the machine epsilon is approximately $10^{-6}, representing the unit in the last place around 1.0.[23] This base-10 structure avoids some binary-to-decimal conversion issues in financial computations, though it generally results in larger epsilons than equivalent binary formats due to the radix difference.[23]
Historical IBM hexadecimal floating-point formats, used in System/360 and later mainframes, employ base-16 with a significand of 6 hexadecimal digits (24 bits) for single precision. The machine epsilon is $2^{-(p-1)} where p reflects the effective bit precision, but the hexadecimal base introduces uneven spacing, with the ulp at 1.0 being $16^{-5} = 2^{-20} \approx 9.537 \times 10^{-7}.[7] This format's variable precision across binades—arising from normalization to non-zero leading hexadecimal digits—complicates error analysis compared to binary standards, though it was optimized for the era's hardware efficiency.[7]
Error Analysis Connections
Link to Relative Rounding Error
The relative rounding error in floating-point arithmetic quantifies the deviation between the exact result of an operation and its computed approximation, providing a measure of precision loss due to finite representation. For a generic arithmetic operation \oplus, the computed result satisfies \left| \mathrm{fl}(x \oplus y) - (x \oplus y) \right| / \left| x \oplus y \right| \leq \epsilon / 2, where \mathrm{fl}(\cdot) denotes the floating-point evaluation and \epsilon is the machine epsilon.[7] This bound arises from the IEEE 754 standard's requirement for correctly rounded operations, ensuring the result is as if the exact value were computed and then rounded to the nearest representable number.[24]
Machine epsilon thus serves as an upper limit on the worst-case relative accuracy of basic arithmetic operations, with \epsilon / 2 acting as the unit roundoff that scales the potential error. In practice, this connection implies that each operation introduces a small, controlled perturbation proportional to the input magnitudes, allowing analysts to predict and bound error propagation in more complex computations.[7]
A illustrative example occurs in the summation of n positive numbers, where the naive recursive algorithm accumulates rounding errors that grow proportionally to n \epsilon, potentially magnifying inaccuracies for large n.[7] This highlights how machine epsilon informs error growth in sequential operations, as each addition contributes an error bounded by \epsilon / 2 times the current partial sum.
Understanding this link guides the selection of numerical algorithms, favoring those—like compensated summation or pairwise summation—that mitigate error accumulation and keep total relative error close to \epsilon rather than allowing it to scale with problem size.[24]
Mathematical Proof
To rigorously establish the connection between machine epsilon and the maximum relative rounding error in floating-point arithmetic, consider the standard model defined by the IEEE 754 standard, which uses binary representation with base \beta = 2, a mantissa of p bits (including the implicit leading 1 for normalized numbers), and rounding to nearest (ties to even).[7]
Theorem. For any real number x in the normalized range of the floating-point system (i.e., x_{\min} \leq x \leq x_{\max}, excluding subnormals for simplicity), the relative error incurred by rounding x to the nearest floating-point number \mathrm{fl}(x) satisfies
\frac{|\mathrm{fl}(x) - x|}{|x|} \leq u = 2^{-p},
where u is the unit roundoff. The machine epsilon \epsilon is defined as \epsilon = 2u = 2^{1-p}, which thus bounds the maximum relative rounding error by \epsilon/2.[7]
Proof. Assume rounding to nearest, so the absolute error satisfies |\delta| = |\mathrm{fl}(x) - x| \leq \frac{1}{2} \cdot \mathrm{ulp}(x), where \mathrm{ulp}(x) is the unit in the last place of x, representing the spacing between consecutive floating-point numbers at the magnitude of x. For a normalized x with exponent e (such that $2^e \leq |x| < 2^{e+1}), \mathrm{ulp}(x) = 2^{e - (p-1)}. Thus, |\delta| \leq 2^{e - p}.[7]
The relative error is then \frac{|\delta|}{|x|} \leq \frac{2^{e - p}}{2^e} = 2^{-p} = u, since |x| \geq 2^e. This bound holds uniformly because the relative spacing \mathrm{ulp}(x)/|x| is maximized when |x| is just above a power of two (i.e., at the lower end of a binade), where it equals $2^{1-p}, but the rounding error is at most half that spacing, yielding the factor of $1/2.[13]
To derive \epsilon, note that near 1 (where e = 0), \mathrm{ulp}(1) = 2^{-(p-1)} = 2^{1-p}, so the smallest \epsilon > 0 such that \mathrm{fl}(1 + \epsilon) > 1 is this ulp value, confirming \epsilon = 2^{1-p} = 2u. The maximum relative error occurs at points midway between representable numbers, achieving equality with u in the limit. This completes the proof, as the bound is tight and independent of the specific x magnitude within the normalized range.[7]