Fact-checked by Grok 2 weeks ago

Fixed-point arithmetic

Fixed-point arithmetic is a numerical representation system for rational numbers in which the decimal (or binary) point is located at a fixed position within the representation, allowing computations to be performed using integer arithmetic scaled by a power of the radix, typically base 2 in binary systems.^[1] This approach contrasts with floating-point arithmetic by maintaining a constant scaling factor across all numbers, which simplifies hardware implementation but limits the dynamic range and precision compared to formats that adjust the point position dynamically.^[2] Commonly used in embedded systems, digital signal processing (DSP), and real-time applications, fixed-point arithmetic enables efficient operations on resource-constrained devices like microcontrollers.^[3] In binary fixed-point representations, numbers are stored as integers with an implied binary point at a predetermined position, often denoted in Qm.n format where m specifies the number of integer bits (including the sign bit) and n the number of fractional bits.^[2] For example, in a signed Q1.15 format using 16 bits, the value is interpreted as the two's complement integer divided by $2^{15}, providing a range from -1 to nearly 1 with a resolution of $2^{-15} \approx 3.05 \times 10^{-5}.^[3] Arithmetic operations such as addition and subtraction require alignment of the binary points, which is straightforward since the position is fixed, but multiplication typically doubles the bit width of the result, necessitating right-shifting to fit the output format and potentially introducing quantization errors.^[2] Scaling during conversion from floating-point values—by multiplying by $2^n—must account for overflow risks and precision loss.^[1] The primary advantages of fixed-point arithmetic include significantly faster execution times and lower power consumption relative to floating-point, as it avoids the complexity of exponent handling and normalization; for instance, on microcontrollers like the PIC32, a fixed-point multiplication may take 21 cycles versus 55 for floating-point.^[3] It is particularly suited for applications requiring predictable performance, such as audio processing, control systems, and graphics rendering, where the fixed precision suffices and overflow can be managed through saturation or wrapping.^[2] However, its drawbacks include a restricted range—for a 32-bit signed format with 16 fractional bits, the maximum value is approximately 32767.9999—and susceptibility to accumulation of rounding errors in iterative computations, making it less ideal for high-precision scientific calculations.^[1] Modern compilers and libraries, such as those supporting the _Accum type in C, provide built-in fixed-point data types to facilitate its use in software.^[3]

Fundamentals

Definition and Basic Representation

Fixed-point arithmetic is a method for representing rational numbers in computing by storing them as integers scaled by a fixed radix point position, which eliminates the need for explicit decimal or binary points during storage.^[4] This approach treats the number as an integer multiple of a power of the radix (typically base 2 for binary systems), where the scaling factor corresponds to the position of the implied radix point.^[5] In essence, a fixed-point number x can be expressed as x = I + F, where I is the integer part and F is the fractional part, but both are combined into a single integer value interpreted relative to the fixed point.^[6] In basic storage, fixed-point numbers are represented using standard integer formats, typically in binary with a fixed word length, and signed values often employ two's complement to handle negative numbers efficiently.^[7] The entire value is stored as a contiguous bit string without any explicit indicator of the radix point; instead, the position is predetermined by the format, allowing arithmetic operations to proceed as if on integers while the interpretation yields fractional results.^[8] For example, in an 8-bit signed representation, the bits might allocate space for both integer and fractional components, with the sign bit using two's complement to extend the range symmetrically around zero.^[9] The placement of the radix point defines the division between integer and fractional parts, commonly denoted in Qm.n format, where m represents the number of bits for the integer portion (excluding the sign bit) and n the number of bits for the fractional portion.^[6] In this notation, the total bit width is typically m + n + 1 (including the sign bit for signed formats), and the value is interpreted as the integer bit pattern divided by $2^n.^[10] For instance, in Q3.4 format with 8 bits, the first 3 bits (after the sign) hold the integer part, and the remaining 4 bits the fraction, enabling representation of values from -8 to nearly 8 with a resolution of $1/16.^[11] This representation offers advantages over pure integer arithmetic by allowing the encoding of fractional values through the implicit scaling, thus supporting applications requiring sub-unit precision without the overhead of dedicated floating-point units.^[5] By leveraging integer hardware, fixed-point formats facilitate efficient computation of rational numbers in resource-constrained environments, such as embedded systems.^[7]

Scaling Factors and Precision Choices

In fixed-point arithmetic, the scaling factor determines the position of the radix point relative to the stored bits, effectively defining the representation as a scaled integer value. For binary fixed-point systems, the scaling is typically a power of 2, denoted as $2^{-f} where f is the number of fractional bits, allowing the value to be interpreted as v = i \times 2^{-f} for an integer i. Selection of this scaling involves analyzing the application's required dynamic range and resolution; for instance, allocating more bits to the integer part expands the representable range (e.g., from [-2^{w-1}, 2^{w-1}) for w total bits in signed two's complement) at the expense of fractional precision, while prioritizing fractional bits enhances accuracy for small values but risks overflow for larger ones.^[12]^[13] The trade-offs in scaling choices are critical for optimizing performance in resource-constrained environments like digital signal processing (DSP). Increasing the total word length (e.g., from 16 to 24 bits) improves both range and precision but raises hardware costs and power consumption; conversely, shorter words favor efficiency but necessitate frequent rescaling to prevent saturation. In practice, application-specific profiling guides this balance, such as using 16-bit formats for audio processing where moderate precision suffices, versus 32-bit for scientific computations demanding higher fidelity.^[12]^[14] Certain real numbers can be represented exactly in binary fixed-point arithmetic if they are dyadic rationals, i.e., fractions with denominators that are powers of 2. For example, 0.5 equals $1/2 and is exactly representable with one fractional bit as $0.1_2 \times 2^{-1} = 0.5, while 0.1 requires an infinite binary expansion and cannot be exact, leading to approximation. Exactness holds only for values where the fractional part terminates in binary, limiting precise representation to sums of distinct powers of $1/2.^[15] Precision loss in fixed-point arises primarily from quantization, where non-representable values are approximated via truncation or rounding, introducing error. The maximum quantization error for rounding to nearest with n fractional bits is \epsilon = \frac{1}{2^{n+1}}, as the error is bounded by half the least significant bit's value. Truncation yields a maximum error of \frac{1}{2^n}, often modeled as uniform noise with variance \sigma^2 = \frac{1}{12} \times (2^{-n})^2. These errors accumulate in operations, necessitating scaling adjustments to maintain overall accuracy.^[13]^[16] The choice of radix significantly influences fixed-point efficiency, with binary (radix 2) predominant due to its alignment with hardware logic gates and shift-based scaling, enabling simple multiplications by powers of 2 via bit shifts. Decimal fixed-point (radix 10), while intuitive for human-readable outputs like financial applications, requires more complex circuitry for multiplication and division, increasing area and latency by factors of 3-5 compared to binary implementations. Binary thus offers superior hardware efficiency for most computational tasks, though decimal persists where exact decimal fractions (e.g., 0.1) are essential.^[12]

Comparison with Floating-Point Arithmetic

Fixed-point arithmetic represents numbers using a fixed binary point position within a word, without an explicit exponent, resulting in a uniform scaling factor applied across the entire range of values. This contrasts with floating-point arithmetic, which employs a mantissa (significand) and an exponent to normalize the number, allowing the binary point to "float" and adapt to the magnitude, as defined in standards like IEEE 754. In fixed-point, all numbers share the same precision determined by the position of the binary point relative to the integer bits, enabling straightforward integer-like operations but restricting adaptability to varying scales.^[17]^[18] Regarding precision and dynamic range, fixed-point formats provide constant absolute precision throughout their representable range, with relative precision that decreases for larger magnitudes. In contrast, floating-point arithmetic provides approximately constant relative precision across its range due to the fixed number of mantissa bits (e.g., 23 bits for single-precision IEEE 754, leading to about 7 decimal digits), though absolute precision varies, being finer for smaller exponents. However, fixed-point's range is inherently limited by the word length and scaling choice, often requiring careful selection to balance integer and fractional parts without overflow, whereas floating-point supports a vastly wider range—spanning from approximately $10^{-38} to $10^{38} in single precision—though at the cost of precision degradation near the extremes. This makes floating-point suitable for scientific computing with disparate magnitudes, while fixed-point excels in applications needing uniform accuracy within a bounded domain.^[17]^[19]^[18]^[20] In terms of performance, fixed-point arithmetic is generally faster and more resource-efficient, as it leverages simple integer operations without the need for exponent alignment, normalization, or complex rounding hardware, reducing latency and power consumption in embedded systems. For instance, on processors lacking dedicated floating-point units, fixed-point can approach the speed of native integer arithmetic, whereas floating-point emulation introduces significant overhead due to these additional steps. This efficiency advantage is particularly pronounced in low-power devices like digital signal processors, where fixed-point implementations can achieve up to several times the throughput of floating-point equivalents for the same bit width.^[19]^[18] Error accumulation in fixed-point arithmetic tends to be more predictable and additive, stemming primarily from quantization and potential overflows that can be mitigated through scaling, leading to uniform noise distribution across operations. In contrast, floating-point errors arise from rounding during normalization and exponent handling, which can propagate non-uniformly and introduce inconsistencies, such as catastrophic cancellation in subtractions of close values, though standards like IEEE 754 bound the relative error to less than 1 unit in the last place (ulp) per operation. Fixed-point's deterministic error behavior facilitates easier analysis in control systems, while floating-point's variability requires compensatory techniques like extended precision guards.^[17]^[18]

Operations

Addition and Subtraction

In fixed-point arithmetic, addition and subtraction require that the operands share the same scaling factor to ensure their radix points align, preventing errors in the fractional representation. If the scaling factors differ, one operand must be adjusted by shifting its binary representation—typically by multiplying or dividing by a power of 2 (e.g., left-shifting by k bits to increase the number of fractional bits from m to m+k)—to match the other's scale before performing the operation. This alignment treats the fixed-point numbers as scaled integers, where the scaling factor s = 2^f reflects the position of the radix point after f fractional bits.^[21]^[22] Once aligned, addition proceeds by performing standard integer addition on the scaled values, followed by applying the common scaling factor to obtain the result. For two fixed-point numbers a and b with the same scale s, the addition is computed as:

c = \frac{(a \cdot s) + (b \cdot s)}{s}

where a \cdot s and b \cdot s are the integer representations. Subtraction follows analogously:

c = \frac{(a \cdot s) - (b \cdot s)}{s}.

For signed fixed-point numbers, these operations employ two's complement representation, where subtraction is implemented by adding the two's complement of the subtrahend to the minuend using the same adder circuit, ensuring consistent handling of negative values.^[23]^[24] Overflow in fixed-point addition and subtraction occurs if the result exceeds the representable range defined by the word length and scaling, such as surpassing the maximum positive or minimum negative value in the format (e.g., $2^{i} - 2^{-f} for an i-bit integer part and f-bit fractional part). Detection typically involves checking the sign bits of the operands and result: for signed addition, overflow is flagged if both operands have the same sign but the result has the opposite sign; similar checks apply post-subtraction. To mitigate overflow, implementations may use saturation arithmetic, clamping the result to the nearest representable extreme rather than wrapping around.^[24]^[23]

Multiplication

In fixed-point arithmetic, multiplication of two numbers represented in formats Q_{m.n} and Q_{p.q} begins with treating them as integers and performing standard integer multiplication, yielding a product in Q_{m+p.n+q} format where the integer part spans m+p bits and the fractional part spans n+q bits.^[7] This doubling of the scaling factor arises because each fractional bit contributes to the overall precision in the product.^[25] To restore the desired scaling, such as matching the original fractional precision, the product undergoes a right shift by n+q bits, effectively dividing by $2^{n+q} to reposition the binary point.^[7] This adjustment may involve truncation of the least significant bits or rounding to mitigate precision loss, depending on the implementation.^[25] The operation can be expressed as:

\text{result} = (a \times b) \gg (n + q)

where \gg denotes an arithmetic right shift to handle sign extension in signed representations.^[7] For signed fixed-point numbers in two's complement format, multiplication preserves the sign through sign extension of partial products during accumulation, ensuring the result remains in two's complement.^[7] Hardware implementations often employ Booth's algorithm to enhance efficiency, which recodes the multiplier by examining bit pairs to replace strings of ones with fewer additions and subtractions, reducing the number of operations especially for numbers with long runs of identical bits.^[26] This technique, originally proposed by Andrew D. Booth in 1951, is particularly suited for signed binary multiplication in fixed-point systems.^[26]

Division and Scaling Conversions

In fixed-point arithmetic, division of two numbers with potentially different scaling factors requires careful adjustment to maintain precision and avoid overflow. Consider a dividend represented as an integer D scaled by factor s_d (where the real value is D \cdot s_d) and a divisor as integer V scaled by s_v (real value V \cdot s_v). The quotient Q in real terms is (D \cdot s_d) / (V \cdot s_v), which simplifies to (D / V) \cdot (s_d / s_v). To compute this using integer operations, one common procedure multiplies the dividend by the divisor's scaling factor and performs integer division: the integer result is (D \cdot s_v) / V, which represents the quotient scaled by s_r = s_d / s_v.^[27]^[28] This approach leverages hardware multipliers for the scaling step, treating the operation as integer arithmetic while accounting for the scales; however, it demands sufficient bit width in intermediate results to prevent overflow, often using extended precision (e.g., 64 bits for 32-bit operands).^[29] For efficiency, especially in software implementations on DSP processors, division often employs reciprocal approximation followed by multiplication. A prominent method uses the Newton-Raphson iteration to approximate the reciprocal $1/V: starting with an initial guess x_0 (e.g., from a lookup table), iterate x_{n+1} = x_n \cdot (2 - V \cdot x_n) until convergence, typically in 2-3 steps for fixed-point precision. The quotient is then D \cdot x_n \cdot s_d / s_v, with the final multiplication handling the scale adjustment. This quadratic convergence makes it faster than long division for repeated operations, though it requires an initial approximation accurate to a few bits.^[29] Scaling conversions between different fixed-point formats involve multiplying the integer representation by the ratio of the target scale to the source scale. To convert from scale s_1 to s_2, compute the new integer as (I \cdot s_2) / s_1, where I is the source integer; the result is now scaled by s_2. If scales are powers of two (common for binary efficiency, e.g., s = 2^{-f}), this reduces to a bit shift: left shift by k bits for s_2 / s_1 = 2^k (increasing fractional precision) or right shift for division, potentially with rounding to mitigate precision loss. Virtual shifts reinterpret the bit positions without altering the integer value, adjusting only the implied scale (e.g., X(a, b) \gg n = X(a - n, b + n)), avoiding hardware shifts but requiring software tracking of the scale. Physical shifts modify the bits directly but risk overflow or truncation if the word length is fixed.^[28]^[30] In hardware implementations, fixed-point division frequently uses digit-recurrence algorithms like restoring or non-restoring division for efficiency on resource-constrained devices such as FPGAs. The restoring algorithm proceeds iteratively: for each quotient bit, subtract a shifted divisor from the partial remainder; if negative, restore by adding back the divisor and set the quotient bit to 0, otherwise set to 1 and proceed without restoration. This requires an extra addition step per iteration. The non-restoring variant optimizes by alternating subtraction and addition based on the remainder's sign, eliminating the restore step: if the partial remainder is positive, subtract and set quotient bit 1; if negative, add and set 0, with a final correction if needed. Both generate one bit per cycle and are suitable for fixed-point by aligning the binary point, offering low latency (n cycles for n-bit quotient) but higher area than multiplication-based methods.^[26]^[31]

Implementation

Hardware Support and Overflow Handling

Hardware implementations of fixed-point arithmetic are prevalent in digital signal processors (DSPs) and microcontrollers, featuring dedicated arithmetic logic units (ALUs) optimized for integer and fractional operations. For instance, the ARM Cortex-M4 and Cortex-M7 processors include DSP extensions to the Thumb instruction set, providing a 32-bit ALU capable of single-cycle execution for most fixed-point arithmetic instructions, such as saturating additions and multiplications on 16-bit or 8-bit data via SIMD processing.^[32] Similarly, Texas Instruments' TMS320C55x DSP family incorporates a 40-bit ALU, a 16-bit ALU, and dual multiply-accumulate (MAC) units, enabling two fixed-point MAC operations per cycle with 40-bit accumulators that include 8 guard bits for extended precision.^[33] These units support common Q formats, such as Q15 (16-bit with 15 fractional bits) and Q31 (32-bit with 31 fractional bits), facilitating efficient handling of fractional data in signal processing tasks. Barrel shifters are integral to these designs for scaling; the C55x employs a 40-bit barrel shifter to perform rapid left or right shifts, essential for adjusting fixed-point representations during operations like multiplication where the result may double in bit width.^[33] Overflow handling in fixed-point hardware typically involves detection mechanisms and response strategies to prevent or manage loss of accuracy. Detection often relies on carry flags or sign bit comparisons; in the ARM Cortex-M processors, overflow is identified through dedicated flags in saturating instructions, while the C55x uses accumulator overflow flags (e.g., ACOV0) and a carry flag (CARRY) that monitors borrow or overflow at the 31st or 39th bit position, depending on the M40 mode bit.^[32]^[33] Common responses include wraparound, where the result modulo-wraps to the representable range, as in non-saturating additions like ARM's SADD16; saturation, which clamps the output to the maximum (e.g., 0x7FFF for 16-bit signed) or minimum (e.g., 0x8000) value; and sticky flags that persist overflow status for software intervention. The C55x implements saturation via SATD and SATA mode bits, clipping results to 7FFFh or 8000h when enabled, with intrinsics like _sadd enforcing this behavior. ARM's QADD and SSAT instructions similarly saturate on overflow, capping values to avoid wraparound artifacts in audio or image processing.^[32]^[33] Renormalization follows operations prone to scaling shifts, such as multiplication, to restore the fixed-point format and avert overflow. In the C55x, the FRCT bit automates a left shift by one bit after multiplication to eliminate an extra sign bit in fractional results, while intrinsics like _norm and _lnorm perform normalization by shifting 16-bit or 32-bit values left until the most significant bit is set. Post-multiplication renormalization often involves right-shifting the 64-bit product to fit the target Q format, using the barrel shifter for efficiency; for example, in Q15 multiplication, the result is shifted right by 15 bits before accumulation. In field-programmable gate arrays (FPGAs), Intel's high-level synthesis (HLS) supports renormalization through ac_fixed types, where arbitrary Q formats (defined by total bits N and integer bits I) allow custom overflow modes like saturation during scaling operations.^[33]^[34] Field-programmable gate arrays (FPGAs) extend fixed-point support via configurable logic blocks implementing Q formats, often with dedicated DSP slices for ALUs and multipliers. Intel Agilex FPGAs, for instance, feature variable-precision DSP blocks optimized for fixed-point multiplication and accumulation, supporting saturation and wraparound modes configurable in HLS via ac_fixed's overflow parameter (O), which handles excess bits by clamping or modular reduction. These implementations allow renormalization through explicit shifts in the design, ensuring scalability for custom Q formats in applications like filters or transforms.^[34]^[35]

Software and Language Support

Several programming languages provide built-in support for fixed-point arithmetic to ensure precise control over decimal precision in applications requiring exact fractional representations. In Ada, fixed-point types are a core language feature, categorized as either ordinary fixed-point types or decimal fixed-point types, where the error bound is defined by a delta parameter specifying the smallest representable difference between values. Similarly, COBOL natively employs fixed-point decimal arithmetic for all numeric computations, treating operands as packed decimal formats unless explicitly designated otherwise, which guarantees consistent handling of financial and business data without floating-point approximations.^[36] In languages lacking native fixed-point types, such as C and C++, developers typically emulate fixed-point arithmetic using integer types combined with scaling factors and macros to manage fractional parts. For instance, the libfixmath library implements Q16.16 fixed-point operations in C, providing functions for addition, multiplication, and trigonometric computations while avoiding floating-point dependencies for embedded systems.^[37] Emulation techniques often involve defining user structs or classes to encapsulate an integer value and a fixed scaling constant, such as representing a value v as \text{int} \times 2^{-s} where s is the scaling exponent, enabling straightforward arithmetic by shifting and scaling results.^[38] In C++, operator overloading further enhances usability by allowing custom fixed-point classes to mimic built-in arithmetic operators, such as redefining + and * to perform scaled integer operations internally.^[39] Specialized libraries extend fixed-point support across various ecosystems. Python's decimal module offers arbitrary-precision decimal arithmetic that can emulate fixed-point behavior through context settings for precision and rounding, making it suitable for applications demanding exact decimal fractions like currency calculations.^[40] In Java, the BigDecimal class provides immutable, arbitrary-precision signed decimal numbers with a configurable scale, effectively supporting fixed-point decimal operations for high-accuracy computations in enterprise software.^[41] For digital signal processing on ARM-based microcontrollers, the CMSIS-DSP library includes optimized fixed-point functions in formats like Q15 and Q31, leveraging integer instructions for efficient filtering and transforms on Cortex-M processors.^[42] Portability remains a key consideration in fixed-point implementations, as consistent scaling and precision must be maintained across diverse platforms without relying on floating-point hardware variations. Emulations using pure integer operations ensure bit-exact results regardless of endianness or compiler differences, though careful definition of scaling factors is essential to avoid platform-specific overflow behaviors.^[38]

Conversion Between Fixed- and Floating-Point

Converting a fixed-point number to floating-point involves interpreting the fixed-point integer representation by dividing it by the scaling factor, which typically equals $2^Q where Q is the number of fractional bits in the fixed-point format.^[43]^[44] This process scales the value to its true magnitude while preserving the sign for signed formats. For example, in hardware implementations like ARM's VCVT instruction, the fixed-point value is loaded into a register, the fractional bits are accounted for via a specified fbits parameter (ranging from 0 to 32), and the result is rounded to nearest for floating-point output.^[44] Normalization may be required if the fixed-point format lacks an implied leading 1, involving a left shift to align the most significant bit and adjustment of the exponent accordingly.^[45] Overflow occurs if the scaled value exceeds the floating-point format's range (e.g., IEEE 754 single-precision maximum of approximately $3.4 \times 10^{38}), in which case the result is set to infinity or clamped, while underflow to subnormal values may produce denormalized floating-point numbers if the magnitude falls below the minimum normalized value.^[44]^[46] The formula for fixed-point to floating-point conversion is:

fp = \frac{\text{fixed\_int}}{2^Q}

where \text{fixed\_int} is the signed or unsigned integer stored in the fixed-point representation, and fp is the resulting floating-point value.^[43]^[45] For signed formats using two's complement, the sign bit is extended appropriately during the division. In cases where the fixed-point value is exactly representable in floating-point (e.g., powers of 2 within the mantissa precision), the conversion is exact; otherwise, rounding to nearest ensures minimal error, though precision loss can occur if Q exceeds the floating-point mantissa bits (e.g., 23 for IEEE 754 single-precision).^[44] Denormals in the output floating-point arise when the fixed-point value's scaled magnitude is very small, below $2^{-126} for single-precision, allowing gradual underflow without abrupt zeroing.^[45] This conversion is common in mixed-precision systems, such as embedded processors where fixed-point hardware accelerates certain operations but floating-point is needed for broader dynamic range in intermediate results.^[47] The reverse conversion, from floating-point to fixed-point, multiplies the floating-point value by the scaling factor $2^Q and rounds the result to the nearest integer to fit the fixed-point integer representation.^[46]^[44] Rounding modes vary by implementation; ARM's VCVT uses round towards zero for this direction to avoid introducing extraneous precision, while general software algorithms often employ round to nearest for better accuracy.^[44] The formula is:

\text{fixed\_int} = \round\left( fp \times 2^Q \right)

where \round denotes the chosen rounding function.^[46]^[43] Precision loss is inevitable if the floating-point value has more significant digits than the fixed-point format can represent, potentially leading to quantization errors; for instance, values not aligned with the fixed-point grid after scaling are truncated or rounded, with error bounded by $2^{-(Q+1)}.^[45] Overflow and underflow are handled by clamping to the fixed-point range (e.g., from -2^{I-1} to $2^{I-1} - 2^{-Q} for signed formats with I integer bits) or raising exceptions, as in IEEE 754-compliant systems.^[46]^[44] In mixed-precision environments, such conversions enable seamless integration, such as converting floating-point inputs to fixed-point for efficient DSP processing while monitoring for inexact results that could propagate errors.^[47]

Applications and Examples

Practical Applications

Fixed-point arithmetic is widely employed in embedded systems, particularly in resource-constrained environments like microcontrollers, where its predictability and low computational overhead enable efficient real-time operations. In robotics, for instance, proportional-integral-derivative (PID) controllers often utilize fixed-point representations to manage motion control tasks, such as stabilizing quad-rotor flying robots, by performing closed-loop calculations with 16-bit fixed-point fractions to balance precision and speed without the variability of floating-point units.^[48] This approach is especially valuable in vision-based applications within embedded platforms, where fixed-point scaling avoids the precision issues of floating-point while supporting compact, power-efficient hardware.^[49] In digital signal processing (DSP), fixed-point arithmetic underpins numerous algorithms in audio and video processing, offering superior efficiency over floating-point in scenarios demanding high throughput on limited hardware. Fixed-point implementations are common in finite impulse response (FIR) filters and fast Fourier transforms (FFTs) used in audio codecs, where they reduce power consumption and cost compared to floating-point alternatives, making them ideal for portable devices and real-time streaming.^[50]^[51] For video codecs, fixed-point DSPs handle tasks like motion compensation and transform coding with deterministic behavior, ensuring consistent performance in high-volume, battery-operated systems without the dynamic range overhead of floating-point processors.^[52] In recent years, fixed-point arithmetic has gained prominence in tiny machine learning (TinyML) applications for edge devices. Quantized fixed-point representations enable efficient inference of neural networks on microcontrollers and FPGAs, minimizing memory usage and power consumption while maintaining acceptable accuracy for tasks like sensor data processing and predictive maintenance. As of 2024, frameworks supporting fixed-point quantization have accelerated deployment of machine learning models in IoT devices.^[53]^[54] Financial software frequently adopts decimal fixed-point arithmetic to represent currency values precisely, mitigating the rounding errors inherent in binary floating-point that could lead to discrepancies in transactions or billing. By scaling amounts to integers (e.g., treating dollars as multiples of cents), banking systems ensure exact decimal handling, avoiding cumulative errors that might otherwise accumulate to significant financial losses, such as millions in annual billing inaccuracies.^[55] This method aligns with human-readable decimal norms, providing reliability in applications like interest calculations and account balancing, where even minor precision losses are unacceptable.^[56]^[57] In video games and physics simulations, fixed-point arithmetic facilitates consistent, low-latency computations on low-end hardware, such as older consoles or mobile devices, by replacing floating-point operations with integer-based scaling for predictable results. For physics engines simulating gravity, collisions, and fluid dynamics, fixed-point methods maintain uniform behavior across platforms, reducing resource demands and enabling real-time rendering without floating-point's variability.^[58] This is particularly evident in resource-optimized simulations, where fixed-point conversions between rendering and physics modules preserve accuracy while minimizing hardware overhead on embedded GPUs.^[59]

Detailed Numerical Examples

To illustrate binary fixed-point multiplication using an unsigned format, consider the numbers 1.5 and 0.75 represented in UQ1.3 format (1 integer bit and 3 fractional bits, total 4 bits, scaling factor $2^{-3} = 0.125). The binary representation of 1.5 is 1100 (decimal 12, value $12 \times 0.125 = 1.5), and 0.75 is 0110 (decimal 6, value $6 \times 0.125 = 0.75).^[25] The multiplication proceeds by treating the representations as unsigned integers: $12 \times 6 = 72 (binary 1001000). In UQ1.3 format, the product of two such numbers yields a UQ2.6 intermediate result (scaling $2^{-6} = 0.015625), so the value is $72 \times 0.015625 = 1.125. To fit back into UQ1.3, shift the integer product right by 3 bits (equivalent to dividing by 8): $72 \div 8 = 9 (binary 1001). The result 1001 in UQ1.3 represents $9 \times 0.125 = 1.125, matching the exact product $1.5 \times 0.75 = 1.125. In binary, the 7-bit product 1001000 shifted right by 3 becomes 1001 (padded as 00001001 in 8 bits for clarity before truncation to 4 bits).^[25] For a decimal fixed-point multiplication example, take 12.34 and 5.67 with a scaling factor of 100 (two decimal places, so representations are integers 1234 and 567). Multiply the scaled integers: $1234 \times 567 = 699678. The product scaling is $100 \times 100 = 10000, so divide by 10000: $699678 \div 10000 = 69.9678. Rounding to two decimal places (common for this scale) gives 69.97, approximating the exact $12.34 \times 5.67 = 69.9678. This method avoids floating-point operations while preserving precision up to the scale.^[60] An example of fixed-point addition in binary uses unsigned UQ2.2 format (2 integer bits and 2 fractional bits, total 4 bits, scaling $2^{-2} = 0.25) for 3.25 and 1.75. The representation of 3.25 is 1101 (decimal 13, value $13 \times 0.25 = 3.25), and 1.75 is 0111 (decimal 7, value $7 \times 0.25 = 1.75). Add the integers directly: $13 + 7 = 20 (binary 10100). The sum 10100 in an extended UQ3.2 format (to handle the carry) represents $20 \times 0.25 = 5.00, matching $3.25 + 1.75 = 5.00. Note the overflow beyond 2 integer bits (maximum 3.75 in UQ2.2), requiring a wider format for the result.^[6] For fixed-point division using a scaled reciprocal, consider 10 divided by 3 in unsigned UQ4.4 format (4 integer bits and 4 fractional bits, total 8 bits, scaling $2^{-4} = 0.0625). Represent 10 as 10100000 (decimal 160, value $160 \times 0.0625 = 10) and 3 as 00110000 (decimal 48, value $48 \times 0.0625 = 3). To divide, approximate the reciprocal of 3 with scaling $2^8 = 256: $256 \div 3 \approx 85.333, take floor 85 (binary 01010101 in UQ0.8, value $85 \times 2^{-8} \approx 0.3320). Multiply the dividend by this reciprocal: $10 \times 0.3320 = 3.320. In integer terms, scale 10 by $2^4 = 16 to 160, multiply by 85: $160 \times 85 = 13600, then adjust scaling by dividing by $2^{4+8} = 4096: $13600 \div 4096 \approx 3.3203. To fit to UQ4.4, scale 3.3203 by 16 ≈ 53.1248 and round to 53 (binary 00110101, value $53 \times 0.0625 = 3.3125), providing a close approximation to $10 \div 3 \approx 3.3333, with error due to truncation; higher scaling (e.g., $2^{12}) improves accuracy.^[61]

Notations and Representations

Fixed-point arithmetic employs various notations to specify the format of numbers, particularly the position of the binary or decimal point and the allocation of bits for integer and fractional parts. The most widely used notation for binary fixed-point representations is the Q format, originally developed by Texas Instruments for describing signed fixed-point numbers in digital signal processing applications.^[62]^[63] In Q notation, a format denoted as Q_{m.n} indicates a signed binary fixed-point number with m bits for the integer portion (excluding the sign bit), n bits for the fractional portion, and an implicit binary point between them, resulting in a total word length of m + n + 1 bits including the sign bit.^[63] For instance, the Q_{15.16} format, common in 32-bit audio processing, allocates 15 bits to the integer part, 16 bits to the fractional part, and 1 sign bit, allowing representation of values from -2^{15} to $2^{15} - 2^{-16} with a resolution of $2^{-16}.^[62] Variations of Q notation distinguish between signed and unsigned formats. Signed formats use the standard Q_{m.n} to denote two's complement representation, where the most significant bit serves as the sign.^[63] Unsigned formats are prefixed with U, as in UQ_{m.n}, which omits the sign bit and uses m + n bits total, representing non-negative values from 0 to $2^{m+n} - 2^{-n}.^[64] Some contexts employ SQ for explicitly signed Q formats and UQ for unsigned, particularly in documentation for fixed-point libraries and hardware descriptions, to clarify the interpretation without altering the core m.n structure.^[62] In decimal fixed-point contexts, notations often resemble programming language specifications, such as fixed(p,q), where p denotes the total number of digits and q the number of digits after the decimal point.^[65] This format appears in languages like C++ for decimal arithmetic libraries or in hardware description languages for simulating fixed-point decimal operations, emphasizing precision in financial or legacy systems.^[66] Block floating-point serves as a hybrid notation related to fixed-point, where a block of numbers shares a common scaling exponent while maintaining fixed-point mantissas within the block, enabling efficient emulation of floating-point behavior on fixed-point hardware without individual exponents.^[67] Historically, early computers like the IBM System/360 used decimal fixed-point notations based on packed decimal formats, storing two digits per byte with an implied decimal point position specified by the programmer, and a sign in the low-order nibble, supporting commercial applications with up to 16 digits per word.^[68]^[66]

References

[1]
[PDF] Fixed-point and floating-point representations of numbers
A fixed-point representation of a number may be thought to consist of 3 parts: the sign field, integer field, and fractional field. One way to store a number ...
[2]
L6: Fixed Point Arithmetic in DSP - Patrick Schaumont
Because fixed-point representation is so strongly related to integer representation, we are able to express arithmetic operations on fixed-point numbers in ...
[3]
ECE4760 fixed point - Cornell University
Fixed-point arithmetic is faster than floating-point, used in DSP, animation, and control loops. It has formats like s2.30, s16.16, and s2.14, and is supported ...
[4]
Fixed-Point Number - an overview | ScienceDirect Topics
A fixed-point number is defined as a numerical representation consisting of a whole part and a fractional part, separated by an implied radix point.
[5]
A Fixed-Point Introduction by Example - Christopher Felton
Apr 25, 2011 · A fixed-point representation of a number consists of integer and fractional components. The bit length is defined as: XNbits=XIntegerNbits+ ...
[6]
Fixed-Point Representation: The Q Format and Addition Examples
Nov 30, 2017 · This article will first review the Q format to represent fractional numbers and then give some examples of fixed-point addition.
[7]
Fixed-Point Arithmetic - MATLAB & Simulink - MathWorks
Two's Complement · Take the one's complement, or “flip the bits.” · Add a 2^(-FL) using binary math, where FL is the fraction length. · Discard any bits carried ...Addition and Subtraction · Multiplication · Modulo Arithmetic · Two's Complement
[8]
Fixed-Point Representation - CS2100 - NUS Computing
The fixed point representation is actually the fractions extension for 1s complement and 2s complement for negative numbers.Representation · Resolution · Approximation
[9]
Two's Complement Fixed-Point Format | Mathematics of the DFT
In two's complement, numbers are negated by complementing the bit pattern and adding 1. Temporary overflows are allowed, and the numbers are on a 'ring'.
[10]
Integer & Fixed Point Representation - Circuit Cellar
Jun 2, 2023 · You will often see fixed point representations denoted by a “Q-number” written as Qm.n where m denotes the size of the integer and n the ...Missing: definition | Show results with:definition
[11]
Fixed Point Numerical Systems for Digital Control
They are commonly written mQn or Qm.n where m is the number of integer bits and n is the number of fractional bits. Note m+n+1 = total number of bits available ...
[12]
Fixed Point Arithmetic - an overview | ScienceDirect Topics
Fixed-point arithmetic is defined as a method of representing binary numbers where the binary point is in a fixed position, allowing for the representation ...
[13]
Fixed Point Arithmetic in DSP — Real Time Digital Signal ...
Nov 14, 2021 · The red curve indicates the quantization error, the difference between the quantized value and the real (floating point) value. The quantization ...
[14]
Fixed-Point Designer
### Summary of Fixed-Point Scaling, Precision, Word Length Trade-offs, Exact Representation, and Quantization Error Formula
[15]
(PDF) Dyadic linear programming and extensions - ResearchGate
Sep 8, 2023 · Dyadic rationals are important for numerical computations because they have an exact representation in floating-point arithmetic on a computer.
[16]
Quantisation Error - an overview | ScienceDirect Topics
Quantization error represents the difference between the input value and the value that is reconstructed at the output: e c = V o − V i n . Quantization error ...
[17]
What Every Computer Scientist Should Know About Floating-Point ...
This paper is a tutorial on those aspects of floating-point arithmetic (floating-point hereafter) that have a direct connection to systems building.
[18]
[PDF] FINITE PRECISION EFFECTS 1. FLOATING POINT VERSUS FIXED ...
The input, output, and internal signals are quantized. Both types of quantization introduce some error. If the binary numbers X and Y are each B bits long, then ...
[19]
Fixed Point Arithmetic in DSP - Patrick Schaumont
Next, compared to emulated floating-point arithmetic, fixed-point arithmetic is (almost) as efficient as fast as integer arithmetic. Fixed-point data types are ...
[20]
22C:18, Lecture 17, Fall 1996 - University of Iowa
Fixed Point Arithmetic. The rules for addition, subtraction, multiplication and division of fixed point binary numbers are the same rules students learn in ...
[21]
https://homepage.cs.uiowa.edu/~dwjones/assem/summer97/notes/17.html
[22]
Lecture 16 – Signed Integers and Arithmetic Part III
Can use common QM.N notation: M+N bits, with M bits as whole part and N bits as fractional part. · To design the computation, must first determine number of bits ...
[23]
5. Fixed Point Arithmetic Unit I - UMD Computer Science
The objectives of this module are to discuss the operation of a binary adder / subtractor unit and calculate the delays associated with this circuit.
[24]
Multiplication Examples Using the Fixed-Point Representation
Dec 15, 2017 · This article will discuss several multiplication examples using the fixed-point representation. Fixed-point representation allows us to use fractional numbers ...
[25]
6. Fixed Point Arithmetic Unit II - UMD Computer Science
6 Fixed Point Arithmetic Unit II · Do the first shift and subtraction · Check sign of the partial remainder · If it is negative, shift and add · If it is positive, ...
[26]
Simple Fixed-Point Math - Atomic Spin
Mar 15, 2012 · The basic idea is to use regular integer types to represent fractional values. Fixed-point math is most commonly used for systems that lack an FPU.<|separator|>
[27]
[PDF] Fixed-Point Arithmetic: An Introduction - Digital Signal Labs
The following are practical rules of fixed-point arithmetic. For these rules we note that when a scaling can be either signed (A(a,b)) or unsigned (U(a,b)) ...
[28]
Fast Division on Fixed-Point DSP Processors Using Newton ...
A method for fast integer division in software, suitable for implementation on processors with integrated hardware multiplier is presented in this paper.Missing: techniques | Show results with:techniques
[29]
[PDF] Fixed-point Mathematics
Converting between fixed-point and floating-point numbers is simple as shown in the examples below. Example A.1.1.1 Conversion from floating-point to fixed- ...
[30]
High speed fixed point dividers for FPGAs - IEEE Xplore
Results show a speedup greater to three times respect to a classical non-restoring division implemented in Xilinx Devices. The dividers were also compared ...
[31]
[PDF] The DSP capabilities of ARM® Cortex®-M4 and Cortex-M7 Processors
While Cortex-M4 and Cortex-M7 can be used in DSP applications for both fixed-point and floating-point operations, the DSP extension is optimized for fixed-point ...
[32]
[PDF] TMS320C55x DSP Programmer's Guide - Texas Instruments
Explains important considerations for doing fixed-point arithmetic with the TMS320C55x DSP. Includes code examples. 5.1. Fixed-Point Arithmetic − a Tutorial.
[33]
8.2. Declaring ac_fixed Data Types - Intel
The overflow mode that determines how to handle values where the generated value has more bits than the number of bits available in the variable. For a list of ...Missing: arithmetic format
[34]
[PDF] Variable Precision DSP Blocks User Guide: Agilex 3 FPGAs and SoCs
Apr 7, 2025 · The Agilex 3 fixed-point arithmetic features include: •. High-performance, power-optimized, and fully registered multiplication operations.
[35]
Fixed-point contrasted with floating-point arithmetic - IBM
In other words, arithmetic evaluations are handled as fixed point only if all the operands are fixed point, the result field is defined to be fixed point, and ...
[36]
PetteriAimonen/libfixmath: Cross Platform Fixed Point Maths Library
Libfixmath implements Q16.16 format fixed point operations in C. License: MIT. Options. Configuration options are compile definitions that are checked by the ...
[37]
Fixed Point Arithmetic in DSP - Patrick Schaumont
Fixed-point arithmetic in DSP uses a fixed-point data representation, related to integer arithmetic, where the programmer is responsible for data alignment.
[38]
What's the best way to do fixed-point math? - c++ - Stack Overflow
Sep 17, 2008 · I need to speed up a program for the Nintendo DS which doesn't have an FPU, so I need to change floating-point math (which is emulated and slow) to fixed-point.What is the "best" way to overload arithmetic operators in modern ...C++: overloading mathematical operators - Stack OverflowMore results from stackoverflow.com
[39]
decimal — Decimal fixed-point and floating-point arithmetic ...
The decimal module provides support for fast correctly rounded decimal floating point arithmetic. It offers several advantages over the float datatype.Decimal -- Decimal... · Decimal Objects · Context Objects
[40]
BigDecimal (Java Platform SE 8 ) - Oracle Help Center
Immutable, arbitrary-precision signed decimal numbers. A BigDecimal consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.Frames · Use · MathContext · RoundingMode
[41]
CMSIS DSP Software Library - GitHub Pages
This user manual describes the CMSIS DSP software library, a suite of common signal processing functions for use on Cortex-M and Cortex-A processor based ...Reference · Revision History · Deprecated List
[42]
Simple Fixed-Point Conversion in C - Embedded Artistry
Jul 12, 2018 · ... fixed-point to floating-point conversion function. Converting from fixed-point to floating-point is straightforward. We take the input value ...
[43]
VCVT (between floating-point and fixed-point, Floating-point)
This instruction converts a value in a register from floating-point to fixed-point, or from fixed-point to floating-point.
[44]
[PDF] Library of Parameterized Hardware Modules for Floating-Point ...
fix2float Unsigned fixed-point to floating-point conversion. 4. Signed fixed-point to floating-point conversion. 5 float2fix Floating-point to unsigned fixed ...<|control11|><|separator|>
[45]
Data Conversion Rules - Win32 apps - Microsoft Learn
Aug 20, 2021 · The following is the general procedure for converting a floating point number n to a fixed point integer i.f, where i is the number of (signed) ...
[46]
[PDF] 3 Floating-Point Arithmetic - INDECT
FIXED-POINT TO FLOATING-POINT CONVERSION. Conversion of numbers from 1.15 fixed-point format into IEEE 754 and two-word floating-point format is discussed in ...
[47]
A Design of FPGA-Based Neural Network PID Controller for Motion ...
Jan 24, 2022 · Closed-loop control systems are internally calculated with 16 fixed-point fractional bits, where the highest bit is the sign bit, the lower ...
[48]
[PDF] Chapter 1 Cybernetics - Coordinated Robotics Labs
of binary digits after the (implied) decimal point. This representation of fixed point real numbers, using N bits, is referred to as Q format, and is ...
[49]
DIGITAL SIGNAL PROCESSORS - Computer Science & Engineering
A low-pass filter is used to smooth the analog output. Fixed Point versus Floating Point DSP chips:DSP chip word size determines resolution and dynamic range.Missing: codecs | Show results with:codecs
[50]
[PDF] DSP Architectures: Past, Present and Future*
Thus fixed point DSPs are cheaper and consume less power than their float- ing point counterparts and are usually used in high volume, compact sized, low power ...<|separator|>
[51]
[PDF] an4841-digital-signal-processing-for-stm32-microcontrollers-using ...
Mar 23, 2016 · Table 1 highlights the main advantages and disadvantages of fixed-point vs. floating-point in. DSP applications. Table 1. Pros and cons of ...Missing: codecs | Show results with:codecs
[52]
[PDF] Hardware Design of a Binary Integer Decimal-based IEEE P754 ...
One study estimates that errors from binary floating-point arithmetic can accumulate to a yearly billing error of over $5 million for large billing systems [10] ...
[53]
[PDF] Decimal floating-point: algorism for computers - speleotrove.com
Decimal arithmetic is the norm in human calculations, and human-centric applications must use a decimal floating-point arithmetic to achieve the same ...
[54]
Exact Versus Inexact Decimal Floating-Point Numbers and Arithmetic
Feb 23, 2023 · Decimal numbers are needed because they avoid the rounding errors that typ- ically occur when converting a decimal fraction in human entered ...<|separator|>
[55]
[PDF] Real-Time Fluid Dynamics for Games
In this case we can replace floating point operations with fixed point arithmetic. This is actually fairly straightforward to implement. Page 12. Simply add ...
[56]
[PDF] FPinGA Final Report - MIT
The renderer uses floating point while the simulation uses fixed point for lower resource use, so values must be passed through a fixed to float conversion ...
[57]
Fixed-Point Arithmetic - FORTH, Inc
This chapter introduces more arithmetic operators and tackles the problem of handling decimal points using only whole numbers, i.e., fixed-point arithmetic.
[58]
Algorithms for division - part 3 - Using multiplication - SEGGER Blog
Nov 29, 2021 · We are going to work in fixed-point arithmetic using scaled integers. ... Our reciprocal is in Q16 format, u is a 16-bit integer, and the ...The Final Code · Measuring Up · Processors Without A...
[59]
Fixed Point Number - an overview | ScienceDirect Topics
Fixed-point number formats can also be represented using Q notation, which was developed by Texas Instruments. Q notation is equivalent to the S/U format used ...<|control11|><|separator|>
[60]
Using the IQmath Library - MATLAB & Simulink - MathWorks
the Texas Instruments representation for signed fixed-point numbers. m is the number of bits used to ...About the IQmath Library · Fixed-Point Numbers · Notation · Format Notation
[61]
Fixed Point Basics — fixedpoint 1.0.1 documentation
The Q format specifies the range and resolution of values that a fixed point number can represent. Q format. Range. Resolution. Qm.n.Missing: standard | Show results with:standard
[62]
Tonc: Fixed-Point Numbers and LUTs - Coranac
A fixed-fixed multiply required a division by the scale afterwards, whereas a fixed-fixed division needs a scale multiply before the division. In both cases ...
[63]
Decimal fixed-point data - IBM
Decimal fixed-point data is stored two digits per byte, with a sign indication in the rightmost 4 bits of the rightmost byte.Missing: System/ 360
[64]
[PDF] A Block Floating Point Implementation for an N-Point FFT on the ...
Most DSP applications can be handled with fixed-point representation. However, for those applications which require extended dynamic range but do not warrant ...
[65]
[PDF] Architecture of the IBM System/360 - People @EECS
HALFWORD FIXED-POINT NUMBER. $. 0. 15. INTEGER. FULLWORD FIXED-POINT NUMBER. S. 0. SHORT FLOATING-POINT NUMBER. $. 7. CHARACTER. 0. 7. LONG FLOATING-POINT ...