Fact-checked by Grok 2 weeks ago

Fixed-point arithmetic

Fixed-point arithmetic is a numerical for rational numbers in which the (or ) point is located at a fixed position within the representation, allowing computations to be performed using integer scaled by a power of the , typically 2 in systems. This approach contrasts with by maintaining a constant scaling factor across all numbers, which simplifies but limits the and compared to formats that adjust the point position dynamically. Commonly used in embedded systems, (), and applications, fixed-point arithmetic enables efficient operations on resource-constrained devices like microcontrollers. In fixed-point representations, numbers are stored as with an implied point at a predetermined , often denoted in Qm.n where m specifies the number of bits (including the ) and n the number of fractional bits. For example, in a signed Q1. using 16 bits, the value is interpreted as the divided by $2^{15}, providing a range from -1 to nearly 1 with a of $2^{-15} \approx 3.05 \times 10^{-5}. operations such as and require alignment of the binary points, which is straightforward since the position is fixed, but multiplication typically doubles the bit width of the result, necessitating right-shifting to fit the output and potentially introducing quantization errors. Scaling during conversion from floating-point values—by multiplying by $2^n—must account for overflow risks and precision loss. The primary advantages of fixed-point arithmetic include significantly faster execution times and lower power consumption relative to floating-point, as it avoids the complexity of exponent handling and ; for instance, on microcontrollers like the PIC32, a fixed-point may take 21 cycles 55 for floating-point. It is particularly suited for applications requiring predictable performance, such as audio processing, systems, and rendering, where the fixed precision suffices and can be managed through or wrapping. However, its drawbacks include a restricted range—for a 32-bit signed format with 16 fractional bits, the maximum value is approximately 32767.9999—and susceptibility to accumulation of rounding errors in iterative computations, making it less ideal for high-precision scientific calculations. Modern compilers and libraries, such as those supporting the _Accum type in C, provide built-in fixed-point data types to facilitate its use in software.

Fundamentals

Definition and Basic Representation

Fixed-point arithmetic is a for representing rational numbers in by storing them as s scaled by a fixed point position, which eliminates the need for explicit or points during storage. This approach treats the number as an multiple of a power of the (typically base 2 for systems), where the scaling factor corresponds to the position of the implied point. In essence, a fixed-point number x can be expressed as x = I + F, where I is the part and F is the , but both are combined into a single value interpreted relative to the fixed point. In basic storage, fixed-point numbers are represented using standard integer formats, typically in with a fixed word , and signed values often employ to handle negative numbers efficiently. The entire value is stored as a contiguous bit string without any explicit indicator of the radix point; instead, the position is predetermined by the format, allowing arithmetic operations to proceed as if on while the interpretation yields fractional results. For example, in an 8-bit signed representation, the bits might allocate space for both and fractional components, with the using to extend the range symmetrically around zero. The placement of the radix point defines the division between integer and fractional parts, commonly denoted in Qm.n format, where m represents the number of bits for the portion (excluding the ) and n the number of bits for the fractional portion. In this notation, the total bit width is typically m + n + 1 (including the for signed formats), and the value is interpreted as the bit pattern divided by $2^n. For instance, in Q3.4 format with 8 bits, the first 3 bits (after the sign) hold the part, and the remaining 4 bits the , enabling representation of values from -8 to nearly 8 with a of $1/16. This representation offers advantages over pure arithmetic by allowing the encoding of fractional values through the implicit , thus supporting applications requiring sub-unit without the overhead of dedicated floating-point units. By leveraging hardware, fixed-point formats facilitate efficient computation of rational numbers in resource-constrained environments, such as systems.

Scaling Factors and Precision Choices

In fixed-point arithmetic, the scaling factor determines the position of the radix point relative to the stored bits, effectively defining the representation as a scaled . For binary fixed-point systems, the scaling is typically a power of 2, denoted as $2^{-f} where f is the number of fractional bits, allowing the value to be interpreted as v = i \times 2^{-f} for an i. Selection of this scaling involves analyzing the application's required and resolution; for instance, allocating more bits to the integer part expands the representable range (e.g., from [-2^{w-1}, 2^{w-1}) for w total bits in signed two's complement) at the expense of fractional precision, while prioritizing fractional bits enhances accuracy for small values but risks overflow for larger ones. The trade-offs in scaling choices are critical for optimizing performance in resource-constrained environments like (). Increasing the total word length (e.g., from 16 to 24 bits) improves both range and but raises costs and consumption; conversely, shorter words favor but necessitate frequent rescaling to prevent . In practice, application-specific guides this balance, such as using 16-bit formats for audio processing where moderate suffices, versus 32-bit for scientific computations demanding higher fidelity. Certain real numbers can be represented exactly in fixed-point arithmetic if they are dyadic rationals, i.e., fractions with denominators that are powers of 2. For example, 0.5 equals $1/2 and is exactly representable with one fractional bit as $0.1_2 \times 2^{-1} = 0.5, while 0.1 requires an infinite expansion and cannot be exact, leading to . Exactness holds only for values where the terminates in , limiting precise representation to sums of distinct powers of $1/2. Precision loss in fixed-point arises primarily from quantization, where non-representable values are approximated via or , introducing . The maximum quantization for rounding to nearest with n fractional bits is \epsilon = \frac{1}{2^{n+1}}, as the is bounded by half the least significant bit's value. yields a maximum of \frac{1}{2^n}, often modeled as uniform with variance \sigma^2 = \frac{1}{12} \times (2^{-n})^2. These errors accumulate in operations, necessitating scaling adjustments to maintain overall accuracy. The choice of radix significantly influences fixed-point efficiency, with (radix 2) predominant due to its alignment with logic gates and shift-based scaling, enabling simple multiplications by powers of 2 via bit shifts. Decimal fixed-point ( 10), while intuitive for human-readable outputs like financial applications, requires more complex circuitry for and , increasing area and by factors of 3-5 compared to binary implementations. Binary thus offers superior hardware efficiency for most computational tasks, though decimal persists where exact decimal fractions (e.g., 0.1) are essential.

Comparison with Floating-Point Arithmetic

Fixed-point arithmetic represents numbers using a fixed binary point position within a word, without an explicit exponent, resulting in a uniform scaling factor applied across the entire range of values. This contrasts with , which employs a () and an exponent to normalize the number, allowing the binary point to "float" and adapt to the magnitude, as defined in standards like IEEE 754. In fixed-point, all numbers share the same precision determined by the position of the binary point relative to the bits, enabling straightforward integer-like operations but restricting adaptability to varying scales. Regarding precision and dynamic range, fixed-point formats provide constant absolute precision throughout their representable range, with relative precision that decreases for larger magnitudes. In contrast, floating-point arithmetic provides approximately constant relative precision across its range due to the fixed number of mantissa bits (e.g., 23 bits for single-precision IEEE 754, leading to about 7 decimal digits), though absolute precision varies, being finer for smaller exponents. However, fixed-point's range is inherently limited by the word length and scaling choice, often requiring careful selection to balance integer and fractional parts without overflow, whereas floating-point supports a vastly wider range—spanning from approximately $10^{-38} to $10^{38} in single precision—though at the cost of precision degradation near the extremes. This makes floating-point suitable for scientific computing with disparate magnitudes, while fixed-point excels in applications needing uniform accuracy within a bounded domain. In terms of performance, fixed-point arithmetic is generally faster and more resource-efficient, as it leverages simple operations without the need for exponent alignment, normalization, or complex rounding hardware, reducing latency and power consumption in systems. For instance, on processors lacking dedicated floating-point units, fixed-point can approach the speed of native arithmetic, whereas floating-point introduces significant overhead due to these additional steps. This efficiency advantage is particularly pronounced in low-power devices like processors, where fixed-point implementations can achieve up to several times the throughput of floating-point equivalents for the same bit width. Error accumulation in fixed-point arithmetic tends to be more predictable and additive, stemming primarily from quantization and potential overflows that can be mitigated through , leading to uniform distribution across operations. In contrast, floating-point errors arise from during and exponent handling, which can propagate non-uniformly and introduce inconsistencies, such as in subtractions of close values, though standards like bound the relative to less than 1 unit in the last place (ulp) per operation. Fixed-point's deterministic error behavior facilitates easier analysis in control systems, while floating-point's variability requires compensatory techniques like guards.

Operations

Addition and Subtraction

In fixed-point arithmetic, addition and subtraction require that the operands share the same scaling factor to ensure their radix points align, preventing errors in the fractional representation. If the scaling factors differ, one operand must be adjusted by shifting its binary representation—typically by multiplying or dividing by a power of 2 (e.g., left-shifting by k bits to increase the number of fractional bits from m to m+k)—to match the other's scale before performing the operation. This alignment treats the fixed-point numbers as scaled integers, where the scaling factor s = 2^f reflects the position of the radix point after f fractional bits. Once aligned, addition proceeds by performing standard addition on the scaled values, followed by applying the common scaling factor to obtain the result. For two fixed-point numbers a and b with the same scale s, the is computed as: c = \frac{(a \cdot s) + (b \cdot s)}{s} where a \cdot s and b \cdot s are the representations. follows analogously: c = \frac{(a \cdot s) - (b \cdot s)}{s}. For signed fixed-point numbers, these operations employ representation, where is implemented by adding the of the subtrahend to the minuend using the same circuit, ensuring consistent handling of negative values. Overflow in fixed-point and occurs if the result exceeds the representable range defined by the word length and scaling, such as surpassing the maximum positive or minimum negative value in (e.g., $2^{i} - 2^{-f} for an i-bit integer part and f-bit ). Detection typically involves checking the sign bits of the operands and result: for signed , is flagged if both operands have the same sign but the result has the opposite sign; similar checks apply post-. To mitigate , implementations may use saturation arithmetic, clamping the result to the nearest representable extreme rather than wrapping around.

Multiplication

In fixed-point arithmetic, multiplication of two numbers represented in formats Q_{m.n} and Q_{p.q} begins with treating them as s and performing standard , yielding a product in Q_{m+p.n+q} format where the integer part spans m+p bits and the fractional part spans n+q bits. This doubling of the scaling factor arises because each fractional bit contributes to the overall precision in the product. To restore the desired scaling, such as matching the original fractional precision, the product undergoes a right shift by n+q bits, effectively dividing by $2^{n+q} to reposition the binary point. This adjustment may involve of the least significant bits or to mitigate loss, depending on the . The operation can be expressed as: \text{result} = (a \times b) \gg (n + q) where \gg denotes an arithmetic right shift to handle sign extension in signed representations. For signed fixed-point numbers in two's complement format, multiplication preserves the sign through sign extension of partial products during accumulation, ensuring the result remains in two's complement. Hardware implementations often employ Booth's algorithm to enhance efficiency, which recodes the multiplier by examining bit pairs to replace strings of ones with fewer additions and subtractions, reducing the number of operations especially for numbers with long runs of identical bits. This technique, originally proposed by Andrew D. Booth in 1951, is particularly suited for signed binary multiplication in fixed-point systems.

Division and Scaling Conversions

In fixed-point arithmetic, of two numbers with potentially different factors requires careful adjustment to maintain and avoid . Consider a represented as an D scaled by factor s_d (where the real value is D \cdot s_d) and a as V scaled by s_v (real value V \cdot s_v). The Q in real terms is (D \cdot s_d) / (V \cdot s_v), which simplifies to (D / V) \cdot (s_d / s_v). To compute this using operations, one common procedure multiplies the dividend by the divisor's scaling factor and performs : the result is (D \cdot s_v) / V, which represents the scaled by s_r = s_d / s_v. This approach leverages multipliers for the scaling step, treating the operation as arithmetic while accounting for the scales; however, it demands sufficient bit width in intermediate results to prevent , often using (e.g., 64 bits for 32-bit operands). For efficiency, especially in software implementations on processors, often employs followed by . A prominent method uses the Newton-Raphson iteration to approximate the $1/V: starting with an initial guess x_0 (e.g., from a ), iterate x_{n+1} = x_n \cdot (2 - V \cdot x_n) until , typically in 2-3 steps for fixed-point . The is then D \cdot x_n \cdot s_d / s_v, with the final handling the scale adjustment. This makes it faster than for repeated operations, though it requires an initial accurate to a few bits. Scaling conversions between different fixed-point formats involve multiplying the integer representation by the ratio of the target to the source . To convert from s_1 to s_2, compute the new as (I \cdot s_2) / s_1, where I is the source ; the result is now scaled by s_2. If scales are powers of two (common for efficiency, e.g., s = 2^{-f}), this reduces to a bit shift: left shift by k bits for s_2 / s_1 = 2^k (increasing fractional ) or right shift for division, potentially with to mitigate loss. Virtual shifts reinterpret the bit positions without altering the value, adjusting only the implied (e.g., X(a, b) \gg n = X(a - n, b + n)), avoiding hardware shifts but requiring software tracking of the . Physical shifts modify the bits directly but risk or if the word length is fixed. In hardware implementations, fixed-point division frequently uses digit-recurrence algorithms like restoring or non-restoring division for on resource-constrained devices such as FPGAs. The restoring algorithm proceeds iteratively: for each bit, subtract a shifted from the partial ; if negative, restore by adding back the divisor and set the quotient bit to 0, otherwise set to 1 and proceed without restoration. This requires an extra addition step per iteration. The non-restoring variant optimizes by alternating subtraction and addition based on the remainder's sign, eliminating the restore step: if the partial remainder is positive, subtract and set bit 1; if negative, add and set 0, with a final correction if needed. Both generate one bit per cycle and are suitable for fixed-point by aligning the binary point, offering low (n cycles for n-bit ) but higher area than multiplication-based methods.

Implementation

Hardware Support and Overflow Handling

Hardware implementations of fixed-point arithmetic are prevalent in digital signal processors () and microcontrollers, featuring dedicated arithmetic logic units (ALUs) optimized for integer and fractional operations. For instance, the Cortex-M4 and Cortex-M7 processors include DSP extensions to the Thumb instruction set, providing a 32-bit ALU capable of single-cycle execution for most fixed-point arithmetic instructions, such as saturating additions and multiplications on 16-bit or 8-bit data via SIMD processing. Similarly, ' TMS320C55x family incorporates a 40-bit ALU, a 16-bit ALU, and dual multiply-accumulate (MAC) units, enabling two fixed-point MAC operations per cycle with 40-bit accumulators that include 8 guard bits for . These units support common Q formats, such as Q15 (16-bit with 15 fractional bits) and Q31 (32-bit with 31 fractional bits), facilitating efficient handling of fractional data in tasks. Barrel shifters are integral to these designs for scaling; the C55x employs a 40-bit to perform rapid left or right shifts, essential for adjusting fixed-point representations during operations like where the result may double in bit width. Overflow handling in fixed-point hardware typically involves detection and response strategies to prevent or manage of accuracy. Detection often relies on s or comparisons; in the processors, is identified through dedicated flags in saturating instructions, while the C55x uses accumulator overflow flags (e.g., ACOV0) and a (CARRY) that monitors borrow or overflow at the 31st or 39th bit position, depending on the M40 mode bit. Common responses include wraparound, where the result modulo-wraps to the representable range, as in non-saturating additions like ARM's SADD16; , which clamps the output to the maximum (e.g., 0x7FFF for 16-bit signed) or minimum (e.g., 0x8000) value; and sticky flags that persist status for software intervention. The C55x implements via SATD and SATA mode bits, clipping results to 7FFFh or 8000h when enabled, with intrinsics like _sadd enforcing this behavior. ARM's QADD and SSAT instructions similarly saturate on , capping values to avoid wraparound artifacts in audio or image processing. Renormalization follows operations prone to scaling shifts, such as , to restore the fixed-point format and avert . In the C55x, the FRCT bit automates a left shift by one bit after to eliminate an extra in fractional results, while intrinsics like _norm and _lnorm perform by shifting 16-bit or 32-bit values left until the most significant bit is set. Post- renormalization often involves right-shifting the 64-bit product to fit the target Q format, using the for efficiency; for example, in Q15 , the result is shifted right by 15 bits before accumulation. In field-programmable gate arrays (FPGAs), Intel's (HLS) supports through ac_fixed types, where arbitrary Q formats (defined by total bits N and integer bits I) allow custom modes like during scaling operations. Field-programmable gate arrays (FPGAs) extend fixed-point support via configurable logic blocks implementing Q formats, often with dedicated slices for ALUs and multipliers. Agilex FPGAs, for instance, feature variable-precision blocks optimized for fixed-point multiplication and accumulation, supporting and wraparound modes configurable in HLS via ac_fixed's parameter (O), which handles excess bits by clamping or modular reduction. These implementations allow through explicit shifts in the design, ensuring scalability for custom Q formats in applications like filters or transforms.

Software and Language Support

Several programming languages provide built-in support for fixed-point arithmetic to ensure precise control over decimal precision in applications requiring exact fractional representations. In Ada, fixed-point types are a core language feature, categorized as either ordinary fixed-point types or decimal fixed-point types, where the error bound is defined by a delta parameter specifying the smallest representable difference between values. Similarly, natively employs fixed-point decimal arithmetic for all numeric computations, treating operands as packed decimal formats unless explicitly designated otherwise, which guarantees consistent handling of financial and business data without floating-point approximations. In languages lacking native fixed-point types, such as C and C++, developers typically emulate fixed-point arithmetic using integer types combined with scaling factors and macros to manage fractional parts. For instance, the libfixmath library implements Q16.16 fixed-point operations in C, providing functions for addition, multiplication, and trigonometric computations while avoiding floating-point dependencies for embedded systems. Emulation techniques often involve defining user structs or classes to encapsulate an integer value and a fixed scaling constant, such as representing a value v as \text{int} \times 2^{-s} where s is the scaling exponent, enabling straightforward arithmetic by shifting and scaling results. In C++, operator overloading further enhances usability by allowing custom fixed-point classes to mimic built-in arithmetic operators, such as redefining + and * to perform scaled integer operations internally. Specialized libraries extend fixed-point support across various ecosystems. Python's decimal module offers arbitrary-precision decimal arithmetic that can emulate fixed-point behavior through context settings for and , making it suitable for applications demanding exact decimal fractions like calculations. In Java, the BigDecimal class provides immutable, arbitrary-precision signed numbers with a configurable scale, effectively supporting fixed-point decimal operations for high-accuracy computations in . For digital signal processing on ARM-based microcontrollers, the CMSIS-DSP library includes optimized fixed-point functions in formats like Q15 and Q31, leveraging integer instructions for efficient filtering and transforms on Cortex-M processors. Portability remains a key consideration in fixed-point implementations, as consistent and precision must be maintained across diverse platforms without relying on floating-point variations. Emulations using pure operations ensure bit-exact results regardless of or differences, though careful definition of scaling factors is essential to avoid platform-specific behaviors.

Conversion Between Fixed- and Floating-Point

Converting a fixed-point number to floating-point involves interpreting the fixed-point integer representation by dividing it by the scaling factor, which typically equals $2^Q where Q is the number of fractional bits in the fixed-point format. This process scales the value to its true magnitude while preserving the sign for signed formats. For example, in hardware implementations like ARM's VCVT instruction, the fixed-point value is loaded into a register, the fractional bits are accounted for via a specified fbits parameter (ranging from 0 to 32), and the result is rounded to nearest for floating-point output. Normalization may be required if the fixed-point format lacks an implied leading 1, involving a left shift to align the most significant bit and adjustment of the exponent accordingly. Overflow occurs if the scaled value exceeds the floating-point format's range (e.g., IEEE 754 single-precision maximum of approximately $3.4 \times 10^{38}), in which case the result is set to infinity or clamped, while underflow to subnormal values may produce denormalized floating-point numbers if the magnitude falls below the minimum normalized value. The formula for fixed-point to floating-point conversion is: fp = \frac{\text{fixed\_int}}{2^Q} where \text{fixed\_int} is the signed or unsigned stored in the fixed-point representation, and fp is the resulting floating-point value. For signed formats using , the sign bit is extended appropriately during the division. In cases where the fixed-point value is exactly representable in floating-point (e.g., powers of 2 within the ), the conversion is exact; otherwise, to nearest ensures minimal , though loss can occur if Q exceeds the floating-point bits (e.g., 23 for single-). Denormals in the output floating-point arise when the fixed-point value's scaled magnitude is very small, below $2^{-126} for single-, allowing gradual underflow without abrupt zeroing. This conversion is common in mixed- systems, such as processors where fixed-point accelerates certain operations but floating-point is needed for broader in intermediate results. The reverse conversion, from floating-point to fixed-point, multiplies the floating-point value by the scaling factor $2^Q and rounds the result to the nearest to fit the fixed-point representation. modes vary by implementation; ARM's VCVT uses round towards zero for this direction to avoid introducing extraneous , while general software algorithms often employ round to nearest for better accuracy. The formula is: \text{fixed\_int} = \round\left( fp \times 2^Q \right) where \round denotes the chosen rounding function. Precision loss is inevitable if the floating-point value has more significant digits than the fixed-point format can represent, potentially leading to quantization errors; for instance, values not aligned with the fixed-point grid after scaling are truncated or rounded, with error bounded by $2^{-(Q+1)}. Overflow and underflow are handled by clamping to the fixed-point range (e.g., from -2^{I-1} to $2^{I-1} - 2^{-Q} for signed formats with I integer bits) or raising exceptions, as in IEEE 754-compliant systems. In mixed-precision environments, such conversions enable seamless integration, such as converting floating-point inputs to fixed-point for efficient DSP processing while monitoring for inexact results that could propagate errors.

Applications and Examples

Practical Applications

Fixed-point arithmetic is widely employed in embedded systems, particularly in resource-constrained environments like microcontrollers, where its predictability and low computational overhead enable efficient real-time operations. In , for instance, proportional-integral-derivative (PID) controllers often utilize fixed-point representations to manage tasks, such as stabilizing quad-rotor flying robots, by performing closed-loop calculations with 16-bit fixed-point fractions to balance precision and speed without the variability of floating-point units. This approach is especially valuable in vision-based applications within embedded platforms, where fixed-point scaling avoids the precision issues of floating-point while supporting compact, power-efficient hardware. In digital signal processing (DSP), fixed-point arithmetic underpins numerous algorithms in audio and video processing, offering superior efficiency over floating-point in scenarios demanding high throughput on limited hardware. Fixed-point implementations are common in finite impulse response (FIR) filters and fast Fourier transforms (FFTs) used in audio codecs, where they reduce power consumption and cost compared to floating-point alternatives, making them ideal for portable devices and real-time streaming. For video codecs, fixed-point DSPs handle tasks like motion compensation and transform coding with deterministic behavior, ensuring consistent performance in high-volume, battery-operated systems without the dynamic range overhead of floating-point processors. In recent years, fixed-point arithmetic has gained prominence in tiny (TinyML) applications for devices. Quantized fixed-point representations enable efficient of neural networks on microcontrollers and FPGAs, minimizing memory usage and power consumption while maintaining acceptable accuracy for tasks like and . As of 2024, frameworks supporting fixed-point quantization have accelerated deployment of models in devices. Financial software frequently adopts fixed-point arithmetic to represent values precisely, mitigating the errors inherent in floating-point that could lead to discrepancies in transactions or billing. By scaling amounts to integers (e.g., treating dollars as multiples of cents), banking systems ensure exact decimal handling, avoiding cumulative errors that might otherwise accumulate to significant financial losses, such as millions in annual billing inaccuracies. This method aligns with human-readable decimal norms, providing reliability in applications like calculations and account balancing, where even minor losses are unacceptable. In and physics simulations, fixed-point arithmetic facilitates consistent, low-latency computations on low-end hardware, such as older consoles or mobile devices, by replacing floating-point operations with integer-based scaling for predictable results. For physics engines simulating gravity, collisions, and , fixed-point methods maintain uniform behavior across platforms, reducing resource demands and enabling rendering without floating-point's variability. This is particularly evident in resource-optimized simulations, where fixed-point conversions between rendering and physics modules preserve accuracy while minimizing hardware overhead on embedded GPUs.

Detailed Numerical Examples

To illustrate binary fixed-point multiplication using an unsigned format, consider the numbers 1.5 and 0.75 represented in UQ1.3 format (1 integer bit and 3 fractional bits, total 4 bits, scaling factor $2^{-3} = 0.125). The representation of 1.5 is 1100 (decimal 12, value $12 \times 0.125 = 1.5), and 0.75 is 0110 (decimal 6, value $6 \times 0.125 = 0.75). The multiplication proceeds by treating the representations as unsigned integers: $12 \times 6 = 72 (binary 1001000). In UQ1.3 format, the product of two such numbers yields a UQ2.6 intermediate result (scaling $2^{-6} = 0.015625), so the value is $72 \times 0.015625 = 1.125. To fit back into UQ1.3, shift the integer product right by 3 bits (equivalent to dividing by 8): $72 \div 8 = 9 (binary ). The result 1001 in UQ1.3 represents $9 \times 0.125 = 1.125, matching the exact product $1.5 \times 0.75 = 1.125. In binary, the 7-bit product 1001000 shifted right by 3 becomes 1001 (padded as 00001001 in 8 bits for clarity before to 4 bits). For a decimal fixed-point multiplication example, take 12.34 and 5.67 with a factor of 100 (two decimal places, so representations are integers 1234 and 567). Multiply the scaled integers: $1234 \times 567 = 699678. The product scaling is $100 \times 100 = 10000, so divide by 10000: $699678 \div 10000 = 69.9678. Rounding to two decimal places (common for this scale) gives 69.97, approximating the exact $12.34 \times 5.67 = 69.9678. This method avoids floating-point operations while preserving up to the scale. An example of fixed-point addition in binary uses unsigned UQ2.2 format (2 integer bits and 2 fractional bits, total 4 bits, scaling $2^{-2} = 0.25) for 3.25 and 1.75. The representation of 3.25 is 1101 (decimal 13, value $13 \times 0.25 = 3.25), and 1.75 is 0111 (decimal 7, value $7 \times 0.25 = 1.75). Add the integers directly: $13 + 7 = 20 (binary 10100). The sum 10100 in an extended UQ3.2 format (to handle the carry) represents $20 \times 0.25 = 5.00, matching $3.25 + 1.75 = 5.00. Note the overflow beyond 2 integer bits (maximum 3.75 in UQ2.2), requiring a wider format for the result. For fixed-point division using a scaled , consider 10 divided by 3 in unsigned UQ4.4 format (4 bits and 4 fractional bits, total 8 bits, scaling $2^{-4} = 0.0625). Represent 10 as 10100000 (decimal 160, value $160 \times 0.0625 = 10) and 3 as 00110000 (decimal 48, value $48 \times 0.0625 = 3). To divide, approximate the of 3 with scaling $2^8 = 256: $256 \div 3 \approx 85.333, take 85 (binary 01010101 in UQ0.8, value $85 \times 2^{-8} \approx 0.3320). Multiply the by this : $10 \times 0.3320 = 3.320. In terms, scale 10 by $2^4 = 16 to 160, multiply by 85: $160 \times 85 = 13600, then adjust scaling by by $2^{4+8} = 4096: $13600 \div 4096 \approx 3.3203. To fit to UQ4.4, scale 3.3203 by 16 ≈ 53.1248 and to 53 (binary 00110101, value $53 \times 0.0625 = 3.3125), providing a close to $10 \div 3 \approx 3.3333, with due to ; higher scaling (e.g., $2^{12}) improves accuracy.

Notations and Representations

Fixed-point arithmetic employs various notations to specify the format of numbers, particularly the position of the or decimal point and the allocation of bits for and fractional parts. The most widely used notation for fixed-point representations is the Q format, originally developed by for describing signed fixed-point numbers in applications. In Q notation, a format denoted as Q_{m.n} indicates a signed fixed-point number with m bits for the portion (excluding the ), n bits for the fractional portion, and an implicit binary point between them, resulting in a total word length of m + n + 1 bits including the . For instance, the Q_{15.16} format, common in 32-bit audio processing, allocates 15 bits to the part, 16 bits to the , and 1 , allowing representation of values from -2^{15} to $2^{15} - 2^{-16} with a of $2^{-16}. Variations of Q notation distinguish between signed and unsigned formats. Signed formats use the standard Q_{m.n} to denote two's complement representation, where the most significant bit serves as the sign. Unsigned formats are prefixed with U, as in UQ_{m.n}, which omits the sign bit and uses m + n bits total, representing non-negative values from 0 to $2^{m+n} - 2^{-n}. Some contexts employ SQ for explicitly signed Q formats and UQ for unsigned, particularly in documentation for fixed-point libraries and hardware descriptions, to clarify the interpretation without altering the core m.n structure. In decimal fixed-point contexts, notations often resemble programming language specifications, such as fixed(p,q), where p denotes the total number of digits and q the number of digits after the point. This format appears in languages like C++ for arithmetic libraries or in description languages for simulating fixed-point operations, emphasizing precision in financial or legacy systems. Block floating-point serves as a notation related to fixed-point, where a block of numbers shares a common scaling exponent while maintaining fixed-point mantissas within the block, enabling efficient of floating-point behavior on fixed-point without individual exponents. Historically, early computers like the used fixed-point notations based on packed decimal formats, storing two digits per byte with an implied decimal point position specified by the , and a sign in the low-order , supporting commercial applications with up to digits per word.

References

  1. [1]
    [PDF] Fixed-point and floating-point representations of numbers
    A fixed-point representation of a number may be thought to consist of 3 parts: the sign field, integer field, and fractional field. One way to store a number ...
  2. [2]
    L6: Fixed Point Arithmetic in DSP - Patrick Schaumont
    Because fixed-point representation is so strongly related to integer representation, we are able to express arithmetic operations on fixed-point numbers in ...
  3. [3]
    ECE4760 fixed point - Cornell University
    Fixed-point arithmetic is faster than floating-point, used in DSP, animation, and control loops. It has formats like s2.30, s16.16, and s2.14, and is supported ...
  4. [4]
    Fixed-Point Number - an overview | ScienceDirect Topics
    A fixed-point number is defined as a numerical representation consisting of a whole part and a fractional part, separated by an implied radix point.
  5. [5]
    A Fixed-Point Introduction by Example - Christopher Felton
    Apr 25, 2011 · A fixed-point representation of a number consists of integer and fractional components. The bit length is defined as: XNbits=XIntegerNbits+ ...
  6. [6]
    Fixed-Point Representation: The Q Format and Addition Examples
    Nov 30, 2017 · This article will first review the Q format to represent fractional numbers and then give some examples of fixed-point addition.
  7. [7]
    Fixed-Point Arithmetic - MATLAB & Simulink - MathWorks
    Two's Complement · Take the one's complement, or “flip the bits.” · Add a 2^(-FL) using binary math, where FL is the fraction length. · Discard any bits carried ...Addition and Subtraction · Multiplication · Modulo Arithmetic · Two's Complement
  8. [8]
    Fixed-Point Representation - CS2100 - NUS Computing
    The fixed point representation is actually the fractions extension for 1s complement and 2s complement for negative numbers.Representation · Resolution · Approximation
  9. [9]
    Two's Complement Fixed-Point Format | Mathematics of the DFT
    In two's complement, numbers are negated by complementing the bit pattern and adding 1. Temporary overflows are allowed, and the numbers are on a 'ring'.
  10. [10]
    Integer & Fixed Point Representation - Circuit Cellar
    Jun 2, 2023 · You will often see fixed point representations denoted by a “Q-number” written as Qm.n where m denotes the size of the integer and n the ...Missing: definition | Show results with:definition
  11. [11]
    Fixed Point Numerical Systems for Digital Control
    They are commonly written mQn or Qm.n where m is the number of integer bits and n is the number of fractional bits. Note m+n+1 = total number of bits available ...
  12. [12]
    Fixed Point Arithmetic - an overview | ScienceDirect Topics
    Fixed-point arithmetic is defined as a method of representing binary numbers where the binary point is in a fixed position, allowing for the representation ...
  13. [13]
    Fixed Point Arithmetic in DSP — Real Time Digital Signal ...
    Nov 14, 2021 · The red curve indicates the quantization error, the difference between the quantized value and the real (floating point) value. The quantization ...
  14. [14]
    Fixed-Point Designer
    ### Summary of Fixed-Point Scaling, Precision, Word Length Trade-offs, Exact Representation, and Quantization Error Formula
  15. [15]
    (PDF) Dyadic linear programming and extensions - ResearchGate
    Sep 8, 2023 · Dyadic rationals are important for numerical computations because they have an exact representation in floating-point arithmetic on a computer.
  16. [16]
    Quantisation Error - an overview | ScienceDirect Topics
    Quantization error represents the difference between the input value and the value that is reconstructed at the output: e c = V o − V i n . Quantization error ...
  17. [17]
    What Every Computer Scientist Should Know About Floating-Point ...
    This paper is a tutorial on those aspects of floating-point arithmetic (floating-point hereafter) that have a direct connection to systems building.
  18. [18]
    [PDF] FINITE PRECISION EFFECTS 1. FLOATING POINT VERSUS FIXED ...
    The input, output, and internal signals are quantized. Both types of quantization introduce some error. If the binary numbers X and Y are each B bits long, then ...
  19. [19]
    Fixed Point Arithmetic in DSP - Patrick Schaumont
    Next, compared to emulated floating-point arithmetic, fixed-point arithmetic is (almost) as efficient as fast as integer arithmetic. Fixed-point data types are ...
  20. [20]
    22C:18, Lecture 17, Fall 1996 - University of Iowa
    Fixed Point Arithmetic. The rules for addition, subtraction, multiplication and division of fixed point binary numbers are the same rules students learn in ...
  21. [21]
  22. [22]
    Lecture 16 – Signed Integers and Arithmetic Part III
    Can use common QM.N notation: M+N bits, with M bits as whole part and N bits as fractional part. · To design the computation, must first determine number of bits ...
  23. [23]
    5. Fixed Point Arithmetic Unit I - UMD Computer Science
    The objectives of this module are to discuss the operation of a binary adder / subtractor unit and calculate the delays associated with this circuit.
  24. [24]
    Multiplication Examples Using the Fixed-Point Representation
    Dec 15, 2017 · This article will discuss several multiplication examples using the fixed-point representation. Fixed-point representation allows us to use fractional numbers ...
  25. [25]
    6. Fixed Point Arithmetic Unit II - UMD Computer Science
    6 Fixed Point Arithmetic Unit II · Do the first shift and subtraction · Check sign of the partial remainder · If it is negative, shift and add · If it is positive, ...
  26. [26]
    Simple Fixed-Point Math - Atomic Spin
    Mar 15, 2012 · The basic idea is to use regular integer types to represent fractional values. Fixed-point math is most commonly used for systems that lack an FPU.<|separator|>
  27. [27]
    [PDF] Fixed-Point Arithmetic: An Introduction - Digital Signal Labs
    The following are practical rules of fixed-point arithmetic. For these rules we note that when a scaling can be either signed (A(a,b)) or unsigned (U(a,b)) ...
  28. [28]
    Fast Division on Fixed-Point DSP Processors Using Newton ...
    A method for fast integer division in software, suitable for implementation on processors with integrated hardware multiplier is presented in this paper.Missing: techniques | Show results with:techniques
  29. [29]
    [PDF] Fixed-point Mathematics
    Converting between fixed-point and floating-point numbers is simple as shown in the examples below. Example A.1.1.1 Conversion from floating-point to fixed- ...
  30. [30]
    High speed fixed point dividers for FPGAs - IEEE Xplore
    Results show a speedup greater to three times respect to a classical non-restoring division implemented in Xilinx Devices. The dividers were also compared ...
  31. [31]
    [PDF] The DSP capabilities of ARM® Cortex®-M4 and Cortex-M7 Processors
    While Cortex-M4 and Cortex-M7 can be used in DSP applications for both fixed-point and floating-point operations, the DSP extension is optimized for fixed-point ...
  32. [32]
    [PDF] TMS320C55x DSP Programmer's Guide - Texas Instruments
    Explains important considerations for doing fixed-point arithmetic with the TMS320C55x DSP. Includes code examples. 5.1. Fixed-Point Arithmetic − a Tutorial.
  33. [33]
    8.2. Declaring ac_fixed Data Types - Intel
    The overflow mode that determines how to handle values where the generated value has more bits than the number of bits available in the variable. For a list of ...Missing: arithmetic format
  34. [34]
    [PDF] Variable Precision DSP Blocks User Guide: Agilex 3 FPGAs and SoCs
    Apr 7, 2025 · The Agilex 3 fixed-point arithmetic features include: •. High-performance, power-optimized, and fully registered multiplication operations.
  35. [35]
    Fixed-point contrasted with floating-point arithmetic - IBM
    In other words, arithmetic evaluations are handled as fixed point only if all the operands are fixed point, the result field is defined to be fixed point, and ...
  36. [36]
    PetteriAimonen/libfixmath: Cross Platform Fixed Point Maths Library
    Libfixmath implements Q16.16 format fixed point operations in C. License: MIT. Options. Configuration options are compile definitions that are checked by the ...
  37. [37]
    Fixed Point Arithmetic in DSP - Patrick Schaumont
    Fixed-point arithmetic in DSP uses a fixed-point data representation, related to integer arithmetic, where the programmer is responsible for data alignment.
  38. [38]
    What's the best way to do fixed-point math? - c++ - Stack Overflow
    Sep 17, 2008 · I need to speed up a program for the Nintendo DS which doesn't have an FPU, so I need to change floating-point math (which is emulated and slow) to fixed-point.What is the "best" way to overload arithmetic operators in modern ...C++: overloading mathematical operators - Stack OverflowMore results from stackoverflow.com
  39. [39]
    decimal — Decimal fixed-point and floating-point arithmetic ...
    The decimal module provides support for fast correctly rounded decimal floating point arithmetic. It offers several advantages over the float datatype.Decimal -- Decimal... · Decimal Objects · Context Objects
  40. [40]
    BigDecimal (Java Platform SE 8 ) - Oracle Help Center
    Immutable, arbitrary-precision signed decimal numbers. A BigDecimal consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.Frames · Use · MathContext · RoundingMode
  41. [41]
    CMSIS DSP Software Library - GitHub Pages
    This user manual describes the CMSIS DSP software library, a suite of common signal processing functions for use on Cortex-M and Cortex-A processor based ...Reference · Revision History · Deprecated List
  42. [42]
    Simple Fixed-Point Conversion in C - Embedded Artistry
    Jul 12, 2018 · ... fixed-point to floating-point conversion function. Converting from fixed-point to floating-point is straightforward. We take the input value ...
  43. [43]
    VCVT (between floating-point and fixed-point, Floating-point)
    This instruction converts a value in a register from floating-point to fixed-point, or from fixed-point to floating-point.
  44. [44]
    [PDF] Library of Parameterized Hardware Modules for Floating-Point ...
    fix2float Unsigned fixed-point to floating-point conversion. 4. Signed fixed-point to floating-point conversion. 5 float2fix Floating-point to unsigned fixed ...<|control11|><|separator|>
  45. [45]
    Data Conversion Rules - Win32 apps - Microsoft Learn
    Aug 20, 2021 · The following is the general procedure for converting a floating point number n to a fixed point integer i.f, where i is the number of (signed) ...
  46. [46]
    [PDF] 3 Floating-Point Arithmetic - INDECT
    FIXED-POINT TO FLOATING-POINT CONVERSION. Conversion of numbers from 1.15 fixed-point format into IEEE 754 and two-word floating-point format is discussed in ...
  47. [47]
    A Design of FPGA-Based Neural Network PID Controller for Motion ...
    Jan 24, 2022 · Closed-loop control systems are internally calculated with 16 fixed-point fractional bits, where the highest bit is the sign bit, the lower ...
  48. [48]
    [PDF] Chapter 1 Cybernetics - Coordinated Robotics Labs
    of binary digits after the (implied) decimal point. This representation of fixed point real numbers, using N bits, is referred to as Q format, and is ...
  49. [49]
    DIGITAL SIGNAL PROCESSORS - Computer Science & Engineering
    A low-pass filter is used to smooth the analog output. Fixed Point versus Floating Point DSP chips:DSP chip word size determines resolution and dynamic range.Missing: codecs | Show results with:codecs
  50. [50]
    [PDF] DSP Architectures: Past, Present and Future*
    Thus fixed point DSPs are cheaper and consume less power than their float- ing point counterparts and are usually used in high volume, compact sized, low power ...<|separator|>
  51. [51]
    [PDF] an4841-digital-signal-processing-for-stm32-microcontrollers-using ...
    Mar 23, 2016 · Table 1 highlights the main advantages and disadvantages of fixed-point vs. floating-point in. DSP applications. Table 1. Pros and cons of ...Missing: codecs | Show results with:codecs
  52. [52]
    [PDF] Hardware Design of a Binary Integer Decimal-based IEEE P754 ...
    One study estimates that errors from binary floating-point arithmetic can accumulate to a yearly billing error of over $5 million for large billing systems [10] ...
  53. [53]
    [PDF] Decimal floating-point: algorism for computers - speleotrove.com
    Decimal arithmetic is the norm in human calculations, and human-centric applications must use a decimal floating-point arithmetic to achieve the same ...
  54. [54]
    Exact Versus Inexact Decimal Floating-Point Numbers and Arithmetic
    Feb 23, 2023 · Decimal numbers are needed because they avoid the rounding errors that typ- ically occur when converting a decimal fraction in human entered ...<|separator|>
  55. [55]
    [PDF] Real-Time Fluid Dynamics for Games
    In this case we can replace floating point operations with fixed point arithmetic. This is actually fairly straightforward to implement. Page 12. Simply add ...
  56. [56]
    [PDF] FPinGA Final Report - MIT
    The renderer uses floating point while the simulation uses fixed point for lower resource use, so values must be passed through a fixed to float conversion ...
  57. [57]
    Fixed-Point Arithmetic - FORTH, Inc
    This chapter introduces more arithmetic operators and tackles the problem of handling decimal points using only whole numbers, i.e., fixed-point arithmetic.
  58. [58]
    Algorithms for division - part 3 - Using multiplication - SEGGER Blog
    Nov 29, 2021 · We are going to work in fixed-point arithmetic using scaled integers. ... Our reciprocal is in Q16 format, u is a 16-bit integer, and the ...The Final Code · Measuring Up · Processors Without A...
  59. [59]
    Fixed Point Number - an overview | ScienceDirect Topics
    Fixed-point number formats can also be represented using Q notation, which was developed by Texas Instruments. Q notation is equivalent to the S/U format used ...<|control11|><|separator|>
  60. [60]
    Using the IQmath Library - MATLAB & Simulink - MathWorks
    the Texas Instruments representation for signed fixed-point numbers. m is the number of bits used to ...About the IQmath Library · Fixed-Point Numbers · Notation · Format Notation
  61. [61]
    Fixed Point Basics — fixedpoint 1.0.1 documentation
    The Q format specifies the range and resolution of values that a fixed point number can represent. Q format. Range. Resolution. Qm.n.Missing: standard | Show results with:standard
  62. [62]
    Tonc: Fixed-Point Numbers and LUTs - Coranac
    A fixed-fixed multiply required a division by the scale afterwards, whereas a fixed-fixed division needs a scale multiply before the division. In both cases ...
  63. [63]
    Decimal fixed-point data - IBM
    Decimal fixed-point data is stored two digits per byte, with a sign indication in the rightmost 4 bits of the rightmost byte.Missing: System/ 360
  64. [64]
    [PDF] A Block Floating Point Implementation for an N-Point FFT on the ...
    Most DSP applications can be handled with fixed-point representation. However, for those applications which require extended dynamic range but do not warrant ...
  65. [65]
    [PDF] Architecture of the IBM System/360 - People @EECS
    HALFWORD FIXED-POINT NUMBER. $. 0. 15. INTEGER. FULLWORD FIXED-POINT NUMBER. S. 0. SHORT FLOATING-POINT NUMBER. $. 7. CHARACTER. 0. 7. LONG FLOATING-POINT ...