Fact-checked by Grok 2 weeks ago

Quantization

Quantization is the process of mapping continuous or otherwise unbounded values from a large set to a relatively small set of discrete values, enabling the representation of analog phenomena in digital or quantized forms.^[1]

Quantization in Physics

In physics, quantization refers to the phenomenon where measurable quantities, such as energy, angular momentum, or electric charge, can only assume discrete values rather than any value from a continuous range, a cornerstone of quantum mechanics.^[2] This discreteness arises from the wave-particle duality of matter and the boundary conditions imposed on quantum systems, leading to standing wave-like behaviors that restrict possible states.^[3] A seminal example is the Bohr model of the hydrogen atom, proposed in 1913, where electrons occupy fixed orbits with quantized energy levels given by E_n = -\frac{13.6}{n^2} eV, with n as a positive integer quantum number; transitions between these levels emit or absorb photons of specific frequencies, explaining atomic spectra.^[4] Quantization extends to broader quantum field theories, where classical fields are promoted to operators obeying commutation relations, though full quantization remains challenging for systems like gravity.^[5]

Quantization in Signal Processing

In signal processing and digital communications, quantization is a key step in analog-to-digital conversion, where continuous amplitude values of a sampled signal are approximated by a finite set of discrete levels, typically represented by binary codes.^[6] This process introduces quantization error—the difference between the original and approximated value—which is modeled as additive noise and minimized by increasing the number of bits per sample; for instance, an 8-bit quantizer provides 256 levels, yielding a signal-to-quantization-noise ratio (SQNR) of approximately $6.02 \times 8 = 48.16 dB for uniform quantization.^[7] Common techniques include uniform quantization for signals within a known range and non-uniform (e.g., logarithmic) quantization for those with wide dynamic ranges, such as in audio or telephony, ensuring efficient storage and transmission while balancing fidelity and bandwidth.^[8]

Quantization in Machine Learning

In machine learning, particularly for neural networks, quantization reduces the precision of model parameters (weights and activations) from high-bit floating-point formats (e.g., 32-bit) to low-bit integers (e.g., 8-bit or 4-bit), compressing models and accelerating inference on resource-constrained hardware like mobile devices or edge AI systems.^[9] Post-training quantization applies this approximation after model training, often using techniques like linear scaling or calibration to minimize accuracy loss, while quantization-aware training incorporates quantization during optimization to further preserve performance.^[10] This approach can reduce model size by up to 4x and inference latency by 2-3x without retraining, making large language models deployable at scale, though it risks gradient overflow in low-precision formats.^[11]

General Principles

Definition and Scope

Quantization is the process of converting continuous values into discrete values by mapping a large set of input values, often from a continuous domain like the real numbers, to a smaller, finite set of output values, such as integers or predefined levels. This approximation reduces the precision of the representation while enabling practical computation, storage, or analysis in digital systems. In essence, it transforms infinite or uncountably infinite possibilities into countable, manageable alternatives, preserving essential information at the cost of some fidelity. The term "quantization" originates from the Latin quantum, meaning "how much" or a discrete portion, and entered scientific lexicon through physics. In 1900, Max Planck introduced the foundational concept by positing that electromagnetic energy is emitted or absorbed in discrete units, or quanta, to resolve discrepancies in black-body radiation spectra—a breakthrough that marked the birth of quantum theory.^[12] Although Planck initially viewed this as a mathematical expedient rather than a physical reality, it established quantization as a core principle for discretizing continuous physical quantities.^[13] Quantization's scope spans multiple fields, providing a unifying framework for discretizing phenomena. In signal processing, it approximates continuous waveforms by assigning values to discrete amplitude levels, facilitating digital encoding. In physics, it underpins the idea that certain observables, like atomic energy states, assume only specific discrete values rather than any point on a continuum. This interdisciplinary application assumes a basic grasp of continuous versus discrete mathematics, where the former allows uncountable values (e.g., all real numbers) and the latter restricts to countable sets (e.g., integers). Such mapping inherently produces quantization error, the deviation between original and discrete representations, though detailed analysis lies beyond this overview.^[13]

Mathematical Foundations

Quantization fundamentally involves partitioning the continuous domain of input values, typically the real line \mathbb{R}, into a finite number of disjoint intervals, known as quantization cells or bins, and mapping each interval to a discrete representative value, often the centroid of the cell for minimizing mean squared error.^[14] This partitioning is achieved through an encoder function \alpha: \mathbb{R} \to \{1, 2, \dots, N\} that assigns each input x to an index i based on membership in cell S_i = \{ x : \alpha(x) = i \}, followed by a decoder \beta: \{1, 2, \dots, N\} \to \mathbb{R} that outputs the reconstruction level q_i = \beta(i).^[14] The nearest-neighbor mapping, where inputs are assigned to the closest reconstruction level under a distance metric like Euclidean distance, is a common strategy for defining these cells.^[14] Simple quantizers can be realized using standard rounding functions, which partition \mathbb{R} into unit intervals and select integer representatives. The floor function \lfloor x \rfloor maps x to the greatest integer less than or equal to x, the ceiling function \lceil x \rceil to the smallest integer greater than or equal to x, and round-to-nearest (typically \round(x) = \lfloor x + 0.5 \rfloor) to the closest integer, with ties resolved by rounding away from zero or to even.^[14] These functions serve as uniform scalar quantizers with step size 1 and reconstruction levels at integers, illustrating the basic principle of approximating continuous values with discrete ones.^[14] In general, a scalar quantizer Q: \mathbb{R} \to \{q_1, q_2, \dots, q_N\} operates by selecting the reconstruction level closest to the input, formalized as

Q(x) = \arg\min_{i} |x - q_i|,

where q_i are the fixed reconstruction levels, and ties are resolved arbitrarily.^[14] Equivalently, Q(x) = \beta(\alpha(x)), with \alpha(x) = \arg\min_i |x - q_i|.^[14] This nearest-neighbor rule defines the decision boundaries as midpoints between adjacent levels, ensuring non-overlapping cells that cover \mathbb{R}.^[15] The performance of a quantizer is quantified by the distortion, commonly the mean squared error (MSE) for a random input X with probability distribution P_X, given by

D = \mathbb{E}[(X - Q(X))^2] = \int_{-\infty}^{\infty} (x - Q(x))^2 \, dP_X(x).

^[14] This expectation measures the average squared deviation between the input and its quantized version, derived directly from the definition of expected value under the squared-error distortion metric d(x, y) = (x - y)^2.^[14] For a fixed number of levels N, the goal is to choose the partition \{S_i\} and levels \{q_i\} to minimize D.^[16] Optimal quantizer design under MSE is addressed by the Lloyd-Max algorithm, an iterative procedure that alternates between updating the encoder boundaries to midpoints between adjacent reconstruction levels and setting each level q_i to the conditional expectation (centroid) \mathbb{E}[X \mid X \in S_i] of the input given the current cell.^[16]^[15] Starting from an initial guess of levels, the algorithm converges to a local minimum of the distortion, satisfying necessary conditions for optimality: boundaries at (q_i + q_{i+1})/2 and levels as centroids.^[16] This method, independently developed by Lloyd and Max, provides a practical way to compute non-uniform quantizers tailored to the input distribution.^[16]^[15]

Signal Processing and Data Compression

Scalar Quantization

Scalar quantization is a fundamental technique in signal processing where each sample of a continuous-amplitude signal is independently mapped to one of a finite set of discrete amplitude levels, typically as part of analog-to-digital conversion. This process reduces the precision of the signal to represent it using fewer bits, enabling efficient storage and transmission while introducing a controlled amount of quantization error. Unlike more complex methods, scalar quantization operates on individual samples without considering correlations between them, making it computationally simple and widely used in applications like audio and image digitization.^[17] In uniform scalar quantization, the input range [x_{\min}, x_{\max}] is divided into L equally spaced levels, with a fixed step size \Delta = \frac{x_{\max} - x_{\min}}{L}. The quantized value q(x) for an input sample x is then given by q(x) = \Delta \cdot \round\left(\frac{x}{\Delta}\right), where \round denotes the nearest integer rounding operation, often with midpoint rounding for symmetry. This approach assumes a uniform probability distribution across the input range and results in equal quantization intervals, which is optimal for signals with flat amplitude histograms but can lead to granular noise in regions of low amplitude.^[17] Non-uniform scalar quantization addresses the limitations of uniform methods by using variable step sizes, allocating finer resolution to more probable signal amplitudes and coarser steps to less frequent ones, thereby minimizing overall distortion for non-uniform distributions common in natural signals like speech. A prominent example is \mu-law companding, widely adopted in North American telephony, which compresses the dynamic range before uniform quantization and expands it afterward. The compression function is defined as F(x) = \sgn(x) \frac{\ln(1 + \mu |x|)}{\ln(1 + \mu)} for |x| \leq 1 and \mu \geq 1, where \sgn(x) is the sign function and \mu (typically 255) controls the degree of compression, providing logarithmic-like spacing that emphasizes low-level signals.^[18] From a rate-distortion perspective, scalar quantization achieves a bit rate R = \log_2 L bits per sample, where L is the number of quantization levels, trading off against mean squared error distortion D, which decreases as R increases according to the rate-distortion function D(R). In the high-rate limit, this relationship approximates Shannon's bound for scalar sources, highlighting the fundamental efficiency limit where additional bits yield diminishing returns in distortion reduction. For instance, uniform scalar quantization at high rates can approach the optimal D(R) \approx \frac{\sigma^2}{2^{2R}} for a Gaussian source with variance \sigma^2.^[19] A practical example is 8-bit pulse code modulation (PCM) for audio, which uses L = 256 uniform levels to quantize speech signals in the range [-V, V], providing a theoretical dynamic range of approximately 48 dB (6 dB per bit). This setup supports sampling rates like 8 kHz for telephony, but overload distortion occurs when input amplitudes exceed V, causing clipping and harmonic artifacts that degrade perceived quality, particularly in loud passages. To mitigate overload, input signals are often scaled to fit within the range, though this risks granular noise in quiet sections.^[20] The concept of scalar quantization originated in the 1940s with the development of PCM for telephony by British engineer Alec Reeves, who proposed digitizing voice signals to combat noise in long-distance transmission, laying the groundwork for modern digital communications.^[21]

Vector Quantization

Vector quantization (VQ) is a technique in signal processing that jointly quantizes multiple samples of a signal into a multi-dimensional vector, mapping it to the nearest representative vector, or codeword, from a predefined finite set known as a codebook. A vector quantizer takes an input vector \mathbf{x} \in \mathbb{R}^k and assigns it to the codeword \mathbf{c}_i that minimizes the distortion, typically measured by the Euclidean distance \|\mathbf{x} - \mathbf{c}_i\|, where the codebook C = \{\mathbf{c}_1, \dots, \mathbf{c}_N\} contains N codewords.^[22]^[23] The codebook design is crucial for effective VQ and is often achieved through iterative algorithms that minimize average distortion. The Linde-Buzo-Gray (LBG) algorithm, a seminal method, initializes a codebook and iteratively partitions the input data into clusters and updates codewords as the centroids of those clusters, akin to K-means clustering, to reduce mean squared error.^[22]^[23] This process defines Voronoi regions for each codeword, where

V_i = \{ \mathbf{x} : \|\mathbf{x} - \mathbf{c}_i\| < \|\mathbf{x} - \mathbf{c}_j\| \ \forall j \neq i \},

representing the set of input vectors closest to \mathbf{c}_i under Euclidean distance; all vectors in V_i are quantized to \mathbf{c}_i.^[22]^[23] Introduced by Y. Linde, A. Buzo, and R. M. Gray in 1980, the LBG algorithm provides a practical framework for generating locally optimal codebooks from training data.^[22] VQ finds prominent applications in data compression, such as image coding where blocks of pixels are quantized to codewords for efficient storage and transmission, enabling techniques similar to those in early JPEG variants but leveraging vector correlations for superior compression.^[24]^[23] In speech coding, VQ compresses spectral parameters or linear prediction coefficients, achieving low bit rates while preserving perceptual quality, as demonstrated in systems like those based on vector quantization of LPC parameters.^[25]^[23] Compared to scalar quantization, which treats each sample independently and serves as a special case for one-dimensional vectors, VQ offers better rate-distortion performance by exploiting statistical dependencies and correlations within the vector components, allowing lower bit rates for equivalent distortion levels in multidimensional signals.^[23]^[26]

Quantization Error Analysis

Quantization error in signal processing arises from the mapping of continuous amplitude values to discrete levels, resulting in distortion that can be analyzed as noise added to the original signal. For uniform scalar quantizers, this error is commonly modeled as additive white noise uniformly distributed over one quantization interval [- \Delta/2, \Delta/2], where \Delta is the step size, assuming the input signal spans many quantization levels and the error is uncorrelated with the signal. The variance of this quantization noise is given by \sigma_q^2 = \Delta^2 / 12, derived from the second moment of the uniform distribution.^[27] This approximation holds under high-resolution conditions where the signal probability density function (PDF) varies slowly compared to \Delta.^[27] A foundational analysis of quantization noise spectra was provided by Bennett in 1948, who established conditions under which the noise can be treated as stationary and white, particularly for signals with Gaussian statistics and rounding quantizers. Bennett's work derived the power spectral density of the noise, showing it to be approximately flat within the signal bandwidth when the quantizer overload is negligible and the input amplitude distribution satisfies certain smoothness criteria. This laid the groundwork for modern noise modeling in digital signal processing.^[28] The signal-to-quantization-noise ratio (SQNR) serves as a key performance metric, defined as \mathrm{SQNR} = 10 \log_{10} (P_s / \sigma_q^2), where P_s is the signal power. For an n-bit uniform quantizer processing a full-scale sinusoidal input, this simplifies to approximately \mathrm{SQNR} \approx 6.02n + 1.76 dB, reflecting the 6 dB per bit gain from doubling the number of levels and the additional factor from the sine wave's power relative to the noise. This formula assumes no overload and uniform noise distribution, providing a benchmark for quantizer effectiveness in applications like audio and data compression.^[27] Quantization distortion can be decomposed into granular distortion, which occurs when the input lies within the quantizer's dynamic range and results in small, bounded errors, and overload distortion, which arises when the input exceeds this range, leading to clipping and large errors. Granular distortion's mean squared error is computed by integrating the squared difference between input and output over the input PDF within each bin, yielding \sigma_q^2 = \Delta^2 / 12 for uniform PDFs but varying with the input distribution (e.g., higher for peaked PDFs like Gaussian). Overload distortion is the expected value of the clipping error weighted by the tail probabilities of the input PDF beyond the quantizer range, often requiring loading factors (e.g., 4\sigma for Gaussian inputs) to balance the two components and minimize total distortion.^[29] To mitigate nonlinearities and signal-dependent errors, dithering techniques introduce controlled noise prior to quantization. Non-subtractive dither adds uncorrelated noise that randomizes the error, making it independent of the input and approximating uniform distribution for first-order accuracy. Subtractive dither, applied before and subtracted after quantization, enables precise PDF shaping of the error; for instance, triangular PDF dither with variance twice that of the quantization noise ensures the total error PDF is uniform, fully linearizing the quantizer response at the cost of increased overall noise power. These methods, analyzed under statistical independence assumptions, are essential for high-fidelity applications where distortion artifacts must be minimized.

Physics

Energy Quantization in Quantum Mechanics

The development of quantum mechanics was precipitated by inconsistencies in classical physics, particularly in explaining the spectrum of blackbody radiation. In classical theory, the Rayleigh-Jeans law predicted that the energy density of radiation diverges to infinity at short wavelengths, known as the ultraviolet catastrophe, which contradicted experimental observations of finite energy emission from hot bodies.^[30] This failure highlighted the limitations of classical wave descriptions for electromagnetic radiation and atomic processes. Max Planck resolved the blackbody radiation problem in 1900 by proposing that the energy of oscillators in the blackbody is quantized, introducing the hypothesis that energy is emitted or absorbed in discrete packets, or quanta, given by E = n h f, where n is a positive integer, h is Planck's constant ($6.626 \times 10^{-34} J s), and f is the frequency.^[31] This quantization led to Planck's law for the spectral energy density, u(f, T) = \frac{8\pi h f^3}{c^3} \frac{1}{e^{h f / k T} - 1}, where k is Boltzmann's constant, T is temperature, and c is the speed of light, which accurately matched experimental data across all wavelengths.^[31] The quantization of energy manifested in the discrete atomic spectra observed in the late 19th century, where atoms emit or absorb light only at specific wavelengths corresponding to transitions between quantized energy levels, with photon energy \Delta E = h f.^[32] For hydrogen, Johann Balmer in 1885 empirically described the visible spectral lines using the formula \frac{1}{\lambda} = R \left( \frac{1}{2^2} - \frac{1}{n^2} \right) for integers n > 2, where R is the Rydberg constant, later generalized by Johannes Rydberg in 1888 to \frac{1}{\lambda} = R \left( \frac{1}{n_1^2} - \frac{1}{n_2^2} \right) for various series.^[32] These line spectra, such as the Balmer series in emission from excited hydrogen atoms, provided evidence for discrete energy levels rather than continuous ones predicted by classical mechanics. Niels Bohr incorporated energy quantization into his 1913 model of the hydrogen atom, postulating stationary orbits where electrons maintain constant energy E_n = -\frac{13.6 \, \text{eV}}{n^2} for principal quantum number n, and angular momentum is quantized as L = n \hbar, with \hbar = h / 2\pi.^[33] Transitions between these levels produce the observed spectral lines, with the frequency given by f = \frac{|E_{n_2} - E_{n_1}|}{h}, successfully explaining the hydrogen spectrum and laying the foundation for quantum theory.^[33] The particle-like nature of quantized energy was confirmed by Arthur Compton in 1923 through scattering experiments on X-rays by electrons, where the wavelength shift \Delta \lambda = \frac{h}{m_e c} (1 - \cos \theta) indicated momentum transfer p = h / \lambda from photon to electron, with m_e the electron mass and \theta the scattering angle.^[34] This Compton effect demonstrated the wave-particle duality of light, supporting Planck's quanta as particles with both energy and momentum.^[34]

Canonical Quantization

Canonical quantization is a foundational method in quantum mechanics for transforming classical mechanical systems into their quantum counterparts by promoting dynamical variables to operators while preserving the structure of the classical Poisson brackets through corresponding commutators. Developed primarily by Paul Dirac in 1925, this procedure provides a systematic way to quantize systems described by generalized coordinates and momenta, ensuring that the quantum theory reproduces classical results in the appropriate limit.^[35] In Dirac's approach, classical position q and momentum p are replaced by operators \hat{q} and \hat{p}, where \hat{q} acts as a multiplication operator on wave functions in position representation, and \hat{p} = -i \hbar \frac{\partial}{\partial q}. This correspondence stems from the need to satisfy the canonical commutation relations derived from the classical Poisson bracket \{q, p\} = 1, which becomes the quantum commutator [\hat{q}, \hat{p}] = i \hbar. For a general classical Hamiltonian H(q, p), the quantum Hamiltonian \hat{H}(\hat{q}, \hat{p}) is formed by substituting the operators, often requiring careful ordering to ensure hermiticity and consistency, with the eigenvalues of \hat{H} yielding the discrete energy spectrum of the quantum system.^[35] A paradigmatic example is the quantum harmonic oscillator, where the classical Hamiltonian H = \frac{p^2}{2m} + \frac{1}{2} m \omega^2 q^2 is quantized to \hat{H} = \frac{\hat{p}^2}{2m} + \frac{1}{2} m \omega^2 \hat{q}^2. Introducing ladder operators \hat{a}^\dagger = \sqrt{\frac{m \omega}{2 \hbar}} \left( \hat{q} - \frac{i}{m \omega} \hat{p} \right) and \hat{a} = \sqrt{\frac{m \omega}{2 \hbar}} \left( \hat{q} + \frac{i}{m \omega} \hat{p} \right), which satisfy [\hat{a}, \hat{a}^\dagger] = 1, the Hamiltonian simplifies to \hat{H} = \hbar \omega \left( \hat{a}^\dagger \hat{a} + \frac{1}{2} \right), with energy eigenvalues E_n = \hbar \omega \left( n + \frac{1}{2} \right) for n = 0, 1, 2, \dots, illustrating the emergence of quantized states and zero-point energy.^[36] Hermann Weyl's 1927 work developed a group-theoretic approach to quantization, providing a framework that connects canonical methods to unitary representations and wave mechanics.^[37]

Field Quantization

Field quantization, also known as second quantization, provides a framework for describing quantum many-body systems and relativistic fields by representing particles as excitations of underlying fields in an infinite-dimensional Hilbert space called Fock space.^[38] In this formalism, the state of the system is specified by occupation numbers for each possible single-particle mode, allowing for a variable number of particles, including the vacuum state with zero particles.^[38] Creation and annihilation operators, denoted \hat{a}^\dagger_k and \hat{a}_k for mode k, act on Fock states to add or remove particles in that mode; for bosons, they satisfy the commutation relation [\hat{a}_k, \hat{a}^\dagger_{k'}] = \delta_{kk'}.^[38] A canonical example is the quantization of the Klein-Gordon field, which describes spin-0 particles like pions. The scalar field operator is expanded as \phi(x) = \int dk \, (\hat{a}_k u_k + \hat{a}^\dagger_k u^*_k), where u_k are plane-wave solutions to the classical field equation normalized to ensure canonical commutation relations.^[38] The Hamiltonian then takes the form \hat{H} = \int dk \, \omega_k \hat{a}^\dagger_k \hat{a}_k + E_0, revealing the particle interpretation: \hat{a}^\dagger_k creates a particle with energy \omega_k = \sqrt{\mathbf{k}^2 + m^2}, and the number operator \hat{n}_k = \hat{a}^\dagger_k \hat{a}_k counts particles in mode k.^[38] For the electromagnetic field, quantization proceeds similarly by expanding the four-vector potential A^\mu in transverse modes, excluding longitudinal components to preserve gauge invariance.^[38] This yields photon creation operators \hat{a}^\dagger_{\mathbf{k},\lambda} for wavevector \mathbf{k} and polarization \lambda = 1,2, with the electric and magnetic fields expressed in terms of these operators, leading to the Hamiltonian \hat{H} = \sum_{\mathbf{k},\lambda} \omega_k \hat{a}^\dagger_{\mathbf{k},\lambda} \hat{a}_{\mathbf{k},\lambda} (up to vacuum energy), where photons are massless bosons with \omega_k = |\mathbf{k}|.^[38] The transverse nature ensures two degrees of freedom per momentum state, corresponding to the physical polarizations.^[38] For fermionic fields, the bosonic commutation relations are replaced by anticommutation \{\hat{a}_k, \hat{a}^\dagger_{k'}\} = \delta_{kk'}, enforcing the Pauli exclusion principle with occupation numbers 0 or 1.^[38] In one dimension, the Jordan-Wigner transformation maps spin-1/2 operators to fermionic creation and annihilation operators via a string of phase factors, enabling exact solutions for models like the XY chain. An illustrative application is the Bose-Einstein condensate (BEC), where second quantization describes a macroscopic occupation of the ground-state mode in a dilute bosonic gas below the critical temperature.^[39] The many-body wavefunction is represented in Fock space, with the condensate fraction given by the expectation value of the ground-state number operator \langle \hat{n}_0 \rangle / N \approx 1 for large particle number N, capturing phenomena like superfluidity through off-diagonal long-range order in the one-body density matrix.^[39]

Machine Learning and Artificial Intelligence

Neural Network Quantization

Neural network quantization in machine learning involves reducing the precision of model parameters, such as weights and activations, from high-precision formats like 32-bit floating-point (FP32) to lower-bit representations, such as 8-bit integers (INT8), to achieve computational efficiency without excessive loss in performance.^[40] This technique is particularly motivated by the need to deploy large-scale models, including large language models (LLMs) with billions of parameters, on resource-constrained edge devices where memory and power are limited.^[40] For instance, quantizing from FP32 to INT8 can reduce memory usage by up to 4x, enabling faster inference and lower latency in real-world applications like mobile AI and embedded systems. A common approach is fixed-point quantization, which maps floating-point values x to quantized integers q using a scale factor s and zero-point z, typically as q = \round\left( \frac{x}{s} + z \right), followed by clipping to the integer range (e.g., 0 to 255 for unsigned INT8).^[40] The scale s normalizes the range of x, while the zero-point z shifts the representation to handle asymmetric distributions around zero, allowing for more accurate approximation of real-valued tensors.^[40] Quantization can employ dynamic ranges, computed on-the-fly during inference for activations based on their varying statistics, or static ranges, pre-determined during calibration for both weights and activations to simplify deployment. Furthermore, scaling can be applied per-tensor, using a single s and z for the entire tensor, or per-channel, where each output channel of a weight matrix has its own parameters to better preserve accuracy in convolutional or transformer layers.^[40] The primary impact of neural network quantization is a trade-off between model accuracy and inference speed, with lower-bit representations accelerating matrix multiplications on hardware like GPUs and TPUs while risking quantization error that degrades performance.^[40] For example, the GPTQ method, introduced in 2022, enables accurate post-training quantization of LLMs to 3-4 bits per weight, achieving near-lossless compression for large models like BLOOM-176B with perplexity increases of about 0.1 on WikiText2 for 4-bit quantization, all while reducing memory by over 4x in under four GPU hours.^[41] By 2025, quantization has become widespread in production deployments, as evidenced by Meta's release of quantized Llama 3.2 models (1B and 3B parameters) using 4-bit groupwise quantization for weights and 8-bit for activations, reducing model size by 56% and memory usage by 41% on average compared to BF16 baselines, with 2-4x inference speedups on mobile devices as of October 2024.^[42]

Post-Training Quantization Techniques

Post-training quantization (PTQ) refers to the process of converting a pre-trained neural network model from high-precision floating-point representations, typically 32-bit (FP32), to lower-precision formats such as 8-bit integers (INT8) without requiring any retraining or fine-tuning of the model parameters. This technique is particularly valuable for deploying models on resource-constrained devices like mobile phones and edge hardware, where it can reduce model size by up to 4x and inference latency by 2-3x while maintaining accuracy close to the original model, often within 1-2% degradation on benchmarks like ImageNet. PTQ operates by analyzing the model's weights and activations post-training to determine optimal quantization parameters, enabling efficient integer-only arithmetic during inference.^[43]^[44] A core step in PTQ is calibration, which involves running a small representative dataset—typically 100-1000 samples from the original training distribution—through the model to collect statistics on activations and weights. These statistics, such as minimum and maximum values or percentiles (e.g., 0.1% and 99.9% to mitigate outliers), are used to compute scaling factors that map the dynamic range of floating-point values to the fixed range of quantized integers, ensuring minimal quantization error. For instance, in full integer quantization, calibration determines per-tensor or per-channel ranges, allowing the converter to quantize both weights (statically) and activations (dynamically during inference). This data-driven approach avoids the need for full retraining, making PTQ fast and practical for production deployment.^[45]^[44] INT8 quantization, a widely adopted PTQ method, represents values using 8-bit signed integers ranging from -128 to 127, with an asymmetric mapping to handle non-symmetric distributions common in neural network activations. The quantization function is given by
q = \clamp\left( \round\left( \frac{x}{s} + z \right), -128, 127 \right),
where q is the quantized integer, x is the original floating-point value, s is the scale factor (derived from the range), and z is the zero-point (shifting the range to center around zero). During inference, the dequantized value is recovered as \hat{x} = s (q - z), enabling hardware-optimized integer operations while approximating the original computation. To further minimize error, Hessian-aware approximations can adjust quantization parameters by estimating the sensitivity of the loss to weight perturbations using the Hessian matrix's trace or eigenvalues, prioritizing layers with higher curvature for finer scaling. This second-order analysis helps achieve near-baseline accuracy, such as 75.5% top-1 on ImageNet for quantized ResNet-50 models.^[43]^[46] Mixed-precision PTQ extends uniform quantization by assigning different bit widths to layers or channels based on their impact on overall accuracy, such as using FP16 for outlier-sensitive activations and INT4 for robust weights, reducing average precision while preserving performance. Hessian-aware methods like HAWQ compute a sensitivity metric S_i = \frac{\lambda_i}{n_i}, where \lambda_i is the dominant Hessian eigenvalue for layer i and n_i is the parameter count, to automatically select bit allocations that minimize expected loss increase. This approach has demonstrated up to 8x compression on models like ResNet-20 with less than 1% accuracy drop on CIFAR-10.^[46]^[44] PTQ techniques were pioneered by Google researchers around 2017, with early applications to MobileNets demonstrating efficient integer inference on mobile CPUs, achieving latencies as low as 33ms on Snapdragon processors with minimal accuracy loss. By 2025, frameworks like ONNX Runtime have integrated comprehensive PTQ support, including static and dynamic quantization APIs for converting ONNX models to INT8, facilitating broad adoption in production pipelines.^[43]^[47]

Quantization-Aware Training

Quantization-aware training (QAT) integrates simulated quantization effects directly into the neural network training process to mitigate accuracy degradation when deploying models at lower bit precisions, outperforming post-training quantization techniques by allowing the model to adapt to quantization noise during optimization.^[43] This approach typically begins with a pre-trained full-precision model, which is then fine-tuned with quantization operations inserted into the forward pass, enabling the network to learn weights that are robust to rounding and clipping errors inherent in low-bit representations.^[43] A core mechanism in QAT involves inserting fake quantization nodes into the computational graph during the forward pass to emulate inference-time quantization without altering the underlying floating-point computations permanently. These nodes apply rounding to the nearest quantization level followed by clipping to the specified range, simulating the behavior of actual quantized operations. For the backward pass, the straight-through estimator (STE) is employed, which bypasses the non-differentiable rounding function by treating it as an identity operation, thereby allowing gradients to flow through as if no quantization occurred while still exposing the model to quantization effects in the forward direction.^[43] This STE approximation, first formalized in the context of binary networks and extended to general QAT, ensures stable training convergence despite the zero-gradient issue of the rounding step.^[43] To further enhance performance in quantized models, knowledge distillation can be incorporated into QAT, where a high-precision teacher model guides the training of a low-bit student network by minimizing the discrepancy between their output distributions. In this setup, the student undergoes quantization simulation during training, leveraging soft targets from the teacher to preserve representational capacity in ultra-low precision regimes.^[48] A representative formulation balances the primary task loss with a distillation term, often expressed as L = L_{\text{task}} + \lambda L_{\text{distort}}, where L_{\text{task}} is the standard cross-entropy or regression loss, L_{\text{distort}} captures quantization-induced errors (e.g., via mean squared error between full- and low-precision activations), and \lambda is a hyperparameter controlling the trade-off.^[48] This combined objective helps the quantized student recover accuracy close to the teacher, as demonstrated in quantization-aware knowledge distillation frameworks achieving near-full-precision performance on image classification benchmarks.^[48] Progressive quantization strategies extend QAT by gradually reducing bit precision over the course of training, starting from higher bits to locate better local minima before enforcing stricter constraints. Two common schemes include bit-width annealing, where precision decreases stepwise across epochs, and mixed-precision optimization, which initially allows variable bits per layer and progressively converges to uniform low bits.^[49] This method avoids the steep loss landscapes of direct low-bit training, yielding improved accuracy for convolutional networks at 2-4 bits.^[49] QAT was pioneered in seminal work on integer-only inference for neural networks in 2017, with practical implementation in TensorFlow's Model Optimization Toolkit introduced in 2018 to support end-to-end training with fake quantization.^[43] By 2025, advancements have enabled QAT for 2-bit large language models, such as variants of BitNet; for example, the BitNet b1.58 2B4T model, released in April 2025, uses ternary weights (-1, 0, +1) at 1.58 bits per parameter, demonstrating competitive perplexity on language tasks while reducing memory footprint by up to 10x compared to FP16 counterparts.^[50]

Linguistics and Semantics

Quantificational Structures

Quantificational structures in linguistics refer to the syntactic and semantic mechanisms by which languages express quantities, amounts, and relations between sets using determiners, pronouns, and related elements. These structures enable speakers to convey notions of universality, existence, or cardinality, forming the core of how noun phrases interact with predicates in sentences. In formal semantics, quantificational structures are analyzed compositionally, ensuring that the meaning of complex expressions derives systematically from their parts.^[51] Universal quantifiers, such as "all" or "every," assert that a property holds for every entity in a given set, while existential quantifiers, like "some," indicate that the property applies to at least one entity. Scope ambiguities arise when multiple quantifiers interact, leading to different interpretations depending on which quantifier takes precedence. For instance, in the sentence "Every farmer who owns a donkey beats it," the universal quantifier "every" can scope over the embedded existential implication of "a donkey," resulting in either a strict universal reading (each farmer beats their own donkey) or an anaphoric binding where "it" refers back across the quantifier boundary, highlighting challenges in quantifier interaction.^[52]^[53] In the framework of Montague grammar, developed in the 1970s, these structures are formalized by treating noun phrases as generalized quantifiers of type ((\mathbf{e} \to \mathbf{t}) \to (\mathbf{e} \to \mathbf{t}) \to \mathbf{t}), which take a restrictor predicate and a nuclear scope predicate as arguments. This approach, pioneered in works like "The Proper Treatment of Quantification in English," provides a model-theoretic semantics where determiners denote relations between sets, allowing for precise handling of scope and compositionality across languages. For the existential quantifier "some," the denotation is \lambda P \lambda Q. \exists x (P(x) \wedge Q(x)), where P is the property of the noun (restrictor) and Q incorporates the verb phrase (scope).^[54] Cross-linguistic variations in quantificational structures are evident in the use of measure words and classifiers, particularly in languages without overt plural marking like Chinese. In Mandarin, expressions of quantity require classifiers to individuate nouns; for example, "three books" is rendered as "sān běn shū," where "běn" is a classifier denoting bound volumes, functioning to atomicize the noun for counting. This contrasts with measure words like "xiāng" (box), as in "three boxes of books," which quantify portions rather than individuals, reflecting a semantic distinction between sortal classification and mensural measurement. Such systems highlight how quantificational structures adapt to typological differences, with classifier languages emphasizing explicit individuation in numeral constructions.^[55]^[56] A key aspect of these structures involves the notion of atomicity, distinguishing cumulative predicates (typical of mass nouns) from quantized predicates (associated with count nouns). Cumulative predicates, such as "water flows," allow for summation: if x and y satisfy the predicate and are disjoint, then x \cup y also does, enabling indefinite extension without boundary. In contrast, quantized predicates, like "three apples rot," treat entities as discrete atoms, incompatible with summation since parts of the whole do not satisfy the predicate independently. This binary opposition, formalized in event semantics, underpins how quantificational structures enforce countability and divisiveness in natural language predicates.^[57]

The Quantization Puzzle

The quantization puzzle in linguistics arises from the semantic effects of verbal prefixes in Slavic languages, particularly how they impose a quantized, or telic, interpretation on events, forcing a perfective aspect that denotes completion or boundedness, while sometimes allowing readings that appear iterative or distributive.^[58] For instance, in Russian, the prefix na- in na-pisat' pismo ("to write a letter") shifts the atelic activity of writing to a telic event, implying the letter is completely written, linking the verb's reference to a measurement scale defined by the incremental theme (the letter).^[59] This quantization contrasts with potential iterative interpretations, as the prefix delimits the event, making it incompatible with ongoing or repeated subevents without altering the perfective force.^[60] Similar effects occur with prefixes like po-, which can attenuate or distribute the event, as in po-pisat' pismo ("to write a bit of a letter"), yet still enforce telicity under certain contexts.^[58] Telicity induced by these prefixes ties quantized event reference to aspectual systems, where perfective verbs denote maximal, bounded eventualities, but the puzzle emerges because some prefixed forms seem to violate strict mereological atomicity—no proper subpart should qualify as an event of the same type—leading to apparent atelic perfectives.^[59] Theoretical challenges include the interaction with distributivity, where prefixes may quantify over plural individuals or subevents, as in Russian na-sypat' peska ("to pour some sand"), which distributes over portions but maintains overall telicity, or plurality, complicating scalar mappings in languages like Polish and Czech.^[61] These issues highlight how prefixes blend lexical semantics with aspectual grammar, often deriving from spatial origins (e.g., na- from "on") to encode delimitative or cumulative measures.^[60] Proposals to resolve the puzzle include Hana Filip's (2008) analysis, which invokes scalar implicatures to derive maximalization over events, treating quantization as a pragmatic strengthening rather than a strict lexical property, thus unifying telic effects across incremental and homogeneous predicates. Event decomposition approaches, building on Krifka's mereology, posit that prefixes modify event structures by introducing measure functions that partition homogeneous activities into bounded units, avoiding direct contradictions with distributivity.^[59] Explored in semantics literature since the 1990s through works on Slavic aspect, the puzzle remains incompletely resolved as of 2025, with ongoing debates on whether prefixes are primarily aspectual operators or versatile lexical modifiers.^[62]

Discretization and Sampling

Discretization refers to the process of approximating continuous domains or variables with discrete counterparts, often as a precursor to quantization in computational and signal processing pipelines. While quantization specifically involves mapping continuous amplitude values to a finite set of discrete levels, discretization typically addresses the spatial or temporal domain, converting infinite continua into manageable grids or sequences. This distinction is crucial in fields like numerical analysis and signal processing, where discretization enables the application of discrete algorithms without altering the underlying value ranges until quantization is applied. The Nyquist-Shannon sampling theorem provides a foundational framework for time-domain discretization in signal processing, stating that a continuous-time signal can be perfectly reconstructed from its samples if the sampling frequency f_s exceeds twice the highest frequency component f_{\max} of the signal, thereby preventing aliasing artifacts. This theorem, formalized by Claude Shannon, ensures that uniform sampling at intervals T = 1/f_s captures all essential information without loss, independent of any subsequent quantization of sample values. Notably, the theorem assumes ideal, infinite-precision sampling and does not account for quantization effects, which introduce noise only after discretization has occurred. In numerical methods for solving partial differential equations (PDEs), discretization techniques such as finite differences approximate derivatives on a discrete grid, transforming continuous problems into solvable algebraic systems. For instance, the forward Euler method discretizes time evolution in parabolic PDEs by approximating the time derivative as \frac{u^{n+1} - u^n}{\Delta t}, where u^n denotes the solution at time step n and \Delta t is the time increment, enabling explicit marching schemes for equations like the heat equation. These methods, rooted in early 20th-century developments, prioritize stability and accuracy through grid refinement, setting the stage for quantization in finite-precision implementations. A practical example of spatial discretization appears in digital imaging, where continuous scenes are represented by a pixel grid that samples intensity at regular intervals across the image plane, effectively discretizing the two-dimensional spatial domain into a finite array. In digital signal pipelines, quantization of these discretized samples—such as rounding pixel values to integer levels—follows immediately after, introducing minor noise that can be analyzed separately from the sampling process itself.

Digitization in Computing

Digitization in computing encompasses the conversion of analog or continuous data into discrete binary representations suitable for processing and storage in digital systems. This process is essential for enabling reliable computation, as it transforms real-world information into a form that can be manipulated using binary logic gates and arithmetic units. The von Neumann architecture, outlined in a 1945 report, established the foundational model for digital computers by integrating memory, processing, and control in a binary-based stored-program design, which facilitated the widespread adoption of digitization.^[63] Binary representation in computing primarily utilizes fixed-point and floating-point formats to encode numerical values. The IEEE 754 standard, first published in 1985 and revised in 2019, defines the binary floating-point format, which includes a sign bit, a biased exponent, and a mantissa (significand). The mantissa is quantized to a fixed bit length—typically 23 bits for single-precision (binary32) or 52 bits for double-precision (binary64)—discretizing the fractional part of the number and introducing a controlled loss of precision to fit within the allocated bits. This quantization ensures efficient representation across a wide dynamic range while adhering to rules for rounding and exception handling.^[64] Fixed-point arithmetic provides a simpler alternative, particularly in resource-constrained environments like embedded systems, by maintaining a fixed binary point position. The Qm.n notation describes such formats, where m denotes the number of integer bits and n the number of fractional bits (excluding the sign bit), for a total of 1 + m + n bits. A prominent example is the Q15.16 format in 32-bit systems, which uses 15 bits for the integer part and 16 for the fractional part, offering high resolution for fractional values in applications such as digital filters. Operations in fixed-point require scaling to avoid precision loss during multiplication or division.^[65] Arithmetic operations in digitized systems must address overflow, which occurs when a result exceeds the representable range, and underflow, when it falls below the minimum. Common handling methods include wrap-around (modulo arithmetic), where values cycle back from the maximum to the minimum or vice versa, and saturation arithmetic, which limits the output to the nearest extreme value (e.g., all 1s for positive overflow). In digital signal processing, saturation is favored over wrap-around to minimize distortion, as it preserves signal amplitude bounds without introducing spurious artifacts like clicks in audio.^[66] A notable application of digitization appears in music production software, where MIDI (Musical Instrument Digital Interface) note events are quantized to a temporal grid. This aligns irregularly timed notes—such as those from live performances—to discrete rhythmic divisions like quarter or eighth notes, ensuring tight synchronization in digital audio workstations while retaining musical expressiveness through adjustable strength parameters.^[67]

References

[1]
What Is Quantization? | How It Works & Applications - MathWorks
Quantization is the process of mapping continuous infinite values to a smaller set of discrete finite values.
[2]
Quantization of Energy | Physics - Lumen Learning
Energy is quantized in some systems, meaning that the system can have only certain energies and not a continuum of energies, unlike the classical case.
[3]
9.3: Energy Quantization - Physics LibreTexts
Jan 19, 2023 · When we describe the energy of a particle as quantized, we mean that only certain values of energy are allowed. Perhaps a particle can only have ...
[4]
6.2 The Bohr Model - Chemistry 2e | OpenStax
Feb 14, 2019 · The energies of electrons (energy levels) in an atom are quantized, described by quantum numbers: integer numbers having only specific allowed ...
[5]
Quantization: History and problems - ScienceDirect.com
In physics the word “quantization” can mean several related things. Most broadly, it can refer to any derivation of quantum mechanics, no matter how heuristic, ...
[6]
Digital Communication - Quantization - Tutorials Point
Quantization is representing the sampled values of the amplitude by a finite set of levels, which means converting a continuous-amplitude sample into a discrete ...
[7]
[PDF] Chapter 5: Sampling and Quantization
Quantization makes the range of a signal discrete, so that the quantized signal takes on only a discrete, usually finite, set of values. Unlike sampling (where ...
[8]
Quantization (Signal Processing) - an overview | ScienceDirect Topics
Quantization in signal processing refers to the process of assigning an integer value to the amplitude of a signal at a specific point in time or space.
[9]
What is Quantization? | IBM
Quantization is the process of reducing the precision of a digital signal, typically from a higher-precision format to a lower-precision format.
[10]
Quantization - Hugging Face
Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision ...
[11]
A Comprehensive Study on Quantization Techniques for Large ...
Oct 30, 2024 · Quantization, a technique that reduces the precision of model values to a smaller set of discrete values, offers a promising solution by ...
[12]
[PDF] THE ORIGINS OF THE QUANTUM THEORY
Max Planck was giving a talk to the German Physical Society on the continuous spec- trum of the frequencies of light emitted by an ideal heated body. Some two.
[13]
[PDF] Quantization: History and Problems - arXiv
Feb 16, 2022 · In physics the word “quantization” can mean several related things. Most broadly, it can refer to any derivation of quantum mechanics, no ...
[14]
[PDF] Fundamentals of Quantization - Stanford Electrical Engineering
Mar 20, 2006 · First consider simple uniform scalar quantization. Then extensions to vector quantization. In scalar case can do some exact analysis without.
[15]
Quantizing for minimum distortion | IEEE Journals & Magazine
This paper discusses the problem of the minimization of the distortion of a signal by a quantizer when the number of output levels of the quantizer is fixed.
[16]
https://ieeexplore.ieee.org/document/1056489/
[17]
Uniform Quantization - an overview | ScienceDirect Topics
Uniform quantization (UQ) is defined as a quantization method that uses the same step size throughout the input range, where the range and step size ...
[18]
[PDF] A-Law and mu-Law Companding Implementations Using the ...
standard is defined mathematically by the continuous equation: F(x) = sgn(x) ln(1 + µ |x|) / ln (1 + µ). Equation (1). -1≤ x ≤ 1 where µ is the compression ...Missing: log(
[19]
[PDF] Rate Distortion Theory
May 12, 2004 · ... distortion D(R) where R = log2 L bits/sample. Thus this distortion can be achieved by simply representing each quantized sample by R bits. 8.
[20]
Quantization Level - an overview | ScienceDirect Topics
In a uniform quantizer, the step size is the same between any two adjacent levels. ... With L as the number of the quantization levels, the step size of a uniform ...
[21]
Pulse Code Modulation - Engineering and Technology History Wiki
In 1937, Alec Reeves came up with the idea of Pulse Code Modulation (PCM). At the time, few, if any, took notice of Reeve's development.
[22]
https://ieeexplore.ieee.org/document/1094577
[23]
https://books.google.com/books?id=DwcDm6xgItUC
[24]
Image compression by vector quantization: a review focused on ...
Recently, vector quantization (VQ) has received considerable attention, and has become an effective tool for image compression.
[25]
[PDF] Vector Quantization in Speech Coding - LabROSA
Quantization, the process of approximating continuous-ampli- tude signals by digital (discreteamplitude) signals, is an important.
[26]
Lossy Compression Basics and Quantization - GitHub Pages
Vector quantization allows us to exploit the correlation between the two dimensions and therefore we can achieve the same distortion with lower rate. It takes ...
[27]
[PDF] MT-001: Taking the Mystery out of the Infamous ... - Analog Devices
W. R. Bennett of Bell Laboratories analyzed the actual spectrum of quantization noise in his classic 1948 paper (Reference 1). With the simplifying assumptions ...
[28]
Spectra of Quantized Signals - Bennett - 1948 - Wiley Online Library
Volume 27, Issue 3 pp. 446-472 Bell System Technical Journal. Full Access. Spectra of Quantized Signals. W. R. Bennett,. W. R. Bennett.
[29]
[PDF] 6.441S16: Chapter 23: Rate-Distortion Theory - MIT OpenCourseWare
Then there are two contributions, known as the granular distortion and overload distortion. ... The Information Rate-Distortion function for a source is. Ri ...
[30]
[PDF] The Spectrum of Scattered X-Rays - DF-UBA
This satisfactory agreement between the experiments and the theory gives confidence in the quantum formula. (1) for the change in wave-length due to scattering.
[31]
[PDF] On the Law of Distribution of Energy in the Normal Spectrum
In my last article4 I showed that the physical foundations of the electromagnetic radiation theory, including the hypothesis of “natural radiation”, withstand ...
[32]
A Timeline of Atomic Spectroscopy
Oct 1, 2006 · 1888: Swedish physicist Johannes Rydberg (1854–1919) generalizes Balmer's formula to: 1/λ = R H [(1/n 2) – (1/m 2)], where n and m are integers ...
[33]
[PDF] 1913 On the Constitution of Atoms and Molecules
This paper is an attempt to show that the application of the above ideas to Rutherford's atom-model affords a basis for a theory of the constitution of atoms.
[34]
[PDF] A Quantum Theory of the Scattering of X-Rays by Light Elements
The experimental support of the theory indicates very convinc- ingly that a radiation quantum carries with it directed momentum as well as energy. Emphasis has ...Missing: photon | Show results with:photon
[35]
The fundamental equations of quantum mechanics - Journals
Fedak W and Prentis J (2009) The 1925 Born and Jordan paper “On quantum mechanics”, American Journal of Physics, 10.1119/1.3009634, 77:2, (128-139), Online ...
[36]
The quantum theory of the emission and absorption of radiation
The quantum theory of the emission and absorption of radiation. Paul Adrien Maurice Dirac.
[37]
Quantenmechanik und Gruppentheorie | Zeitschrift für Physik A ...
Download PDF · Zeitschrift für Physik. Quantenmechanik und ... Cite this article. Weyl, H. Quantenmechanik und Gruppentheorie. Z. Physik 46, 1–46 (1927).
[38]
[PDF] Quantum Field Theory - DAMTP
This is a very clear and comprehensive book, covering everything in this course at the right level. It will also cover everything in the “Advanced Quantum ...
[39]
[PDF] Second Quantization
Second Quantization. 1.1 Creation and Annihilation Operators in Quan- tum Mechanics. We will begin with a quick review of creation and annihilation operators in ...Missing: seminal paper
[40]
A Survey of Quantization Methods for Efficient Neural Network ...
Mar 25, 2021 · This paper surveys quantization methods for efficient neural network inference, which involves distributing continuous numbers over a discrete ...
[41]
GPTQ: Accurate Post-Training Quantization for Generative Pre ...
Oct 31, 2022 · GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight.
[42]
Introducing quantized Llama models with increased speed and a ...
Oct 24, 2024 · We're sharing quantized versions of Llama 3.2 1B and 3B models. These models offer a reduced memory footprint, faster on-device inference, accuracy, and ...
[43]
Quantization and Training of Neural Networks for Efficient Integer ...
Dec 15, 2017 · We propose a quantization scheme that allows inference to be carried out using integer-only arithmetic, which can be implemented more efficiently than floating ...
[44]
[2106.08295] A White Paper on Neural Network Quantization - arXiv
Jun 15, 2021 · Neural network quantization reduces power and latency. This paper introduces Post-Training Quantization (PTQ) and Quantization-Aware-Training ( ...
[45]
https://www.tensorflow.org/lite/performance/post_training_quantization
[46]
Hessian AWare Quantization of Neural Networks with Mixed-Precision
Apr 29, 2019 · View a PDF of the paper titled HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision, by Zhen Dong and 4 other authors. View ...
[47]
Quantize ONNX models | onnxruntime
ONNX Runtime provides python APIs for converting 32-bit floating point model to an 8-bit integer model, aka quantization.Missing: 2025 | Show results with:2025
[48]
[1911.12491] QKD: Quantization-aware Knowledge Distillation - arXiv
Nov 28, 2019 · Quantization and Knowledge distillation (KD) methods are widely used to reduce memory and power consumption of deep neural networks (DNNs), ...
[49]
Effective Training of Convolutional Neural Networks with Low ... - arXiv
Aug 10, 2019 · First, for progressive quantization, we propose two schemes to progressively find good local minima. Specifically, we propose to first optimize ...
[50]
[PDF] Quantification and Scope 9 - USC Dornsife
In natural languages, a quantifier, the element that generates quantification, is often a determiner, such as all, every, some, most, many, a few in English.
[51]
[PDF] Donkeys without borders - PhilArchive
*Every farmer who owns a donkey gathers it at night. This poses a challenge for analyses of the 9/8 dichotomy that try to reduce the behavior of donkey ...
[52]
[PDF] Donkeys under discussion* | Semantics and Pragmatics
Nov 12, 2019 · b. *Every farmer who owns a donkey gathers it at night. This poses a challenge for analyses of the ∃/∀ dichotomy that try to reduce.
[53]
[PDF] Semantics Archive
Montague, Richard (1970a). 'English as a formal language'. In Bruno, Visentini, editor, Linguaggi nella societae nella tecnica, 189–223. Edizioni di ...
[54]
[PDF] On the Semantic Distinction between Classifiers and Measure ...
In languages like Chinese, the use of classifiers turns a noun to a count noun; whereas in non-classifier languages like English, plural inflections are used ...
[55]
[PDF] Measure words are measurably different from sortal classifiers
Mar 9, 2023 · In addition to sortal classifiers, which categorize nouns in terms of intrinsic semantic features, classifier systems also include mensural.
[56]
[PDF] Lecture 8: Quantification and Semantic Typology
Apr 7, 2011 · (i') All languages have essentially quantificational DPs, i.e. DPs which can be analyzed as generalized quantifiers (type (e → t) →t), but not ...
[57]
[PDF] Semantics Archive - The Quantization Puzzle
Slavic languages are not unique in having quantificational and measurement verbal prefixes. Morphological operators that are applied to a verb at a lexical ...
[58]
[PDF] The Quantization Puzzle
This is troublesome given that each Slavic language has a set of about twenty verbal prefixes, many of which have quantificational and/or measurement content,.
[59]
[PDF] The Semantics of Perfectivity - Italian Journal of Linguistics
The source of the quantization puzzle posed by Slavic 'atelic per- fectives' lies in verbal prefixes with which they are formed, because they have common vague ...
[60]
(PDF) The Quantization Puzzle - ResearchGate
Jun 22, 2016 · 27 The main motivation for such a generalized scalar account is to provide a uniform solution to the problem of the quantization puzzle in ...
[61]
‪Hana Filip‬ - ‪Google Scholar‬
The quantization puzzle. H Filip. Events as grammatical objects, from the ... Theoretical and crosslinguistic approaches to the semantics of aspect, 217-256, 2008.
[62]
[PDF] First draft report on the EDVAC by John von Neumann - MIT
June 30, 1945. This is an exact copy of the original typescript draft as obtained from the University of Pennsylvania. Moore School Library except that a ...
[63]
IEEE 754-2019 - IEEE SA
Jul 22, 2019 · This standard specifies interchange and arithmetic formats and methods for binary and decimal floating-point arithmetic in computer programming environments.
[64]
[PDF] Fixed-point arithmetic
May 23, 2014 · Fixed-point. May 23, 2014. 5 / 21. Page 6. Notation. Qm.n. Q3.4 1 sign bit, 3 integer, 4 fractional bits. Q1.30 32 bits in total. Q15.16 32 bits ...
[65]
Fixed Point Arithmetic in DSP — Real Time Digital Signal ...
Nov 14, 2021 · An alternate of overflow is saturation, a technique that caps the most positive or most negative value that can be held in a fixed-point ...Fixed Point Data... · Fixed Point Arithmetic · Dsp With Fixed-Point...
[66]
5 MIDI Quantization Tips
Quantization is the process of moving MIDI data (usually notes, but also potentially other data) that's out of time to a rhythmic “grid.”