Fact-checked by Grok 2 weeks ago

Data compression

Data compression is the process of encoding data using fewer bits than the original representation, thereby reducing its size for more efficient storage and transmission while preserving the essential information.^[1] This technique eliminates redundancies in data, transforming it into a more compact form that can be decoded back to approximate or exact originals depending on the method used.^[2] By increasing effective data density, compression plays a critical role in applications ranging from file archiving and telecommunications to multimedia processing.^[3] There are two primary categories of data compression: lossless and lossy. Lossless compression allows the original data to be perfectly reconstructed without any loss of information, making it suitable for text files, executables, and scenarios where fidelity is essential; typical compression ratios for lossless methods range from 2:1 to 4:1.^[4] In contrast, lossy compression discards less perceptible details to achieve higher compression rates—often significantly better than lossless— and is commonly applied to images, audio, and video, as seen in formats like JPEG and MP3.^[5]^[6] The foundations of modern data compression trace back to information theory developed by Claude Shannon in the mid-20th century, which established fundamental limits like entropy as the theoretical minimum for lossless encoding. Practical advancements accelerated in the 1960s with applications in space missions, where both lossless and lossy methods were employed to manage telemetry data.^[7] Key milestones include David Huffman's 1952 algorithm for optimal prefix codes and the 1977–1978 work by Jacob Ziv and Abraham Lempel, leading to the LZW algorithm that powers tools like ZIP and GIF.^[8] These innovations have made data compression indispensable in computing, enabling everything from efficient web browsing to high-definition streaming.^[9] Common techniques in lossless compression include run-length encoding for repetitive data, Huffman coding for variable-length symbol assignment based on frequency, and dictionary-based methods like LZW for substituting repeated sequences.^[8] Lossy approaches often rely on perceptual models, such as discrete cosine transforms in JPEG for images or modified discrete cosine transforms in MP3 for audio, prioritizing human perception over exact replication.^[10] Compression ratios are typically measured as the ratio of the original size to the compressed size, with higher ratios indicating better efficiency (e.g., 2:1 means the compressed file is half the original size).^[11] Ongoing research continues to push boundaries, particularly in hardware-accelerated and AI-enhanced methods for emerging data-intensive fields like IoT and big data.^[12]

Fundamentals

Definition and Principles

Data compression is the process of encoding information using fewer bits than the original representation to reduce data size while preserving the essential content. This technique aims to minimize storage requirements and optimize data transmission by eliminating unnecessary bits. The primary purposes include enabling efficient data storage on limited-capacity devices, accelerating file transfers over networks, and conserving bandwidth in communication systems. At its core, data compression exploits statistical redundancy inherent in data, such as repeated patterns or predictable sequences, to represent information more compactly without altering its meaning. A fundamental principle is entropy, introduced by Claude Shannon as a measure of the average uncertainty or information content in a source, which establishes the theoretical limit for the shortest possible encoding length in lossless compression. The entropy H for a discrete source with symbols having probabilities p_i is calculated as

H = -\sum_i p_i \log_2 p_i,

where the summation is over all possible symbols, and \log_2 p_i quantifies the bits needed to encode each symbol based on its probability of occurrence. Key trade-offs in compression include the compression ratio, defined as the original data size divided by the compressed size (higher values indicate better efficiency), balanced against the computational cost of encoding and decoding, which can impact processing time and resource usage. For instance, a simple text file of 10 KB containing repetitive phrases might compress to 4-5 KB, yielding a ratio of about 2:1 to 2.5:1, while a binary file filled with uniform patterns, like all zeros, can achieve near-total reduction to a few bytes representing the pattern and length. Compression methods fall into lossless and lossy categories, with the former ensuring exact data recovery and the latter allowing minor losses for greater size reduction.

Types of Compression

Data compression is broadly categorized into two primary types: lossless and lossy, each designed to reduce data size while addressing different requirements for fidelity and efficiency.^[13] Lossless compression ensures the original data can be exactly reconstructed without any loss of information, making it essential for applications where data integrity is paramount.^[14] In contrast, lossy compression permits some irreversible data loss to achieve significantly higher compression ratios, prioritizing perceptual quality over exact reproduction.^[15] Lossless compression algorithms exploit statistical redundancies in the data to encode it more compactly, guaranteeing bit-for-bit recovery of the source upon decompression. This type is particularly suitable for text files, executable programs, and other structured data where even minor alterations could render the content unusable or introduce errors. For instance, compressing source code or database records requires lossless methods to preserve functionality and accuracy. Typical compression ratios for lossless techniques range from 2:1 to 4:1, depending on the data's entropy and redundancy patterns.^[4] Lossy compression, on the other hand, discards less perceptually significant information, such as high-frequency details in images or inaudible frequencies in audio, to achieve greater size reduction while maintaining acceptable quality for human observers. It is ideal for multimedia content like photographs, videos, and music streams, where exact replication is unnecessary and bandwidth constraints are critical. Compression ratios in lossy methods often exceed 10:1 for images and can reach 100:1 or more for video, enabling efficient storage and transmission in resource-limited environments.^[16]^[17] The choice between lossless and lossy compression involves trade-offs in fidelity, efficiency, and applicability, as summarized in the following comparison:

Aspect	Lossless Compression	Lossy Compression
Fidelity	Exact reconstruction; no data loss	Approximate reconstruction; some data discarded
Compression Ratio	Typically 2:1-4:1 for general data	Often 10:1-100:1 or more for media content
Pros	Preserves all information; suitable for critical data	Higher efficiency; better for perceptual media
Cons	Lower ratios; less effective on random data	Irreversible loss; potential quality degradation
Use Cases	Archival storage, software distribution	Streaming services, mobile devices

Hybrid approaches combine elements of both, applying lossless compression to critical data portions and lossy to others, to optimize overall performance in scenarios like scientific simulations or medical imaging.^[20] Selecting the appropriate type depends on data sensitivity—opting for lossless when exactness is required, such as in legal documents or financial records—and domain-specific needs, like archival preservation versus real-time streaming where bandwidth savings outweigh minor quality trade-offs.^[15] Underlying lossy compression is rate-distortion theory, which provides a framework for balancing the bit rate (amount of data used) against distortion (deviation from the original), often visualized as the rate-distortion curve that identifies optimal operating points for a given quality threshold.^[21]

Theory

Information Theory Foundations

The foundations of data compression are rooted in information theory, particularly Claude Shannon's seminal work establishing fundamental limits on lossless compression. Shannon's source coding theorem, also known as the noiseless coding theorem, states that for a discrete memoryless source, the minimum average number of bits required per symbol for reliable lossless compression is equal to the source's entropy rate; it is impossible to compress below this rate without introducing errors on average.^[22] This theorem provides the theoretical boundary for compression efficiency, demonstrating that redundancy in data can be exploited up to but not beyond the entropy limit.^[22] Central to this theorem is the concept of Shannon entropy, which quantifies the average uncertainty or information content in a random variable. For a discrete random variable X with possible outcomes x_i and probabilities p(x_i), the entropy H(X) is derived as the expected value of the self-information, where self-information of an outcome is -\log_2 p(x_i), measuring the surprise or bits needed to specify it. The formula arises from the requirement that the code length for each symbol should be approximately -\log_2 p(x_i) to achieve optimality, leading to the average code length being at least H(X) = -\sum_i p(x_i) \log_2 p(x_i).^[22] To illustrate, consider a fair coin flip where X takes values heads or tails each with probability 0.5; here, H(X) = - (0.5 \log_2 0.5 + 0.5 \log_2 0.5) = 1 bit, indicating full compressibility to 1 bit per flip without loss. In contrast, English text exhibits lower entropy per character (around 1-1.5 bits due to predictability) compared to a uniform random string (5 bits for 32 possible characters), allowing significant compression by encoding probable sequences more efficiently.^[22] Another theoretical measure of compressibility is Kolmogorov complexity, defined as the length of the shortest computer program that outputs a given string on a universal Turing machine. This provides an absolute, uncomputable limit on description length, where a string is incompressible if its Kolmogorov complexity equals its length, highlighting intrinsic randomness without probabilistic assumptions.^[23] Information theory distinguishes source coding, which removes redundancy from the data source to minimize bits for representation, from channel coding, which adds controlled redundancy to protect against transmission errors over noisy channels. Shannon's framework separates these to optimize end-to-end communication, with source coding achieving entropy limits in noiseless settings and channel coding approaching capacity in noisy ones.^[22] Universal coding methods, such as arithmetic coding, approximate these entropy limits without prior knowledge of exact source statistics by adaptively partitioning the probability interval [0,1] for the message, assigning shorter codes to more probable sequences. Developed from early ideas by Peter Elias and refined in practical forms, arithmetic coding achieves compression close to the entropy bound, often within 1-2 bits of the theoretical minimum for finite blocks.^[24]

Algorithmic Techniques

Dictionary methods form a cornerstone of lossless data compression by exploiting redundancies through the construction and substitution of repeated substrings in the input data. These techniques build a dictionary of previously seen phrases or substrings, replacing future occurrences with references to the dictionary entries, thereby reducing the overall data size without loss of information. The Lempel-Ziv (LZ) family of algorithms exemplifies this approach, with LZ77 and LZ78 being seminal variants. LZ77 employs a sliding window mechanism over the input stream, searching backward within a fixed-size window to find the longest matching substring preceding the current position; it then encodes the output as a literal character or a pair consisting of the distance back to the match and its length.^[25] In contrast, LZ78 builds an explicit dictionary incrementally by parsing the input into non-overlapping phrases, where each new entry is formed by appending the next input symbol to a previous dictionary phrase, and the output references the dictionary index along with the symbol.^[26] These methods achieve efficiency by adapting to local redundancies, though LZ77 favors sequential access patterns while LZ78 supports random access more readily. Entropy coding techniques assign variable-length codes to symbols based on their probabilities, ensuring that more frequent symbols receive shorter codes to minimize the average code length, approaching the theoretical entropy bound established in information theory. Huffman coding constructs an optimal prefix code using a binary tree built from symbol frequencies: symbols are repeatedly combined into nodes with probabilities equal to the sum of their children's, forming a tree where code lengths reflect the inverse of symbol probabilities, and the resulting codes are instantaneous and uniquely decodable.^[27] Arithmetic coding, on the other hand, encodes the entire message as a single fractional number within the unit interval [0,1), subdividing the interval according to cumulative symbol probabilities and narrowing it progressively with each symbol; this fractional representation is then quantized to a binary string, often achieving compression closer to the entropy limit than fixed-code methods by avoiding codeword boundaries.^[28] Transform coding alters the data representation to concentrate redundancies, making subsequent compression more effective; a simple example is run-length encoding (RLE), which replaces sequences of identical symbols with a single instance of the symbol paired with the count of repetitions, ideal for data with long runs like binary images or sparse arrays.^[29] This transformation groups consecutive identical values, reducing the number of symbols to encode while preserving exact recovery during decoding. Prediction and residual coding leverage correlations by estimating future values from past ones and encoding only the prediction errors, which are typically smaller and more compressible. In delta encoding, commonly applied to time-series data, each value is predicted as the previous one, and the difference (residual) is stored instead; for monotonically increasing sequences, these deltas are often small non-negative integers that entropy coders can represent efficiently.^[30] The effectiveness of these algorithmic techniques is evaluated using metrics such as compression ratio (ratio of original to compressed size), encoding and decoding speed (in bytes per second), and memory usage during compression. Standard benchmarks like the Calgary corpus—a collection of 14 files totaling about 3.2 MB representing diverse text types such as programs, binaries, and prose—provide a consistent framework for comparing algorithms, with higher ratios and faster speeds indicating superior performance under resource constraints.^[31]

Machine Learning Approaches

Machine learning approaches to data compression leverage neural networks to learn data representations and probability distributions directly from training data, often surpassing traditional methods in rate-distortion performance for complex data like images and videos. These methods typically frame compression as an optimization problem, balancing bitrate (rate) and reconstruction fidelity (distortion) through end-to-end trainable models. Unlike classical techniques, ML-based compression adapts to data statistics without hand-crafted features, enabling flexible, variable-rate encoding.^[32] A foundational paradigm in neural compression involves autoencoders, particularly variational autoencoders (VAEs), optimized with a rate-distortion loss function. In this setup, an encoder compresses input data into a latent representation, which is quantized and entropy-coded, while a decoder reconstructs the output; the loss combines a distortion term (e.g., mean squared error) with a rate term estimating the bitrate via the latent's entropy. The seminal work by Ballé, Laparra, and Simoncelli introduced an end-to-end optimized nonlinear transform coder using this framework, achieving superior compression to JPEG across bitrates on standard benchmarks. Subsequent extensions, such as variational models with hyperpriors, further refine latent distributions for better entropy estimation, enabling state-of-the-art results on datasets like Kodak.^[33]^[34] Learned entropy models enhance compression by using neural networks to predict symbol probabilities more accurately than fixed parametric assumptions, often integrated with arithmetic coding for near-optimal entropy rates. Recurrent neural networks (RNNs) or transformers model spatial or sequential dependencies in latents, estimating conditional probabilities to guide the coder; for instance, hyperprior networks learn a side-information latent to parameterize the primary latent's distribution, reducing redundancy. This approach, building on VAE architectures, has been shown to outperform JPEG 2000 in multi-scale structural similarity (MS-SSIM) metrics at low bitrates. Transformers, with their attention mechanisms, have recently improved long-range dependency capture in entropy modeling, as seen in lossless schemes where they replace RNNs for faster inference. Entropy coding integration allows these models to approach theoretical limits while maintaining compatibility with standard arithmetic decoders.^[34]^[35] Generative models extend lossy compression by prioritizing perceptual quality over pixel-wise fidelity, using adversarial or diffusion processes for realistic reconstructions. Generative adversarial networks (GANs) train a generator to produce outputs indistinguishable from originals, conditioned on compressed latents, often yielding visually pleasing results at ultra-low bitrates despite higher distortion metrics. Diffusion models, which iteratively denoise from Gaussian noise, have been adapted for compression by encoding conditionals that guide the reverse process, achieving better perceptual scores than autoencoder baselines on image datasets. For example, conditional diffusion frameworks optimize end-to-end for rate-distortion-perception trade-offs, demonstrating gains over BPG in human evaluations. Post-2020 advances have pushed neural compression toward practical deployment and theoretical efficiency. Bits-back coding enables lossless compression with latent variable models by "returning" bits from sampling auxiliary variables, improving rates over standard ANS-based schemes; Monte Carlo variants address approximation gaps, yielding up to 10-20% better compression on text and images compared to gzip. Neural arithmetic coding replaces traditional coders with differentiable approximations, allowing gradient-based optimization of the entire pipeline, as in recurrent models for edge devices that achieve 15% bitrate savings over classical arithmetic coding. Google's early neural efforts, evolving into hyperprior-based systems, have influenced standards, with models rivaling AVIF in efficiency.^[36]^[37]^[38]^[39] In 2025, methods like LMCompress leverage large language models to achieve lossless compression rates that halve those of state-of-the-art codecs such as JPEG-XL for images, FLAC for audio, and H.264 for video, and improve text compression by nearly one-third over zpaq, by exploiting semantic understanding of data.^[40]^[41] Despite these gains, challenges persist in ML compression. Training requires vast datasets to generalize across domains, often leading to overfitting on specific distributions like natural images, unlike classical methods that are dataset-agnostic. Computational costs are high: encoding/decoding with deep networks demands GPU acceleration, contrasting the lightweight nature of Huffman or JPEG, with inference times 10-100x slower in early models. Ongoing research addresses these via distillation and quantization, but deployment in resource-constrained settings remains limited.^[32]

Lossless Compression

Core Mechanisms

Lossless compression algorithms reduce data size by eliminating statistical redundancy while ensuring the original data can be perfectly reconstructed. This process relies on identifying patterns and correlations in the data, such as repeated sequences or uneven symbol frequencies, without discarding any information. The theoretical foundation is Shannon's entropy, which defines the minimum average number of bits needed to represent the data based on its probability distribution.^[42] Key mechanisms include entropy coding, which assigns variable-length codes to symbols proportional to their occurrence probabilities—shorter codes for frequent symbols and longer for rare ones—to approach the entropy limit. Unlike fixed-length coding, this exploits the non-uniform distribution of data symbols. Dictionary-based methods build a dictionary of common phrases or substrings during compression, replacing subsequent occurrences with references to dictionary entries, thus avoiding repetition. Preprocessing techniques, such as run-length encoding (RLE), further enhance efficiency by encoding consecutive identical symbols as a single value and count, particularly effective for sparse or repetitive data like text or graphics. These mechanisms are often combined, with prediction models estimating future symbols from context to residual-encode differences, followed by entropy coding of the residuals.^[43]^[42] The Burrows-Wheeler transform (BWT) represents another core approach, rearranging the data to group similar characters together, creating long runs of identical symbols that can be efficiently compressed via RLE and entropy coding. Overall, lossless methods achieve typical compression ratios of 2:1 to 4:1 for general data, depending on redundancy, with no distortion introduced.^[42]

Specific Algorithms

Huffman coding, developed in 1952, constructs an optimal prefix code tree based on symbol frequencies, ensuring no code is a prefix of another for unambiguous decoding. Symbols are traversed from the tree root to leaves, assigning binary codes along the path; for example, in English text, 'e' might receive a 1-bit code while rare letters get longer ones, reducing average code length close to entropy. It is widely used as a building block in formats like ZIP and JPEG (for lossless modes).^[42] The Lempel–Ziv–Welch (LZW) algorithm, introduced in 1984 as a variant of LZ78, uses a dynamic dictionary to encode sequences. It parses input into previously unseen phrases, adds them to the dictionary, and outputs the index of the longest matching prior phrase; decoding rebuilds the dictionary sequentially. LZW powers GIF and TIFF formats, achieving good ratios for textual and graphical data without lookahead.^[42] Arithmetic coding, an advanced entropy coder from the 1970s, treats the entire message as a fractional number within [0,1), narrowing an interval based on cumulative symbol probabilities rather than discrete codes. This allows fractional bits per symbol, outperforming Huffman for small alphabets or skewed distributions, and is employed in modern tools like JBIG2 for fax compression and H.264 video entropy stages.^[42] DEFLATE, standardized in 1993 for PNG and ZIP, combines LZ77 sliding-window dictionary matching with Huffman coding of literals and distances. It scans for matches within a 32 KB window, encoding unmatched literals or match (length, distance) pairs, followed by Huffman compression of the output stream, balancing speed and ratio for general-purpose use.^[42]

Lossy Compression

Core Mechanisms

Lossy compression fundamentally relies on the irreversible discard of data to achieve substantial reduction in storage or transmission requirements, prioritizing perceptual fidelity over exact reconstruction. This irreversibility is achieved primarily through quantization, a process that approximates continuous or high-precision signal values with a discrete set of representation levels, effectively eliminating nuances below a certain precision threshold that contribute minimally to the overall information content. By mapping input values to the nearest quantization level, subtle details are lost, but the resulting approximation maintains acceptable quality for human observers when guided by perceptual models.^[44]^[45] Central to effective lossy compression are perceptual models that exploit limitations in human sensory systems to determine which data can be discarded without noticeable degradation. In audio compression, psychoacoustic models leverage auditory masking effects, where a dominant sound masks weaker simultaneous or nearby signals, allowing quantization noise to be introduced below computed masking thresholds in the time-frequency domain. For visual data, psychovisual models incorporate concepts like the just-noticeable difference (JND), derived from psychophysical principles such as Weber's law, which quantifies the minimal perceptible change in luminance or color, enabling the safe removal of variations below these thresholds. These models ensure that discarded information remains imperceptible, optimizing compression while preserving subjective quality.^[46]^[47] Transform coding forms a cornerstone mechanism, converting the signal into a domain where redundancies are more compactly represented, followed by selective quantization. Techniques such as the Discrete Cosine Transform (DCT) or Fast Fourier Transform (FFT) shift the data into the frequency domain, concentrating signal energy into fewer low-frequency coefficients while high-frequency components, often less perceptually salient, receive coarser quantization. Quantization can be scalar, applying uniform or non-uniform steps to individual coefficients, or vector-based, grouping multiple coefficients into vectors and mapping them to codebook entries for improved rate-distortion performance by exploiting statistical dependencies. This combination decorrelates the data and targets discard of less critical frequency content.^[48]^[49] The trade-off between compression rate and resulting distortion is formalized by rate-distortion theory, providing a theoretical lower bound on the bitrate needed for a given distortion level. In practice, Lagrangian optimization minimizes the objective function

J = D + \lambda R

where D measures reconstruction error (e.g., mean squared error), R is the bitrate, and \lambda > 0 tunes the emphasis on distortion versus rate. For instance, in designing a quantizer for a simple uniform source, one iterates over possible step sizes: finer steps reduce D but increase R due to more bits per sample; the optimal \lambda yields the step size minimizing J, balancing, say, a 10% distortion increase against a 20% rate saving for efficient encoding. This approach ensures globally optimal parameter selection across the compression pipeline.^[50] Despite these optimizations, lossy compression often introduces visible or audible artifacts from aggressive data discard and block-based processing. Blocking artifacts manifest as discontinuities at processing block edges due to independent quantization, while ringing appears as oscillatory halos around sharp transitions from truncated high frequencies. Mitigation techniques, such as post-filtering, apply adaptive smoothing to blend block boundaries and suppress oscillations without reintroducing excessive blur, preserving edge details through edge-directed filters.^[51]^[52]

Specific Algorithms

Prominent lossy compression algorithms leverage techniques such as transform coding, predictive coding, subband coding, and vector quantization to achieve efficient data reduction while introducing controlled distortion through quantization. These methods transform or predict the input data to concentrate energy or redundancy, followed by quantization that discards less perceptually important information, as referenced in core lossy mechanisms.^[53] Transform-based algorithms, exemplified by the Discrete Cosine Transform (DCT) in the JPEG baseline standard, operate on small blocks of data to decorrelate spatial information. In JPEG, the image is divided into 8x8 pixel blocks, each undergoing a two-dimensional DCT to convert spatial domain values into frequency coefficients, where lower frequencies capture most energy. These coefficients are then quantized using standard or custom quantization tables that scale values inversely with perceptual importance, discarding high-frequency details to achieve compression ratios often exceeding 10:1 with acceptable visual quality for still images. The process concludes with entropy coding of the quantized coefficients, enabling scalable quality adjustment via table modifications.^[53] Predictive coding methods, such as Differential Pulse Code Modulation (DPCM), exploit temporal or spatial correlations by estimating the current sample from previous ones and encoding only the prediction error. In DPCM, a linear predictor—typically a simple average of adjacent samples for images or autoregressive filter for audio—generates the estimate, and the difference (error) is quantized with fewer bits due to its smaller dynamic range compared to the original signal. Quantization introduces loss by rounding the error to discrete levels, often using uniform or non-uniform steps tailored to signal statistics, yielding compression gains of 2-4 bits per sample in early audio systems while maintaining intelligibility in speech. This approach forms the basis for more advanced codecs, balancing prediction accuracy with quantization granularity.^[54] Subband coding divides the signal into frequency bands using filter banks, followed by independent quantization and coding of each band, with wavelet transforms providing a multi-resolution alternative to fixed subbands. The Embedded Zerotree Wavelet (EZW) algorithm applies a discrete wavelet transform to decompose the signal into hierarchical coefficient trees, exploiting the statistical similarity of wavelet coefficients across scales where insignificance (below a threshold) forms zerotrees—sparse structures that allow efficient progressive transmission. Starting from the coarsest scale, EZW scans coefficients in dominant and subordinate passes, encoding significant ones and refining approximations iteratively, which supports embedded bitstreams for quality scalability and achieves superior rate-distortion performance over DCT at low bit rates, such as below 0.5 bits per pixel for images. Vector quantization (VQ) maps multidimensional input vectors—such as blocks of pixels or speech frames—to the nearest entry in a predefined codebook of representative patterns, encoding only the index of the match. Codebooks are designed using algorithms like the Linde-Buzo-Gray (LBG) method, which iteratively partitions training data via k-means clustering to minimize distortion, typically with codebook sizes of 256-1024 vectors for 8-10 bit indices in speech applications. In speech compression, VQ groups spectral parameters or linear prediction coefficients into vectors, enabling rates as low as 1-2 kbps by capturing joint statistics, though it requires large codebooks to avoid block artifacts and is often combined with other techniques for robustness. Key parameters in these algorithms control the trade-off between compression ratio and distortion, including quantization step sizes that determine quality levels—analogous to early constant rate factor (CRF) mechanisms in video coding precursors, where a fixed quantization parameter adjusts bitrate dynamically to target perceptual quality. Bit allocation strategies further optimize performance by distributing available bits non-uniformly across coefficients, bands, or vectors based on rate-distortion criteria, prioritizing perceptually sensitive components to minimize overall error for a given rate.^[55]

Applications

Image Compression

Image compression techniques are designed to reduce the storage and transmission requirements of visual data while preserving perceptual quality, with methods differing significantly between raster and vector formats. Raster images, composed of a grid of pixels, rely on pixel-level encoding and thus employ both lossless and lossy compression to manage large file sizes inherent to their resolution-dependent nature. In contrast, vector images represent graphics through mathematical paths, points, and curves, enabling inherently scalable and compact representations that typically use lossless compression without quality degradation upon scaling. For example, the Scalable Vector Graphics (SVG) format, defined by the W3C, stores vector data in XML, achieving compression through text-based optimization rather than pixel manipulation, resulting in files that remain lightweight even for complex illustrations.^[56]^[57] The JPEG family of standards dominates lossy compression for raster photographs, with the baseline JPEG (ISO/IEC 10918-1) utilizing the discrete cosine transform (DCT) to convert spatial data into frequency coefficients, followed by quantization and entropy coding to achieve high compression ratios. This approach excels in reducing file sizes for continuous-tone images but introduces artifacts such as blocking—visible 8x8 pixel grid discontinuities—and ringing around edges, particularly at high compression levels, which can degrade perceived quality in areas of fine detail or smooth gradients. JPEG 2000 (ISO/IEC 15444-1), an advancement using discrete wavelet transforms, offers superior compression efficiency and supports both lossy and lossless modes, mitigating JPEG's blocking artifacts through better energy compaction and providing region-of-interest coding for selective quality preservation; however, its computational complexity limits widespread adoption compared to baseline JPEG.^[58]^[59]^[60] For lossless raster compression, particularly suited to graphics and web imagery, the Portable Network Graphics (PNG) format employs the DEFLATE algorithm—a combination of LZ77 dictionary coding and Huffman entropy coding—to achieve portable, bit-depth-agnostic compression without data loss, making it ideal for images requiring exact reproduction like diagrams or screenshots. PNG's filtering step before DEFLATE optimizes for spatial redundancies in scanlines, yielding compression ratios superior to uncompressed formats but generally larger than lossy alternatives for photographs.^[61]^[62] Modern formats like WebP and AVIF address the limitations of legacy standards by integrating video codec technologies for enhanced efficiency in both lossy and lossless scenarios. WebP, developed by Google, adapts intra-frame coding from the VP8 video codec to deliver 25-34% smaller files than JPEG at equivalent quality, supporting transparency and animation while reducing artifacts through predictive coding and advanced entropy methods. Similarly, AVIF leverages the AV1 video codec within the HEIF container to achieve up to 50% better compression than JPEG or WebP, with native support for high dynamic range (HDR), wide color gamuts, and lossless modes, though its encoding speed remains a drawback for real-time applications. Emerging formats like JPEG XL provide even higher efficiency, with up to 55% smaller files than JPEG and 25% than AVIF, supporting lossless conversion from legacy JPEG and progressive decoding, though adoption is growing as of 2025.^[63]^[64]^[65]^[66] Quality in image compression is evaluated using metrics that quantify fidelity between original and compressed versions, with peak signal-to-noise ratio (PSNR) measuring pixel-wise mean squared error in decibels—higher values indicate less distortion, though it correlates poorly with human perception—and structural similarity index (SSIM), which assesses luminance, contrast, and structural fidelity on a scale of -1 to 1, better aligning with visual judgments by emphasizing edge preservation. Use cases dictate format selection: lossy methods like JPEG and WebP suit web archiving where minor artifacts are tolerable for faster loading, while lossless PNG or SVG excels in professional graphics and archival storage to ensure pixel-perfect fidelity.^[67]^[68]

Audio Compression

Audio compression involves reducing the size of digital sound data while preserving perceptual quality, leveraging the temporal nature of audio signals and human auditory perception. Uncompressed audio typically uses Pulse Code Modulation (PCM), where analog waveforms are sampled at regular intervals and quantized into digital values. For compact disc (CD) quality, PCM employs a sampling rate of 44.1 kHz and 16-bit depth per channel, capturing frequencies up to 22.05 kHz per the Nyquist theorem to cover the human hearing range of approximately 20 Hz to 20 kHz.^[69] This results in a bitrate of about 1.411 Mbps for stereo audio, making compression essential for storage and transmission.^[69] Central to audio compression are psychoacoustic principles that exploit auditory limitations to discard inaudible information. Frequency masking occurs when a louder sound at one frequency reduces the perception of nearby frequencies, allowing encoders to allocate fewer bits to masked regions.^[70] Equal-loudness contours describe how human sensitivity varies across frequencies, with lower sensitivity at extremes (e.g., below 100 Hz or above 10 kHz), enabling reduced quantization in those bands.^[71] Filter banks, such as polyphase or quadrature mirror filters, decompose the signal into subbands aligned with critical bands of hearing (about 25 Bark scale units), facilitating precise perceptual modeling.^[72] Lossy audio compression formats apply these principles to achieve high efficiency by removing imperceptible details. MP3 (MPEG-1 Audio Layer III) uses perceptual coding with a hybrid filter bank combining 32 subband filters and Modified Discrete Cosine Transform (MDCT) for finer resolution, followed by quantization guided by a psychoacoustic model that computes masking thresholds.^[73] A bit reservoir mechanism allows borrowing bits across frames to handle variable bitrate needs, enabling compression to 128 kbps with near-transparent quality for many listeners.^[74] Advanced Audio Coding (AAC), an successor standard, improves efficiency through better filter banks, parametric stereo coding, and tools like Spectral Band Replication (SBR), which reconstructs high frequencies from low-band data at bitrates as low as 48 kbps, offering 30-50% better compression than MP3 at equivalent quality.^[75]^[76] Lossless formats compress audio without data loss, typically achieving 40-60% size reduction. FLAC (Free Lossless Audio Codec) employs linear predictive coding (LPC) to estimate samples from prior ones, encoding prediction residuals using Rice codes—a variable-length prefix code optimal for exponentially distributed errors—followed by Huffman coding for further efficiency.^[77] Apple's ALAC (Apple Lossless Audio Codec) uses a similar LPC-based approach with adaptive prediction orders up to 31, integrating arithmetic coding to maintain bit-perfect reconstruction, and supports metadata embedding for iOS ecosystems.^[78] In applications like streaming and storage, compressed audio balances quality and bandwidth. Services such as Spotify employ Ogg Vorbis, a lossy format using MDCT and perceptual noise shaping, at up to 320 kbps for "Very High" quality, enabling efficient delivery over variable networks.^[79] Subjective quality is often assessed via Mean Opinion Score (MOS), a 1-5 scale from ITU recommendations where scores above 4 indicate excellent perceived fidelity, guiding codec optimizations for real-world listening.^[80]

Video Compression

Video compression techniques reduce the data required to represent moving images by exploiting both spatial redundancies within individual frames and temporal redundancies across consecutive frames. This process typically involves dividing video into frames and applying hybrid coding methods that combine predictive coding with transform-based compression. Intra-frame coding treats each frame independently, similar to still image compression, using techniques like discrete cosine transform (DCT) to encode keyframes (I-frames) that serve as reference points without relying on other frames.^[81] Inter-frame coding, in contrast, predicts subsequent frames from previous ones using motion compensation, where predicted (P-frames) and bi-directionally predicted (B-frames) frames incorporate motion vectors to describe changes between frames, significantly lowering bit rates for sequences with consistent motion.^[82] Motion compensation extends predictive coding by estimating and compensating for object movement across frames, typically through block-matching algorithms.^[83] Major standards define the frameworks for these methods, with H.264/AVC (Advanced Video Coding) being a foundational block-based approach that uses 16x16 macroblocks for motion estimation and compensation, along with context-adaptive binary arithmetic coding (CABAC) for efficient entropy encoding.^[84] H.265/HEVC (High Efficiency Video Coding) improves upon H.264 by employing larger coding tree units (CTUs) up to 64x64 pixels, enabling more flexible partitioning and achieving approximately 50% better compression efficiency for the same quality, particularly beneficial for high-resolution content. AV1, developed by the Alliance for Open Media, offers a royalty-free alternative with comparable or superior efficiency to HEVC, supporting advanced features like film grain synthesis while remaining open for broad adoption in web and streaming applications.^[85] Hybrid formats in video compression integrate transform coding, such as integer DCT approximations, to concentrate energy in low-frequency coefficients after motion-compensated prediction, followed by quantization and entropy coding. Motion estimation employs search algorithms, like full or diamond search patterns, to compute vectors minimizing differences between blocks in reference and current frames, often with sub-pixel accuracy for smoother predictions. Deblocking filters are applied post-reconstruction to mitigate artifacts at block boundaries, enhancing perceptual quality without increasing bit rates significantly.^[83] Various profiles tailor these standards to specific uses, such as streaming 4K HDR content, where high dynamic range and wide color gamut require extended bit depths and color sampling. Metrics like Video Multimethod Assessment Fusion (VMAF) evaluate perceived quality by fusing multiple models, correlating strongly with human judgments for optimized encoding in bandwidth-constrained environments. Challenges in video compression include achieving real-time encoding for live streams, where the high computational demands of motion estimation and transform processes can introduce latencies exceeding acceptable thresholds for interactive applications, necessitating hardware acceleration or simplified algorithms.^[86]^[87]

Scientific and Other Uses

In genetics, data compression plays a vital role in managing the vast volumes of DNA sequencing data generated by high-throughput technologies. Reference-based compression methods, such as DNAzip, leverage a known reference genome to encode differences, exploiting the repetitive nature of genomic sequences where identical or similar segments recur frequently across individuals or species. This approach achieves significant space savings by storing only variations from the reference, making it particularly effective for resequencing projects. For instance, DNAzip can compress human genome data to ratios exceeding 100:1 in some cases, facilitating efficient storage and analysis in bioinformatics pipelines. More recent advancements include OReO, which optimizes read ordering for improved compression of FASTQ files (2025), and novel methods that compress hundreds of terabytes of genomic data into gigabytes (as of 2024).^[88]^[89]^[90] Scientific datasets, including time-series observations from telescopes, often employ delta encoding to capture small incremental changes between consecutive data points, which is ideal for signals with low variability over time. In astronomical applications, tools like DeltaComp apply delta coding combined with adaptive techniques to compress timelines from instruments monitoring variable stars or exoplanets, reducing storage needs while preserving precision for downstream analysis. Similarly, for large-scale simulations in fields like climate modeling or physics, the Hierarchical Data Format 5 (HDF5) integrates lossy compressors such as SZ, which bound errors to user-specified tolerances and achieve compression ratios up to 10:1 or higher on multidimensional arrays from computational fluid dynamics. The SZ compressor, designed for floating-point scientific data, uses predictive modeling to exploit spatial correlations, enabling faster I/O in high-performance computing environments.^[91]^[92] In general storage systems, techniques like deduplication in ZFS file systems identify and eliminate redundant blocks across datasets, storing only unique instances to optimize disk usage in environments with repetitive data patterns, such as virtual machine images. For backups, NTFS compression on Windows volumes transparently reduces file sizes using algorithms like LZNT1, which is particularly useful for archiving logs or documents without impacting application compatibility. Archival formats combine these with tools like tar for bundling files and gzip for deflate-based compression, as in TAR.GZ archives, which balance portability and efficiency for long-term preservation of heterogeneous data collections. In big data ecosystems, Hadoop integrates Snappy compression for intermediate map-reduce outputs, prioritizing speed over maximum ratio to accelerate processing of petabyte-scale datasets in distributed clusters.^[93]^[94]^[95]^[96] Domain-specific redundancies further enhance compression in scientific contexts, such as sparsity in matrices from graph analytics or finite element methods, where formats like compressed sparse row (CSR) store only non-zero elements, reducing memory footprint by orders of magnitude compared to dense representations. Additionally, error-resilient codes, including modifications to Tunstall encoding, incorporate redundancy to detect and correct bit errors in compressed streams transmitted over noisy channels, ensuring data integrity in remote sensing or space-based experiments without excessive overhead. These techniques underscore compression's adaptation to structured scientific data, emphasizing exactness, error control, and efficiency over perceptual quality.^[97]^[98]

History

Early Developments

The concept of data compression emerged in the early 19th century with efforts to optimize communication efficiency in telegraphy. Samuel Morse developed Morse code in 1838, assigning shorter sequences of dots and dashes to more frequently used letters in English, such as 'E' (a single dot) and 'T' (a single dash), while rarer letters received longer codes; this variable-length encoding reduced the average transmission time for messages over limited-bandwidth telegraph lines.^[42]^[99] In 1948, Claude Shannon published "A Mathematical Theory of Communication," which laid the theoretical foundation for data compression by introducing the concept of entropy as a measure of the minimum average information bits required to represent a source's output, enabling the quantification of redundancy in signals.^[22] The 1950s saw practical algorithmic advances, with David A. Huffman introducing Huffman coding in his 1952 paper "A Method for the Construction of Minimum-Redundancy Codes," which builds optimal prefix codes by assigning shorter binary strings to more probable symbols based on their frequencies, achieving compression close to the entropy limit.^[27] This method was implemented in hardware for early digital systems, including telecommunications equipment and nascent computer applications on platforms like IBM mainframes in the 1960s, where it optimized data storage and transmission for punch-card and tape-based processing.^[42] Practical advancements accelerated in the 1960s with applications in space exploration, where data compression was used on deep space missions to manage telemetry data. Significant flight history exists for both lossless and lossy methods, including implementations on Voyager and other NASA missions to reduce bandwidth requirements for interplanetary communication.^[7] By the 1970s, dictionary-based approaches gained prominence with the Lempel-Ziv algorithms developed by Abraham Lempel and Jacob Ziv. Their 1977 LZ77 algorithm used a sliding window to replace repeated substrings with references to prior occurrences, followed by LZ78 in 1978, which built an explicit dictionary of phrases during compression.^[25] Concurrently, run-length encoding (RLE), which represents consecutive identical data elements by a single value and its count, was widely adopted in facsimile (fax) machines starting in the early 1970s to efficiently transmit scanned lines of black-and-white pixels, reducing bandwidth needs for document imaging over telephone lines.^[42] These early developments, driven by key figures like Huffman, Lempel, and Ziv, primarily found initial applications in telecommunications for minimizing signal redundancy and improving transmission efficiency in bandwidth-constrained environments.^[42]

Modern Evolution

The modern evolution of data compression from the 1980s onward marked a shift toward standardized algorithms for digital media, driven by the explosion of personal computing, internet growth, and multimedia applications. In the 1980s, key developments included the emergence of image and audio compression techniques that laid the groundwork for widespread digital formats. The JPEG standard, formally adopted in 1992 as ISO/IEC 10918-1, introduced discrete cosine transform (DCT)-based lossy compression for continuous-tone images, achieving compression ratios of 10:1 to 20:1 with minimal perceptual loss, revolutionizing still image storage and transmission. Similarly, the GIF format, released in 1987 by CompuServe, utilized Lempel–Ziv–Welch (LZW) dictionary-based compression for lossless encoding of indexed-color images, supporting animations and becoming ubiquitous in early web graphics despite its 256-color limitation. Precursors to MP3 audio compression also arose in the late 1980s, with projects like the Adaptive Spectral Perceptual Entropy Coding (ASPEC) algorithm developed under the Eureka 147 project, which employed psychoacoustic modeling and subband coding to reduce audio file sizes by factors of 10-12 while preserving quality. The 1990s and 2000s saw compression integrate deeply into file archiving, video, and broadband ecosystems, with standards promoting interoperability. The ZIP format, standardized in 1989 via PKWARE's implementation of DEFLATE (combining LZ77 sliding-window matching with Huffman coding), achieved ubiquity in software like WinZip and became the de facto standard for general-purpose lossless compression, offering ratios around 2:1 to 3:1 for text and executables. Video compression advanced through the MPEG family of standards; MPEG-1 (1993, ISO/IEC 11172-2) enabled CD-ROM video playback with 30:1 ratios, while MPEG-2 (1995, ISO/IEC 13818-2) supported DVD and broadcast TV at similar efficiencies, and MPEG-4 (1999 onward) introduced object-based coding for interactive media, facilitating streaming with adaptive bitrates. For lossless text and binary data, bzip2 (1996) employed the Burrows–Wheeler transform followed by move-to-front encoding and Huffman coding, delivering superior ratios (up to 20-30% better than gzip) for large files at the cost of higher computational demands. Milestones included patent disputes, such as Unisys's enforcement of LZW patents in the 1990s, which led to royalty fees for GIF implementations and spurred alternatives like PNG, highlighting tensions between innovation and intellectual property. Open-source initiatives further democratized compression in this era; the zlib library (1995), implementing DEFLATE without patent restrictions, became integral to PNG, HTTP, and countless applications, fostering widespread adoption through its permissive BSD-like license. Entering the 2010s, high-efficiency codecs addressed 4K and beyond: HEVC (High Efficiency Video Coding, 2013, ISO/IEC 23008-2) halved bitrates compared to H.264 for equivalent quality, enabling efficient UHD streaming with 50:1 ratios in some scenarios. Image formats evolved with WebP (2010, Google), supporting both lossy (VP8-based) and lossless modes, achieving 25-34% better compression than JPEG or PNG for web use. Neural network-based compression research began gaining traction around 2016-2018, with early works like Ballé et al.'s variational autoencoder models demonstrating learned entropy coding that outperformed traditional methods on images by 10-20% in rate-distortion performance. In the 2020s, compression has emphasized royalty-free, efficient, and resilient designs amid rising data volumes and environmental concerns. AV1 (AOMedia Video 1, 2018, finalized 2020s adoption) has seen broad uptake in platforms like Netflix and YouTube, offering 30% bitrate savings over HEVC without licensing fees, supporting up to 50% efficiency gains for 8K video. Sustainable compression efforts focus on energy-efficient algorithms, such as hardware-optimized neural codecs that reduce encoding power compared to traditional methods, addressing the carbon footprint of data centers. Emerging quantum-resistant methods incorporate lattice-based cryptography into compression pipelines to safeguard against quantum attacks, ensuring long-term security for archived data without significant overhead.^[100] As of 2025, advancements in AI-driven compression include large model-based lossless methods like LMCompress, which leverage neural networks to achieve record-breaking compression ratios on diverse datasets.^[40] These innovations reflect a trajectory toward accessible, performant, and future-proof compression integral to digital infrastructure.

Future Directions

Emerging Technologies

AI-driven neural codecs represent a significant advancement in data compression, leveraging machine learning to achieve higher efficiency and perceptual quality compared to traditional methods. These codecs employ end-to-end neural networks for encoding and decoding, often incorporating generative models to reconstruct compressed data with minimal loss in visual fidelity. For instance, diffusion-based perceptual neural video compression frameworks integrate foundational diffusion models to enhance reconstruction, while maintaining superior subjective quality on benchmark datasets.^[101] Netflix has integrated machine learning enhancements into AV1 encoding pipelines, using neural networks for tasks such as dynamic optimization and downscaling, which improve encoding efficiency without compromising viewer experience.^[102] Recent developments, including real-time neural video codecs, further enable practical deployment by balancing compression ratios with low latency, outperforming prior models by 10.7% in BD-rate reduction on standard video sequences.^[103] Quantum compression emerges as a theoretical frontier, potentially surpassing classical limits through the use of entangled states. Schumacher's quantum data compression theorem establishes the von Neumann entropy as the fundamental limit for compressing quantum information, allowing faithful reconstruction with high probability using quantum typical subspaces.^[104] Entanglement-assisted protocols extend this by exploiting shared entanglement to reduce the required quantum communication rate below the Schumacher limit, achieving savings equal to half the classical entropy of entanglement for certain sources.^[105] While still largely theoretical, these approaches hold promise for quantum networks, where entangled states could enable more efficient storage and transmission of quantum data, though practical implementations remain constrained by current quantum hardware limitations. In edge computing for IoT, lightweight algorithms prioritize speed and low resource usage to handle constrained devices. Algorithms like Zstandard offer high compression ratios with significantly faster decompression than Brotli, making them suitable for real-time IoT data processing on edge nodes. Recent innovations, such as SZ4IoT, adapt lossy compression techniques for sensor data, achieving high size reduction with adjustable error bounds while consuming minimal CPU cycles on resource-limited hardware.^[106] These methods reduce latency and bandwidth demands in distributed IoT networks, enabling efficient data aggregation at the edge before transmission to central servers. Adaptive streaming technologies, particularly DASH integrated with machine learning, dynamically adjust bitrates based on predicted network conditions and user quality preferences. ML models in DASH frameworks predict optimal streaming quality by analyzing web service logs and historical throughput, selecting bitrates that minimize rebuffering while maximizing perceptual quality, with prediction accuracies exceeding 85% in varied network scenarios.^[107] This approach enhances user experience in heterogeneous environments, without quality degradation.^[108] Sustainability-focused low-power codecs address the energy demands of data centers by minimizing computational overhead and data volume. Energy-efficient lossless algorithms reduce storage and transfer energy through faster processing and smaller payloads, directly lowering the carbon footprint of large-scale data handling. For multimedia in green data centers, compact visual representations using neural compression cut transmission energy while preserving fidelity, supporting sustainable practices amid rising data volumes.^[109] These codecs prioritize hardware-friendly designs, enabling deployment on energy-constrained servers to curb the sector's substantial share of global electricity consumption.^[110]

Unused Potential

Despite significant advances in data compression algorithms, substantial untapped potential remains in leveraging artificial intelligence, particularly large language models and deep learning techniques, to achieve more efficient encoding across diverse data types. Recent surveys highlight that while classical compression methods rely on statistical redundancy, AI-driven approaches can exploit semantic understanding to push beyond traditional limits, enabling task-oriented or goal-directed compression that preserves information relevant to specific applications rather than raw fidelity.^[111] For instance, integrating large models for lossless compression has demonstrated unprecedented ratios, such as halving the rates of established standards like JPEG-XL for images and FLAC for audio, by linking deeper data comprehension to encoding efficiency.^[40] Future enhancements in model capabilities could further unlock this potential, potentially revolutionizing compression for complex, multimodal data streams.^[40] In domain-specific areas like genomics, current compressors often fail to fully capture subtle structural patterns in biological sequences, leaving room for algorithms that incorporate evolutionary or contextual redundancies to achieve dramatically smaller file sizes. The explosive growth of genomic datasets—projected to reach exabytes by the end of the decade—underscores this gap, as existing methods like gzip or specialized tools such as CRAM provide only modest reductions, typically 2-5x for raw sequences, while advanced pattern-aware techniques could exceed 10x in targeted scenarios.^[112] Research emphasizes the need for hybrid approaches combining machine learning with biological priors to exploit these untapped redundancies, potentially reducing storage costs for large-scale sequencing projects and enabling broader reuse of compressed archives.^[113] Similarly, in high-performance computing for scientific simulations, lossy compression techniques show promise but remain underutilized due to challenges in error bounding and reconstruction fidelity, with ongoing work pointing to adaptive, application-specific reducers as a key frontier.^[114] Another area of unused potential lies in cross-modal and real-time compression for emerging applications, such as IoT sensor networks and edge computing, where energy constraints limit adoption of sophisticated algorithms. Probabilistic modeling advancements, including neural autoregressive flows, offer pathways to more practical neural compression that balances rate-distortion trade-offs without excessive computational overhead.^[111] Overall, these opportunities hinge on interdisciplinary efforts to address open challenges like scalability for massive datasets and standardization of AI-compressed formats, ensuring that compression evolves in tandem with data generation rates.^[115]

References

[1]
[PDF] Introduction to Data Compression - CMU School of Computer Science
Jan 31, 2013 · Data compression involves encoding a message into a compressed representation and decoding it back to the original message. It is used in many ...
[2]
Data Compression
Data compression reduces redundancy in data, transforming it into a shorter string with the same information, to increase data density.
[3]
Data compression | ACM Computing Surveys - ACM Digital Library
The aim of data compression is to reduce redundancy in stored or communicated data, thus increasing effective data density. Data compression has important ...
[4]
[PDF] Fundamentals of Data Compression - Stanford Electrical Engineering
Sep 9, 1997 · This means that can get data expansion instead of compression in the short run. Typical lossless compression ratios: 2:1 to 4:1. Can do better ...
[5]
Data Compression - ASAP
Lossy compression, by definition, does not exactly preserve the original data, but it achieves higher compression rates and subsequently smaller storage ...
[6]
From WinZips to Cat GIFs, Jacob Ziv's Algorithms Have Powered ...
Apr 21, 2021 · Lossless data compression seems a bit like a magic trick. Its cousin, lossy compression, is easier to comprehend. Lossy algorithms are used ...
[7]
[PDF] SPACE DATA COMPRESSION STANDARDS
Space data compression has been used on deep space missions since the late 1960s. Significant flight history on both lossless and lossy methods exists.<|control11|><|separator|>
[8]
5.5 Data Compression - Algorithms, 4th Edition
Data compression reduces the size of a file to save space when storing it and to save time when transmitting it.
[9]
Handbook of Data Compression | Guide books - ACM Digital Library
Data compression is one of the most important fields and tools in modern computing. From archiving data, to CD-ROMs, and from coding theory to image ...
[10]
Overview: Introduction
This site surveys the field of data compression from a technical perspective, providing overviews of the fundamental compression algorithms on which more ...
[11]
Introduction to data (including text) compression - Emory CS
Compression ratio = the ratio between the size of the result data file and the size of the source data file. The smaller the compression ratio, ...
[12]
An Efficient Hardware Accelerator For Lossless Data Compression
Lossless data compression is used in compressing binary files, telemetry data and high-fidelity medical and scientific images where details are crucial. There ...Missing: definition | Show results with:definition
[13]
Full article: Webwaves: Lossless vs lossy compression
Apr 27, 2020 · Lossless compression uses a group of algorithms that allows the original data to be accurately reconstructed from the compressed.
[14]
Lossy and non-lossy compression - American Mathematical Society
If the compression algorithm guarantees the uncompressed material is what was started with, this is called non-lossy compression or lossless compression.
[15]
A survey on data compression techniques: From the perspective of ...
In some scenarios, lossy compression techniques are preferable where the reconstructed data is not perfectly matched with the original data and the ...
[16]
[PDF] Lossy and lossless data compression algorithms - IJARIIT
ABSTRACT. This research introduces the lossless and lossy compression algorithms in detail, compares their types on basis of strengths,.
[17]
[PDF] A Survey: Various Techniques of Image Compression - arXiv
Lossy compression technique provides a higher compression ratio than lossless compression. Major performance considerations of a lossy compression scheme.
[18]
https://arxiv.org/pdf/1311.6877
[19]
Image compression techniques: A survey in lossless and lossy ...
This paper provides a survey on various image compression techniques, their limitations, compression rates and highlights current research in medical image ...
[20]
L2C: Combining Lossy and Lossless Compression on Memory and I/O
Jan 14, 2022 · L 2 C is a hybrid compression scheme that combines lossless compression with more aggressive, lossy compression. Lossy compression has the ...
[21]
Rate Distortion Theory and Data Compression - SpringerLink
In this introductory lecture we present the rudiments of rate distortion theory, the branch of information theory that treats data compression problems.
[22]
[PDF] A Mathematical Theory of Communication
379–423, 623–656, July, October, 1948. A Mathematical Theory of Communication. By C. E. SHANNON. INTRODUCTION. THE recent development of various methods of ...
[23]
[PDF] Three Approaches to the Quantitative Definition of Information*
There are two common approaches to the quantitative definition of. "information": combinatorial and probabilistic. The author briefly de- scribes the major ...
[24]
[PDF] ARITHMETIC CODING FOR DATA COIUPRESSION
Arithmetic coding represents a message as an interval of real numbers between 0 and 1, giving greater compression than Huffman coding.
[25]
[PDF] A Universal Algorithm for Sequential Data Compression
The proposed compression algorithm is an adaptation of a simple copying procedure discussed recently [10] in a study on the complexity of finite sequences.
[26]
[PDF] Compression of Individual Sequences via Variable-Rate Coding
Compression of Individual Sequences via. Variable-Rate Coding. JACOB ZIV, FELLOW, IEEE, AND ABRAHAM LEMPEL, MEMBER, IEEE.
[27]
[PDF] A Method for the Construction of Minimum-Redundancy Codes*
Minimum-Redundancy Codes*. DAVID A. HUFFMAN+, ASSOCIATE, IRE. September. Page 2. 1952. Huffman: A Method for the Construction of Minimum-Redundancy Codes. 1099 ...
[28]
Arithmetic Coding | Semantic Scholar
Rissanen. Computer Science, Mathematics. IEEE Communications Letters. 2003. TLDR. This paper addresses the problem of constructing an adaptive arithmetic code ...Missing: original | Show results with:original
[29]
Run Length Encoding (RLE) - Data-Compression Info
This RLE algorithm from 2001 by Michael Maniscalco is based on a variable length threshold run, which defines the length of the binary representation of the ...
[30]
Time-series compression algorithms, explained | Tiger Data
Mar 14, 2024 · Enter compression: “the process of encoding information using fewer bits than the original representation.” (source). Compression has played ...
[31]
A corpus for the evaluation of lossless compression algorithms
Abstract: A number of authors have used the Calgary corpus of texts to provide empirical results for lossless compression algorithms.Missing: benchmark | Show results with:benchmark
[32]
[2202.06533] An Introduction to Neural Data Compression - arXiv
Feb 14, 2022 · View a PDF of the paper titled An Introduction to Neural Data Compression, by Yibo Yang and 2 other authors. View PDF. Abstract:Neural ...
[33]
[1611.01704] End-to-end Optimized Image Compression - arXiv
Nov 5, 2016 · We describe an image compression method, consisting of a nonlinear analysis transformation, a uniform quantizer, and a nonlinear synthesis transformation.
[34]
Variational image compression with a scale hyperprior - arXiv
Feb 1, 2018 · We describe an end-to-end trainable model for image compression based on variational autoencoders. The model incorporates a hyperprior to effectively capture ...
[35]
[PDF] LOSSLESS DATA COMPRESSION WITH TRANSFORMER
The basic idea of all these entropy coding schemes is to assign shorter codes to the more likely tokens. In this work, we use the arithmetic coding scheme. 4 ...
[36]
https://arxiv.org/abs/1901.04866
[37]
Improving Lossless Compression Rates via Monte Carlo Bits-Back ...
Latent variable models have been successfully applied in lossless compression with the bits-back coding algorithm. However, bits-back suffers from an ...Missing: arithmetic | Show results with:arithmetic
[38]
DRAC: a delta recurrent neural network-based arithmetic coding ...
Jul 5, 2021 · This paper develops an arithmetic coding algorithm based on delta recurrent neural network for edge computing devices called DRAC.
[39]
Image Compression with Neural Networks - Google Research
Sep 29, 2016 · Our work shows that using neural networks to compress images results in a compression scheme with higher quality and smaller file sizes.
[40]
[PDF] Data Compression - Higher Education | Pearson
In the case of lossy compression, a relation exists between the uncoded data and the decoded data; the data streams are similar but not identical. The ...
[41]
The impact of irreversible image data compression on post ... - NIH
Most methods for irreversible image compression attempt to only discard information that is not noticeable to the human observer, thus providing “visually ...Abstract · Functional Cardiac Imaging · DiscussionMissing: details | Show results with:details
[42]
[PDF] Perceptual coding of digital audio - Center for Neural Science
The psychoacoustic model delivers masking thresholds that quantify the maximum amount of distortion at each point in the time-frequency plane such that ...
[43]
[PDF] SERF: Optimization of Socially Sourced Images using Psychovisual ...
May 13, 2016 · “just noticeable difference” (JND) or psychovisual threshold. Given the proliferation of image compression libraries and codecs, most ...
[44]
[PDF] Compression of Audio Using Transform Coding
Mar 7, 2019 · Transform (DCT) and fast Fourier transform (FFT), Sub- band coding divides signal into a number of sub-bands, using band-pass filter [2].
[45]
Vector quantization and signal compression: | Guide books
Vector quantization (VQ) is widely used in low bit rate image compression. In this paper, two predictive vector quantization (PVQ) algorithms that combine ...
[46]
[PDF] Coding Theorems for a Discrete Source With a Fidelity Criterion
In this paper a study is made of the problem of coding a discrete source of information, given a fidelity criterion or a measure of the distortion of the final ...
[47]
[PDF] Compression Artifacts Reduction by a Deep Convolutional Network
Lossy compression introduces complex compression ar- tifacts, particularly the blocking artifacts, ringing effects and blurring.
[48]
Removal of blocking and ringing artifacts in JPEG-coded images
The paper relates to a method for effective reduction of artifacts, caused by lossy compression algorithms based on block-based discreet cosine transform (DCT) ...
[49]
[PDF] The JPEG Still Picture Compression Standard
A DCT-based method is specified for “lossy'' compression, and a predictive method for “lossless'' compression. JPEG features a simple lossy technique known as ...
[50]
Predictive Quantizing Systems (Differential Pulse Code Modulation ...
It is the purpose of this paper to determine what kind of performance can be expected from well-designed systems of this type when used to encode television ...
[51]
https://openaccess.thecvf.com/content_iccv_2015/papers/Dong_Compression_Artifacts_Reduction_ICCV_2015_paper.pdf
[52]
Scalable Vector Graphics (SVG) 2 - W3C
Oct 4, 2018 · SVG is a language based on XML for describing two-dimensional vector and mixed vector/raster graphics. SVG content is stylable, scalable to different display ...Changes from SVG 1.1 · Conformance Criteria · Introduction · Document StructureMissing: compression | Show results with:compression
[53]
Raster vs. Vector Images - All About Images - Research Guides
Sep 8, 2025 · Instead of trying to keep track of the millions of tiny pixels in a raster image, vector images keep track of points and the equations for the ...
[54]
ISO/IEC 10918-1:1994 - Information technology
Specifies processes for converting source image data to compressed image data, processes for converting compressed image data to reconstructed image data.
[55]
Blocking Artifact - an overview | ScienceDirect Topics
In general, blocking artifacts are the most perceptually annoying distortion in images and video compressed by the various standards. The suppression of ...
[56]
ISO/IEC 15444-1:2019 - JPEG 2000 image coding system
This Recommendation | International Standard defines a set of lossless (bit-preserving) and lossy compression methods for coding bi-level, continuous-tone ...
[57]
Portable Network Graphics (PNG) Specification (Third Edition) - W3C
Jun 24, 2025 · This document describes PNG (Portable Network Graphics), an extensible file format for the lossless, portable, well-compressed storage of static and animated ...
[58]
RFC 1951 - DEFLATE Compressed Data Format Specification ...
This specification defines a lossless compressed data format that compresses data using a combination of the LZ77 algorithm and Huffman coding.
[59]
Getting Started | WebP - Google for Developers
Aug 7, 2025 · WebP is a new image format and is natively supported in Google Chrome, Opera, and many other applications and libraries. API for the WebP Codec.Using cwebp to Convert... · Using dwebp to Convert...
[60]
AV1 Image File Format (AVIF)
Apr 15, 2022 · An AVIF file is designed to be a conformant [HEIF] file for both image items and image sequences. Specifically, this specification follows the ...Scope · Image Items and properties · Auxiliary Image Items and... · Profiles
[61]
What is AVIF? - Alliance for Open Media
AVIF (AV1 Image File Format) is an image format offering superior compression and quality, with smaller file sizes and over 50% savings compared to JPEG.
[62]
[PDF] Image Quality Assessment: From Error Visibility to Structural Similarity
This paper focuses on full-reference image quality assess- ment. The simplest and most widely used full-reference quality metric is the mean squared error (MSE) ...
[63]
Image Quality Metrics: PSNR vs. SSIM - IEEE Xplore
In this paper, we analyse two well-known objective image quality metrics, the peak-signal-to-noise ratio (PSNR) as well as the structural similarity index ...
[64]
Explanation of 44.1 kHz CD sampling rate - CS@Columbia
Jan 9, 2008 · The CD sampling rate has to be larger than about 40 kHz to fulfill the Nyquist criterion that requires sampling at twice the maximum analog ...
[65]
[PDF] A tutorial on MPEG/audio compression - IEEE Multimedia
The psychoacoustic model analyzes the audio signal and computes the amount of noise mask- ing available as a function of frequency. 12.13 The masking ability ...
[66]
https://jpeg.org/jpegxl/
[67]
[PDF] Filter Banks in Perceptual Audio Coding
Filter banks allow for signal decorrelation and therefore provide a framework for removing redundancy in an audio signal.
[68]
[PDF] MP3 and AAC Explained
The technique to do this is called perceptual encoding and uses knowledge from psychoacoustics to reach the target of efficient but inaudible compression.
[69]
[PDF] AUDIO DATA COMPRESSION - CMU School of Computer Science
Later, encoder can use bits from “bit reservoir”, temporarily exceeding maximum data rate, to do the best job of encoding audio. ICM Week 13 ...
[70]
[PDF] CT-aacPlus - a state-of-the-art audio coding scheme - EBU tech
SBR technology and demonstrates the improvements achieved through the combination of SBR technology with traditional audio coders such as AAC and MP3.
[71]
[PDF] SBR explained: White paper
SBR (Spectral Band Replication) is a new audio coding enhancement tool. It offers the possibility to improve the performance of low bitrate audio and speech ...
[72]
FLAC - Format overview - Xiph.org
FLAC has two methods of forming approximations: 1) fitting a simple polynomial to the signal; and 2) general linear predictive coding (LPC). I will not go into ...
[73]
About lossless audio in Apple Music - Apple Support
Lossless compression is a form of compression that preserves all of the original data. Apple has developed its own lossless audio compression technology called ...
[74]
OGG vs MP3: Which is the Best Audio Format? - Gumlet
Jul 7, 2024 · Spotify primarily uses the OGG Vorbis format for streaming its music, but it also uses other formats like MP3, M4A, and WAV. However, all tracks ...
[75]
https://tech.ebu.ch/docs/techreview/trev_291-dietz.pdf
[76]
[PDF] The H.264/MPEG-4 Advanced Video Coding (AVC) Standard - ITU
Jul 22, 2005 · Professional applications (more than 8 bits per sample,. 4:4:4 color sampling, etc.) • Higher-quality high-resolution video. • Alpha plane ...
[77]
https://xiph.org/flac/documentation_format_overview.html
[78]
Recent Advances in Video Compression Standards
In the hybrid motion compensated and transform video coder, motion compensated prediction first reduces temporal redundancies. Transform coding is then applied ...
[79]
H.264 : Advanced video coding for generic audiovisual services - ITU
H.264 (03/05), Advanced video coding for generic audiovisual services. This edition includes the modifications introduced by H.264 (2005) Cor.1 approved on 13 ...
[80]
AV1 Features - Alliance for Open Media
AV1 Features. ROYALTY-FREEPermalink. Interoperable and open. UBIQUITOUSPermalink. Scales to any modern device at any bandwith. FLEXIBLEPermalink.
[81]
Netflix/vmaf: Perceptual video quality assessment based on multi ...
VMAF is an Emmy-winning perceptual video quality assessment algorithm developed by Netflix. This software package includes a stand-alone C library libvmaf and ...Releases 23 · Pull requests 26 · Sign inMissing: profiles streaming
[82]
Towards Low-Complexity VVENC Encoding: Challenges, Strategies ...
As a result, real-time encoding remains out of reach. This paper highlights these challenges and explores possible strategies to reduce the overall complexity ...
[83]
Efficient storage of high throughput DNA sequencing data using ...
In this paper we present a new reference-based compression method that efficiently compresses DNA sequences for storage.
[84]
DeltaComp: Fast and efficient compression of astronomical timelines
Highlights · We propose a compression tool for astronomical timeline data, DeltaComp. · Polynomial fitting can well be replaced with carefully used delta coding.Missing: telescope | Show results with:telescope
[85]
Software | SZ Lossy Compressor
An open-source, production quality lossy compressor for scientific data. It has applications in simulations, AI and instruments. SZ has implementations on ...
[86]
ZFS Data Deduplication Requirements
ZFS Data Deduplication Requirements. You can use the deduplication (dedup) property to remove redundant data from your ZFS file systems.
[87]
File Compression and Decompression - Win32 apps | Microsoft Learn
Jan 7, 2021 · On the NTFS file system, compression is performed transparently. This means it can be used without requiring changes to existing applications.
[88]
GNU Gzip: General file (de)compression
Feb 24, 2025 · This manual is for GNU Gzip (version 1.14, 24 February 2025), and documents commands for compressing and decompressing data.
[89]
Enable Snappy Compression for Improved Performance in Big SQL ...
This means that if data is loaded into Big SQL using either the LOAD HADOOP or INSERT…SELECT commands, then SNAPPY compression is enabled by default.
[90]
In-memory data compression for sparse matrices - ACM Digital Library
We present a high performance in-memory lossless data compression scheme designed to save both memory storage and bandwidth for general sparse matrices.
[91]
Error-Resilient Data Compression With Tunstall Codes - IEEE Xplore
Feb 17, 2023 · In this paper, an efficient error-resilient data compression technique with Tunstall codes is proposed; it requires almost no memory overhead and can correct ...
[92]
History of Lossless Data Compression Algorithms
Jan 22, 2019 · Published in 1977, LZ77 is the algorithm that started it all. It introduced the concept of a 'sliding window' for the first time which brought ...
[93]
History [of data compression] - Wolfram Science
Morse code, invented in 1838 for use in telegraphy, is an early example of data compression based on using shorter codewords for letters such as "e" and "t".
[94]
Diffusion-based Perceptual Neural Video Compression with ...
Aug 19, 2025 · In this article, we introduce DiffVC, a diffusion-based perceptual neural video compression framework that effectively integrates foundational ...
[95]
For your eyes only: improving Netflix video quality with neural ...
Nov 14, 2022 · In this tech blog, we describe how we improved Netflix video quality with neural networks, the challenges we faced and what lies ahead.
[96]
Real-Time Neural Video Compression with Unified Intra and Inter ...
Oct 16, 2025 · Experimental results show that our scheme outperforms DCVC-RT by an average of 10.7% BD-rate reduction, delivers more stable bitrate and quality ...
[97]
[quant-ph/9603009] Schumacher's quantum data compression as a ...
Mar 7, 1996 · This paper presents an algorithm for Schumacher's noiseless quantum bit compression, using reversible operations and space-saving techniques.
[98]
[PDF] Entanglement-Assisted Quantum Data Compression - arXiv
Jan 18, 2019 · It turns out that the amount by which the quantum rate beats the Schumacher limit, the entropy of the source, is precisely half the entropy of.
[99]
SZ4IoT: an adaptive lightweight lossy compression algorithm for ...
Jan 14, 2025 · This paper presents adaptable lightweight SZ lossy compression algorithm for IoT devices (SZ4IoT), a lightweight and adjusted version of the SZ lossy ...
[100]
DASH Framework Using Machine Learning Techniques and ...
Jun 13, 2022 · The quality predictor model extracts features from the web service log to predict appropriate streaming quality depending on client network ...<|separator|>
[101]
Use of Machine Learning for Rate Adaptation in MPEG-DASH for ...
In this work, we propose a machine learning-based method for selecting the optimal target quality, in terms of bitrate, for video streaming through an MPEG-DASH ...
[102]
Energy-efficient algorithms for lossless data compression schemes ...
This work presents a comparative analysis of data compression algorithms tailored for WSNs. It studies and enhances two adaptive lossless data compression ...
[103]
Compact Visual Data Representation for Green Multimedia - arXiv
Nov 21, 2024 · Compact data representation, visual data compression, green technology, sustainable multimedia, energy efficiency. I Introduction. Report ...
[104]
The Greenest Codec: LCEVC - Streaming Media
Feb 15, 2022 · Apart from 8K, if AR/VR will take off, other data compression problems will substitute video and the sustainability challenge will remain.
[105]
Information Compression in the AI Era: Recent Advances and Future ...
Jun 14, 2024 · We survey recent works on applying deep learning techniques to task-based or goal-oriented compression, as well as image and video compression.
[106]
Lossless data compression by large models - Nature
May 1, 2025 · Here we present LMCompress, a new method that leverages large models to compress data. LMCompress shatters all previous lossless compression records.
[107]
The Desperate Quest for Genomic Compression Algorithms
Aug 22, 2018 · A genomic data compressor that factors in subtle patterns in the data will result in smaller file sizes and reduced storage costs. In our own ...
[108]
https://www.researchgate.net/publication/324074970_Use_of_Machine_Learning_for_Rate_Adaptation_in_MPEG-DASH_for_Quality_of_Experience_Improvement
[109]
(PDF) State of the art and future trends in data reduction for high ...
Research into data reduction techniques has gained popularity in recent years as storage capacity and performance become a growing concern.Missing: directions untapped
[110]
State-of-the-Art Trends in Data Compression: COMPROMISE Case ...
The aim of this paper is to systematically identify the current challenges in data compression and the responses of the research community so far to highlight ...