Unary coding
Unary coding is a simple entropy encoding method in computer science that represents a positive integer n using a binary string of exactly n ones followed by a single zero, making it a prefix-free code suitable for variable-length encoding of small natural numbers.[1] For example, the number 1 is encoded as "10", 2 as "110", and 3 as "1110", with the length of the codeword being n + 1 bits.[2]
This encoding scheme, also known as thermometer code in some contexts, is uniquely decodable and instantaneous, allowing the decoder to identify codeword boundaries without ambiguity by searching for the terminating zero.[3] Unary coding is optimal for sources following a geometric probability distribution where the success probability p = 1/2, achieving the entropy bound in such cases, but it becomes highly inefficient for larger values of n due to its linear growth in code length.[4] It serves as a building block for more advanced universal codes, such as Elias gamma and delta codes, where the unary part encodes the length of a subsequent binary representation.[1]
In practice, unary coding finds applications in lossless data compression, particularly for encoding frequencies or positions of small integers in inverted indexes for information retrieval systems, where many values are low and follow a skewed distribution.[2] It is also employed in modern video coding standards like HEVC through Context-Adaptive Binary Arithmetic Coding (CABAC), where unary codes signal bin strings for parameters such as transform coefficients or motion vectors, balancing simplicity with adaptability to probability models.[5] Additionally, variants appear in hardware designs for sorting networks and neural network implementations due to their straightforward mapping to physical representations like pulse widths.[6]
Fundamentals
Definition and Basic Concept
Unary coding, also known as the unary numeral system, is a fundamental method for representing positive integers in a binary framework, where the integer n is encoded as a sequence of n identical symbols—typically 1's—followed by a terminating delimiter, usually a 0, resulting in a codeword of length n+1 bits.[7] This approach treats the number's value directly as the count of repeated symbols, making it one of the simplest forms of entropy encoding for natural numbers.[7]
Alternative representations exist within unary coding, such as encoding n with n zeros followed by a single 1, or strictly positive variants that use n-1 ones followed by a 0, resulting in length n bits and avoiding representation of zero.[7] Historically, unary coding has been termed "thermometer code" due to its linear progression, where the bit length increases proportionally with n, resembling the rising level in a thermometer. This terminology highlights its intuitive, analog-like scaling in digital contexts.
The primary motivation for unary coding lies in its extreme simplicity, which is advantageous for encoding small integers or in data compression scenarios where the source follows a geometric probability distribution favoring low values, as unary codes achieve optimality under such conditions.[8] For instance, under a geometric distribution, unary encoding minimizes average code length by assigning shorter codes to more probable smaller numbers.[8]
A basic example illustrates this: the integer 1 is encoded as "10", 2 as "110", and 3 as "1110", where each additional 1 corresponds to an increment in value before the terminating 0.[7]
Encoding and Decoding Procedures
The encoding procedure for standard unary coding in a binary alphabet represents a positive integer n as a sequence of exactly n '1' bits followed by a single '0' bit as the delimiter. This ensures the codeword is prefix-free, allowing unambiguous parsing when concatenated. The algorithm is straightforward and can be implemented iteratively:
function encode_unary(n):
if n < 1:
raise ValueError("n must be a positive integer")
codeword = ""
for i in range(1, n + 1):
codeword += "1"
codeword += "0"
return codeword
function encode_unary(n):
if n < 1:
raise ValueError("n must be a positive integer")
codeword = ""
for i in range(1, n + 1):
codeword += "1"
codeword += "0"
return codeword
This pseudocode generates the codeword directly, with a time complexity of O(n) proportional to the output length.
Decoding reverses this process by scanning the bitstream sequentially, counting consecutive '1's until a '0' is encountered, and interpreting the count as n. The algorithm assumes a well-formed input stream and handles edge cases such as an empty string (invalid, returns error) or a stream ending with all '1's without a delimiter (invalid, potentially indicating truncation). A basic implementation in pseudocode is:
function decode_unary(bitstream):
if bitstream is empty:
raise ValueError("Empty bitstream")
count = 0
i = 0
while i < len(bitstream) and bitstream[i] == '1':
count += 1
i += 1
if i == len(bitstream) or bitstream[i] != '0':
raise ValueError("Invalid unary code: missing delimiter")
return count
function decode_unary(bitstream):
if bitstream is empty:
raise ValueError("Empty bitstream")
count = 0
i = 0
while i < len(bitstream) and bitstream[i] == '1':
count += 1
i += 1
if i == len(bitstream) or bitstream[i] != '0':
raise ValueError("Invalid unary code: missing delimiter")
return count
This decoder advances through the stream one codeword at a time, with O(n) time per codeword.
A common variation inverts the symbols, representing n as n '0' bits followed by a single '1' delimiter, where decoding counts the '0's before the '1'. This form is equivalent in structure and decodability but may suit specific hardware or channel constraints, such as those favoring majority-zero representations.
Unary coding exhibits high sensitivity to bit-flip errors due to its reliance on run-length counting and delimiters for synchronization. A single bit error, such as flipping a '0' delimiter to '1', can merge adjacent codewords, causing the decoder to misinterpret subsequent values by shifting the run boundaries and potentially propagating errors across the entire stream.
For illustration, encoding the integer 4 yields the codeword "11110": four '1's followed by '0'. To decode "11110", the algorithm counts four consecutive '1's before the terminating '0', recovering n = 4. If a bit flip changes it to "11111", decoding would count five '1's without a delimiter, resulting in an invalid parse and loss of synchronization for following codewords.
Mathematical Properties
In unary coding, the codeword for a positive integer n has length l(n) = n + 1 bits, typically consisting of n ones followed by a zero (or vice versa). The average codeword length under a probability distribution P(n) over positive integers is then given by \sum_{n=1}^{\infty} P(n) \cdot (n + 1).[9]
The standard unary code is a prefix code, meaning no codeword is a prefix of another, which ensures instantaneous decodability without ambiguity during sequential decoding. This property implies that the code satisfies the Kraft inequality, a necessary and sufficient condition for the existence of a prefix code over a binary alphabet: \sum_{n=1}^{\infty} 2^{-l(n)} \leq 1. For unary coding, the sum evaluates to \sum_{n=1}^{\infty} 2^{-(n+1)} = \frac{1}{2} \sum_{n=1}^{\infty} 2^{-n} = \frac{1}{2} < 1, confirming compliance and allowing for potential extensions to more efficient codes if needed.[10]
Unary coding achieves optimality as a prefix code for certain source distributions, particularly geometric ones where probabilities decay rapidly. Specifically, the standard unary code has 1 bit redundancy for the geometric distribution P(n) = 2^{-n} for n \geq 1, where the entropy H = 2 bits but average length is 3 bits. However, the strictly positive variant (length n) is the Huffman code (and thus optimal in average length) for this distribution, achieving the entropy bound with zero redundancy. More generally, unary variants are optimal for geometric distributions P(n) = (k-1) k^{-n} with base k \geq 2, as the codeword lengths align precisely with the ideal -\log_2 P(n). For example, under P(n) = 2^{-n}, the entropy H = 2 bits, and the variant's average length is also 2 bits, achieving the Shannon limit. For slower-decaying distributions (e.g., uniform over a finite range or geometric with k < 2), unary's linear length growth leads to inefficiency, with average length exceeding the entropy by a factor that grows with the support size.[10][11][9]
Variants of unary coding can be uniquely decodable without being prefix codes, requiring look-ahead during decoding to resolve ambiguities where one codeword prefixes another. For instance, a code where shorter symbols end with 1 and longer ones start with 0 may necessitate examining subsequent bits to distinguish boundaries, as the decoder cannot decide instantaneously upon encountering a potential prefix. Such constructions satisfy the McMillan extension of the Kraft inequality for uniquely decodable codes but introduce decoding delay proportional to the maximum look-ahead needed.[12]
Basic Unary Codes
Standard Run-Length Unary Codes
The standard run-length unary code represents a positive integer n as a binary string consisting of exactly n consecutive '1's followed by a single '0', forming a run of identical symbols whose length encodes the value directly.[13] This method, often termed "run-length" due to the contiguous sequence of '1's terminated by a delimiter, serves as a basic variable-length encoding scheme for non-negative integers starting from 1.[14]
The encodings for small values of n are shown in the table below:
| n | Unary Code |
|---|
| 1 | 10 |
| 2 | 110 |
| 3 | 1110 |
| 4 | 11110 |
| 5 | 111110 |
This scheme offers simplicity in implementation, as encoding requires only repeating '1' n times and appending '0', while decoding involves counting '1's until the first '0' is encountered.[14] It is inherently prefix-free, ensuring no codeword is a prefix of another, which allows for instantaneous, unambiguous decoding without buffering or lookahead.[15]
A primary disadvantage arises from the code length of n + 1 bits, which grows linearly with n and becomes highly inefficient for large values, particularly when the data distribution lacks a strong bias toward small integers.[16] This exponential inefficiency relative to binary representations limits its utility for general-purpose compression of unbounded integers.
Standard run-length unary coding relates to broader run-length encoding (RLE) in data compression by using the run of '1's to directly signify the count, with the '0' acting as a terminator, akin to how RLE stores run lengths for repeated symbols in sequences.[17]
Uniquely Decodable Non-Prefix Unary Codes
Uniquely decodable non-prefix unary codes represent positive integers using variable-length binary strings based on runs of identical symbols, where at least one codeword serves as a prefix of another, violating the prefix condition. Despite this, the code as a whole ensures unique decodability, meaning any concatenation of codewords can be parsed into the original sequence without ambiguity. The Sardinas–Patterson algorithm provides a polynomial-time method to verify this property by iteratively computing sets of possible "dangling suffixes" and checking for intersections with the code itself; if no intersection occurs, the code is uniquely decodable.[18][19]
A representative construction encodes the integer n as a leading '1' followed by exactly n-1 '0's: for example, n=1 maps to "1", n=2 to "10", n=3 to "100", and n=4 to "1000".[20] Here, the codeword "1" is a prefix of all longer codewords like "10" and "100", requiring the decoder to read ahead after a '1' to check if the next bit is '0' (indicating continuation of the current codeword) or '1' (indicating the end of the current codeword and start of the next). For instance, the encoded sequence "110100" decodes uniquely as 1 ("1"), followed by 2 ("10"), followed by 3 ("100"), parsed by counting consecutive '0's after each leading '1' until the next '1' appears.
This approach allows for shorter average codeword lengths under certain symbol probability distributions compared to prefix-constrained unary codes, as the relaxed prefix condition permits more flexible length assignments while still satisfying the Kraft-McMillan inequality for unique decodability.[18] Construction typically involves appending run-length indicators ('0's) after a fixed delimiter ('1'), with the delimiter of the subsequent codeword serving as the implicit terminator, thereby avoiding catastrophic prefix conflicts and ensuring a unique parsing tree across the entire bitstream.
However, the need for look-ahead introduces decoding complexity, as the parser must buffer bits to resolve potential continuations, rendering these codes non-instantaneous and unsuitable for applications demanding real-time, streaming decoding without delay.[18]
Specialized Unary Codes
Unary codes using leading zeros followed by a one
Unary codes using leading zeros followed by a one represent a variant of unary coding designed for consistent encoding of positive integers through a step-wise accumulation of leading zeros terminated by a 1, ensuring a predictable structure suitable for enumeration and integration into hybrid coding schemes.[11] In this form, the integer n is encoded as n-1 leading zeros followed by a single 1, reflecting a progressive prefix addition where each increment adds a zero, producing codewords that maintain order for applications requiring sorted representations.[11] This construction adheres to unary principles and is prefix-free.[11]
The encoding process is straightforward: for n=1, the code is "1"; for n=2, "01"; for n=3, "001"; and so on, with the code length exactly equal to n.[11] These codewords are prefix-free, guaranteeing unique decodability upon parsing, as the terminating 1 unambiguously delineates each symbol without overlap.[21] Unary codes using leading zeros followed by a one are frequently employed as the length-indicator component in composite schemes, such as Elias gamma coding, where the unary segment precedes a binary offset to encode the bit length of the offset.[21]
Key advantages include the predictability of codeword lengths, which directly correspond to the integer value, and their inherent sortability, as lexicographical ordering of the codewords mirrors the numerical sequence of the integers they represent.[11] This reduces ambiguity in mixed coding environments by providing a reliable baseline for synchronization and ordering.[21] For decoding, the process interprets the number of leading zeros k as an offset, with the value computed as k + 1; for instance, "001" yields two leading zeros, indicating n=3.[21]
| Integer n | Unary Code | Length |
|---|
| 1 | 1 | 1 |
| 2 | 01 | 2 |
| 3 | 001 | 3 |
| 4 | 0001 | 4 |
This table illustrates representative encodings, highlighting the step-wise zero prefix addition.[11]
Generalized Unary Coding
Generalized unary coding extends the standard unary representation to a fixed-length block structure, allowing the encoding of a predefined range of non-negative integers from 0 to m-1 within a bounded number of bits, where m is determined by the block length n and the fixed number of 1s k used in the codeword. Unlike traditional unary coding, which uses variable-length runs of 1s terminated by a 0 and scales linearly with the value, this generalization permits the block of k 1s to be distributed or "spread" across the n-bit block, enabling a quadratic increase in representational capacity to approximately (n - k)^2 values. This approach is particularly suited for scenarios where the maximum value is known in advance, as the parameters n and k are chosen such that m \approx (n - k)^2.[22]
The construction combines unary-like runs of 1s with offsets or delimiters to position the k 1s uniquely within the fixed n-bit block, ensuring instantaneous decodability. For a given value v in the range 0 to m-1, the codeword begins with all 0s for v = 0, and for higher values, it places the k consecutive or spread 1s according to a systematic rule, such as starting with a contiguous block and then introducing increasing separations (e.g., one additional 0, then two, etc.) in subsequent "cycles" to cover the expanded range. A general formulation for the codeword can be viewed as a prefix segment of fixed length related to \log_2 m bits to indicate the cycle or offset position, followed by a unary suffix representing the remainder within that cycle using the k 1s. This hybrid structure maintains some unary properties while bounding the total length to n bits. Decoding involves scanning the block for the positions of the k 1s and mapping their configuration back to the value v.[22][23]
For example, consider encoding values from 0 to 15 (m = 16) using a 7-bit block with k = 3 (n - k = 4, $4^2 = 16). The code for 0 is 0000000 (no 1s, treated as a special case). For 1, it is 0000111 (the 3 ones placed contiguously at the end). Higher values spread the 1s further, such as for 6: 0010111 (1s positioned with specific offsets). Similarly, for a smaller range m = 8 (0 to 7), a configuration with n = 8 and appropriate k (e.g., k = 4) might use a 4-bit prefix of leading 0s or offset indicators combined with a 4-bit unary run of 1s for the remainder; for v = 1, this could yield a total 8-bit codeword like a 4-bit fixed prefix followed by a short unary run (e.g., effectively 0000 + 0111, adjusted for the scheme), ensuring all codewords are exactly 8 bits long. These examples illustrate how the fixed block enforces uniform length while embedding unary elements.[22][23]
The primary advantages of generalized unary coding lie in its bounded codeword length for applications with known maximum values, preventing the exponential growth in bits seen in unbounded unary schemes, and its utility in hybrid compression or table-based lookups where fixed-size entries simplify processing and storage. By achieving roughly quadratic capacity in n bits—far surpassing the linear capacity of standard unary—it offers improved efficiency for moderate ranges without resorting to full binary encoding, and the spread structure can enhance error resilience due to higher minimum Hamming distances in some configurations. This makes it relevant in neural network representations or biological-inspired coding, where spatial distribution of signals (like in birdsong timing) mirrors the spread 1s.[22][23]
However, this approach sacrifices the simplicity of pure unary coding, as decoding requires more complex position analysis rather than just counting runs, potentially increasing computational overhead. Additionally, it necessitates prior knowledge of the range m to select optimal n and k, limiting adaptability to variable or unknown bounds without redesigning the code set. Optimization techniques, such as adjusting the spread pattern for minimal average length within the fixed block, have been proposed to mitigate redundancy while preserving capacity.[23]
Applications
Uses in Computing
Unary coding plays a key role in data compression algorithms designed for sources with geometric distributions, where it efficiently encodes the quotient component of non-negative integers. In Golomb-Rice codes, the quotient \lfloor x / 2^k \rfloor is represented using a unary sequence of that many zeros followed by a one, paired with a binary-encoded remainder, making it optimal for exponentially decaying probabilities common in image residuals or run lengths.[24] Similarly, Elias gamma codes prepend a unary representation of the bit length \lfloor \log_2 n \rfloor + 1 (as zeros followed by a one) to the binary form of n excluding its leading one, providing prefix-free encoding for positive integers in inverted indexes and sparse data.[25]
In character encoding standards, UTF-8 employs a unary-like mechanism to indicate continuation bytes in multi-byte sequences, where each non-lead byte begins with the fixed pattern "10" followed by six data bits, allowing decoders to synchronize without ambiguity even after bit errors. This design, specified in RFC 3629, ensures backward compatibility with ASCII while supporting variable-length encoding up to four bytes for Unicode code points.[26]
Hardware implementations leverage unary coding for its linear scaling and monotonicity, particularly in thermometer codes used within analog-to-digital converters (ADCs) and digital-to-analog converters (DACs). In flash ADCs, a comparator array produces a thermometer code—a unary string of consecutive ones indicating the input voltage level—which is then encoded to binary, enabling high-speed conversion with reduced glitch sensitivity in current-steering DACs that activate unary-weighted elements sequentially.[27] Unary representations also appear in priority encoders for hardware arbitration, where thermometer-like outputs from comparators facilitate one-hot to binary conversion in systems like interrupt controllers, minimizing decoding latency.[28] In neuromorphic computing, unary coding models spiking neural network activations by representing activation strength as the number of pulses or ones in a temporal or spatial sequence, aiding energy-efficient inference on memristive hardware.[7]
Modern applications extend unary coding to sparse data structures in databases and search engines, where it supports run-length encoding of bit vectors for compressed inverted indexes. For instance, in bitmap indexes, unary prefixes encode short runs of ones or zeros in posting lists, reducing space for low-cardinality attributes while enabling fast intersection queries via succinct operations.[29] Overall, unary coding offers low decoding overhead for small integers—typically 1 to 10 bits—making it suitable for metadata in file systems or database indexes, where frequent access to short values prioritizes speed over density for values under 32.
Unary Coding in Biology
In biological systems, unary coding appears in neural processes as a sparse representation strategy, particularly in the premotor nucleus HVC of songbirds like the zebra finch (Taeniopygia guttata), where it supports the precise timing of learned vocal sequences. During song production, individual HVC projection neurons (HVC_RA) fire a single, brief burst of 2–7 spikes per rendition of the song motif—a structured sequence of 3–10 syllables lasting approximately 200–600 ms—resulting in an ultra-sparse or unary code that dedicates one neural event to encoding a specific temporal position in the motif. This firing pattern aligns bursts with individual syllables or transitions, enabling the population of HVC neurons to collectively generate a motor program for song syntax without overlap or redundancy in a single motif cycle.[30]
Research on zebra finch brains has demonstrated this unary representation in premotor circuits, highlighting its role in sequence timing. In a seminal study, Kozhevnikov and Fee (2007) recorded from identified HVC neurons during singing, revealing that HVC_RA cells produce precisely timed bursts (mean duration 5.1 ms, 4.3 spikes per burst) exactly once per motif, with no activity during intervening periods or motifs, confirming the unary sparsity. This pattern contrasts with HVC interneurons and HVC_X neurons, which fire multiple bursts per motif but lack the same precision. The unary code in HVC_RA neurons thus serves as a combinatorial basis for motor control, where the population activity reconstructs the full song sequence through distributed, non-overlapping contributions.[30]
Broader biological analogs to unary coding occur in sensory neurons, where stimulus intensity is encoded via increasing spike trains in a rate-code manner, akin to a thermometer-like progression. For instance, in thermoreceptors, the firing rate escalates with temperature intensity, effectively representing discrete levels through the number of spikes in a fixed time window, much like unary run-length encoding for sparse, graduated signals. This mechanism allows efficient transmission of analog environmental data in neural pathways, prioritizing reliability for low-frequency but critical stimuli.