Fact-checked by Grok 2 weeks ago

HyperLogLog

HyperLogLog is a probabilistic for estimating the (number of distinct elements) of a , providing near-optimal accuracy with minimal memory usage, typically requiring only a few kilobytes to approximate counts exceeding billions of elements. Developed by Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier in 2007, it builds upon earlier probabilistic counting techniques like LogLog by employing stochastic averaging with harmonic means to refine estimates. The algorithm operates by hashing input elements into binary strings and tracking the positions of leading zeros in these hashes across multiple registers, which serve as estimators of the set's size. For m registers, HyperLogLog achieves a relative of approximately 1.04 / √m, enabling, for example, a 2% error rate with m = 2048 registers (about 1.5 of ) for cardinalities larger than 10^9. This represents a 36% memory reduction compared to the LogLog algorithm while maintaining equivalent accuracy, making it particularly suitable for and resource-constrained environments. In practice, HyperLogLog has been widely adopted in big data systems for tasks such as estimating unique user counts, query distinctness, and network traffic volumes. Implementations like HyperLogLog++ at incorporate enhancements such as bias correction for small cardinalities, 64-bit hashing for larger scales, and sparse representations to optimize memory for low-cardinality sets, reducing errors by up to fourfold in certain ranges. Its single-pass, parallelizable nature supports efficient processing in distributed frameworks, powering daily analyses of massive datasets in production environments.

Introduction

Definition and Purpose

HyperLogLog (HLL) is a probabilistic designed to approximate the , or number of distinct elements, in a or , while maintaining constant memory usage irrespective of the stream's size. Unlike exact counting methods that require storing all unique elements, HLL employs averaging techniques to produce estimates with controlled error bounds, making it suitable for processing massive datasets where precision must be balanced against resource constraints. The primary purpose of HyperLogLog is to address the in environments, such as network traffic monitoring, , and database query optimization, where traditional approaches like hash sets become infeasible due to prohibitive memory and computation demands. By processing data in a single pass and relying on probabilistic sketching rather than deterministic tracking, HLL enables efficient estimation in real-time streaming scenarios without the need for full . Key benefits of HyperLogLog include its exceptional space efficiency, achieving relative standard errors of approximately 2% for cardinalities exceeding 10^9 elements while using only about 1.5 KB of memory. This trade-off of tunable accuracy for drastic reductions in storage—compared to exact methods that scale linearly with cardinality—positions HLL as a foundational tool in scalable data processing systems.

Historical Development

The HyperLogLog algorithm was introduced in 2007 by Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier through their seminal , "HyperLogLog: The analysis of a near-optimal algorithm," presented at the conference. This work built directly on foundational probabilistic counting methods, including the 1985 Flajolet-Martin , which pioneered the use of hash-based with averaging to approximate distinct counts in databases. It further refined the LogLog algorithm from 2003, co-authored by Marianne Durand and Philippe Flajolet, which advanced space efficiency by emphasizing the count of leading zeros in uniform hash values to estimate with logarithmic memory usage. Key to HyperLogLog's improvements was the use of harmonic means for averaging and refined bias correction using , enabling a of approximately 1.04/√m for m registers while minimizing variance and bias compared to its predecessors. Adoption accelerated in the following decade as HyperLogLog proved valuable for environments. In 2013, Google engineers Stefan Heule, Marc Nunkesser, and Alexander Hall published "HyperLogLog in Practice: Algorithmic Engineering of a Estimation ," introducing the HLL++ variant with enhancements like 64-bit hashing, sparse representation for low cardinalities, and refined estimators to boost accuracy and reduce memory by up to 30% in production settings. incorporated HyperLogLog as a native in version 2.8 (released in late 2013), allowing efficient estimation of set cardinalities with a of 0.81% using 12 KB per key. Google BigQuery followed in 2017 by integrating HLL++ functions, enabling faster and cheaper approximate distinct counts on petabyte-scale datasets without full scans. By the mid-2010s, HyperLogLog had permeated frameworks, with adding early support through custom implementations around 2015 for interactive and batch analytics on clusters. As of 2025, the core algorithm—stable since 2007—continues to underpin refinements in streaming systems, such as integrations with through libraries like Kafka Streams for real-time distinct counting in event-driven pipelines. In 2025, a remote execution vulnerability (CVE-2025-32023) was identified in Redis's HyperLogLog implementation, allowing authenticated attackers to trigger out-of-bounds writes; it was fixed in versions 8.0.3, 7.4.5, 7.2.10, and 6.2.19.

Fundamental Concepts

Key Terminology

In the context of the HyperLogLog algorithm, denotes the number of distinct elements within a , representing a fundamental measure of uniqueness in large streams such as network flows or query logs. A serves as a compact, probabilistic summary that enables efficient of cardinality using sublinear space, processing input streams in a single pass without retaining the original elements themselves. The register in HyperLogLog is a fixed-size array of counters, typically comprising m = 2^p entries for p-bit precision, where each entry records the maximum number of leading zeros observed in hashed values from assigned substreams to facilitate aggregated estimation. A multiset is a collection that permits duplicate elements, in contrast to a set, and forms the input domain for HyperLogLog where the goal is to approximate the count of unique items amid potential repetitions. The quantifies the relative accuracy of the cardinality estimate, typically expressed as $1.04 / \sqrt{2^p} for p-bit registers, providing a measure of typical deviation in a mean quadratic sense. Bias correction involves targeted adjustments to the raw estimate, particularly for small cardinalities where underestimation occurs, such as applying empirical offsets or switching to alternative estimators like linear counting to enhance precision without increasing space requirements.

Hashing and Leading Zeros

In the HyperLogLog algorithm, hashing serves as the foundational mechanism for transforming input elements into uniformly distributed random values, enabling probabilistic . A maps each distinct element from the input domain to a binary string, typically using 32-bit or 64-bit outputs to handle cardinalities ranging from millions to trillions. For instance, practical implementations employ functions like MurmurHash3, which generates pseudo-random integers that approximate over the range [0, 2^{32}-1] or [0, 2^{64}-1], ensuring that even non-random inputs behave as if drawn from a . This uniformity assumption is essential, as it allows the algorithm to treat hashed values as independent random variables, with collisions between elements handled probabilistically rather than explicitly tracked. The leading zeros heuristic builds directly on this hashed output by examining the binary representation to quantify rarity. For a given hashed value interpreted as a binary fraction in [0,1), the rank \rho is defined as the number of leading zero bits before the first '1' bit (e.g., for the binary string 0.000001..., \rho = 5). Under the uniformity assumption, the probability of observing at least \rho leading zeros is $2^{-\rho}, meaning such a pattern occurs roughly once every $2^\rho trials, signaling the scale of the underlying set's cardinality. This probabilistic indicator leverages the geometric distribution inherent in uniform random bits, where higher \rho values become increasingly rare and thus informative about large cardinalities. The core posits that the maximum \rho observed across multiple hashed samples approximates \log_2 of the , with aggregation via the to stabilize estimates from sparse, high-rank observations that reveal the presence of large sets. For example, if the of the "apple" yields a binary value starting with five leading zeros (e.g., 0.000001...), this suggests the element's position in a set of size approximately $2^5 = 32, as such a would be expected once in a sample of that scale. In practice, registers may store these maximum \rho values per to capture this information efficiently. The of hashing is critical for the heuristic's validity, as deviations can \rho counts and degrade accuracy, though robust hash functions mitigate this in real-world deployments.

Core Algorithm

Register Structure

The HyperLogLog algorithm employs a fixed-size array consisting of m registers, where m = 2^p and p is a small integer parameter that determines the number of buckets. Typically, p = 14, yielding m = 16{,}384 registers to balance estimation accuracy and memory efficiency. Each register functions as a hash bucket and stores a small integer representing the maximum \rho observed, where \rho is the position of the leftmost 1-bit (equivalent to 1 plus the number of leading zeros from the MSB) in the binary representations of hash values mapped to that bucket. Upon initialization, all s are set to 0, which provides a practical starting point for small cardinalities and avoids underestimation issues that could arise from using negative in theoretical models. The value of p dictates the indexing: the first p bits (MSB) of a 64-bit value select the specific register index j (ranging from 0 to m-1), while the remaining bits are used to compute the leading zeros \rho for potential updates to that register. This bucket selection ensures of hashed elements across the registers, leveraging the properties of the for probabilistic uniformity. The overall memory footprint is O(m), with each register requiring a constant amount of space—typically 6 bits to accommodate \rho values up to around for -bit hashes, resulting in approximately 12 for the standard p=[14](/page/14) configuration. For illustration, consider p=4, so m=16; a value beginning with digits 0010 ( 2) followed by arbitrary bits (e.g., 0010xxxx) would direct updates to 2.

Add Operation

The add operation in HyperLogLog inserts an into the by updating the appropriate in the , leveraging a to map the input to a probabilistic representation of its contribution to the estimate. To perform the addition, the input v is first hashed using a H that produces a fixed-width binary string, typically or 64 bits, assuming perfect for across the bit space. The first p bits (MSB) of the hash determine the j, where p = \log_2 m and m is the number of registers in the (e.g., p = [14](/page/14) for m = 16384). The remaining bits of the hash form a from which the value \rho is computed as the position of the leftmost 1-bit (starting from 1 at the MSB), equivalent to the number of leading zeros plus one in this suffix. This \rho represents an estimate of the negative logarithm of a in (0,1], capturing the rarity of the hashed value. The at j is then updated to the maximum of its current value and this \rho, ensuring the retains the highest observed rank for that . The operation is idempotent: adding the same element multiple times computes the identical hash and \rho, so the max update leaves the register unchanged after the first insertion, preventing redundant modifications to the sketch. This property holds under the assumption of a deterministic , making the algorithm suitable for multisets where duplicates should not inflate the estimate. HyperLogLog's add operation supports processing, as it requires only a single pass over the input and direct access to the register array for updates. Each addition involves constant-time operations—hashing, bit extraction, and a max —enabling efficient incorporation of elements without revisiting prior data. The following illustrates the core add procedure for a 64-bit and p-bit precision:
function add(element v):
    hash = H(v)  // 64-bit hash
    j = (hash >> (64 - p)) & ((1 << p) - 1)  // Extract highest p bits for index
    w = hash & ((1 << (64 - p)) - 1)  // Remaining lower bits as suffix
    ρ = position_of_leftmost_1_bit(w)  // Position starting at 1 from MSB of w
    registers[j] = max(registers[j], ρ)
Here, position_of_leftmost_1_bit returns the bit position of the most significant 1-bit (1-based from MSB). To prevent overflow in fixed-size registers, \rho values are capped at a maximum threshold, such as 32 for a 32-bit hash suffix or 5 + p for 64-bit implementations, beyond which further increases provide negligible refinement to the estimate. This cap ensures numerical stability while maintaining the algorithm's probabilistic guarantees.

Count Operation

The count operation in HyperLogLog estimates the cardinality n of the set of distinct elements by aggregating the values stored in the m registers, where each register j holds R_j, the maximum \rho observed among the hash suffixes assigned to that bucket. To compute the estimate, first calculate Z = \frac{1}{m} \sum_{j=1}^m 2^{-R_j}, the arithmetic mean of the quantities $2^{-R_j} across all registers. The raw estimator is then given by E = \alpha_m \cdot \frac{m}{Z}, where \alpha_m is a normalizing constant that corrects for bias, approximated as \alpha_m \approx 0.7213 / \left(1 + \frac{1.079}{m}\right) for m \geq 16, with exact values such as 0.673 for m=16 and 0.697 for m=32. This formula derives from the Flajolet-Martin probabilistic counting framework, where the expected position of the first 1-bit in a hashed uniform random variable satisfies E[2^\rho] = \frac{2}{\ln 2} \approx 2.88539, leading to an unbiased estimator \phi(n) \approx \frac{\rho}{ \log_2 e } for a single hash, or asymptotically \log_2 n + \gamma / \ln 2 + o(1) with Euler-Mascheroni constant \gamma. In the averaged setting of LogLog and , stochastic averaging over m independent buckets reduces variance, and the HyperLogLog refinement replaces the arithmetic mean of the \rho_j (as in early LogLog, yielding E \approx 2^{\overline{\rho}}) with the harmonic-inspired form $1/Z, which is asymptotically equivalent but achieves lower variance by directly estimating the reciprocal of the geometric means of the individual $2^\rho contributions, with \alpha_m obtained via the integral \alpha_m = \left( \int_0^\infty (\log_2 (2 + u/(1+u)) )^m du \right)^{-1}. For small cardinalities, where the main estimator is unreliable (e.g., if E \leq \frac{5m}{2}), linear counting is used instead: let V be the number of empty registers (those with R_j = 0), then E = m \ln \left( \frac{m}{V} \right). For large cardinalities, where many registers saturate at R_j = 32 (indicating $2^{-R_j} \approx 0), the estimator E underestimates due to truncation, so apply the LogLog correction E^* = -2^{32} \ln \left(1 - \frac{E}{2^{32}} \right) if E > \frac{1}{30} \cdot 2^{32}. The operation outputs this approximate distinct count E (or E^*), providing a probabilistic guarantee of relative error around $1.04 / \sqrt{m} with high probability. For example, with m=16 and an average \rho = 3 (implying Z = 2^{-3} = 0.125), E \approx 0.7213 \times 16 / 0.125 \approx 92.

Advanced Operations

Merge Operation

The merge operation in HyperLogLog enables the combination of multiple sketches to approximate the of their , which is particularly useful in environments where data is processed across multiple nodes. When merging two sketches with the same number of registers m, a new register array is created by taking the element-wise maximum: for each j from 0 to m-1, the value in the merged is \max(R_1, R_2), where R_1 and R_2 are the register arrays of the input sketches. The estimate is then computed from this merged using the standard estimation formula, as if it were a single sketch built from the of the underlying sets. Merging is defined only for sketches with the same number of registers. For differing register counts, direct merging is not standard; some systems downgrade the result to the lower precision to enable combination while accepting reduced accuracy. The merge operation is both commutative and associative, meaning the order of merging multiple sketches does not affect the result, and it supports efficient tree-based aggregation in parallel or distributed systems, such as frameworks, where partial sketches from subsets of data can be recursively combined. A primary is estimating the size of the of two sets A and B, where |A \cup B| \approx the estimate of the merged sketch of A and B, enabling scalable distinct counting over partitioned data without materializing the full sets. However, the operation is designed specifically for and does not directly support intersections or other set operations; approximating intersections typically involves inclusion-exclusion principles with multiple merges, though such methods introduce additional error and are beyond the core merge functionality.

Practical Implementation Notes

In practical deployments of HyperLogLog (HLL), the parameter p (where the number of registers m = 2^p) is selected between 4 and to balance and estimation accuracy, with higher p values providing lower at the cost of increased storage. For instance, p = 14 (yielding m = 16384 registers) typically requires approximately 10 KB of using 5 bits per register and achieves a of about 0.81%. Implementers should evaluate p based on expected ranges and available resources, opting for sparse representations in early stages to further minimize for low- sets. Hash function selection is critical for uniform distribution without introducing bias; 64-bit non-cryptographic hashes are recommended to support large cardinalities while maintaining performance. Algorithms like xxHash or MurmurHash3 are preferred for their speed and quality, as cryptographic options such as or impose unnecessary computational overhead without improving uniformity in this context. The first p bits of the hash determine the register index, with the remaining bits used to compute the (leading zeros), ensuring even bucket assignment. For storage and transmission, HLL sketches are serialized in compact formats to optimize and I/O. Dense representations store registers as a fixed-size byte (e.g., 6 bits per register packed into bytes), while sparse formats use variable-length encodings for only non-zero registers, switching to dense when exceeds a like 1/32. This approach enables efficient persistence in databases or merging across distributed systems. Edge cases require careful handling to ensure robustness. An , with no elements added, yields a estimate of 0, often via initialization to a cleared state or linear counting fallback. For 64-bit hashes, the rank \rho (number of leading zeros) can reach up to , but registers typically use 5-6 bits (capping at 31-63); values exceeding the register width are handled by setting a maximum indicator (e.g., all bits set) to avoid underestimation in high- scenarios. In multi-threaded environments, concurrent add operations demand synchronization to prevent race conditions on register updates. Simple implementations employ mutex locks around the add and estimate functions, ensuring at the cost of potential contention; more advanced lock-free variants use operations on registers for higher throughput. Recent advancements as of include SIMD-accelerated in open-source libraries, particularly Rust crates like hyperlog-simd, which leverage vector instructions on x86_64 and for faster computations and bias corrections during . These optimizations can yield 2-5x speedups on modern hardware without altering accuracy, making them suitable for high-velocity streaming applications.

Analysis

Complexity

The of the HyperLogLog (HLL) algorithm is O(m), where m = 2^p is the number of and p is the precision parameter, resulting in a fixed amount of independent of the n of the input . In the original implementation, each requires at most 5 bits; modern standard implementations using 64-bit hashing typically require 6 bits per , yielding approximately 12 KB for p = 14 (m = 16384). The for the add operation is O(1) amortized per element, consisting primarily of a constant-time followed by an array access and update to the relevant . In contrast, the count (estimation) operation requires O(m) time to sum the over all registers and apply the bias correction. Similarly, the merge operation between two HLL sketches takes O(m) time by computing the componentwise maximum across corresponding registers. Overall, while queries and merges scale linearly with the fixed sketch size m, the add operations enable constant-time insertions, making HLL suitable for streaming data. This design allows HLL to handle cardinalities up to $2^{64} without increasing space usage, unlike exact methods such as hash sets that require O(n) space and become prohibitive for large n.

Accuracy and Error Bounds

The HyperLogLog (HLL) algorithm provides probabilistic guarantees on the accuracy of its estimates through statistical analysis rooted in the applied to the of values. For large cardinalities n, the relative is approximately \sigma \approx 1.04 / \sqrt{m}, where m is the number of s, ensuring that estimates deviate from the true value n by about \sigma with 65% probability, $2\sigma with 95% probability, and $3\sigma with 99% probability. This bound arises from the asymptotic normality of the estimator, treating the average of the exponentially transformed register maxima as a sum of weakly dependent random variables. The variance of the estimate E is given by \mathrm{Var}(E) \approx (1.04 / \sqrt{m})^2 E^2, reflecting the relative nature of the error, which scales with the square of the true cardinality. Confidence intervals can be constructed using Chebyshev's inequality for conservative bounds or empirical distributions for tighter ones, though the standard error formula serves as the primary metric for practical tuning of m. For small cardinalities n < 10m, the standard HLL estimator exhibits underestimation bias due to nonlinear distortions in the register updates when many registers remain empty or low-valued; this is typically corrected by switching to linear counting, which uses the proportion of empty registers for exact estimation up to O(m), or via empirical bias adjustments in improved implementations such as HyperLogLog++, which apply k-nearest neighbor corrections below thresholds around $5m. HLL operates in distinct estimation regimes based on cardinality scale: a linear regime for n \ll m, where accuracy approaches exact counting via empty-register tracking; a LogLog regime for m < n < 2^{64}, relying on the probabilistic distribution of leading zeros in 64-bit hashes; and a stable regime for larger n, where the estimate converges to the theoretical variance without further degradation (with collision adjustments near $2^{64}). In distributed settings, such as federated over clinical data repositories, empirical studies from the 2020s confirm typical relative errors of 0.5-2% for m \geq 2^{12}, aligning with theoretical bounds while accounting for merge operations across sketches. Variance can be further reduced by maintaining multiple independent HLL sketches and averaging their estimates, dividing the variance by the number of sketches per the , at the cost of additional space. For example, with m = 2^{14} = 16384 registers, the standard error is \sigma \approx 0.81\%, yielding relative errors of \pm 1.6\% at 95% confidence for large sets, a configuration common in production systems for balancing space and precision.

Extensions and Variants

HLL++

HLL++ is a refined variant of the HyperLogLog algorithm, developed in 2013 by Stefan Heule, Marc Nunkesser, and Alexander Hall while at Google. This improvement addresses limitations in the original algorithm, particularly for small and medium cardinalities, through targeted algorithmic enhancements while maintaining compatibility with standard HLL implementations. Key enhancements include the adoption of 64-bit hashing, which extends the algorithm's capability to estimate cardinalities exceeding 10^9 without issues present in 32-bit variants, at the cost of slightly increased memory to 6 \times 2^p bits. For small cardinalities (n < 5 \times 2^p), HLL++ employs precomputed bias correction tables derived from extensive simulations, using k-nearest neighbor (k=6) to mitigate systematic underestimation errors; these tables cover up to approximately 2^{16} elements. Additionally, a sparse representation is introduced for tiny sets (n \ll 2^p), dynamically allocating registers only as needed and storing index-value pairs in a compressed map, which transitions to dense mode once exceeding a of about 6 \times 2^p bits to optimize space. These modifications yield substantial accuracy gains, reducing the relative error by a factor of 4 for cardinalities up to roughly 12,000 (with p=), effectively achieving errors below 0.5% for n=1000 in practical setups by eliminating bias spikes at transition points. Regarding efficiency, the and advanced encoding (averaging 8.12 bits per element for p=) enable up to 1.5 times less usage than the original HLL for equivalent across varying cardinalities, particularly beneficial for low-to-medium n. The precomputed corrections are based on empirical data from billions of simulated hashes, ensuring robustness. HLL++ has seen widespread adoption in production systems and remains backward-compatible with standard HLL sketches, allowing seamless merging of registers from both variants. It is implemented in for efficient estimation in caching and real-time analytics. As of 2025, enhanced HLL sketches based on similar principles power approximate distinct counts in Google BigQuery for large-scale data processing.

Streaming HyperLogLog

Streaming HyperLogLog extends the core HyperLogLog algorithm to handle continuous data streams in a one-pass manner, supporting both windowed and infinite streams while maintaining bounded memory usage. This adaptation is particularly suited for environments where data arrives incrementally, such as or , by processing elements as they arrive without requiring full dataset storage. The fixed number of registers m ensures constant , typically on the order of kilobytes, regardless of stream length. Key features include support for deletions through aging mechanisms or hybrid structures. In aging approaches, older elements are evicted based on timestamps to simulate window boundaries, preventing unbounded growth. For more precise deletions, hybrids with Count-Min Sketches can track frequencies, allowing subtraction of removed items from the estimate, though this increases memory slightly. These features enable online updates without recomputing the entire sketch. Variants address specific streaming scenarios. The sliding window HyperLogLog maintains estimates over a moving time window by replacing each register with a List of Future Possible Maxima (LFPM), storing potential leading zero counts from recent elements and evicting those outside the window or dominated by newer values. This allows querying for any subwindow duration without full recomputation. An improved variant uses a fixed-size Bit Distance Recorder (BDR) per register instead of variable-length LFPM, reducing memory overhead while sharing counters across windows via BDR for distributed settings. For infinite , adaptations incorporate exponential histograms to weight recent elements more heavily, decaying influence of older data over time. A recent extension, Sliding Window HyperLogLog (SWVHLL), optimizes for network flow spread estimation by virtually managing window slides, achieving up to 60% error reduction compared to prior methods in high-speed traces. Core operations involve incremental additions via hashing into registers, similar to standard HyperLogLog, and merges by taking register-wise maxima, enabling parallel . Periodic counts are computed on-the-fly using the of registers, avoiding full recomputes even as the stream evolves. These operations support low-latency queries in streaming frameworks. Challenges in streaming include handling duplicates, which are mitigated by uniform hashing but can amplify in bursty streams, and error accumulation over long durations due to unpruned stale data. Aging and strategies help bound errors, though they introduce trade-offs in precision for very large windows. In 2024-2025 real-time analytics, Streaming HyperLogLog integrates with via built-in HyperLogLog++ functions for SQL-based distinct counts in windows, and with for optimized in distributed , enhancing performance in pipelines. Emerging research explores quantum variants for streaming , leveraging quantum sketches for potential space advantages in adversarial streams.

Applications

Real-World Use Cases

HyperLogLog has found widespread adoption in for estimating unique visitors without storing exhaustive data. For instance, 4 employs the HyperLogLog++ variant to approximate for key metrics such as active users and sessions, enabling efficient processing of massive traffic volumes while maintaining acceptable accuracy levels. This approach allows platforms to scale analytics for billions of daily interactions, reducing computational overhead compared to exact counting methods. In database systems, HyperLogLog serves as a native probabilistic for estimation in queries. Redis integrated HyperLogLog in version 2.8, released in , to approximate the number of distinct elements in multisets or streams with fixed memory usage, making it ideal for real-time applications like leaderboards or session tracking. Similarly, the postgresql-hll extension for provides a fixed-size for distinct value counting, supporting tunable precision in scenarios such as rollup tables for aggregated metrics, where exact counts would demand prohibitive storage. Big data frameworks leverage for approximate distinct counting in distributed environments. Apache Spark's approx_count_distinct function implements the dense HyperLogLog++ algorithm to estimate unique values across large datasets, facilitating faster query execution in tools like for tasks involving petabyte-scale data. In fraud detection within streaming pipelines, HyperLogLog estimates unique transactions or devices to identify anomalies, as seen in Redis-powered systems at , where it helps detect unusual patterns like rapid unique IP accesses without retaining full logs. Network monitoring applications utilize to gauge the uniqueness of addresses in traffic flows. 's implementation, detailed in their engineering analysis, applies it in systems for estimating distinct elements in network data streams, supporting tasks like and with minimal memory footprint. In workflows, HyperLogLog aids by approximating distinct categories in high-dimensional datasets. integrates it into scalable feature stores to compute for categorical variables, enabling efficient preprocessing of sparse data for models trained on billions of records without excessive resource demands. Emerging applications in 2025 highlight HyperLogLog's role in resource-constrained settings. In edge for , it estimates unique device counts in real-time streams, optimizing performance in distributed networks by favoring probabilistic structures over exact methods for large-scale environments. For privacy-preserving in federated queries, HyperLogLog enables queries across clinical data networks without centralizing sensitive , as demonstrated in methods combining it with to balance utility and compliance in multi-site collaborations.

Comparisons with Other Algorithms

HyperLogLog (HLL) represents a significant advancement over the original algorithm by employing multiple registers and stochastic averaging to reduce variance in cardinality estimates. While the basic sketch achieves a relative of approximately 0.78/√m—where m is the number of registers—its single-register variant suffers from high variability, and even the averaged version requires more computational effort for comparable accuracy. In contrast, HLL maintains a of 1.04/√m but does so with enhanced stability through hyperloglog-specific refinements, making it more practical for large-scale implementations. Compared to the LogLog algorithm, which also uses multiple registers but relies on the maximum leading zeros , HLL incorporates the to improve accuracy for smaller cardinalities (n < 10^5) and exhibits lower overall. Both algorithms utilize similar space—roughly m bytes with 5-bit registers—but HLL delivers superior , achieving the same as LogLog while consuming only 64% of the for equivalent . This efficiency stems from HLL's refined correction and design, reducing systematic in low-cardinality regimes without increasing demands. Exact methods, such as hash sets, provide precise counts but at the cost of linear O(n), where n is the number of distinct elements, rendering them infeasible for massive datasets. HLL, with its constant space usage of O(1) (typically 1.5 kB for 2% error on cardinalities exceeding 10^9), excels in streaming environments where n surpasses 10^6, offering sublinear approximations that trade negligible accuracy for vast scalability. HLL is tailored for exact cardinality estimation of distinct elements, whereas prioritizes set similarity metrics like the , using to enable fast comparisons between sets. Although can approximate with a relative standard deviation of 1/√m, it demands more space (32- or 64-bit signatures) and slower update times O(m) compared to HLL's O(1) operations and 5-6 bit registers, making HLL markedly more accurate and efficient for pure count tasks. For similarity applications, however, 's design provides better support, as HLL lacks inherent locality sensitivity. The focuses on frequency estimation for individual elements in multisets, using a 2D array of counters to bound overestimates with space O(1/ε · log(1/δ)) for error ε and failure probability δ. In contrast, HLL targets distinct counts and achieves lower space requirements—often an less—for equivalent accuracy in cardinality tasks, as CMS would require adaptations like tracking all potential elements to estimate distinctness, inflating its footprint. Thus, HLL is preferable for distinct element counting in streams, while CMS suits heavy-hitter detection or point queries. Overall, HLL strikes an optimal balance for sublinear-space in data streams, outperforming predecessors in accuracy-memory trade-offs and enabling single-pass processing for billions of elements. Its primary weakness lies in very small sets (n < 100), where corrections like small-set tracking are needed to mitigate , though variants address this without altering core efficiency.

References

  1. [1]
    [PDF] HyperLogLog: the analysis of a near-optimal cardinality estimation ...
    In particular, HYPERLOGLOG achieves an accuracy matching that of standard LOGLOG by consuming only 64% of the corresponding memory.
  2. [2]
    [PDF] HyperLogLog in Practice: Algorithmic Engineering of a State of The ...
    HyperLogLog algorithm uses a hash function to random- ize the input data, and will thus, for a fixed hash function and input, return the same results. To get ...
  3. [3]
    [PDF] philippe flajolet - Algorithms Project
    Philippe Flajolet's algorithms estimate the number of distinct elements in a large data collection using probabilistic counting, based on hashed values, in a ...Missing: sketch | Show results with:sketch
  4. [4]
    [PDF] Loglog Counting of Large Cardinalities - Algorithms Project
    In general the LogLog algorithm makes use of m "small bytes" of auxiliary memory in order to estimate in a single pass the number of distinct elements (the " ...
  5. [5]
    Redis new data structure: the HyperLogLog - antirez
    Apr 1, 2014 · The Druid.io folks recently blogged about three additional optimizations to HyperLogLog that compact the registers and can improve query speed.Missing: 2010 | Show results with:2010
  6. [6]
    Counting uniques faster in BigQuery with HyperLogLog++
    Jul 5, 2017 · In this post, I'll explain how BigQuery uses HyperLogLog++, Google's internal implementation of the HyperLogLog algorithm for cardinality estimation, to pull ...
  7. [7]
    Interactive Audience Analytics With Apache Spark and HyperLogLog
    Oct 13, 2015 · Leverage Apache Spark and HyperLogLog for interactive audience analytics, improving data processing efficiency.
  8. [8]
    Streaming With Probabilistic Data Structures: Why & How - Medium
    Oct 27, 2020 · We've managed to incorporate HyperLogLog into our KafkaStreams topology, and honestly, we could have done that in any other Scala streaming ...Missing: integration | Show results with:integration
  9. [9]
    HyperLogLog in Presto: Faster cardinality estimation
    Dec 13, 2018 · We implemented an algorithm called HyperLogLog (HLL) in Presto, a distributed SQL query engine. HLL works by providing an approximate count of distinct ...
  10. [10]
    Algorithm at Scale: HyperLogLog - JavaScript in Plain English
    Jun 10, 2025 · Use high‐quality 64-bit hash functions (MurmurHash3, xxHash). Be aware of bias tables and small‐range corrections in mature libraries ...
  11. [11]
    hll package - github.com/DataDog/go-hll - Go Packages
    Feb 26, 2020 · It supports add and union operations in addition to estimating the cardinality. The zero value is an empty set, provided that Defaults has been ...
  12. [12]
    Using Linear Counting, LogLog, and HyperLogLog to Estimate ...
    Dec 4, 2016 · Linear Counting uses a bit table, LogLog counts leading zeros, and HyperLogLog improves on LogLog with corrections for extreme cases.
  13. [13]
    Fast Concurrent Data Sketches | ACM Transactions on Parallel ...
    Apr 11, 2022 · We present a generic approach to parallelising data sketches efficiently and allowing them to be queried in real time, while bounding the error that such ...
  14. [14]
    bcmcmill/hyperlog-simd - GitHub
    A Rust implementation of HyperLogLog and HyperLogLogPlusPlus streaming distinct count algorithms with SIMD (Single Instruction, Multiple Data) support on ...Missing: accelerated | Show results with:accelerated
  15. [15]
    Fun With HyperLogLog and SIMD - Vaktibabat
    Oct 3, 2025 · As we can see, for larger values of p , we start seeing more and more accurate estimations, though at some point we start seeing diminishing ...
  16. [16]
    [PDF] HyperLogLog: the analysis of a near-optimal cardinality ...
    HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm ... Flajolet-Martin-85 coupon collection process and ... Flajolet.
  17. [17]
  18. [18]
    HyperLogLog Functions - Presto 0.295 Documentation - PrestoDB
    Presto implements HyperLogLog data sketches as a set of 32-bit buckets which store a maximum hash. They can be stored sparsely (as a map from bucket ID to ...Missing: Druid | Show results with:Druid
  19. [19]
    DataSketches HLL Sketch module | Apache® Druid
    This module provides Apache Druid aggregators for distinct counting based on HLL sketch from Apache DataSketches library.
  20. [20]
    Why Every Serious Data Engineer Should Understand Bloom Filters ...
    Jul 11, 2025 · In this article, we'll explore three of the most powerful ones: Bloom Filters and HyperLogLog— along with examples and practical use cases ...
  21. [21]
    [PDF] Sliding HyperLogLog: Estimating cardinality in a data stream - HAL
    Mar 26, 2010 · The objective of the HyperLogLog algorithm is to estimate the cardinality of a given multiset S. A multiset is defined as a set, where an ...
  22. [22]
    Is it possible to dedup hyperloglog such that adding and deleting ...
    Jun 1, 2018 · You can keep a separate HyperLogLog to count the deleted items, and substract to get the final count. However there are caveats with this approach.Why does Hyperloglog work and which real-world problems?Is it possible to decrement a HyperLogLog set in RedisMore results from stackoverflow.com
  23. [23]
    [PDF] Streaming Algorithms for Robust Distinct Elements
    ABSTRACT. We study the problem of estimating distinct elements in the data stream model, which has a central role in traffic mon-.
  24. [24]
    Cardinalities estimation under sliding time window by sharing ... - arXiv
    Oct 31, 2018 · A sliding version of HyperLogLog can work under sliding time window by replacing every counter of HyperLogLog with a list of feature possible ...
  25. [25]
    A HyperLogLog-based Solution to Measuring Sliding Window Flow ...
    This paper proposes a novel, compact solution, called Sliding Window Virtual HyperLogLog (SWVHLL). SWVHLL extends the HLL sketch, which is designed for ...<|separator|>
  26. [26]
    Performance optimization and comparative analysis of distributed ...
    Jan 16, 2025 · The results show that the HyperLogLog cardinality estimation algorithm of the Storm stream computing platform can be used to optimize the ...
  27. [27]
    HyperLogLogPlusPlus (Flink : 1.20-SNAPSHOT API)
    Merge the HLL buffers by iterating through the registers in both buffers and select the maximum number of leading zeros for each register.Missing: integration | Show results with:integration
  28. [28]
    datastreams and beyond, from traditional approaches to quantum
    Sep 1, 2024 · In terms of functions, Hyperloglog and Stream computing. frameworks are utilized for datastreaming, providing compact representations and ...
  29. [29]
    Unique count approximation in Google Analytics
    Oct 9, 2024 · Google Analytics 4 properties use HyperLogLog++ (HLL++) algorithm to estimate cardinality for most used metrics including Active Users and Sessions.Unique count implementation... · Using BigQuery HLL++...
  30. [30]
    citusdata/postgresql-hll - GitHub
    This postgresql-hll extension was originally developed by the Science team from Aggregate Knowledge, now a part of Neustar. Please see the acknowledgements ...Overview · Usage · Explanation Of Parameters...
  31. [31]
    New Sketch-Based Approximate Distinct Counting | Databricks Blog
    Sep 21, 2023 · In this blog post, we'll explore a set of advanced SQL functions available within Apache Spark that leverage the HyperLogLog algorithm, enabling you to count ...
  32. [32]
    Why leading banks use Redis to detect fraud in real-time
    Jul 18, 2025 · Bloom Filters—Check if an IP, device, or email has been seen before. · HyperLogLog—Estimate unique users or devices over time to spot anomalies.
  33. [33]
    Feature Engineering at Scale | Databricks Blog
    Jul 16, 2021 · HyperLogLog. HyperLogLog (HLL) is a machine learning algorithm used to approximate a distinct count of values in a dataset. It provides a ...
  34. [34]
    Smart Data Analysis Methods to Improve IoT Application Speed
    Sep 13, 2025 · Favor sketching and probabilistic structures (e.g., Count-Min Sketch, HyperLogLog) for large-scale environments requiring estimation rather than ...Real-Time Data Processing... · Using Mqtt Protocol For... · Visualizing Iot Data With...
  35. [35]
    Balancing Accuracy and Privacy in Federated Queries of Clinical ...
    Nov 3, 2020 · In this paper, we propose a new method for combining data from sites in a federated clinical data network, based on the HyperLogLog (HLL) ...
  36. [36]
    [PDF] SetSketch: Filling the Gap between MinHash and HyperLogLog - arXiv
    Jan 1, 2021 · Together with the presented corrections for small and large cardinalities, which do not require any empirical calibration, the estimator can ...
  37. [37]
    [PDF] A Practical and More Space-Efficient Alternative to HyperLogLog for ...
    Heule, M. Nunkesser, and A. Hall. 2013. HyperLogLog in Practice: Algorithmic. Engineering of a State of the Art Cardinality Estimation Algorithm. In ...<|control11|><|separator|>