Fact-checked by Grok 2 weeks ago

Count-distinct problem

The count-distinct problem, also known as the distinct elements problem or F₀ estimation, is the task of approximating the number of unique elements in a large multiset or data stream, where exact computation is impractical due to constraints on memory, processing time, or the volume of data arriving sequentially.^[1] This problem arises in streaming models, where algorithms must process elements in a single pass using sublinear space relative to the input size, typically providing a (1 ± ε)-multiplicative approximation with high probability.^[2] The problem was first formalized in the context of probabilistic counting by Philippe Flajolet and G. Nigel Martin in 1985, who introduced algorithms based on stochastic averaging of hash function outputs to estimate cardinality with minimal storage, such as less than 100 binary words for reasonable accuracy on datasets exceeding 100 million elements.^[1] Their approach, known as the Flajolet-Martin (FM) algorithm, leverages the position of the lowest set bit in hashed values to infer the number of distinct items, forming the basis for subsequent refinements like the LogLog algorithm, which improved variance through averaged maximums over multiple hash functions.^[1] A major advancement came with the HyperLogLog algorithm in 2007, an optimized variant that uses a fixed-size array of registers to track leading zeros in hashed representations, achieving a standard error of approximately 1.04 / √(2^p) for p-bit registers while requiring only O(p) space, making it suitable for massive-scale applications.^[3] Further engineering refinements, such as those implemented at Google, reduced memory usage for high-cardinality estimates (e.g., over 1 billion distinct elements) while preserving single-pass parallelizability and empirical accuracy.^[4] Theoretical progress culminated in 2010 with an optimal streaming algorithm by David P. Woodruff and colleagues, which computes a (1 ± ε)-approximation using O(ε⁻² + log n) bits of space and O(1) update and query times, matching established lower bounds and closing decades of research on space-time tradeoffs.^[2] This optimality holds for the turnstile streaming model, where elements can be added or deleted, extending applicability beyond insertion-only streams. The count-distinct problem has broad practical significance in database systems for query optimization (e.g., estimating join cardinalities), network monitoring (e.g., tracking unique IP addresses for anomaly detection), and big data analytics (e.g., unique user counts in web logs).^[2]^[1] These applications demand robust, low-overhead sketches that balance accuracy, space efficiency, and computational speed, influencing tools like Google's Dremel and PowerDrill.^[4]

Fundamentals

Formal Definition

The count-distinct problem, also referred to as the distinct elements problem, seeks to determine the number of unique items in a multiset of data elements, typically encountered in streaming or large-scale data processing contexts. Formally, given a multiset S consisting of n elements drawn from a universe U, the objective is to compute the cardinality of the support, denoted as d = |\{x \in U : f_x > 0\}|, where f_x represents the frequency of element x in S. This can be expressed mathematically as

d = \sum_{x \in U} \mathbf{1}_{\{f_x > 0\}},

where \mathbf{1}_{\{f_x > 0\}} is the indicator function that equals 1 if f_x > 0 and 0 otherwise.^[5]^[2] Standard notation in the literature includes d for the number of distinct elements and n for the total length of the input stream or multiset. In approximation settings, parameters such as \epsilon > 0 (relative error) and \delta > 0 (failure probability) are commonly used to quantify the quality of estimates, aiming for an output \hat{d} satisfying (1 - \epsilon)d \leq \hat{d} \leq (1 + \epsilon)d with probability at least $1 - \delta. These parameters arise naturally in probabilistic analyses of the problem.^[2]^[6] The universe U may have a known size |U| or an unknown (potentially very large) size, which influences the feasibility of solutions. When |U| is known and small relative to n (e.g., |U| \leq n), exact computation can leverage direct indexing, but for large or unknown |U|, the problem demands space-efficient methods that do not enumerate all possible elements. This distinction is critical, as unbounded or enormous universes (e.g., via hashing to a large bit string) preclude exhaustive storage.^[5]^[6]

Naive Exact Solution

The naive exact solution to the count-distinct problem involves maintaining a data structure to track unique elements encountered in the input stream.^[7] A hash set (or hash table) is typically used, where each new element from the stream is checked for existence; if absent, it is inserted, and the count of distinct elements d is incremented.^[7] This approach ensures precise computation of d, the cardinality of the set of unique elements, without any approximation error.^[7] The algorithm can be outlined in pseudocode as follows:

Initialize an empty hash set S
Initialize counter d = 0
For each element x in the input stream:
    if x not in S:
        insert x into S
        d = d + 1
Return d
Initialize an empty hash set S
Initialize counter d = 0
For each element x in the input stream:
    if x not in S:
        insert x into S
        d = d + 1
Return d

This pseudocode assumes average-case O(1) operations for insertions and lookups via hashing, though worst-case performance may degrade due to collisions (mitigated by standard hash table resizing).^[7] In terms of resource requirements, the space complexity is O(d) in the worst case, as the hash set must store all distinct elements, potentially up to the size of the universe of possible elements.^[7] The time complexity is O(n) overall for a stream of length n, with amortized O(1) processing per element due to constant-time hash operations.^[7] For example, consider a stream of elements [a, b, a, c]: the hash set initially empty grows to contain {a} after the first element (d=1), then {a, b} after the second (d=2), remains unchanged for the third duplicate a, and finally {a, b, c} after the fourth (d=3).^[7] This method is unsuitable for massive streams where d greatly exceeds available memory, as it requires storing the entire set of unique elements explicitly.^[7]

Approximation Algorithms

Probabilistic Counting

The probabilistic counting method provides an early approximation technique for estimating the number of distinct elements in a dataset under severe memory limitations, where exact methods like hash tables or bitmaps become infeasible due to their linear space requirements. Introduced by Philippe Flajolet and G. Nigel Martin in 1985, this approach uses hashing to map elements to random binary strings and tracks the position of the first 1-bit to infer cardinality with logarithmic space. It represents the first practical solution for such constrained environments, enabling estimates of large cardinalities with fixed-size structures, typically a bitmap or registers of 32 bits or less, while keeping relative errors bounded independently of the true count.^[1] In the basic algorithm, a hash function h maps each element to a uniform random binary string. For each hashed value, compute ρ(y), the position of the least significant 1-bit (starting from 0 for the LSB). A bitmap or counter tracks the maximum ρ observed, or more efficiently, sets bits in a bitmap at position ρ. The estimator uses the position R of the leftmost 0 in the bitmap (or max ρ + 1), yielding \hat{d} \approx \frac{2^R}{\phi}, where \phi \approx 0.77351 is a correction factor. This requires O(\log n) bits for the bitmap but has high variance. To reduce variance, the Probabilistic Counting with Stochastic Averaging (PCSA) variant uses m independent hash functions or substreams, averaging the individual estimates: \hat{d} = \frac{m}{\sum_{i=1}^m 2^{-R_i}}, with relative standard error approximately 0.78 / \sqrt{m}. For m=64, space is about 2KB, achieving ~10% error.^[1] The expected value aligns with the true cardinality, as the probability that max ρ < k is (1 - 2^{-k})^d \approx e^{-d / 2^k}, so R \approx \log_2 d. Bias corrections are applied for small d, and the design compresses the representation exponentially while mimicking uniform sampling of ranks. These bounds hold with high probability under the sequential processing model, highlighting the trade-off between space efficiency and accuracy.^[1]

HyperLogLog Algorithm

The HyperLogLog algorithm extends the LogLog approach by utilizing an array of m registers to independently estimate substream cardinalities, aggregating them via the harmonic mean to yield a more stable and accurate overall estimate of the number of distinct elements. This design leverages stochastic averaging to mitigate variance, achieving a relative standard error of approximately 1.04 / √m with m typically set to values like 2^14 for practical precision. Originally proposed in 2007, it represents a near-optimal solution for the count-distinct problem in terms of memory-accuracy trade-off.^[3] To process an element x, the algorithm applies a hash function h mapping x to a uniform random value in [0,1). The first b bits of h(x) (with m = 2^b, often b=14) select the register index j, while the remaining bits determine ρ(h(x)), the number of leading zeros before the first 1-bit in the binary representation (i.e., the position of the leftmost 1-bit from the MSB). The j-th register M is then updated to the maximum of its current value and ρ(h(x)). After processing the stream, the estimator computes the harmonic mean indicator Z = (∑_{j=1}^m 2^{-M})^{-1}, yielding the raw cardinality estimate \hat{d} = \alpha_m \cdot m \cdot Z, where \alpha_m is a precomputed bias-correction constant (e.g., \alpha_m \approx 0.7213/(1 + 1.079/m) for finite m). Typically, 5 to 6 bits suffice per register to store values up to 32 or 64 for hash sizes handling universe sizes up to 10^9 or beyond.^[3] Bias in the estimator is addressed through targeted corrections: for small cardinalities d < 5m/2, a mark-and-correct scheme or linear counting fallback is used, employing empirical adjustments derived from k-nearest neighbor interpolation on simulated datasets to counteract underestimation. For large d approaching the hash range limit, extended hashing (e.g., 64-bit) and overflow corrections prevent saturation. These refinements, building on the original, reduce error by up to a factor of 4 in low-cardinality regimes. The overall space complexity is O(\epsilon^{-2} \log \log |U|) bits, enabling efficient sketches for massive datasets.^[4]^[3] HyperLogLog has seen widespread adoption, including in Redis as a core probabilistic data structure for set cardinality approximation and in Google Analytics through the enhanced HyperLogLog++ variant for metrics like active users. Key practical improvements, such as sparse representations for small sets and optimized bias handling, were detailed in 2013 work that halved memory needs in some scenarios while boosting accuracy.^[8]^[9]^[4]

Streaming Sketches

Flajolet-Martin and LogLog Sketches

The Flajolet-Martin (FM) sketch, introduced in 1985, provides an early probabilistic method for approximating the number of distinct elements in a multiset by leveraging hash functions to observe patterns in binary representations.^[5] Each element is hashed to a uniformly random binary string, and for each such hash value, the parameter ρ is computed as the position of the lowest set bit (starting from 1 for the least significant bit). Over all elements, the algorithm tracks the maximum observed ρ, denoted max ρ, which serves as the core statistic for estimation. This approach exploits the fact that for a multiset of cardinality d, the expected value of max ρ is approximately log₂(d) + γ, where γ is a constant related to the harmonic series.^[5] The estimator for the distinct count in the basic FM sketch is given by \hat{d} \approx 2^{\max \rho} / \phi, where \phi \approx 0.77351 is a bias-correction constant derived from the asymptotic analysis of the maximum ρ distribution.^[5] This φ arises as the reciprocal of the expected value adjustment for the geometric distribution of trailing zeros in the hash values. However, the single-register FM sketch exhibits high variance due to the concentration of the maximum statistic, often leading to relative errors exceeding 50% in practice. To mitigate this, the method incorporates stochastic averaging by maintaining multiple independent copies (m registers) of the sketch, each using a separate hash function, and averaging the resulting estimators; this reduces the standard error to approximately 0.78 / \sqrt{m}.^[5] The LogLog algorithm, proposed in 2003, builds directly on the FM framework to achieve better space efficiency while preserving low relative error. It employs k independent hash functions to partition the input into k registers, where each register i stores the maximum ρ observed among elements hashed to that bucket. The estimator is then computed as the geometric mean of the individual estimates 2^{\max \rho_i} across the k registers, adjusted by the same bias-correction factor φ \approx 0.77351, yielding \hat{d} \approx \phi \cdot k \cdot \left( \prod_{i=1}^k 2^{\max \rho_i} \right)^{1/k}.^[10] This geometric averaging smooths out variations more effectively than arithmetic means in the FM approach, as it aligns with the logarithmic nature of the ρ statistics. The relative standard deviation of the LogLog estimator is approximately 1.3 / \sqrt{k}, enabling typical relative errors of around 4% with k around 1,000 registers.^[10] In terms of space complexity, both methods require O(k \log \log |U|) bits, where |U| is the universe size, since each ρ value needs O(\log \log |U|) bits to store (as ρ rarely exceeds \log_2 |U|). The primary limitation of the original FM sketch remains its high variance without multiple registers, which LogLog addresses more elegantly through bucketing, though both suffer from sensitivity to hash collisions in finite-precision implementations. LogLog laid the groundwork for subsequent refinements like HyperLogLog.^[10]

Bottom-k Sketches

Bottom-k sketches provide an approximation for the count-distinct problem by maintaining a compact summary of the smallest hash values encountered in a data stream.^[11] For each distinct element x, a hash function h(x) maps it to a uniform random value in [0, 1), and the sketch preserves the k smallest such values using a min-heap data structure.^[11] Duplicates are ignored to ensure only unique elements contribute, assuming a collision-resistant hash family.^[12] The cardinality estimate \hat{d} is derived from the preserved hashes, commonly using the k-th smallest hash value h_k as \hat{d} \approx k / h_k, which leverages the expected position of order statistics in a uniform sample.^[13] Alternatively, a more refined estimator based on the empirical distribution or inverse probability, such as (k-1) / \sum h_i over the k hashes h_i, yields an unbiased approximation with reduced variance.^[11] The space complexity is O(k), as the sketch stores only the k hash values and their associated elements, though heap operations incur an additional O(\log k) factor in update time.^[11] Error analysis relies on concentration inequalities for uniform order statistics, providing a relative error of O(1/\sqrt{k}) with high probability (e.g., $1 - 1/k).^[11] Variants address potential hash collisions by employing pairwise-independent hash functions or multiple hash projections, ensuring the probability of collision among distinct elements remains negligible for large universes.^[14] These sketches are employed in database systems for efficient approximate query processing on large datasets.^[15] Key advantages include their simplicity in implementation and applicability to streams with unknown or unbounded universe sizes, avoiding the need for domain-specific parameters unlike some probabilistic counting methods.^[11]

Min/Max Sketches

Min/max sketches constitute a family of streaming algorithms for approximating the number of distinct elements in a data stream by exploiting extreme order statistics from hashed values to infer cardinality via expected distributions. The core approach uses m independent uniform hash functions h_j(x) \in [0,1] for j = 1 to m. For a min-sketch variant, the sketch maintains, for each j, the minimum value \min_j = \min \{ h_j(y) \mid y is a distinct element seen so far \}, updating only if a smaller value is encountered. An analogous max-sketch maintains the maximum \max_j = \max \{ h_j(y) \mid y seen \}. Duplicates do not affect the extrema since they hash identically.^[16] The estimator \hat{d} is computed by averaging individual estimates across the m hashes. For the min-sketch, each \hat{d}_j \approx 1 / \min_j, leveraging that the expected minimum over d uniform samples is $1/(d+1) \approx 1/d. For the max-sketch, \hat{d}_j \approx \max_j / (1 - \max_j), since the expected maximum is d/(d+1) \approx 1 - 1/d. The overall \hat{d} is the (optionally bias-corrected) average of the \hat{d}_j, with multiple instances or min/max combinations used to further reduce variance via maximum likelihood adjustments.^[16] These sketches achieve space complexity O(1/\epsilon^2), independent of the total number of distinct elements d or stream length, as m scales with the desired accuracy (typically m = O(1/\epsilon^2 \log(1/\delta))).^[16] They provide an (\epsilon, \delta)-approximation guarantee, ensuring that with probability at least $1 - \delta, the output \hat{d} satisfies (1 - \epsilon)d \leq \hat{d} \leq (1 + \epsilon)d, with constant factors arising from the repetition of independent sketches to bound variance.^[16] Min/max sketches trace their origins to refinements of the seminal probabilistic counting framework introduced by Flajolet and Martin in 1985 for approximating the F_0 frequency moment, with detailed statistical analysis in later works.^[16] In practice, min/max sketches find application in streaming queries within sensor networks, where constrained memory requires efficient estimation of unique sensor readings or events over high-volume, continuous data flows.^[16]

Recent Developments

CVM Algorithm

The CVM algorithm, introduced by Chakraborty, Vinodchandran, and Meel in 2023, provides a straightforward sampling-based approach to estimate the number of distinct elements, denoted as d or F_0, in a data stream while using space independent of d. It maintains a set X of sampled elements and a dynamic sampling probability p to balance memory usage and estimation accuracy, making it particularly suitable for educational purposes and practical implementations due to its reliance on basic probability concepts rather than advanced hashing techniques.^[17] The algorithm begins with initialization: set p \leftarrow 1, X \leftarrow \emptyset, and a threshold \mathrm{thresh} \leftarrow \lceil 12 / \epsilon^2 \cdot \log(8m / \delta) \rceil, where m is the stream length, \epsilon > 0 is the desired relative error, and $0 < \delta < 1 is the failure probability. For each incoming element a_i in the stream, the update proceeds as follows: first, remove a_i from X if present; then, with probability p, add a_i to X. If |X| reaches \mathrm{thresh} after the addition, perform a subsampling step by independently retaining each element in X with probability $1/2 and update p \leftarrow p/2; if |X| equals \mathrm{thresh} even after this subsampling, the run fails (output \perp, indicating a need to retry with fresh randomness); otherwise, processing continues. Upon query or stream end, the estimator is \hat{d} = |X| / p.^[17] This design ensures space complexity of O\left(\frac{1}{\epsilon^2} \log\frac{m}{\delta} \log n\right) bits, where n is the universe size (accounting for \log n bits per element), as the threshold is independent of d and the set X never exceeds \mathrm{thresh} elements beyond transient additions. The estimator achieves a relative error guarantee: with probability at least $1 - \delta, (1 - \epsilon) d \leq \hat{d} \leq (1 + \epsilon) d. Donald Knuth's 2023 analysis confirmed the estimator's unbiasedness and derived Chernoff-style error bounds using the buffer size s (analogous to \mathrm{thresh}), showing that for s = 100, estimates are often within 1% of the true d for tested streams.^[18]^[17] Compared to streaming sketches like HyperLogLog, the CVM algorithm offers simplicity by storing actual elements for exact duplicate detection, avoiding hash functions, leading-zero counts, or complex statistics, which results in lower constant factors and easier implementation. Its probabilistic nature allows high-probability accuracy without intricate derivations, and formal verification using probabilistic programming frameworks in 2025 established its correctness under the specified model, further solidifying its reliability.^[19]^[17]

Complexity and Lower Bounds

The count-distinct problem, also known as estimating the number of distinct elements or the zeroth frequency moment F_0, exhibits significant theoretical limitations in distributed and streaming settings. In the multiparty communication model, where input data is randomly partitioned among k parties and the goal is to compute an \varepsilon-approximation to F_0 with constant success probability, any randomized protocol requires \Omega(k / \varepsilon^2) bits of communication in the message-passing model. This lower bound arises from reductions to set disjointness problems and information-theoretic arguments, demonstrating that even basic coordination among parties incurs substantial costs for accurate estimation.^[20] In the turnstile streaming model, where updates can increment or decrement frequencies of elements from a universe of size n, space lower bounds further underscore the challenge of approximation. Any algorithm achieving a (1+\varepsilon)-approximation to F_0 with constant success probability demands \Omega(1/\varepsilon^2 + \log n) words of space, even when \varepsilon is a constant. This bound, which combines a dependence on precision with a logarithmic term reflecting the universe size, holds for one-pass algorithms and is derived via communication complexity reductions and direct constructions showing that sublinear space cannot suffice for reliable estimation across adversarial inputs.^[2] Regarding time-space tradeoffs, exact computation of F_0 in the turnstile model necessitates \Theta(n) space to track all non-zero frequencies, with total processing time of O(n \mathrm{polylog} n) using dynamic hash tables to handle insertions and deletions efficiently. In contrast, approximation algorithms enable O(1) amortized time per update while using sublinear space, highlighting the practical necessity of randomized sketching techniques to bypass exact computation's overhead. Recent advancements, such as the 2023 CVM (Chakraborty-Vinodchandran-Meel) algorithm, demonstrate near-optimality by achieving space usage O\left(\frac{1}{\varepsilon^2} \log\frac{m}{\delta} \log n\right), using a simple sampling procedure based on coin flips across rounds to estimate F_0 with relative error \varepsilon. This approach refines prior sketches by simplifying analysis and implementation without sacrificing asymptotic guarantees.^[17] These bounds imply that no sublinear-space algorithm can solve the count-distinct problem exactly in general, even in unrestricted computational models, rendering sketching and approximation indispensable for large-scale data processing where memory is constrained.^[2]

Weighted Variant

Problem Definition

The weighted count-distinct problem generalizes the unweighted count-distinct problem by associating positive weights with elements in a data stream and seeking to estimate the total weight contributed by the distinct elements that appear at least once. Given a stream of elements x_i \in U (where U is the universe of possible elements) arriving over time, each paired with a weight w_i > 0, the goal is to compute an estimate of \sum_{x: f_x > 0} w(x), where f_x denotes the frequency of element x in the stream (i.e., the number of times x appears) and w(x) is a weight function defined on x.^[21]^[22] The weight function w(x) can vary depending on the specific formulation, leading to different variants of the problem. In the element-weighted variant, each element x has a fixed, predefined weight w(x), often assumed to be bounded in [0,1] or arbitrary positive real numbers.^[22] In the occurrence-weighted variant, the stream consists of pairs (x_i, w_i), and w(x) is computed based on the weights of occurrences of x, such as the maximum w_i among occurrences of x or the weight of the first occurrence.^[21] Formally, let d_w = \sum_{\text{distinct } x} w(x). The objective is to produce an (\varepsilon, \delta)-approximation \hat{d_w}, meaning \Pr[|\hat{d_w} - d_w| \leq \varepsilon d_w] \geq 1 - \delta, typically for small constants \varepsilon, \delta > 0.^[22]^[21] This can be expressed as

d_w = \sum_{x \in U} w(x) \cdot \mathbf{1}_{\{f_x > 0\}},

where \mathbf{1}_{\{f_x > 0\}} is the indicator function that is 1 if f_x > 0 and 0 otherwise.^[22] The unweighted count-distinct problem corresponds to the special case where w(x) = 1 for all x \in U.^[5]

Estimation Methods

The weighted count-distinct problem adapts several unweighted estimation techniques, such as those discussed in the HyperLogLog and bottom-k sections, by incorporating item weights into the sketching process to approximate the weighted cardinality d_w = \sum_{i=1}^d w_i, where w_i > 0 are the weights of the d distinct elements.^[23]^[24] One prominent approach is the weighted HyperLogLog, which modifies the standard HyperLogLog registers to track the maximum value of the rank function \rho weighted by the item's weight, rather than the unweighted leading zeros in the hash. Specifically, for each item with hash h and weight w, the rank is adjusted using a transformation like h^{1/w} to simulate a Beta distribution that accounts for the weight's influence, and registers store the maximum such adjusted rank per bucket. The estimator is then updated using a weighted harmonic mean adjustment to the sum of $2^{-\max \rho}, yielding an approximation of d_w with improved handling of varying weights.^[23]^[24] Another adaptation is the weighted bottom-k sketch, which preserves the k smallest weighted hashes by assigning ranks drawn from an exponential distribution parameterized by the weight w_i (e.g., rate w_i), ensuring heavier items are more likely to be included in the sketch, as they have a higher probability of drawing small ranks. Estimation proceeds via weighted quantiles of these preserved ranks, conditioning on the (k+1)-th rank to derive unbiased estimators for the total weighted distinct count, often with variance reduction through mimicking multiple independent sketches.^[11]^[14] These methods achieve space complexity of O(1/\epsilon^2 \log(1/\epsilon)) bits, comparable to their unweighted counterparts but incorporating weight normalization or clipping to manage range, where \epsilon is the desired relative error. Error analysis guarantees a relative error of \epsilon for d_w with high probability, addressing heavy-tailed weight distributions through techniques like Beta transformations or rank clipping to prevent outlier dominance.^[23]^[24]^[11] Other methods include sampling-based approaches like weighted reservoir sampling, which maintains a fixed-size sample where inclusion probabilities are proportional to weights, suitable for estimating d_w in insertion-only streams by extrapolating from the weighted sample size. In the turnstile model allowing updates and deletions, linear sketches can be extended for weighted distinct estimation, though they often require higher space due to frequency tracking.^[25]^[26] These estimation techniques find applications in e-commerce for approximating weighted unique purchases (e.g., total value of distinct customer transactions) and in sensor networks for weighted event distinctness (e.g., aggregating unique signals weighted by intensity).^[27]^[28]

References

[1]
[PDF] philippe flajolet - Algorithms Project
This paper introduces a class of probabilistic counting algorithms with which one can estimate the number of distinct elements in a large collection of data ...
[2]
[PDF] An Optimal Algorithm for the Distinct Elements Problem
ABSTRACT. We give the first optimal algorithm for estimating the num- ber of distinct elements in a data stream, closing a long line.
[3]
[PDF] HyperLogLog in Practice: Algorithmic Engineering of a State of The ...
The HyperLogLog algorithm uses randomization to ap- proximate the cardinality of a multiset. This randomization is achieved by using a hash function h that is ...
[4]
[PDF] Probabilistic Counting Algorithms for Data Base Applications
This paper introduces a class of probabilistic counting ~lgorithms with which one can estimate the number of distinct elements in a large collection of data ...
[5]
[PDF] Lecture 2: Distinct Element Counting - Columbia CS
Jan 31, 2011 · We begin by defining the stream formally. Definition 1 A finite data stream, X = x1x2x3...xn, is a sequence of n items chosen from a.
[6]
[PDF] Mining of Massive Datasets - Stanford InfoLab
Further, the book takes an algorithmic point of view: data mining is about applying algorithms to data, rather than using data to “train” a machine-learning ...
[7]
[PDF] Probabilistic Counting and Morris Counter
Sep 3, 2020 · The input consists of m objects/items/tokens e1, e2,..., em that are seen one by one by the algorithm. The algorithm has “limited” memory ...Missing: distinct elements
[8]
[PDF] APPROXIMATE COUNTING: A DETAILED ANALYSIS - Inria
Abstract. Approximate counting is an algorithm proposed by R. Morris which makes it possible to keep approximate counts of large numbers in small counters.
[9]
[PDF] Lecture 1 – Counting, Morris' Algorithm, Probability - Columbia CS
Jan 18, 2017 · In particular, we will see the Morris' Randomized Approximate Counting, conceived in. 1978, whose underlying idea is as follows. We use a ...Missing: distinct elements
[10]
[PDF] HyperLogLog: the analysis of a near-optimal cardinality estimation ...
This extended abstract describes and analyses a near-optimal probabilistic algorithm, HYPERLOGLOG, dedicated to estimating the number of distinct elements ...
[11]
HyperLogLog | Docs - Redis
HyperLogLog is a probabilistic data structure that estimates the cardinality of a set. As a probabilistic data structure, HyperLogLog trades perfect accuracy.Use Cases · Basic Commands · Performance
[12]
Unique count approximation in Google Analytics
Oct 9, 2024 · Google Analytics 4 properties use HyperLogLog++ (HLL++) algorithm to estimate cardinality for most used metrics including Active Users and Sessions.Unique count implementation... · Using BigQuery HLL++...
[13]
[PDF] Loglog Counting of Large Cardinalities - Algorithms Project
In general the LogLog algorithm makes use of m "small bytes" of auxiliary memory in order to estimate in a single pass the number of distinct elements (the " ...
[14]
[PDF] Summarizing Data using Bottom-k Sketches
Aug 15, 2007 · A Bottom-k sketch is a summary of a set of items with nonnega- tive weights that supports approximate query processing. A sketch is obtained by ...Missing: seminal paper<|control11|><|separator|>
[15]
Counting Distinct Elements in a Data Stream - SpringerLink
Aug 23, 2002 · Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D., Trevisan, L. (2002). Counting Distinct Elements in a Data Stream. In: Rolim, J.D.P. ...
[16]
The KMV Sketch, Better Estimator, Size = k - Apache DataSketches
Since these measurements are relative to zero, a sketch constructed like this is also known as a Bottom-k sketch. (It could well have been a Top-k sketch, but ...
[17]
[PDF] Tighter Estimation using Bottom›k Sketches - VLDB Endowment
In this paper we derive confidence bounds and tighter estimators that fully exploit the information in bottom- k sketches. Beyond computation issues, the ...Missing: seminal | Show results with:seminal
[18]
[PDF] Sketch Techniques for Approximate Query Processing
A different style of sketch construction leads to sketches for distinct-value queries. As mentioned above, using a sample to estimate the answer to a. COUNT ...
[19]
[PDF] weighted-cardinality.pdf - Prof. Reuven Cohen, Technion
Min/max sketch estimators use a hash function to map every element xi to U(0,1), and then remember only the minimum/maximum hashed value. If only one hash ...
[20]
Distinct Elements in Streams: An Algorithm for the (Text) Book
### Summary of CVM Algorithm for Distinct Counting
[21]
[PDF] CVM Algorithm for Estimating Distinct Elements in Streams
May 25, 2023 · [M22] Analyze the exact behavior of Algorithm D' when it is applied with s = 1 to the stream of distinct elements (a, b, c). What is the ...
[22]
Verification of the CVM Algorithm with a Functional Probabilistic ...
Sep 22, 2025 · Abstract. Estimating the number of distinct elements in a data stream is a classic problem with numerous applications in computer science.
[23]
Robust lower bounds for communication and stream computation
May 17, 2008 · We study the communication complexity of evaluating functions when the input data is randomly allocated (according to some known ...
[24]
Sketching as a Tool for Numerical Linear Algebra - Now Publishers
Oct 29, 2014 · This survey highlights the recent advances in algorithms for numerical linear algebra that have come from the technique of linear sketching.
[25]
https://www.ece.iastate.edu/snt/files/2019/04/sampling-PODS19.pdf
[26]
http://www.cs.cmu.edu/afs/cs/user/dwoodruf/www/lnw14b.pdf
[27]
[PDF] Solving the Weighted Cardinality Estimation Problem - Aviv Yehezkel
• We will study the problem of estimating the number of distinct elements in a stream when only a small sample of the stream is given. • Real-world ...Missing: variants | Show results with:variants
[28]
A unified scheme for generalizing cardinality estimators to sum ...
This paper shows how to generalize every cardinality estimation algorithm that relies on extreme order statistics (min/max sketches) to a weighted version, ...
[29]
[PDF] Weighted Reservoir Sampling from Distributed Streams
Random sampling serves as a building block in other aggregation tasks, such as estimation of the num- ber of distinct elements [12, 13] and identifying heavy ...
[30]
[PDF] Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Before our work, one could not use randomized communication lower bounds for f to obtain space lower bounds for streaming algorithms since there is no ...
[31]
Semi-automated estimation of weighted rates for e-commerce ...
In this article, we investigate approaches for tracking weighted rates over time, here defined as the fraction of customer attention that goes to items with a ...
[32]
Deep Learning with Dynamically Weighted Loss Function for Sensor ...
Experimental results show that dynamically-weighted loss functions helps us achieve significant improvement for remaining useful life prediction and fault ...2. Background · 2.1. How Neural Network... · 4. Experimental Design