Fact-checked by Grok 2 weeks ago

Hash join

A hash join is a fundamental join algorithm in management systems (DBMS) that combines two input s, R and S, based on an on specified join attributes, by constructing an in-memory from one and probing it with s from the other to identify matches. The algorithm operates in two primary phases: the build phase, where the smaller (typically R) is scanned to populate a using a on the join keys, and the probe phase, where the larger (S) is scanned, with each hashed to locate potential matches in the table, followed by verification to handle collisions. This approach minimizes the number of comparisons required, making it particularly efficient for equi-joins on large datasets where the build fits in memory. For cases where relations exceed available memory, variants like the Grace hash join extend the algorithm with an initial partitioning phase, where both relations are hashed into multiple buckets using a partitioning , potentially recursively until buckets fit in memory, followed by independent joins on matching buckets to reduce I/O costs to approximately 3(M + N) page accesses, where M and N are the sizes of R and S in pages. The hybrid hash join further optimizes this by retaining frequently accessed (or "hot") partitions in memory during partitioning, avoiding disk spills for skewed data distributions and blending elements of both in-memory and partitioned strategies. These variants leverage techniques such as Bloom filters during probing to filter out non-matching tuples early, reducing unnecessary I/O. Hash joins are widely used in modern DBMS for (OLAP) workloads due to their superior performance over alternatives like nested-loop joins (which scale poorly with O(m × n) comparisons) or sort-merge joins (which incur overhead), especially on unsorted data with sufficient memory availability. However, they are memory-intensive, requiring space for the proportional to the build relation's size, and are limited to equality conditions, with potential performance degradation from hash collisions or data skew. Optimizations in multi-core environments focus on parallel partitioning and probing to exploit hardware parallelism while minimizing synchronization overhead.

Fundamentals

Definition and Motivation

The hash join is an for performing between two in a management system, where tuples from R and S are matched based on conditions such as R.A = S.B. It applies a to the join attributes of one (typically the smaller, called the build ) to construct an in-memory with containing tuples grouped by their hash values. Tuples from the other (the probe ) are then hashed and probed against this to identify matching tuples efficiently, with collisions resolved within the same . This hashing reduces the search space dramatically compared to brute-force methods, allowing for rapid identification of join pairs through in-memory lookups. The motivation for hash joins arises from the limitations of alternative join strategies, particularly nested-loop joins, which incur a time complexity of O(n \times m) for relations of sizes n and m, rendering them inefficient for large-scale data. In contrast, hash joins achieve a linear time complexity of O(n + m) on average by scanning each once and leveraging hashing to limit comparisons to relevant subsets of tuples. Additionally, variants of hash joins excel in memory-limited settings by employing partitioning to process data in smaller, -resident units, thereby reducing disk I/O overhead and supporting in environments where entire relations cannot fit in main memory. The classic hash join assumes sufficient memory for the build . Hash joins trace their origins to the early 1980s amid research on optimizing relational query processing, with key developments including Kjell Bratbergsengen's hash-based relational algebra methods from the ASTRA project, first documented in 1980 and fully published in 1984. Concurrently, David DeWitt and colleagues advanced hash join techniques for multiprocessor systems in 1984-1985, influencing parallel database implementations. These foundational works enabled the integration of hash joins into major commercial systems by the mid-1990s.

Comparison to Other Join Methods

Hash joins are one of several algorithms used in management systems (DBMS) to combine data from multiple tables based on join conditions. The nested-loop join is the simplest approach, iterating through each row of one table (the outer) and scanning the other table (the inner) for matches on the join predicate, resulting in a worst-case of O(n × m) where n and m are the sizes of the two relations. This method performs well for small datasets or when indexes are available on the inner table but scales poorly for large, unindexed tables due to its quadratic behavior. In contrast, the first sorts both input relations on the join keys, which takes O(n log n + m log m) time, followed by a linear merge pass in O(n + m) to produce matching pairs. It is particularly stable for non-equi-join conditions, such as inequalities, and handles sorted data efficiently without requiring indexes, though the upfront sorting cost can be prohibitive for unsorted inputs. Hash joins excel in equi-join scenarios—where the join condition is on specific attributes—achieving an average-case of O(n + m) under uniform distribution, making them asymptotically optimal for large equi-joins. Their primary strength lies in constant-time lookups via in-memory s, avoiding the repeated scans of nested loops or the overhead of sort-merge joins, particularly when sufficient is available to build the . However, hash joins falter with skewed distributions, where collisions lead to uneven sizes and potential degradation, and they are generally unsuitable for non-equi conditions due to reliance on exact matches. In modern DBMS such as and SQL Server, hash joins are the preferred strategy for medium-to-large equi-joins on unindexed or unsorted tables, as they leverage available memory to minimize I/O and enable parallelism. They are commonly selected by query optimizers when the build phase fits in memory and the join keys exhibit good selectivity, outperforming alternatives in data warehousing workloads. Empirical studies on benchmarks like TPC-H demonstrate hash joins outperforming sort-merge and nested-loop joins by factors of 2-10x in equi-join dominated queries, especially for factors where constraints are manageable and data skew is moderate. For instance, in TPC-H query processing, hash-based implementations reduce execution time significantly compared to sort-merge variants by avoiding sorting overhead, though sort-merge may edge out in scenarios with pre-sorted inputs or joins.

Core Mechanics

Build and Probe Phases

The hash join algorithm consists of two primary phases: the build phase, in which a is constructed from one input , and the probe phase, in which the second is used to query the for matches. This two-phase structure forms the foundational workflow for all hash join variants, enabling efficient equi-joins on large datasets when sufficient memory is available. In the build phase, the smaller of the two (R and S) is conventionally selected as the build input to minimize consumption and maximize the likelihood that the fits entirely in main . For each in the build , a is applied to the join key (or composite join keys) to compute a , and the is inserted into the corresponding of the in-memory . If the entire build fits in , the process avoids disk I/O; otherwise, the algorithm may require extensions beyond the basic phases, though the core insertion logic remains the same. The is typically sized to achieve a load factor below 1 to balance space and performance. Collisions in the —arising when multiple tuples hash to the same —are resolved using methods such as , where tuples in a are stored in a , or , where an alternative slot is probed via techniques like linear or . is particularly common in database implementations due to its simplicity and robustness against clustering, though it introduces pointer overhead. The algorithm assumes a high-quality that distributes tuples uniformly across buckets, minimizing skew where some buckets become disproportionately large and degrade performance through excessive chain lengths or probes. This ensures average-case constant-time lookups and is a key precondition for the efficiency of the classic hash join, the simplest non-partitioned realization of these phases. The probe phase processes the larger relation (the probe input) by iterating through its tuples, computing the hash of each join key, and looking up the corresponding bucket in the hash table. For each potential match in the bucket (e.g., by scanning the chain or probing slots), an equality check on the join keys confirms a join; matching tuples are then concatenated and output. Non-matching probe tuples are discarded for inner joins. This phase leverages the precomputed hash table for rapid access, typically achieving O(1) average time per probe under uniform hashing. The following high-level pseudocode illustrates the build and probe phases for an in-memory hash join: Build Phase:
for each tuple t in build_relation:
    h = hash(t.join_key)
    insert t into hash_table[h]  // using chaining or open addressing for collisions
Probe Phase:
for each tuple s in probe_relation:
    h = hash(s.join_key)
    for each t in bucket hash_table[h]:  // scan chain or probe slots
        if t.join_key == s.join_key:
            output joined tuple (t, s)
This outline captures the essential steps, with the insert and bucket traversal adapted to the chosen collision resolution strategy.

Classic Hash Join

The classic hash join is a fundamental join in management systems that operates entirely in main memory, assuming the smaller relation fits completely within available memory to avoid disk I/O during execution. This method leverages hashing to enable efficient equi-joins by constructing an in-memory from one relation and probing it with tuples from the other, making it suitable for scenarios where memory constraints are not prohibitive and relations are not excessively large. The algorithm proceeds in two primary phases without any partitioning: first, the build phase constructs a hash table using the smaller relation (denoted as R) on the join attribute(s); second, the probe phase scans the larger relation (S) and uses each tuple's join attribute to probe the hash table for matches. In the build phase, each tuple from R is hashed based on the join key, and the tuple (or a pointer to it) is inserted into the corresponding bucket of the hash table, which may handle collisions via chaining or open addressing. During the probe phase, for each tuple in S, the system computes the hash value of its join attribute, locates the matching bucket, and compares the tuple against those in the bucket to identify exact matches; matching pairs are then output, while non-matches are discarded. This process assumes a good hash function that distributes keys uniformly to minimize collisions and bucket overflows. To illustrate, consider a simple equi-join on attribute A between relations (3 tuples) and (5 tuples), where is the build relation. Relation R:
IDAB
110x
220y
330z
Relation S:
IDAC
110p
215q
320r
425s
530t
In the build phase, a is constructed from R using attribute A as the key (assuming a simple h(A) = A mod 7, with for collisions):
  • h(10) = 3 → Bucket 3: (10, x)
  • h(20) = 6 → Bucket 6: (20, y)
  • h(30) = 2 → Bucket 2: (30, z)
In the probe phase, S is scanned sequentially:
  • For S (A=10), h(10)=3 → Match with (10, x) → Output (10, x, p)
  • For S (A=15), h(15)=1 → No match
  • For S (A=20), h(20)=6 → Match with (20, y) → Output (20, y, r)
  • For S (A=25), h(25)=4 → No match
  • For S (A=30), h(30)=2 → Match with (30, z) → Output (30, z, t)
The resulting join contains three tuples. A key limitation of the classic hash join is its reliance on the entire build relation fitting in memory; if R exceeds available memory, the algorithm fails or degrades severely due to excessive paging in systems, leading to high I/O costs and poor performance. It also offers no mechanism for handling spills to disk, making it unsuitable for large-scale data beyond early prototypes. The classic hash join was employed in initial relational DBMS prototypes during the , particularly in research systems exploring efficient join strategies for main-memory operations.

Partitioned Approaches

Grace Hash Join

The Grace hash join is a variant of the hash join algorithm designed to handle relations larger than available main memory by recursively partitioning the input relations into smaller buckets that fit in memory, thereby minimizing disk I/O operations. This approach separates the process into distinct partitioning and joining phases, enabling efficient processing on disk-based storage systems where random access is feasible. In the partitioning phase, both input relations are hashed on the join attribute using the same to produce k buckets, where k is selected such that each resulting partition is small enough to fit into available memory during subsequent joining. The relations are read from disk, partitioned in a single pass, and the resulting buckets are written to disk, ensuring that corresponding partitions from each relation can be joined independently without cross-partition dependencies. The choice of k balances the number of partitions against I/O overhead, typically aiming for partitions that are approximately the size of memory minus space for output buffers. Following partitioning, the recursive joining phase processes corresponding pairs of partitions from the two relations. For each pair, if the combined size fits in , a classic hash join is applied directly; otherwise, the process by further partitioning the pair into sub-buckets and repeating the algorithm until the base case is reached. This ensures to arbitrarily large datasets, assuming sufficient disk space and random access capabilities for efficient reads and writes. The algorithm assumes storage systems supporting , which allows partitions to be read and written without sequential constraints, optimizing for multi-disk environments. The value of k is chosen to minimize total I/O, often set to the ratio of relation sizes to capacity, ensuring balanced load across levels. The name "Grace hash join" originates from the database machine project in the 1980s, an initiative at the that developed hash-based relational processing techniques, though the method has since become a standard reference independent of the hardware. Compared to the classic hash join, which requires both relations to fit in memory, the Grace hash join accommodates datasets of arbitrary size by distributing processing across multiple passes, reducing the effective I/O to approximately three passes over each input relation (two reads and one write), as the partitioning phase involves reading each relation once and writing the partitions to disk, while the joining phase involves reading the partition pairs. This results in significantly lower I/O costs for large-scale joins, making it suitable for traditional database systems with limited main memory.

Hybrid Hash Join

The hybrid hash join algorithm optimizes hash-based equi-joins by selectively retaining smaller partitions in memory during the initial partitioning phase, thereby minimizing disk I/O compared to fully disk-based approaches. Introduced to leverage increasing main memory sizes in database systems, it partitions both input relations using a on the join attribute while estimating the distribution of join keys—typically assuming uniformity or using available statistics—to determine partition sizes. In execution, the algorithm first hashes the build relation (R) into B+1 partitions, retaining the smallest partition (R₀) in memory (sized to approximately M - B blocks, where M is the available memory in blocks and B is the number of buffer pages for partitioning). The probe relation (S) is similarly partitioned into corresponding buckets, with S₀ joined immediately against the in-memory for R₀. Larger partitions (Rᵢ and Sᵢ for i > 0) are spilled to disk and later joined using a Grace hash join as a fallback. Bucket sizing is determined by selecting the number of partitions such that 1-2 buckets can fit comfortably in , often starting with an estimate based on cardinalities and constraints; dynamic adjustments occur if causes overflows, such as by repartitioning oversized buckets or using sampling to refine estimates. This approach handles data effectively by isolating large buckets for disk processing while joining smaller ones on-the-fly, reducing overall latency in skewed datasets. The primary advantages include fewer I/O operations—typically 2 passes over the data versus 3 for the Grace hash join—due to immediate in-memory joining of retained partitions, along with lower CPU overhead from reduced record comparisons. It is widely adopted in modern database optimizers, serving as the default equi-join operator in systems like Apache AsterixDB and . For example, consider relations R (1 million tuples) and S (2 million tuples) with skewed join keys where 80% of tuples share one key value, available M = 100 blocks, and B = 10 buffers. The algorithm partitions into 11 buckets, retaining R₀ and S₀ (the non-skewed portions, ~20% of ) in memory for immediate joining, while spilling the large skewed buckets to disk for subsequent Grace-style processing; this avoids unnecessary disk writes and reads for the smaller partitions, potentially halving I/O compared to uniform spilling.

Specialized Hash-Based Joins

Hash Semi-Join

A hash semi-join is a variant of the semi-join operation in relational databases that returns only the tuples from one input (typically the left , R) for which there exists at least one matching tuple in the other input (S) under the specified join condition, while retaining only the attributes from R and suppressing any duplicates from the matching side. This operation is formally defined as R ⋉ S = π_R (R ⋈ S), where π_R projects onto R's attributes after performing the inner join, ensuring no attributes from S appear in the output. The algorithm adapts the core hash join mechanism by building a on the smaller (S) during the build and then probing from the larger (R) in the probe , outputting a probe tuple from R immediately upon finding the first match in the hash table without including any attributes from S. This adaptation leverages the same partitioning and hashing principles as classic hash joins but optimizes for existence checks rather than producing full cross-products. In a left semi-join (R ⋉ S), the output consists of tuples from the left relation R that have matches in S; a right semi-join (S ⋉ R) is symmetric but less commonly implemented due to typical query patterns favoring left-side outputs. The operation is directional, meaning R ⋉ S generally differs from S ⋉ R unless the relations and join condition are symmetric. Hash semi-joins are particularly useful in query optimization scenarios, such as rewriting EXISTS subqueries or applying filters to reduce intermediate result sizes in complex queries, thereby minimizing data transmission in distributed systems and avoiding the overhead of full joins. Implementation details include early stopping during the probe phase, where processing for a given probe tuple halts upon the first match to confirm existence without scanning the entire hash bucket, and the use of bitmap or bit vector indexes in some database management systems to accelerate filtering by marking potential matches prior to probing.

Hash Anti-Join

A hash anti-join is a specialized variant of the hash join algorithm used in relational database systems to compute the anti-join operation, which returns tuples from one input relation that lack matching tuples in the other relation according to the specified join condition. This operation is particularly useful for identifying non-matching elements, forming the relational algebra equivalent of set difference under the join predicate. In a left anti-join, tuples from the left (outer) are output only if no corresponding tuples exist in the right (inner) ; the right anti-join reverses this, outputting unmatched tuples from the right . The algorithm typically selects the smaller as the build input to optimize usage. During the build , a is constructed by hashing tuples from the inner on the join attributes, often employing hybrid hashing to handle spills to disk if is limited. In the probe , each tuple from the outer is hashed to locate the relevant in the table; the entire must then be scanned to verify the absence of any matching tuple, with non-matching outer tuples passed to the output. This assumes equi-join predicates and relies on full scans for correctness, distinguishing it from inner joins where matches trigger output. The technique is sensitive to hash collisions, which populate buckets with multiple tuples and necessitate exhaustive scans to confirm non-matches, thereby increasing computational overhead in skewed distributions. In exact implementations, these scans ensure no false negatives, but the rises with collision frequency; approximate variants may leverage probabilistic structures to mitigate this, though at the risk of erroneously excluding valid non-matches. Hash anti-joins find application in evaluating NOT EXISTS subqueries and predicates, where confirming the absence of related records is essential, such as in validation. They support data cleaning by detecting discrepancies like missing references or duplicate exclusions in tasks. For in distributed environments, they can be approximated using Bloom filters to pre-filter probes, complementing hash semi-joins that handle checks and reducing data transfer while accepting controlled error rates.

Performance Considerations

Complexity Analysis

The classic hash join achieves an average-case of O(n + m), where n and m denote the sizes of the two input relations, under the assumption that the smaller relation fits entirely in main memory and the provides without significant collisions. This bound arises from the build phase scanning and hashing the smaller relation once (O(\min(n, m))) and the probe phase scanning the larger relation once while performing constant-time lookups (O(\max(n, m))). The for the in-memory during the build phase is O(\min(n, m)), limited to the hashed tuples plus overhead for buckets and chains. In the average case, the total processing time can be formalized as T = n + m + h, where h represents the probes into the , typically equaling the size of the probe relation (h = m) with successful or unsuccessful matches resolved in constant time. However, the worst-case degrades to O(n \cdot m) due to data skew, where uneven hashing causes long chains or , effectively reverting to a nested-loop within buckets. For the hash join, which addresses inputs exceeding available through , the average-case extends to O((n + m) \log k), where k is the number of determined by the factor, accounting for the logarithmic depth in partitioning and sub-join phases. The per reduces to O(\sqrt{n}) in Grace and hybrid variants, achieved by selecting k \approx \sqrt{n / M} (with M as size) to balance sizes against constraints in a two-level process. Performance is further influenced by the quality of the , which must ensure balanced distribution to avoid -induced or overflows, and by I/O costs in the external model, where operations are analyzed in terms of block size B (tuples per ). Specifically, the I/O for Grace hash join is O\left( \frac{n + m}{B} \log_{\frac{M}{B}} \frac{n + m}{B} \right), reflecting multiple passes over partitioned with fan-out M/B. These asymptotic bounds, derived from foundational , underscore the scalability of hash joins for large relations while highlighting sensitivity to and .

Optimizations and Limitations

To mitigate data skew in hash joins, where uneven distribution of join key values can lead to "hot" buckets that overload specific partitions and degrade performance, universal hashing is employed as an optimization. , which selects hash functions randomly from a family that ensures in , reduces the probability of worst-case collisions and skew even with adversarial inputs. Partitioning strategies in hash joins are tuned based on join selectivity—the expected fraction of matching tuples—to optimize . For high-selectivity joins (producing many output tuples), hash joins outperform alternatives like merge joins by avoiding costly , as the build can efficiently handle larger inner relations when selectivity indicates a dense match set. In partitioned variants like Grace hash join, selectivity estimates guide the choice of partitioning fan-out to balance disk I/O and memory usage across s. In distributed systems such as Hadoop's framework, parallel hash joins distribute the build and probe phases across nodes via hash-based partitioning of inputs during the map stage, enabling scalable processing of large sets. This approach leverages reducer parallelism for the probe phase, with optimizations like skew-aware repartitioning to prevent stragglers from uneven loads. A primary limitation of hash joins is vulnerability to skew, which causes disproportionate assignments to buckets, resulting in load imbalance, increased I/O, and up to several-fold slowdowns in parallel environments. Systems address this through techniques like dynamic bucket spreading during partitioning, but severe can still necessitate fallbacks to alternative algorithms. Hash joins are inherently designed for equi-joins ( conditions) and perform poorly on non-equi joins (e.g., or predicates), often requiring fallback to sort-merge joins, which incur overhead but handle arbitrary conditions. This limitation stems from the inability to efficiently probe s with non-equality keys, leading to full scans or degraded selectivity. Memory overflow occurs when the build exceeds available , prompting spills to disk that introduce significant I/O penalties and can multiply execution time by factors of 10 or more. Database management systems (DBMS) handle this via or adaptive algorithms that dynamically adjust sizes and spill partial buckets, though estimation errors in join cardinalities exacerbate the issue. Modern extensions include adaptive hashing in query optimizers, such as PostgreSQL's dynamic spilling mechanism, where hash tables allocate up to work_mem × hash_mem_multiplier before spilling batches to temporary disk files, allowing graceful degradation for oversized builds without query failure. GPU acceleration targets the compute-intensive build phase by parallelizing hash table construction across thousands of threads, achieving 5-10× speedups over CPU baselines for large relations, though data transfer overhead limits gains for small inputs. Hash joins are preferred over index nested-loop joins when no suitable indexes exist on the inner relation or when both inputs are large and unsorted, as the former avoids index maintenance costs and scales linearly with input size. Conversely, index nested-loop joins are chosen for small outer relations with indexed inner tables, minimizing I/O through direct seeks rather than building in-memory structures.

References

  1. [1]
    [PDF] Lecture #09: Hash Join Algorithms - CMU 15-721
    Hash join is the most important operator in a DBMS for OLAP workloads. We must speed up our DBMS's join algorithm by taking advantage of multiple cores. We ...Missing: explanation | Show results with:explanation
  2. [2]
    [PDF] Design and Evaluation of Main Memory Hash Join Algorithms for ...
    May 7, 2011 · ABSTRACT. The focus of this paper is on investigating efficient hash join algorithms for modern multi-core processors in main mem-.Missing: explanation | Show results with:explanation
  3. [3]
    [PDF] Join Algorithms
    Populate an ephemeral hash table as the DBMS scans the table. . For each record, check whether there is already an entry in the hash table:.
  4. [4]
    [PDF] 15-445/645 Database Systems (Spring 2024) - 11 Joins Algorithms
    Hash join algorithms use a hash table to split the tuples into smaller chunks based on their join attribute(s). This reduces the number of comparisons that the ...Missing: explanation | Show results with:explanation
  5. [5]
    [PDF] Join processing in relational databases
    The join operation is one of the fundamental relational database query operations. It facilitates the retrieval of information from two different relations.Missing: seminal | Show results with:seminal
  6. [6]
    [PDF] The Discovery of Hash Based Relational Algebra - Hal-Inria
    Jul 19, 2017 · This article is about how we discovered hash-based methods for doing relational algebra, searching for efficient algorithms to run on a future ...
  7. [7]
    [PDF] Multiprocessor Hash-Based Join Algorithms - cs.wisc.edu
    The Hybrid hash join was first described in [DEWI84a]. All partitioning is finished in the first stage of the algorithm in a fashion similar to the Grace ...
  8. [8]
    [PDF] Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited
    In this paper we experimentally study the performance of main-memory, parallel, multi-core join algorithms, focusing on sort-merge and (radix-)hash join. The ...
  9. [9]
    Joins (SQL Server) - Microsoft Learn
    Aug 21, 2025 · The hash join allows reductions in the use of denormalization. Denormalization is typically used to achieve better performance by reducing join ...<|separator|>
  10. [10]
    [PDF] Quantifying TPC-H Choke Points and Their Optimizations
    ABSTRACT. TPC-H continues to be the most widely used benchmark for relational OLAP systems. It poses a number of chal- lenges, also known as “choke points”, ...
  11. [11]
    Using intrinsic data skew to improve hash join performance
    Hash join is used to join large, unordered relations and operates independently of the data distributions of the join relations. Real-world data sets are ...
  12. [12]
    Join strategies and performance in PostgreSQL
    Understanding the 3 join strategies - nested loop join, hash join, and merge join - is crucial for anyone who wants to understand execution plans and tune ...Missing: comparison | Show results with:comparison
  13. [13]
    [PDF] Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi ...
    Aug 28, 2009 · This paper re- examines two popular join algorithms – hash join and sort-merge join – to determine if the latest computer architecture trends ...Missing: complexity | Show results with:complexity
  14. [14]
  15. [15]
    Design and evaluation of main memory hash join algorithms for ...
    The focus of this paper is on investigating efficient hash join algorithms for modern multi-core processors in main memory environments.Missing: pseudocode | Show results with:pseudocode
  16. [16]
    [PDF] FPGA-based Multithreading for In-Memory Hash Joins
    In this paper we present the first end-to-end in-memory. FPGA hash join implementation. ... Our hash table uses the chaining collision resolution technique: all ...
  17. [17]
    Join processing in database systems with large main memories
    We study algorithms for computing the equijoin of two relations in a system with a standard architecture hut with large amounts of main memory.
  18. [18]
    [PDF] Multiprocessor Hash-Based Join Algorithms - VLDB Endowment
    This paper extends earlier research on hash-join algorithms IO a multiprocessor architecture. Implcmen- tations of a number of centralized join algorithms ...
  19. [19]
    [PDF] Join Processing in Database Systems with Large Main Memories
    If a hash table for the smaller relation R cannot fit into memory, each of the hashing algorithms described in this paper calculates the join by partitioning R.
  20. [20]
    [PDF] Application of Hash to Data Base Machine and Its Architecture
    We propose a data base machine GRACE which adopts a novel relational algebraic processing algorithm based on hash and sort. Whereas conventional logic-per-track ...Missing: acronym | Show results with:acronym
  21. [21]
    [PDF] Design Trade-offs for a Robust Dynamic Hybrid Hash Join
    All three men- tioned hash join algorithms consist of two phases, namely "build" and "probe". During the build phase, they partition the smaller input, which ...
  22. [22]
    [PDF] Hash Joins and Hash Teams in Microsoft SQL Server
    Abstract. The query execution engine in Microsoft SQL Server em- ploys hash-based algorithms for inner and outer joins, semi-joins, set operations (such as ...Missing: seminal | Show results with:seminal
  23. [23]
    [PDF] Optimizing Queries with Universal Quantification in Object-Oriented ...
    The hash variants of the anti-semijoin have been de- rived from the full hash join which uses the aforementioned hybrid hashing scheme. Again, two variants ...
  24. [24]
    Optimizing Distributed Joins with Bloom Filters - ACM Digital Library
    In this work we extend Bloomjoin, the state of the art algorithm for distributed joins, so that it minimizes the network usage for the query execution based on ...
  25. [25]
    [PDF] Query evaluation techniques for large databases
    This survey provides a foundation for the design and implementation of query execution facilities innew database management systems. It describes awide array of.
  26. [26]
    [PDF] Simple, Efficient, and Robust Hash Tables for Join Processing
    Relational database systems rely on hash joins to efficiently process join queries, and the runtime of most queries is dominated by join processing. The ...Missing: definition seminal
  27. [27]
    [PDF] Tandem TR 89.1 Hash Join Algorithms in a Multiuser Environment
    The methods we would like to introduce here are Grace and Hybrid Hash Join. In Grace, the inner relation is divided into buckets, each of them small enough to ...
  28. [28]
    [PDF] Join Optimization Techniques for Partitioned Tables
    Mar 10, 2010 · Selectivity-based partitioning [16], content-based routing [4], and conditional plans [6] are techniques that enable differ- ent plans to be ...
  29. [29]
    A skew handling join algorithm for Google's MapReduce framework
    Hadoop's implementation of the join operation cannot effectively handle such skewed joins, attributed to the use of hash partitioning for load distribution.
  30. [30]
    [PDF] A New, Robust, Parallel Hash Join Method for Data Skew in the ...
    In this paper, we propose a new parallel hash join method, the bucket spreading strategy, which is ro- bust for data skew. During partitioning relations, each ...Missing: seminal | Show results with:seminal
  31. [31]
    [PDF] An Evaluation of Non-Equijoin Algorithms
    There is a growing body of both analytic and experimen- tal evidence that hash-based algorithms are highly effective for equijoins, surpassing the performance.
  32. [32]
    [PDF] An Adaptive Hash Join Algorithm for Multiuser Environments
    In its simplest form, the algorithm uses a constant amount of main memory, it needs just two block buffers, one for the inner and one for the outer relation.Missing: seminal | Show results with:seminal<|control11|><|separator|>
  33. [33]
    Documentation: 18: 19.4. Resource Consumption - PostgreSQL
    Some commonly available page sizes on modern 64 bit server architectures ... Hash tables are used in hash joins, hash-based aggregation, memoize nodes ...
  34. [34]
    [PDF] Revisiting Hash Join on Graphics Processors: A Decade Later
    The build phase generates a hash table using global memory atomic operations and the probe phase then probes the hash table using the second input relation.
  35. [35]
    A New Join Algorithm - ACM Digital Library
    would be the choice when no suitable index existed and a nested-loop join algorithm performed acceptably when a suitable index existed. Both nested-loop and ...