Fact-checked by Grok 2 weeks ago

Samplesort

Samplesort is a divide-and-conquer that generalizes minimal storage tree by employing random sampling from the input to select an efficient estimate of the , which serves as a for partitioning the data into smaller subarrays that are recursively sorted. Introduced by W. D. Frazer and A. C. McKellar in their 1970 paper "Samplesort: A Sampling Approach to Minimal Storage Tree ," the algorithm operates by first drawing a random sample of size s from the input of n elements, this sample, and using its as the partitioning element to split the remaining elements into two roughly equal parts, thereby reducing the expected number of comparisons to O(n log n) on average. This approach makes Samplesort statistically insensitive to biases in the input , unlike earlier algorithms such as , and it asymptotically approaches the information-theoretic lower bound of approximately n log n - 1.443n comparisons for n elements. The recursive partitioning continues until subarrays are small enough to sort using a base-case method, such as , ensuring minimal additional storage beyond the input . Samplesort's balanced partitioning has proven especially valuable in and environments, where it facilitates even across multiple processors by selecting multiple splitter values from a larger sample to create p buckets for p processors, minimizing communication overhead in message-passing systems. Variants like Parallel Sorting by Regular Sampling (PSRS) use regular sampling to select splitter values, achieving O(n log n / p) time on p processors when n ≥ p³. In modern multi-core settings, advanced implementations such as In-Place Parallel Super Scalar Samplesort (IPS⁴o) incorporate cache-aware techniques, branch-free classification, and in-place operations to achieve practical speedups of up to 2.5× over std::sort on arrays up to 2³² elements while maintaining O(n log n) work complexity. These developments underscore Samplesort's enduring relevance as a foundation for high-performance in both sequential and contexts.

History and Introduction

History

Samplesort was invented in 1970 by W. D. Frazer and A. C. McKellar. Their seminal work introduced the algorithm as a method for efficient sorting with minimal storage requirements. The original paper, titled "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting," was published in the Journal of the ACM, Volume 17, Issue 3, pages 496–507. The primary motivation for samplesort was to overcome the inefficiencies in root selection for minimal algorithms, where estimating the proved computationally expensive. By employing a random sample of the input to select multiple splitters, achieved better balance in partitioning compared to 's reliance on a single , while maintaining minimal additional and approaching the information-theoretic lower bound on comparisons. Samplesort can thus be viewed as a generalization of , extending its divide-and-conquer paradigm with sampling for improved statistical robustness against input biases. An early practical implementation of samplesort as a minimal storage was detailed in 1975 by J. G. Peters and P. S. Kritzinger in BIT Numerical Mathematics, Volume 15, pages 85–93, focusing on sequential execution and empirical performance. Initially developed with an emphasis on sequential efficiency for large datasets, samplesort's design principles were later adapted for environments during the 1980s and . Key contributions include J. S. Huang and Y. C. Chow's 1983 paper on parallel sorting and data partitioning by sampling, presented at the 7th International Computer Software and Applications Conference, which extended the approach to distributed systems. In the , further refinements appeared, such as H. Shi and J. Schaeffer's 1992 for parallel sorting by regular sampling in the Journal of Parallel and Distributed Computing, Volume 14, Issue 4, pages 361–372, optimizing for multiprocessor architectures with reduced communication overhead.

Overview and Motivation

Samplesort is a divide-and-conquer that generalizes by employing multiple splitters, derived from a random sample of the input , to elements into a set of balanced buckets. Unlike , which relies on a single pivot to create two partitions, samplesort uses p-1 splitters to form p buckets, where p represents the desired number of partitions, often aligned with the number of available processors in settings. This approach was introduced by Frazer and McKellar in 1970 as an enhancement to minimal storage tree . The primary motivation for samplesort stems from quicksort's vulnerabilities to poor pivot selection, particularly with non-uniform distributions, which can lead to unbalanced partitions and degraded performance. By sampling to approximate the distribution, samplesort achieves more equitable bucket sizes, reducing recursion depth from O(\log n) in to O(\log_p n) and improving load balance. This makes it particularly advantageous for parallel execution, where buckets can be assigned to distinct processors for independent sorting, minimizing communication overhead and enhancing scalability on distributed systems. At a high level, samplesort begins by drawing a random sample from the input to select splitters that define boundaries, assuming basic familiarity with comparison-based . Elements are then classified into these based on their relation to the splitters, followed by recursive within each and final of the sorted to produce the overall ordered sequence. The sampling step serves as a statistical of the data's , enabling robust partitioning without exhaustive preprocessing.

Core Algorithm

Description

Samplesort operates as a divide-and-conquer that partitions the input array into multiple buckets using a sampled set of splitters to achieve balanced subproblems, particularly useful in both sequential and contexts. The process begins with the selection of a random sample of size s from the n input elements, where s is typically chosen on the order of p \sqrt{n/p} (with p denoting the desired number of buckets) to approximate the distribution effectively and ensure the resulting buckets are of comparable size. This sampling step draws from the input to represent the overall data characteristics without requiring a full presort. The selected sample is then sorted using an efficient sequential sorting method, such as , to produce an ordered list from which p-1 splitters are extracted. These splitters are the elements at positions that divide the sorted sample into p approximately equal segments, such as every s/p-th position, serving as boundaries to define the ranges for the buckets. Each element in the original is subsequently classified into one of the p buckets by comparing it against these splitters, often via a linear scan or binary search for efficiency, resulting in p disjoint subsets whose union covers the entire input. Each bucket is recursively sorted using the Samplesort procedure to maintain consistency, with a base case applied when the bucket size falls below a threshold (e.g., using for small subarrays to optimize performance). For large inputs where deep may be impractical, the algorithm can be implemented iteratively by sorting the buckets sequentially in a loop, avoiding stack overhead while preserving the divide-and-conquer structure. The fully sorted output is obtained by concatenating the sorted buckets in the order defined by the splitters. For illustration, consider an array of n = 100 elements to be divided into p = 4 buckets: a sample of size approximately 20 is drawn and sorted, yielding splitters at roughly the 25th, 50th, and 75th percentiles of the sample, which guide the bucketing to create four subsets each expected to hold about 25 elements.

Pseudocode

The core samplesort algorithm can be expressed recursively in pseudocode, where the input is an array A of n elements to be sorted using p buckets (corresponding to processors in a parallel context). The procedure selects a random sample, derives splitters from it, partitions the array into buckets, recursively sorts each bucket, and concatenates the results. This formulation assumes a sequential setting but extends naturally to parallel execution by distributing buckets across processors.
pseudocode
function samplesort(A, p):
    n = |A|
    if n <= 1:
        return A
    // For small arrays, switch to insertion sort as base case
    if n < threshold:  // e.g., threshold = 2p
        return insertion_sort(A)
    s = sample_size(n, p)  // e.g., s = p * sqrt(n / p)
    S = random_sample(A, s)
    sort(S)  // Sort the sample (e.g., using quicksort)
    splitters = select_splitters(S, p-1)  // Select p-1 splitters at positions s/p, 2s/p, ..., (p-1)s/p
    buckets = [[] for i in 1 to p]
    for each x in A:
        bucket_index = find_bucket(x, splitters)  // Binary search to find index j where splitters[j-1] < x <= splitters[j]
        buckets[bucket_index].append(x)
    for i in 1 to p:
        // Recurse with adjusted processor count for small buckets
        buckets[i] = samplesort(buckets[i], min(p, |buckets[i]|))
    return concatenate(buckets[1], ..., buckets[p])
The find_bucket operation uses binary search on the sorted splitters array to determine the appropriate bucket index from 1 to p, with conventions that elements \leq the first splitter go to bucket 1 and elements > the last splitter go to bucket p. This ensures balanced partitioning in expectation. The sample size s is chosen to provide a good approximation of the data distribution while keeping sampling overhead low.

Complexity Analysis

The Samplesort exhibits an average-case of O(n \log n), where n is the input size, due to the expected balanced partitioning into p buckets. The sampling phase involves a random sample of size s, incurring O(s \log s) time, while the phase performs O(n \log p) comparisons by evaluating each of the n elements against \log p splitters derived from the sorted sample. The subsequent recursive of the p buckets then contributes a total expected time of O(n \log n) across all levels, as the bucket sizes are roughly equal with high probability under random sampling. In the worst case, poor representation of the data distribution by the sample can lead to highly unbalanced , resulting in a of O(n^2); however, random selection of the sample makes such pathological cases improbable. The total number of comparisons in Samplesort can be approximated as n \log p + \sum_{i=1}^{p} n_i \log n_i, where \sum n_i = n and the bucket sizes n_i \approx n/p in the case, simplifying the sum to n \log n - n \log p and yielding an overall O(n \log n) bound. Samplesort has a space complexity of O(n + s + p), accounting for the input , the sample storage, and temporary space for the p output buckets during classification; the recursion depth adds O(\log n) stack space in the average case. The choice of sample size s critically influences bucket balance: larger values of s (typically \Theta(p \log n) for p buckets) decrease variance in n_i and mitigate worst-case risks but elevate the sampling and splitter derivation costs. Relative to , Samplesort offers greater stability, as the multi-splitter approach from a representative sample reduces the likelihood of degenerate partitions compared to quicksort's single-pivot selection, which can devolve to O(n^2) more readily.

Sampling Techniques

Sample Selection

In Samplesort, the initial sample is drawn uniformly at random from the input array of n elements to approximate the underlying of the keys. This random sampling ensures an unbiased , allowing the algorithm to derive effective splitters without requiring prior knowledge of the data . In the original sequential variant, the sample size is chosen to optimize the expected number of comparisons, typically on the order of \sqrt{2n \ln 2} for partitioning to approach the information-theoretic lower bound. In parallel or multi-bucket variants, the sample size s is typically O(p \sqrt{n/p}), where p is the number of processors or buckets, to achieve buckets of size approximately \sqrt{n} after partitioning; this choice balances the overhead of sampling and sorting the sample against the cost of local bucket sorting. The selected sample is then sorted using a standard comparison-based algorithm such as , incurring O(s log s) . From this sorted sample, p-1 splitters are chosen as the elements at positions s/p, 2(s/p), ..., (p-1)(s/p) (rounded appropriately), which define equal-probability intervals for partitioning the input into p buckets. These splitters function as approximate quantiles of the input , promoting balanced sizes and avoiding the need for a complete presort of the entire , thereby enabling efficient execution. For instance, with p=5 and s=25, the splitters would be the 5th, 10th, 15th, and 20th elements in the sorted sample.

Oversampling in samplesort involves selecting a sample size s larger than the minimal required for basic splitter estimation, such as s > p \sqrt{n/p} in settings with p processors and n elements, to reduce the variance in splitter positions. This approach enhances the representativeness of the sample, particularly for non-uniform data . The primary benefits of include minimizing the probability of unbalanced buckets by providing more accurate estimates of data quantiles, which leads to improved load balance across recursive phases. Consequently, it achieves near-O(n \log n) average-case with high probability, making more robust in practice. A key trade-off is the increased cost of sampling and the larger set, which requires O(s \log s) time, though this overhead is offset by reduced imbalance in subsequent bucket sorts. was introduced in later refinements to address limitations in the original 1970 samplesort algorithm, which struggled with skewed data leading to poor bucket balances. In practice, after drawing the larger random sample, it is sorted to identify candidate splitters, from which the final p-1 splitters are selected (e.g., every s/p-th element); any excess samples are typically discarded before partitioning. The variance of bucket sizes decreases as O(1/s), directly contributing to better overall load balance. \text{Var}(|B_i|) = O\left(\frac{1}{s}\right) where |B_i| denotes the size of the i-th bucket.

Bucket Size Estimation

In samplesort, bucket size estimation occurs after splitter selection to predict and verify the of elements across buckets, ensuring balanced loads for subsequent recursive . The primary goal is to achieve buckets of approximately equal size, ideally n/p elements each, where n is the total number of elements and p is the number of buckets or processors, thereby minimizing load imbalance in or divide-and-conquer execution. A key technique leverages the sample frequencies to provide an initial prediction of bucket sizes without a full scan. After the sample of size s and deriving splitters that define the bucket intervals, the expected size of each bucket i is calculated as \text{size}_i = \left( \frac{s_i}{s} \right) \cdot n, where s_i is the count of sample elements in interval i. This proportional estimation assumes the sample is representative of the overall data distribution, offering a quick of how elements will . For precise sizing, a dedicated counting pass follows splitter selection, iterating over the entire to classify each into its corresponding and increment counters. This yields exact bucket sizes, enabling efficient allocation and movement in the subsequent partitioning . If the resulting sizes deviate substantially from the target n/p—due to skewed distributions or poor sample representation—adjustments such as re-sampling to generate new splitters or applying weighted adjustments to existing ones can be employed to promote balance. In settings, adapts to distributed by having each compute local counts of its subarray elements falling into global bucket intervals, often using local sample subsets or histograms derived from initial splitter probes. These local estimates are aggregated or refined iteratively to approximate each processor's share, with the overall aim of n/p per bucket to equalize recursive workloads across processors. , by increasing s beyond the minimal p-1 splitters, enhances estimation accuracy, reducing variance in predicted sizes and mitigating risks from non-uniform . For illustration, if a sample reveals that 30% of its elements reside in the first 's interval, the prediction would allocate roughly 0.3n elements to that , guiding preparations for partitioning and highlighting potential imbalances for adjustment.

Special Cases

Many Identical Keys

In samplesort, the presence of many identical keys in the input data can lead to non-unique splitters when the sample is sorted, causing some to become degenerate—either empty or disproportionately large—which disrupts load balancing across processors. This issue arises because duplicate keys in the sample result in coinciding splitter values, potentially forcing a significant portion of elements to traverse multiple levels without effective partitioning. To address this, samplesort implementations often employ multiplicity counts during splitter selection, treating sequences of equal splitters as boundaries that define ranges rather than distinct points. Elements equal to such a splitter are then classified through additional to ensure even distribution, such as directing equals into dedicated " buckets" for keys with exceeding n/p, where n is the input size and p is the number of processors. This adjustment requires only a extra comparison per in robust designs, avoiding branching overhead while preserving parallelism. The impact of these adaptations is a slight increase in classification time due to the extra handling, but they maintain overall balance and prevent on uniform equality buckets, reducing total workload. In the worst case, such as when all sample elements are identical, the algorithm effectively reduces to bucketing all elements into a single group or falls back to a specialized sort like for that bucket, thereby avoiding degenerate behavior. Regarding , this handling may elevate the comparison count to O(n \log p + n \cdot f), where f is the fraction of duplicates, though high-probability analyses confirm the overall time remains O(n \log n / p) per processor with proper .

Skewed Data Distributions

Skewed data distributions, such as heavy-tailed or ones, pose challenges to samplesort by causing uneven partitioning into buckets, even when initial sampling appears representative, due to varying densities across the key range. This can result in load imbalances during , where some buckets receive disproportionately more elements, leading to increased communication overhead and degraded . To address this, implementations often employ higher rates, selecting more samples than the minimal required to better approximate the underlying and ensure balanced buckets. For instance, selecting approximately n/p^2 samples per , where n is the input size and p the number of processors, provides robustness against by improving the statistical reliability of splitter selection. Adaptive techniques further mitigate these issues by adjusting the oversampling based on observed properties of the sample, such as variance, to dynamically increase sampling in regions of high or tails. Additionally, pre-sampling random of the input can decorrelate clustered elements, promoting more even across buckets regardless of the original order or . Post-sampling, if bucket counts reveal imbalances exceeding a (e.g., twice the average size), elements can be redistributed via additional all-to-all communication, though this is rarely needed with sufficient . These methods reference bucket size estimation to predict and adjust for potential disparities. In terms of performance, these adaptations preserve the expected O(n \log n) time complexity of samplesort, as the extra sampling and potential corrections add only lower-order terms. The probability of significant imbalance (e.g., a bucket exceeding n/p + O(\log^3 n) elements) is bounded by $1 - 1/n^c for some constant c > 0, making failures negligible for large n. For distributions, where tails contribute few but critical elements, increasing the sample size s in those regions—potentially combined with rough estimation from the sample—helps maintain balance without excessive overhead. Early variants, such as those evaluated on real-world datasets like database keys, highlighted the necessity of these robustness measures, demonstrating consistent across uniform, Gaussian, and highly skewed inputs on architectures like the Cray T3D and IBM SP-2.

Parallel Applications

Uses in Parallel Systems

Samplesort is particularly suitable for parallel systems due to its bucketing approach, which naturally partitions into roughly equal-sized subsets that can be assigned to individual processors or cores, enabling local with reduced overhead. This design minimizes inter-processor communication to essential phases like sample collection and result merging, making it well-adapted to (BSP) models where computation proceeds in supersteps separated by barriers. It also fits multi-core CPUs through thread-based distribution of bucketing and tasks, as well as distributed systems employing message-passing interfaces such as MPI for data redistribution across nodes. Early adaptations of samplesort for parallel environments emerged in the on shared-nothing architectures, where sampling facilitated balanced data partitioning to address load imbalances in early multiprocessor systems. In contemporary , variants like parallel super scalar samplesort are integrated into libraries such as .Sort for efficient multi-core execution and HPX for scalable distributed . Key benefits include inherent load balancing from bucket sizing, which ensures even work distribution across , and constrained communication volumes limited to sample gathering via collective operations and final output concatenation. For instance, in a with p , an initial global sample is aggregated using all-reduce or all-gather to select pivots, followed by local classification into buckets and independent on each before merging the results. Samplesort achieves near-linear speedup in these settings for large n where p ≪ n, with implementations demonstrating up to 28.71× speedup over sequential versions on 32 cores in multi-core environments and stable weak scaling in distributed setups with 32 nodes handling billions of elements. As of 2025, samplesort variants continue to evolve, with optimizations for GPUs and FPGAs enhancing performance in environments.

Parallelization Strategies

Samplesort's parallelization strategies vary depending on the memory model, with distinct approaches for systems (e.g., using message-passing like MPI) and systems (e.g., using multi-threading). In settings, the algorithm emphasizes efficient communication to balance load across processors while minimizing movement. Each of the p processors begins by taking a local random sample from its portion of the input , typically of size proportional to the local length. These local samples are then combined globally through collective operations such as all-gather or reduce-scatter to form a unified sample , which is sorted to select the splitters at regular intervals. The selected splitters, numbering p-1 for p-way partitioning, are broadcast to all processors to ensure uniform classification criteria. Each processor then performs local bucketing by classifying its elements into p buckets using the splitters, often via binary search for efficiency. This step determines the destination for each element based on ownership by rank. Following classification, an all-to-all communication phase redistributes the buckets so that each processor receives the elements destined for its local bucket; this exchange handles irregular sizes effectively. In MPI implementations, this is commonly achieved using MPI_Alltoallv, which supports variable message lengths per processor pair. Once the exchange completes, each processor sorts its received bucket locally using a sequential , such as mergesort for or recursive samplesort for larger subproblems. Synchronization occurs via barriers after key phases—such as sample combination, splitter broadcast, bucketing, and —to adhere to a (BSP) model, ensuring all processors progress in lockstep. For shared memory environments, parallelization relies on threads for concurrent local and bucketing, with operations or lock-free techniques to manage shared structures during element assignment. Optimizations enhance performance by addressing load imbalance and communication overhead. Local oversampling increases the global sample size beyond the minimum (e.g., by a factor of 1.5 to 2), improving splitter quality and reducing variance in sizes without excessive . Communication is further boosted by adapting the all-to-all to the underlying , such as hierarchical all-to-all for multi-level clusters to minimize . These strategies yield near-linear speedups, with experimental results showing speedups approaching linear scaling on up to 128 processors for large inputs.

Efficient Implementations

Bucket Determination

In samplesort, bucket determination assigns each element to one of the k buckets defined by k-1 sorted splitters, typically via binary search on the splitters to find the appropriate position, requiring O(\log k) comparisons per element and O(n \log k) total time, where k is often proportional to the number of processors p. Efficient implementations optimize this process to reduce branch mispredictions and leverage hardware parallelism. Super Scalar Sample Sort (SSS) employs an implicit complete structure on the splitters, where the root is the splitter and navigation proceeds level by level using predicated instructions for branch-free execution; this decouples comparisons from memory accesses in a two-pass approach—first computing bucket indices and sizes, then placing elements—yielding an overall O(n) cost despite O(n \log k) comparisons. Further enhancements include precomputing a from the splitters for faster lookups. In variants like In-Place Super Scalar Samplesort (IPS⁴o), the decision tree is unrolled into straight-line code with conditional increments (e.g., indexing doubles at each level, adjusted if the element exceeds the splitter), eliminating branches. For example, in with k=16 (a ), an element traverses the four-level tree by comparing against splitters like the 8th, then conditionally the 4th or 12th, and so on, with and software pipelining enabling multiple comparators to operate in via SIMD operations on modern processors. This approach achieves near-linear classification time in practice, particularly when k is small relative to sizes.

Data Partitioning

In samplesort, the data partitioning phase follows splitter determination and involves classifying and redistributing elements into balanced buckets to enable local sorting. This phase ensures that elements destined for the same bucket are collected together, minimizing communication overhead in parallel settings while maintaining efficiency. Local partitioning begins with each processor scanning its assigned data portion and assigning elements to buckets via comparison against the splitters. Temporary arrays are created for each bucket, and elements are appended sequentially during a single classification pass to minimize memory accesses. Buffer sizes are preallocated using size estimates derived from the sampling step, preventing costly reallocations and promoting contiguous memory layouts for better cache performance. In the Super Scalar SampleSort (SSS) implementation, local partitioning employs a two-pass strategy for efficiency: the first pass counts elements per bucket using an implicit constructed from the splitters, while the second pass scatters elements into the preallocated buffers. This approach avoids branch mispredictions by leveraging predicated instructions and , ensuring high during classification. Elements are packed into cache-friendly blocks to optimize data movement within shared-memory systems. In parallel environments, the global exchange step redistributes the locally partitioned buffers across processors using an all-to-all communication primitive, such as MPI_Alltoallv in distributed-memory systems. Each processor sends its portion of j to the owner of j, with serialized and packed to reduce network latency and usage. Upon receipt, the buffers are concatenated into a single local array for subsequent , targeting cache-aligned layouts to enhance processing speed. This partitioning requires O(n) temporary space overall, dominated by the local buffers totaling approximately the input size n. For instance, in a with p processors, processor i receives all elements satisfying s_{i-1} < x \leq s_i (where s denotes splitters) from every 's local data, forming a roughly equal share for local sorting.

In-Place Samplesort

Local Classification

In the local classification phase of in-place samplesort, each processing unit classifies its local portion of the input into conceptual buckets without allocating additional arrays to store the partitioned elements, thereby minimizing auxiliary space usage. The input is divided into blocks of size b, processed by t threads in stripes. This phase employs local buffer blocks and a precomputed or branchless based on the global splitters, determined in the prior sampling phase and broadcast to all units, to classify elements into k buckets (where k approximates the number of processing units) in constant time per element. The classification proceeds in a single pass over the local stripe. Each thread scans its data, classifies using the , and places them into local buffer blocks corresponding to buckets. Full buffer blocks are written back to the , while histograms are built as a side effect to count elements per bucket. These local counts are then aggregated across units, often using parallel reduction, to compute global prefix sums that establish the exact starting positions for each bucket in the overall . This partially moves elements into their target buckets, leaving unprocessed partial blocks for later phases. The of this phase is O(n), achieved through constant-time via the search , where n is the local subarray size. Auxiliary space requirements are O(k b t) for local buffers, histograms, and prefix sums (sublinear in n, with k buckets, b block size, and t threads), enabling the in-place nature; a strictly in-place variant achieves O(1) space beyond the input by eliminating the stack. This approach leverages parallel splitters selected from a distributed sample to ensure balanced buckets with high probability, enabling efficient subsequent phases.

Block Permutation

In the block permutation phase of in-place samplesort, bucket counts obtained from the local step are used to compute the starting positions for each within the array via prefix sums, with boundaries rounded up to the nearest block size to ensure contiguous allocation. This preparation allows the remaining unprocessed blocks from , now in the partially partitioned array, to be rearranged into their designated positions without requiring additional memory. The permutation process involves multiple threads cycling through the buckets in a coordinated manner, employing multi-pointer techniques with read (r_i) and write (w_i) pointers for each to track progress. Unprocessed blocks are read into temporary swap buffers and moved or swapped to their destination buckets based on the classification results and pointers: if the destination write pointer (w_{\text{dest}}) is less than or equal to the read pointer (r_{\text{dest}}), a direct swap occurs; otherwise, the block is moved to an empty slot, which is later refilled. This block-interchange approach ensures that entire blocks are placed contiguously, maintaining the that each contains correct (already in ), unprocessed, or empty blocks. To handle potential overlaps between source and destination regions, the algorithm uses atomic operations, such as , to synchronize access and prevent overwriting of unprocessed , thereby avoiding data races in execution. For instance, if bucket 1 is assigned positions 0 to 9 and contains 10 elements for it scattered elsewhere, the permutation swaps or moves those blocks sequentially into indices 0 through 9 while preserving order within the block. The overall for this phase is O(n), as it performs a linear pass over the after classification.

Cleanup

In the in-place samplesort algorithm, the cleanup phase addresses minor disorders that may persist at bucket boundaries following the block step, primarily arising from equal keys or slight overlaps in partitioning. These irregularities occur because the permutation process, while largely correct, can leave a small number of elements misplaced across adjacent buckets due to the approximate nature of the initial . To resolve these issues, scans the boundaries between consecutive and swaps misplaced elements, such as those from the head of the next or partial buffers, into appropriate empty slots within the current or unused blocks. For cases involving keys that span multiple blocks, dedicated buckets are formed, and these are handled separately to avoid inter-block disruptions. If boundary regions remain small but disordered after swapping, may recurse on those limited subarrays to ensure precise ordering. This phase integrates seamlessly by first applying an in-place local sort, such as , to each individual bucket after permutation, which stabilizes intra-bucket order before addressing any cross-boundary equals. The cleanup thus finalizes the partitioning, confirming that all elements are correctly placed within their designated blocks in the original array, preparing the subarrays for subsequent recursive invocations of the samplesort procedure. The entire cleanup operates with O(1) extra space, relying solely on the input array and temporary swaps without additional buffers or recursion stacks, thereby maintaining the in-place property throughout the algorithm's execution.

References

  1. [1]
    Samplesort: A Sampling Approach to Minimal Storage Tree Sorting
    Samplesort: A Sampling Approach to Minimal Storage Tree Sorting. Authors: W. D. Frazer ... A sorting algorithm called Recursive Samplesort is described and ...
  2. [2]
    [PDF] In-Place Parallel Super Scalar Samplesort (IPS4o) - DROPS
    IPS4o is an in-place, parallel, cache-efficient sorting algorithm that performs O(nlog n) work, and is a potential replacement for quicksort.
  3. [3]
    Implementation of samplesort: A minimal storage tree sort
    An implementation of samplesort, a generalization of minimal-storage tree-sorting, is presented as an algorithm which is relatively insensitive to possible.
  4. [4]
    Parallel sorting by regular sampling - ScienceDirect.com
    The algorithm reduces memory and bus contention, which many parallel sorting algorithms suffer from, by using a regular sampling of the data to ensure good ...
  5. [5]
    [PDF] Super Scalar Sample Sort - Algorithm Engineering
    Super Scalar Sample Sort (SSS) is a generalization of quicksort, using a two-pass approach with oracles to decouple comparisons from conditional branching.
  6. [6]
    [PDF] Highly Scalable Parallel Sorting
    Sample Sort is a popular and widely analyzed splitter- based method for parallel sorting [7], [8]. This algorithm acquires a sample of data of size s from each ...
  7. [7]
    [PDF] [ Team LiB ] Increasingly, parallel processing is being seen as the ...
    ... Introduction to Parallel Computing, Second Edition. By Ananth Grama, Anshul ... This evolution has a profound impact on the process of design, analysis, and ...
  8. [8]
    Generalized Leapfrogging Samplesort: A Class of $O(n \log ... - arXiv
    Jan 29, 2018 · Title:Generalized Leapfrogging Samplesort: A Class of O(n \log^2 n) Worst-Case Complexity and O(n \log n) Average-Case Complexity Sorting ...
  9. [9]
    [PDF] Theory of Computing Systems
    The remainder of this paper studies the implementations of bitonic sort, radix sort, and sample sort. In each case it describes and analyzes the basic algorithm ...
  10. [10]
    [PDF] Parallel Algorithms
    First each processor selects a random sample of its keys. Next all of the selected keys are sorted together. Finally these keys are used as the boundaries. Such ...
  11. [11]
    [PDF] In-place Parallel Super Scalar Samplesort (IPS4o) - arXiv
    Jun 29, 2017 · 4.4 The Case of Many Identical Keys. Having inputs with many identical keys can be a problem for samplesort, since this might move large ...
  12. [12]
    Engineering In-place (Shared-memory) Sorting Algorithms
    Jan 31, 2022 · 5.2 Parallel Complexity and I/O Complexity. The PEM model measures the parallel complexity of an algorithm by counting the number of parallel ...
  13. [13]
    [PDF] A Randomized Parallel Sorting Algorithm With an Experimental Study
    Aug 15, 1996 · Section 2 presents our computation model for analyzing parallel algorithms. Section 3 describes in detail our improved sample sort algorithm.
  14. [14]
    [PDF] Robust Massively Parallel Sorting - arXiv
    SSort uses a sample size of 16 log p elements per PE and calls MPI_Alltoallv to exchange data. The imple- mentation uses the MPI_Datatype MPI_Byte for every.
  15. [15]
    [PDF] Parallel selection on GPUs - Innovative Computing Laboratory
    Splitter selection. The choice of splitters in a bucket-based selection algorithm has a strong influence on the recursion depth, and thus the total run- time ...
  16. [16]
    A New Sample Sort Algorithm
    Step (3): Each processor. sorts the. values received in Step (2) using an appropriate sequential sorting algorithm. For integers we use the radix sort ...Missing: paper | Show results with:paper
  17. [17]
    [PDF] Parallel Sorting and Data Partitioning by Sampling
    Parallel Sorting and Data Partitioning by Sampling by. Jun S. Huang. Institute of Information Science. Academia Sinica. Taipei, Taiwan, R.O.C.. 中研院資訊所圖書 ...
  18. [18]
    [PDF] Bulk Synchronous Message Passing: bspsort
    ▷ Function bspsort is an implementation of Algorithm 1.4 for bulk synchronous parallel regular samplesort. ▷ The communication pattern of its main communication.
  19. [19]
    [PDF] Parallel Sample Sort using MPI
    Sample Sort is an improved version of divide and conquer algorithms that addresses this non-uniformity. Consider it a generalization of. Quicksort. As the name ...
  20. [20]
    The worst-case time complexity of parallel::sort seems to be O(N^2).
    Aug 15, 2017 · ... samplesort, using a half of the additional memory of ... HPX parallel stable sort : 0.744336 secs HPX sample sort : 0.611427 secs ...
  21. [21]
    [PDF] Parallel Sorting by Regular Sampling†
    Parallel Sorting by Regular Sampling (PSRS) finds pivots for partitioning the data into ordered subsets of approximately equal size by using a regular sample ...
  22. [22]
    A Randomized Parallel Sorting Algorithm with an Experimental Study
    In this paper, we introduce a novel variation on sample sort which uses only two rounds of regular all-to-all personalized communication in a scheme that yields ...
  23. [23]
    [PDF] Engineering In-place (Shared-memory) Sorting Algorithms - arXiv
    Sep 28, 2020 · Our new comparison-based algorithm In-place Superscalar Samplesort (IPS4o), combines this technique with branchless decision trees. By taking ...
  24. [24]
  25. [25]
    [1705.02257] In-place Parallel Super Scalar Samplesort (IPS$^4$o)
    May 5, 2017 · Access Paper: View a PDF of the paper titled In-place Parallel Super Scalar Samplesort (IPS$^4$o), by Michael Axtmann and 3 other authors.