Fact-checked by Grok 2 weeks ago

Parallel processing

Parallel processing, also known as , is the simultaneous use of multiple compute resources—such as central processing units (CPUs), cores, or computers—to solve a single computational problem by dividing it into smaller, concurrent tasks. This approach exploits concurrency to enhance performance, enabling faster execution of complex algorithms in fields like scientific simulation, , and . The concept of parallel processing emerged in the mid-20th century alongside the development of early computers, with foundational work in the and focusing on multiprocessor systems to overcome the limitations of single-processor architectures. By the , advancements in vector processors and shared-memory multiprocessors, such as those from Cray Research, marked a significant evolution, driven by the need for in scientific applications. The and saw a shift toward commodity hardware, including clusters of off-the-shelf processors and the rise of multicore chips, fueled by and the demand for scalable systems, with supercomputers achieving performance exceeding 1 exaflop as of 2025. Parallel processing architectures are classified using frameworks like , which categorizes systems based on instruction and data streams: single instruction, single data (SISD) for sequential computing; (SIMD) for vector operations; (MISD) for fault-tolerant pipelines; and (MIMD) for general-purpose parallelism. Common implementations include shared-memory systems, where multiple processors access a unified space (e.g., multicore CPUs), and distributed-memory systems, where processors communicate via (e.g., clusters using MPI). models, combining both, are prevalent in modern supercomputers and GPUs for tasks requiring massive parallelism. Key benefits of parallel processing include substantial speedups for workloads, improved scalability for and simulations, and enhanced resource utilization in environments. For instance, it enables breakthroughs in climate modeling and by distributing computations across thousands of nodes. However, challenges persist, such as managing to avoid race conditions, minimizing inter-processor communication overhead, ensuring load balancing, and developing portable software that scales efficiently across heterogeneous hardware.

Fundamentals

Definition and core principles

Parallel processing is a computational that involves the simultaneous execution of multiple processes or threads across multiple processing units to solve problems more efficiently than sequential execution on a single . This approach divides a computational task into independent subtasks that can be performed concurrently, leveraging the aggregate computational power of multiple processors to reduce overall execution time. The primary motivations for parallel processing stem from the demands of handling massive datasets, performing simulations, and tackling computationally intensive applications such as scientific modeling and climate forecasting, where single-processor systems fall short in terms of speed and . By distributing workloads, parallel processing enables the of large-scale data volumes that would otherwise exceed the or processing capacity of individual machines, and it supports time-critical tasks requiring rapid results. Central to parallel processing are principles that quantify performance gains, such as speedup and efficiency. Speedup S is defined as the ratio of the execution time of the best sequential algorithm T_s to the execution time of the parallel algorithm T_p on p processors:
S = \frac{T_s}{T_p}.
Efficiency E measures resource utilization as the speedup divided by the number of processors:
E = \frac{S}{p}.
These metrics highlight the ideal linear scaling where S = p and E = 1, though real-world factors often yield sublinear results. Gustafson's law extends this by focusing on scalable parallelism for fixed-time problems, where the problem size grows with the number of processors; it posits that speedup is bounded by S = s + p(1 - s), with s as the serial fraction, emphasizing that nearly all computation can be parallelized in appropriately scaled problems.
The degree of achievable parallelism is fundamentally constrained by dependencies within the computation. Data dependencies occur when an operation relies on the output of a preceding , enforcing sequential ordering to ensure correctness. Control dependencies arise from conditional branches or loops that dictate alternative execution paths, potentially serializing portions of the code. These dependencies limit the of parallelization, as unresolved conflicts can lead to overheads or reduced concurrency, impacting overall and efficiency.

Types of parallelism

Parallelism in computing can be categorized based on the level of granularity at which tasks are divided and executed concurrently, as well as architectural classifications that describe how processors and instructions interact. These distinctions help in designing systems that exploit concurrency effectively, building on the core principles of dividing workloads to achieve speedup. refers to the size of the computational units being parallelized, influencing overhead from communication and . Granularity is typically divided into three levels: fine-grained, medium-grained, and coarse-grained. Fine-grained parallelism operates at the level, where individual are executed simultaneously across multiple units, often requiring tight coordination to manage dependencies; this is common in superscalar processors that issue multiple . Medium-grained parallelism targets loop-level operations, such as parallelizing in a for-loop where each iteration is independent, balancing computation with moderate communication costs. Coarse-grained parallelism involves task-level decomposition, assigning large, independent subtasks to separate processors, which minimizes overhead but requires careful partitioning to ensure load balance; this approach is suitable for applications with naturally separable components. A foundational classification is , proposed in 1966, which categorizes computer architectures based on the number of instruction streams and data streams. Single Instruction, Single Data (SISD) represents sequential processing, where one instruction operates on one data item at a time, as in traditional architectures. Single Instruction, Multiple Data (SIMD) applies a single instruction to multiple data elements simultaneously, enabling efficient vector processing; examples include vector processors like the , which accelerated scientific computations by handling arrays in parallel. Multiple Instruction, Single Data (MISD) involves multiple instructions operating on a single data stream, though this is rare and primarily theoretical, with limited practical implementations. Multiple Instruction, Multiple Data (MIMD) allows multiple instructions on multiple data streams, forming the basis for most modern parallel systems like multicore CPUs and distributed clusters. This taxonomy, while simplistic, remains influential for understanding architectural trade-offs in parallelism. Beyond architectural taxonomies, parallelism is often classified by the nature of the workload: , , and . involves applying the same operation across multiple data elements, such as where each processor computes a portion of the result; this is highly scalable for uniform tasks and underpins SIMD architectures. , in contrast, divides a problem into heterogeneous subtasks that execute different operations concurrently, like rendering different scenes in graphics processing; it suits applications with diverse computational requirements. structures computation as a sequence of stages, where each stage processes data from the previous one in an overlapping manner, similar to an ; this is effective for streaming workloads, such as , where throughput is prioritized over latency. A special case is embarrassing parallelism, where a problem decomposes into completely independent tasks with no interdependencies, requiring minimal ; classic examples include simulations for estimating integrals, where multiple random samples are evaluated in parallel to approximate results with high accuracy. Such problems achieve near-linear on parallel systems, making them ideal for exploiting available concurrency without complex .

Hardware architectures

Multicore and shared-memory systems

Shared-memory systems form a foundational in parallel processing, where multiple processors access a space to enable efficient and . In this model, processors communicate implicitly through the shared rather than explicit , simplifying programming for applications like scientific simulations and database operations. These systems are tightly coupled, with low-latency communication, but their performance depends on memory access uniformity and mechanisms. The shared-memory model is categorized into uniform memory access (UMA) and (NUMA) based on access latencies. In UMA architectures, all processors experience equal access times to any location, typically achieved through a centralized connected via a shared bus, making it suitable for small-scale systems with up to a few dozen processors. NUMA systems, in contrast, distribute across nodes, where processors local faster than remote , allowing scalability to larger configurations while introducing variable latencies that require software optimizations for locality. To maintain data consistency across processor caches in shared-memory systems, cache coherence protocols are essential, as each processor may hold local copies of shared data. The MESI protocol, widely adopted in modern implementations, defines four states for cache lines—Modified (dirty data unique to one cache), Exclusive (clean data unique to one cache), Shared (clean data possibly in multiple caches), and Invalid (stale or unused)—ensuring that writes propagate correctly and reads reflect the latest values through snooping or directory-based mechanisms. This invalidate-based approach minimizes bandwidth usage compared to update protocols, though it can introduce overhead from frequent invalidations in write-heavy workloads. Multi-core processors integrate multiple processing units on a single chip, enhancing parallelism by reducing inter-core communication latency through on-chip interconnects like rings or meshes. Introduced commercially with IBM's in 2001, this design has evolved to include features like in architectures, which allows a to execute two threads concurrently by duplicating registers and improving resource utilization without full duplication. Representative examples include 's Core i series, starting with dual-core models in 2006, and AMD's processors, which since 2017 have offered up to 16 cores per chip with high core counts for multithreaded tasks. Symmetric multiprocessing (SMP) extends shared-memory principles by connecting multiple identical processors to a single shared memory under one operating system, enabling balanced load distribution across cores. SMP systems typically scale to around 64 processors before diminishing returns, as the architecture supports equal access but relies on a unified memory controller. This setup is common in servers and workstations, where it facilitates task parallelism via directives like those in OpenMP, though it demands careful thread scheduling to avoid imbalances. Performance in multicore and shared-memory systems is often constrained by bus contention, where simultaneous memory requests from multiple processors saturate the shared interconnect, leading to stalls and reduced throughput. emerges as another key , as the mismatch between processor speed and memory delivery rates—known as the memory wall—limits effective parallelism, particularly in bandwidth-intensive applications where off-chip accesses dominate execution time. Mitigations include larger on-chip caches and NUMA-aware allocation, but these cannot fully eliminate contention in highly parallel workloads.

Distributed and cluster computing

Distributed computing architectures extend parallel processing beyond single machines by interconnecting multiple independent nodes, each with its own private memory, to form scalable systems without a shared address space. In the distributed-memory model, processors communicate exclusively through explicit over networks, requiring programmers to manage data exchange and manually. This approach contrasts with shared-memory systems by avoiding centralized memory contention, enabling greater scalability for large-scale computations, though it introduces challenges in handling non-uniform communication latencies. Common network fabrics include Ethernet for cost-effective, general-purpose connectivity and for high-bandwidth, low-latency transfers in demanding environments, supporting rates up to 400 Gbps in modern implementations. Cluster computing represents a foundational implementation of distributed architectures, aggregating commodity off-the-shelf (COTS) hardware into tightly coupled systems for (HPC). Pioneered by the project in 1993 at , these clusters initially used 16 processors linked via channel-bonded Ethernet to demonstrate affordable parallel performance exceeding 1 Gflop/s by 1996. Beowulf-style clusters revolutionized HPC by leveraging like and message-passing libraries, making supercomputing accessible without proprietary hardware. Today, they dominate the TOP500 list of the world's fastest supercomputers, where systems like (1.742 exaflop/s, HPE EX with interconnect) and (1.353 exaflop/s) employ massive node counts—over 11 million cores in —to achieve exascale performance through distributed clustering. Grid computing builds on distributed principles by pooling heterogeneous resources across geographically dispersed locations, enabling collaborative problem-solving for resource-intensive tasks like scientific simulations. Resources such as , , and are dynamically allocated via that coordinates nodes from multiple organizations, often spanning continents, to form a virtual without dedicated ownership. This model supports jobs by federating idle capacities, as seen in projects like CERN's Worldwide LHC Computing Grid, which processes petabytes of data. Evolving from grids, platforms further democratize access to distributed resources; AWS ParallelCluster automates HPC cluster deployment on EC2 instances, supporting schedulers like Slurm for scalable job queuing across virtual nodes. Similarly, Google Cloud's HPC toolkit enables workloads in , such as running Fluent simulations on distributed GPU clusters for faster convergence in modeling. Effective communication in these systems relies on interconnection topologies that balance —the startup time for —and bandwidth—the sustained data transfer rate—across nodes. The ring topology connects nodes in a cycle, offering simplicity with two neighbors per node but limited of 1, leading to high (up to p-1 for p nodes) in large systems. A mesh topology, often or grids, provides higher (up to 6 neighbors in ) and scaling with √p, reducing average to about 2(√p-1) while supporting parallel paths for improved throughput in moderate-scale clusters. The topology excels in , linking 2^d nodes with d neighbors each, achieving logarithmic (log p ) and of p/2, which minimizes and maximizes bandwidth for exascale distributed environments like leaders. These trade-offs guide topology selection: rings for low-cost setups, meshes for balanced / simulations, and hypercubes for low-, high-bandwidth demands in expansive networks.

Specialized parallel hardware

Specialized parallel hardware encompasses architectures designed for targeted parallel workloads, diverging from general-purpose multicore processors by optimizing for massive data parallelism, reconfigurability, or fixed-function acceleration. These systems leverage single-device parallelism to achieve high throughput in domains requiring intensive computations, such as scientific simulations and data processing. Unlike distributed clusters, they focus on intradevice efficiency, often incorporating SIMD (single instruction, multiple data) principles to process large arrays simultaneously. General-purpose computing on graphics processing units (GPGPU) repurposes GPU architectures, originally for rendering, to execute non-graphics parallel tasks. GPUs feature thousands of lightweight cores organized in SIMD fashion, enabling massive parallelism for data-intensive operations like matrix multiplications or simulations. NVIDIA's , introduced in 2006, provides a that allows developers to write kernels executed across these cores, achieving up to 10-100x speedups over CPUs for suitable workloads. , an from 2009, extends similar capabilities across vendors, supporting on GPUs from multiple manufacturers. For example, GPUs in the NVIDIA Ampere series, such as the A100, contain 6,912 cores and deliver 19.5 TFLOPS of FP32 performance or up to 156 TFLOPS using TF32 Tensor Cores for data-parallel tasks; consumer models like the RTX 3090 exceed 10,000 cores with about 35.6 TFLOPS FP32. Subsequent architectures like NVIDIA's Hopper H100, as of 2023, offer up to 16,896 cores and 67 TFLOPS FP32. Field-programmable gate arrays (FPGAs) offer reconfigurable logic blocks that can be programmed post-manufacturing to implement custom parallel circuits tailored to specific applications. This flexibility allows designers to define hardware parallelism, such as pipelined data paths or multiple processing units, optimizing for throughput and latency in tasks like . In , FPGAs accelerate filters and transforms by deploying parallel arithmetic units, achieving real-time performance unattainable on fixed architectures. For , dynamic partial reconfiguration enables on-the-fly adjustments for algorithms like , reducing resource overhead while maintaining high-speed through parallel round computations. Surveys highlight FPGAs' advantages in low-power, customizable parallelism compared to GPUs, though they require hardware description languages like for implementation. Application-specific integrated circuits (ASICs) are fixed-function engineered for a narrow set of parallel operations, providing unparalleled efficiency for dedicated workloads. In cryptocurrency mining, Bitcoin ASICs perform trillions of hash computations per second via highly parallel SHA-256 pipelines, with modern designs like those from achieving over 100 TH/s at low power per hash, far surpassing general-purpose hardware. For AI inference, Google's (TPU), deployed since 2015, uses systolic arrays for matrix multiplications, delivering 92 teraOPS (TOPS) for INT8 operations in a 123W package optimized for deployments. These sacrifice versatility for 10-100x energy efficiency gains in their target domains, making them ideal for large-scale, repetitive parallel tasks. Vector processors represent a class of hardware with dedicated units for SIMD operations on entire arrays, enabling efficient handling of long s without explicit . Historically, systems like the from 1976 pioneered this approach, using chained pipelines to process s at rates up to 160 MFLOPS, influencing subsequent designs. Modern incarnations, such as NEC's SX-Aurora TSUBASA series introduced in 2019, feature vector engines with up to 48 vector pipelines per card, supporting lengths exceeding 256 elements and achieving over 2.45 TFLOPS in dense linear . These processors excel in scientific computing by masking through continuous operations, with the SX-Aurora's architecture allowing seamless scaling across multiple engines for array-based parallelism.

Software and programming models

Parallel programming languages

Parallel programming languages provide abstractions and constructs to express concurrency and parallelism, enabling developers to leverage multi-core processors, distributed systems, and specialized without managing low-level details such as scheduling or access. These languages range from extensions to existing general-purpose ones to domain-specific designs, focusing on models like , , and to simplify the development of scalable applications. High-level languages have incorporated parallel features to support concurrent execution on shared-memory systems. 's coarrays, introduced in the Fortran 2008 standard, enable single-program multiple-data (SPMD) parallelism by allowing arrays to be distributed across multiple images (processes) with one-sided communication operations like coarray assignments and intrinsics such as SYNC ALL. This extension facilitates portable parallel code without external libraries, making it suitable for scientific computing on clusters. supports parallelism through its built-in threading model, where the class and Runnable allow creation of multiple threads that share and execute concurrently, often managed via the java.util.concurrent package for executors and locks. Go introduces goroutines as lightweight, concurrently executing functions spawned with the "go" keyword, paired with channels for safe communication and between them, promoting a model where concurrency is cheap and composable for scalable networked applications. Parallel extensions to standard languages provide directive-based or library-driven approaches for specific memory models. is an specification for shared-memory parallelism in C, C++, and , using compiler directives like #pragma omp parallel to fork threads, work-sharing constructs such as for loops, and clauses to manage data races and ensure scalability across multi-core systems. The (MPI) standardizes communication in distributed-memory environments, offering point-to-point operations (e.g., MPI_Send and MPI_Recv) and collective routines (e.g., MPI_Allreduce) for SPMD programs running on clusters, with implementations like MPICH ensuring portability and high performance. Functional and declarative languages emphasize immutability and higher-order abstractions for concurrency. provides parallelism primitives in its runtime, such as par and pseq from the Control.Parallel module, which spark parallel evaluations of expressions on multiple cores while controlling sequencing to avoid leaks, integrated with strategies for divide-and-conquer patterns in pure functional code. Erlang's concurrency model treats processes as isolated lightweight entities communicating solely via asynchronous with the ! operator, enabling fault-tolerant distributed systems where each process handles its own state without . Domain-specific languages target hardware-accelerated parallelism for particular workloads. , NVIDIA's extension to C/C++, enables general-purpose on GPUs by defining kernels—functions executed in parallel across thousands of threads organized in grids and blocks—with memory hierarchies like for coalesced access and synchronization via barriers. is a for image and array processing pipelines, separating algorithm specification from optimization schedules that automatically generate parallel code for CPUs, GPUs, or other accelerators, achieving performance comparable to hand-tuned implementations through autotuning of tiling, vectorization, and fusion.

Coordination and synchronization mechanisms

In parallel processing, coordination and mechanisms ensure that multiple tasks or s interact correctly without leading to errors such as race conditions, where concurrent access to shared resources produces inconsistent results. techniques are fundamental to preventing such issues by guaranteeing that only one accesses a of code or data at a time. Semaphores, introduced by in his 1968 paper on the THE multiprogramming system, provide a counting mechanism for controlling access to shared resources; a semaphore acts as a simple lock, while general semaphores manage multiple permits. Locks, often implemented via hardware instructions like , enforce at a low level and form the basis for higher-level constructs in shared-memory systems. Monitors, proposed by C.A.R. Hoare in 1974, encapsulate shared data and procedures within a single module, using implicit and condition variables to simplify concurrent programming by associating with data access. Beyond , synchronization primitives enable to coordinate progress and wait for specific . Barriers ensure that all threads reach a designated point before any proceed, facilitating phased execution in parallel algorithms; for instance, dissemination barriers, as described in scalable algorithms, achieve logarithmic-time coordination by propagating signals across processors in a tree-like fashion. variables, integrated into designs, allow to wait until a shared holds true, such as resource availability, with operations like wait (releasing the lock) and signal (notifying a waiting thread); this avoids busy-waiting and promotes efficiency in producer-consumer scenarios. Memory models define the ordering and visibility of operations across threads, critical for predictable behavior in parallel execution. , formalized by in 1979, requires that the results of all memory accesses appear as if executed in some global sequential order consistent with each thread's program order, ensuring intuitive correctness but imposing high hardware costs. Relaxed memory models, such as those relaxing write-to-read or write-to-write orders, improve performance by allowing compiler and hardware optimizations while providing fences or acquire/release semantics to restore necessary ordering; for example, the Total Store Order model used in systems permits read reordering but enforces strict write serialization. operations, like (CAS), support lock-free programming by ensuring indivisible updates to shared variables, underpinning non-blocking data structures without full . Advanced mechanisms address the limitations of traditional locks by reducing contention and complexity. Transactional memory, introduced by Maurice Herlihy and J. Eliot B. Moss in 1993, treats sequences of operations as atomic transactions that execute optimistically; if conflicts arise, transactions abort and retry, enabling lock-free for complex data structures like queues. In modern languages, async/await patterns provide for asynchronous operations, suspending execution at await points without blocking threads, thus synchronizing on completions (e.g., I/O events) while maintaining sequential-like code readability; this builds on promise-based models to avoid callback hell in and distributed contexts.

Algorithms and applications

Parallel algorithms

Parallel algorithms are designed to exploit multiple processing units by dividing computational tasks into concurrent subtasks that can execute independently or with minimal coordination, thereby improving over sequential counterparts. These algorithms emphasize work-depth trade-offs, where total work remains comparable to sequential versions while depth (critical length) is reduced to logarithmic or constant factors. Seminal work in parallel algorithm design, such as that by Blelloch, highlights the importance of balancing computational load across processors to achieve near-linear .

Divide-and-Conquer Approaches

Divide-and-conquer paradigms in recursively partition problems into independent subproblems, solving them concurrently before combining results. This approach is particularly effective for problems with regular structure, enabling logarithmic-time execution on parallel models like PRAM. For instance, divides an array into halves, recursively sorts each in parallel, and merges the sorted halves using a parallel merging step that compares elements across subarrays. Cole's algorithm achieves this in O(log n) time using n processors on a PRAM model, performing approximately 5/2 n log n comparisons, which is work-efficient compared to sequential merge sort's n log n operations. Similarly, parallel employs a divide to the around a , followed by recursive of subarrays. To handle load imbalances from uneven partitions, work-stealing integrates dynamic task redistribution: idle processors "steal" tasks from busy ones' deques, ensuring balanced execution. This technique, extended in mixed-mode parallelism frameworks, allows to scale on multicore systems by combining task-parallel with data-parallel partitioning, achieving near-ideal for large inputs.

Prefix Sum and Reduction Operations

Prefix sum (scan) and reduction operations compute cumulative or aggregate results over arrays using associative functions, forming building blocks for many parallel algorithms. In parallel scan, an input of n elements is transformed such that each output position holds the sum (or other operation) of all preceding inputs; this is achieved through an upsweep () phase that builds partial sums in a tree-like manner, followed by a downsweep that propagates inclusive results. Blelloch's EREW PRAM computes the scan with p processors in O(n/p + log p) time while maintaining O(n) work, enabling applications like compaction or primitives. Reductions, such as parallel summation, similarly use associative operators to combine elements in a tree fashion, reducing an array to a single value in O(log n) depth. These operations are crucial for data-parallel computations, as they allow independent processing of subarrays before aggregation, with implementations in libraries like CUDA achieving high throughput on GPUs by minimizing memory traffic.

Graph Algorithms

Graph algorithms in parallel settings adapt traversal and optimization techniques to distributed or shared-memory models, often using message passing for inter-processor communication. Parallel breadth-first search (BFS) explores graphs level by level, queuing vertices at each frontier for concurrent processing by multiple threads or nodes. A work-efficient variant by Leiserson and Schardl uses direction-optimizing edges to reduce redundant traversals, achieving O(m + n) work and O(log n) span on multicore processors, where m is edges and n is vertices; this scales to graphs with billions of edges by partitioning frontiers dynamically. For shortest paths, parallel algorithms like Dijkstra's adaptations employ to propagate distance updates across partitions. In distributed environments, the DSMR (Dijkstra Strip Mined Relaxation) method relaxes edges in strips—batches of vertices—using MPI for inter-node communication, enabling efficient single-source shortest paths on large sparse graphs with billions of edges across distributed systems, by minimizing synchronization overhead through bounded relaxation sets.

Load Balancing Techniques

Load balancing ensures equitable task distribution in parallel algorithms, especially for irregular workloads where computation varies unpredictably. Dynamic scheduling adjusts assignments at based on current loads, using metrics like utilization to migrate tasks; for example, diffusion-based methods propagate load imbalances to neighbors, converging to balance in O(log n) steps on meshes. Partitioning strategies for irregular workloads, such as those in or computations, divide data into chunks minimizing cuts while equalizing counts. partitioning, as in adaptive mesh refinements, assigns subdomains to processors via space-filling curves, reducing communication volume by up to 50% compared to random partitions and enabling scalable execution on distributed systems. For amorphous parallelism in irregular algorithms, hybrid static-dynamic schemes pre-partition data logically before runtime adjustments, achieving near-linear speedup on workloads like n-body simulations.

Real-world applications

Parallel processing plays a pivotal role in scientific computing, enabling the simulation of complex phenomena that require immense computational resources. In weather forecasting, the European Centre for Medium-Range Weather Forecasts (ECMWF) employs massively parallel systems to run high-resolution numerical models, aiming for 5 km horizontal resolution for ensemble predictions, by distributing atmospheric simulations across thousands of processors to improve accuracy and timeliness. Similarly, molecular dynamics simulations leverage parallel architectures to model atomic interactions in biological systems; for instance, NAMD (Nanoscale Molecular Dynamics) uses domain decomposition to scale simulations of over 100,000 atoms across parallel clusters, facilitating drug discovery and protein folding studies. In and databases, parallel processing frameworks handle vast datasets by distributing workloads across clusters. enables distributed storage and processing of petabyte-scale data through its paradigm, allowing fault-tolerant parallel execution on commodity hardware for tasks like log analysis and ETL operations. Complementing this, accelerates analytics with in-memory computing, achieving up to 100x speedups over Hadoop for iterative algorithms on large clusters, as demonstrated in logistic regression benchmarks on 100 GB datasets. For relational databases, parallel query execution in systems like PostgreSQL divides scans and joins across multiple workers, significantly reducing execution time for aggregate queries on large tables by utilizing all available CPU cores. similarly employs parallel execution to break down SQL statements into concurrent tasks, enhancing throughput for data warehousing applications. Graphics and multimedia production benefit from parallel processing to render photorealistic visuals and compress media efficiently. In ray tracing for film rendering, Pixar's RenderMan integrates parallel ray tracing to compute light paths in complex scenes, as used in productions like Cars, where distributed computing across clusters handles billions of rays per frame to achieve global illumination effects. For video encoding, standards like H.264/AVC and HEVC (H.265) support parallelization through multi-slice and multi-frame techniques; for example, HEVC's tile-based partitioning allows independent processing of video regions on GPUs, supporting efficient encoding and decoding of high-definition streams. In , parallel processing underpins and trading operations requiring rapid computation. simulations for analysis parallelize thousands of scenario paths across processors to estimate value-at- (VaR) and option prices; GPU-accelerated implementations can achieve 100x speedups for nonparametric in pricing. (HFT) systems use parallel architectures to process streams in microseconds; for instance, multi-core and GPU parallelization in C++11-based frameworks enables simultaneous order matching and checks, handling millions of per second with latencies under 1 μs.

Challenges and limitations

Performance bottlenecks

Parallel processing systems often fail to achieve ideal linear speedup due to inherent limitations in program structure and hardware capabilities. A fundamental theoretical bound is provided by , which quantifies how the serial portion of a workload restricts overall performance gains. Formulated by in 1967, the law states that the maximum speedup S for a system with p processors is given by S \leq \frac{1}{f + \frac{1-f}{p}}, where f represents the fraction of the program that must execute serially. Even with an infinite number of processors, speedup is capped at $1/f, emphasizing that parallelizing only the non-serial components yields diminishing returns as f increases. For instance, if 5% of the workload is serial, the theoretical maximum speedup is 20, regardless of processor count. This law highlights the necessity of minimizing serial code to maximize parallel efficiency in applications like scientific simulations. Communication overhead further erodes performance in distributed parallel systems, where data exchange between processors introduces latency and bandwidth limitations. In message-passing architectures, such as those using MPI, the time to transfer data is dominated by startup latency (fixed delay for initiating communication) and transmission time (proportional to message size divided by bandwidth). High communication latency can cause processors to idle while awaiting data, potentially degrading performance in latency-sensitive applications. Bandwidth saturation occurs when aggregate traffic overwhelms interconnect links, such as or Ethernet, leading to queuing delays that reduce effective throughput to below 50% of peak in dense workloads. The model, proposed by Culler et al. in 1993, captures these effects through parameters for latency (L), overhead (o), gap (g, related to bandwidth), and processors (P), providing a to predict and mitigate such bottlenecks in design. Load imbalance and improper task compound these issues by causing uneven processor utilization, thereby lowering overall . Load imbalance arises when tasks require varying computation times, forcing faster processors to wait at points, which can reduce to as low as 20-30% in irregular applications like graph processing. refers to the size of work units assigned to processors; coarse (large tasks) amplifies imbalance and underutilizes resources, while fine (small tasks) increases overhead from scheduling and communication, potentially negating parallel gains. Optimal balances these trade-offs, often achieving 70-90% in balanced workloads on multicore systems, as demonstrated in analyses of parallel algorithms for numerical computations. In dense parallel hardware, such as multi-core or many-core processors, power and constraints impose additional limits on . High core densities lead to elevated power densities exceeding 100 /cm², triggering thermal throttling to prevent overheating, which caps clock speeds and reduces performance by 20-50% under sustained loads. Cooling solutions like liquid immersion or advanced heat sinks are essential but add complexity and cost, while power delivery networks must handle increased current without voltage droop. Research on multi-core architectures shows that thermal hotspots from uneven workloads exacerbate these issues, shifting optimal configurations toward fewer active s to maintain and sustain performance.

Fault tolerance and reliability

In parallel processing systems, particularly large-scale (HPC) environments, ensures continued operation despite hardware s, software errors, or network issues, which become more frequent as system scale increases. For instance, in exascale HPC systems as of 2024, (MTBF) is projected to be as low as one hour due to the scale of millions of components. Reliability mechanisms are essential for maintaining computational integrity in distributed setups, where a single can propagate across the system without proper safeguards. Checkpointing and restart techniques address faults in long-running parallel jobs by periodically saving the application's state to stable storage, allowing recovery from the last valid checkpoint upon . This approach, widely used in HPC applications via tools like those integrated with the (MPI), minimizes downtime by restarting only affected processes while coordinating global consistency across nodes. Coordinated checkpointing synchronizes all processes to capture a consistent global state, though it introduces overhead; alternatives like uncoordinated or hierarchical methods reduce this by staggering saves or using intermediate storage layers. Redundancy enhances reliability through task or replication, ensuring that multiple instances execute the same computation to detect and mitigate failures via majority voting or . In distributed parallel systems, Byzantine fault tolerance (BFT) extends this by handling arbitrary failures, including malicious ones, as demonstrated in the Practical Byzantine Fault Tolerance (PBFT) protocol, which achieves consensus among replicas using a three-phase commit tolerant to up to one-third faulty s. Error detection mechanisms, such as in inter-node communications, verify data integrity over HPC interconnects like by appending polynomial-based checksums that flag transmission errors with high probability. Similarly, detects and corrects single-bit errors in , crucial for parallel workloads where silent could invalidate results; ECC-equipped systems in HPC clusters reduce undetected errors by orders of magnitude compared to non-ECC setups. Self-healing capabilities enable automatic recovery in cloud-based parallel environments, such as replacing failed nodes via elastic scaling in platforms like Amazon EC2, which dynamically reprovisions resources to maintain workload distribution. Consensus algorithms like facilitate this by coordinating agreement on system state among surviving nodes, ensuring fault-tolerant and log replication even under partial failures.

History and evolution

Early developments

The development of parallel processing in the mid-20th century was driven by the need for greater computational power in scientific and military applications, leading to early innovations in hardware architectures that exploited multiple processing units. In 1964, designed the , recognized as the first successful , which featured a augmented by ten peripheral processing units (PPUs) to handle parallel tasks, achieving speeds up to three million floating-point operations per second. This design introduced functional unit pipelining and to manage , laying groundwork for future vector systems. In 1966, the project began at the University of Illinois, aiming to build the first massively parallel computer with 256 processing elements operating in a (SIMD) configuration for ; although scaled down, it demonstrated the feasibility of large-scale parallelism when it became operational in at NASA Ames Research Center. A pivotal theoretical contribution came in 1967 when published "Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities," introducing , which quantified the fundamental limits of parallel speedup due to inherently sequential portions of programs. This law emphasized that overall performance gains are constrained by the serial fraction of execution time, influencing subsequent designs to prioritize balanced architectures. Throughout the 1970s, early SIMD machines like the operational proved effective for , achieving up to 200 million operations per second despite reliability challenges. Vector processing advanced with Cray's 1976 , the first commercial vector system, which used deep pipelines for chained floating-point operations, delivering peak performance of 160 megaflops and setting standards for . The 1980s saw further maturation with specialized hardware for distributed parallelism. In 1982, the extended vector processing to multiprocessor configurations, supporting up to four CPUs with and , reaching peak speeds of 941 megaflops for applications like simulations. Concurrently, conceptualized the in a 1979 MIT memo, envisioning a SIMD architecture with thousands of simple processors interconnected in a topology, realized in the 1985 CM-1 with 65,536 one-bit processors for and tasks. In 1985, INMOS released the T414, a designed for parallel systems with built-in communication links, paired with the occam programming language based on , enabling scalable networks of up to thousands of nodes for embedded and general-purpose computing. These developments by figures like and Amdahl established core principles of parallelism that persisted into later decades.

Modern advancements

In the 1990s, the development of clusters marked a significant advancement in accessible parallel processing by leveraging commodity off-the-shelf hardware to create cost-effective supercomputing systems. The first was constructed in 1994 at NASA's , consisting of interconnected PCs running , which demonstrated scalable parallel performance for scientific workloads without relying on proprietary architectures. Concurrently, the (MPI) emerged as a standardized protocol for distributed-memory parallel programming, with the initial MPI-1.0 specification released on May 5, 1994, by the MPI Forum, enabling portable message-passing across diverse systems and fostering widespread adoption in . The saw the proliferation of multi-core processors and general-purpose computing on graphics processing units (GPGPU), shifting parallel processing toward mainstream consumer and enterprise hardware. introduced its first dual-core processor, the (Smithfield), on May 25, 2005, which integrated two cores on a single die to enhance multitasking and computational throughput in desktop and server environments. In parallel, launched in November 2006, a and that simplified GPGPU development by allowing developers to use C/C++ extensions for massive parallelism on GPU architectures, revolutionizing applications in and . During the 2010s, efforts toward exascale computing accelerated, aiming for systems capable of at least one exaFLOP (10^18 floating-point operations per second) to address grand challenges in science and engineering. The U.S. Department of Energy (DOE) outlined exascale goals in the early 2010s through initiatives like the Exascale Computing Project, targeting deployment by the early 2020s while addressing power efficiency and scalability hurdles. This culminated in milestones tracked by the TOP500 list, where the Frontier supercomputer at Oak Ridge National Laboratory achieved 1.102 exaFLOPS on the High-Performance Linpack benchmark in June 2022, becoming the first confirmed exascale system. Subsequent systems followed, with Aurora at Argonne National Laboratory reaching exascale performance in May 2024 and becoming fully operational in January 2025, and El Capitan at Lawrence Livermore National Laboratory topping the list in November 2024 with 1.742 exaFLOPS, enabling breakthroughs in climate modeling, drug discovery, and national security simulations. Standardization efforts evolved to support these hardware advances, with progressing from its 1997 inception as a shared-memory directive-based to versions like OpenMP 5.0 in 2018 and OpenMP 6.0 in November 2024, incorporating tasking, accelerators, SIMD directives for heterogeneous systems, and enhanced support for easier parallel programming and fine-grained control. Additionally, cloud integration expanded parallel processing accessibility, as exemplified by AWS ParallelCluster, an open-source tool launched in 2017 that automates HPC cluster deployment on Amazon EC2, and the AWS Parallel Computing Service introduced in August 2024 for managed scaling of parallel workloads.

Integration with AI and machine learning

Parallel processing has become integral to advancing (AI) and (ML) workloads, particularly in handling the computational demands of training large-scale models. By distributing computations across multiple processors or devices, parallel techniques enable efficient of tasks, reducing training times from weeks to hours while managing vast datasets and model parameters. This integration leverages both and algorithmic innovations to address the exponential growth in model complexity, allowing AI systems to process petabytes of data and billions of parameters effectively. In distributed deep learning, data parallelism and model parallelism are foundational strategies that exploit parallel processing to train neural networks. involves replicating the model across multiple devices, each processing a subset of the training data in parallel, with gradients synchronized periodically to update a shared model; this approach is natively supported in frameworks like through its DistributedDataParallel module, which facilitates scalable multi-GPU training by handling communication via all-reduce operations. Similarly, implements data parallelism via its DistributionStrategy API, enabling seamless distribution across clusters for synchronous or asynchronous updates. Model parallelism, conversely, partitions the model itself across devices to accommodate architectures too large for single-device memory, such as transformer-based large language models; for instance, techniques like tensor parallelism split attention layers, while pipeline parallelism stages the model sequentially across GPUs. These methods have been pivotal in training models like , which reportedly employs hybrid parallelism to manage an estimated 1.8 trillion parameters, achieving near-linear scaling efficiency on distributed GPU setups. GPU clusters further amplify parallel processing for AI training through frameworks like Horovod, which simplifies distributed execution across , , and other libraries by integrating efficient ring-allreduce algorithms for gradient aggregation. Horovod enables scaling from single GPUs to clusters of thousands, as demonstrated in benchmarks where it achieves up to 90% efficiency on hundreds of GPUs for convolutional neural networks, minimizing communication overhead via optimized collective operations. This framework has been widely adopted for large-scale AI workloads, allowing organizations to train models on massive datasets without custom infrastructure modifications. In federated learning, parallel processing extends to edge devices, where models are trained locally on decentralized data to preserve privacy, with only model updates aggregated centrally; this approach uses asynchronous parallel computations across devices like smartphones, reducing bandwidth needs by up to 100x compared to centralized training while maintaining model accuracy. Seminal implementations, such as those in Federated, parallelize gradient computations on heterogeneous edge hardware, enabling privacy-preserving AI for applications like mobile health monitoring. Recent advances from 2024 to 2025 have focused on mixture-of-experts () models, which inherently support parallelization by activating only subsets of specialized "experts" per input, drastically reducing active parameters during and . On TPUs, expert parallelism distributes these experts across cores, as seen in extensions of Switch Transformers, where sparse routing scales to trillion-parameter models with 7x faster pre- than dense counterparts on TPU v3 pods. Innovations like shortcut-connected expert parallelism further optimize communication in layers, achieving up to 1.5x speedup on GPUs by overlapping computations and reducing all-to-all bottlenecks in hybrid data-expert pipelines. These developments, including scalable adaptations for multi-domain tasks and frameworks like NeMo Automodel for efficient large-scale , have enabled efficient of models with hundreds of billions of parameters on TPU and GPU clusters, pushing the boundaries of scalability while maintaining computational efficiency.

Quantum and neuromorphic computing

Quantum computing introduces a paradigm of parallelism fundamentally different from classical approaches, leveraging and entanglement to evaluate multiple computational paths simultaneously. This concept, known as quantum parallelism, was first formalized by in his 1985 proposal of the , which extends the classical Turing model by allowing qubits to exist in superposition states, enabling the machine to process exponentially many inputs in parallel through a single operation. Unlike classical parallel systems that distribute tasks across multiple processors, quantum parallelism arises inherently from the quantum mechanical principles of interference and superposition, allowing a to explore a vast solution space without explicit replication of hardware. Seminal algorithms exemplify this capability. for exploits quantum parallelism to compute the period of a modular across all possible inputs in superposition, achieving an over classical methods for certain problems like breaking encryption. Similarly, Grover's search algorithm uses quantum parallelism to amplify the probability of finding a target item in an unsorted database, providing a by evaluating the search on a superposition of all database entries in parallel. These demonstrations highlight how quantum parallelism enables efficient parallel exploration of problem spaces, though it requires careful to extract useful results, distinguishing it from classical massive parallelism. Neuromorphic computing, inspired by the massively parallel architecture of biological neural systems, designs that emulates (SNNs) to achieve high-efficiency parallel processing. Pioneered by in the late , neuromorphic systems replace architectures with distributed, event-driven networks where and synapses operate asynchronously and in parallel, mimicking the brain's ability to process sensory data through localized computations without centralized control. This parallelism is inherent in the : each handles independent computations, enabling simultaneous updates across thousands or millions of units with minimal overhead, as synaptic weights and activations are co-located to reduce data movement. Key implementations underscore this parallel paradigm. IBM's TrueNorth chip, released in 2014, integrates 1 million neurons and 256 million synapses across 4096 , supporting highly parallel, low-power simulation of SNNs for tasks like , where each processes local events independently to achieve performance at 65 mW. Intel's Loihi , introduced in 2018, advances this with on-chip learning and 128 neuromorphic , each managing up to 1,024 neurons, facilitating parallel spike routing and updates for adaptive computing in applications, outperforming traditional GPUs in for sparse, event-based workloads. These systems prioritize conceptual parallelism over exhaustive , offering scalable alternatives to conventional parallel processing for brain-like tasks such as optimization and sensory fusion.

References

  1. [1]
    [PDF] INTRODUCTION TO PARALLEL COMPUTING - Harvard University
    Mar 25, 2016 · In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem:.
  2. [2]
    Parallel Computing: Overview, Definitions, Examples and ...
    Parallel computing is the use of two or more processors (cores, computers) in combination to solve a single problem.
  3. [3]
    Parallel Computing: Theory and Practice
    The goal of this book is to cover the fundamental concepts of parallel computing, including models of computation, parallel algorithms, and techniques for ...<|separator|>
  4. [4]
    Parallel Processing, 1980 to 2020 - Illinois Experts
    Here, we cover the evolution of the field since 1980 in: parallel computers, ranging from the Cyber 205 to clusters now approaching an exaflop, to multicore ...
  5. [5]
    [PDF] A View of the Parallel Computing Landscape - People @EECS
    The whole microprocessor industry thus declared that its future was in parallel computing, with an increasing number of processors or cores each technology ...
  6. [6]
    Taxonomy of Parallel Computers - Cornell Virtual Workshop
    Flynn's taxonomy classifies parallel computers by instruction and data streams: SISD, SIMD, MISD, and MIMD, based on single or multiple streams.
  7. [7]
    Parallel | Princeton Research Computing
    Parallel programming involves breaking up code into smaller tasks or chunks that can be run simultaneously.Serial versus Parallel... · Four Basic Types of Parallel...
  8. [8]
    [PDF] Introduction to Parallel Programming and pMatlab v2.0
    4. Computation and Communication. Parallel programming improves performance by breaking down a problem into smaller sub- problems that are distributed to ...
  9. [9]
    [PDF] Parallel Computing- Pros and Cons
    Apr 20, 2021 · Parallel computation makes the compute resource more scalable and we can break the boundary of memory and the processors to solve a particular ...
  10. [10]
    Science and the Future of Computing: Parallel Processing to Meet ...
    Apr 13, 2011 · It sets a path forward to sustain growth in computer performance so that we can enjoy the next level of benefits to society. This book and ...
  11. [11]
    [PDF] Introduction to Parallel Computing Issues
    Parallel Computing Challenges. • It is not easy to develop an efficient parallel program. • Some Challenges: – Parallel Programming. – Complex Algorithms. – ...
  12. [12]
    [PDF] Fundamentals Of Parallel Computer Architecture
    Oct 9, 2025 · Common challenges include managing synchronization between processors, minimizing communication overhead, ensuring load balancing, and ...
  13. [13]
    Introduction to Parallel Computing Tutorial - | HPC @ LLNL
    Refers to a parallel system's (hardware and/or software) ability to demonstrate a proportionate increase in parallel speedup with the addition of more resources ...
  14. [14]
    [PDF] Validity of the Single Processor Approach to Achieving Large Scale ...
    This article was the first publica- tion by Gene Amdahl on what became known as Amdahl's Law. Interestingly, it has no equations and only a single figure. For ...
  15. [15]
    [PDF] REEVALUATING AMDAHL'S LAW - John Gustafson
    1. Amdahl, G.M. Validity of the single-processor approach to achieving large scale computing capabilities. In AFIPS Conference Proceedings vol. 30 (Atlantic ...
  16. [16]
    The Impact of Data Dependence Analysis on Compilation and ...
    Data dependence testing is very important for automatic parallelization, vectorization and any other code transformation. In this paper we examine the impact of ...
  17. [17]
    [PDF] Shared Memory - CMSC 611: Advanced Computer Architecture
    (bus bandwidth, memory access time and support for address translation). •! Scalability is limited given that the communication model is so tightly coupled ...<|control11|><|separator|>
  18. [18]
  19. [19]
    [PDF] An Introduction to the Intel QuickPath Interconnect
    The Intel® QuickPath Interconnect implements a modified format of the MESI coherence protocol. The standard MESI protocol maintains every cache line in one ...
  20. [20]
    [PDF] A Primer on Memory Consistency and Cache Coherence, Second ...
    ... specification of the MESI protocol, including transient states. Differences with respect to the MSI protocol are highlighted with boldface font. The protocol ...
  21. [21]
    [PDF] History and Future Trends of Multicore Computer Architecture
    The literature review focused on the architecture of a multicore processor by exploring the concept of multicore technology before documenting the details of ...
  22. [22]
    (PDF) Multi-core processors - An overview - ResearchGate
    This paper briefs on evolution of multi-core processors followed by introducing the technology and its advantages in today's world.
  23. [23]
    [PDF] 3.0—Multiprocessing - Higher Education | Pearson
    SMP systems do not usually exceed 16 processors, although newer machines released by Unix vendors support up to 64. Modern SMP software permits several CPUs to ...<|separator|>
  24. [24]
    [PDF] Performance Bottlenecks On Large-Scale Shared-Memory ...
    Contention for the shared bus limits the effective size of this architecture ... The RAC significantly reduces the memory bandwidth requirements of the memory.
  25. [25]
    [PDF] Evaluation and Optimization of Multicore Performance Bottlenecks in ...
    Due to shared resources in the memory hierarchy, multicore applications tend to be limited by off-chip bandwidth. At first glance, other optimization strategies ...
  26. [26]
    [PDF] Introduction to InfiniBand™ for End Users - Networking
    Although it is possible to run MPI on a shared memory system, the more common deployment is as the communication layer connecting the nodes of a cluster.
  27. [27]
    Overview -- History - Beowulf.org
    The Beowulf Project was started. The initial prototype was a cluster computer consisting of 16 DX4 processors connected by channel bonded Ethernet.
  28. [28]
    [PDF] History and overview of high performance computing
    Beowulf Clusters, 1994-present. In 1994 Donald Becker and Tom. Stirling, both at NASA, built a cluster using available PCs and networking hardware. 16 Intel ...
  29. [29]
    TOP500: Home -
    The 65th edition of the TOP500 showed that the El Capitan system retains the No. 1 position. With El Capitan, Frontier, and Aurora, there are now 3 Exascale ...TOP500 List · Lists · June 2018 · November 2018
  30. [30]
    What is Grid Computing? | IBM
    Grid computing is a type of distributed computing that brings together various compute resources located in different places to accomplish a common task.
  31. [31]
    AWS ParallelCluster - Amazon Web Services
    AWS ParallelCluster is an open source cluster management tool that makes it easy for you to deploy and manage High Performance Computing (HPC) clusters on AWS.Missing: Google | Show results with:Google<|control11|><|separator|>
  32. [32]
    HPC solution | Google Cloud
    Tackle your most demanding HPC workloads with confidence. Google Cloud gives you immediate access to the latest CPUs, GPUs, and storage, ...
  33. [33]
    [PDF] Parallel Computing Platforms - CS@Purdue
    Parallel computing platforms address processor, memory, and datapath bottlenecks. Topics include communication models, physical organization, and mapping ...Missing: types | Show results with:types
  34. [34]
    [PDF] Chapter 2. Parallel Architectures and Interconnection Networks
    Parallel architectures include processor arrays, multiprocessors, and multicomputers. Network topologies describe how nodes are connected, and are the heart of ...
  35. [35]
    [PDF] GPGPU COMPUTING - arXiv
    We will present the benefits of the CUDA programming model. We will also compare the two main approaches, CUDA and. AMD APP (STREAM) and the new framwork, ...Missing: seminal | Show results with:seminal
  36. [36]
    [PDF] GPGPU PROCESSING IN CUDA ARCHITECTURE - arXiv
    In this paper, we will show how CUDA can fully utilize the tremendous power of these GPUs. CUDA is NVIDIA's parallel computing architecture. It enables dramatic ...Missing: seminal | Show results with:seminal
  37. [37]
    Trends of CPU, GPU and FPGA for high-performance computing
    In this paper, we compare the trends of these computing architectures for high-performance computing and survey these platforms in the execution of ...
  38. [38]
    [PDF] Self-Partial and Dynamic Reconfiguration Implementation for AES ...
    This paper presents an optimal implementation of the AES. (Advanced Encryption Standard) cryptography algorithm by the use of a dynamic partially reconfigurable ...
  39. [39]
    A Survey of Parallel Implementations for Model Predictive Control
    Mar 11, 2019 · This paper reviews methods to accelerate MPC, including parallel computing using FPGAs, multi-core CPUs, and many-core GPUs.
  40. [40]
    [PDF] Addressing the Environmental Impact of Bitcoin Mining - arXiv
    Nov 14, 2024 · The paper examines the fundamental process of. Bitcoin mining, highlighting its energy-intensive proof-of-work mechanism, and provides a ...
  41. [41]
    In-Datacenter Performance Analysis of a Tensor Processing Unit
    Apr 16, 2017 · This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural ...
  42. [42]
    [PDF] arXiv:2410.05686v2 [cs.DC] 12 Dec 2024
    Dec 12, 2024 · ASICs are specialized chips designed exclusively for cryptocurrency mining. They are far more powerful and energy-efficient than both GPUs ...
  43. [43]
    Overcoming the Limitations of Conventional Vector Processors
    Vector processors traditionally require an aggressive memory system. High memory bandwidth is inherently nec- essary to match the high throughput of arithmetic ...
  44. [44]
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA ...
    Apr 24, 2023 · In this paper, we analyze the performance of a prototype SX-Aurora TSUBASA supercomputer equipped with the brand-new Vector Engine (VE30) processor.
  45. [45]
    Preparing for HPC on RISC-V: Examining Vectorization and ...
    The RISC-V vector specification follows in the tradition of vector processors found in the CDC STAR-100, the Cray-1, the Convex C-Series, and the NEC SX ...
  46. [46]
    A Comprehensive Exploration of Languages for Parallel Computing
    Jan 18, 2022 · In this article, we conduct a systematic literature review of programming and modeling languages for parallel computing platforms. This ...
  47. [47]
    [PDF] Parallel programming with Fortran 2008 and 2018 coarrays
    Coarrays were first introduced in Fortran 2008 standard. Coarrays are intended for single program - multiple data (SPMD) type parallel programming.
  48. [48]
    [PDF] Co-Array Fortran for parallel programming - UCLA CS
    Abstract. Co-Array Fortran, formerly known as F--, is a small extension of Fortran 95 for parallel processing. A Co-Array Fortran program is interpreted as ...<|separator|>
  49. [49]
    Parallelism (The Java™ Tutorials > Collections > Aggregate ...
    You can execute streams in serial or in parallel. When a stream executes in parallel, the Java runtime partitions the stream into multiple substreams. Aggregate ...
  50. [50]
    Goroutines - A Tour of Go
    A goroutine is a lightweight thread managed by the Go runtime. The evaluation of f , x , y , and z happens in the current goroutine and the execution of f ...Channels · Buffered Channels · Range and Close · A Tour of Go, Concurrency
  51. [51]
    OpenMP: Home
    The OpenMP API supports multi-platform shared-memory parallel programming in C/C++ and Fortran. The OpenMP API defines a portable, scalable model.Reference Guides · Specifications · Compilers & Tools · Join the OpenMP ARB
  52. [52]
    MPI Forum
    This website contains information about the activities of the MPI Forum, which is the standardization forum for the Message Passing Interface (MPI).Meetings · MPI Documents · MPI Next · New to the MPI Forum
  53. [53]
    Concurrent Programming — Erlang System Documentation v28.1.1
    Erlang's ability to handle concurrency and distributed programming. By concurrency is meant programs that can handle several threads of execution at the same ...Processes · Message Passing · Registered Process Names
  54. [54]
    CUDA C++ Programming Guide
    The programming guide to the CUDA model and interface.
  55. [55]
    Halide
    Halide is a programming language designed to make it easier to write high-performance image and array processing code on modern machines.Expr, and Halide · Tutorials · Docs · Halide @ CVPR2015
  56. [56]
    Halide: a language and compiler for optimizing parallelism, locality ...
    Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Authors: Jonathan Ragan-Kelley.
  57. [57]
    [PDF] Algorithms for scalable synchronization on shared-memory ...
    Spin locks provide a means for achieving mutual exclu- sion (ensuring that only one processor can access a particular shared data structure at a time) and are a ...
  58. [58]
    Monitors: an operating system structuring concept
    This paper develops Brinch-Hansen's concept of a monitor as a method of structuring an operating system. It introduces a form of synchronization, ...Missing: original | Show results with:original
  59. [59]
    [PDF] Shared Memory Consistency Models: A Tutorial - Computer Science
    The goal of this tutorial article is to provide a description of sequential consistency and other more relaxed memory consistency models in a way that would be ...
  60. [60]
    Transactional memory: architectural support for lock-free data ...
    This paper introduces transactional memory, a new multiprocessor architecture intended to make lock-free synchronization as efficient (and easy to use) as ...
  61. [61]
    [PDF] Prefix Sums and Their Applications
    This section describes an algorithm for calculating the scan operation in parallel. For p processors and a vector of length n on an EREW PRAM, the algorithm has ...
  62. [62]
    [PDF] parallel merge sort - CMU School of Computer Science
    A parallel merge sort on CREW PRAM uses n processors and O(log n) time, with a small constant, and performs 5/2n log n comparisons.
  63. [63]
    [PDF] Work-stealing for mixed-mode parallelism by deterministic team ...
    Dec 22, 2010 · Abstract. We show how to extend classical work-stealing to deal also with data parallel tasks that can require any number of threads r ≥ 1.
  64. [64]
    [PDF] Parallel Prefix Sum (Scan) with CUDA
    Apr 1, 2007 · In this document we introduce Scan and describe step-by-step how it can be implemented efficiently in NVIDIA CUDA. We start with a basic naïve ...
  65. [65]
    [PDF] A Work-Efficient Parallel Breadth-First Search Algorithm (or How to ...
    Jun 15, 2010 · In this paper, we present a paral- lel BFS algorithm, called PBFS, whose performance scales linearly with the number of processors and for which ...Missing: seminal | Show results with:seminal
  66. [66]
    [PDF] DSMR: A Parallel Algorithm for Single-Source Shortest Path Problem
    Jun 3, 2016 · In this paper, we introduce the Dijkstra Strip Mined Relaxation (DSMR) algorithm, an efficient parallel SSSP algorithm for shared and ...<|separator|>
  67. [67]
    Dynamic Load Balancing Strategy for Scalable Parallel Systems
    INTRODUCTION. This paper focuses on dynamic load balancing strategies designed to minimize the total execution time of a single application running in parallel ...
  68. [68]
    [PDF] Parallel Computing Strategies for Irregular Algorithms
    Partitioning the sparse matrix is required on distributed-memory architectures, but can be beneficial even on shared-memory machines by enforcing data locality.
  69. [69]
    Scalability - ECMWF
    Efficiency gains in all parts of the forecasting system are required in order to make a goal such as a 5 km horizontal resolution for ECMWF's ensemble forecasts ...
  70. [70]
    Scalable Molecular Dynamics with NAMD - PMC - PubMed Central
    NAMD is a parallel molecular dynamics code for high-performance simulation of large biomolecular systems, designed to enable simulation of 100,000+ atoms.
  71. [71]
    Apache Hadoop
    The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple ...
  72. [72]
    [PDF] Apache Spark: A Unified Engine for Big Data Processing
    Nov 2, 2016 · Performance of logistic regression in Hadoop MapReduce vs. Spark for 100GB of data on 50 m2.4xlarge EC2 nodes. 0. 500. 1,000. 1,500.
  73. [73]
    Documentation: 18: Chapter 15. Parallel Query - PostgreSQL
    Parallel query in PostgreSQL uses multiple CPUs to answer queries faster, often significantly speeding up queries that touch large amounts of data.15.1. How Parallel Query Works · 15.3. Parallel Plans · 15.2. When Can Parallel...
  74. [74]
    8.1 Parallel Execution Concepts - Oracle Help Center
    Parallel execution uses multiple CPU and I/O resources to execute a single SQL statement, breaking down tasks so many processes work simultaneously.
  75. [75]
    [PDF] Ray Tracing for the Movie 'Cars' - Pixar Graphics Technologies
    This paper describes how we extended Pixar's RenderMan renderer with ray tracing abilities. In order to ray trace highly complex.
  76. [76]
    [PDF] Parallel Tools in HEVC for High-Throughput Processing
    Oct 23, 2014 · 264/AVC video codec implementations. This makes it difficult to achieve the high throughput necessary for high resolution and frame-rate videos.
  77. [77]
    Parallel computing in finance for estimating risk-neutral densities ...
    Parallel computing, using GPUs, is used to estimate risk-neutral densities for option pricing, addressing computational challenges in nonparametric methods.
  78. [78]
    Parallelizing High-Frequency Trading Applications by Using C++11 ...
    The REPARA methodology consists in a systematic way to express parallel patterns by annotating the source code using C++11 attributes transformed automatically.
  79. [79]
    [PDF] Fault tolerance techniques for high-performance computing
    Key fault tolerance techniques include checkpointing (coordinated and hierarchical), fault prediction, replication, and application-specific methods like ABFT.Missing: seminal | Show results with:seminal
  80. [80]
    A survey of fault tolerance mechanisms and checkpoint/restart ...
    Feb 12, 2013 · In this paper, we briefly review the failure rates of HPC systems and also survey the fault tolerance approaches for HPC systems and issues with these ...Missing: seminal | Show results with:seminal
  81. [81]
    [PDF] Practical Byzantine Fault Tolerance
    This paper describes a new replication algorithm that is able to tolerate Byzantine faults. We believe that Byzantine- fault-tolerant algorithms will be ...
  82. [82]
    A survey of fault tolerance in cloud computing - ScienceDirect.com
    This paper presents a comprehensive overview of fault tolerance-related issues in cloud computing; emphasizing upon the significant concepts, architectural ...
  83. [83]
    Parallel Processing - CHM Revolution - Computer History Museum
    ILLIAC IV wasn't very reliable, but did prove that "single instruction, multiple data" designs worked. It was particularly good for problems in computational ...
  84. [84]
    CRI Cray X-MP | Computational and Information Systems Lab
    Each X-MP processor could execute two instructions in 8.5 nanoseconds, and the system as a whole had a peak computation rate of 941 million floating-point ...Missing: 1980s | Show results with:1980s
  85. [85]
    Richard Feynman and The Connection Machine - Long Now
    The machine, as we envisioned it, would contain a million tiny computers, all connected by a communications network. We called it a "Connection Machine."
  86. [86]
    INMOS TN20 - Communicating processes and occam - transputer.net
    The body of an occam procedure may be any process, sequential or parallel. To ensure that expression evaluation has no side effects and always terminates ...
  87. [87]
    The Roots of Beowulf - NASA Technical Reports Server (NTRS)
    Oct 13, 2014 · The first Beowulf Linux commodity cluster was constructed at NASA's Goddard Space Flight Center in 1994 and its origins are a part of the ...
  88. [88]
    MPI Standard
    The MPI Forum home page has links to the official copies of both the MPI 1.1, 1.2, and 2.0 standard documents. The MPI-2 Forum has completed its work. The MPI-2 ...
  89. [89]
    Dual Core Era Begins, PC Makers Start Selling Intel-Based PCs
    Apr 18, 2005 · Intel's first dual-core processor-based platform includes the Intel® Pentium® Processor Extreme Edition 840 running at 3.2 GHz and the Intel® ...
  90. [90]
    CUDA Zone - Library of Resources | NVIDIA Developer
    Ian Buck later joined NVIDIA and led the launch of CUDA in 2006, the world's first solution for general-computing on GPUs. Since its inception, the CUDA ...
  91. [91]
    Overview of the ECP - Exascale Computing Project
    Exascale computing enables the capability to tackle challenges in scientific discovery, manufacturing R&D, and national security at levels of complexity and ...
  92. [92]
    June 2022 - TOP500
    The No. 1 spot is now held by the Frontier system at Oak Ridge National Laboratory (ORNL) in the US. Based on the latest HPE Cray EX235a architecture and ...
  93. [93]
    [PDF] A “Hands-on” Introduction to OpenMP*
    OpenMP pre-history. ○ OpenMP based upon SMP directive standardization efforts PCF and aborted ANSI. X3H5 – late 80's. ◇Nobody fully implemented either standard.
  94. [94]
    Megatron-LM: Training Multi-Billion Parameter Language Models ...
    Sep 17, 2019 · In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach.
  95. [95]
    [1911.07652] Information-Theoretic Perspective of Federated Learning
    Nov 15, 2019 · An approach to distributed machine learning is to train models on local datasets and aggregate these models into a single, stronger model. A ...<|separator|>
  96. [96]
    Quantum theory, the Church–Turing principle and the universal ...
    A class of model computing machines that is the quantum generalization of the class of Turing machines is described, and it is shown that quantum theory and the ...
  97. [97]
    [PDF] Quantum theory, the Church-Turing principle and the universal ...
    Parallel processing on a serial computer. Quantum theory is a theory of parallel interfering universes. There are circumstances under which different ...
  98. [98]
    [PDF] Neuromorphic electronic systems - Proceedings of the IEEE
    Carver A. Mead is Gordon and Betty Moore. Professor of Computer Science at the Cal- ifornia Institute of Technology, Pasadena, where he has ...
  99. [99]
  100. [100]
    Opportunities for neuromorphic computing algorithms and applications
    Jan 31, 2022 · Highly parallel operation: neuromorphic computers are inherently parallel, where all of the neurons and synapses can potentially be operating ...
  101. [101]
    [PDF] Loihi: A Neuromorphic Manycore Processor with On-Chip Learning
    Loihi is a 60-mm2 chip fabricated in Intel's 14-nm process that advances the state-of-the-art modeling of spiking neural networks in silicon.