Fact-checked by Grok 2 weeks ago

Parallel programming model

A parallel programming model is an abstraction provided by languages, libraries, or runtime systems that enables developers to express, manage, and execute concurrent computations across multiple processing units, such as multi-core CPUs, GPUs, or distributed clusters, to achieve improved and over sequential programs. These models address key challenges in parallelism, including task decomposition, data distribution, , and communication, while abstracting complexities to enhance programmer productivity and portability across diverse architectures. Parallel programming models can be broadly classified by their approach to memory access and communication. Shared-memory models, such as and , allow multiple threads to access a common , facilitating easier data sharing but requiring explicit mechanisms like locks or barriers to prevent conditions. In contrast, message-passing models, exemplified by the (MPI), treat processes as independent entities communicating via explicit messages over systems, which is well-suited for large-scale clusters but introduces overhead from network latency. Data-parallel models, like for GPUs, focus on applying the same operation across arrays or vectors simultaneously, leveraging architectures for high-throughput tasks such as scientific simulations or . Emerging and hybrid models extend these foundations to handle environments, including many-core processors and accelerators. The models, such as Unified Parallel C (UPC) and , provide a global view of data while partitioning it locally to minimize communication costs, supporting both task and data parallelism with features like automatic load balancing. Hybrid approaches, combining MPI for inter-node communication with for intra-node threading, are common in (HPC) applications, where clusters dominate the supercomputing landscape, comprising nearly all systems on the list as of November 2025. These models prioritize performance metrics like locality, load balance, and degree of parallelism, alongside productivity through higher-level abstractions and portability via compilers or runtimes. The evolution of parallel programming models reflects advances in hardware, from multi-core processors to exascale systems, driving innovations in areas like memory consistency (e.g., sequential, weak, or release consistency) and execution paradigms (e.g., SPMD or MPMD under ). By enabling efficient exploitation of concurrency, these models underpin critical applications in fields like , processing, and , though challenges such as concurrent code and ensuring persist.

Fundamentals

Definition and Scope

A parallel programming model is an abstraction that defines how computational tasks are divided into concurrent subtasks, coordinated for synchronization and communication, and executed across multiple processing units to achieve faster problem-solving or greater scalability compared to sequential approaches. This model serves as a bridge between application logic and underlying hardware, enabling developers to express parallelism without directly managing low-level details such as thread scheduling or memory allocation. The scope of parallel programming models spans multiple abstraction levels, from hardware architectures like multi-core CPUs and distributed clusters to software interfaces such as APIs (e.g., for shared-memory parallelism) and high-level languages (e.g., for productivity-oriented parallel coding). These models emphasize , particularly in distributed systems where tasks may run on thousands of nodes connected via networks like , allowing applications to handle larger datasets or more complex computations by leveraging increased resources. Parallel programming models enable significant performance benefits, including reduced execution time, as quantified by , which describes the theoretical limit for a fixed-size problem when parallelized across p processors with a f. The S is given by
S = \frac{1}{f + \frac{1-f}{p}} ,
where the portion f (0 ≤ f ≤ 1) bottlenecks overall gains, even as p grows large; for example, if f = 0.05, S approaches 20 but never exceeds it regardless of p. For scaled problems where problem size increases with available processors, provides a complementary perspective, showing that efficiency remains high since the parallelizable portion dominates, thus supporting near-linear in many real-world scenarios like scientific simulations.
While parallel programming models abstract concurrency, they are distinct from underlying hardware architectures, such as those classified by into SIMD ( streams, e.g., vector processors) and MIMD (multiple instructions, multiple data streams, e.g., multicore systems), which influence model selection but do not dictate implementation details.

Historical Evolution

The roots of parallel programming models trace back to the 1960s, when early efforts in introduced concepts of vector processing and architectures as precursors to paradigms. Vector processors, exemplified by the introduced in 1964, enabled pipelined operations on arrays of data, laying groundwork for SIMD-style parallelism by processing multiple elements simultaneously through hardware instructions. Concurrently, the project, initiated in the mid-1960s and becoming operational in 1975, represented one of the first attempts at large-scale parallel computation with up to 256 processing elements organized in a SIMD array, though its programming complexity highlighted early challenges in model design. A foundational contribution came from Michael J. Flynn's 1966 taxonomy, which classified computer architectures into SISD, SIMD, MISD, and MIMD categories based on instruction and data stream multiplicity, providing a theoretical framework that directly influenced the development of subsequent parallel models. The and marked the maturation of practical parallel programming models, driven by the proliferation of distributed and shared-memory systems in supercomputing. Shared-memory approaches gained traction with the standardization of threads () in 1995, which provided a portable for multithreaded programming on symmetric multiprocessors, enabling lightweight task creation and within a unified . In parallel, message-passing models emerged to address distributed-memory architectures, with the Parallel Virtual Machine (PVM) developed in 1989 by as an early portable library for heterogeneous clusters, facilitating explicit communication via send-receive primitives, and first publicly released in 1991. This was followed by the (MPI) standard, formalized in 1994 by the MPI Forum (with roots in 1992 discussions), which became the de facto model for by offering robust collective operations and point-to-point messaging across scalable clusters. Entering the , parallel programming models adapted to new hardware trends, including GPUs and hybrid systems, while addressing the shift from uniprocessor dominance to ubiquitous multicore processors as of the mid-, prompted by and constraints that ended traditional clock-speed —a trend that has continued with processors scaling to hundreds of cores into the 2020s, such as AMD's series reaching 192 cores in 2024. NVIDIA's framework, launched in 2006, revolutionized GPU computing by introducing a single-instruction multiple-thread (SIMT) model, allowing programmers to write kernels that exploit thousands of cores for data-parallel tasks like scientific simulations and rendering. Hybrid models evolved with , first specified in 1997 for shared-memory directives but extended in subsequent versions (e.g., 3.0 in 2008) to support task-based parallelism and accelerators, bridging directive-based simplicity with multicore and heterogeneous environments. The (PGAS) paradigm gained prominence through Unified Parallel C (UPC) in 1999, sponsored by , which enabled one-sided communication in a global divided among nodes, and IBM's X10 language released in 2004, which integrated PGAS with object-oriented features for distributed place-based programming. These developments were propelled by broader shifts, including the multicore era post-2005, where processors like Intel's Core Duo emphasized thread-level parallelism to sustain performance gains amid Dennard scaling's collapse. The rise of and further drove innovations, such as in 2010 from UC Berkeley's AMPLab, which introduced resilient distributed datasets (RDDs) for fault-tolerant across clusters, unifying batch, streaming, and iterative workloads in a higher-level abstraction over distributed frameworks. Underpinning these evolutions were challenges from slowdown around 2005, which reduced transistor density improvements and led to "dark silicon"—unused chip areas due to power budgets—necessitating energy-efficient parallel models that prioritize sparsity, locality, and heterogeneous to maximize utilization without excessive dissipation. Subsequent decades saw further maturation to support exascale computing and diverse workloads. OpenMP advanced with version 4.0 in 2013 adding accelerator offload, 5.0 in 2018 introducing taskloop and SIMD directives, and 6.0 in November 2024 enhancing AI/ML integrations like loop transformations. MPI progressed through 3.0 (2012) with non-blocking collectives and 4.0 (2021) improving fault tolerance and dynamic processes. Exascale systems, such as the U.S. DOE's Frontier supercomputer achieving 1.1 exaFLOPS in 2022, relied on hybrid models combining MPI for inter-node and OpenMP for intra-node parallelism, alongside PGAS languages like Chapel (stable release 2009 onward). Emerging paradigms addressed heterogeneous accelerators and AI, with task-based models in libraries like Intel oneAPI (2020) and Legion/DIMMA (2010s) enabling asynchronous execution for irregular workloads.

Classifications

Interaction Paradigms

Interaction paradigms in parallel programming models refer to the mechanisms by which processes or threads communicate, coordinate, and share data during execution. These paradigms define how parallelism is expressed and managed at the level of inter-process interactions, influencing both programmer productivity and system performance. Common paradigms include , , (PGAS), and implicit interaction, each offering distinct approaches to handling concurrency and data access in multiprocessor environments. In the shared memory paradigm, multiple processes or threads access a common address space, allowing direct reads and writes to shared variables without explicit data transfer. This model simplifies programming by enabling straightforward data sharing, particularly for complex data structures like graphs or trees, as references can be passed among threads with minimal overhead. However, it introduces challenges such as cache coherence overhead, where maintaining consistent views of shared data across processors requires protocol enforcement, potentially leading to performance bottlenecks in large-scale systems. To ensure safe access, mechanisms like mutexes are commonly used for mutual exclusion, preventing concurrent modifications to critical sections. The message passing paradigm relies on explicit communication primitives, such as send and receive operations, to exchange data between processes that do not share a memory space. This approach is particularly suited to distributed systems, like clusters of independent machines, where processes operate in isolated address spaces and coordinate solely through messages. Message passing supports both synchronous modes, where sender and receiver rendezvous before proceeding, and asynchronous modes, allowing non-blocking sends and receives for better overlap of computation and communication. Additionally, collective operations, such as broadcasts or reductions involving multiple processes, enable efficient group communications, as standardized in libraries like MPI. The (PGAS) paradigm offers a model, providing each with a local, fast-access region while exposing a global for remote access. This design promotes data locality by assigning ownership of data partitions to specific processes, reducing unnecessary data movement and improving performance on heterogeneous architectures. PGAS facilitates one-sided communication, where a can directly read from or write to another 's without involving the remote in , enhancing expressiveness for irregular data access patterns. Languages like UPC and exemplify this model, balancing the ease of with the control of explicit messaging. Implicit interaction paradigms abstract communication and coordination from the , relying on systems or compilers to manage parallelism automatically. In these models, the developer writes sequential-like code, and the system infers and enforces parallel execution, often through techniques like automatic thread scheduling or . A prominent example is (STM), introduced in 1995, which treats groups of operations as atomic transactions; conflicts are detected at , and transactions are rolled back and retried without explicit locks. This approach simplifies concurrent programming for dynamic data structures but may incur overhead from conflict resolution. Comparing these paradigms reveals key trade-offs in and . Shared memory excels in symmetric multiprocessor () environments due to its programming simplicity but scales poorly beyond a few processors because of contention and costs. Message passing supports better on large clusters by avoiding shared state, though it requires more explicit effort from programmers. PGAS bridges these by offering scalable distribution with global visibility, while implicit models like enhance productivity at the cost of potential runtime overheads. Regarding , message passing inherently supports distributed recovery through isolated processes, whereas shared memory models often assume reliable hardware and struggle with node failures due to centralized state.

Decomposition Strategies

Decomposition strategies in parallel programming models refer to the methods used to partition a computational problem into concurrent subtasks or data portions, enabling efficient distribution across processing units. These strategies determine the of parallelism—ranging from fine-grained (many small tasks) to coarse-grained (fewer larger tasks)—and influence load balancing, , and suitability for specific problem types. Key approaches include , , stream parallelism, implicit parallelism, and hybrid combinations, each tailored to exploit concurrency in different ways. Task parallelism decomposes a problem into independent tasks that may vary in computational intensity and execution time, allowing dynamic assignment to processors for better load balancing. This strategy is particularly effective for irregular workloads where task dependencies form dynamic graphs, such as in tree traversals or recursive algorithms. Dynamic scheduling techniques, like work-stealing, enable idle processors to "steal" tasks from busy ones, minimizing overhead and achieving near-optimal load balance. For instance, in the Cilk , work-stealing schedulers have demonstrated efficient handling of fully strict multithreaded computations on parallel machines, with theoretical guarantees of linear for balanced workloads. Data parallelism focuses on dividing data into partitions and applying identical operations across them simultaneously, often resembling single instruction multiple data (SIMD) execution. Partitioning strategies include block decomposition, where contiguous data chunks are assigned to processors, which suits problems with uniform access patterns but can lead to load imbalances if computation varies; cyclic decomposition, distributing data in a round-robin fashion to even out irregular workloads; and block-cyclic, combining both for improved balance in matrix operations or simulations. These methods ensure scalability in regular, data-intensive applications like numerical simulations, with cyclic approaches particularly reducing variance in execution times for non-uniform data. For example, in parallel matrix multiplication, block partitioning minimizes data movement while cyclic variants balance computation across processors with differing loads. Stream parallelism organizes computation as a where flows through stages, each handled by a separate unit in a consumer-producer manner, enabling continuous overlap of operations. This model excels in applications with sequential dependencies but high throughput needs, such as or encoding, where stages process unbounded asynchronously. In GPUs, stream parallelism supports pipelined execution of kernels on batches, allowing multiple stages to run concurrently for hiding. Seminal work on on-the-fly pipeline parallelism has shown its efficacy in organizing linear sequences of stages, achieving high utilization in stream-based computations without excessive buffering. Implicit parallelism relies on compilers or runtimes to automatically detect and extract concurrent executable regions from sequential code, reducing programmer burden but requiring sophisticated analysis. A classic example is parallelizing independent loop iterations in using DOALL constructs, where the compiler identifies loops free of data dependencies to distribute across processors. Challenges include accurate dependence analysis to avoid false , privatization of scalar variables, and handling , which can limit extraction to about 10-20% of loops in real codes without transformations. Loop transformation techniques, such as those maximizing DOALL loops through splitting, have been pivotal in uncovering hidden parallelism in nested loops for scientific computing. Hybrid approaches integrate multiple decomposition strategies to address complex applications requiring both task and data concurrency at varying granularities, often combining fine-grained (e.g., loop-level) and coarse-grained (e.g., function-level) parallelism. For instance, within tasks can optimize inner loops, while schedules outer dependencies, as seen in MPI-OpenMP models for distributed shared-memory systems. Fine-grained hybrids suit dense computations with low communication, achieving better cache locality, whereas coarse-grained ones reduce overhead in sparse or irregular problems. These combinations have enabled scalable performance in multi-level parallel environments, such as finite element simulations, by mapping coarse parallelism across nodes and fine within them.

Core Concepts

Synchronization Mechanisms

Synchronization mechanisms in parallel programming ensure that concurrent executions maintain data consistency and avoid undefined behaviors such as race conditions, by coordinating access to shared resources and controlling the across threads or processes. These primitives range from simple tools to more sophisticated hardware-supported operations, each designed to balance correctness with performance in multi-threaded environments. Barriers provide global synchronization points where all participating threads must reach before any can proceed, effectively halting execution until the entire group arrives. This is crucial for phases of parallel computation where all threads need to complete their work before the next stage begins, such as in iterative algorithms. A common efficient implementation uses sense-reversing trees, where threads propagate flags up a and reverse a shared sense variable to signal completion, reducing contention and achieving logarithmic in the number of threads. Locks and enforce for critical sections, preventing multiple from simultaneously accessing shared data that could lead to inconsistencies. A lock, often implemented as a binary , allows only one to enter a protected region via acquire (lock) and release (unlock) operations; spinlocks repeatedly poll the lock in a busy-wait for low-latency scenarios, while blocking locks suspend the to save CPU cycles in longer waits. generalize this to counting variants, supporting resource pools where the count represents available units, decremented on wait (P operation) and incremented on signal (V operation). prevention in lock-based systems can employ strategies like the , which simulates to ensure a sequence exists where all processes can complete without circular waits. Condition variables facilitate signaling between threads, allowing one to wait until a specific holds true, typically used in with a mutex to protect the associated . In the threads () , a calls pthread_cond_wait to atomically release the mutex and block until another invokes pthread_cond_signal or pthread_cond_broadcast to wake waiters, ensuring efficient notification without busy-waiting. This mechanism is foundational to higher-level constructs like monitors, enabling producer-consumer patterns where threads synchronize on shared queues. Atomic operations and memory fences provide hardware-supported primitives for lock-free programming, allowing indivisible updates to shared variables without traditional locks. Atomics ensure operations like (CAS) or execute as single instructions, preventing interleaving that could corrupt data. Fences enforce , such as acquire fences (preventing subsequent reads from moving before the fence) and release fences (preventing prior writes from moving after), to maintain visibility of changes across threads. Memory models define these guarantees; requires all threads to see operations in a , while relaxed models like total store order (TSO) in x86 allow optimizations for better performance at the cost of explicit fence usage. Advanced mechanisms include for handling asynchronous computations, where a represents a pending result that threads can query without blocking the entire program, and promises allow setting the value later to fulfill waiting dependents. This decouples task submission from result retrieval, improving parallelism in languages like C++ std::. Transactional memory offers composable atomicity by executing blocks of code as transactions that either commit fully or abort on conflicts, using hardware support like protocols to detect and resolve races without explicit locking. Despite their utility, synchronization mechanisms introduce overheads like contention on shared primitives, which can degrade in large systems with thousands of , as tree-based barriers or lock queues may at nodes or spots. race conditions remains challenging, often requiring tools like thread sanitizers to detect non-deterministic errors, while high-contention scenarios may necessitate adaptive strategies to switch between locking and lock-free approaches.

Performance Considerations

Performance in parallel programming models is evaluated using key metrics that quantify how effectively additional processing resources improve execution time and resource utilization. measures the reduction in execution time when using p processors compared to a single processor, with highlighting the fundamental limit imposed by the serial fraction of the workload, where even small serial portions can cap overall gains. extends this by considering scaled problem sizes, emphasizing that parallel fractions can yield near-linear s for larger workloads on more processors. , defined as E = S / p, where S is , indicates the average utilization of processors, typically decreasing as p increases due to overheads. assesses how performance holds with growing processor counts; strong scaling maintains a fixed problem size while increasing processors, often revealing limits from communication and , whereas weak scaling proportionally increases problem size with processors to sustain . The isoefficiency function provides a deeper scalability analysis by determining the problem size required to maintain constant efficiency as p grows, particularly accounting for communication overhead in distributed models; for example, in algorithms with total communication cost proportional to p \log p, the function scales as W = K p \log p, where W is workload and K is a constant. Bottlenecks significantly impact these metrics, with communication and being primary constraints in distributed systems. The model captures this realistically through parameters including L (end-to-end for a ), overhead o (processor busy time sending/receiving), gap g (minimum time between consecutive sends/receives), and number of P, enabling predictions of how characteristics degrade for communication-intensive tasks. Load imbalance, where complete work at uneven rates, exacerbates inefficiencies, while Amdahl's serial fraction amplifies the issue by forcing idle time during non-parallelizable phases, often reducing below 50% beyond moderate p. Optimization strategies target these bottlenecks to enhance performance. Improving data locality minimizes data transfers between processors or memory levels, reducing communication costs in both shared- and distributed-memory models by aligning data access patterns with hardware topology. Profiling tools such as (Tuning and Analysis Utilities), which supports across multiple languages and systems for tracing events and metrics, and VTune Profiler, offering hardware-level counters for CPU, , and I/O analysis, enable identification of hotspots and imbalances. In NUMA systems, hybrid scaling combines thread-level parallelism (e.g., ) within nodes with process-level (e.g., MPI) across nodes to optimize affinity and reduce remote access latencies. Energy and power considerations have gained prominence in , especially for large-scale systems. Dynamic Voltage and Frequency Scaling (DVFS) adjusts processor voltage and frequency dynamically based on workload demands, reducing power consumption quadratically with frequency while preserving for parallel phases with variable intensity. Post-2010 trends in for HPC emphasize energy proportionality, with initiatives like the list tracking in gigaflops per watt, showing steady improvements from integration and software optimizations, though full exascale systems still face challenges in balancing and power budgets. Standardized evaluations rely on benchmarks like the NAS Parallel Benchmarks, introduced in the to assess parallel kernel and application performance across architectures. Research has extended these benchmarks to evaluate exascale features such as and , measuring under modern constraints.

Notable Implementations

Shared Memory Models

Shared memory models enable parallel execution by allowing multiple threads or processes within a single address space to access and modify common data structures, facilitating implicit communication through shared variables. These models are particularly suited for (SMP) systems and multicore processors, where low-latency access to unified memory simplifies programming compared to explicit message exchanges. Prominent implementations include directive-based approaches like , low-level thread libraries such as Threads, language-integrated concurrency in , and higher-level abstractions like (TBB). While effective for exploiting intranode parallelism in (HPC) simulations and multicore applications, shared memory models face scalability limitations beyond a single node due to contention and coherence overheads in distributed environments. OpenMP is a directive-based for parallel programming in C, C++, and , using pragmas to annotate sequential code for . Key constructs include #pragma omp parallel for for distributing loop iterations across threads and #pragma omp sections for parallel execution of independent code blocks. Introduced in 1997, OpenMP evolved significantly with version 3.0 in May 2008, which added task constructs (#pragma omp task) to support dynamic, irregular workloads by deferring execution of independent units until resources are available. Further advancements in version 6.0, released in November 2024, extend support to accelerators like GPUs through target directives (#pragma omp target), enabling offloading of computations to heterogeneous devices while managing data transfers implicitly, alongside features for easier parallel programming in new applications and finer-grained control. In HPC simulations, such as on multicore CPUs, OpenMP simplifies scaling to dozens of threads by handling via barriers and , though NUMA effects can degrade on larger core counts. POSIX Threads (pthreads) provides a low-level, portable defined in the .1-2001 standard for creating and managing s in C programs on systems. Core functions include pthread_create() to spawn a new with specified attributes and start routine, pthread_join() to wait for completion, and pthread_attr_init()/pthread_attr_setdetachstate() for configuring properties like stack size or detachment. A classic use case is the producer-consumer pattern, where a producer enqueues data into a shared protected by a mutex (pthread_mutex_lock()) and condition variable (pthread_cond_wait()), while consumers dequeue items upon signaling (pthread_cond_signal()), ensuring thread-safe access to the shared . This is widely used in multicore applications for fine-grained control, such as real-time data processing, but requires explicit synchronization to avoid race conditions, increasing programmer burden compared to higher-level models. Java's built-in threading support integrates shared memory concurrency directly into the language, with the Thread class extending java.lang.Thread for subclassing or implementing Runnable for task execution, and higher-level abstractions like ExecutorService from the java.util.concurrent package for managing thread pools via submit() and shutdown(). The Java Memory Model (JMM), formalized in JSR-133 and effective from Java 5.0 in 2004, defines visibility guarantees for shared variables, ensuring that writes to volatile fields are immediately visible to other threads and establishing happens-before relationships to prevent reordering issues. For instance, in a multicore application, an ExecutorService can parallelize tasks like image processing across cores, with volatile keywords ensuring consistent without full overhead. This model excels in on multicore systems but inherits limitations, such as potential deadlocks in distributed JVM setups. Modern extensions like Threading Building Blocks (TBB), first released in 2007 as a C++ template library, abstract parallelism through high-level patterns, avoiding low-level thread management. TBB supports flow graphs for modeling dependencies via graph, node, and edge connections, enabling asynchronous execution of task pipelines, and provides concurrent containers like concurrent_queue and concurrent_hash_map for lock-free or fine-grained locked access in multithreaded environments. In HPC simulations on multicore CPUs, TBB's parallel_for and parallel_pipeline algorithms distribute workloads dynamically, achieving near-linear scaling up to hundreds of cores while adapting to load imbalances, though it remains confined to shared address spaces and less suitable for distributed clusters.

Message Passing Models

Message passing models enable parallel programs to operate in distributed environments by allowing independent processes to communicate explicitly through the transmission and reception of messages, without relying on a . This approach is ideal for systems comprising multiple nodes, such as clusters, where data locality and explicit control over communication are essential for and efficiency. Unlike shared memory paradigms, message passing requires programmers to manage data movement, , and potential latency explicitly, fostering portability across heterogeneous hardware. The (MPI) stands as the predominant standard for , initially specified in MPI-1 on May 5, 1994, by the MPI Forum, a of researchers, vendors, and users. Subsequent versions have expanded its capabilities, with the current MPI-5.0 approved on June 5, 2025, incorporating enhancements for programming, , and modern HPC requirements. Core to MPI are point-to-point operations, such as MPI_Send for sending messages and MPI_Recv for receiving them, which facilitate direct, buffered or synchronous exchanges between two processes. Collective communication routines, including MPI_Allreduce, support group-wide operations like summations or broadcasts, optimizing data aggregation in parallel algorithms. Introduced in MPI-3 on September 21, 2012, one-sided communication primitives, such as MPI_Put and MPI_Get, enable remote memory access without requiring active participation from the target process, reducing synchronization overhead in distributed applications. Preceding MPI, the Parallel Virtual Machine (PVM) emerged in 1989 at and was refined starting in 1991 at the , providing an early framework for heterogeneous . PVM emphasized dynamic process creation, management, and migration across networked machines, using for inter-process coordination. Its influence on portable parallel programming persists, though it has been largely supplanted by MPI due to the latter's standardization and performance optimizations. Unified Parallel C (UPC), an extension of ISO C for partitioned global address space (PGAS) computing, integrates message passing with shared data abstractions to simplify distributed programming. UPC allows declaration of shared arrays with qualifiers like shared for global access or private/local for node-specific data, enabling implicit message passing through remote references while maintaining explicit control over partitioning. This model supports one-sided operations for efficient data movement, making it suitable for applications requiring fine-grained parallelism on large-scale clusters. In practice, message passing models excel in cluster-based scientific simulations, such as numerical weather prediction with the Weather Research and Forecasting (WRF) model, where MPI distributes computational domains across nodes to accelerate forecasts. These systems often incorporate fault tolerance through periodic checkpointing, saving process states to stable storage for restart after node failures, ensuring long-running simulations complete reliably on unreliable hardware. For less structured applications, alternatives like ZeroMQ, launched in 2007, offer lightweight, asynchronous messaging patterns—such as publish-subscribe or request-reply—without brokers, ideal for scalable, real-time distributed software.

References

  1. [1]
    [PDF] Exploring Traditional and Emerging Parallel Programming Models ...
    In this section, we provide an overview of the com- monly used models (MPI, OpenMP, hybrid MPI+OpenMP, and CUDA) along with a description of how the parallel.
  2. [2]
    [PDF] Introduction to Parallel Machines and Programming Models Lecture 3
    Jan 27, 2015 · 4! Parallel Programming Models. • Programming model is made up of the languages and libraries that create an abstract view of the machine.
  3. [3]
    [PDF] Parallel Programming Models
    Execution models impact the above programming model. • Traditional computer is SISD. • SIMD is data parallelism while MISD is pure task parallelism.
  4. [4]
    Parallel Programming Model - an overview | ScienceDirect Topics
    A parallel programming model is defined as a set of program abstractions that facilitate the mapping of parallel activities from applications to underlying ...Introduction to Parallel... · Classification and... · Programming Abstractions...
  5. [5]
    Introduction to Parallel Computing Tutorial - | HPC @ LLNL
    Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem.<|separator|>
  6. [6]
    Compiling Chapel: Keys to Making Parallel Programming Productive ...
    Sep 30, 2020 · Chapel is a parallel language designed to support productive programming from laptops to supercomputers.
  7. [7]
    [PDF] Validity of the Single Processor Approach to Achieving Large Scale ...
    Amdahl. TECHNICAL LITERATURE. This article was the first publica- tion by Gene Amdahl on what became known as Amdahl's Law. Interestingly, it has no equations.
  8. [8]
    Reevaluating Amdahl's law | Communications of the ACM
    Uses and abuses of Amdahl's law​​ Amdahl's law has been widely used by designers and researchers to get a rough estimate of performance improvement when ...Missing: original | Show results with:original<|separator|>
  9. [9]
    The History of the Development of Parallel Computing
    1955 [1] IBM introduces the 704. Principal architect is Gene Amdahl; it is the first commercial machine with floating-point hardware, and is capable of ...
  10. [10]
    [PDF] PARALLEL COMPUTING PLATFORMS, PIPELINING, VECTOR ...
    Flynn's Taxonomy of parallel computers ä Distinguishes architectures by ... (e.g., ILLIAC IV in the 60s, the CM2 in early 90s). CU. Processing Elements.Missing: 1970s | Show results with:1970s
  11. [11]
    [PDF] Apache Spark: A Unified Engine for Big Data Processing
    Nov 2, 2016 · 2 Spark's de- sign as a storage-system-agnostic engine makes it easy for users to run computations against existing data and join diverse data ...
  12. [12]
    Partitioned Global Address Space Languages - ACM Digital Library
    The Partitioned Global Address Space (PGAS) model is a parallel programming model that aims to improve programmer productivity while at the same time aiming ...
  13. [13]
    A Comprehensive Exploration of Languages for Parallel Computing
    Jan 18, 2022 · Shared-memory, message-passing and data-flow are the most common communication models, in that order. ▻. The majority of languages target shared ...
  14. [14]
    Memory Models: A Case For Rethinking Parallel Languages and ...
    Aug 1, 2010 · The ability to pass memory references among threads makes it easier to share complex data structures. Finally, shared-memory makes it far easier ...
  15. [15]
    Dynamic self-invalidation: reducing coherence overhead in shared ...
    This paper introduces dynamic self-invalidation (DSI), a new technique for reducing cache coherence overhead in shared-memory multiprocessors.
  16. [16]
    A comparison of message passing and shared memory ...
    Shared memory and message passing are two opposing communication models for parallel multicomputer architectures. Comparing such architectures has been ...
  17. [17]
    Unifying synchronous and asynchronous message-passing models
    Unifying synchronous and asynchronous message-passing models. Authors: Maurice Herlihy.
  18. [18]
    MPI's collective communication operations for clustered wide area ...
    Writing parallel applications for computational grids is a challenging task. To achieve good performance, algorithms designed for local area networks must ...
  19. [19]
    Productivity and performance using partitioned global address ...
    Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message ...
  20. [20]
  21. [21]
    Implicit parallelism with ordered transactions - ACM Digital Library
    The key idea is to specify opportunities for parallelization in a sequential program using annotations similar to transactions. Unlike explicit parallelism, ...
  22. [22]
    Transactional memory: architectural support for lock-free data ...
    This paper introduces transactional memory, a new multiprocessor architecture intended to make lock-free synchronization as efficient (and easy to use) as ...
  23. [23]
    Sharing memory robustly in message-passing systems
    Shared-memory vs. ... It seems that it is more difficult to develop fault-tolerant programs for message passing systems than for shared memory systems.
  24. [24]
    [PDF] Lecture 4: Principles of Parallel Algorithm Design
    • Fine-grained decomposition: large number of small tasks. • Coarse-grained decomposition: small number of ... • Hybrid Strategies. 40. Page 41. Mapping Based on ...
  25. [25]
    Scheduling multithreaded computations by work stealing
    This paper studies the problem of efficiently schedulling fully strict (ie, well-structured) multithreaded computations on parallel computers.
  26. [26]
    [PDF] Scheduling Multithreaded Computations by Work Stealing - supertech
    Abstract. This paper studies the problem of e ciently scheduling fully strict (i.e., well- structured) multithreaded computations on parallel computers.
  27. [27]
    [PDF] Lecture 10: Principles of Parallel Algorithm Design
    – Hybrid mappings. • Data partitioning. – Data decomposition and then 1-1 mapping of tasks to PEs. Partitioning a given task-dependency graph across processes.
  28. [28]
    [PDF] Principles of Parallel Algorithm Design - Purdue Computer Science
    Cyclic and Block Cyclic Distributions. • If the amount of computation associated with data items varies, a block decomposition may lead to significant load.
  29. [29]
    [PDF] On-the-Fly Pipeline Parallelism
    Pipeline parallelism organizes a parallel program as a linear se- quence of s stages. Each stage processes elements of a data stream, passing each processed ...
  30. [30]
    [PDF] A loop transformation theory and an algorithm to maximize parallelism
    To maximize the degree of parallelism is to transform the loop nest to maximize the number of DOALL loops. ... parallel loops can be extracted. For example ...
  31. [31]
    [PDF] The Structure of a Compiler for Explicit and Implicit Parallelism
    We describe the structure of a compilation system that gen- erates code for processor architectures supporting both explicit and im- plicit parallel threads.
  32. [32]
    Coarse-Grained Parallelism - an overview | ScienceDirect Topics
    Hybrid programming approaches combine MPI and OpenMP to leverage nested parallelism, using MPI for internode communication and OpenMP for intranode parallelism.
  33. [33]
    Comparing barrier algorithms
    A barrier is a method for synchronizing a large number of concurrent computer processes. After considering some basic synchronization mechanisms, ...
  34. [34]
    [PDF] Fast, Contention-Free Combining Tree Barriers
    Abstract. Counter-based algorithms for busy-wait barrier synchronization execute in time linear in the number of synchronizing processes.
  35. [35]
    [PDF] Co-operating sequential processes - Pure
    The semaphore "incoming message" seems at first sight a fairly basic one, being defined by the surrounding universe. This is, howevert an illusion: within ...
  36. [36]
    [PDF] The Structure of the "THE"-Multiprogramming System - UCF
    Explicit mutual synchronization of parallel sequential processes is implemented via so-called "semaphores." They are special purpose integer variables allocated ...Missing: original | Show results with:original
  37. [37]
    pthread_cond_wait
    When using condition variables there is always a boolean predicate involving shared variables associated with each condition wait that is true if the thread ...
  38. [38]
    [PDF] A Futures Library and Parallelism Abstractions for a Functional ...
    After discussing some related work, we introduce three Lisp primitives that enable and control parallel evaluation, based on a notion of futures.
  39. [39]
    [PDF] Transactional Memory: Architectural Support for Lock-Free Data ...
    This paper introduces transactional memory, a new multiprocessor architecture intended to make lock-free synchronization as efficient (and easy to use) as ...Missing: seminal | Show results with:seminal
  40. [40]
    [PDF] Transactional Memory Today1 1 Motivation
    Jun 19, 2015 · Each dynamic execution of an atomic block is known as a transaction. The implementation guesses that concurrent transactions will be mutually ...
  41. [41]
  42. [42]
    Specifications - OpenMP
    Sep 15, 2025 · Previous Official OpenMP Specifications · Version 3.1 Complete Specifications – Jul 2011 · Version 3.1 Summary Card C/C++ – Sep 2011 · Version 3.1 ...
  43. [43]
    [PDF] OpenMP Application Program Interface
    The OpenMP Application Program Interface is version 3.0, released in May 2008. It includes an introduction, scope, and glossary.
  44. [44]
    Version 5.0 - OpenMP
    OPENMP API Specification: Version 5.0 November 2018 · 1 Structure of the OpenMP Memory Model 1.4. · 2 Device Data Environments 1.4. · 3 Memory Management 1.4. · 4 ...
  45. [45]
    Threads
    This defines interfaces and functionality to support multiple flows of control, called threads, within a process.
  46. [46]
    Java Specification Requests - detail JSR# 133
    Description: The proposed specification describes the semantics of threads, locks, volatile variables and data races.
  47. [47]
  48. [48]
    Intel® oneAPI Threading Building Blocks Documentation
    Intel® oneAPI Threading Building Blocks Documentation. Overview · Download · Documentation & Resources.Missing: 2005 | Show results with:2005
  49. [49]
    3. Overview and Goals - MPI Forum
    All parts of this definition are significant. MPI addresses primarily the message-passing parallel programming model, in which data is moved from the address ...<|control11|><|separator|>
  50. [50]
    [PDF] MPI: A Message-Passing Interface Standard
    Jun 9, 2021 · Version MPI-2.2 (September 4, 2009) added additional clarifications and seven new routines. Version MPI-3.0 (September 21, 2012) was an exten- ...
  51. [51]
    [PDF] A Message-Passing Interface Standard - MPI Forum
    Nov 2, 2023 · This document describes the Message-Passing Interface (MPI) standard, version 4.1. The MPI standard includes point-to-point message-passing, ...
  52. [52]
    [PDF] A Message-Passing Interface Standard - MPI Forum
    May 5, 1994 · ii Page 3 Version 3.0: September 21, 2012. Coincident with the development of MPI-2.2, the MPI Forum began discussions of a major extension to ...Missing: date | Show results with:date
  53. [53]
    History of PVM Versions - The Netlib
    This appendix contains a list of all the versions of PVM that have been released from the first one in February 1991 through August 1994. Along with each ...
  54. [54]
    (PDF) Pvm 3 User's Guide And Reference Manual - ResearchGate
    This report is the PVM version 3.3 users' guide and reference manual. It contains an overview of PVM, and how version 3 can be obtained, installed and used.
  55. [55]
    Berkeley Unified Parallel C (UPC) Project
    Unified Parallel C (UPC) is an extension of the C programming language designed for high performance computing on large-scale parallel machines.
  56. [56]
    Viability of Cloud Computing for Real-Time Numerical Weather ...
    To run WRF in parallel (which allows us to achieve faster simulation times by distributing a WRF run over mul- tiple computing resources), Message Passing ...
  57. [57]
    11.6. Fault tolerance — Open MPI main documentation
    We typically define “fault tolerance” to mean the ability to recover from one or more component failures in a well defined manner.
  58. [58]
    ZeroMQ
    Its asynchronous I/O model gives you scalable multicore applications, built as asynchronous message-processing tasks. It has a score of language APIs and ...Get started · Zeromq · Download · C++Missing: lightweight | Show results with:lightweight