Task parallelism
Task parallelism is a fundamental paradigm in parallel computing that involves decomposing a program into distinct, concurrently executable tasks distributed across multiple processors or cores, emphasizing the simultaneous performance of different functions or operations to enhance efficiency and scalability.[1] This approach contrasts with data parallelism, which applies the same operation to multiple data subsets simultaneously, by instead focusing on functional diversity where tasks may handle varied computations without uniform data partitioning.[2] Unlike finer-grained forms such as instruction-level parallelism, task parallelism operates at a coarser level, organizing code into processes, threads, or independent units that can run asynchronously.[3]
Key characteristics of task parallelism include the potential for tasks to be either fully independent—enabling embarrassingly parallel execution with minimal synchronization—or interdependent, requiring coordination mechanisms like locks, barriers, or futures to resolve data dependencies and maintain program correctness.[1] It is particularly suited to heterogeneous workloads, such as mixing intensive computations with input/output operations, and aligns with the Multiple Instruction, Multiple Data (MIMD) classification in Flynn's taxonomy, allowing flexible resource utilization on multicore systems.[3] For instance, in web applications, task parallelism can handle concurrent HTTP request processing, where each request operates as an independent task with little intercommunication.[3]
Task parallelism finds broad applications in domains requiring diverse concurrent operations, including multimedia processing—such as parallel video decoding and rendering—and scientific simulations where distinct algorithmic stages execute simultaneously.[4] It is also prevalent in high-performance computing for irregular workloads, like graph analytics or machine learning pipelines, where dynamic task scheduling improves throughput on distributed systems.[2] Support for task parallelism is integrated into modern programming environments, with languages and libraries like Cilk for lightweight task creation, OpenMP's task constructs[5] for directive-based parallelism, and Java's Executor framework for thread pool management, enabling developers to exploit multicore hardware without low-level thread handling.[6][7]
Core Concepts
Definition
Task parallelism is a form of parallel computing in which a computational problem is divided into multiple independent tasks—discrete units of work—that execute concurrently across different processors or cores to enhance throughput and reduce execution time compared to sequential processing.[8][9] This approach, also known as functional decomposition, emphasizes the concurrent performance of distinct operations rather than uniform processing of data elements.[8]
The key principles of task parallelism revolve around task independence, enabling minimal inter-task communication and synchronization to maximize efficiency; the potential for dynamic task creation and assignment during runtime to adapt to workload variations; and a focus on coarse-grained work division, where tasks encompass larger, self-contained computations suitable for distribution across heterogeneous resources.[9] These principles distinguish task parallelism within the broader context of parallel computing, which involves the simultaneous use of multiple compute resources to solve problems that would otherwise require sequential execution on a single processor.[10] Task parallelism first appeared in early multiprocessing systems of the 1960s and 1970s, where concurrent task handling became essential for leveraging multiple processors.[11]
The basic workflow of task parallelism begins with the identification and decomposition of a program into independent tasks, followed by their concurrent execution on available processing units, and concludes with result aggregation if dependencies exist.[8] This process improves resource utilization by allowing tasks to proceed asynchronously, with overall performance limited primarily by the longest-running task and any inherent serial components.[9]
Historical Development
The concept of task parallelism emerged in the 1960s and 1970s amid early efforts to harness multiprocessing for high-performance computing. By the mid-1970s, foundational ideas for dynamic task execution were advanced through dataflow architectures, where computations are broken into independent tasks activated by data availability rather than rigid control flow. Jack B. Dennis and David P. Misunas proposed a preliminary architecture for a basic data-flow processor in 1975, enabling asynchronous task firing and influencing subsequent models of irregular parallelism.[12]
In the 1980s, task parallelism gained practical expression in programming languages designed for concurrent systems. The Ada programming language, standardized as Ada 83 in 1983 under U.S. Department of Defense sponsorship, introduced native tasking facilities—including task types, rendezvous for synchronization, and select statements for conditional execution—to support real-time and embedded applications with reliable concurrency.[13] This marked a shift toward structured, language-level support for dynamic task creation and interaction, building on earlier multiprocessing concepts from the 1970s.[14]
The 2000s accelerated adoption due to hardware trends, particularly the transition to multicore processors. Intel's announcement in 2005 of a pivot from single-core frequency scaling to multicore designs, exemplified by the release of dual-core Pentium processors, underscored the need for software paradigms like task parallelism to exploit on-chip concurrency effectively.[11] A pivotal milestone came with OpenMP 3.0 in May 2008, which added task constructs to the API, enabling programmers to define and schedule independent tasks dynamically for irregular workloads, evolving from earlier static loop-based parallelism.[15]
Influential contributions in distributed contexts further shaped the field, with Ian Foster's work in the 1990s and 2000s on grid computing promoting task-based models for large-scale, heterogeneous environments, as detailed in his 1995 book Designing and Building Parallel Programs.[16] Post-2010 developments, driven by cloud computing and accelerators, extended dynamic tasking to scalable frameworks, transitioning from static scheduling in early multiprocessing to adaptive, runtime-managed task graphs in modern systems.[17]
Implementation Mechanisms
Task Models and Abstractions
In task-parallel systems, computations are often modeled using a directed acyclic graph (DAG), where nodes represent individual tasks and directed edges indicate dependencies between them, ensuring that dependent tasks execute only after their predecessors complete. This model captures the structure of parallel workloads by avoiding cycles that could lead to deadlocks or undefined behavior, allowing runtimes to identify opportunities for concurrent execution. The DAG approach has become a foundational abstraction for expressing irregular parallelism in applications with varying dependency patterns.[18]
A key abstraction for handling asynchronous results in these models is the use of futures and promises. A future acts as a placeholder for a value that will be computed asynchronously by a task, enabling the main program to continue without blocking until the result is needed, while a promise serves as the mechanism for the completing task to deliver that value. Futures were introduced in the context of concurrent symbolic computation to support lazy evaluation in parallel Lisp environments, promoting fine-grained parallelism without explicit synchronization. Promises, as precursors to modern asynchronous constructs, were proposed to represent eventual results in applicative programming paradigms, decoupling computation from result access.[19][20]
Tasks in these systems are typically treated as first-class objects, meaning they can be dynamically created, manipulated, and managed like any other data entity, with well-defined states such as created, submitted, running, and completed. This abstraction allows programmers to submit tasks to a runtime scheduler and query their status, facilitating composable parallelism where tasks can depend on or spawn others. Dependency graphs, frequently implemented as DAGs, further refine this by explicitly encoding relationships, such as data or control dependencies, to guide execution order while maximizing concurrency.[21]
Important distinctions in task models include implicit versus explicit parallelism. In implicit models, the runtime or compiler automatically detects and exploits parallel opportunities from high-level specifications, reducing programmer burden but potentially limiting control over irregular workloads; explicit models require developers to annotate or define parallel regions and dependencies directly, offering precision at the cost of added complexity. Additionally, tasks are often designed as lightweight entities, similar to threads that share process resources like memory and file descriptors for low overhead, in contrast to heavyweight processes that maintain isolated address spaces and incur higher creation and context-switching costs.[22][23]
To illustrate basic task creation and management, consider the following pseudocode, which captures the essence of submitting a task and awaiting its result in a generic task-parallel runtime:
task T = create_task(compute_function, arguments);
submit(T, scheduler);
result = wait_for_completion(T);
task T = create_task(compute_function, arguments);
submit(T, scheduler);
result = wait_for_completion(T);
This pattern encapsulates task instantiation as a first-class operation, with submission integrating it into the dependency graph for execution. Scheduling mechanisms may operate on such models to prioritize ready tasks, but the abstractions themselves focus on declarative specification.[24]
Scheduling and Synchronization
In task parallelism, scheduling involves assigning tasks to available processing resources to optimize execution time and resource utilization. Static scheduling pre-allocates tasks to processors based on a known task graph prior to runtime, assuming predictable execution times and no variations in load, which minimizes runtime overhead but can lead to imbalances if assumptions fail.[25] Dynamic scheduling, in contrast, adapts task assignments at runtime to current system conditions, such as varying task durations or processor loads, enabling better responsiveness in irregular workloads.[25] A prominent dynamic strategy is work-stealing, where idle processors "steal" tasks from busy processors' queues to balance load, as implemented in systems like Cilk; this approach ensures low contention and bounded overhead, with theoretical guarantees of O(log n) stealing attempts per task in a system with n tasks.
Priority-based task queues extend these strategies by ordering tasks according to assigned priorities, often using heaps or multi-level queues, to ensure high-priority tasks (e.g., those on critical paths) execute first, reducing overall completion time in dependency-heavy graphs.[26] Task models like directed acyclic graphs (DAGs) inform these schedulers by representing dependencies, allowing runtime systems to select only ready (unblocked) tasks for assignment. Synchronization in task parallelism coordinates task execution to respect dependencies and protect shared resources, using primitives tailored to minimize blocking.
Barriers synchronize groups of tasks by requiring all to reach a common point before any proceeds, ensuring collective progress in phases like iterative algorithms.[10] Mutexes provide mutual exclusion for shared data access, preventing race conditions during critical sections, though they can introduce serialization if overused in fine-grained tasks.[10] Atomic operations offer lightweight synchronization for signaling task completion, such as incrementing counters or setting flags without full locks, enabling efficient dependency resolution via mechanisms like reference counting on futures or promises.[10]
Runtime considerations for scheduling and synchronization emphasize load balancing across cores, achieved through decentralized mechanisms like work-stealing deques per processor, which distribute tasks without central bottlenecks and adapt to heterogeneity. Handling task dependencies at runtime involves maintaining a ready-task pool derived from the DAG, where incoming edges are decremented upon predecessor completion (often atomically), releasing successors when counts reach zero; this ensures tasks execute only after prerequisites, with schedulers prioritizing ready tasks to minimize idle time.[27]
Challenges in these mechanisms include overhead from context switching, where frequent task migrations between cores incur costs for saving/restoring states, registers, and cache lines, potentially dominating execution for fine-grained tasks and reducing effective parallelism.[28] To quantify impact, consider the speedup equation, which measures parallel efficiency. Let T_{\text{serial}} be the execution time of the sequential version, encompassing all computation without parallelism. In the parallel case, T_{\text{parallel}} includes the parallelized computation time divided by the number of processors p, plus synchronization and communication times T_{\text{sync}}, and scheduling overhead T_{\text{sched}} (e.g., from stealing or priority queue operations). Thus,
T_{\text{parallel}} = \frac{T_{\text{comp}}}{p} + T_{\text{sync}} + T_{\text{sched}},
where T_{\text{comp}} is the total computational work (approximating T_{\text{serial}} if fully parallelizable). Speedup is then
S = \frac{T_{\text{serial}}}{T_{\text{parallel}}} = \frac{T_{\text{serial}}}{\frac{T_{\text{comp}}}{p} + T_{\text{sync}} + T_{\text{sched}}}.
Deriving from first principles, as p increases ideally, S \to p if overheads are negligible, but T_{\text{sched}} grows with task granularity (e.g., O(1) per steal but multiplied by steals), capping S below p; for instance, if T_{\text{sched}} is constant, S asymptotes to T_{\text{serial}} / T_{\text{sched}}. This formulation highlights how scheduling costs directly limit scalability in task-parallel systems.[10]
Language and Framework Support
The standardization of task parallelism in C++ emerged in response to the widespread adoption of multicore processors starting around 2005, which necessitated higher-level abstractions for scalable concurrent programming beyond low-level threads.[29] The C++ standards committee, through Working Group 21 (WG21), introduced foundational concurrency features in C++11 to enable asynchronous task execution, addressing the limitations of explicit thread management for irregular workloads on multicore systems.[30]
In C++11, the <future> header provides core primitives for task-based parallelism, including std::async, std::future, and std::packaged_task. std::async launches a callable object asynchronously, potentially on a new thread, and returns a std::future object that allows the caller to retrieve the result or handle exceptions once the task completes.[31] This mechanism supports deferred or concurrent execution policies, facilitating task parallelism without direct thread creation. std::future represents the shared state of an asynchronous operation, offering methods like wait() and get() to synchronize and access results, while std::packaged_task wraps a callable into a task that stores its outcome in a shared state accessible via a future.[32] These features were refined in C++17 with improvements to exception propagation and in C++20 with coroutines enhancing asynchronous task composition, though the core task model remains centered on futures for multicore scalability.[33]
OpenMP, a directive-based API for shared-memory parallelism, integrated task parallelism starting with version 3.0 in 2008 to handle dynamic, irregular workloads on multicore architectures.[34] The #pragma omp task directive generates a task from a structured block, allowing deferred execution by the runtime scheduler, while #pragma omp taskwait suspends the current task until all its child tasks complete, enabling dependency management without explicit synchronization.[35] This model, which builds on work-sharing constructs from earlier versions, promotes scalability by distributing tasks across threads dynamically, and it has been extended in subsequent standards like OpenMP 4.0 and 5.0 for better dependency graphs and device offloading, and in OpenMP 6.0 (released November 2024) with improved tasking support and features for easier parallel programming.[36][37]
Related to C++ standards, Intel's oneAPI Threading Building Blocks (oneTBB), formerly Threading Building Blocks, offers a library-based approach to task parallelism that complements standard features.[38] oneTBB provides task_group for enclosing parallel tasks executed by a work-stealing scheduler, ensuring load balancing on multicore systems, and flow graphs—a node-based model for composing task dependencies as directed acyclic graphs, suitable for pipeline and irregular parallelism.[39] Originally developed in 2007 to address post-multicore programming challenges, oneTBB has evolved into an open-source standard under the oneAPI initiative, integrating seamlessly with C++11+ concurrency primitives.[40]
Support in Java and Other Languages
Java provides robust support for task parallelism through its java.util.concurrent package, introduced in Java 5, which includes the ExecutorService interface for managing thread pools and executing asynchronous tasks without directly handling threads. ExecutorService allows developers to submit tasks via methods like submit() or invokeAll(), enabling efficient distribution of work across a pool of worker threads, which helps in achieving parallelism for independent units of computation. This framework abstracts low-level thread management, promoting scalable task execution in multi-core environments.[41]
Building on this, Java 7 introduced the ForkJoinPool, a specialized ExecutorService implementation that employs a work-stealing scheduler to balance workloads dynamically among threads, particularly suited for recursive divide-and-conquer algorithms like parallel quicksort. In this model, idle threads "steal" tasks from busy threads' queues, minimizing synchronization overhead and maximizing CPU utilization for fine-grained tasks. Java 8 further enhanced task composition with CompletableFuture, a class that represents a pending completion stage and supports chaining operations (e.g., thenApply() for transformations or allOf() for combining multiple futures), facilitating non-blocking, asynchronous workflows that compose parallel tasks declaratively.[42][43]
Java 21 (released in 2023) advanced task parallelism significantly with virtual threads under Project Loom, which are lightweight, JVM-managed threads that map to carrier (platform) threads, allowing millions of them to run concurrently with minimal overhead compared to traditional platform threads. Virtual threads enable scalable task execution for I/O-bound applications by reducing context-switching costs, while structured concurrency constructs like StructuredTaskScope ensure safe grouping and cancellation of related tasks. However, in managed runtime environments like the JVM, garbage collection pauses—such as those in the parallel collector—can introduce stop-the-world interruptions, potentially degrading latency-sensitive task parallelism by halting all threads during heap reclamation.[44][45]
Beyond Java, other languages offer distinct paradigms for task parallelism. In Go, goroutines provide lightweight concurrency primitives, launched with the go keyword, that enable thousands of tasks to run multiplexed on a smaller number of OS threads managed by the Go runtime; channels facilitate safe communication and synchronization between goroutines, supporting patterns like fan-out/fan-in for parallel data processing. Python's concurrent.futures module, available since Python 3.2, abstracts task execution through ThreadPoolExecutor for I/O-bound parallelism and ProcessPoolExecutor for CPU-bound tasks, allowing submission of callables via submit() or map(), with futures for result retrieval, though the Global Interpreter Lock limits true thread parallelism.[46][47]
Rust emphasizes safe, high-performance task parallelism via its async/await syntax (stable since Rust 1.39 in 2019), where asynchronous functions spawn non-blocking tasks; the Tokio runtime, a popular async executor, schedules these tasks across a multi-threaded worker pool using work-stealing, enabling efficient parallelism for both I/O and compute-intensive workloads while leveraging Rust's ownership model to prevent data races. Cross-language trends highlight the actor model, notably in Erlang, where lightweight processes act as isolated actors that communicate solely via asynchronous message passing, inherently supporting distributed task parallelism across nodes with fault tolerance through process supervision.[48]
Applications and Examples
Practical Examples
One practical application of task parallelism is in parallel image processing, where different operations such as edge detection on one section and region growing on another are assigned to separate tasks executing concurrently on available processors. This approach allows functionally diverse processing—such as applying distinct algorithms to image regions—to proceed with coordination for dependencies, enabling efficient utilization of multicore systems.[49]
The following pseudocode illustrates the task decomposition and submission for this image processing example, where the image is partitioned into tasks that can be executed in parallel:
function parallel_image_process(image):
tasks = decompose_image_into_regions(image) // Partition image into regions for different operations
for each region in tasks:
if region_type == "edge":
submit_task(edge_detection, region)
elif region_type == "grow":
submit_task(region_growing, region)
wait_for_all_tasks() // Synchronize completion
return combined_processed_image(tasks)
function parallel_image_process(image):
tasks = decompose_image_into_regions(image) // Partition image into regions for different operations
for each region in tasks:
if region_type == "edge":
submit_task(edge_detection, region)
elif region_type == "grow":
submit_task(region_growing, region)
wait_for_all_tasks() // Synchronize completion
return combined_processed_image(tasks)
In this decomposition, each task handles a distinct operation, reducing overall execution time for mixed workloads by distributing diverse computations across processors.[49]
A more complex example arises in scientific simulations, such as Monte Carlo methods for estimating probabilities in physical systems, where the large number of independent iterations can be divided into separate tasks for parallel execution. For instance, in simulating particle interactions or financial risk assessments, each task performs a subset of random sampling iterations autonomously, with results aggregated post-execution to compute the final estimate.[50]
Pseudocode for task decomposition in a Monte Carlo simulation might appear as follows, emphasizing the independent nature of iteration tasks:
function parallel_monte_carlo_simulation(num_iterations, simulation_function):
tasks = partition_iterations(num_iterations) // Divide total iterations into equal task batches
for each batch in tasks:
submit_task(monte_carlo_batch, batch, simulation_function) // Each task runs independent simulations
wait_for_all_tasks() // Synchronize completion
partial_results = [get_result(batch) for batch in tasks]
return aggregate_results(partial_results) // Combine for final estimate, e.g., via averaging
function parallel_monte_carlo_simulation(num_iterations, simulation_function):
tasks = partition_iterations(num_iterations) // Divide total iterations into equal task batches
for each batch in tasks:
submit_task(monte_carlo_batch, batch, simulation_function) // Each task runs independent simulations
wait_for_all_tasks() // Synchronize completion
partial_results = [get_result(batch) for batch in tasks]
return aggregate_results(partial_results) // Combine for final estimate, e.g., via averaging
This task-based partitioning is particularly effective for embarrassingly parallel workloads, such as those involving random number generation in simulations, as it minimizes idle time by allowing concurrent task progression.[51]
In scientific simulations with functional diversity, task parallelism can assign different stages to tasks, such as one for solving differential equations and another for data analysis or visualization, enabling pipeline-like execution on multicore systems. For example, in computational fluid dynamics, a solver task computes flow fields while a separate visualization task renders results asynchronously.[10]
Frameworks like Intel TBB or OpenMP provide abstractions that facilitate such task submissions in languages like C++, enabling these examples without explicit thread management.[10]
Task parallelism offers significant performance benefits on multicore hardware by enabling the concurrent execution of independent tasks across multiple processors, thereby achieving scalable speedups as the number of cores increases.[52] This approach improves resource utilization by dynamically assigning tasks to idle processors, reducing wait times and maximizing CPU occupancy in workloads with irregular or unpredictable task durations.[53]
The theoretical limits of these benefits are captured by Amdahl's law, which quantifies the maximum speedup achievable in a parallel program. According to this principle, the overall speedup depends on the fraction of the workload that remains serial, as parallelization cannot accelerate inherently sequential portions. The formula is given by:
\text{Speedup} = \frac{1}{f + \frac{1 - f}{p}}
where f represents the serial fraction of the execution time, and p is the number of processors. For instance, if only 5% of a task is serial (f = 0.05), the theoretical maximum speedup approaches 20x with sufficient processors, but diminishes rapidly if the serial fraction is larger.
Despite these advantages, task parallelism introduces challenges such as overhead from task creation and synchronization, which can erode gains in fine-grained applications. Creating numerous small tasks incurs costs from scheduling and queue management, while synchronization points like barriers or joins introduce waiting times that limit scalability.[54] Additionally, the inherent non-determinism arising from varying task execution orders complicates debugging, as race conditions or subtle errors may produce inconsistent outputs across runs, making reproducibility difficult.[55]
To mitigate these issues, optimization strategies focus on tuning task granularity—balancing task size to minimize overhead without underutilizing cores—and reducing dependencies through careful dependency analysis. Optimization in benchmarks like DaCapo and SPEC OMP demonstrates improvements in efficiency through such techniques.[54][56]
In practice, these optimizations translate to measurable throughput gains, as demonstrated in SPEC OMP benchmarks, which evaluate OpenMP-based task parallelism on shared-memory systems.
Comparisons to Other Parallelism Models
Differences from Data Parallelism
Task parallelism and data parallelism represent two fundamental approaches to achieving concurrency in computing, differing primarily in how they partition workloads. In task parallelism, the program is decomposed into distinct, independent tasks that execute different functions or operations, often on varied data sets, allowing for heterogeneous workloads where each task contributes uniquely to the overall computation.[2] In contrast, data parallelism divides a large data set into subsets and applies the identical operation to each subset simultaneously, emphasizing homogeneity where the same code runs across all data partitions, such as in single instruction, multiple data (SIMD) or single instruction, multiple threads (SIMT) paradigms.[57] This functional division in task parallelism contrasts with the data-centric division in data parallelism, leading to task models being more irregular and requiring explicit management of dependencies, while data models leverage regularity for simpler specification and execution.[58]
Use cases for task parallelism are particularly suited to applications with diverse computational stages, such as pipeline processing in multimedia applications where one task handles filtering, another transformation, and a third rendering, enabling efficient exploitation of multi-core processors for non-uniform operations.[57] Data parallelism, however, excels in scenarios involving large, homogeneous arrays, like matrix multiplication or image convolution, where the workload scales with data volume and benefits from vectorized or array-based operations.[2] Task parallelism typically maps to general-purpose CPUs that handle complex, branching control flows across a moderate number of powerful cores, whereas data parallelism aligns with GPUs or vector processing units optimized for massive, uniform throughput on thousands of simpler cores.[59][60]
Hybrid approaches that combine task and data parallelism have emerged to leverage the strengths of both, such as using task parallelism for coarse-grained orchestration and data parallelism for fine-grained computations within tasks, as seen in high-performance scientific simulations where overall efficiency improves significantly over pure models.[61] This integration allows for better load balancing in mixed workloads but introduces additional complexity in synchronization and resource allocation.[62]
Differences from Thread-Based Parallelism
Task parallelism operates at a higher level of abstraction than thread-based parallelism, treating tasks as independent units of work that are dynamically scheduled by a runtime system, in contrast to threads, which are low-level execution entities requiring explicit programmer control for creation, synchronization, and termination. For instance, in thread-based models like POSIX threads (pthreads), developers must manually manage thread pools, queues, and joining operations, often resulting in verbose code and error-prone synchronization. Task-based systems, such as those using Intel Threading Building Blocks (TBB) or OpenMP tasks, abstract these details, allowing the runtime to handle load balancing and dependency resolution automatically.
This abstraction in task parallelism reduces boilerplate code significantly—up to 81% fewer lines of code in some benchmarks—while improving scalability, as demonstrated by up to 42% better performance on 16-core systems compared to pthread implementations for irregular workloads like bodytrack.[63] However, tasks introduce potential runtime overhead from dependency tracking and scheduling, which can impact fine-grained operations where direct thread control is more efficient.[64] Thread-based approaches, by contrast, offer finer-grained control but demand more effort to achieve balanced execution, often leading to load imbalances in dynamic scenarios.
The evolution from thread-based to task-based parallelism reflects a shift toward easier multicore programming, beginning with the standardization of pthreads in 1995 for low-level concurrency support.[65] By the mid-2000s, libraries like Intel TBB (introduced in 2006) emerged to simplify task expression and scheduling, addressing the complexities of manual threading on increasingly parallel hardware.[66] This progression prioritizes developer productivity and scalability for modern many-core systems over the explicit management required in early thread models.[64]
Task parallelism is particularly suited for dynamic workloads with irregular dependencies and load imbalances, such as pipeline or graph-based computations, where runtime adaptability shines.[64] Thread-based parallelism, however, remains preferable for scenarios demanding precise, low-overhead control, like simple data-parallel loops with minimal synchronization needs.