Fact-checked by Grok 2 weeks ago

Task parallelism

Task parallelism is a fundamental paradigm in parallel computing that involves decomposing a program into distinct, concurrently executable tasks distributed across multiple processors or cores, emphasizing the simultaneous performance of different functions or operations to enhance efficiency and scalability.^[1] This approach contrasts with data parallelism, which applies the same operation to multiple data subsets simultaneously, by instead focusing on functional diversity where tasks may handle varied computations without uniform data partitioning.^[2] Unlike finer-grained forms such as instruction-level parallelism, task parallelism operates at a coarser level, organizing code into processes, threads, or independent units that can run asynchronously.^[3] Key characteristics of task parallelism include the potential for tasks to be either fully independent—enabling embarrassingly parallel execution with minimal synchronization—or interdependent, requiring coordination mechanisms like locks, barriers, or futures to resolve data dependencies and maintain program correctness.^[1] It is particularly suited to heterogeneous workloads, such as mixing intensive computations with input/output operations, and aligns with the Multiple Instruction, Multiple Data (MIMD) classification in Flynn's taxonomy, allowing flexible resource utilization on multicore systems.^[3] For instance, in web applications, task parallelism can handle concurrent HTTP request processing, where each request operates as an independent task with little intercommunication.^[3] Task parallelism finds broad applications in domains requiring diverse concurrent operations, including multimedia processing—such as parallel video decoding and rendering—and scientific simulations where distinct algorithmic stages execute simultaneously.^[4] It is also prevalent in high-performance computing for irregular workloads, like graph analytics or machine learning pipelines, where dynamic task scheduling improves throughput on distributed systems.^[2] Support for task parallelism is integrated into modern programming environments, with languages and libraries like Cilk for lightweight task creation, OpenMP's task constructs^[5] for directive-based parallelism, and Java's Executor framework for thread pool management, enabling developers to exploit multicore hardware without low-level thread handling.^[6]^[7]

Core Concepts

Definition

Task parallelism is a form of parallel computing in which a computational problem is divided into multiple independent tasks—discrete units of work—that execute concurrently across different processors or cores to enhance throughput and reduce execution time compared to sequential processing.^[8]^[9] This approach, also known as functional decomposition, emphasizes the concurrent performance of distinct operations rather than uniform processing of data elements.^[8] The key principles of task parallelism revolve around task independence, enabling minimal inter-task communication and synchronization to maximize efficiency; the potential for dynamic task creation and assignment during runtime to adapt to workload variations; and a focus on coarse-grained work division, where tasks encompass larger, self-contained computations suitable for distribution across heterogeneous resources.^[9] These principles distinguish task parallelism within the broader context of parallel computing, which involves the simultaneous use of multiple compute resources to solve problems that would otherwise require sequential execution on a single processor.^[10] Task parallelism first appeared in early multiprocessing systems of the 1960s and 1970s, where concurrent task handling became essential for leveraging multiple processors.^[11] The basic workflow of task parallelism begins with the identification and decomposition of a program into independent tasks, followed by their concurrent execution on available processing units, and concludes with result aggregation if dependencies exist.^[8] This process improves resource utilization by allowing tasks to proceed asynchronously, with overall performance limited primarily by the longest-running task and any inherent serial components.^[9]

Historical Development

The concept of task parallelism emerged in the 1960s and 1970s amid early efforts to harness multiprocessing for high-performance computing. By the mid-1970s, foundational ideas for dynamic task execution were advanced through dataflow architectures, where computations are broken into independent tasks activated by data availability rather than rigid control flow. Jack B. Dennis and David P. Misunas proposed a preliminary architecture for a basic data-flow processor in 1975, enabling asynchronous task firing and influencing subsequent models of irregular parallelism.^[12] In the 1980s, task parallelism gained practical expression in programming languages designed for concurrent systems. The Ada programming language, standardized as Ada 83 in 1983 under U.S. Department of Defense sponsorship, introduced native tasking facilities—including task types, rendezvous for synchronization, and select statements for conditional execution—to support real-time and embedded applications with reliable concurrency.^[13] This marked a shift toward structured, language-level support for dynamic task creation and interaction, building on earlier multiprocessing concepts from the 1970s.^[14] The 2000s accelerated adoption due to hardware trends, particularly the transition to multicore processors. Intel's announcement in 2005 of a pivot from single-core frequency scaling to multicore designs, exemplified by the release of dual-core Pentium processors, underscored the need for software paradigms like task parallelism to exploit on-chip concurrency effectively.^[11] A pivotal milestone came with OpenMP 3.0 in May 2008, which added task constructs to the API, enabling programmers to define and schedule independent tasks dynamically for irregular workloads, evolving from earlier static loop-based parallelism.^[15] Influential contributions in distributed contexts further shaped the field, with Ian Foster's work in the 1990s and 2000s on grid computing promoting task-based models for large-scale, heterogeneous environments, as detailed in his 1995 book Designing and Building Parallel Programs.^[16] Post-2010 developments, driven by cloud computing and accelerators, extended dynamic tasking to scalable frameworks, transitioning from static scheduling in early multiprocessing to adaptive, runtime-managed task graphs in modern systems.^[17]

Implementation Mechanisms

Task Models and Abstractions

In task-parallel systems, computations are often modeled using a directed acyclic graph (DAG), where nodes represent individual tasks and directed edges indicate dependencies between them, ensuring that dependent tasks execute only after their predecessors complete. This model captures the structure of parallel workloads by avoiding cycles that could lead to deadlocks or undefined behavior, allowing runtimes to identify opportunities for concurrent execution. The DAG approach has become a foundational abstraction for expressing irregular parallelism in applications with varying dependency patterns.^[18] A key abstraction for handling asynchronous results in these models is the use of futures and promises. A future acts as a placeholder for a value that will be computed asynchronously by a task, enabling the main program to continue without blocking until the result is needed, while a promise serves as the mechanism for the completing task to deliver that value. Futures were introduced in the context of concurrent symbolic computation to support lazy evaluation in parallel Lisp environments, promoting fine-grained parallelism without explicit synchronization. Promises, as precursors to modern asynchronous constructs, were proposed to represent eventual results in applicative programming paradigms, decoupling computation from result access.^[19]^[20] Tasks in these systems are typically treated as first-class objects, meaning they can be dynamically created, manipulated, and managed like any other data entity, with well-defined states such as created, submitted, running, and completed. This abstraction allows programmers to submit tasks to a runtime scheduler and query their status, facilitating composable parallelism where tasks can depend on or spawn others. Dependency graphs, frequently implemented as DAGs, further refine this by explicitly encoding relationships, such as data or control dependencies, to guide execution order while maximizing concurrency.^[21] Important distinctions in task models include implicit versus explicit parallelism. In implicit models, the runtime or compiler automatically detects and exploits parallel opportunities from high-level specifications, reducing programmer burden but potentially limiting control over irregular workloads; explicit models require developers to annotate or define parallel regions and dependencies directly, offering precision at the cost of added complexity. Additionally, tasks are often designed as lightweight entities, similar to threads that share process resources like memory and file descriptors for low overhead, in contrast to heavyweight processes that maintain isolated address spaces and incur higher creation and context-switching costs.^[22]^[23] To illustrate basic task creation and management, consider the following pseudocode, which captures the essence of submitting a task and awaiting its result in a generic task-parallel runtime:

task T = create_task(compute_function, arguments);
submit(T, scheduler);
result = wait_for_completion(T);
task T = create_task(compute_function, arguments);
submit(T, scheduler);
result = wait_for_completion(T);

This pattern encapsulates task instantiation as a first-class operation, with submission integrating it into the dependency graph for execution. Scheduling mechanisms may operate on such models to prioritize ready tasks, but the abstractions themselves focus on declarative specification.^[24]

Scheduling and Synchronization

In task parallelism, scheduling involves assigning tasks to available processing resources to optimize execution time and resource utilization. Static scheduling pre-allocates tasks to processors based on a known task graph prior to runtime, assuming predictable execution times and no variations in load, which minimizes runtime overhead but can lead to imbalances if assumptions fail.^[25] Dynamic scheduling, in contrast, adapts task assignments at runtime to current system conditions, such as varying task durations or processor loads, enabling better responsiveness in irregular workloads.^[25] A prominent dynamic strategy is work-stealing, where idle processors "steal" tasks from busy processors' queues to balance load, as implemented in systems like Cilk; this approach ensures low contention and bounded overhead, with theoretical guarantees of O(log n) stealing attempts per task in a system with n tasks. Priority-based task queues extend these strategies by ordering tasks according to assigned priorities, often using heaps or multi-level queues, to ensure high-priority tasks (e.g., those on critical paths) execute first, reducing overall completion time in dependency-heavy graphs.^[26] Task models like directed acyclic graphs (DAGs) inform these schedulers by representing dependencies, allowing runtime systems to select only ready (unblocked) tasks for assignment. Synchronization in task parallelism coordinates task execution to respect dependencies and protect shared resources, using primitives tailored to minimize blocking. Barriers synchronize groups of tasks by requiring all to reach a common point before any proceeds, ensuring collective progress in phases like iterative algorithms.^[10] Mutexes provide mutual exclusion for shared data access, preventing race conditions during critical sections, though they can introduce serialization if overused in fine-grained tasks.^[10] Atomic operations offer lightweight synchronization for signaling task completion, such as incrementing counters or setting flags without full locks, enabling efficient dependency resolution via mechanisms like reference counting on futures or promises.^[10] Runtime considerations for scheduling and synchronization emphasize load balancing across cores, achieved through decentralized mechanisms like work-stealing deques per processor, which distribute tasks without central bottlenecks and adapt to heterogeneity. Handling task dependencies at runtime involves maintaining a ready-task pool derived from the DAG, where incoming edges are decremented upon predecessor completion (often atomically), releasing successors when counts reach zero; this ensures tasks execute only after prerequisites, with schedulers prioritizing ready tasks to minimize idle time.^[27] Challenges in these mechanisms include overhead from context switching, where frequent task migrations between cores incur costs for saving/restoring states, registers, and cache lines, potentially dominating execution for fine-grained tasks and reducing effective parallelism.^[28] To quantify impact, consider the speedup equation, which measures parallel efficiency. Let T_{\text{serial}} be the execution time of the sequential version, encompassing all computation without parallelism. In the parallel case, T_{\text{parallel}} includes the parallelized computation time divided by the number of processors p, plus synchronization and communication times T_{\text{sync}}, and scheduling overhead T_{\text{sched}} (e.g., from stealing or priority queue operations). Thus,

T_{\text{parallel}} = \frac{T_{\text{comp}}}{p} + T_{\text{sync}} + T_{\text{sched}},

where T_{\text{comp}} is the total computational work (approximating T_{\text{serial}} if fully parallelizable). Speedup is then

S = \frac{T_{\text{serial}}}{T_{\text{parallel}}} = \frac{T_{\text{serial}}}{\frac{T_{\text{comp}}}{p} + T_{\text{sync}} + T_{\text{sched}}}.

Deriving from first principles, as p increases ideally, S \to p if overheads are negligible, but T_{\text{sched}} grows with task granularity (e.g., O(1) per steal but multiplied by steals), capping S below p; for instance, if T_{\text{sched}} is constant, S asymptotes to T_{\text{serial}} / T_{\text{sched}}. This formulation highlights how scheduling costs directly limit scalability in task-parallel systems.^[10]

Language and Framework Support

The standardization of task parallelism in C++ emerged in response to the widespread adoption of multicore processors starting around 2005, which necessitated higher-level abstractions for scalable concurrent programming beyond low-level threads.^[29] The C++ standards committee, through Working Group 21 (WG21), introduced foundational concurrency features in C++11 to enable asynchronous task execution, addressing the limitations of explicit thread management for irregular workloads on multicore systems.^[30] In C++11, the <future> header provides core primitives for task-based parallelism, including std::async, std::future, and std::packaged_task. std::async launches a callable object asynchronously, potentially on a new thread, and returns a std::future object that allows the caller to retrieve the result or handle exceptions once the task completes.^[31] This mechanism supports deferred or concurrent execution policies, facilitating task parallelism without direct thread creation. std::future represents the shared state of an asynchronous operation, offering methods like wait() and get() to synchronize and access results, while std::packaged_task wraps a callable into a task that stores its outcome in a shared state accessible via a future.^[32] These features were refined in C++17 with improvements to exception propagation and in C++20 with coroutines enhancing asynchronous task composition, though the core task model remains centered on futures for multicore scalability.^[33] OpenMP, a directive-based API for shared-memory parallelism, integrated task parallelism starting with version 3.0 in 2008 to handle dynamic, irregular workloads on multicore architectures.^[34] The #pragma omp task directive generates a task from a structured block, allowing deferred execution by the runtime scheduler, while #pragma omp taskwait suspends the current task until all its child tasks complete, enabling dependency management without explicit synchronization.^[35] This model, which builds on work-sharing constructs from earlier versions, promotes scalability by distributing tasks across threads dynamically, and it has been extended in subsequent standards like OpenMP 4.0 and 5.0 for better dependency graphs and device offloading, and in OpenMP 6.0 (released November 2024) with improved tasking support and features for easier parallel programming.^[36]^[37] Related to C++ standards, Intel's oneAPI Threading Building Blocks (oneTBB), formerly Threading Building Blocks, offers a library-based approach to task parallelism that complements standard features.^[38] oneTBB provides task_group for enclosing parallel tasks executed by a work-stealing scheduler, ensuring load balancing on multicore systems, and flow graphs—a node-based model for composing task dependencies as directed acyclic graphs, suitable for pipeline and irregular parallelism.^[39] Originally developed in 2007 to address post-multicore programming challenges, oneTBB has evolved into an open-source standard under the oneAPI initiative, integrating seamlessly with C++11+ concurrency primitives.^[40]

Support in Java and Other Languages

Java provides robust support for task parallelism through its java.util.concurrent package, introduced in Java 5, which includes the ExecutorService interface for managing thread pools and executing asynchronous tasks without directly handling threads. ExecutorService allows developers to submit tasks via methods like submit() or invokeAll(), enabling efficient distribution of work across a pool of worker threads, which helps in achieving parallelism for independent units of computation. This framework abstracts low-level thread management, promoting scalable task execution in multi-core environments.^[41] Building on this, Java 7 introduced the ForkJoinPool, a specialized ExecutorService implementation that employs a work-stealing scheduler to balance workloads dynamically among threads, particularly suited for recursive divide-and-conquer algorithms like parallel quicksort. In this model, idle threads "steal" tasks from busy threads' queues, minimizing synchronization overhead and maximizing CPU utilization for fine-grained tasks. Java 8 further enhanced task composition with CompletableFuture, a class that represents a pending completion stage and supports chaining operations (e.g., thenApply() for transformations or allOf() for combining multiple futures), facilitating non-blocking, asynchronous workflows that compose parallel tasks declaratively.^[42]^[43] Java 21 (released in 2023) advanced task parallelism significantly with virtual threads under Project Loom, which are lightweight, JVM-managed threads that map to carrier (platform) threads, allowing millions of them to run concurrently with minimal overhead compared to traditional platform threads. Virtual threads enable scalable task execution for I/O-bound applications by reducing context-switching costs, while structured concurrency constructs like StructuredTaskScope ensure safe grouping and cancellation of related tasks. However, in managed runtime environments like the JVM, garbage collection pauses—such as those in the parallel collector—can introduce stop-the-world interruptions, potentially degrading latency-sensitive task parallelism by halting all threads during heap reclamation.^[44]^[45] Beyond Java, other languages offer distinct paradigms for task parallelism. In Go, goroutines provide lightweight concurrency primitives, launched with the go keyword, that enable thousands of tasks to run multiplexed on a smaller number of OS threads managed by the Go runtime; channels facilitate safe communication and synchronization between goroutines, supporting patterns like fan-out/fan-in for parallel data processing. Python's concurrent.futures module, available since Python 3.2, abstracts task execution through ThreadPoolExecutor for I/O-bound parallelism and ProcessPoolExecutor for CPU-bound tasks, allowing submission of callables via submit() or map(), with futures for result retrieval, though the Global Interpreter Lock limits true thread parallelism.^[46]^[47] Rust emphasizes safe, high-performance task parallelism via its async/await syntax (stable since Rust 1.39 in 2019), where asynchronous functions spawn non-blocking tasks; the Tokio runtime, a popular async executor, schedules these tasks across a multi-threaded worker pool using work-stealing, enabling efficient parallelism for both I/O and compute-intensive workloads while leveraging Rust's ownership model to prevent data races. Cross-language trends highlight the actor model, notably in Erlang, where lightweight processes act as isolated actors that communicate solely via asynchronous message passing, inherently supporting distributed task parallelism across nodes with fault tolerance through process supervision.^[48]

Applications and Examples

Practical Examples

One practical application of task parallelism is in parallel image processing, where different operations such as edge detection on one section and region growing on another are assigned to separate tasks executing concurrently on available processors. This approach allows functionally diverse processing—such as applying distinct algorithms to image regions—to proceed with coordination for dependencies, enabling efficient utilization of multicore systems.^[49] The following pseudocode illustrates the task decomposition and submission for this image processing example, where the image is partitioned into tasks that can be executed in parallel:

function parallel_image_process(image):
    tasks = decompose_image_into_regions(image)  // Partition image into regions for different operations
    for each region in tasks:
        if region_type == "edge":
            submit_task(edge_detection, region)
        elif region_type == "grow":
            submit_task(region_growing, region)
    wait_for_all_tasks()  // Synchronize completion
    return combined_processed_image(tasks)
function parallel_image_process(image):
    tasks = decompose_image_into_regions(image)  // Partition image into regions for different operations
    for each region in tasks:
        if region_type == "edge":
            submit_task(edge_detection, region)
        elif region_type == "grow":
            submit_task(region_growing, region)
    wait_for_all_tasks()  // Synchronize completion
    return combined_processed_image(tasks)

In this decomposition, each task handles a distinct operation, reducing overall execution time for mixed workloads by distributing diverse computations across processors.^[49] A more complex example arises in scientific simulations, such as Monte Carlo methods for estimating probabilities in physical systems, where the large number of independent iterations can be divided into separate tasks for parallel execution. For instance, in simulating particle interactions or financial risk assessments, each task performs a subset of random sampling iterations autonomously, with results aggregated post-execution to compute the final estimate.^[50] Pseudocode for task decomposition in a Monte Carlo simulation might appear as follows, emphasizing the independent nature of iteration tasks:

function parallel_monte_carlo_simulation(num_iterations, simulation_function):
    tasks = partition_iterations(num_iterations)  // Divide total iterations into equal task batches
    for each batch in tasks:
        submit_task(monte_carlo_batch, batch, simulation_function)  // Each task runs independent simulations
    wait_for_all_tasks()  // Synchronize completion
    partial_results = [get_result(batch) for batch in tasks]
    return aggregate_results(partial_results)  // Combine for final estimate, e.g., via averaging
function parallel_monte_carlo_simulation(num_iterations, simulation_function):
    tasks = partition_iterations(num_iterations)  // Divide total iterations into equal task batches
    for each batch in tasks:
        submit_task(monte_carlo_batch, batch, simulation_function)  // Each task runs independent simulations
    wait_for_all_tasks()  // Synchronize completion
    partial_results = [get_result(batch) for batch in tasks]
    return aggregate_results(partial_results)  // Combine for final estimate, e.g., via averaging

This task-based partitioning is particularly effective for embarrassingly parallel workloads, such as those involving random number generation in simulations, as it minimizes idle time by allowing concurrent task progression.^[51] In scientific simulations with functional diversity, task parallelism can assign different stages to tasks, such as one for solving differential equations and another for data analysis or visualization, enabling pipeline-like execution on multicore systems. For example, in computational fluid dynamics, a solver task computes flow fields while a separate visualization task renders results asynchronously.^[10] Frameworks like Intel TBB or OpenMP provide abstractions that facilitate such task submissions in languages like C++, enabling these examples without explicit thread management.^[10]

Performance Benefits and Challenges

Task parallelism offers significant performance benefits on multicore hardware by enabling the concurrent execution of independent tasks across multiple processors, thereby achieving scalable speedups as the number of cores increases.^[52] This approach improves resource utilization by dynamically assigning tasks to idle processors, reducing wait times and maximizing CPU occupancy in workloads with irregular or unpredictable task durations.^[53] The theoretical limits of these benefits are captured by Amdahl's law, which quantifies the maximum speedup achievable in a parallel program. According to this principle, the overall speedup depends on the fraction of the workload that remains serial, as parallelization cannot accelerate inherently sequential portions. The formula is given by:

\text{Speedup} = \frac{1}{f + \frac{1 - f}{p}}

where f represents the serial fraction of the execution time, and p is the number of processors. For instance, if only 5% of a task is serial (f = 0.05), the theoretical maximum speedup approaches 20x with sufficient processors, but diminishes rapidly if the serial fraction is larger. Despite these advantages, task parallelism introduces challenges such as overhead from task creation and synchronization, which can erode gains in fine-grained applications. Creating numerous small tasks incurs costs from scheduling and queue management, while synchronization points like barriers or joins introduce waiting times that limit scalability.^[54] Additionally, the inherent non-determinism arising from varying task execution orders complicates debugging, as race conditions or subtle errors may produce inconsistent outputs across runs, making reproducibility difficult.^[55] To mitigate these issues, optimization strategies focus on tuning task granularity—balancing task size to minimize overhead without underutilizing cores—and reducing dependencies through careful dependency analysis. Optimization in benchmarks like DaCapo and SPEC OMP demonstrates improvements in efficiency through such techniques.^[54]^[56] In practice, these optimizations translate to measurable throughput gains, as demonstrated in SPEC OMP benchmarks, which evaluate OpenMP-based task parallelism on shared-memory systems.

Comparisons to Other Parallelism Models

Differences from Data Parallelism

Task parallelism and data parallelism represent two fundamental approaches to achieving concurrency in computing, differing primarily in how they partition workloads. In task parallelism, the program is decomposed into distinct, independent tasks that execute different functions or operations, often on varied data sets, allowing for heterogeneous workloads where each task contributes uniquely to the overall computation.^[2] In contrast, data parallelism divides a large data set into subsets and applies the identical operation to each subset simultaneously, emphasizing homogeneity where the same code runs across all data partitions, such as in single instruction, multiple data (SIMD) or single instruction, multiple threads (SIMT) paradigms.^[57] This functional division in task parallelism contrasts with the data-centric division in data parallelism, leading to task models being more irregular and requiring explicit management of dependencies, while data models leverage regularity for simpler specification and execution.^[58] Use cases for task parallelism are particularly suited to applications with diverse computational stages, such as pipeline processing in multimedia applications where one task handles filtering, another transformation, and a third rendering, enabling efficient exploitation of multi-core processors for non-uniform operations.^[57] Data parallelism, however, excels in scenarios involving large, homogeneous arrays, like matrix multiplication or image convolution, where the workload scales with data volume and benefits from vectorized or array-based operations.^[2] Task parallelism typically maps to general-purpose CPUs that handle complex, branching control flows across a moderate number of powerful cores, whereas data parallelism aligns with GPUs or vector processing units optimized for massive, uniform throughput on thousands of simpler cores.^[59]^[60] Hybrid approaches that combine task and data parallelism have emerged to leverage the strengths of both, such as using task parallelism for coarse-grained orchestration and data parallelism for fine-grained computations within tasks, as seen in high-performance scientific simulations where overall efficiency improves significantly over pure models.^[61] This integration allows for better load balancing in mixed workloads but introduces additional complexity in synchronization and resource allocation.^[62]

Differences from Thread-Based Parallelism

Task parallelism operates at a higher level of abstraction than thread-based parallelism, treating tasks as independent units of work that are dynamically scheduled by a runtime system, in contrast to threads, which are low-level execution entities requiring explicit programmer control for creation, synchronization, and termination. For instance, in thread-based models like POSIX threads (pthreads), developers must manually manage thread pools, queues, and joining operations, often resulting in verbose code and error-prone synchronization. Task-based systems, such as those using Intel Threading Building Blocks (TBB) or OpenMP tasks, abstract these details, allowing the runtime to handle load balancing and dependency resolution automatically. This abstraction in task parallelism reduces boilerplate code significantly—up to 81% fewer lines of code in some benchmarks—while improving scalability, as demonstrated by up to 42% better performance on 16-core systems compared to pthread implementations for irregular workloads like bodytrack.^[63] However, tasks introduce potential runtime overhead from dependency tracking and scheduling, which can impact fine-grained operations where direct thread control is more efficient.^[64] Thread-based approaches, by contrast, offer finer-grained control but demand more effort to achieve balanced execution, often leading to load imbalances in dynamic scenarios. The evolution from thread-based to task-based parallelism reflects a shift toward easier multicore programming, beginning with the standardization of pthreads in 1995 for low-level concurrency support.^[65] By the mid-2000s, libraries like Intel TBB (introduced in 2006) emerged to simplify task expression and scheduling, addressing the complexities of manual threading on increasingly parallel hardware.^[66] This progression prioritizes developer productivity and scalability for modern many-core systems over the explicit management required in early thread models.^[64] Task parallelism is particularly suited for dynamic workloads with irregular dependencies and load imbalances, such as pipeline or graph-based computations, where runtime adaptability shines.^[64] Thread-based parallelism, however, remains preferable for scenarios demanding precise, low-overhead control, like simple data-parallel loops with minimal synchronization needs.

References

[1]
[PDF] 210 A Survey on Parallelism and Determinism - Hal-Inria
Definition 1. Task parallelism and data parallelism. Task parallelism: A parallelisation paradigm that consists in splitting the application into several tasks ...
[2]
[PDF] CS4961 Parallel Programming Lecture 4: Data and Task Parallelism
Sep 3, 2009 · Task parallel computation: - Perform distinct computations -- or tasks -- at the same time; with the number of tasks fixed, the parallelism ...
[3]
[PDF] Task Level Parallelism
There are several different forms of parallel computing: bit-level, instruction level, data, and task parallelism. ... coupled form of parallel computing.
[4]
Types of parallelism - Arm Developer
Task parallelism is also known as functional parallelism. An example of an application that can use task parallelism is playing a video in a web page. To ...
[5]
[PDF] Task Parallel Assembly Language for Uncompromising Parallelism
A classic problem in parallel computing is to take a high- level ... TPAL can achieve task parallelism as a nearly zero-cost ab- straction, even ...
[6]
https://paragon.cs.northwestern.edu/papers/2021-PLDI-TPAL-Rainey.pdf
[7]
9.3. Parallel Design Patterns — Computer Systems Fundamentals
Task parallelism refers to decomposing the problem into multiple sub-tasks, all of which can be separated and run in parallel. Data parallelism, on the other ...
[8]
Task Parallelism - an overview | ScienceDirect Topics
Task-based parallelism. If we look at a typical operating system, we see it exploit a type of parallelism called task parallelism. The processes are ...
[9]
Introduction to Parallel Computing Tutorial - | HPC @ LLNL
Breaking a task into steps performed by different processor units, with inputs streaming through, much like an assembly line; a type of parallel computing.
[10]
[PDF] Parallel Computing: Background - Intel
The interest in parallel computing dates back to the late 1950's, with advancements surfacing in the form of supercomputers throughout the 60's and 70's. These ...
[11]
September 7: The ILLIAC IV Supercomputer Is Shut Down
Sep 7, 1981 · The first large parallel processing computer, ILLIAC IV, ends its nearly decade-long life at the University of Illinois. In 1966 ...Missing: task | Show results with:task
[12]
A preliminary architecture for a basic data-flow processor
A processor is described which can achieve highly parallel execution of programs represented in data-flow form.
[13]
Ada '83 Rationale, Sec 13.1: Introduction (to Ch 13: Tasking)
The first describes the tasking facilities and illustrates their use with examples. This is followed by a brief historical survey of parallel processing ...
[14]
Timeline of the Ada Programming Language | AdaCore
The Ada language was designed for developing reliable, safe and secure software. It has been updated several times since its initial inception in the 1980s.
[15]
[PDF] OpenMP Tasking
Tasking history: future? 13. • Merge of the C/C++ and. Fortran OpenMP spec. • classical OpenMP. • task and taskwait construct. • Basic set of tasking clauses.
[16]
[PDF] Foster-Designing and Building Parallel Programs - Technology
Each task performs the same computation on each grid point and at each time step. Because the parallel algorithm does not replicate computation, we can ...
[17]
[PDF] Evolution of a Minimal Parallel Programming Model - OSTI.GOV
Abstract. We take a historical approach to our presentation of self-scheduled task parallelism (SSTP), a programming model with its origins in early ...
[18]
[PDF] Theory and Applications of Parallelism with Futures by Zhiyu Liu
A future-parallel computation is modeled as a directed acyclic graph (DAG). Each node in the DAG represents a task (one or more instructions), and an edge ...
[19]
MULTILISP: a language for concurrent symbolic computation
Multilisp is a version of the Lisp dialect Scheme extended with constructs for parallel execution. Like Scheme, Multilisp is oriented toward symbolic ...
[20]
[PDF] Aspects of Applicative Programming for Parallel Processing
Aspects of Applicative Programming for Parallel Processing · Daniel P. Friedman, David S. Wise · Published in IEEE transactions on… 1 April 1978 · Computer Science.
[21]
[PDF] Task-Based Parallel Programming for Scalable Algorithms - Hal-Inria
Sep 29, 2022 · Abstract: Task-based programming models have succeeded in gaining the interest of the high- performance mathematical software community ...
[22]
A Comparison of Implicit and Explicit Parallel Programming
A comparison is made between Sisal, a functional language with implicit parallelism, and SR, an imperative language with explicit parallelism. Both ...
[23]
Operating Systems: Threads
4.1.1 Motivation. Threads are very useful in modern programming whenever a process has multiple tasks to perform independently of the others.
[24]
Task Graph - Our Pattern Language
A Task Graph is a directed acyclic graph of atomic tasks with dependencies, where tasks can only start when their antecedents complete.Context · Forces · Solution · Examples
[25]
[PDF] Scheduling in Parallel and Distributed Computing Systems
Scheduling schemes can be static or dynamic. In static schemes, subtasks are assigned to processors at compile time either by the programmer or by the compiler.
[26]
https://ieeexplore.ieee.org/document/10065487
[27]
Dynamic Scheduling of Parallel Computations - Archive ouverte HAL
In this paper, we consider the scheduling of parallel computations whose task graphs are generated at run time. We analyze the case where the task graph has a ...
[28]
Measuring context switching and memory overheads for Linux threads
Sep 4, 2018 · These code samples measure context switching overheads using two different techniques:Linux Threads And Nptl · Threads, Processes And The... · Memory Usage Of Threads
[29]
[PDF] Parallelizing the Standard Algorithms Library | N3408=12-0098
Sep 21, 2012 · In this document, we propose an approach to parallelism that insulates the C++ programmer from the low-level, hardware-specific details that ...
[30]
Rebase the Parallelism TS onto the C++17 Standard
Sep 8, 2017 · The current working draft for the Parallelism TS is based on C++14. Many of the features were significantly revised and adopted into the C++17 ...
[31]
https://en.cppreference.com/w/cpp/thread/async
[32]
https://en.cppreference.com/w/cpp/thread/future
[33]
Technical Specification for C++ Extensions for Parallelism Version 2 ...
The goal of this Technical Specification is to build widespread existing practice for parallelism in the C++ standard algorithms library. It gives advice on ...
[34]
[PDF] OpenMP Application Program Interface
Version 3.0. This appendix summarizes the major changes between the OpenMP API Version 2.5 specification and the OpenMP API Version 3.0 specification. • The ...
[35]
taskwait Construct - OpenMP
The taskwait construct specifies a wait on the completion of child tasks of the current task. The taskwait construct is a stand-alone directive.Missing: 3.0 | Show results with:3.0
[36]
[PDF] OpenMP Tasking Explained
Nov 20, 2013 · Tasking was introduced in OpenMP 3.0. ▫ Until then it was impossible to efficiently and easily implement certain types of parallelism.
[37]
Intel® oneAPI Threading Building Blocks
Intel® oneAPI Threading Building Blocks (oneTBB)† is a flexible performance library that simplifies the work of adding parallelism to complex applications ...Missing: groups | Show results with:groups
[38]
Task Scheduler — oneAPI Specification 1.3-rev-1 documentation
oneAPI Threading Building Blocks (oneTBB) provides a task scheduler, which is the engine that drives the algorithm templates and task groups. The exact ...
[39]
oneAPI Threading Building Blocks (oneTBB)
This document contains information about oneTBB. It is a flexible performance library that let you break computation into parallel running tasks. The following ...<|separator|>
[40]
ExecutorService (Java Platform SE 8 ) - Oracle Help Center
An Executor that provides methods to manage termination and methods that can produce a Future for tracking progress of one or more asynchronous tasks.
[41]
ForkJoinPool (Java Platform SE 8 ) - Oracle Help Center
A ForkJoinPool differs from other kinds of ExecutorService mainly by virtue of employing work-stealing: all threads in the pool attempt to find and execute ...
[42]
CompletableFuture (Java Platform SE 8 ) - Oracle Help Center
Returns a new CompletableFuture that is asynchronously completed by a task running in the given executor with the value obtained by calling the given Supplier.
[43]
Project Loom: Understand the new Java concurrency model
Project Loom massively increases resource efficiency while preserving backward compatibility with Java threads. Here's a look at Loom and the roadmap ahead.Project Loom Massively... · Virtual Threads In Java · Continuations And Structured...
[44]
The Parallel Garbage Collector - Sip of Java
Aug 1, 2022 · The Parallel Garbage Collector (GC), so named because it can utilize multiple threads for handling GC tasks, provides the highest throughput of GCs available ...
[45]
Goroutines - A Tour of Go
A goroutine is a lightweight thread managed by the Go runtime. go f(x, y, z) starts a new goroutine running f(x, y, z).
[46]
concurrent.futures — Launching parallel tasks — Python 3.14.0 ...
The concurrent.futures module provides a high-level interface for asynchronously executing callables. The asynchronous execution can be performed with threads, ...
[47]
tokio::task - Rust
A task is a light weight, non-blocking unit of execution. A task is similar to an OS thread, but rather than being managed by the OS scheduler, they are managed ...What are Tasks? · Working with Tasks · Spawning · Blocking and Yielding
[48]
[PDF] Parallel computing in digital image processing - ijarcce
The forms of parallel processing in image processing receive i.e., data, task and pipeline parallelism. We've also presented the algorithms for parallel ...
[49]
[PDF] COMP 422, Lecture 4: Decomposition Techniques for Parallel ...
This induces a task decomposition in which each task generates partial counts for all itemsets. These are combined subsequently for aggregate counts. Page ...
[50]
https://ieeexplore.ieee.org/document/9150294
[51]
SPOCK: Exact Parallel Kinetic Monte-Carlo on 1.5 Million Tasks
We have created a scalable implementation of the kinetic Monte-Carlo method, SPOCK (Scalable Parallel Optimistic Crystal Kinetics).
[52]
Scalable computing with parallel tasks - ACM Digital Library
In this article, we present an approach to structure the computations of an application as parallel tasks which can interact with other parallel tasks in ...
[53]
Introduction to task-based parallelism
The practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation.
[54]
[PDF] Reducing Task Creation and Termination Overhead in Explicitly ...
The fine-grain parallelism specified by the programmer may lead to the creation of excessive tasks and synchronization operations. A common occurrence of this ...
[55]
Parallel Programming Must Be Deterministic by Default
Mar 25, 2009 · In today's widely used parallel programming models, subtle programming errors can lead to unintended nondeterministic behavior and hard to ...
[56]
Performance Evaluation of Massively Parallel Systems Using SPEC ...
May 5, 2022 · We present an extensive evaluation study of the performance peaks and scalability of these two modern architectures using SPEC OMP benchmarks.
[57]
[PDF] Task Parallelism and Data Distribution - HAL Mines Paris
Oct 18, 2012 · This paper describes how six popular and efficient parallel programming language designs tackle the issue of task parallelism specification: ...Missing: seminal | Show results with:seminal
[58]
[PDF] Unintegrated Support for Task Parallelism in ZPL
It is an especially simple case of data parallel tasks because there is no data parallel communication within these tasks.
[59]
An Introduction to Accelerator and Parallel Programming - Intel
Parallelism is when parts of a program can run at the same time as another part of the program. Typically, we break this down into two categories: task ...
[60]
CPU vs. GPU for Machine Learning - IBM
CPUs are designed to process instructions and quickly solve problems sequentially. GPUs are designed for larger tasks that benefit from parallel computing.
[61]
[PDF] Modeling the Bene ts of Mixed Data and Task Parallelism - NetLib.org
The structure of the task graph, which gives an idea of the degree of task parallelism available to supplement data parallelism. By \structure" we mean the task.
[62]
A high-throughput hybrid task and data parallel Poisson solver for ...
Jul 15, 2021 · The hybrid parallelism results in nearly 300 times less time-to-solution and thus computational cost (measured in node-hours) for the Poisson ...
[63]
https://dl.acm.org/doi/pdf/10.1145/2829952
[64]
[PDF] POSIX Threads and the Linux Kernel
In the years since then the POSIX thread API (commonly known as pthreads) went through many revisions and was incorporated into the POSIX standard in 1996. ...
[65]
[PDF] Parallel Programming with Intel® Threading Building Blocks
the TBB project started at Intel. June, 2006 – Intel® TBB 1.0. • Intel's New Parallel Programming Model announced. April, 2007 – Intel® TBB 1.1. • OS coverage ...