Fact-checked by Grok 2 weeks ago

High-throughput computing

High-throughput computing (HTC) is a computing paradigm that employs distributed and often heterogeneous resources, such as clusters, workstations, and volunteer computing networks, to execute large volumes of independent or loosely coupled tasks over extended durations, prioritizing sustained throughput—typically measured in jobs completed per month—over instantaneous peak performance.^[1]^[2] The concept of HTC emerged in the mid-1990s from research at the University of Wisconsin-Madison, where Miron Livny and colleagues developed foundational mechanisms to manage resources across distributed environments, emphasizing fault tolerance, opportunistic resource utilization, and automation to handle long-running workloads reliably.^[3]^[4] Key characteristics include the use of middleware like HTCondor for job scheduling, matchmaking, and resource discovery; support for heterogeneous hardware without requiring high-efficiency components; and a focus on resilience to failures in dynamic, distributively owned systems.^[2]^[5] In contrast to high-performance computing (HPC), which relies on tightly coupled parallel processing for low-latency, high-speed simulations on specialized supercomputers measured in floating-point operations per second (FLOPS), HTC excels in scenarios involving massive parallelism of smaller, embarrassingly parallel jobs, leveraging grids and clouds for scalability.^[1]^[2] Prominent applications span scientific domains, including bioinformatics for genomic sequencing analysis, drug discovery through virtual screening of molecular compounds, high-energy physics data processing (e.g., at CERN), climate modeling ensembles, and big data analytics in agriculture and social sciences.^[1]^[6]^[7] Modern HTC infrastructures, such as the Open Science Grid (OSG) and tools like Pegasus for workflow management or SLURM for batch processing, enable researchers to access national-scale resources, delivering more than 1.2 billion core-hours annually and supporting over 240 projects in fields like life sciences and engineering as of 2025.^[8]^[9]^[6]

Definition and Principles

Core Concepts

High-throughput computing (HTC) is a computing paradigm that prioritizes the sustained execution of large-scale, independent jobs over extended periods to maximize overall productivity, rather than focusing on instantaneous peak performance. This model is particularly suited to opportunistic or distributed environments where resources may be heterogeneous and intermittently available, enabling the processing of vast numbers of loosely coupled tasks that do not demand immediate synchronization.^[10]^[11] The primary metric in HTC is throughput, defined as the number of jobs or tasks completed per unit of time, which quantifies the system's ability to deliver cumulative computational results over long durations. This contrasts with metrics like floating-point operations per second in other paradigms, emphasizing total output such as "floating-point operations per year" for applications requiring prolonged computation. The throughput can be formally expressed as:

\text{Throughput} = \frac{\text{Number of tasks completed}}{\text{Total execution time}}

This measure underscores HTC's goal of optimizing resource utilization for continuous operation, often in 24/7 environments where failures are anticipated and managed to maintain steady progress.^[10]^[11]^[12] In HTC contexts, batch processing predominates, involving the submission and automated execution of large queues of predefined jobs without real-time user intervention, which aligns with the paradigm's focus on high-volume, non-interactive workloads. This differs from interactive computing, where users engage directly with the system for immediate responses, as batch approaches in HTC allow for efficient handling of independent tasks that can tolerate delays in favor of maximizing completed output. The loose coupling of tasks in HTC further supports this, as jobs operate without requiring tight synchronization, shared memory, or direct communication, relying instead on cooperative protocols for resource allocation and data exchange across distributed nodes.^[10]^[11]

Key Characteristics

High-throughput computing (HTC) emphasizes opportunistic resource usage, harnessing idle computational cycles from distributed environments such as desktop grids or cloud infrastructures to maximize overall productivity without dedicated hardware ownership. This approach allows systems to dynamically acquire resources from volunteer workstations or underutilized cloud instances, enabling cost-effective scaling for large-scale computations. For instance, projects like HTCondor facilitate the capture of unused processing power from campus or global networks of desktops, migrating jobs as needed to avoid interruptions when owners resume activity.^[10] A defining trait of HTC is its scalability to thousands of heterogeneous nodes, accommodating variability in performance across diverse hardware and network conditions while maintaining robust operation. Systems tolerate fluctuations in resource availability and speed by employing flexible matchmaking and fault-tolerant mechanisms, ensuring continuous progress even in unreliable environments like the Open Science Grid, which spans over 100 institutions. This heterogeneity contrasts with more uniform setups, allowing HTC to aggregate power from disparate sources without stringent synchronization requirements.^[13] HTC workflows are inherently long-running, often extending over days or weeks to complete extensive task sets, such as parameter sweeps in scientific simulations that explore multiple variable configurations. These protracted executions suit applications requiring sustained computation, like Monte Carlo methods or sensitivity analyses, where the focus is on cumulative output rather than immediate results. Automation through directed acyclic graphs (DAGs) supports the orchestration of interdependent tasks over these durations, as demonstrated in workflows managing hundreds of thousands of jobs.^[10]^[14] Beyond core throughput metrics, HTC evaluates performance using indicators like makespan, the total time required to complete an entire job set, which optimizes overall workflow completion in distributed settings. Resource efficiency is another critical measure, quantified as the utilization rate—calculated as (busy time / total available time) × 100%—to assess how effectively idle cycles are exploited across the system. These metrics highlight HTC's emphasis on sustained, efficient resource harnessing over peak instantaneous performance.^[15]^[16]

History and Development

Origins in Distributed Systems

High-throughput computing (HTC) emerged in the 1990s as an extension of distributed systems research, particularly through initiatives in grid computing that sought to harness geographically dispersed resources for sustained, large-scale computations. Early efforts were driven by the need to aggregate idle computing power from workstations and clusters across institutions, moving beyond the limitations of dedicated hardware. The Condor project, initiated at the University of Wisconsin-Madison in 1984 but gaining prominence in the mid-1990s, exemplified this approach by enabling opportunistic scheduling of jobs on non-dedicated resources, achieving fault-tolerant execution through mechanisms like process checkpointing and migration to handle node failures in unreliable networks.^[17] By the late 1990s, Condor's matchmaking algorithms facilitated efficient resource allocation for high-volume workloads, establishing HTC as a paradigm for delivering vast computational capacity over extended periods.^[17] Grid computing initiatives further solidified HTC's foundations, with key contributions from researchers like Ian Foster and Carl Kesselman, who developed middleware to enable seamless resource sharing across wide-area networks. Their 1998 book, The Grid: Blueprint for a New Computing Infrastructure, outlined a vision for coordinating heterogeneous systems, while the Globus Toolkit—released that same year—provided essential protocols for job submission, security, and data transfer, supporting loosely coupled applications typical of HTC.^[18] This work responded to the escalating data volumes in scientific fields, such as high-energy physics and astronomy, where centralized supercomputers proved insufficient for processing petabyte-scale datasets generated by experiments like those at particle accelerators.^[18] The shift emphasized decentralized pooling of internet-scale resources, allowing scientists to access on-demand computing without owning expensive infrastructure.^[18] A landmark influence on HTC's practical adoption came from volunteer computing projects, notably SETI@home, launched in 1999 by the University of California, Berkeley. This initiative distributed radio signal analysis tasks to millions of volunteered personal computers worldwide, demonstrating HTC's viability in public-resource environments by achieving peak throughputs exceeding 27 teraFLOPS through independent, fault-tolerant work units that tolerated intermittent connectivity and volunteer attrition.^[19] Early grid efforts, including integrations like Condor-G, bridged these volunteer models with institutional grids, prioritizing resilience in heterogeneous, failure-prone distributed setups.^[17]

Major Milestones and Evolutions

The rise of high-throughput computing (HTC) gained momentum in the mid-2000s amid the big data era, with the release of Apache Hadoop in 2006 marking a pivotal advancement. Hadoop adapted Google's MapReduce programming model to enable distributed, fault-tolerant processing of massive datasets across clusters of commodity hardware, prioritizing sustained throughput over low-latency performance for batch-oriented workloads.^[20] In the 2010s, middleware systems evolved significantly to enhance HTC scalability in distributed environments. HTCondor, originally developed as Condor in 1984 and renamed in 2012, underwent key updates including the adoption of multicore pilots by 2015, enabling efficient resource utilization on grid sites and scaling to tens of thousands of CPU cores for experiments like those at CERN's Large Hadron Collider (LHC).^[21] Concurrently, pilot agent frameworks such as glideinWMS, introduced around 2008-2009, revolutionized grid computing by dynamically provisioning virtual pools of resources, abstracting site heterogeneities and improving job throughput to handle up to 100,000 queued jobs across hundreds of sites.^[22] The integration of HTC with cloud computing accelerated in the late 2000s and 2010s, exemplified by Amazon Web Services' launch of EC2 Spot Instances in December 2009, which allowed bidding on unused capacity for up to 90% cost savings on flexible, interruptible workloads.^[23] This mechanism proved particularly effective for HTC applications, where jobs could be decomposed into small, preemptible tasks to maximize utilization and goodput in hybrid batch systems like HTCondor.^[24] By the 2020s, HTC transitioned toward hybrid models that seamlessly combine traditional grids, public clouds, and emerging edge resources to address heterogeneous computing demands in scientific workflows. This evolution emphasizes tighter integration of workflow management systems with diverse infrastructures, enabling efficient execution on multi-site, multi-provider environments for data-intensive domains like high-energy physics. In 2024, the HTCondor project celebrated its 40th anniversary, recognizing its foundational role in HTC since 1984.^[25]^[26]

Architectures and Technologies

Resource Management Systems

Resource management systems in high-throughput computing (HTC) environments serve as the foundational infrastructure for dynamically allocating computational resources to jobs across distributed, often heterogeneous, networks of nodes. These systems typically employ resource brokers and schedulers that evaluate job requirements against available node capabilities, matching them based on predefined policies to optimize throughput while accommodating variability in resource availability. A key policy in such systems is fair-share allocation, which ensures equitable distribution of resources among users or groups by prioritizing jobs from those with historically lower utilization, thereby preventing resource monopolization and promoting sustained high throughput over time.^[27]^[28] In prominent HTC implementations like HTCondor, resource matchmaking is facilitated through mechanisms such as ClassAds, which enable dynamic advertising of resource attributes and job requirements in a classified-advertisement style framework. Introduced in the late 1990s, ClassAds allow nodes to advertise their current state—such as CPU availability, memory, and operating system—while jobs specify needs like execution environment or priority, enabling a centralized matchmaker to pair them efficiently without tight coupling to specific hardware. This approach supports opportunistic usage of resources, including transient or volunteered nodes, by evaluating constraints and rank expressions to select optimal matches. The seminal design of ClassAds, as detailed in early Condor publications, has influenced subsequent distributed scheduling paradigms by decoupling resource discovery from allocation decisions.^[28]^[29] To address heterogeneity in HTC environments—where nodes may vary in architecture, software stacks, or availability—virtualization techniques like containerization have become integral since the introduction of Docker in 2013. Containers encapsulate job dependencies and runtime environments, allowing seamless deployment across diverse nodes without full virtualization overhead, thus maintaining high throughput by minimizing setup times and ensuring reproducibility. In scientific HTC, Apptainer (formerly Singularity) is also widely used, enabling non-privileged users to run containers securely on shared clusters.^[30] For instance, Docker's layered filesystem and image-based distribution enable rapid instantiation of isolated workloads, which is particularly beneficial in opportunistic grids where node configurations fluctuate. Studies on container performance in distributed computing contexts confirm near-native efficiency for many HTC workloads.^[31]^[32] Monitoring tools are essential for tracking resource utilization and enabling proactive management in HTC systems, providing real-time metrics on cluster health, job progress, and bottlenecks. Ganglia, an open-source distributed monitoring system developed for high-performance clusters and grids, aggregates data from nodes using a multicast-based protocol to visualize metrics like CPU load, network traffic, and memory usage across large-scale deployments. In HTC setups, such as those integrated with HTCondor, Ganglia's hierarchical design supports scalable monitoring of thousands of nodes, facilitating fault detection and load balancing to sustain throughput. Its extensibility allows custom metrics tailored to HTC needs, like job queue depths, ensuring administrators can optimize resource policies based on empirical data.^[33]^[34]

Workflow and Job Scheduling Frameworks

In high-throughput computing (HTC), workflows are commonly modeled using Directed Acyclic Graphs (DAGs), where nodes represent individual tasks or jobs, and directed edges denote dependencies between them, ensuring no cycles to prevent indefinite waits. This structure allows for the orchestration of complex, data-intensive computations across distributed resources, such as grids or clusters, by enabling parallel execution of independent tasks while respecting sequential constraints.^[35] The execution model processes DAG nodes in topological order, submitting a job only after all its predecessor tasks have completed successfully, which maximizes resource utilization and throughput in environments with variable availability. Workflow schedulers in HTC, such as Pegasus, facilitate the management of these DAG-based workflows by mapping abstract, high-level descriptions—defined in formats like XML or YAML—into concrete executable plans tailored to available resources.^[36] Developed in 2001 at the University of Southern California's Information Sciences Institute, Pegasus automates the planning process, including data staging, task transformation, and resource allocation across heterogeneous systems like grids and clouds, thereby enabling scalable execution of scientific computations.^[37] Similarly, HTCondor's DAGMan tool supports DAG construction and submission, integrating seamlessly with the broader HTC ecosystem to handle workflows comprising thousands of interdependent jobs. Fault recovery mechanisms are integral to HTC frameworks to maintain reliability in unreliable distributed environments, where node failures or network issues are common. Checkpointing periodically saves the state of executing tasks, allowing restarts from the last valid point without recomputing prior work, while resubmission automatically retries failed tasks up to a configurable limit, often with exponential backoff to avoid overwhelming resources.^[38] In Pegasus, for instance, these features generate rescue DAGs that resume workflows from the point of failure, preserving provenance data to track execution history and debug issues.^[36] DAGMan employs analogous recovery by monitoring job status and requeueing aborted nodes, ensuring minimal disruption to overall throughput. Integration with batch systems enhances HTC schedulers' ability to handle large-scale job queuing on cluster infrastructures. Pegasus, for example, interfaces with SLURM through intermediaries like HTCondor, submitting workflow tasks as batch jobs while leveraging SLURM's partitioning and priority queuing for efficient resource allocation in high-volume scenarios.^[36] SLURM itself supports HTC by optimizing for short, numerous jobs via features like job arrays and fair-share scheduling, allowing frameworks to enqueue thousands of tasks without manual intervention.^[39] This synergy enables seamless scaling, where abstract workflows are decomposed into queued executions that adapt to cluster dynamics.^[39]

Differences from High-Performance Computing

High-throughput computing (HTC) and high-performance computing (HPC) represent distinct paradigms in distributed and parallel computing, each optimized for different workload characteristics and resource utilization strategies.^[40] HTC emphasizes the execution of numerous independent, loosely coupled tasks over extended periods to maximize overall system throughput, often measured in jobs completed per month or floating-point operations per year (FLOPY), whereas HPC focuses on accelerating tightly coupled, synchronous computations that require rapid data exchange among processes to achieve high peak performance.^[41] This contrast arises from HTC's suitability for embarrassingly parallel workloads, where tasks lack interdependencies and can run autonomously across heterogeneous resources, in opposition to HPC's reliance on coordinated parallelism for complex simulations.^[40]^[42] In terms of hardware, HPC systems typically employ specialized clusters with low-latency interconnects, such as InfiniBand, to minimize communication overhead in tightly coupled applications, enabling efficient synchronization across hundreds or thousands of nodes.^[43] Conversely, HTC leverages commodity hardware, including desktops, departmental servers, and cloud instances, without requiring high-speed fabrics, as tasks are designed to tolerate variable latencies and operate independently on distributed, opportunistic resources.^[40] Performance objectives further diverge: HPC prioritizes floating-point operations per second (FLOPS) to deliver rapid results for compute-intensive problems, while HTC optimizes for sustained job completion rates, ensuring high aggregate output over time despite potential delays in individual runs.^[41] A representative example illustrates these differences: in HPC, weather modeling often involves a single large-scale simulation using Message Passing Interface (MPI) for tightly coupled parallel processing across nodes to solve coupled atmospheric equations in near-real time.^[44] In contrast, HTC supports genomic sequencing through millions of independent sequence alignments or marker evaluations, processed as embarrassingly parallel jobs on distributed clusters to achieve high throughput for large datasets.^[42] These trade-offs highlight HTC's focus on scalability for batch-oriented, long-running computations versus HPC's emphasis on low-latency synchronization for interactive, high-speed analysis.

Relation to Many-Task Computing

Many-task computing (MTC) represents an extension of high-throughput computing (HTC) by accommodating diverse task granularities, ranging from short bursts of seconds to prolonged jobs lasting hours or days, all orchestrated within a single cohesive framework.^[45] This approach enhances HTC's focus on sustained resource utilization by introducing flexibility for heterogeneous workloads, enabling efficient execution on grids, clusters, and supercomputers without rigid assumptions about task uniformity.^[45] The MTC paradigm was proposed by Ian Foster and collaborators in 2008 as a means to unify HTC with high-performance computing (HPC), leveraging HTC's loose coupling for data-intensive, task-parallel applications while adapting HPC infrastructures for broader scalability.^[45] A pivotal tool in this integration is the SWIFT scripting language, developed in 2007, which facilitates HTC workflows by allowing users to define implicitly parallel scripts that manage variable task durations through dataflow-driven execution and dynamic resource allocation.^[46] SWIFT's model abstracts file-based data handling, supporting tasks from brief computations to extended simulations, thus bridging HTC's batch-oriented style with MTC's adaptability.^[46] While HTC and MTC share common challenges, such as minimizing data movement overheads through techniques like local caching and diffusion, HTC generally presumes uniform, batch-style tasks with consistent durations and minimal interactivity.^[47] In contrast, MTC extends this by incorporating support for interactive, communication-intensive elements and dynamic task graphs, fostering more versatile applications in scientific computing.^[45]

Applications and Use Cases

Scientific Research Domains

High-throughput computing (HTC) has become integral to scientific research domains where vast datasets and numerous independent computations are processed in parallel, enabling researchers to tackle complex problems that require extensive simulation and analysis. In fields like bioinformatics, astronomy, climate modeling, and high-energy physics, HTC systems distribute workloads across distributed resources, such as grids and clusters, to achieve scalability and efficiency without tight synchronization.^[48] This approach contrasts with high-performance computing by emphasizing loose coupling and high job throughput over peak speed for individual tasks.^[49] In bioinformatics, HTC facilitates large-scale sequence alignment tasks, such as running thousands of Basic Local Alignment Search Tool (BLAST) jobs against massive protein databases like those from the National Center for Biotechnology Information (NCBI). Since the early 2000s, tools like Soap-HT-BLAST have leveraged web services and grid computing to parallelize these alignments, processing terabytes of genomic data for applications in gene discovery and evolutionary studies.^[50] For instance, frameworks such as Divide and Conquer BLAST (DCBLAST) enable rapid execution of NCBI BLAST+ on high-performance computing environments, reducing turnaround times for aligning millions of sequences from high-throughput sequencing experiments.^[51] This has been crucial for projects analyzing whole-genome data, where HTC workflows handle the embarrassingly parallel nature of pairwise alignments without requiring specialized hardware.^[52] Astronomy benefits from HTC in processing pipelines for telescope data, particularly in projects generating enormous volumes of images and metadata. The Vera C. Rubin Observatory's Legacy Survey of Space and Time (LSST), which began operations in 2025, relies on HTC infrastructures like the Open Science Grid (OSG) to manage the influx of 20 terabytes of nightly data for tasks such as transient detection and image calibration.^[53] As of June 2025, the observatory released its first images, with full survey operations commencing later that year, further demonstrating HTC's role in handling initial data processing.^[54] Systems like RabbitQR integrate HTC with stream processing to handle LSST's data rates, distributing reduction steps across fluctuating worker pools for real-time analysis of cosmic events.^[55] The LSST Data Facilities plan incorporates grid-based HTC for very high-throughput storage and computation, ensuring scalable handling of petabyte-scale archives for multi-wavelength studies.^[56] Climate modeling employs HTC for ensemble runs that explore parameter variations to quantify uncertainty in projections. Large ensembles, such as those from the Community Earth System Model (CESM), use distributed computing to simulate thousands of scenarios varying initial conditions and forcings, providing probabilistic insights into future climate states.^[57] HTC platforms, including grids, have enabled studies like the BRIDGE HadCM3 family of models, where high-throughput execution across facilities processes perturbed parameter ensembles to assess regional impacts like precipitation changes.^[58] For example, the Magellan project demonstrated HTC's role in scaling climate simulations for analysis, supporting workflows that integrate observational data with model outputs for uncertainty quantification.^[59] In high-energy physics, HTC supports event simulation for particle colliders, exemplified by the Large Hadron Collider (LHC) data processing since its 2008 startup. The ATLAS and CMS experiments use global HTC systems like the Worldwide LHC Computing Grid (WLCG) to generate and simulate billions of proton-proton collision events, distributing Monte Carlo simulations across thousands of sites to model detector responses and backgrounds.^[60] Post-2008, this infrastructure has processed exabytes of data through high-throughput pipelines, with tools like Condor enabling volunteer and opportunistic computing for event generation.^[61] Recent integrations, such as ATLAS's shift to high-performance platforms for simulation, maintain HTC principles to handle the increased luminosity in LHC Run 3, ensuring timely physics analyses.^[49]

Industrial and Data-Intensive Workloads

In industrial sectors, high-throughput computing (HTC) enables the processing of vast workloads at scale to deliver business value through rapid analysis and simulation, often leveraging distributed resources like grid or cluster systems to handle embarrassingly parallel tasks. This approach is particularly suited to data-intensive applications where volume and speed directly impact operational efficiency and decision-making, such as in finance, media production, e-commerce, and pharmaceuticals. By utilizing opportunistic and dedicated resources, HTC systems like HTCondor manage thousands of independent jobs, reducing turnaround times from days to hours while optimizing costs on heterogeneous infrastructure. In financial services, HTC supports Monte Carlo simulations for risk assessment, where millions of market scenarios are generated and evaluated daily to compute metrics like Value at Risk (VaR). These simulations model price distributions using historical data and stochastic processes, such as normally distributed changes in interest rates and equities, to quantify potential losses at confidence levels like 95% or 99%. Distributed environments, such as those built with the Globus Toolkit or Condor, parallelize the independent evaluation of scenarios across nodes, enabling faster execution than single-machine setups—for instance, processing large portfolios with payments in multiple currencies. This scalability allows firms to handle the computational demands of real-time risk management, where a single simulation might require evaluating millions of paths to ensure regulatory compliance and investment decisions.^[62] Media and entertainment industries rely on HTC for rendering farms that produce computer-generated imagery (CGI) in films through frame-by-frame parallelism. Render farms distribute rendering tasks—such as ray tracing and shading for complex scenes—across clusters of commodity hardware, harnessing idle cycles from desktop grids to avoid dedicated high-cost setups. For example, open-source tools like Blender integrated with HTCondor can render a 400-frame animation in under an hour on a 60-core grid, compared to over seven hours on a single machine, achieving approximately 7.6-fold speedup. Similarly, TeraGrid-based environments using Condor and RenderMan-compliant renderers like Pixie have scaled to 200 nodes, reducing a 3,600-frame high-definition animation from 120 hours to 36 minutes. This parallelism is essential for tight production deadlines in visual effects, where each frame may involve intensive computations for lighting and textures in films.^[63]^[64] E-commerce platforms employ HTC for log analysis and training recommendation engines on petabyte-scale data, processing user interactions, transaction histories, and behavioral logs to personalize offerings and optimize inventory. Distributed frameworks like Apache Spark enable collaborative filtering algorithms, such as alternating least squares (ALS), to factorize massive user-item matrices across clusters, handling billions of events from diverse sources including clickstreams and purchases. This high-throughput processing supports real-time insights, for instance, by analyzing petabytes of data to generate recommendations that drive sales through pattern detection in shopping behaviors. Scalable ingestion and computation ensure low-latency updates, with systems distributing workloads to manage the velocity of incoming logs from millions of users daily.^[65]^[66]^[67] In pharmaceutical drug discovery, HTC facilitates virtual screening of compound libraries against protein targets, evaluating docking scores for millions of molecules to identify potential leads. Frameworks like HHVSF, built on HTCondor, automate ligand-based and structure-based screening in distributed environments, parallelizing independent docking simulations to scan ultra-large libraries efficiently. For example, opportunistic grids via the Open Science Grid (OSG) have scaled screenings to billions of compounds, reducing discovery timelines from months to weeks by leveraging heterogeneous resources for tasks like binding affinity predictions. This approach prioritizes cost-effective exploration of chemical spaces, focusing on hit identification before costly wet-lab validation.^[68]^[69]^[70]

Challenges and Future Directions

Technical Hurdles

High-throughput computing (HTC) systems often incorporate resources from heterogeneous environments, such as clusters with varying CPU architectures, memory capacities, and network interfaces, which complicates effective workload distribution. This variability leads to load imbalances where some nodes remain underutilized while others are overburdened, reducing overall system throughput and increasing completion times for bag-of-tasks applications. For instance, in medical image processing workflows deployed on hybrid HPC-HTC platforms, small data sizes exacerbate imbalances across GPUs, resulting in GPU utilization below 50% and performance degradation as parallelism increases.^[71]^[72] Data transfer bottlenecks pose another major challenge in HTC, particularly across wide-area networks where latency and bandwidth limitations hinder efficient movement of input/output files for distributed tasks. In disk-to-disk transfers common to scientific workflows, multiple subsystems—including storage I/O, network protocols, and application parameters—create identifiable chokepoints that limit achievable throughput, often requiring manual tuning to approach peak performance. These issues become pronounced in large-scale deployments, where the cumulative cost of data staging and retrieval can scale linearly with the number of tasks (O(n) for n independent jobs), amplifying delays in resource-opportunistic environments like grids.^[73] Security concerns are acute in volunteer-based HTC grids, where untrusted nodes from public contributors execute code, raising risks of malicious behavior such as sabotage or data theft. To mitigate this, systems employ sandboxing techniques that isolate applications, for example by running them under unprivileged user accounts or within virtual machines to prevent access to host resources and block harmful code execution. In platforms like BOINC, code signing with public-key encryption ensures only verified applications run, while replication on multiple nodes validates results against potential tampering, though anonymous participants still introduce vulnerabilities in result integrity.^[74] Energy consumption in large-scale HTC deployments is inefficient due to the opportunistic harvesting of idle cycles from desktop and cluster resources, where baseline power draw persists even during low-utilization periods. Simulations of desktop grid environments reveal that idle resources operating at 5-10% CPU load contribute substantially to wasted energy, as evicted jobs and reboot overheads prevent full exploitation of harvested capacity. These idle harvesting dynamics are compounded by the inability to power down nodes without risking availability in fault-prone settings.^[75]

Emerging Trends and Solutions

One prominent emerging trend in high-throughput computing (HTC) is the integration of serverless computing paradigms to enable automatic scaling of batch jobs and workflows. Serverless platforms, such as AWS Lambda launched in 2014, allow HTC applications to execute functions on-demand without provisioning or managing servers, facilitating elastic resource allocation for variable workloads like scientific simulations and data processing pipelines.^[76] This approach addresses scalability challenges by dynamically adjusting compute resources, reducing idle times, and improving overall throughput; for instance, the SAGE framework enhances GPU serverless execution with pre-execution data analysis to achieve faster setup times and up to 2.5x higher throughput compared to baseline systems in deep learning inference tasks.^[77] Similarly, adaptations of serverless runtimes for HPC environments, including HTC, enable seamless integration with existing batch schedulers like HTCondor or Slurm.^[77] AI-driven scheduling represents another key innovation, leveraging machine learning to predict resource demands and optimize job placement for maximized throughput. Reinforcement learning-based algorithms, such as those in multi-agent deep reinforcement learning (MARL) frameworks like GAS-MARL, align job scheduling with renewable energy availability to improve performance in HPC clusters.^[78] For HTC-specific scenarios involving large-scale data analytics, AI schedulers incorporate predictive models to handle bursty arrivals, as seen in throughput-optimal policies for AI agent orchestration that guarantee maximum system capacity under varying request sizes. These methods extend traditional HTC schedulers by integrating neural networks.^[79] Extensions to edge computing are enabling low-latency HTC for processing continuous IoT data streams, where distributed edge nodes handle high-volume, real-time tasks closer to data sources. Frameworks like Beaver employ topology-aware graph partitioning and matching algorithms to schedule stream processing jobs across edge devices, delivering up to 3x higher throughput and sub-10ms latencies for IoT applications such as sensor fusion and anomaly detection.^[80] This paradigm mitigates central cloud bottlenecks in HTC by offloading compute-intensive filtering and aggregation, with studies showing 40% reductions in end-to-end processing delays for geo-distributed IoT networks while sustaining gigabit-per-second data rates.^[81] In practice, such integrations support scalable HTC for edge-IoT ecosystems, as evidenced by hybrid models that balance local execution with opportunistic cloud bursting for overflow tasks. Sustainability efforts in HTC focus on green scheduling algorithms designed to minimize energy consumption and carbon footprints amid growing cluster scales. Algorithms like GAS-MARL utilize multi-action deep reinforcement learning to align job scheduling with renewable energy availability in HPC/HTC clusters, by prioritizing low-carbon time slots.^[78] Cloud-based frameworks incorporating real-time carbon intensity data into task allocation further optimize for eco-efficiency, as demonstrated by reductions in carbon expenses for data-intensive workloads through predictive placement on greener data centers.^[82] These approaches, often plugin-compatible with standard HTC managers, emphasize holistic metrics like power usage effectiveness (PUE), promoting sustainable scaling as clusters exceed exascale thresholds.

References

[1]
High Throughput - an overview | ScienceDirect Topics
High-throughput computing (HTC) is defined as the use of distributed computing facilities for applications requiring large computing power over extended periods ...Introduction to High... · High-Throughput Computing...
[2]
High Throughput Computing - HTCondor
High Throughput Computing (HTC) environments deliver large amounts of processing capacity over long periods, focusing on throughput rather than per-second ...Missing: definition | Show results with:definition
[3]
[PDF] the Birth and Evolution of High Throughput Computing - HPDC
Jun 6, 2013 · “ Miron Livny, “ Study of Load Balancing Algorithms for Decentralized Distributed. Processing Systems.”,. Ph.D thesis, ...
[4]
High-throughput resource management | The grid
High-throughput resource management. Authors: Miron Livny. Miron Livny. View ... Abawajy J(2005)Job scheduling policy for high throughput grid computing ...
[5]
[PDF] Mechanisms for High Throughput Computing | Semantic Scholar
This work describes environments that can deliver large amounts of processing capacity over very long periods of time and refers to such environments as ...
[6]
Center for High Throughput Computing (CHTC)
High Throughput Computing is a collection of principles and techniques which maximize the effective throughput of computing resources towards a given ...Staff · UW Research Computing Home · Our Approach · Jobs
[7]
High-throughput computing in the sciences - PubMed
This chapter covers various patterns of high-throughput computing usage and the skills and techniques necessary to take full advantage of them.
[8]
https://osg-htc.org/
[9]
High-Throughput and Many-Task Computing - Slurm Edition
Oct 17, 2024 · These high-throughput computing (HTC) workloads aim to complete larger problems over longer periods of time by completing many smaller ...
[10]
HIGH THROUGHPUT COMPUTING: AN INTERVIEW WITH MIRON ...
Jun 27, 1997 · HPC deals with floating-point operations per second, and I like to portray HTC as floating-point operations per year.
[11]
[PDF] High Throughput Computing
“ Miron Livny & Rajesh Raman, "High Throughput Resource. Management", in “The Grid: Blueprint for a New Computing Infrastructure”. Page 5. www.cs.wisc.edu/~ ...Missing: paper | Show results with:paper
[12]
[PDF] Workflow management and scheduling in a cloud ... - DiVA portal
High Throughput Computing. HTTP ... • High Throughput Computing (HTC) workloads. HTC ... Maximum throughput (number of tasks run per unit of time).
[13]
[PDF] What defines a workload as High Throughput Computing
Sep 4, 2018 · Distinguishing characteristics of High Throughput Computing (HTC), including how it contrasts with High. Performance Computing (HPC).
[14]
[PDF] The Principles of HTC
“ Miron Livny, “ Study of Load Balancing Algorithms for Decentralized Distributed ... High Throughput Computing requires automation as it is a 24-7-365 activity ...
[15]
Characterization of the iterative application of makespan heuristics ...
Mar 29, 2012 · ... high throughput computing. Future Gener Comput Syst 23(8):968–976 ... resource allocation heuristics that manage tradeoff between makespan and ...
[16]
An Efficient Approach to Consolidating Job Schedulers in Traditional ...
During this period, the utilization rate of Group A is 10.24%, and the utilization rate ... HTCondor, High Throughput Computing. Available online: https ...
[17]
[PDF] Condor and the Grid - Computer Sciences Dept.
Ian Foster and Carl Kesselman. Globus: A metacomputing intrastructure ... Matchmaking: Distributed resource management for high throughput computing.
[18]
[PDF] The History of the Grid - arXiv
The late 1990s saw widespread enthusiasm for Grid computing in industry. Many vendors saw a need for a Grid product. Because few had on-demand computing or ...
[19]
SETI@home: An Experiment in Public-Resource Computing
Large-scale public-resource computing became feasible with the growth of the Internet in the 1990s. Two major public-resource projects predate SETI@home ...
[20]
Apache Hadoop
### Summary of Hadoop History and 2006 Release with MapReduce
[21]
[PDF] Two decades of HTCondor and CMS success story - CERN Indico
Sep 16, 2025 · In this contribution, in the year when the HTCSS turns 40, we propose an overview of the nearly. 20-year shared history of HTCondor and CMS.
[22]
[PDF] The pilot way to Grid resources using glideinWMS
This paper presents both the general pilot concept, as well as a concrete implementation, called glideinWMS, deployed in the Open Science. Grid. 1. Introduction.Missing: advancements | Show results with:advancements
[23]
Announcing Amazon EC2 Spot Instances - AWS
Dec 14, 2009 · Spot Instances allow customers to bid on unused Amazon EC2 capacity and run those instances for as long as their bid exceeds the current Spot Price.Missing: high- throughput
[24]
[PDF] High Throughput Computing with EC2 Spot Instances
May 19, 2016 · Amazon EC2 Spot instances are spare EC2 instances that you can bid on to run your cloud computing applications. Spot.
[25]
[PDF] A Roadmap for HEP Software and Computing R&D for the 2020s
Efficient running on heterogeneous resources will require a tighter integration with the computing models' higher-level systems of workflow and data management.
[26]
Job Scheduling — HTCondor Manual 25.5.0 documentation
The HTCondor system calculate a “fair share” of machine slots to allocate to each user. Whether each user can use all of these slots depends on a number of ...Missing: HTC | Show results with:HTC<|control11|><|separator|>
[27]
[PDF] Matchmaking: Distributed Resource Management for High ...
Conventional resource management systems use a sys- tem model to describe resources and a centralized sched- uler to control their allocation.
[28]
Matchmaking with ClassAds - HTCondor Manual - Read the Docs
Understanding the unique framework by which HTCondor matches submitted jobs with machines is the key to getting the most from HTCondor's scheduling algorithm.Missing: seminal | Show results with:seminal
[29]
What is a Container? - Docker
A container is a standard unit of software that packages code and dependencies, ensuring it runs reliably and uniformly, isolating it from its environment.
[30]
Using Docker in high performance computing applications
Our experiment and evaluation show how to deploy efficiently high performance computing applications on Docker containers and VMs.
[31]
Ganglia download | SourceForge.net
Rating 4.8 (32) · Free · CommunicationGanglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design ...
[32]
[PDF] Monitoring HTCondor with Ganglia
Presents graphs from persistent data. Page 3. 3. Page 4. 4. Why Ganglia? › Widely used monitoring system for cluster and grids. › Easy to add new metrics. › ...
[33]
DAGMan for orchestrating complex workflows on HTC resources ...
It enables users to define complex workflows as directed acyclic graphs (DAGs). In a DAG, nodes represent individual computational tasks, and the directed edges ...
[34]
Pegasus WMS – Automate, recover, and debug scientific computations
Automatically locates the necessary input data and computational resources, and manages storage space for executing data-intensive workflows on storage- ...Workflow Examples · 4. Tutorial · HPC · AI / MLMissing: throughput | Show results with:throughput
[35]
The Evolution of the Pegasus Workflow Management Software
May 29, 2019 · Since 2001, the Pegasus Workflow Management System has evolved into a robust and scalable system that automates the execution of a number of complex ...
[36]
An efficient fault tolerant workflow scheduling approach using ...
It uses replication, resubmission, checkpointing and provides fault-tolerance in an efficient manner. In the scheduling step, the workflow tasks are replicated ...
[37]
High Throughput Computing Administration Guide
This document contains Slurm administrator information specifically for high throughput computing, namely the execution of many short jobs.
[38]
High-Throughput Computing (HTC) and its Requirements
These problems demand a computing environment that delivers large amounts of computational power over a long period of time. Such an environment is called a ...Missing: definition | Show results with:definition
[39]
High Throughput Computing - HTCondor
We first introduced the distinction between High Performance Computing (HPC) and High Throughput Computing (HTC) in a seminar at the NASA Goddard Flight Center ...
[40]
https://htcondor.readthedocs.io/en/main/overview/high-throughput-computing-requirements.html
[41]
High Performance Computing Resources - CRC | Rice University
NOTS is designed for both high performance computing (HPC) and high throughput computing ... InfiniBand low-latency networking. An expansion in 2024 provided a ...
[42]
https://doi.org/10.3389/fgene.2011.00004
[43]
Many-task computing for grids and supercomputers - ResearchGate
Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Many task computing ...
[44]
[PDF] Swift: A language for distributed parallel scripting
Mar 16, 2011 · Swift scripts express the execution of programs that consume and produce file-resident datasets. Swift uses a C- like syntax consisting of ...
[45]
(PDF) Many-Task Computing: Bridging the Gap between High ...
Many-task computing (MTC) aims to bridge the gap between two paradigms, high-throughput computing (HTC) and high-performance computing (HPC).
[46]
[PDF] Possibilities of Grid Computing for Realization of ... - Eurorisk Systems
... Monte Carlo simulation for evaluating the market risk of a financial ... Keywords: Grid Computing, Monte Carlo Simulation, High Throughput Computing, Distributed ...
[47]
How to Build an Open Source Render Farm Based on Desktop Grid ...
Aug 7, 2025 · In [7] , the authors present a solution for a grid-based rendering farm based on Condor, an open-source high throughput computing software ...
[48]
Implementation of a distributed rendering environment for the TeraGrid
A job cluster number will also be given. This cluster number is the identifier used with Condor for ... Computer Sciences, “Condor High. Throughput Computing,” ...
[49]
Product collaborative filtering based recommendation systems for ...
By leveraging Spark's distributed computing capabilities, ALS distributes the matrix factorization process across multiple nodes, leading to superior ...
[50]
Distributed Recommendation Systems: Survey and Research ...
Nov 26, 2024 · Common applications include item recommendations on e-commerce ... Distributed Computing Techniques, DisCoTec '19, Proceedings 19.
[51]
(PDF) Enhancing E-Commerce Recommendations Through Data ...
Feb 8, 2025 · Implementing distributed computing frameworks like Apache Spark ensures scalability and operational efficiency. Centered on the electronics ...
[52]
HHVSF: A Framework to Accelerate Drug-Based High-Throughput ...
Mar 20, 2018 · With the development of high-performance computers, the virtual drug screening is accelerating. ... High-Throughput Computing (HTC) environment.
[53]
Scaling virtual screening to ultra-large virtual chemical libraries | OSG
Aug 19, 2021 · ... Cancer Center, described how high throughput computing (HTC) has made his work in early-stage drug discovery infinitely more scalable.
[54]
Virtual High-Throughput Ligand Screening - PMC - NIH
SurfaceScreen, DOCK, and AUTODOCK leverage the BG/P in a high-throughput computing mode. FEP-REMD/GCMC uses a new, innovative, and highly parallel variant ...
[55]
How to Balance the Load on Heterogeneous Clusters
Cluster heterogeneity increases the difficulty of balancing the load across the system nodes and, although the relationship between heterogeneity and load ...
[56]
[PDF] High-Throughput Computing on HPC: a Case Study of Medical ...
High-Throughput Computing on HPC: a Case Study of Medical Image Processing ... In the future, we plan to update the scheduling strategy for load imbalance when ...
[57]
[PDF] Explaining Wide Area Data Transfer Performance
ABSTRACT. Disk-to-disk wide-area file transfers involve many subsystems and tunable application parameters that pose significant challenges for bottleneck ...
[58]
[PDF] BOINC: A Platform for Volunteer Computing 1. Introduction - arXiv
Dec 9, 2018 · In addition to reducing heterogeneity problems, VM technology provides a strong security sandbox, allowing untrusted applications to be used.
[59]
[PDF] Trace-driven simulation for energy consumption in High Throughput ...
Energy efficient high throughput computing. The evaluation of techniques to reduce energy consumption within high throughput computing ...Missing: inefficiency | Show results with:inefficiency
[60]
Introducing AWS Lambda
Nov 13, 2014 · AWS Lambda starts running your code within milliseconds of an event such as an image upload, in-app activity, website click, or output from a ...
[61]
Towards Fast Setup and High Throughput of GPU Serverless ... - arXiv
Apr 23, 2024 · We propose SAGE, a GPU serverless framework with fast setup and high throughput. First, based on the data knowability of GPU function ahead of actual execution.
[62]
GAS-MARL: : Green-Aware job Scheduling algorithm for HPC ...
Apr 11, 2025 · We propose a Green-Aware job Scheduling algorithm for HPC clusters based on Multi-Action Deep Reinforcement Learning (GAS-MARL), which optimizes both renewable ...Missing: throughput | Show results with:throughput
[63]
A CI-Aware, AI-Driven Scheduling Framework for HPC Workloads
Jul 17, 2024 · Many modern scientific workloads in HPC centers rely heavily on AI-driven tasks, particularly deep neural network (DNN) training workloads.
[64]
https://www.researchgate.net/publication/224638402_Implementation_of_a_distributed_rendering_environment_for_the_TeraGrid
[65]
Cost-aware & Fault-tolerant Geo-distributed Edge Computing for ...
As IoT applications have a strong demand for low latency and high throughput computing, stream processing using edge computing resources is a promising ...
[66]
https://dl.acm.org/doi/10.1145/3694783