Fact-checked by Grok 2 weeks ago

High-throughput computing

High-throughput computing (HTC) is a computing paradigm that employs distributed and often heterogeneous resources, such as clusters, workstations, and volunteer computing networks, to execute large volumes of independent or loosely coupled tasks over extended durations, prioritizing sustained throughput—typically measured in jobs completed per month—over instantaneous peak performance. The concept of HTC emerged in the mid-1990s from research at the University of Wisconsin-Madison, where Miron Livny and colleagues developed foundational mechanisms to manage resources across distributed environments, emphasizing , opportunistic resource utilization, and to handle long-running workloads reliably. Key characteristics include the use of like HTCondor for job scheduling, , and resource discovery; support for heterogeneous hardware without requiring high-efficiency components; and a focus on resilience to failures in dynamic, distributively owned systems. In contrast to high-performance computing (HPC), which relies on tightly coupled parallel processing for low-latency, high-speed simulations on specialized supercomputers measured in floating-point operations per second (FLOPS), HTC excels in scenarios involving massive parallelism of smaller, embarrassingly parallel jobs, leveraging grids and clouds for scalability. Prominent applications span scientific domains, including bioinformatics for genomic sequencing analysis, drug discovery through virtual screening of molecular compounds, high-energy physics data processing (e.g., at CERN), climate modeling ensembles, and big data analytics in agriculture and social sciences. Modern HTC infrastructures, such as the Open Science Grid (OSG) and tools like for workflow management or SLURM for batch processing, enable researchers to access national-scale resources, delivering more than 1.2 billion core-hours annually and supporting over 240 projects in fields like life sciences and as of 2025.

Definition and Principles

Core Concepts

High-throughput computing (HTC) is a computing paradigm that prioritizes the sustained execution of large-scale, independent jobs over extended periods to maximize overall productivity, rather than focusing on instantaneous peak performance. This model is particularly suited to opportunistic or distributed environments where resources may be heterogeneous and intermittently available, enabling the processing of vast numbers of loosely coupled tasks that do not demand immediate synchronization. The primary metric in HTC is throughput, defined as the number of jobs or tasks completed per , which quantifies the system's ability to deliver cumulative computational results over long durations. This contrasts with metrics like floating-point operations per second in other paradigms, emphasizing total output such as "floating-point operations per year" for applications requiring prolonged computation. The throughput can be formally expressed as: \text{Throughput} = \frac{\text{Number of tasks completed}}{\text{Total execution time}} This measure underscores HTC's goal of optimizing resource utilization for continuous operation, often in 24/7 environments where failures are anticipated and managed to maintain steady progress. In HTC contexts, predominates, involving the submission and automated execution of large queues of predefined jobs without real-time user intervention, which aligns with the paradigm's focus on high-volume, non-interactive workloads. This differs from interactive , where users engage directly with the system for immediate responses, as batch approaches in HTC allow for efficient handling of independent tasks that can tolerate delays in favor of maximizing completed output. The of tasks in HTC further supports this, as jobs operate without requiring tight , , or direct communication, relying instead on cooperative protocols for and data exchange across distributed nodes.

Key Characteristics

High-throughput computing (HTC) emphasizes opportunistic resource usage, harnessing idle computational cycles from distributed environments such as desktop grids or cloud infrastructures to maximize overall productivity without dedicated hardware ownership. This approach allows systems to dynamically acquire resources from volunteer workstations or underutilized cloud instances, enabling cost-effective scaling for large-scale computations. For instance, projects like facilitate the capture of unused processing power from campus or global networks of desktops, migrating jobs as needed to avoid interruptions when owners resume activity. A defining trait of HTC is its scalability to thousands of heterogeneous nodes, accommodating variability in performance across diverse hardware and network conditions while maintaining robust operation. Systems tolerate fluctuations in resource availability and speed by employing flexible matchmaking and fault-tolerant mechanisms, ensuring continuous progress even in unreliable environments like the , which spans over 100 institutions. This heterogeneity contrasts with more uniform setups, allowing HTC to aggregate power from disparate sources without stringent synchronization requirements. HTC workflows are inherently long-running, often extending over days or weeks to complete extensive task sets, such as parameter sweeps in scientific simulations that explore multiple variable configurations. These protracted executions suit applications requiring sustained computation, like methods or sensitivity analyses, where the focus is on cumulative output rather than immediate results. Automation through directed acyclic graphs (DAGs) supports the orchestration of interdependent tasks over these durations, as demonstrated in workflows managing hundreds of thousands of jobs. Beyond core throughput metrics, HTC evaluates using indicators like , the total time required to complete an entire job set, which optimizes overall completion in distributed settings. Resource efficiency is another critical measure, quantified as the utilization rate—calculated as (busy time / total available time) × 100%—to assess how effectively cycles are exploited across the system. These metrics highlight HTC's emphasis on sustained, efficient resource harnessing over peak instantaneous .

History and Development

Origins in Distributed Systems

High-throughput computing (HTC) emerged in the as an extension of distributed systems , particularly through initiatives in grid computing that sought to harness geographically dispersed resources for sustained, large-scale computations. Early efforts were driven by the need to aggregate idle computing power from workstations and clusters across institutions, moving beyond the limitations of dedicated hardware. The project, initiated at the University of Wisconsin-Madison in but gaining prominence in the mid-, exemplified this approach by enabling opportunistic scheduling of jobs on non-dedicated resources, achieving fault-tolerant execution through mechanisms like process checkpointing and to handle node failures in unreliable networks. By the late , 's matchmaking algorithms facilitated efficient for high-volume workloads, establishing HTC as a paradigm for delivering vast computational capacity over extended periods. Grid computing initiatives further solidified HTC's foundations, with key contributions from researchers like Ian Foster and Carl Kesselman, who developed to enable seamless resource sharing across wide-area networks. Their 1998 book, The Grid: Blueprint for a New Computing Infrastructure, outlined a vision for coordinating heterogeneous systems, while the Globus Toolkit—released that same year—provided essential protocols for job submission, security, and data transfer, supporting loosely coupled applications typical of HTC. This work responded to the escalating data volumes in scientific fields, such as high-energy physics and astronomy, where centralized supercomputers proved insufficient for processing petabyte-scale datasets generated by experiments like those at particle accelerators. The shift emphasized decentralized pooling of internet-scale resources, allowing scientists to access on-demand computing without owning expensive . A landmark influence on HTC's practical adoption came from volunteer computing projects, notably , launched in 1999 by the . This initiative distributed radio signal analysis tasks to millions of volunteered personal computers worldwide, demonstrating HTC's viability in public-resource environments by achieving peak throughputs exceeding 27 teraFLOPS through independent, fault-tolerant work units that tolerated intermittent connectivity and volunteer attrition. Early grid efforts, including integrations like Condor-G, bridged these volunteer models with institutional grids, prioritizing resilience in heterogeneous, failure-prone distributed setups.

Major Milestones and Evolutions

The rise of high-throughput computing (HTC) gained momentum in the mid-2000s amid the era, with the release of in 2006 marking a pivotal advancement. Hadoop adapted Google's programming model to enable distributed, fault-tolerant processing of massive datasets across clusters of commodity hardware, prioritizing sustained throughput over low-latency performance for batch-oriented workloads. In the 2010s, middleware systems evolved significantly to enhance HTC scalability in distributed environments. HTCondor, originally developed as in 1984 and renamed in 2012, underwent key updates including the adoption of multicore pilots by 2015, enabling efficient resource utilization on sites and scaling to tens of thousands of CPU cores for experiments like those at 's (LHC). Concurrently, pilot agent frameworks such as glideinWMS, introduced around 2008-2009, revolutionized by dynamically provisioning virtual pools of resources, abstracting site heterogeneities and improving job throughput to handle up to 100,000 queued jobs across hundreds of sites. The integration of HTC with accelerated in the late 2000s and 2010s, exemplified by ' launch of EC2 Spot Instances in December 2009, which allowed bidding on unused capacity for up to 90% cost savings on flexible, interruptible workloads. This mechanism proved particularly effective for HTC applications, where jobs could be decomposed into small, preemptible tasks to maximize utilization and in hybrid batch systems like HTCondor. By the 2020s, HTC transitioned toward hybrid models that seamlessly combine traditional grids, public clouds, and emerging edge resources to address demands in scientific workflows. This evolution emphasizes tighter integration of workflow management systems with diverse infrastructures, enabling efficient execution on multi-site, multi-provider environments for data-intensive domains like high-energy physics. In 2024, the HTCondor project celebrated its 40th anniversary, recognizing its foundational role in HTC since 1984.

Architectures and Technologies

Resource Management Systems

Resource management systems in high-throughput computing (HTC) environments serve as the foundational for dynamically allocating computational to across distributed, often heterogeneous, networks of . These systems typically employ brokers and schedulers that evaluate job requirements against available capabilities, matching them based on predefined policies to optimize throughput while accommodating variability in availability. A key policy in such systems is fair-share allocation, which ensures equitable distribution of among users or groups by prioritizing from those with historically lower utilization, thereby preventing resource monopolization and promoting sustained high throughput over time. In prominent HTC implementations like HTCondor, resource matchmaking is facilitated through mechanisms such as ClassAds, which enable dynamic advertising of resource attributes and job requirements in a classified-advertisement style framework. Introduced in the late , ClassAds allow nodes to advertise their current state—such as CPU availability, memory, and operating system—while jobs specify needs like execution environment or , enabling a centralized matchmaker to pair them efficiently without tight coupling to specific hardware. This approach supports opportunistic usage of resources, including transient or volunteered nodes, by evaluating constraints and rank expressions to select optimal matches. The seminal design of ClassAds, as detailed in early publications, has influenced subsequent distributed scheduling paradigms by decoupling resource discovery from allocation decisions. To address heterogeneity in HTC environments—where nodes may vary in architecture, software stacks, or availability—virtualization techniques like have become integral since the introduction of in 2013. Containers encapsulate job dependencies and runtime environments, allowing seamless deployment across diverse nodes without full virtualization overhead, thus maintaining high throughput by minimizing setup times and ensuring reproducibility. In scientific HTC, Apptainer (formerly ) is also widely used, enabling non-privileged users to run securely on shared clusters. For instance, Docker's layered filesystem and image-based distribution enable rapid instantiation of isolated workloads, which is particularly beneficial in opportunistic grids where node configurations fluctuate. Studies on container performance in contexts confirm near-native efficiency for many HTC workloads. Monitoring tools are essential for tracking resource utilization and enabling proactive management in HTC systems, providing real-time metrics on cluster health, job progress, and bottlenecks. Ganglia, an open-source distributed monitoring system developed for high-performance clusters and grids, aggregates data from nodes using a multicast-based to visualize metrics like CPU load, network traffic, and usage across large-scale deployments. In HTC setups, such as those integrated with HTCondor, Ganglia's hierarchical design supports scalable monitoring of thousands of nodes, facilitating fault detection and load balancing to sustain throughput. Its extensibility allows custom metrics tailored to HTC needs, like depths, ensuring administrators can optimize resource policies based on empirical data.

Workflow and Job Scheduling Frameworks

In high-throughput computing (HTC), workflows are commonly modeled using Directed Acyclic Graphs (DAGs), where nodes represent individual tasks or jobs, and directed edges denote dependencies between them, ensuring no cycles to prevent indefinite waits. This structure allows for the orchestration of complex, data-intensive computations across distributed resources, such as grids or clusters, by enabling parallel execution of independent tasks while respecting sequential constraints. The execution model processes DAG nodes in , submitting a job only after all its predecessor tasks have completed successfully, which maximizes resource utilization and throughput in environments with variable availability. Workflow schedulers in HTC, such as , facilitate the management of these DAG-based workflows by mapping abstract, high-level descriptions—defined in formats like XML or —into concrete executable plans tailored to available resources. Developed in 2001 at the University of Southern California's Information Sciences Institute, automates the planning process, including data staging, task transformation, and across heterogeneous systems like grids and clouds, thereby enabling scalable execution of scientific computations. Similarly, HTCondor's DAGMan tool supports DAG construction and submission, integrating seamlessly with the broader HTC ecosystem to handle workflows comprising thousands of interdependent jobs. Fault recovery mechanisms are integral to HTC frameworks to maintain reliability in unreliable distributed environments, where node failures or network issues are common. Checkpointing periodically saves the state of executing tasks, allowing restarts from the last valid point without recomputing prior work, while resubmission automatically retries failed tasks up to a configurable limit, often with to avoid overwhelming resources. In , for instance, these features generate rescue DAGs that resume workflows from the point of failure, preserving data to track execution history and debug issues. DAGMan employs analogous recovery by monitoring job status and requeueing aborted s, ensuring minimal disruption to overall throughput. Integration with batch systems enhances HTC schedulers' ability to handle large-scale job queuing on cluster infrastructures. , for example, interfaces with SLURM through intermediaries like HTCondor, submitting tasks as batch jobs while leveraging SLURM's partitioning and priority queuing for efficient in high-volume scenarios. SLURM itself supports HTC by optimizing for short, numerous jobs via features like job arrays and fair-share scheduling, allowing frameworks to enqueue thousands of tasks without manual intervention. This synergy enables seamless scaling, where abstract workflows are decomposed into queued executions that adapt to cluster dynamics.

Differences from High-Performance Computing

High-throughput computing (HTC) and (HPC) represent distinct paradigms in distributed and parallel computing, each optimized for different workload characteristics and resource utilization strategies. HTC emphasizes the execution of numerous independent, loosely coupled tasks over extended periods to maximize overall system throughput, often measured in jobs completed per month or floating-point operations per year (FLOPY), whereas HPC focuses on accelerating tightly coupled, synchronous computations that require rapid data exchange among processes to achieve high peak performance. This contrast arises from HTC's suitability for workloads, where tasks lack interdependencies and can run autonomously across heterogeneous resources, in opposition to HPC's reliance on coordinated for complex simulations. In terms of hardware, HPC systems typically employ specialized clusters with low-latency interconnects, such as InfiniBand, to minimize communication overhead in tightly coupled applications, enabling efficient synchronization across hundreds or thousands of nodes. Conversely, HTC leverages commodity hardware, including desktops, departmental servers, and cloud instances, without requiring high-speed fabrics, as tasks are designed to tolerate variable latencies and operate independently on distributed, opportunistic resources. Performance objectives further diverge: HPC prioritizes floating-point operations per second (FLOPS) to deliver rapid results for compute-intensive problems, while HTC optimizes for sustained job completion rates, ensuring high aggregate output over time despite potential delays in individual runs. A representative example illustrates these differences: in HPC, weather modeling often involves a single large-scale using (MPI) for tightly coupled across nodes to solve coupled atmospheric equations in near-real time. In contrast, HTC supports genomic sequencing through millions of independent sequence alignments or marker evaluations, processed as jobs on distributed clusters to achieve high throughput for large datasets. These trade-offs highlight HTC's focus on for batch-oriented, long-running computations versus HPC's emphasis on low-latency synchronization for interactive, high-speed analysis.

Relation to Many-Task Computing

Many-task computing (MTC) represents an extension of high-throughput computing (HTC) by accommodating diverse task granularities, ranging from short bursts of seconds to prolonged jobs lasting hours or days, all orchestrated within a single cohesive framework. This approach enhances HTC's focus on sustained resource utilization by introducing flexibility for heterogeneous workloads, enabling efficient execution on grids, clusters, and supercomputers without rigid assumptions about task uniformity. The MTC paradigm was proposed by Ian Foster and collaborators in 2008 as a means to unify HTC with (HPC), leveraging HTC's for data-intensive, task-parallel applications while adapting HPC infrastructures for broader . A pivotal tool in this integration is the scripting language, developed in 2007, which facilitates HTC workflows by allowing users to define implicitly parallel scripts that manage variable task durations through dataflow-driven execution and dynamic . SWIFT's model abstracts file-based data handling, supporting tasks from brief computations to extended simulations, thus bridging HTC's batch-oriented style with MTC's adaptability. While HTC and MTC share common challenges, such as minimizing data movement overheads through techniques like local caching and , HTC generally presumes uniform, batch-style tasks with consistent durations and minimal . In contrast, MTC extends this by incorporating support for interactive, communication-intensive elements and dynamic task graphs, fostering more versatile applications in scientific computing.

Applications and Use Cases

Scientific Research Domains

High-throughput computing (HTC) has become integral to scientific research domains where vast datasets and numerous independent computations are processed in , enabling researchers to tackle complex problems that require extensive and . In fields like bioinformatics, astronomy, climate modeling, and high-energy physics, HTC systems distribute workloads across distributed resources, such as grids and clusters, to achieve and without tight . This approach contrasts with by emphasizing and high job throughput over peak speed for individual tasks. In bioinformatics, HTC facilitates large-scale sequence alignment tasks, such as running thousands of Basic Local Alignment Search Tool () jobs against massive protein databases like those from the (NCBI). Since the early 2000s, tools like Soap-HT-BLAST have leveraged web services and to parallelize these alignments, processing terabytes of genomic data for applications in gene discovery and evolutionary studies. For instance, frameworks such as Divide and Conquer BLAST (DCBLAST) enable rapid execution of NCBI BLAST+ on environments, reducing turnaround times for aligning millions of sequences from high-throughput sequencing experiments. This has been crucial for projects analyzing whole-genome data, where HTC workflows handle the embarrassingly parallel nature of pairwise alignments without requiring specialized hardware. Astronomy benefits from HTC in processing pipelines for telescope data, particularly in projects generating enormous volumes of images and . The Vera C. Rubin Observatory's Legacy Survey of Space and Time (LSST), which began operations in 2025, relies on HTC infrastructures like the Open Science Grid (OSG) to manage the influx of 20 terabytes of nightly for tasks such as transient detection and image calibration. As of June 2025, the observatory released its first images, with full survey operations commencing later that year, further demonstrating HTC's role in handling initial . Systems like RabbitQR integrate HTC with to handle LSST's rates, distributing reduction steps across fluctuating worker pools for analysis of cosmic events. The LSST Data Facilities plan incorporates grid-based HTC for very high-throughput storage and computation, ensuring scalable handling of petabyte-scale archives for multi-wavelength studies. Climate modeling employs HTC for ensemble runs that explore parameter variations to quantify uncertainty in projections. Large ensembles, such as those from the Community Earth System Model (CESM), use distributed computing to simulate thousands of scenarios varying initial conditions and forcings, providing probabilistic insights into future climate states. HTC platforms, including grids, have enabled studies like the BRIDGE HadCM3 family of models, where high-throughput execution across facilities processes perturbed parameter ensembles to assess regional impacts like precipitation changes. For example, the Magellan project demonstrated HTC's role in scaling climate simulations for analysis, supporting workflows that integrate observational data with model outputs for uncertainty quantification. In high-energy physics, HTC supports event simulation for particle colliders, exemplified by the (LHC) data processing since its 2008 startup. The ATLAS and experiments use global HTC systems like the Worldwide LHC Computing Grid (WLCG) to generate and simulate billions of proton-proton collision events, distributing simulations across thousands of sites to model detector responses and backgrounds. Post-2008, this infrastructure has processed exabytes of data through high-throughput pipelines, with tools like enabling volunteer and opportunistic computing for event generation. Recent integrations, such as ATLAS's shift to high-performance platforms for simulation, maintain HTC principles to handle the increased luminosity in LHC Run 3, ensuring timely physics analyses.

Industrial and Data-Intensive Workloads

In industrial sectors, high-throughput computing (HTC) enables the processing of vast workloads at scale to deliver through rapid analysis and simulation, often leveraging distributed resources like or systems to handle tasks. This approach is particularly suited to data-intensive applications where volume and speed directly impact and , such as in , media production, , and pharmaceuticals. By utilizing opportunistic and dedicated resources, HTC systems like HTCondor manage thousands of independent jobs, reducing turnaround times from days to hours while optimizing costs on heterogeneous infrastructure. In , HTC supports simulations for , where millions of market scenarios are generated and evaluated daily to compute metrics like (). These simulations model price distributions using historical data and stochastic processes, such as normally distributed changes in interest rates and equities, to quantify potential losses at confidence levels like 95% or 99%. Distributed environments, such as those built with the Globus Toolkit or , parallelize the independent evaluation of scenarios across nodes, enabling faster execution than single-machine setups—for instance, processing large portfolios with payments in multiple currencies. This scalability allows firms to handle the computational demands of real-time risk management, where a single simulation might require evaluating millions of paths to ensure and investment decisions. Media and entertainment industries rely on HTC for rendering farms that produce (CGI) in films through frame-by-frame parallelism. Render farms distribute rendering tasks—such as ray tracing and for complex scenes—across clusters of commodity , harnessing idle cycles from desktop to avoid dedicated high-cost setups. For example, open-source tools like integrated with HTCondor can render a 400-frame in under an hour on a 60-core , compared to over seven hours on a single machine, achieving approximately 7.6-fold . Similarly, TeraGrid-based environments using and RenderMan-compliant renderers like have scaled to 200 nodes, reducing a 3,600-frame high-definition from 120 hours to 36 minutes. This parallelism is essential for tight production deadlines in , where each frame may involve intensive computations for lighting and textures in films. E-commerce platforms employ HTC for log analysis and training recommendation engines on petabyte-scale data, processing user interactions, transaction histories, and behavioral logs to personalize offerings and optimize inventory. Distributed frameworks like enable collaborative filtering algorithms, such as alternating (ALS), to factorize massive user-item matrices across clusters, handling billions of events from diverse sources including clickstreams and purchases. This high-throughput processing supports real-time insights, for instance, by analyzing petabytes of data to generate recommendations that drive sales through pattern detection in shopping behaviors. Scalable ingestion and computation ensure low-latency updates, with systems distributing workloads to manage the velocity of incoming logs from millions of users daily. In pharmaceutical drug discovery, HTC facilitates virtual screening of compound libraries against protein targets, evaluating docking scores for millions of molecules to identify potential leads. Frameworks like HHVSF, built on HTCondor, automate ligand-based and structure-based screening in distributed environments, parallelizing independent docking simulations to scan ultra-large libraries efficiently. For example, opportunistic grids via the Open Science Grid (OSG) have scaled screenings to billions of compounds, reducing discovery timelines from months to weeks by leveraging heterogeneous resources for tasks like binding affinity predictions. This approach prioritizes cost-effective exploration of chemical spaces, focusing on hit identification before costly wet-lab validation.

Challenges and Future Directions

Technical Hurdles

High-throughput computing (HTC) systems often incorporate resources from heterogeneous environments, such as clusters with varying CPU architectures, capacities, and interfaces, which complicates effective distribution. This variability leads to load imbalances where some nodes remain underutilized while others are overburdened, reducing overall system throughput and increasing completion times for bag-of-tasks applications. For instance, in medical image processing workflows deployed on HPC-HTC platforms, small sizes exacerbate imbalances across GPUs, resulting in GPU utilization below 50% and performance degradation as parallelism increases. Data transfer bottlenecks pose another major challenge in HTC, particularly across wide-area networks where latency and bandwidth limitations hinder efficient movement of input/output files for distributed tasks. In disk-to-disk transfers common to scientific workflows, multiple subsystems—including storage I/O, network protocols, and application parameters—create identifiable chokepoints that limit achievable throughput, often requiring manual tuning to approach peak performance. These issues become pronounced in large-scale deployments, where the cumulative cost of data staging and retrieval can scale linearly with the number of tasks (O(n) for n independent jobs), amplifying delays in resource-opportunistic environments like grids. Security concerns are acute in volunteer-based HTC grids, where untrusted nodes from public contributors execute code, raising risks of malicious behavior such as or theft. To mitigate this, systems employ sandboxing techniques that isolate applications, for example by running them under unprivileged user accounts or within virtual machines to prevent access to host resources and block harmful code execution. In platforms like BOINC, with ensures only verified applications run, while replication on multiple nodes validates results against potential tampering, though anonymous participants still introduce vulnerabilities in result integrity. Energy consumption in large-scale HTC deployments is inefficient due to the opportunistic harvesting of idle cycles from and resources, where baseline power draw persists even during low-utilization periods. Simulations of grid environments reveal that resources operating at 5-10% CPU load contribute substantially to wasted , as evicted jobs and overheads prevent full exploitation of harvested . These harvesting dynamics are compounded by the inability to power down nodes without risking in fault-prone settings. One prominent emerging trend in high-throughput computing (HTC) is the integration of paradigms to enable automatic scaling of batch jobs and workflows. Serverless platforms, such as launched in 2014, allow HTC applications to execute functions on-demand without provisioning or managing servers, facilitating elastic resource allocation for variable workloads like scientific simulations and data processing pipelines. This approach addresses scalability challenges by dynamically adjusting compute resources, reducing idle times, and improving overall throughput; for instance, the framework enhances GPU serverless execution with pre-execution data analysis to achieve faster setup times and up to 2.5x higher throughput compared to baseline systems in inference tasks. Similarly, adaptations of serverless runtimes for HPC environments, including HTC, enable seamless integration with existing batch schedulers like HTCondor or Slurm. AI-driven scheduling represents another key innovation, leveraging to predict resource demands and optimize job placement for maximized throughput. Reinforcement learning-based algorithms, such as those in multi-agent () frameworks like GAS-MARL, align job scheduling with availability to improve performance in HPC clusters. For HTC-specific scenarios involving large-scale data analytics, AI schedulers incorporate predictive models to handle bursty arrivals, as seen in throughput-optimal policies for AI that guarantee maximum system capacity under varying request sizes. These methods extend traditional HTC schedulers by integrating neural networks. Extensions to are enabling low-latency HTC for continuous data streams, where distributed edge nodes handle high-volume, real-time tasks closer to data sources. Frameworks like employ topology-aware graph partitioning and matching algorithms to schedule jobs across edge devices, delivering up to 3x higher throughput and sub-10ms latencies for applications such as and . This paradigm mitigates central bottlenecks in HTC by offloading compute-intensive filtering and aggregation, with studies showing 40% reductions in end-to-end delays for geo-distributed networks while sustaining gigabit-per-second data rates. In practice, such integrations support scalable HTC for edge-IoT ecosystems, as evidenced by models that balance local execution with opportunistic bursting for tasks. Sustainability efforts in HTC focus on green scheduling algorithms designed to minimize and carbon footprints amid growing scales. Algorithms like GAS-MARL utilize multi-action to align job scheduling with availability in HPC/HTC s, by prioritizing low-carbon time slots. Cloud-based frameworks incorporating real-time carbon intensity data into task allocation further optimize for eco-efficiency, as demonstrated by reductions in carbon expenses for data-intensive workloads through predictive placement on greener data centers. These approaches, often plugin-compatible with standard HTC managers, emphasize holistic metrics like (PUE), promoting sustainable scaling as s exceed exascale thresholds.

References

  1. [1]
    High Throughput - an overview | ScienceDirect Topics
    High-throughput computing (HTC) is defined as the use of distributed computing facilities for applications requiring large computing power over extended periods ...Introduction to High... · High-Throughput Computing...
  2. [2]
    High Throughput Computing - HTCondor
    High Throughput Computing (HTC) environments deliver large amounts of processing capacity over long periods, focusing on throughput rather than per-second  ...Missing: definition | Show results with:definition
  3. [3]
    [PDF] the Birth and Evolution of High Throughput Computing - HPDC
    Jun 6, 2013 · “ Miron Livny, “ Study of Load Balancing Algorithms for Decentralized Distributed. Processing Systems.”,. Ph.D thesis, ...
  4. [4]
    High-throughput resource management | The grid
    High-throughput resource management. Authors: Miron Livny. Miron Livny. View ... Abawajy J(2005)Job scheduling policy for high throughput grid computing ...
  5. [5]
    [PDF] Mechanisms for High Throughput Computing | Semantic Scholar
    This work describes environments that can deliver large amounts of processing capacity over very long periods of time and refers to such environments as ...
  6. [6]
    Center for High Throughput Computing (CHTC)
    High Throughput Computing is a collection of principles and techniques which maximize the effective throughput of computing resources towards a given ...Staff · UW Research Computing Home · Our Approach · Jobs
  7. [7]
    High-throughput computing in the sciences - PubMed
    This chapter covers various patterns of high-throughput computing usage and the skills and techniques necessary to take full advantage of them.
  8. [8]
  9. [9]
    High-Throughput and Many-Task Computing - Slurm Edition
    Oct 17, 2024 · These high-throughput computing (HTC) workloads aim to complete larger problems over longer periods of time by completing many smaller ...
  10. [10]
    HIGH THROUGHPUT COMPUTING: AN INTERVIEW WITH MIRON ...
    Jun 27, 1997 · HPC deals with floating-point operations per second, and I like to portray HTC as floating-point operations per year.
  11. [11]
    [PDF] High Throughput Computing
    “ Miron Livny & Rajesh Raman, "High Throughput Resource. Management", in “The Grid: Blueprint for a New Computing Infrastructure”. Page 5. www.cs.wisc.edu/~ ...Missing: paper | Show results with:paper
  12. [12]
    [PDF] Workflow management and scheduling in a cloud ... - DiVA portal
    High Throughput Computing. HTTP ... • High Throughput Computing (HTC) workloads. HTC ... Maximum throughput (number of tasks run per unit of time).
  13. [13]
    [PDF] What defines a workload as High Throughput Computing
    Sep 4, 2018 · Distinguishing characteristics of High Throughput Computing (HTC), including how it contrasts with High. Performance Computing (HPC).
  14. [14]
    [PDF] The Principles of HTC
    “ Miron Livny, “ Study of Load Balancing Algorithms for Decentralized Distributed ... High Throughput Computing requires automation as it is a 24-7-365 activity ...
  15. [15]
    Characterization of the iterative application of makespan heuristics ...
    Mar 29, 2012 · ... high throughput computing. Future Gener Comput Syst 23(8):968–976 ... resource allocation heuristics that manage tradeoff between makespan and ...
  16. [16]
    An Efficient Approach to Consolidating Job Schedulers in Traditional ...
    During this period, the utilization rate of Group A is 10.24%, and the utilization rate ... HTCondor, High Throughput Computing. Available online: https ...
  17. [17]
    [PDF] Condor and the Grid - Computer Sciences Dept.
    Ian Foster and Carl Kesselman. Globus: A metacomputing intrastructure ... Matchmaking: Distributed resource management for high throughput computing.
  18. [18]
    [PDF] The History of the Grid - arXiv
    The late 1990s saw widespread enthusiasm for Grid computing in industry. Many vendors saw a need for a Grid product. Because few had on-demand computing or ...
  19. [19]
    SETI@home: An Experiment in Public-Resource Computing
    Large-scale public-resource computing became feasible with the growth of the Internet in the 1990s. Two major public-resource projects predate SETI@home ...
  20. [20]
    Apache Hadoop
    ### Summary of Hadoop History and 2006 Release with MapReduce
  21. [21]
    [PDF] Two decades of HTCondor and CMS success story - CERN Indico
    Sep 16, 2025 · In this contribution, in the year when the HTCSS turns 40, we propose an overview of the nearly. 20-year shared history of HTCondor and CMS.
  22. [22]
    [PDF] The pilot way to Grid resources using glideinWMS
    This paper presents both the general pilot concept, as well as a concrete implementation, called glideinWMS, deployed in the Open Science. Grid. 1. Introduction.Missing: advancements | Show results with:advancements
  23. [23]
    Announcing Amazon EC2 Spot Instances - AWS
    Dec 14, 2009 · Spot Instances allow customers to bid on unused Amazon EC2 capacity and run those instances for as long as their bid exceeds the current Spot Price.Missing: high- throughput
  24. [24]
    [PDF] High Throughput Computing with EC2 Spot Instances
    May 19, 2016 · Amazon EC2 Spot instances are spare EC2 instances that you can bid on to run your cloud computing applications. Spot.
  25. [25]
    [PDF] A Roadmap for HEP Software and Computing R&D for the 2020s
    Efficient running on heterogeneous resources will require a tighter integration with the computing models' higher-level systems of workflow and data management.
  26. [26]
    Job Scheduling — HTCondor Manual 25.5.0 documentation
    The HTCondor system calculate a “fair share” of machine slots to allocate to each user. Whether each user can use all of these slots depends on a number of ...Missing: HTC | Show results with:HTC<|control11|><|separator|>
  27. [27]
    [PDF] Matchmaking: Distributed Resource Management for High ...
    Conventional resource management systems use a sys- tem model to describe resources and a centralized sched- uler to control their allocation.
  28. [28]
    Matchmaking with ClassAds - HTCondor Manual - Read the Docs
    Understanding the unique framework by which HTCondor matches submitted jobs with machines is the key to getting the most from HTCondor's scheduling algorithm.Missing: seminal | Show results with:seminal
  29. [29]
    What is a Container? - Docker
    A container is a standard unit of software that packages code and dependencies, ensuring it runs reliably and uniformly, isolating it from its environment.
  30. [30]
    Using Docker in high performance computing applications
    Our experiment and evaluation show how to deploy efficiently high performance computing applications on Docker containers and VMs.
  31. [31]
    Ganglia download | SourceForge.net
    Rating 4.8 (32) · Free · CommunicationGanglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design ...
  32. [32]
    [PDF] Monitoring HTCondor with Ganglia
    Presents graphs from persistent data. Page 3. 3. Page 4. 4. Why Ganglia? › Widely used monitoring system for cluster and grids. › Easy to add new metrics. › ...
  33. [33]
    DAGMan for orchestrating complex workflows on HTC resources ...
    It enables users to define complex workflows as directed acyclic graphs (DAGs). In a DAG, nodes represent individual computational tasks, and the directed edges ...
  34. [34]
    Pegasus WMS – Automate, recover, and debug scientific computations
    Automatically locates the necessary input data and computational resources, and manages storage space for executing data-intensive workflows on storage- ...Workflow Examples · 4. Tutorial · HPC · AI / MLMissing: throughput | Show results with:throughput
  35. [35]
    The Evolution of the Pegasus Workflow Management Software
    May 29, 2019 · Since 2001, the Pegasus Workflow Management System has evolved into a robust and scalable system that automates the execution of a number of complex ...
  36. [36]
    An efficient fault tolerant workflow scheduling approach using ...
    It uses replication, resubmission, checkpointing and provides fault-tolerance in an efficient manner. In the scheduling step, the workflow tasks are replicated ...
  37. [37]
    High Throughput Computing Administration Guide
    This document contains Slurm administrator information specifically for high throughput computing, namely the execution of many short jobs.
  38. [38]
    High-Throughput Computing (HTC) and its Requirements
    These problems demand a computing environment that delivers large amounts of computational power over a long period of time. Such an environment is called a ...Missing: definition | Show results with:definition
  39. [39]
    High Throughput Computing - HTCondor
    We first introduced the distinction between High Performance Computing (HPC) and High Throughput Computing (HTC) in a seminar at the NASA Goddard Flight Center ...
  40. [40]
  41. [41]
    High Performance Computing Resources - CRC | Rice University
    NOTS is designed for both high performance computing (HPC) and high throughput computing ... InfiniBand low-latency networking. An expansion in 2024 provided a ...
  42. [42]
  43. [43]
    Many-task computing for grids and supercomputers - ResearchGate
    Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Many task computing ...
  44. [44]
    [PDF] Swift: A language for distributed parallel scripting
    Mar 16, 2011 · Swift scripts express the execution of programs that consume and produce file-resident datasets. Swift uses a C- like syntax consisting of ...
  45. [45]
    (PDF) Many-Task Computing: Bridging the Gap between High ...
    Many-task computing (MTC) aims to bridge the gap between two paradigms, high-throughput computing (HTC) and high-performance computing (HPC).
  46. [46]
    [PDF] Possibilities of Grid Computing for Realization of ... - Eurorisk Systems
    ... Monte Carlo simulation for evaluating the market risk of a financial ... Keywords: Grid Computing, Monte Carlo Simulation, High Throughput Computing, Distributed ...
  47. [47]
    How to Build an Open Source Render Farm Based on Desktop Grid ...
    Aug 7, 2025 · In [7] , the authors present a solution for a grid-based rendering farm based on Condor, an open-source high throughput computing software ...
  48. [48]
    Implementation of a distributed rendering environment for the TeraGrid
    A job cluster number will also be given. This cluster number is the identifier used with Condor for ... Computer Sciences, “Condor High. Throughput Computing,” ...
  49. [49]
    Product collaborative filtering based recommendation systems for ...
    By leveraging Spark's distributed computing capabilities, ALS distributes the matrix factorization process across multiple nodes, leading to superior ...
  50. [50]
    Distributed Recommendation Systems: Survey and Research ...
    Nov 26, 2024 · Common applications include item recommendations on e-commerce ... Distributed Computing Techniques, DisCoTec '19, Proceedings 19.
  51. [51]
    (PDF) Enhancing E-Commerce Recommendations Through Data ...
    Feb 8, 2025 · Implementing distributed computing frameworks like Apache Spark ensures scalability and operational efficiency. Centered on the electronics ...
  52. [52]
    HHVSF: A Framework to Accelerate Drug-Based High-Throughput ...
    Mar 20, 2018 · With the development of high-performance computers, the virtual drug screening is accelerating. ... High-Throughput Computing (HTC) environment.
  53. [53]
    Scaling virtual screening to ultra-large virtual chemical libraries | OSG
    Aug 19, 2021 · ... Cancer Center, described how high throughput computing (HTC) has made his work in early-stage drug discovery infinitely more scalable.
  54. [54]
    Virtual High-Throughput Ligand Screening - PMC - NIH
    SurfaceScreen, DOCK, and AUTODOCK leverage the BG/P in a high-throughput computing mode. FEP-REMD/GCMC uses a new, innovative, and highly parallel variant ...
  55. [55]
    How to Balance the Load on Heterogeneous Clusters
    Cluster heterogeneity increases the difficulty of balancing the load across the system nodes and, although the relationship between heterogeneity and load ...
  56. [56]
    [PDF] High-Throughput Computing on HPC: a Case Study of Medical ...
    High-Throughput Computing on HPC: a Case Study of Medical Image Processing ... In the future, we plan to update the scheduling strategy for load imbalance when ...
  57. [57]
    [PDF] Explaining Wide Area Data Transfer Performance
    ABSTRACT. Disk-to-disk wide-area file transfers involve many subsystems and tunable application parameters that pose significant challenges for bottleneck ...
  58. [58]
    [PDF] BOINC: A Platform for Volunteer Computing 1. Introduction - arXiv
    Dec 9, 2018 · In addition to reducing heterogeneity problems, VM technology provides a strong security sandbox, allowing untrusted applications to be used.
  59. [59]
    [PDF] Trace-driven simulation for energy consumption in High Throughput ...
    Energy efficient high throughput computing. The evaluation of techniques to reduce energy consumption within high throughput computing ...Missing: inefficiency | Show results with:inefficiency
  60. [60]
    Introducing AWS Lambda
    Nov 13, 2014 · AWS Lambda starts running your code within milliseconds of an event such as an image upload, in-app activity, website click, or output from a ...
  61. [61]
    Towards Fast Setup and High Throughput of GPU Serverless ... - arXiv
    Apr 23, 2024 · We propose SAGE, a GPU serverless framework with fast setup and high throughput. First, based on the data knowability of GPU function ahead of actual execution.
  62. [62]
    GAS-MARL: : Green-Aware job Scheduling algorithm for HPC ...
    Apr 11, 2025 · We propose a Green-Aware job Scheduling algorithm for HPC clusters based on Multi-Action Deep Reinforcement Learning (GAS-MARL), which optimizes both renewable ...Missing: throughput | Show results with:throughput
  63. [63]
    A CI-Aware, AI-Driven Scheduling Framework for HPC Workloads
    Jul 17, 2024 · Many modern scientific workloads in HPC centers rely heavily on AI-driven tasks, particularly deep neural network (DNN) training workloads.
  64. [64]
  65. [65]
    Cost-aware & Fault-tolerant Geo-distributed Edge Computing for ...
    As IoT applications have a strong demand for low latency and high throughput computing, stream processing using edge computing resources is a promising ...
  66. [66]