Portable Batch System
The Portable Batch System (PBS) is a workload management and job scheduling software suite designed for high-performance computing (HPC) environments, enabling the efficient allocation and execution of computational tasks across distributed resources such as clusters, clouds, and supercomputers.[1] Originally developed in the early 1990s as an enhancement to the Network Queuing System (NQS), PBS adheres to POSIX 1003.2d standards for batch job processing and supports both batch and interactive workloads by managing job queues, resource allocation, and execution monitoring.[2]
PBS originated as a collaborative project between NASA's Ames Research Center Numerical Aerospace Simulation (NAS) Systems Division and the National Energy Research Supercomputer Center (NERSC) at Lawrence Livermore National Laboratory (LLNL), with key contributions from developers including Albeaus Bayucan, Robert L. Henderson, and others at MRJ Technology Solutions.[2] The system was first released in alpha form in June 1994 (version 1.0) and evolved through milestones like version 2.0 in 1998, introducing advanced features such as job dependency management and a Tcl-based scheduler using the Batch Scheduling Language (BASL).[2] In 2016, Altair Engineering released an open-source version called OpenPBS, fostering community-driven enhancements and widespread adoption at thousands of global sites for over two decades. As of 2023, the latest OpenPBS release is version 23.06, with community-driven maintenance continuing into 2025.[1]
At its core, PBS comprises several interconnected components: the pbs_server for central job and queue management; the pbs_sched for policy-driven scheduling cycles that balance turnaround time and resource utilization; and the pbs_mom (Machine-Oriented Mini-Server) daemons on execution hosts for local job execution, resource monitoring, and fault detection via health checks.[2] Additional elements include the Interface Facility (IFF) for secure user authentication and the Batch Interface Library (IFL) for developing custom clients through APIs like pbs_submit and pbs_statjob.[2] These components enable scalability to millions of cores—tested on over 50,000 nodes—and resiliency features such as automatic failover with no single point of failure, making PBS a foundational tool for optimizing HPC productivity.[1]
Key commands in PBS, such as qsub for job submission, qstat for monitoring, and qdel for deletion, allow users to script and manage workflows, often via shell scripts with directives like #PBS -l nodes=1:ppn=16 to request specific resources.[3] PBS's flexible plugin framework supports customization for modern middleware and applications, while its policy-driven approach ensures fair resource sharing and efficient handling of dependencies in parallel computing tasks.[1] Widely used in facilities like national laboratories and supercomputing centers, PBS continues to influence HPC ecosystem tools, with ongoing developments emphasizing portability and integration with emerging technologies.[1]
Overview
Definition and Scope
The Portable Batch System (PBS) is a computer software system for job scheduling and workload management, designed to allocate computational tasks to resources in distributed computing environments. It originated as a flexible solution for managing batch jobs across heterogeneous systems, enabling efficient execution of non-interactive workloads on clusters, supercomputers, and grids. PBS operates by queuing jobs, assigning them to available compute nodes based on resource requirements and policies, and monitoring their progress to optimize hardware utilization and throughput.[4][1]
The scope of PBS centers on batch processing in multi-node systems, where it handles the full lifecycle of job queuing, execution, and resource allocation without requiring interactive user intervention during runtime. Batch jobs in PBS are typically non-interactive scripts or executables submitted to the system, which stages input data, executes the tasks on designated hosts, and returns output to specified files upon completion, ensuring seamless management of computational workloads in resource-constrained environments. This focus distinguishes PBS from tools geared toward real-time or interactive computing, prioritizing automated, scalable processing for high-volume tasks.[5][6]
A key concept in PBS is the distinction between batch jobs—non-interactive, script-based tasks that run independently—and interactive sessions, which connect directly to user terminals but still leverage PBS for resource scheduling. The system's portability across Unix-like operating systems, including Linux variants, as well as Windows platforms, allows it to be deployed in diverse infrastructures without major modifications to job scripts, supporting broad adoption in high-performance computing ecosystems.[4][1]
The Portable Batch System (PBS) serves as a critical workload manager in high-performance computing (HPC) clusters, facilitating the distribution of parallel batch jobs across multiple nodes to optimize the utilization of CPU, memory, and storage resources. By queuing and dispatching jobs to available compute nodes, PBS ensures that computational tasks, such as large-scale simulations, are executed efficiently without manual intervention, allowing users to submit jobs via scripts that specify resource requirements like node count and runtime limits. This role is particularly vital in distributed environments where resources must be dynamically allocated to match varying workloads, preventing bottlenecks and enabling seamless integration with parallel programming models like MPI.[7][5]
In supercomputing facilities, PBS has been extensively applied for scientific simulations, data processing, and large-scale computations, notably at NASA's Numerical Aerospace Simulation (NAS) facility at Ames Research Center, where it manages jobs on systems like Electra and Aitken. Originally developed as a joint project involving NASA Ames, Lawrence Livermore National Laboratory, and others, PBS replaced earlier systems like NQS to handle complex aerospace and scientific workloads, supporting exclusive access to compute nodes for resource-intensive tasks such as climate modeling and fluid dynamics simulations. Its deployment across all NAS supercomputers underscores its reliability in government-funded HPC infrastructures for advancing research in physics, engineering, and bioinformatics.[8][5][3]
PBS enhances HPC throughput by minimizing resource idle time through intelligent queuing and backfilling techniques, achieving utilization rates up to 85.6% on petaflop-scale systems like the Kraken supercomputer, where it processed over 4 million jobs from 2009 to 2014. It supports scalability for managing thousands to millions of jobs concurrently, as demonstrated by its handling of 2.1 million jobs on the Oakley system with queue prioritization to balance loads across clusters. Additionally, PBS promotes fair sharing among users via policies such as mission-specific resource limits and historical usage tracking, ensuring equitable access in multi-user environments without compromising overall system performance.[9][10][7][5]
Architecture
Core Components
The Portable Batch System (PBS) architecture relies on a set of fundamental daemons and modules that enable distributed job management across high-performance computing environments. These core components include the central server daemon, execution agents on compute nodes, the scheduler, the communication daemon, and client interaction tools, which collectively handle job queuing, execution, and resource allocation.[11]
pbs_server serves as the central management daemon in PBS, acting as the primary point of contact for all client communications and overseeing the overall state of the batch system. It accepts job submissions from users, maintains the job queue and associated metadata, and routes jobs to appropriate execution hosts or queues based on configured policies. Additionally, pbs_server manages system resources at the server level, enforces access controls such as user and host lists, processes administrative commands, and logs events and accounting data to track system activity. It communicates with other components by updating node status files, handling failover in multi-server setups, and authenticating incoming requests to ensure secure operations.[11]
pbs_mom, or Machine Oriented Mini-server, operates as the execution daemon on individual compute nodes (or virtual nodes) within the PBS cluster. This component is responsible for launching jobs on its host, monitoring resource usage during execution—such as CPU, memory, and network utilization—and reporting status updates back to pbs_server. pbs_mom enforces resource limits, manages file staging and transfers for job inputs and outputs, and executes prologue and epilogue scripts to prepare and clean up the execution environment. It supports advanced features like checkpointing and dynamic resource detection through customizable scripts, while coordinating with sister pbs_mom instances for multi-node parallel jobs. By maintaining vnode (virtual node) associations and sending periodic resource reports, pbs_mom ensures accurate tracking of available compute capacity across the cluster.[11]
pbs_sched functions as the dedicated scheduler daemon, which analyzes the job queue and available resources to determine the optimal execution order for pending jobs. It applies configurable scheduling policies, such as first-in-first-out (FIFO), fairshare, or backfilling, to select and prioritize jobs while optimizing cluster utilization. pbs_sched runs periodic cycles to evaluate queue states, calculate estimated start times, and issue directives to pbs_server for job initiation on suitable nodes. This daemon supports advanced optimizations like preemption, reservations, and placement sets to handle complex workloads, and it logs its decisions for auditing purposes. Through queries to pbs_server and feedback from pbs_mom, pbs_sched maintains an up-to-date view of system resources to make informed allocation decisions.[11]
pbs_comm is the communication daemon that facilitates secure and efficient inter-daemon communication within the PBS complex, particularly in multi-host and failover configurations. Introduced in PBS version 13.0, it handles TCP-based messaging between pbs_server, pbs_sched, pbs_mom, and other components, supporting features like leaf routers for large-scale clusters to reduce network overhead. pbs_comm runs on server hosts and execution nodes as needed, ensuring reliable data exchange for status updates, job directives, and resource queries.[12]
Client commands in PBS provide the interface for users and administrators to interact with the system daemons without delving into their internal operations. For instance, qsub is the primary command for submitting batch jobs or scripts to the queue managed by pbs_server, allowing specification of resource requirements and job attributes that influence scheduling by pbs_sched. Other commands, such as qstat for querying job status and qdel for deletion, facilitate basic interactions but defer detailed monitoring and management to the core daemons. These tools communicate directly with pbs_server to relay user requests, ensuring seamless integration with the underlying architecture.[11]
Job Lifecycle and Workflow
The job lifecycle in the Portable Batch System (PBS) encompasses a structured sequence of stages managed by its core daemons, including the server, scheduler, and execution daemons, ensuring efficient processing of batch workloads in high-performance computing environments. Upon submission, a job is sent to the PBS server daemon (pbs_server), which parses the input script, validates directives, and assigns the job to an appropriate queue based on specified attributes such as resource requirements and dependencies.[13] This initial stage establishes the job's metadata, including its unique identifier, and places it in a pending state within the queue system, where it awaits further processing.[13]
Once queued, the job enters the scheduling phase, handled by the PBS scheduler daemon (pbs_sched), which evaluates queue priorities, resource availability, and site-specific policies to determine the execution order.[13] If resources are available, the scheduler dispatches the job to the designated execution hosts via the machine-oriented mini-server daemons (pbs_mom) on those nodes.[13] The pbs_mom daemons then prepare the environment, such as creating staging directories for private sandboxes if configured, and initiate job execution under the submitting user's account, managing process spawning across allocated nodes or virtual nodes.[13] During execution, the job runs until completion, interruption, or resource exhaustion, with output streams (standard output and error) captured for later retrieval.[13]
Monitoring occurs continuously throughout the lifecycle, allowing administrators and users to track job status, resource utilization, and progress via server queries, with historical data retained for completed jobs based on configuration settings.[13] Upon termination, the pbs_mom daemons execute a job epilogue to handle cleanup, including staging out files, removing temporary directories, and releasing allocated resources back to the pool.[13] The PBS server then updates the job's final state, archiving logs and notifying stakeholders if mail events are enabled, completing the workflow.[13]
Error handling mechanisms are integrated at each stage to maintain system stability. During submission or queuing, jobs may be rejected if resource requests exceed queue limits or if dependencies cannot be resolved, preventing invalid entries from proceeding.[13] In the scheduling and dispatch phases, jobs can be placed on hold due to insufficient resources, security issues, or administrative intervention, allowing for resolution before resumption.[13] Execution errors, such as staging failures, trigger retries with escalating delays (e.g., 1-second, 11-second, and 21-second intervals for stage-out attempts) or requeuing, while node failures may lead to partial completion or abortion with resource reclamation.[13] These processes ensure minimal disruption and provide diagnostic feedback through job attributes and notifications.[13]
Features
Scheduling and Resource Management
The Portable Batch System (PBS) employs a suite of scheduling algorithms to allocate computational resources efficiently across high-performance computing clusters, ensuring optimal job throughput and reduced idle time. The core scheduler, pbs_sched, implements first-in-first-out (FIFO) scheduling as a baseline mechanism, processing jobs in the order of submission while respecting queue priorities and resource availability.[14] This approach provides predictable execution for simple workloads but can lead to fragmentation if not augmented by advanced policies. To enhance fairness, PBS integrates fair-share scheduling, which adjusts job priorities based on historical resource usage by users, groups, or projects, favoring entities that have underutilized their allocated shares over time.[15] Fair-share calculations use a decaying usage metric—typically CPU time—updated cyclically and weighted by predefined shares in a configuration file, promoting equitable distribution without strict quotas.[14] Additionally, backfill scheduling addresses inefficiencies in FIFO by permitting lower-priority jobs to execute in idle slots ahead of higher-priority ones, provided they do not delay the latter, thereby minimizing overall wait times and improving cluster utilization.[14]
Resource management in PBS centers on tracking and allocating node attributes such as CPU cores, memory, and GPUs to match job requirements precisely. The system maintains a vnode (virtual node) model to represent compute resources, querying availability for attributes like ncpus (number of CPU cores), mem (memory in bytes or GB), and custom resources for accelerators like ngpus (number of GPUs).[14] Users specify limits during job submission via the -l directive in the qsub command or #PBS pragmas in scripts; for instance, -l nodes=2:ppn=8 requests two nodes with eight processors per node, while modern equivalents use -l select=2:ncpus=8:mem=16gb:ngpus=1 to define resource chunks more flexibly.[13] The scheduler enforces these by subtracting allocated resources from total availability, supporting dynamic tracking of external factors like licenses and applying placement policies (e.g., scatter or pack) to optimize distribution across nodes.[14] This granular control prevents oversubscription and enables efficient handling of heterogeneous hardware, with defaults ensuring minimum viable allocations if unspecified.
PBS further refines resource allocation through configurable policies that enforce limits and sequencing. User and group quotas are implemented via attributes like max_run and max_queued, capping concurrent or pending jobs per entity to prevent monopolization, often integrated with fair-share for soft enforcement that triggers preemption if exceeded.[14] Job dependencies, set with the -W depend option (e.g., -W depend=afterok:12345), ensure a job waits for a predecessor to complete successfully before starting, facilitating complex workflows without manual intervention.[13] Priority adjustments allow fine-tuning via the -p flag (range: -1024 to +1023) or tools like qorder to reorder queues, combined with formula-based sorting that incorporates fair-share metrics and eligible wait time to expedite critical tasks.[13] These policies collectively balance equity and performance, adapting to site-specific needs while minimizing disruptions through mechanisms like checkpointing during preemption.[14]
Advanced Capabilities
The Portable Batch System (PBS) supports interactive jobs, enabling users to execute pseudo-interactive sessions for tasks such as debugging or testing without submitting a traditional batch script. These jobs are initiated using the qsub -I command, which allocates resources and connects the user's terminal directly to the execution environment, mimicking a login session while adhering to PBS resource limits. This feature is particularly valuable in high-performance computing environments for real-time interaction, supporting graphical user interfaces on Linux systems via the -X option or Windows via -G, though it does not support job arrays or reruns. Interactive jobs in containerized environments, such as Docker or Singularity, are limited to single-vnode or multi-host configurations and require explicit port specifications for networked applications.[4]
Job arrays represent a key advanced capability in PBS, allowing the submission of numerous similar tasks through a single script to facilitate parameter sweeps or high-throughput computations. Submitted via qsub -J <start>-<end>[:step][%<max>], such as qsub -J 1-10000%500 to cap concurrent subjobs at 500, arrays generate up to 10,000 indexed subjobs managed by the PBS_ARRAY_INDEX environment variable, enabling efficient parameterization without multiple submissions. Subjobs progress through states like queued, running, or held, with dependencies enforceable between arrays and non-array jobs but not among subjobs themselves; file staging and monitoring via qstat with array indices further streamline management. This mechanism optimizes resource utilization for repetitive workloads, such as simulations varying input parameters.[4]
Reservations in PBS provide advanced resource booking for time-sensitive or guaranteed allocations, extending beyond standard scheduling by reserving nodes for specific durations or recurring patterns. Created with pbs_rsub, reservations include advance types (e.g., pbs_rsub -R 1130 -D 00:30:00 for a future slot), standing reservations (e.g., pbs_rsub -r "FREQ=WEEKLY;COUNT=10" for periodic access), and job-specific variants triggered as soon as possible or immediately. These can be modified with pbs_ralter, queried via pbs_rstat, or deleted using pbs_rdel, supporting exclusive placement (-l place=excl) and chunk-level resource allocation, though shrink-to-fit is unavailable. Administrators typically manage reservations to ensure predictable access for critical workloads.[4]
Hooks enhance PBS's extensibility through customizable Python scripts invoked at lifecycle events, enabling site-specific plugins for validation, optimization, and resource management without altering core code. Types include pre-execution hooks (e.g., queuejob for post-submission validation before queuing), execution hooks (e.g., execjob_prologue before job startup on hosts), periodic hooks for server tasks, and reservation-specific hooks like resvsub to approve or reject bookings based on criteria such as user privileges or resource availability. For instance, a queuejob hook can enforce mandatory attributes like walltime or adjust priorities, while resvsub modifies reservation durations or resources during creation, facilitating time-based booking and custom logic for heterogeneous clusters. These hooks run with restricted access—pre-execution on the server and execution on hosts—promoting secure, event-driven customization.[16][17]
PBS integrates seamlessly with Message Passing Interface (MPI) environments to support parallel workloads, leveraging tools like mpiexec or pbs_mpirun for process launching across allocated nodes. Compatible with implementations such as Open MPI, Intel MPI, MPICH, and MVAPICH, integration relies on PBS-generated nodefiles (e.g., at $PBS_NODEFILE) listing host allocations, with resources specified via ncpus and mpiprocs to map chunks per process. Administrators configure MPI support for full tracking of ranks and accounting, ensuring processes are confined to PBS-allocated vnodes; multi-host container jobs further extend this for encapsulated parallel executions. This capability is essential for distributed applications in high-performance computing, where PBS handles launch and termination natively without external SSH dependencies.[4]
Usage
Job Submission and Scripting
Job submission in the Portable Batch System (PBS) typically involves creating a script that combines PBS directives with executable commands to define and execute computational tasks. A PBS job script begins with a shebang line specifying the shell interpreter, such as #!/bin/[bash](/page/Bash) or #!/bin/[tcsh](/page/Tcsh), followed by PBS directives on subsequent lines starting with #PBS. These directives set job attributes like name and resource requests; for instance, #PBS -N jobname assigns a user-defined name to the job, while #PBS -l walltime=01:00:00 specifies a maximum runtime limit of one hour.[2][18] The directives are scanned by the qsub command until the first non-directive executable line, after which the script contains the actual commands to run, such as program invocations or shell operations.[19][2]
To submit a job, users invoke the qsub command with the script as an argument, such as qsub myscript.pbs, which queues the job on the PBS server and returns a unique job identifier like 123.server.domain. Command-line options to qsub can override or supplement script directives; for example, qsub -N alternate_name myscript.pbs changes the job name without modifying the file. Standard output and error streams are directed to files by default, named based on the job ID (e.g., 123.server.domain.o123 for output), but can be customized using directives like #PBS -o /path/to/output.txt or #PBS -e /path/to/error.txt, or via flags such as qsub -o custom.out -e custom.err myscript.pbs. Merging output and error into a single file is possible with #PBS -j oe.[20][2][19]
During execution, PBS sets environment variables to provide job context to the script. PBS_O_HOME holds the submitting user's home directory from the host where qsub was run, ensuring consistent access to user files. Similarly, PBS_NODEFILE points to a temporary file listing the nodes allocated to the job, with one node per line, allowing scripts to iterate over resources for parallel execution, such as in MPI applications. These variables, along with others like PBS_JOBID and PBS_O_WORKDIR, are automatically exported unless the job specifies otherwise via directives.[2][19][18]
bash
#!/bin/[bash](/page/Bash)
#PBS -N example_job
#PBS -l walltime=01:00:00
#PBS -o job_output.txt
#PBS -e job_error.txt
echo "Job started on $([hostname](/page/Hostname))"
# Example command: run a program
./myprogram arg1 arg2
# Access nodes
while read node; do
echo "Node: $node"
done < $PBS_NODEFILE
echo "Home directory: $PBS_O_HOME"
#!/bin/[bash](/page/Bash)
#PBS -N example_job
#PBS -l walltime=01:00:00
#PBS -o job_output.txt
#PBS -e job_error.txt
echo "Job started on $([hostname](/page/Hostname))"
# Example command: run a program
./myprogram arg1 arg2
# Access nodes
while read node; do
echo "Node: $node"
done < $PBS_NODEFILE
echo "Home directory: $PBS_O_HOME"
This example illustrates a basic PBS script structure, where directives precede commands, and environment variables are utilized within the executable section.[18][2]
Monitoring and Administrative Commands
The Portable Batch System (PBS) provides a suite of command-line tools for users and administrators to monitor job progress and manage system resources post-submission. These commands enable tracking of job states, resource allocation, and system configuration without altering the underlying scheduling policies.[21]
Users primarily rely on the qstat command to query job and queue status from the batch server. Invoked as qstat [options] [job_id], it displays summaries including job identifiers, owners, states, and resource usage such as CPU time consumed. The -f option produces a full report with detailed attributes like execution hosts, queue names, and limits, aiding in diagnosing delays or resource contention. Queue states are denoted by single letters: Q for queued (waiting for resources), H for held (paused due to holds or errors), and R for running (actively executing). Resource usage reports in the output highlight metrics like walltime utilized versus requested, providing insight into efficiency without exhaustive logs.[22][21]
For job management, users employ qdel to terminate jobs, issued as qdel job_id, which sends a delete request to the server and processes identifiers sequentially until completion or error. To pause execution, qhold places a hold on a job via qhold [-h hold_type] job_id, where hold types include user (u), system (s), or operator (o) levels, rendering the job ineligible for scheduling. Conversely, qrls releases holds with qrls [-h hold_type] job_id, restoring eligibility; by default, it targets user holds if unspecified. These operations are restricted to job owners or authorized users, ensuring controlled intervention.[23][24][25]
Administrators use qmgr for configuring queues and the server, executed interactively or via scripts as qmgr [-c "directive"]. Directives like create queue name, set queue name attribute = value (e.g., adjusting max running jobs), or list queue allow querying and modifying parameters such as priorities or enabled states, requiring manager privileges for alterations. Complementing this, pbsnodes reports and alters node status with pbsnodes [options] [node_name], listing attributes like availability (free, busy, down) and resources (CPUs, memory) for all nodes via -a. Options such as -o mark nodes offline to prevent allocations, while -c clears such states, facilitating maintenance without disrupting active jobs. Node outputs include state-unknown flags for unreachable hosts, enabling proactive troubleshooting.[26][27]
History
Origins and Early Development
The Portable Batch System (PBS) originated as a joint project between the Numerical Aerospace Simulation (NAS) Systems Division at NASA Ames Research Center and the National Energy Research Supercomputer Center (NERSC) at Lawrence Livermore National Laboratory (LLNL), initiated in 1991 to address workload management challenges in high-performance computing environments.[2] This collaboration focused on developing a robust batch queuing system capable of handling diverse aerospace simulations at NASA and computational tasks at NERSC, where heterogeneous Unix-based supercomputers required efficient resource allocation for compute-intensive tasks.[2] The effort was driven by the limitations of existing systems like the Network Queuing System (NQS), which lacked sufficient flexibility and portability across varying hardware architectures prevalent in early 1990s supercomputing facilities.[28]
A primary motivation for PBS was to create a standards-compliant batch system that adhered to the emerging POSIX 1003.2d Batch Environment Standard, ensuring interoperability and portability among heterogeneous Unix systems without vendor-specific dependencies.[3] This standard, approved in 1994, defined interfaces for job submission, queuing, and execution in distributed environments, and PBS was engineered from the outset to conform to its requirements, including support for job dependencies, resource reservations, and multi-node execution.[3] By prioritizing POSIX compliance, the developers aimed to facilitate seamless job management across sites like NASA Ames' Cray systems and NERSC's computational clusters, reducing administrative overhead and enabling scalable scientific workloads.[2]
Early development milestones included the project's formal start on June 17, 1991, under a NASA contract with MRJ Technology Solutions as the primary developer.[28] The initial alpha release, version 1.0, occurred in June 1994, focused on core batch queuing functionalities, with an emphasis on testing conformance to POSIX 1003.2d drafts.[2] Beta testing followed at sites including the University of New Hampshire and Purdue University, refining features like job routing and resource monitoring to support the demanding, multi-user environments of aerospace and scientific research.[2]
Evolution of Versions
The Portable Batch System (PBS) underwent significant development in its early years under NASA's Numerical Aerospace Simulation (NAS) facility, with the project initiating as a contract in 1991 to replace the aging Network Queuing System (NQS). The first alpha test release, version 1.0, occurred in June 1994, followed by version 1.1 on March 15, 1995, marking the initial deployment for testing on NAS supercomputers. This version focused on basic job queuing and resource allocation for parallel and distributed systems, establishing PBS as a flexible workload manager compliant with POSIX P1003.2 batch services standards.[2]
The 2.x series, spanning the late 1990s into the early 2000s, introduced key enhancements to support evolving high-performance computing needs. Version 2.0 was released on October 14, 1998, coinciding with the transition to open source distribution by Veridian (formerly MRJ Technology Solutions), which made the software freely available to the broader community after NASA distributed it to approximately 70 U.S. sites between 1995 and 1998. Subsequent iterations, such as version 2.1 in May 1999 and version 2.2 by late 1999, added features like job dependencies, allowing child jobs to wait on parent job completion for workflow orchestration.[2][29][2]
Further advancements in the 2.x series included scheduler enhancements for better resource allocation and fair-share policies, as well as initial support for Windows execution hosts to enable heterogeneous environments. Standardization efforts culminated in version 2.3 around 2000, refining interoperability and POSIX compliance for multi-vendor clusters. These updates solidified PBS's role in the Department of Defense High Performance Computing Modernization Program, where it became the standard batch system by 1998.[30][31][32]
Development of PBS began under contract to MRJ Technology Solutions (later Veridian) in 1991, with commercialization authorized by NASA in 1998. In 2001, Veridian asserted copyright over the software, leading to the release of the PBS Professional Edition. This shift to commercial entities was completed when Altair Engineering acquired development rights in 2003, sustaining evolution amid growing adoption.[33][34]
Implementations
Open Source Variants
The original open source variant of the Portable Batch System, known as OpenPBS, was released in 1998 by MRJ Technology Solutions, the R&D contractor that had developed the original PBS for NASA. This release emphasized core POSIX compliance for batch job processing and provided foundational support for high-performance computing (HPC) environments, including job queuing, resource allocation, and basic monitoring on Unix-like systems. Development of this original OpenPBS continued through the early 2000s, with the last major version, 2.3, released in September 2000 (patch 2.3.16 in 2002), focusing on stability and interoperability rather than extensive new features. By the mid-2000s, active maintenance had largely ceased, leaving this OpenPBS as a stable but unupdated codebase for smaller-scale HPC deployments.[35]
In 2016, Altair Engineering released a new open-source version called OpenPBS, based on their commercial PBS Professional, to foster community-driven development and unite the HPC ecosystem. This modern OpenPBS shares the core architecture and commands of PBS but includes enhancements for scalability, modern OS support (e.g., Ubuntu 22.04, RHEL 8/9), and integration with contemporary HPC tools. Actively maintained by the OpenPBS community, it has seen regular releases, such as v20.0.1 in 2020 and v23.06.06 in June 2023, with ongoing updates emphasizing security fixes, plugin extensibility, and support for large-scale clusters up to tens of thousands of nodes. As of November 2025, it remains a key open-source option for production HPC workloads, distinct from the 1998 version.[1][36]
In 2003, Cluster Resources Inc. (now part of Adaptive Computing) forked the 1998 OpenPBS to create TORQUE (Terascale Open-source Resource and Queue manager), addressing limitations in scalability and integration for growing cluster sizes. TORQUE retained the core PBS commands and architecture while introducing enhancements such as improved node failure detection, better handling of large job arrays, and support for resource managers in clusters exceeding thousands of nodes. Later versions, starting with 6.0 in 2015, integrated Linux control groups (cgroups) for finer-grained resource enforcement, including CPU and memory limits per job, enhancing isolation and accounting in multi-tenant environments. TORQUE also deepened integration with the Maui scheduler (and its successor Moab), enabling advanced policy-based scheduling like fairshare and reservations, which were not native to the original OpenPBS.
Key differences between the variants lie in their scope and evolution: The 1998 OpenPBS prioritized POSIX standards and basic HPC functionality for modest clusters, whereas TORQUE extended this for terascale systems with features like dynamic node power management and extensible plugins for custom resources, making it more suitable for enterprise-level deployments. TORQUE's active maintenance through the 2010s and beyond, under Adaptive Computing, has resulted in versions up to 7.0.1 (released in 2023), incorporating modern OS support and security fixes.[37]
Commercial Derivatives
PBS Professional, often referred to as PBS Pro, is the primary commercial derivative of the Portable Batch System, originally developed by Veridian Information Solutions in the 1990s as a proprietary workload management solution for high-performance computing environments.[30] Initially tailored for NASA's need to replace the Network Queuing System (NQS), it evolved into a robust enterprise-grade scheduler with advanced resource allocation capabilities, distinguishing it from open-source variants like TORQUE through exclusive support and feature extensions.[30]
In 2003, Veridian's PBS Products business unit was acquired by Altair Engineering, Inc., which established it as a dedicated division to further innovate on the technology.[38] Under Altair's stewardship, PBS Professional has incorporated proprietary modules for hybrid cloud-on-premises environments, enabling seamless workload bursting to cloud resources via integration with Altair HPCWorks, and advanced analytics tools for resource utilization and cost reporting.[6] These enhancements support complex, multi-site deployments, with features like policy-driven scheduling and plugin frameworks allowing customization for diverse hardware ecosystems.[6]
Versions from 2025 onward, such as PBS Professional 2025.2.1 (as of February 2025), emphasize exascale scalability—tested across over 50,000 nodes—and incorporate AI-driven scheduling through Altair's Liquid Scheduling, which optimizes mixed AI and traditional HPC workloads by dynamically adjusting priorities and resources in real-time.[6][39] Multi-cluster management capabilities further enable federated operations across geographically distributed sites, reducing silos and improving overall efficiency.[6] Security features, including EAL3+ certification and SELinux integration, ensure compliance in sensitive environments.[6]
PBS Professional is widely deployed for production workloads on major supercomputers, powering NASA's Pleiades cluster at the Ames Research Center for aerospace simulations and managing resources at the Department of Energy's Argonne National Laboratory to support exascale-era scientific research.[5][40] Its adoption in these facilities underscores its reliability for handling million-core jobs and ensuring high utilization in mission-critical applications.[41]
Licensing and Distribution
Open Source Licensing
The open source variants of the Portable Batch System, particularly OpenPBS and TORQUE, operate under licenses designed to promote community access, modification, and redistribution while imposing specific obligations on users and developers. The modern OpenPBS project, maintained under the Linux Foundation's umbrella, releases its software under the GNU Affero General Public License version 3.0 (AGPLv3). This copyleft license permits free use, study, modification, and distribution of the software in both source and binary forms, including for commercial purposes, provided that any modifications or derivative works are also released under AGPLv3 and the source code is made available to users who interact with the software over a network.[42]
TORQUE, developed as a community fork of the original OpenPBS codebase, utilizes a custom open source license for versions 2.5 and later, known as the TORQUE v2.5+ Software License v1.1. This license allows modification and redistribution in source and binary forms, supports commercial use, and requires that source code for derivatives be included in distributions or made available at no more than the cost of distribution plus a nominal fee, along with retention of copyright notices and attribution in advertising materials.[43] Although subsequent developments by Adaptive Computing have incorporated proprietary elements in newer releases, the core TORQUE codebase remains accessible under these open source terms via public repositories.[44]
These licensing models facilitate broad deployment in academic, research, and non-profit settings by eliminating royalty requirements and enabling cost-free access, which has supported extensive use in high-performance computing environments for workload management without financial barriers to entry.[45]
Commercial Licensing Models
Commercial implementations of the Portable Batch System, particularly PBS Professional from Altair Engineering, operate under proprietary licensing models designed for enterprise environments. These models emphasize subscription-based structures, where licenses are typically acquired on an annual lease basis, ensuring ongoing access to software enhancements and technical support.[46][47]
Licensing for PBS Professional is primarily calculated on a per-socket or per-node basis, with each PBSProSockets license covering one physical CPU or GPU socket regardless of core count, while PBSProNodes licenses apply to entire physical nodes supporting up to four devices such as accelerators. This approach allows scalability for high-performance computing clusters, where costs are tied to hardware resources rather than usage metrics, though on-demand options for cloud bursting are available via time-based tokens like PBSWorksBurstNodeHours. Subscriptions include comprehensive enterprise support from Altair's global HPC experts, regular updates to the software, and exclusive access to proprietary features such as advanced analytics for workload optimization, Liquid Scheduling for dynamic resource allocation, and integrations with security standards like EAL3+ certification.[48][6][49]
As of PBS Professional version 2025.1, Altair introduced the Altair Unit Licensing Scheme, which replaces previous per-socket and per-node models with a flexible system based on Altair Units managed via the Altair License Management System (ALM) version 2025.0.0 or newer. This per-core licensing approach (including GPUs) uses the formula: Units needed = Ceil((Cores + (GPUs * 64)) / 32). For example, a system with 5000 cores and 25 GPUs requires 207 units. Administrators can determine required units using the pbs_topologyinfo -au or -auv command, with configuration via parameters like pbs_license_file_location and tools such as pbs_license_info. This scheme enhances scalability for modern multi-core and GPU-accelerated environments.[50]
The evolution of these models traces back to Veridian Corporation, which originally commercialized PBS with perpetual licenses that required separate annual maintenance fees for support and updates following the initial purchase. In 2003, Altair Engineering acquired the PBS technology and intellectual property from Veridian, transitioning toward more flexible structures that incorporated subscription elements to better align with evolving HPC needs. By 2007, with the release of PBS Professional 9.0, Altair introduced an on-demand licensing variant priced at $13.50 per concurrent license (North American pricing as of 2007), marking a shift from purely perpetual models to annual subscriptions that bundle support and facilitate easier scaling for multi-core systems.[51][52][49][53]
Distribution under commercial licenses is restricted to binary executables only, prohibiting access to source code to protect proprietary enhancements and maintain competitive advantages. Certain advanced modules or documentation may require non-disclosure agreements (NDAs) for access, ensuring confidentiality of implementation details beyond core functionality. These restrictions differentiate commercial PBS Professional from open-source variants, focusing on reliability and vendor-managed evolution for mission-critical deployments.[48][54][55]