Fact-checked by Grok 2 weeks ago

Distributed operating system

A distributed operating system (DOS) is a software layer that coordinates the operation of multiple independent, networked computers, presenting them to users and applications as a single, unified system while managing , communication, and computation across the nodes without shared physical memory. Unlike traditional centralized operating systems that run on a single machine or networked operating systems that provide loose connectivity between separate machines, a DOS achieves high —hiding details such as resource location, access methods, migration, concurrency, replication, and failures—to create the illusion of a uniprocessor. This is facilitated by mechanisms like remote procedure calls (RPC) and remote method invocation (RMI), which enable seamless over unreliable networks. Key goals of DOS include enhancing resource sharing (e.g., CPU, memory, and storage across nodes), improving scalability through incremental addition of processors, boosting reliability via redundancy and , and optimizing performance via load balancing and execution. However, challenges arise from network-induced issues, such as communication overhead (e.g., remote operations being significantly slower than local ones), the absence of a global clock or state, and handling failures in components like disks, links, or software without disrupting the overall system. Historically, DOS concepts emerged in the amid advances in networking and , with influential research projects including the Distributed Computing System (emphasizing object-based distribution), (a microkernel-based system for ), LOCUS (focused on distributed file systems), and the V System (pioneering RPC). These efforts, often from universities like , Stanford, and Vrije Universiteit, laid the groundwork for modern distributed systems seen in and large-scale services. Design considerations for DOS encompass resource management techniques like sender- or receiver-initiated load balancing, distributed locking for , client caching for efficiency (e.g., in file systems like ), and replication strategies to ensure availability, all while addressing the inherent complexities of decentralized control.

Introduction

Definition and Characteristics

A distributed operating system () is software designed to manage a collection of independent computers, making them appear to users as a single coherent . It handles resource sharing, communication, and coordination in a transparent manner, abstracting the underlying distribution of and software components. This contrasts with a single-node operating system, where is confined to local hardware without the need for inter-machine coordination or network latency considerations. Key characteristics of a DOS include various forms of transparency that mask the complexities of distribution from users and applications. Location transparency hides the physical location of resources, allowing users to access files or processes without knowing which machine hosts them. Access transparency ensures uniform interaction with resources regardless of their type or location, as if they were local. Migration transparency permits the movement of processes or resources between machines without user intervention or awareness. Replication transparency conceals the existence of multiple copies of resources for improved availability or load balancing. Failure transparency enables the system to recover from hardware or software faults automatically, maintaining service continuity. Concurrency transparency manages simultaneous access to shared resources without conflicts visible to users. Performance and scaling transparency allow the system to adapt to varying loads and sizes while preserving consistent behavior. These transparencies collectively create the illusion of a single system, where users perceive a unified environment despite the underlying multiplicity of nodes. Additionally, DOS emphasizes hardware independence, enabling software to run across heterogeneous processors without modification, and fault tolerance through redundancy, such as data replication and failover mechanisms, to enhance reliability in the face of component failures. Classic examples of DOS include , developed by at , which uses a architecture to support object-oriented resource management across workstations and servers. Sprite, from UC Berkeley, focuses on high-performance file caching and process migration within workstation clusters to achieve a single system image. , extends Unix principles into a distributed environment with a global file system and networked resource naming for seamless multi-machine operation.

Distinction from Networked and Clustered Systems

Distributed operating systems () are distinguished from networked operating systems (NOS) primarily by the degree of transparency and integration they provide to users. In NOS, resource sharing is facilitated through protocols such as the Network File System (NFS) for remote file access or Remote Procedure Calls (RPC) for inter-machine communication, but users must explicitly manage these interactions and remain aware of the underlying distribution across separate machines. This approach treats the network as an extension for ad-hoc sharing rather than a unified whole, with each running its own independent OS instance. In contrast, DOS abstract distribution to create the illusion of a single coherent system, enabling seamless resource access without user intervention in location-specific details. Clustered systems, such as clusters, represent another category of loosely coupled computing environments that differ from in their hardware-centric design and limited OS-level abstraction. These systems aggregate commodity-grade computers into a , often using a standard OS like enhanced with parallel programming libraries such as the (MPI) to coordinate tasks across nodes. While clusters achieve high-performance parallel execution through shared hardware resources, they do not provide the full OS integration of ; instead, distribution is managed at the application or level, requiring developers to handle node coordination explicitly. The core distinctions lie in the scope of system integration and the resulting trade-offs. DOS offer OS-level mechanisms like unified naming spaces and process migration to maintain the single-system abstraction, which incurs higher overhead but enhances usability in transparent environments. Networked and clustered systems, by relying on application-level tools, provide greater flexibility for heterogeneous setups at the cost of reduced transparency and increased user or programmer burden. These differences position DOS within a broader spectrum of computing paradigms, evolving from centralized OS on isolated machines—offering no distribution—to DOS with their unified illusion, and further to loosely coupled where resources span wide-area networks with minimal central coordination.

Historical Evolution

Pioneering Systems (1950s-1970s)

Early developments in computer systems during the 1950s and 1970s, driven by Cold War imperatives for fault-tolerant and scalable computing, laid foundational concepts for distributed computing, though true distributed operating systems emerged later with networking advances. Military needs demanded reliable systems capable of withstanding disruptions, leading to designs that emphasized modularity and redundancy amid constraints like vacuum tube technology's high failure rates. Batch processing was prevalent, focusing on modular hardware-software integration that foreshadowed distributed approaches. One of the earliest examples was the DYSEAC, completed in by the National Bureau of Standards for the U.S. Army Signal Corps as a transportable, truck-mounted computer for field computations. Weighing approximately 18 tons and using over 3,000 vacuum , DYSEAC featured a that allowed supervisory control to be distributed across system components for task execution, enabling flexible reconfiguration for military . This approach addressed tube unreliability through distributed logic functions within a single , influencing later fault-tolerant designs in computing systems. The Lincoln TX-2, operational from 1958 at MIT's Lincoln Laboratory, advanced modularity through interrupt-driven mechanisms that supported concurrent operations across multiple sequences. This transistor-based system, with 65,536 words of core memory, employed a "multiple-sequence " where independent could and interleave, facilitating early concepts in and . Designed under sponsorship for defense research, TX-2's emphasized by isolating I/O and computation, reducing single-point failures in batch-oriented setups. In the , theoretical concepts like intercommunicating cells emerged as foundations for cellular architectures in , notably proposed by C.Y. Lee in a 1962 paper. These cells, each containing simple logic and memory, communicated locally to form networks for solving complex problems, such as , without central coordination. explored similar ideas in the D825 multiprocessor, introduced around 1962, which used up to four CPUs interconnected via crossbar switches for symmetrical MIMD processing, aiming at scalable, fault-tolerant computation amid demands. Such designs prioritized over centralized processing, laying groundwork for later despite hardware limitations. By the 1970s, the advent of networking marked a pivotal shift. The , operational from 1969, enabled the first wide-area experiments, such as the programs in 1971-1972, which demonstrated resource sharing and communication across networked nodes—key precursors to distributed operating systems.

Key Foundational Developments (1980s-1990s)

During the and , the rise of affordable workstations and local area networks spurred innovations in software abstractions that masked the distributed nature of resources, enabling more seamless distributed operating system () functionality. Researchers shifted focus from hardware-centric approaches to higher-level mechanisms for resource sharing, consistency, and , laying groundwork for scalable DOS environments. A pivotal advancement was the development of (DSM) systems, which provided programmers with an illusion of a single, coherent across networked machines. The Ivy system, implemented in 1987 at , pioneered software-based DSM by layering a shared on top of the Apollo network OS. Ivy used page-level granularity for memory sharing, employing broadcast-based invalidation protocols to maintain ; upon a , the requesting processor fetched the page from the owner, ensuring strict while supporting parallel applications on clusters of up to 64 processors. Building on this, the Munin project at introduced lazy release in 1992, a relaxed memory model that deferred update propagation until synchronization points (acquires), using vector timestamps and write notices to track causal dependencies. This approach reduced communication overhead by up to 50% in benchmarks like SPLASH, minimizing and enabling efficient multiple-writer protocols for shared variables. File system abstractions also evolved to support location transparency and availability in distributed settings. The Andrew File System (AFS), developed at Carnegie Mellon University in the mid-1980s, offered a unified namespace across thousands of workstations through Vice file servers and Venus client caches. AFS employed whole-file caching on local disks, with a callback mechanism to invalidate caches only on modifications, achieving high performance by reducing server polls; read-only volume replication at higher namespace levels further enhanced scalability and fault tolerance. Extending these ideas, the Coda file system, also from Carnegie Mellon and released in 1991, incorporated server replication and disconnected operation. Coda used volume-level replication with server-to-server synchronization via the volume server protocol, combined with client-side hoarding and reintegration for caching, ensuring availability even during network partitions while maintaining strong consistency through resolution logs. Transaction abstractions emerged to guarantee atomicity in distributed operations, addressing failures in multi-site updates. The two-phase commit (2PC) protocol, formalized in the late but widely adopted in DOS research, coordinated participants via a prepare phase (logging intents) followed by a commit or abort phase, ensuring all-or-nothing outcomes despite crashes. This was exemplified in the system from , developed in the by , where atomic actions served as distributed transactions spanning multiple guardians (protected processes). Argus implemented 2PC for top-level actions, using version numbers on atomic objects for recovery and nested actions for modularity, providing and failure atomicity in applications like distributed databases. Middleware for group communication and coordination further abstracted persistence and reliability. The toolkit, created at in 1985 by Kenneth Birman and colleagues, introduced virtual synchrony semantics for process groups, enabling reliable multicast with and membership notifications. ISIS facilitated fault-tolerant replication by treating groups as resilient objects, where upcalls handled view changes during failures, supporting applications like bulletin boards with automatic recovery. Reliability mechanisms emphasized recovery from node failures without full system halts. In the DOS, led by Andrew Tanenbaum at since the early 1980s, relied on object replication in the and at-most-once RPC semantics, with user-level servers handling recovery via stable storage logs. Amoeba's design isolated faults, allowing crashed processes to restart from checkpoints in user space, while planned multicasting in version 5.0 (early 1990s) aimed to coordinate replicated state across processors. Key experimental systems integrated these abstractions for dynamic resource management. The Sprite OS from UC Berkeley, developed in the late 1980s, supported transparent process migration to balance load across workstations, using a global naming scheme and thread checkpointing to transfer executing processes mid-execution with minimal overhead (under 1 second for large processes). Sprite's migration facility, combined with a , demonstrated practical DOS operation on Sun workstations, influencing later cluster computing.

Core Components and Mechanisms

Kernel and Process Management

In distributed operating systems (), the serves as the core coordinator for spanning multiple nodes, emphasizing modularity and fault isolation to handle the inherent distribution of resources. Unlike traditional monolithic kernels that bundle services into a single for efficiency, designs in DOS minimize kernel functionality to essentials like () and basic scheduling, delegating higher-level services to user-space servers. This approach enhances reliability in distributed environments by containing failures to individual nodes and facilitating across heterogeneous hardware. For instance, the DOS employs a microkernel that runs identically on all machines, managing only process creation, threads, and RPC-based communication while offloading and other services to dedicated servers. L4-based microkernels further exemplify this design in modern DOS, prioritizing minimalism with capabilities for secure resource delegation and high-performance IPC to support distributed execution on multicore and networked systems. These kernels, such as seL4, achieve sub-microsecond IPC latencies, enabling efficient message exchanges across nodes without shared memory assumptions. In contrast to shared memory models, which rely on uniform address spaces and are challenging to implement over networks due to consistency overheads, message-passing kernels like L4 and use explicit IPC primitives (e.g., synchronous RPC) to ensure reliable, ordered communication in distributed settings. Process management in DOS extends beyond local creation to location-transparent operations, allowing processes to spawn on optimal nodes without user awareness of physical locations. This is achieved through global naming services and kernel-level abstractions that abstract node boundaries, as seen in Amoeba's thread-based process model where creation via RPC enables remote instantiation with shared code and data across machines. Load balancing complements this by dynamically redistributing processes to prevent hotspots; sender-initiated diffusion algorithms, for example, enable overloaded nodes to probe neighbors and transfer tasks based on local load thresholds, improving overall throughput in unstructured networks. Process migration mechanisms in facilitate moving executing processes between nodes for load balancing or , involving the transfer of including memory, files, and communication links. Seminal implementations, such as in the operating system, support transparent migration at any time by checkpointing the process image—capturing CPU registers, memory pages, and open files—and resuming on the destination node, with global file naming ensuring resource continuity. State transfer protocols typically employ a two-phase approach: pre-migration demand paging to move only active pages and post-migration link updating to redirect endpoints, minimizing downtime to seconds in environments. Scheduling in DOS balances local autonomy with , where local schedulers handle per-node priorities using traditional algorithms like , while global schedulers aggregate load information to assign or migrate tasks across the system for balanced utilization. Global approaches, often using or graph-based models, can outperform isolated local scheduling in simulated heterogeneous clusters under varying loads, though they incur communication overheads. In distributed systems, priority inheritance protocols extend this by propagating priorities across nodes during ; for example, a high-priority task blocked on a remote low-priority one inherits its priority via timestamped tokens, bounding inversion delays to network latencies and ensuring schedulability in embedded DOS like those based on L4.

Inter-Process Communication and Synchronization

In distributed operating systems, (IPC) enables coordination among processes executing on separate nodes without , relying instead on network-based exchanges to maintain and efficiency. The two dominant IPC paradigms are remote procedure calls (RPC) and , each addressing different needs for abstraction and performance in heterogeneous environments. These mechanisms must handle challenges like network latency, partial failures, and message loss to ensure reliable interaction. Remote procedure calls provide a familiar by allowing a client to invoke a procedure on a remote as if it were local, with the operating system handling parameter marshalling, transmission, and result return. The foundational design by Birrell and Nelson emphasized exception-based error handling and at-most-once semantics, where if a result is returned, the procedure was invoked exactly once; otherwise, it may have been invoked zero or one time, to balance reliability with the impossibility of exactly-once guarantees in asynchronous networks. , an early commercial implementation, extended this model using for low-overhead transport and incorporated idempotent operations to approximate at-most-once semantics, reducing duplicate effects through client timeouts and server . These semantics ensure that RPC failures manifest as exceptions rather than silent errors, facilitating robust distributed applications like file s. Message passing, in contrast, offers explicit control over communication, suitable for systems requiring fine-grained coordination or multicast. In the distributed operating system, supports both synchronous (blocking until reply) and asynchronous (non-blocking send) modes for point-to-point exchanges, with messages encapsulated in fixed-size headers for operation codes, parameters, and data buffers up to 32 KB. 's design hides low-level details via stub routines that marshal arguments, similar to RPC, but allows direct buffering for high-throughput scenarios like remote file I/O, where a read operation sends an offset and length in the header and receives data in the buffer. This approach scales to wide-area networks by minimizing overhead, though it requires explicit error checking for lost messages via sequence numbers and acknowledgments. Synchronization primitives in distributed operating systems extend local constructs like mutexes and semaphores to prevent race conditions across nodes, often using for coordination. A distributed mutex enforces for a by requiring processes to request permission via a or , with the Ricart-Agrawala achieving optimality through timestamped requests broadcast to all N processes, granting access only after 2(N-1) messages in the worst case and ensuring fairness via total ordering. Distributed semaphores generalize this to counting mechanisms, allowing up to a specified number of concurrent accesses; they operate via wait and signal operations propagated through messages, maintaining a global count at a or via decentralized to handle failures without centralized bottlenecks. To enforce causal ordering—ensuring events appear in a sequence respecting dependencies—distributed systems employ vector clocks, an extension of Lamport's logical clocks that capture partial orders in asynchronous environments. Each maintains a vector of size N (one entry per process), initialized to zeros; on a local event, the process increments its own entry, and on sending a , it attaches a copy of its vector, while the receiver updates its vector by taking the component-wise maximum and then incrementing. Two events are causally related if their vectors V_a and V_b satisfy V_a ≤ V_b (every component of V_a is less than or equal to V_b, with at least one strict inequality); otherwise, they are concurrent, enabling detection of consistent global states without a shared clock. This mechanism, formalized by Mattern, underpins causal multicast protocols by filtering messages that violate ordering, reducing unnecessary deliveries in group settings. Group communication enhances IPC for collaborative processes by providing reliable, ordered delivery to multiple recipients, critical for fault-tolerant applications like replicated databases. Atomic guarantees that a is delivered to all or none in a group, with across members, often implemented atop reliable broadcast layers. Virtual synchrony, a higher-level , simulates views by delivering messages in stable order and notifying members of membership changes (e.g., failures) consistently, as in Birman's toolkit, where group multicasts use sequence numbers and checkpoints to ensure without full replication overhead. In , this model supports up to hundreds of processes with low , as atomic delivery completes in a few delays under normal conditions, making it suitable for coordination in distributed simulations. Deadlock detection in distributed systems identifies circular waits for resources across nodes, using graph-based models adapted for decentralization. The wait-for graph represents processes as nodes with directed edges from a process to the resources it awaits, aggregated globally via message probes to avoid constructing the full graph at one site. The Chandy-Misra-Haas algorithm employs edge-chasing: a blocked process initiates a probe message containing its identifier, which travels along wait-for edges to successors; if a probe returns to the initiator, a cycle exists, confirming deadlock with O(E) messages, where E is the number of edges, and no false positives under the AND model of resource requests. Resource allocation graphs extend this by including resource nodes and request/assignment edges, enabling distributed detection through periodic merging of local subgraphs or probe diffusion, though they incur higher communication costs in large systems due to the need for consistent snapshots. These techniques prioritize detection over prevention, allowing systems like Amoeba to resolve deadlocks by preempting resources from lower-priority processes.

Architectural Models

Centralized vs. Decentralized Structures

In distributed operating systems, centralized structures rely on a master-slave model where a single coordinator , often referred to as the , makes global decisions and delegates tasks to subordinate slave nodes. This approach simplifies system management by concentrating control flow in one entity, enabling straightforward and across nodes. For instance, the assigns computational tasks to slaves, which execute them and report results, reducing the complexity of distributed coordination. However, this model introduces a ; if the node crashes, the entire system may halt until recovery or occurs, limiting . Decentralized structures, in contrast, distribute control without a central authority, allowing nodes to collaborate through fully distributed mechanisms for and propagation. Gossip protocols exemplify this by enabling nodes to periodically exchange information with randomly selected peers, mimicking epidemic spreading to achieve across the system. Consensus algorithms like Lamport's further support by ensuring agreement on a single value among nodes despite failures, using phases where proposers, acceptors, and learners coordinate via to tolerate up to floor((n-1)/2) faulty processes in a system of n nodes. These methods enhance by supporting growth in node count without bottlenecks, as each node handles local decisions while propagating . Hybrid approaches combine elements of both paradigms to balance simplicity and resilience, often employing hierarchical decentralization where higher-level coordinators oversee lower-level distributed operations. Google's Borg system illustrates this through a replicated central master that manages cluster-wide decisions, such as task scheduling and resource allocation, while delegating execution to distributed agents on individual machines. Fault tolerance in such systems is bolstered by leader election mechanisms, like the Raft algorithm, which uses randomized timeouts to select a leader from a cluster and replicates logs to maintain consistency, allowing the system to continue operating with a majority of nodes intact. This structure scales to tens of thousands of nodes by sharding responsibilities and using asynchronous scheduling, mitigating the risks of pure centralization. Control messages in these architectures often leverage inter-process communication primitives to facilitate coordination between centralized coordinators and decentralized nodes. Overall, centralized models prioritize ease of implementation for smaller scales, while decentralized and hybrid variants offer superior and for large, unreliable environments.

Client-Server and Peer-to-Peer Paradigms

In distributed operating systems, the client-server paradigm organizes interactions through asymmetric roles, where clients request services or resources from specialized servers that manage and provide them. This model centralizes on servers, enabling efficient handling of like file storage, while clients focus on user-specific tasks. For instance, in the operating system, the employs a client-server where servers maintain stateful information about client caches to ensure , contrasting with stateless implementations like the standard (NFS). in this paradigm is often achieved through server replication, distributing load across multiple identical servers to handle increased client demands without compromising performance. The (P2P) paradigm, in contrast, relies on symmetric roles among nodes, where each participant functions as both a client and a , contributing resources like and to the collective system. This symmetry fosters decentralized cooperation, with overlay networks facilitating efficient resource discovery and routing; for example, the Chord (DHT) organizes nodes in a ring structure to route queries logarithmically, enabling scalable lookup of data across dynamic networks. In applications such as file-sharing distributed systems, extensions inspired by demonstrate P2P's utility by distributing encrypted data fragments across nodes for anonymous and retrieval, enhancing privacy and availability without central points of control. Modern distributed operating systems have evolved from predominantly client-server models toward hybrid or approaches to leverage greater , particularly in resource-constrained or highly dynamic environments. This transition is supported by discovery protocols like JXTA, which enable peers to advertise and locate services through XML-based messaging, allowing seamless formation of peer groups for collaborative tasks. A key trade-off between these paradigms involves and : client-server systems typically offer lower for direct requests due to centralized servers, but they risk single points of failure that reduce overall system . Conversely, systems provide enhanced through resource distribution and fault-tolerant routing, though they may incur higher from multi-hop paths in overlay networks.

Design Principles and Considerations

Transparency and Resource Management

Transparency in distributed operating systems (DOS) refers to the mechanisms that hide the complexities of distribution from users and applications, allowing interaction with resources as if they were part of a single, centralized system. This is achieved through various forms of , enabling seamless to distributed components without requiring knowledge of their physical locations, failure states, or access protocols. Key types include location transparency, which provides uniform naming via global namespaces to decouple resource references from their physical hosts; access transparency, which ensures consistent application programming interfaces () across distributed resources regardless of their implementation; and failure transparency, which masks component crashes using proxies or redirection to maintain service continuity. Resource management in DOS extends traditional techniques to handle allocation across multiple nodes while preventing issues like deadlocks. For distributed allocation, extensions to the impose a on system nodes and apply modified checks at each level, avoiding centralized bottlenecks by allowing local decisions with global coordination. Naming services further support by organizing resources into hierarchical structures; for instance, the (DCE) employs cell-based naming, where resources within an administrative (a "cell") are referenced via location-independent paths, resolved through directory services that link local and global namespaces. Migration and replication enhance by enabling transparent relocation of or files to optimize load or , while maintaining data consistency. migration allows a running to transfer between nodes without user intervention, as demonstrated in early systems like the Galaxy OS, where sender-initiated transfers handle state capture and restart to preserve execution continuity. Replication involves duplicating resources across nodes, with consistency models defining update propagation; ensures all operations appear in a single visible to all , akin to a uniprocessor execution, whereas permits temporary divergences that resolve over time once updates propagate, balancing with reduced synchronization overhead. A notable tool for achieving these transparencies is the system's location service, which uses a scalable, worldwide to map object handles to contact addresses for distributed objects. This service supports location, , and replication transparency by allowing objects to expand or shrink their contact points dynamically—such as adding replicas or relocating to new hosts—without altering client references, thus hiding distribution details through a two-level naming scheme.

Reliability, Availability, and Fault Tolerance

In distributed operating systems (), reliability refers to the probability that the system performs its intended functions without failure over a specified period, often modeled to handle arbitrary faults including malicious ones. A foundational approach to reliability is (BFT), which addresses scenarios where components may fail in arbitrary ways, such as sending conflicting messages, yet the system must reach . The seminal Byzantine Generals Problem, formulated by Lamport, Shostak, and Pease, demonstrates that with oral messages, agreement is achievable if more than two-thirds of the processes are non-faulty, using an algorithm that iteratively exchanges and computes majorities over multiple rounds. This model underpins reliability in DOS by ensuring correct operation despite up to f faulty nodes in a system of 3f+1 total nodes. Redundancy via replication enhances reliability by maintaining multiple identical copies of data or processes across , allowing the system to mask failures through or switching to healthy replicas. In DOS, replication strategies, such as , ensure that all replicas process the same sequence of operations to maintain , thereby tolerating crashes or omissions without . This approach is critical in environments like cluster computing, where node failures are common, and has been widely adopted in systems like Google's Spanner for linearizable . Availability in DOS focuses on minimizing downtime, often targeting 99.99% or higher uptime, through techniques that enable seamless service continuation during failures. clustering groups multiple s to monitor each other, automatically transferring workloads from a failed to a healthy one within seconds, using shared or virtual addresses to maintain service accessibility. For instance, in enterprise DOS implementations, ensures that if a primary crashes, a secondary assumes without user intervention, reducing outage impact. Hot standby s complement this by keeping idle replicas fully synchronized and ready to activate immediately upon detecting a primary , as seen in monitoring systems where standby processes mirror active ones in real-time. Fault tolerance in DOS extends beyond detection to , enabling the system to continue or resume operations post-. Checkpoint/restart periodically save states across distributed nodes, allowing restart from the last consistent checkpoint after a , which is vital for long-running computations in clusters. The Distributed MultiThreaded CheckPointing (DMTCP) tool exemplifies this by transparently checkpointing multithreaded and distributed applications without modifications, achieving restart times under 2 seconds on 128 cores with minimal runtime overhead. Distributed supports fault tolerance by enabling developers to trace and reproduce failures across nodes, using techniques like record-replay to capture non-deterministic events for post-mortem analysis, thus facilitating robust error . Key metrics for evaluating these aspects in DOS include (MTBF), which measures the average operational period before a occurs, and (MTTR), the average duration to restore functionality post-; in distributed contexts, these are aggregated across nodes, with high MTBF/MTTR ratios indicating effective . For example, BFT systems aim for MTBF exceeding system lifetime by tolerating f faults without compromising . N-version programming bolsters through software diversity, executing multiple independently developed versions of the same module and selecting outputs via majority voting, reducing common-mode ; experiments by Knight and Leveson showed that while independence assumptions hold imperfectly due to correlated , the theoretical model with N=4 versions and low individual failure probabilities can achieve error rates below 10^{-9}. These mechanisms collectively hide from users, aligning with principles in .

Challenges and Trade-offs

Performance and Scalability Issues

Distributed operating systems face significant performance challenges primarily due to the inherent overhead of communication, which introduces that can dominate execution times in distributed environments. overhead arises from delays in data transmission between nodes, often exacerbated by variations in conditions, leading to up to 3.5 times slower runtime in communication-intensive applications like . This overhead is particularly pronounced in scenarios where even small variations in can degrade overall system performance more than the mean itself. Another key performance bottleneck is the communication-to-computation ratio (C2C), which measures the balance between data transfer costs and local processing efforts, often expressed as the total communication traffic divided by total computations. In distributed deep learning systems, a high C2C ratio—such as in models like BERT-Large with low model intensity (I = 248)—limits scalability, resulting in only 1.2× speedup across 4 GPUs due to excessive synchronization overhead. Extensions of Amdahl's law to distributed settings account for this by incorporating communication as an additional overhead component, highlighting how even parallelizable workloads are constrained by network dependencies. Scalability in distributed operating systems is hindered by state explosion in large clusters, where the of possible execution states from concurrent operations makes exhaustive verification or testing infeasible, often requiring techniques like dynamic partial order reduction to prune redundant paths and achieve up to 920× speedup with 1,024 workers. To mitigate these limits, partitioning strategies such as distribute data across nodes using a sharding key (e.g., hashed identifiers) to balance load and prevent hot spots, enabling horizontal while maintaining consistent schemas across shards. Performance in distributed operating systems is typically evaluated using metrics like throughput (e.g., ) and response time, which quantify the system's capacity to handle workloads under varying conditions. Benchmarks such as those for distributed file systems measure these by simulating remote and execution, revealing trade-offs in where throughput may plateau due to bottlenecks. For instance, in high-performance distributed environments, injectors demonstrate how response times increase nonlinearly with size, guiding optimizations for real-world deployments. Mitigation strategies focus on optimizations like caching hierarchies and prefetching in distributed file systems to reduce impacts. Caching hierarchies, including client- and server-driven approaches with techniques like LRU replacement, can achieve up to 83% hit rates by exploiting temporal locality, thereby cutting server and client in large-scale setups. Prefetching algorithms, such as those based on highly relevant frequent patterns, anticipate needs to improve read by 29% to 77% through proactive loading, enhancing overall throughput without excessive overhead. These methods briefly intersect with principles by prioritizing placement to minimize communication costs.

Security and Complexity Management

Distributed operating systems (DOS) face unique security vulnerabilities stemming from their reliance on multiple interconnected nodes, where trust must be established across potentially untrusted environments. Unlike centralized systems, DOS require mechanisms to verify the identity and integrity of nodes that may be geographically dispersed and operated by different entities. A primary challenge is ensuring to prevent unauthorized access, as compromised nodes can propagate malicious actions throughout the network. For instance, authentication protocols like address this by providing a ticket-based system where a issues time-limited credentials, allowing secure verification without transmitting passwords over the network. This approach mitigates risks such as and replay attacks in distributed environments. Secure communication channels are essential to protect across distributed nodes, where poses a significant threat. Integration of protocols like enables , , and checks at the IP layer, ensuring that communications between DOS components remain confidential even over public networks. 's use of security associations and further supports dynamic policy enforcement tailored to DOS requirements, such as protecting resource-sharing operations. However, implementing these in DOS introduces overhead, as nodes must negotiate policies without a central , potentially exposing vulnerabilities if not properly managed. Attacks in DOS often exploit the decentralized nature of the system, amplifying threats like the Byzantine generals problem, where faulty or malicious nodes send conflicting information, leading to inconsistent states across the network. This problem, formalized in seminal work, illustrates how even a small fraction of unreliable s can prevent in critical operations such as or . In (P2P) DOS architectures, insider threats—where legitimate s are compromised or act maliciously—pose particular risks, as they can inject false data or disrupt services without external intrusion. Reputation-based mechanisms, such as those using distributed directories, help detect and isolate such insiders by tracking behavior over time. These threats underscore the need for robust integrated into DOS kernels to maintain system integrity. Managing the inherent complexity of DOS designs requires structured approaches to ensure reliability without overwhelming development and maintenance efforts. principles decompose the system into independent components, such as separate modules for communication, naming, and fault handling, allowing isolated and easier . This reduces interdependencies, facilitating updates to specific parts without affecting the entire system, though it demands clear interfaces to avoid hidden coupling. Verification tools like TLA+ further aid complexity management by enabling and of distributed algorithms, catching errors in concurrency and fault scenarios before deployment. TLA+'s temporal logic-based approach has been applied to verify protocols in real-world DOS, ensuring properties like safety and liveness hold under diverse failure modes. The trade-offs of this complexity are evident in challenges like debugging distributed deadlocks, where across nodes creates cycles that are harder to detect and resolve than in single-machine systems. Tools for distributed , such as those employing global snapshots or trace , help identify these issues but increase system overhead and require coordinated across nodes. To simplify management, layers abstract underlying hardware heterogeneity, allowing to run as virtual machines or containers that encapsulate complexity and enable easier and . While introduces its own performance costs, it streamlines deployment by providing a uniform , balancing the price of complexity with practical usability.

Modern Research and Applications

Advances in Heterogeneous and Multi-Core Environments

Recent advances in distributed operating systems (DOS) have increasingly addressed the challenges posed by multi-core architectures, particularly in (NUMA) and multi-socket configurations, where traditional shared-memory models falter due to scalability limits. Barrelfish, a multikernel OS developed by researchers at , exemplifies this shift by treating the system as a distributed node network, with each running an independent kernel that communicates via explicit messages rather than shared state. This design enables efficient handling of NUMA topologies through a system (SKB) that maintains topology information, facilitating topology-aware and multicast-based TLB shootdowns that scale to 32 s on multi-socket systems. In scheduling, Barrelfish employs distributed CPU drivers and monitors for per- dispatchers, incorporating by pinning tasks to specific s based on SKB-derived locality to minimize inter- communication overheads. Measurements on a 2×2- system demonstrated Barrelfish's message-passing outperforming Linux's shared-memory IP loopback by up to 18% in throughput (2154 Mbit/s vs. 1823 Mbit/s). Although the project became inactive in 2020, its principles have influenced subsequent multi- research at . Heterogeneous environments, combining CPUs, GPUs, and FPGAs, introduce further complexities in DOS, requiring middleware to abstract device differences and enable seamless resource orchestration across nodes. Distributed extensions to OpenCL have emerged as key enablers, allowing kernels to execute across networked heterogeneous devices without low-level messaging interfaces like MPI. For instance, PoCL-Remote provides a client-server model where a daemon on remote nodes exposes OpenCL devices as local, supporting peer-to-peer data transfers and event synchronization to optimize bandwidth in CPU-GPU clusters. This approach virtualizes heterogeneous hardware, enabling DOS to treat distributed GPUs or FPGAs as a unified compute fabric, with efficient buffer management reducing client-side overheads in multi-node setups. Similarly, SnuCL-D decentralizes OpenCL execution by redundantly running host programs on all nodes in CPU/GPU clusters, virtualizing remote devices to eliminate single-host bottlenecks and achieving up to 45x speedup over centralized frameworks on 512-node systems for compute-intensive benchmarks. These middleware layers integrate into DOS by providing device-neutral APIs, allowing adaptive task offloading to specialized accelerators while maintaining fault tolerance through redundant computation. In the 2020s, research has focused on lightweight frameworks and adaptive mechanisms to better exploit heterogeneous multi-core resources. Unikraft, an open-source development kit, supports constructing specialized, minimal OS images for distributed cloud and deployments, reducing boot times to milliseconds and memory footprints by up to 90% compared to full kernels for single-purpose applications. Its modular architecture allows fine-grained integration of components like networking and , enabling efficient scaling across heterogeneous nodes in distributed systems without unnecessary bloat. Complementing this, adaptive resource partitioning techniques dynamically allocate shared resources like and in multi-core DOS to mitigate interference. For example, OS-anchored budgeting in multicores adjusts access to last-level based on usage by prioritizing efficient threads. Holistic partitioning frameworks further extend this to distributed settings, dividing resources among cores and nodes while monitoring interference to rebalance tasks, as demonstrated in systems with improved schedulability and reduced WCET interference. These efforts address post-2010 hardware diversity, prioritizing and over legacy monolithic designs.

Integration with Cloud and Edge Computing

Distributed operating systems (DOS) integrate seamlessly with paradigms, particularly in Infrastructure-as-a-Service (IaaS) environments, where they manage pooled resources across geographically dispersed nodes to support scalable workloads. , an open-source IaaS platform, exemplifies this by providing modular components for orchestrating compute (), storage (), and networking () in distributed setups, enabling private and public cloud deployments with fault-tolerant resource allocation. Extensions such as OpenStack++ build on this foundation to support hybrid cloudlet environments, dynamically provisioning virtual machines and services at the network while maintaining centralized control for efficiency. In serverless architectures, DOS principles underpin lightweight virtualization technologies like AWS , which acts as a minimal, secure for execution without a full traditional OS . employs microVMs to isolate workloads, achieving startup times under 125 milliseconds and memory footprints as low as 5 MiB, thus enabling dense, distributed on cloud infrastructure. extends DOS capabilities to resource-constrained environments, particularly for () applications, through operating systems that decentralize processing to reduce dependency on remote clouds. Platforms like provide a vendor-neutral, microservices-based architecture for edge gateways, facilitating device and local data orchestration in distributed networks. This latency-aware distribution routes tasks to nearby nodes, minimizing end-to-end delays compared to centralized models in real-time scenarios. As of , emerging trends in DOS research emphasize -driven enhancements, such as frameworks integrated into operating system layers for collaborative model training across distributed devices without centralizing sensitive data. A proposed horizontal operating system for unifies edge resources for agent-based , improving in 5G/6G networks while preserving through on-device computation. Container orchestration tools like further embody this evolution, functioning as a DOS abstraction that schedules and scales containerized workloads across clusters, abstracting underlying hardware heterogeneity for cloud-native applications. Sustainability considerations are increasingly central to DOS design in cloud and edge contexts, with energy-efficient distribution strategies optimizing workload placement to curb power usage amid growing data volumes. For instance, distributed edge computing models in dense IoT deployments have achieved 19% higher energy efficiency by intelligently offloading tasks from battery-limited devices to fog nodes, alongside 54% better resource utilization and 86% improved latency management.

References

  1. [1]
    [PDF] Distributed Operating Systems - Computer Science
    A distributed operating system runs on multiple, independent CPUs, where each computer has its own OS, and users don't know which machine their programs are ...
  2. [2]
    [PDF] Distributed and Operating Systems Course Notes - LASS
    1.8.2 Distributed Operating System. Distributed operating systems are operating systems that manage resources in a distributed system. However, from a user ...
  3. [3]
    [PDF] Distributed Systems - cs.wisc.edu
    Distributed systems involve many machines cooperating, like web services. Communication is unreliable, and failure is a major challenge.
  4. [4]
    A brief introduction to distributed systems | Computing
    Aug 16, 2016 · A distributed system is a collection of autonomous computing elements that appears to its users as a single coherent system.
  5. [5]
    [PDF] The Amoeba Distributed Operating System - CERN Document Server
    Amoeba consists of two basic pieces: a microkernel, which runs on every processor, and a collection of servers that provide most of the traditional operating ...
  6. [6]
    Sprite - UC Berkeley EECS
    Sprite is a distributed operating system that provides a single system image to a cluster of workstations. It provides very high file system performance through ...
  7. [7]
    [PDF] Plan 9 from Bell Labs
    Plan 9 is a distributed computing environment. It is assembled from separate machines acting as CPU servers, file servers, and terminals.
  8. [8]
    Distributed operating systems | ACM Computing Surveys
    This paper is intended as an introduction to distributed operating systems, and especially to current university research about them.
  9. [9]
    Differences Between Distributed and Parallel Systems - ResearchGate
    This report characterizes the differences between distributed systems, networks of workstations, and massively parallel systems and analyzes the impact of these ...
  10. [10]
  11. [11]
    Cold War Computer Arms Race - Marine Corps University
    The theory critical to the success of networking was “distributed communication,” proposed by Paul Baran at Rand in the early 1960s.9 The human brain, Baran ...
  12. [12]
    [PDF] A New Era in Computation - American Academy of Arts and Sciences
    "batch processing" on computers: When the output appears as interminable pages of printout and numerical tables, it is difficult to uncover significant or ...
  13. [13]
    [PDF] computer development (SEAC and DTSEAC) at the National Bureau ...
    Such topics as systems development, engineering development, design, construction, and maintenance of computer equipment are covered. The introduction ...
  14. [14]
    [PDF] 3. DYSEAC - Ed Thelen
    As a consequence, the supervisory control over the common task may initially be loosely distributed throughout the system and then temporarily concentrated in ...
  15. [15]
    The Lincoln TX-2 computer development - ACM Digital Library
    THE TX-2 is the newest member of a growing fam- ily of experimental computers designed and con- structed at the Lincoln Laboratory of M.I.T. as.Missing: modularity | Show results with:modularity
  16. [16]
    [PDF] Memory Units in the Lincoln TX-2. - Bitsavers.org
    The Lincoln TX-2 incorporates several new developments in high-speed transistor circuits, large capacity magnetic-core-memories, and flexibility in machine ...Missing: 1958 | Show results with:1958
  17. [17]
    The History of the Development of Parallel Computing
    [17] Burroughs introduces the D825 symmetrical MIMD multiprocessor. 1 to 4 CPUs access 1 to 16 memory modules using a crossbar switch. The CPUs are similar to ...
  18. [18]
    [PDF] Experiences with the Amoeba Distributed Operating System
    The Amoeba distributed operating system has been in development and use for over eight years now. In this paper we describe the present system and our.
  19. [19]
    [PDF] A Shared Memory Architecture For Distributed Computing
    A prototype loosely-coupled shared memory system called Ivy has been implemented in software on top of the Apollo Domain system [Li86]. It allows a ...
  20. [20]
    [PDF] Lazy Release Consistency for Software Distributed Shared Memory
    distributed shared mem- ory system to use release consistency. Munin's inlplementa- tion of release consistency merges updates at release time, rather than.
  21. [21]
    [PDF] Scale and Performance in a Distributed File System - andrew.cmu.ed
    The Andrew File System is a location-transparent distributed tile system that will eventually span more than 5000 workstations at Carnegie Mellon University ...
  22. [22]
    [PDF] Coda: A Highly Available File System for a Distributed Workstation ...
    Coda integrates the use of two different mechanisms, whole-file caching and replication, while Locus relies solely on replication. Coda clients directly update ...
  23. [23]
    None
    **Summary of Argus Use of Transactions and Two-Phase Commit in Distributed OS**
  24. [24]
    [PDF] Reliable communication in the presence of failures
    A review of several uses for the protocols in the ISIS system, which supports fault-tolerant resilient objects and bulletin boards, illustrates the significant ...
  25. [25]
    [PDF] Process Migration in the Sprite Operating System - UC Berkeley EECS
    Feb 11, 1987 · This paper describes a process migration facility for the Sprite operating system. In order to provide location-transparent remote execution, ...
  26. [26]
    [PDF] The Amoeba Distributed Operating System
    Tanenbaum at the Vrije Universiteit (VU) in Amsterdam (The Netherlands) has been doing research since 1980 in the area of distributed computer systems. This.
  27. [27]
    [PDF] From L3 to seL4 What Have We Learnt in 20 Years of L4 ...
    The L4 microkernel has undergone 20 years of use and evolution. It has an active user and developer commu- nity, and there are commercial versions which are ...<|separator|>
  28. [28]
    [PDF] The Amoeba Distributed Operating System - A Status Report
    This mechanism for process creation is much more location transparent and efficient in a distributed system than the UNIX fork system call. Like all operating ...Missing: Plan | Show results with:Plan
  29. [29]
    [PDF] Strategies for dynamic load balancing on highly parallel computers
    Sender Initiated Diffusion (SID)¹ is a highly distributed local approach which makes use of near-neighbor load information to apportion surplus load from ...
  30. [30]
    Process migration | ACM Computing Surveys - ACM Digital Library
    Process migration is the act of transferring a process between two machines, enabling dynamic load distribution and fault resilience.
  31. [31]
    [PDF] Scheduling in Distributed Systems - Computer Science
    In a distributed system, the local scheduler may need global information from other workstations to achieve the optimal overall performance of the entire system ...
  32. [32]
    [PDF] Distributed Priority Inheritance for Real-Time and Embedded ...
    We study the problem of priority inversion in distributed real-time and embedded systems and propose a solution based on a dis- tributed version of the priority ...
  33. [33]
    [PDF] Implementing Remote Procedure Calls
    This paper describes a package providing a remote procedure call facility, the options that face the designer of such a package, and the decisions. ~we made. We ...
  34. [34]
    [PDF] Comparing remote procedure calls
    This report describes and compares three significant RPCs: • Open Network Computing (ONC) RPC from Sun Microsystems[SUN90][MS91]. • Distributed Computing ...<|separator|>
  35. [35]
    An optimal algorithm for mutual exclusion in computer networks
    Ricart, G., and Agrawala, A.K. Using exact timing to implement mutual exclusion in a distributed network. Tech. Rept. TR-742, Dept. Comptr. Sci., Univ. of ...
  36. [36]
    Synchronization in Distributed Programs
    A distributed semaphore is a distributed synchronization mechanism that be- haves in much the same way as a semaphore [5]. Two operations are defined on.Missing: seminal | Show results with:seminal
  37. [37]
    [PDF] Virtual time and global states of distributed systems
    Abstract. A distributed system can be characterized by the fact that the global state is distributed and that a common time base does not exist.
  38. [38]
    [PDF] EXPLOITING VIRTUAL SYNCHRONY IN DISTRIBUTED SYSTEMS
    The premise of the ISIS project and this paper is that the existing software development methodology is inadequate to address this class of applications ...
  39. [39]
    [PDF] Distributed Deadlock Detection - CSE IIT KGP
    CHANDY, K.M., AND MISRA, J. A distributed algorithm for detecting resource deadlocks in distributed systems. In Proc. A CM SIGA CT-SIGOPS Syrup. Principles of ...
  40. [40]
    [PDF] Deadlock Detection in Distributed Systems
    Chandy-Misra-Haas's distributed deadlock detection algorithm for AND model is based on edge-chasing. The algorithm uses a special message called probe, which is ...
  41. [41]
    A Master-Slave Operating System Architecture for Multiprocessor ...
    This motivated us to reexamine the master-slave approach. In this paper, we attempt to address real-time and performance issues associated with the master-slave ...
  42. [42]
    [PDF] Operating Systems on a Master-Slave Mode
    ABSTRACT. Master-Slave mode has been known in operating systems for many years. It refers to one processor which is the master that controls other ...
  43. [43]
    Gossiping in distributed systems - ACM Digital Library
    In this paper, we present a brief introduction to the field of gossiping in distributed systems, by providing a simple framework and using that framework to ...
  44. [44]
    [PDF] Paxos Made Simple - Leslie Lamport
    Paxos Made Simple. Leslie Lamport. 01 Nov 2001. Page 2. Abstract. The Paxos algorithm, when presented in plain English, is very simple. Page 3 ...
  45. [45]
    [PDF] In Search of an Understandable Consensus Algorithm
    May 20, 2014 · Strong leader: Raft uses a stronger form of leader- ship than other consensus algorithms. For example, log entries only flow from the leader to ...
  46. [46]
    Client-server computing | Communications of the ACM
    Client-server computing is a distributed computing model in which client applications request services from server processes. Clients and servers typically run ...
  47. [47]
    [PDF] The Sprite Network Operating System - UCSD CSE
    Nov 19, 1987 · Sprite is a network OS for workstations, designed for memory sharing and process migration, with a transparent network file system and similar ...
  48. [48]
    [PDF] Spritely NFS: Experiments with Cache-Consistency Protocols
    Stateless-server systems, such as NFS, cannot properly manage client file caches. Stateful sys- tems, such as Sprite, can use explicit cache consistency.Missing: paradigm | Show results with:paradigm
  49. [49]
    Peer-to-peer information systems - ACM Digital Library
    P2P systems offer an alternative to traditional client/server systems: Every node acts both as a client and a server and "pays" its participation by providing ...
  50. [50]
    [PDF] Chord: A Scalable Peer-to-peer Lookup Service for Internet
    This paper presents Chord, a distributed lookup protocol that addresses this problem. Chord provides support for just one operation: given a key, it maps ...
  51. [51]
    [PDF] A Distributed Anonymous Information Storage and Retrieval System ...
    Freenet is a peer-to-peer network for anonymous, location-independent storage and retrieval of data, using a distributed file system.
  52. [52]
    From Client-Server to P2P Networking | SpringerLink
    In this chapter, an investigation of network architecture evolution, from client-server to P2P networking, will be given, underlining the benefits and the ...
  53. [53]
    [PDF] Project JXTA: A Technology Overview - cs.Princeton
    Apr 25, 2001 · Peer Discovery Protocol. This protocol enables a peer to find advertisements on other peers, and can be used to find any of the peer, peer ...
  54. [54]
    [PDF] Names in Distributed Systems
    Naming Services in Distributed Systems in general - provide clients with values of attributes of named objects name space the collection of valid names ...
  55. [55]
    [PDF] DISTRIBUTED SYSTEMS PRINCIPLES AND PARADIGMS ...
    Q: An alternative definition for a distributed system is that of a collection of independent computers providing the view of being a single system, that is, ...
  56. [56]
    Extension of the banker's algorithm for resource allocation in a ...
    The Banker's algorithm for resource allocation prevents deadlocks. Its straightforward extension to a distributed environment would require a centralized ...
  57. [57]
    DCE 1.1: Directory Services - Information Model
    The DCE directory information model consists of a set of composite name spaces. Composite name spaces are linked together in a hierarchical order. This document ...
  58. [58]
    Principles of Eventual Consistency - ACM Digital Library
    In this tutorial, we carefully examine both the what and the how of consistency in distributed systems.
  59. [59]
    [PDF] A Location Service for Worldwide Distributed Objects
    We introduced the Globe object model and its location service which provides trans- parency of location, migration, distribution, and replication. We can ...
  60. [60]
    [PDF] The Byzantine Generals Problem - Leslie Lamport
    We devote the major part of the paper to a discussion of this abstract problem and conclude by indicating how our solutions can be used in implementing a ...
  61. [61]
    Failover Clustering | Microsoft Learn
    Jun 25, 2025 · Failover clustering is a powerful strategy to ensure high availability and uninterrupted operations in critical environments.Create a failover cluster · Recover a failover cluster... · Hardware requirements
  62. [62]
    [PDF] IBM Tivoli Monitoring: High Availability Guide for Distributed Systems
    The hot standby operation is best illustrated by describing a scenario. In this example, all monitoring components are started in order, a scenario that ...
  63. [63]
    [PDF] DMTCP: Transparent Checkpointing for Cluster Computations and ...
    On 128 distributed cores (32 nodes), checkpoint and restart times are typically. 2 seconds, with negligible run-time overhead. Typical check- point times are ...Missing: paper | Show results with:paper
  64. [64]
    [PDF] Debugging Distributed Systems - UBC Computer Science
    2 Achieving such fault tolerance, however, requires developers to reason through complex failure modes. For most distributed systems, fault tolerance cannot.
  65. [65]
    Distributed system availability - Availability and Beyond
    The calculation of availability using MTBF and MTTR has its roots in hardware systems. However, distributed systems fail for very different reasons than a ...
  66. [66]
    [PDF] AN EXPERIMENTAL EVALUATION OF THE ASSUMPTION OF ...
    N-version programming has been proposed as a method of incorporating fault tolerance into software. Multiple versions of a program (i.e. ''N'') are prepared ...
  67. [67]
    [PDF] Measuring Network Latency Variation Impacts to High Performance ...
    Apr 9, 2018 · ABSTRACT. In this paper, we study the impacts of latency variation versus la- tency mean on application runtime, library performance, ...
  68. [68]
    [PDF] A Quantitative Survey of Communication Optimizations in Distributed ...
    Due to the dependency between communication tasks and computation tasks (Fig. 1b), the C2C ratio is the key factor that affects the system scalability. In ...
  69. [69]
    On Extending Amdahl's law to Learn Computer Performance - arXiv
    Oct 15, 2021 · We propose to (1) extend Amdahl's law to accommodate multiple configurable resources into the overall speedup equation, and (2) transform the speedup equation ...Missing: parallelism | Show results with:parallelism
  70. [70]
    [PDF] Scalable Dynamic Partial Order Reduction? - Parallel Data Lab
    In this paper we are concerned with scaling dynamic partial order reduction, a key technique for mitigating the state space explosion problem, to very large ...
  71. [71]
    Data partitioning guidance - Azure Architecture Center
    Horizontal partitioning (often called sharding). In this strategy, each partition is a separate data store, but all partitions have the same schema. Each ...Horizontal Partitioning... · Vertical Partitioning · Application Design...
  72. [72]
    A benchmark for performance evaluation of a distributed file system
    This paper presents a performance study of a distributed file system. The distributed system allows for remote file access and remote process execution.
  73. [73]
    [PDF] CACHING IN LARGE-SCALE DISTRIBUTED FILE SYSTEMS
    We introduce the notion of dynamic hierarchical caching, in which adaptive client hierarchies are constructed on a le - by - le basis. Trace - driven simulation ...
  74. [74]
    HRFP: Highly Relevant Frequent Patterns-Based Prefetching and ...
    Mar 1, 2023 · In this research, we introduced a novel highly relevant frequent patterns (HRFP)-based algorithm that prefetches content from the distributed file system ...4. Proposed Work · 5. Simulation Experiments · 5.2. Simulation
  75. [75]
    Kerberos: An Authentication Service for Computer Networks
    Kerberos is a distributed authentication service that allows a process (a client) running on behalf of a principal (a user) to prove its identity to a verifier ...
  76. [76]
    [PDF] A Survey of Security Threats in Distributed Operating System
    Abstract—Over the last years, the security into software systems especially in the distributed system becomes complex day by day.Missing: scholarly | Show results with:scholarly
  77. [77]
    [PDF] Guide to IPsec VPNs - NIST Technical Series Publications
    Jun 1, 2020 · Backing up the router operating system and configuration files is a necessity since the prototype is being implemented on production ...
  78. [78]
    Distributed Automatic Configuration of Complex IPsec-Infrastructures
    We analyze the security requirements and further desirable properties of IPsec policy negotiation, and show that the distribution of security policy ...
  79. [79]
    Distributed systems security - ScienceDirect.com
    This multiuser distributed environment proposes unique security challenges ... Journal of Network and Computer Applications, Volume 229, 2024, Article 103917.Missing: scholarly | Show results with:scholarly
  80. [80]
    ReDS: reputation for directory services in P2P systems
    As a system distributed over a diverse set of untrusted nodes, however, directory services must be resilient to adversarial behavior by such malicious insiders.
  81. [81]
    [PDF] Security Issues in Distributed Systems – A survey - CORE
    Based on these considerations the aim of this paper is to describe the concept of security issues in distributed systems considering some important concerns ...
  82. [82]
    modular architectures for distributed and databases systems
    This paper describes the importance of modularity in systems and lists a number of reasons why systems will become increasingly modular.
  83. [83]
    [PDF] Specifying and Verifying Systems With TLA - Leslie Lamport
    Correct code is an important component of software reliability. Modern operating systems make extensive use of concurrent and distributed algorithms.
  84. [84]
    Formal Verification Tool TLA+: An Introduction from the Perspective ...
    Dec 20, 2021 · TLA+ verification is a basic standard for proposing a new algorithm from an existing algorithm in the research fields of distributed algorithms ...
  85. [85]
    [PDF] Virtualization in Distributed System: A Brief Overview
    There are various types of virtual- izations that can be used to increase the performance in distributed systems. Some of them are OS virtualiza- tion, storage ...Missing: complexity | Show results with:complexity
  86. [86]
    [PDF] Design and Modular Verification of Distributed Transactions in ...
    We developed our specification using a modular tech- nique, which allowed us to formalize the interactions between the distributed transactions protocol and the ...
  87. [87]
    [PDF] The Multikernel: A new OS architecture for scalable multicore systems
    We present a multikernel, Barrelfish, which explores the im- plications of applying the model to a concrete OS implemen- tation. • We show through measurement ...
  88. [88]
    The Barrelfish Operating System
    Barrelfish is a new research operating system being built from scratch and released by ETH Zurich in Switzerland.Missing: NUMA | Show results with:NUMA
  89. [89]
    No-MPI OpenCL-Only Distributed Computing With PoCL-Remote
    Sep 4, 2023 · PoCL-Remote enables distributed OpenCL computing without MPI, using a client-server architecture with a daemon on any machine with OpenCL, and ...
  90. [90]
    [PDF] A Distributed OpenCL Framework using Redundant Computation ...
    The host processor executes the host program, and compute devices execute kernels. Kernels are built online or offline.
  91. [91]
  92. [92]
    unikraft/unikraft - GitHub
    Unikraft is a radical, yet Linux-compatible with effortless tooling, technology for running applications as highly optimized, lightweight and single-purpose ...Missing: 2020s | Show results with:2020s
  93. [93]
    [PDF] Adaptive Resource Sharing in Multicores - People at MPI-SWS
    Abstract—This short paper presents an adaptive, operating system (OS) anchored budgeting mechanisms for controlling the access to a shared resource.
  94. [94]
    [PDF] Holistic resource allocation for multicore real-time systems ...
    One effective approach to mitigating interferences among concurrent tasks is resource partitioning: by dividing the cache and memory bandwidth among the cores, ...
  95. [95]
    OpenStack: Open Source Cloud Computing Infrastructure
    OpenStack is a set of software components that provide common services for cloud infrastructure. browse openstack componentsSoftware · Get Started · OpenStack Documentation · Setting Up Your Gerrit Account
  96. [96]
    [PDF] OpenStack++ for Cloudlet Deployment - Carnegie Mellon University
    To provide a systematic way to expedite cloudlet deployment, I implemented OpenStack++ that extends OpenStack, an open source ecosystem for cloud computing. In ...
  97. [97]
    Firecracker: Lightweight Virtualization for Serverless Applications
    We developed Firecracker, a new open source Virtual Machine Monitor (VMM) specialized for serverless workloads, but generally useful for containers, functions ...
  98. [98]
    All one needs to know about fog computing and related edge ...
    Fog nodes can be placed close to IoT source nodes, allowing latency to be noticeably reduced compared to traditional cloud computing. While this example gives ...
  99. [99]
    Overview | Kubernetes
    Sep 11, 2024 · Kubernetes is a portable, extensible, open source platform for managing containerized workloads and services, that facilitates both ...Kubernetes Components · The Kubernetes API · Kubernetes Object Management
  100. [100]
    Energy-Efficient Distributed Edge Computing to Assist Dense ... - MDPI
    The proposed model achieved a higher energy efficiency by 19%, resource utilization by 54%, latency efficiency by 86%, and reduced network congestion by 92% ...2. Background And Related... · 5. Optimized Energy Model · 7. Results