Fact-checked by Grok 2 weeks ago

Distributed operating system

A distributed operating system (DOS) is a software layer that coordinates the operation of multiple independent, networked computers, presenting them to users and applications as a single, unified system while managing resource allocation, communication, and computation across the nodes without shared physical memory.^[1]^[2] Unlike traditional centralized operating systems that run on a single machine or networked operating systems that provide loose connectivity between separate machines, a DOS achieves high transparency—hiding details such as resource location, access methods, migration, concurrency, replication, and failures—to create the illusion of a virtual uniprocessor.^[2]^[1] This transparency is facilitated by mechanisms like remote procedure calls (RPC) and remote method invocation (RMI), which enable seamless inter-process communication over unreliable networks.^[2]^[3] Key goals of DOS include enhancing resource sharing (e.g., CPU, memory, and storage across nodes), improving scalability through incremental addition of processors, boosting reliability via redundancy and fault tolerance, and optimizing performance via load balancing and parallel execution.^[2]^[1] However, challenges arise from network-induced issues, such as communication overhead (e.g., remote operations being significantly slower than local ones), the absence of a global clock or state, and handling failures in components like disks, links, or software without disrupting the overall system.^[3]^[1] Historically, DOS concepts emerged in the 1980s amid advances in networking and multiprocessing, with influential research projects including the Cambridge Distributed Computing System (emphasizing object-based distribution), Amoeba (a microkernel-based system for scalability), LOCUS (focused on distributed file systems), and the V System (pioneering RPC).^[1] These efforts, often from universities like Cambridge, Stanford, and Vrije Universiteit, laid the groundwork for modern distributed systems seen in cloud computing and large-scale services.^[1] Design considerations for DOS encompass resource management techniques like sender- or receiver-initiated load balancing, distributed locking for consistency, client caching for efficiency (e.g., in file systems like Coda), and replication strategies to ensure availability, all while addressing the inherent complexities of decentralized control.^[2]

Introduction

Definition and Characteristics

A distributed operating system (DOS) is software designed to manage a collection of independent computers, making them appear to users as a single coherent system. It handles resource sharing, communication, and coordination in a transparent manner, abstracting the underlying distribution of hardware and software components. This contrasts with a single-node operating system, where resource management is confined to local hardware without the need for inter-machine coordination or network latency considerations.^[1] Key characteristics of a DOS include various forms of transparency that mask the complexities of distribution from users and applications. Location transparency hides the physical location of resources, allowing users to access files or processes without knowing which machine hosts them. Access transparency ensures uniform interaction with resources regardless of their type or location, as if they were local. Migration transparency permits the movement of processes or resources between machines without user intervention or awareness. Replication transparency conceals the existence of multiple copies of resources for improved availability or load balancing. Failure transparency enables the system to recover from hardware or software faults automatically, maintaining service continuity. Concurrency transparency manages simultaneous access to shared resources without conflicts visible to users. Performance and scaling transparency allow the system to adapt to varying loads and sizes while preserving consistent behavior. These transparencies collectively create the illusion of a single system, where users perceive a unified environment despite the underlying multiplicity of nodes. Additionally, DOS emphasizes hardware independence, enabling software to run across heterogeneous processors without modification, and fault tolerance through redundancy, such as data replication and failover mechanisms, to enhance reliability in the face of component failures.^[1]^[4] Classic examples of DOS include Amoeba, developed by Andrew S. Tanenbaum at Vrije Universiteit Amsterdam, which uses a microkernel architecture to support object-oriented resource management across workstations and servers. Sprite, from UC Berkeley, focuses on high-performance file caching and process migration within workstation clusters to achieve a single system image. Plan 9, from Bell Labs, extends Unix principles into a distributed environment with a global file system and networked resource naming for seamless multi-machine operation.^[5]^[6]^[7]

Distinction from Networked and Clustered Systems

Distributed operating systems (DOS) are distinguished from networked operating systems (NOS) primarily by the degree of transparency and integration they provide to users. In NOS, resource sharing is facilitated through middleware protocols such as the Network File System (NFS) for remote file access or Remote Procedure Calls (RPC) for inter-machine communication, but users must explicitly manage these interactions and remain aware of the underlying distribution across separate machines.^[8] This approach treats the network as an extension for ad-hoc sharing rather than a unified whole, with each node running its own independent OS instance. In contrast, DOS abstract distribution to create the illusion of a single coherent system, enabling seamless resource access without user intervention in location-specific details.^[8] Clustered systems, such as Beowulf clusters, represent another category of loosely coupled computing environments that differ from DOS in their hardware-centric design and limited OS-level abstraction. These systems aggregate commodity-grade computers into a local area network, often using a standard OS like Linux enhanced with parallel programming libraries such as the Message Passing Interface (MPI) to coordinate tasks across nodes. While clusters achieve high-performance parallel execution through shared hardware resources, they do not provide the full OS integration of DOS; instead, distribution is managed at the application or middleware level, requiring developers to handle node coordination explicitly.^[9] The core distinctions lie in the scope of system integration and the resulting trade-offs. DOS offer OS-level mechanisms like unified naming spaces and process migration to maintain the single-system abstraction, which incurs higher overhead but enhances usability in transparent environments.^[10] Networked and clustered systems, by relying on application-level tools, provide greater flexibility for heterogeneous setups at the cost of reduced transparency and increased user or programmer burden. These differences position DOS within a broader spectrum of computing paradigms, evolving from centralized OS on isolated machines—offering no distribution—to DOS with their unified illusion, and further to loosely coupled grid computing where resources span wide-area networks with minimal central coordination.^[4]

Historical Evolution

Pioneering Systems (1950s-1970s)

Early developments in computer systems during the 1950s and 1970s, driven by Cold War imperatives for fault-tolerant and scalable computing, laid foundational concepts for distributed computing, though true distributed operating systems emerged later with networking advances. Military needs demanded reliable systems capable of withstanding disruptions, leading to designs that emphasized modularity and redundancy amid constraints like vacuum tube technology's high failure rates. Batch processing was prevalent, focusing on modular hardware-software integration that foreshadowed distributed approaches.^[11]^[12]^[13] One of the earliest examples was the DYSEAC, completed in 1954 by the National Bureau of Standards for the U.S. Army Signal Corps as a transportable, truck-mounted computer for field computations. Weighing approximately 18 tons and using over 3,000 vacuum tubes, DYSEAC featured a modular design that allowed supervisory control to be distributed across system components for task execution, enabling flexible reconfiguration for military signal processing.^[14] This approach addressed tube unreliability through distributed logic functions within a single chassis, influencing later fault-tolerant designs in computing systems.^[14] The Lincoln TX-2, operational from 1958 at MIT's Lincoln Laboratory, advanced modularity through interrupt-driven mechanisms that supported concurrent operations across multiple instruction sequences.^[15] This transistor-based system, with 65,536 words of core memory, employed a "multiple-sequence program technique" where independent instruction streams could interrupt and interleave, facilitating early concepts in time-sharing and parallel processing.^[16] Designed under Air Force sponsorship for defense research, TX-2's architecture emphasized fault tolerance by isolating I/O and computation, reducing single-point failures in batch-oriented setups.^[15] In the 1960s, theoretical concepts like intercommunicating cells emerged as foundations for cellular architectures in parallel processing, notably proposed by C.Y. Lee in a 1962 paper. These cells, each containing simple logic and memory, communicated locally to form networks for solving complex problems, such as pattern recognition, without central coordination. Burroughs Corporation explored similar ideas in the D825 multiprocessor, introduced around 1962, which used up to four CPUs interconnected via crossbar switches for symmetrical MIMD processing, aiming at scalable, fault-tolerant computation amid Cold War demands.^[17] Such designs prioritized redundancy over centralized processing, laying groundwork for later distributed computing despite hardware limitations. By the 1970s, the advent of networking marked a pivotal shift. The ARPANET, operational from 1969, enabled the first wide-area distributed computing experiments, such as the Creeper and Reaper programs in 1971-1972, which demonstrated resource sharing and communication across networked nodes—key precursors to distributed operating systems.^[18]

Key Foundational Developments (1980s-1990s)

During the 1980s and 1990s, the rise of affordable workstations and local area networks spurred innovations in software abstractions that masked the distributed nature of computing resources, enabling more seamless distributed operating system (DOS) functionality. Researchers shifted focus from hardware-centric approaches to higher-level mechanisms for resource sharing, consistency, and fault tolerance, laying groundwork for scalable DOS environments.^[19] A pivotal advancement was the development of distributed shared memory (DSM) systems, which provided programmers with an illusion of a single, coherent address space across networked machines. The Ivy system, implemented in 1987 at Princeton University, pioneered software-based DSM by layering a shared virtual memory on top of the Apollo Domain network OS. Ivy used page-level granularity for memory sharing, employing broadcast-based invalidation protocols to maintain consistency; upon a page fault, the requesting processor fetched the page from the owner, ensuring strict sequential consistency while supporting parallel applications on clusters of up to 64 processors.^[20] Building on this, the Munin project at Rice University introduced lazy release consistency in 1992, a relaxed memory model that deferred update propagation until synchronization points (acquires), using vector timestamps and write notices to track causal dependencies. This approach reduced communication overhead by up to 50% in benchmarks like SPLASH, minimizing false sharing and enabling efficient multiple-writer protocols for shared variables.^[21] File system abstractions also evolved to support location transparency and availability in distributed settings. The Andrew File System (AFS), developed at Carnegie Mellon University in the mid-1980s, offered a unified namespace across thousands of workstations through Vice file servers and Venus client caches. AFS employed whole-file caching on local disks, with a callback mechanism to invalidate caches only on modifications, achieving high performance by reducing server polls; read-only volume replication at higher namespace levels further enhanced scalability and fault tolerance.^[22] Extending these ideas, the Coda file system, also from Carnegie Mellon and released in 1991, incorporated server replication and disconnected operation. Coda used volume-level replication with server-to-server synchronization via the volume server protocol, combined with client-side hoarding and reintegration for caching, ensuring availability even during network partitions while maintaining strong consistency through resolution logs.^[23] Transaction abstractions emerged to guarantee atomicity in distributed operations, addressing failures in multi-site updates. The two-phase commit (2PC) protocol, formalized in the late 1970s but widely adopted in 1980s DOS research, coordinated participants via a prepare phase (logging intents) followed by a commit or abort phase, ensuring all-or-nothing outcomes despite crashes. This was exemplified in the Argus system from MIT, developed in the 1980s by Barbara Liskov, where atomic actions served as distributed transactions spanning multiple guardians (protected processes). Argus implemented 2PC for top-level actions, using version numbers on atomic objects for recovery and nested actions for modularity, providing serializability and failure atomicity in applications like distributed databases.^[24] Middleware for group communication and coordination further abstracted persistence and reliability. The ISIS toolkit, created at Cornell University in 1985 by Kenneth Birman and colleagues, introduced virtual synchrony semantics for process groups, enabling reliable multicast with total order and membership notifications. ISIS facilitated fault-tolerant replication by treating groups as resilient objects, where upcalls handled view changes during failures, supporting applications like bulletin boards with automatic recovery.^[25] Reliability mechanisms emphasized recovery from node failures without full system halts. In the Amoeba DOS, led by Andrew Tanenbaum at Vrije Universiteit Amsterdam since the early 1980s, fault tolerance relied on object replication in the directory service and at-most-once RPC semantics, with user-level servers handling recovery via stable storage logs. Amoeba's microkernel design isolated faults, allowing crashed processes to restart from checkpoints in user space, while planned multicasting in version 5.0 (early 1990s) aimed to coordinate replicated state across processors.^[19] Key experimental systems integrated these abstractions for dynamic resource management. The Sprite OS from UC Berkeley, developed in the late 1980s, supported transparent process migration to balance load across workstations, using a global naming scheme and thread checkpointing to transfer executing processes mid-execution with minimal overhead (under 1 second for large processes). Sprite's migration facility, combined with a network file system, demonstrated practical DOS operation on Sun workstations, influencing later cluster computing.^[26]

Core Components and Mechanisms

Kernel and Process Management

In distributed operating systems (DOS), the kernel serves as the core coordinator for processes spanning multiple nodes, emphasizing modularity and fault isolation to handle the inherent distribution of resources. Unlike traditional monolithic kernels that bundle services into a single address space for efficiency, microkernel designs in DOS minimize kernel functionality to essentials like inter-process communication (IPC) and basic scheduling, delegating higher-level services to user-space servers. This approach enhances reliability in distributed environments by containing failures to individual nodes and facilitating scalability across heterogeneous hardware. For instance, the Amoeba DOS employs a microkernel that runs identically on all machines, managing only process creation, threads, and RPC-based communication while offloading file systems and other services to dedicated servers.^[27] L4-based microkernels further exemplify this design in modern DOS, prioritizing minimalism with capabilities for secure resource delegation and high-performance IPC to support distributed execution on multicore and networked systems. These kernels, such as seL4, achieve sub-microsecond IPC latencies, enabling efficient message exchanges across nodes without shared memory assumptions. In contrast to shared memory models, which rely on uniform address spaces and are challenging to implement over networks due to consistency overheads, message-passing kernels like L4 and Amoeba use explicit IPC primitives (e.g., synchronous RPC) to ensure reliable, ordered communication in distributed settings.^[28]^[27] Process management in DOS extends beyond local creation to location-transparent operations, allowing processes to spawn on optimal nodes without user awareness of physical locations. This is achieved through global naming services and kernel-level abstractions that abstract node boundaries, as seen in Amoeba's thread-based process model where creation via RPC enables remote instantiation with shared code and data across machines. Load balancing complements this by dynamically redistributing processes to prevent hotspots; sender-initiated diffusion algorithms, for example, enable overloaded nodes to probe neighbors and transfer tasks based on local load thresholds, improving overall throughput in unstructured networks.^[29]^[30] Process migration mechanisms in DOS facilitate moving executing processes between nodes for load balancing or fault tolerance, involving the transfer of process state including memory, files, and communication links. Seminal implementations, such as in the Sprite operating system, support transparent migration at any time by checkpointing the process image—capturing CPU registers, memory pages, and open files—and resuming on the destination node, with global file naming ensuring resource continuity. State transfer protocols typically employ a two-phase approach: pre-migration demand paging to move only active pages and post-migration link updating to redirect IPC endpoints, minimizing downtime to seconds in LAN environments.^[26]^[31] Scheduling in DOS balances local autonomy with global optimization, where local schedulers handle per-node priorities using traditional algorithms like round-robin, while global schedulers aggregate load information to assign or migrate tasks across the system for balanced utilization. Global approaches, often using diffusion or graph-based models, can outperform isolated local scheduling in simulated heterogeneous clusters under varying loads, though they incur communication overheads. In distributed real-time systems, priority inheritance protocols extend this by propagating priorities across nodes during resource contention; for example, a high-priority task blocked on a remote low-priority one inherits its priority via timestamped tokens, bounding inversion delays to network latencies and ensuring schedulability in embedded DOS like those based on L4.^[32]

Inter-Process Communication and Synchronization

In distributed operating systems, inter-process communication (IPC) enables coordination among processes executing on separate nodes without shared memory, relying instead on network-based exchanges to maintain transparency and efficiency. The two dominant IPC paradigms are remote procedure calls (RPC) and message passing, each addressing different needs for abstraction and performance in heterogeneous environments. These mechanisms must handle challenges like network latency, partial failures, and message loss to ensure reliable interaction.^[33] Remote procedure calls provide a familiar programming model by allowing a client process to invoke a procedure on a remote server as if it were local, with the operating system handling parameter marshalling, transmission, and result return. The foundational design by Birrell and Nelson emphasized exception-based error handling and at-most-once semantics, where if a result is returned, the procedure was invoked exactly once; otherwise, it may have been invoked zero or one time, to balance reliability with the impossibility of exactly-once guarantees in asynchronous networks.^[33] Sun RPC, an early commercial implementation, extended this model using UDP for low-overhead transport and incorporated idempotent operations to approximate at-most-once semantics, reducing duplicate effects through client timeouts and server statelessness.^[34] These semantics ensure that RPC failures manifest as exceptions rather than silent errors, facilitating robust distributed applications like file servers.^[33] Message passing, in contrast, offers explicit control over communication, suitable for systems requiring fine-grained coordination or multicast. In the Amoeba distributed operating system, message passing supports both synchronous (blocking until reply) and asynchronous (non-blocking send) modes for point-to-point exchanges, with messages encapsulated in fixed-size headers for operation codes, parameters, and data buffers up to 32 KB.^[19] Amoeba's design hides low-level details via stub routines that marshal arguments, similar to RPC, but allows direct buffering for high-throughput scenarios like remote file I/O, where a read operation sends an offset and length in the header and receives data in the buffer.^[19] This approach scales to wide-area networks by minimizing overhead, though it requires explicit error checking for lost messages via sequence numbers and acknowledgments.^[19] Synchronization primitives in distributed operating systems extend local constructs like mutexes and semaphores to prevent race conditions across nodes, often using message passing for coordination. A distributed mutex enforces mutual exclusion for a shared resource by requiring processes to request permission via a token or quorum, with the Ricart-Agrawala algorithm achieving optimality through timestamped requests broadcast to all N processes, granting access only after 2(N-1) messages in the worst case and ensuring fairness via total ordering.^[35] Distributed semaphores generalize this to counting mechanisms, allowing up to a specified number of concurrent accesses; they operate via wait and signal operations propagated through messages, maintaining a global count at a coordinator or via decentralized diffusion to handle failures without centralized bottlenecks.^[36] To enforce causal ordering—ensuring events appear in a sequence respecting dependencies—distributed systems employ vector clocks, an extension of Lamport's logical clocks that capture partial orders in asynchronous environments. Each process maintains a vector of size N (one entry per process), initialized to zeros; on a local event, the process increments its own entry, and on sending a message, it attaches a copy of its vector, while the receiver updates its vector by taking the component-wise maximum and then incrementing.^[37] Two events are causally related if their vectors V_a and V_b satisfy V_a ≤ V_b (every component of V_a is less than or equal to V_b, with at least one strict inequality); otherwise, they are concurrent, enabling detection of consistent global states without a shared clock.^[37] This mechanism, formalized by Mattern, underpins causal multicast protocols by filtering messages that violate ordering, reducing unnecessary deliveries in group settings.^[37] Group communication enhances IPC for collaborative processes by providing reliable, ordered delivery to multiple recipients, critical for fault-tolerant applications like replicated databases. Atomic multicast guarantees that a message is delivered to all or none in a group, with total order across members, often implemented atop reliable broadcast layers.^[38] Virtual synchrony, a higher-level abstraction, simulates shared memory views by delivering messages in stable FIFO order and notifying members of membership changes (e.g., failures) consistently, as in Birman's ISIS toolkit, where group multicasts use sequence numbers and checkpoints to ensure causal consistency without full replication overhead.^[38] In ISIS, this model supports up to hundreds of processes with low latency, as atomic delivery completes in a few message delays under normal conditions, making it suitable for real-time coordination in distributed simulations.^[38] Deadlock detection in distributed systems identifies circular waits for resources across nodes, using graph-based models adapted for decentralization. The wait-for graph represents processes as nodes with directed edges from a process to the resources it awaits, aggregated globally via message probes to avoid constructing the full graph at one site.^[39] The Chandy-Misra-Haas algorithm employs edge-chasing: a blocked process initiates a probe message containing its identifier, which travels along wait-for edges to successors; if a probe returns to the initiator, a cycle exists, confirming deadlock with O(E) messages, where E is the number of edges, and no false positives under the AND model of resource requests.^[39] Resource allocation graphs extend this by including resource nodes and request/assignment edges, enabling distributed detection through periodic merging of local subgraphs or probe diffusion, though they incur higher communication costs in large systems due to the need for consistent snapshots.^[40] These techniques prioritize detection over prevention, allowing systems like Amoeba to resolve deadlocks by preempting resources from lower-priority processes.^[19]

Architectural Models

Centralized vs. Decentralized Structures

In distributed operating systems, centralized structures rely on a master-slave model where a single coordinator node, often referred to as the master, makes global decisions and delegates tasks to subordinate slave nodes.^[41] This approach simplifies system management by concentrating control flow in one entity, enabling straightforward resource allocation and synchronization across nodes.^[42] For instance, the master assigns computational tasks to slaves, which execute them and report results, reducing the complexity of distributed coordination.^[41] However, this model introduces a single point of failure; if the master node crashes, the entire system may halt until recovery or failover occurs, limiting fault tolerance.^[42] Decentralized structures, in contrast, distribute control without a central authority, allowing nodes to collaborate through fully distributed mechanisms for decision-making and state propagation.^[43] Gossip protocols exemplify this by enabling nodes to periodically exchange information with randomly selected peers, mimicking epidemic spreading to achieve eventual consistency across the system.^[43] Consensus algorithms like Lamport's Paxos further support decentralization by ensuring agreement on a single value among nodes despite failures, using phases where proposers, acceptors, and learners coordinate via message passing to tolerate up to floor((n-1)/2) faulty processes in a system of n nodes.^[44] These methods enhance scalability by supporting growth in node count without bottlenecks, as each node handles local decisions while propagating global state.^[43] Hybrid approaches combine elements of both paradigms to balance simplicity and resilience, often employing hierarchical decentralization where higher-level coordinators oversee lower-level distributed operations. Google's Borg system illustrates this through a replicated central master that manages cluster-wide decisions, such as task scheduling and resource allocation, while delegating execution to distributed agents on individual machines. Fault tolerance in such systems is bolstered by leader election mechanisms, like the Raft algorithm, which uses randomized timeouts to select a leader from a cluster and replicates logs to maintain consistency, allowing the system to continue operating with a majority of nodes intact.^[45] This structure scales to tens of thousands of nodes by sharding responsibilities and using asynchronous scheduling, mitigating the risks of pure centralization. Control messages in these architectures often leverage inter-process communication primitives to facilitate coordination between centralized coordinators and decentralized nodes. Overall, centralized models prioritize ease of implementation for smaller scales, while decentralized and hybrid variants offer superior scalability and fault tolerance for large, unreliable environments.^[45]

Client-Server and Peer-to-Peer Paradigms

In distributed operating systems, the client-server paradigm organizes interactions through asymmetric roles, where clients request services or resources from specialized servers that manage and provide them.^[46] This model centralizes resource management on servers, enabling efficient handling of shared services like file storage, while clients focus on user-specific tasks.^[47] For instance, in the Sprite operating system, the file system employs a client-server architecture where servers maintain stateful information about client caches to ensure consistency, contrasting with stateless implementations like the standard Network File System (NFS).^[48] Scalability in this paradigm is often achieved through server replication, distributing load across multiple identical servers to handle increased client demands without compromising performance.^[46] The peer-to-peer (P2P) paradigm, in contrast, relies on symmetric roles among nodes, where each participant functions as both a client and a server, contributing resources like storage and bandwidth to the collective system.^[49] This symmetry fosters decentralized cooperation, with overlay networks facilitating efficient resource discovery and routing; for example, the Chord distributed hash table (DHT) organizes nodes in a ring structure to route queries logarithmically, enabling scalable lookup of data across dynamic networks.^[50] In applications such as file-sharing distributed systems, extensions inspired by Freenet demonstrate P2P's utility by distributing encrypted data fragments across nodes for anonymous storage and retrieval, enhancing privacy and availability without central points of control.^[51] Modern distributed operating systems have evolved from predominantly client-server models toward hybrid or P2P approaches to leverage greater decentralization, particularly in resource-constrained or highly dynamic environments.^[52] This transition is supported by discovery protocols like JXTA, which enable peers to advertise and locate services through XML-based messaging, allowing seamless formation of peer groups for collaborative tasks.^[53] A key trade-off between these paradigms involves latency and resilience: client-server systems typically offer lower latency for direct requests due to centralized servers, but they risk single points of failure that reduce overall system resilience.^[49] Conversely, P2P systems provide enhanced resilience through resource distribution and fault-tolerant routing, though they may incur higher latency from multi-hop paths in overlay networks.^[50]

Design Principles and Considerations

Transparency and Resource Management

Transparency in distributed operating systems (DOS) refers to the mechanisms that hide the complexities of distribution from users and applications, allowing interaction with resources as if they were part of a single, centralized system. This is achieved through various forms of transparency, enabling seamless access to distributed components without requiring knowledge of their physical locations, failure states, or access protocols. Key types include location transparency, which provides uniform naming via global namespaces to decouple resource references from their physical hosts; access transparency, which ensures consistent application programming interfaces (APIs) across distributed resources regardless of their implementation; and failure transparency, which masks component crashes using proxies or redirection to maintain service continuity.^[54]^[55] Resource management in DOS extends traditional techniques to handle allocation across multiple nodes while preventing issues like deadlocks. For distributed allocation, extensions to the Banker's algorithm impose a hierarchy on system nodes and apply modified safety checks at each level, avoiding centralized bottlenecks by allowing local decisions with global coordination. Naming services further support transparency by organizing resources into hierarchical structures; for instance, the Distributed Computing Environment (DCE) employs cell-based naming, where resources within an administrative domain (a "cell") are referenced via location-independent paths, resolved through directory services that link local and global namespaces.^[56]^[57] Migration and replication enhance resource management by enabling transparent relocation of processes or files to optimize load or availability, while maintaining data consistency. Process migration allows a running process to transfer between nodes without user intervention, as demonstrated in early systems like the Galaxy OS, where sender-initiated transfers handle state capture and restart to preserve execution continuity. Replication involves duplicating resources across nodes, with consistency models defining update propagation; sequential consistency ensures all operations appear in a single total order visible to all processes, akin to a uniprocessor execution, whereas eventual consistency permits temporary divergences that resolve over time once updates propagate, balancing availability with reduced synchronization overhead.^[31]^[58] A notable tool for achieving these transparencies is the Globe system's location service, which uses a scalable, worldwide search tree to map object handles to contact addresses for distributed objects. This service supports location, migration, and replication transparency by allowing objects to expand or shrink their contact points dynamically—such as adding replicas or relocating to new hosts—without altering client references, thus hiding distribution details through a two-level naming scheme.^[59]

Reliability, Availability, and Fault Tolerance

In distributed operating systems (DOS), reliability refers to the probability that the system performs its intended functions without failure over a specified period, often modeled to handle arbitrary faults including malicious ones. A foundational approach to reliability is Byzantine fault tolerance (BFT), which addresses scenarios where components may fail in arbitrary ways, such as sending conflicting messages, yet the system must reach consensus. The seminal Byzantine Generals Problem, formulated by Lamport, Shostak, and Pease, demonstrates that with oral messages, agreement is achievable if more than two-thirds of the processes are non-faulty, using an algorithm that iteratively exchanges and computes majorities over multiple rounds.^[60] This model underpins reliability in DOS by ensuring correct operation despite up to f faulty nodes in a system of 3f+1 total nodes. Redundancy via replication enhances reliability by maintaining multiple identical copies of data or processes across nodes, allowing the system to mask failures through voting or switching to healthy replicas. In DOS, replication strategies, such as state machine replication, ensure that all replicas process the same sequence of operations to maintain consistency, thereby tolerating crashes or omissions without data loss. This approach is critical in environments like cluster computing, where node failures are common, and has been widely adopted in systems like Google's Spanner for linearizable consistency.^[61] Availability in DOS focuses on minimizing downtime, often targeting 99.99% or higher uptime, through techniques that enable seamless service continuation during failures. Failover clustering groups multiple nodes to monitor each other, automatically transferring workloads from a failed node to a healthy one within seconds, using shared storage or virtual IP addresses to maintain service accessibility. For instance, in enterprise DOS implementations, failover ensures that if a primary node crashes, a secondary assumes control without user intervention, reducing outage impact. Hot standby nodes complement this by keeping idle replicas fully synchronized and ready to activate immediately upon detecting a primary failure, as seen in monitoring systems where standby processes mirror active ones in real-time.^[62]^[63] Fault tolerance in DOS extends beyond detection to recovery, enabling the system to continue or resume operations post-failure. Checkpoint/restart mechanisms periodically save process states across distributed nodes, allowing restart from the last consistent checkpoint after a failure, which is vital for long-running computations in clusters. The Distributed MultiThreaded CheckPointing (DMTCP) tool exemplifies this by transparently checkpointing multithreaded and distributed applications without kernel modifications, achieving restart times under 2 seconds on 128 cores with minimal runtime overhead. Distributed debugging supports fault tolerance by enabling developers to trace and reproduce failures across nodes, using techniques like record-replay to capture non-deterministic events for post-mortem analysis, thus facilitating robust error recovery.^[64]^[65] Key metrics for evaluating these aspects in DOS include Mean Time Between Failures (MTBF), which measures the average operational period before a failure occurs, and Mean Time to Repair (MTTR), the average duration to restore functionality post-failure; in distributed contexts, these are aggregated across nodes, with high MTBF/MTTR ratios indicating effective redundancy. For example, BFT systems aim for MTBF exceeding system lifetime by tolerating f faults without compromising availability. N-version programming bolsters fault tolerance through software diversity, executing multiple independently developed versions of the same module and selecting outputs via majority voting, reducing common-mode failures; experiments by Knight and Leveson showed that while independence assumptions hold imperfectly due to correlated failures, the theoretical model with N=4 versions and low individual failure probabilities can achieve error rates below 10^{-9}.^[66]^[67] These mechanisms collectively hide failures from users, aligning with transparency principles in resource management.

Challenges and Trade-offs

Performance and Scalability Issues

Distributed operating systems face significant performance challenges primarily due to the inherent overhead of network communication, which introduces latency that can dominate execution times in distributed environments. Network latency overhead arises from delays in data transmission between nodes, often exacerbated by variations in network conditions, leading to up to 3.5 times slower runtime in communication-intensive applications like LU decomposition. This overhead is particularly pronounced in high-performance computing scenarios where even small variations in latency can degrade overall system performance more than the mean latency itself.^[68] Another key performance bottleneck is the communication-to-computation ratio (C2C), which measures the balance between data transfer costs and local processing efforts, often expressed as the total communication traffic divided by total computations. In distributed deep learning systems, a high C2C ratio—such as in models like BERT-Large with low model intensity (I = 248)—limits scalability, resulting in only 1.2× speedup across 4 GPUs due to excessive synchronization overhead.^[69] Extensions of Amdahl's law to distributed settings account for this by incorporating communication as an additional overhead component, highlighting how even parallelizable workloads are constrained by network dependencies.^[69]^[70] Scalability in distributed operating systems is hindered by state explosion in large clusters, where the exponential growth of possible execution states from concurrent operations makes exhaustive verification or testing infeasible, often requiring techniques like dynamic partial order reduction to prune redundant paths and achieve up to 920× speedup with 1,024 workers. To mitigate these limits, partitioning strategies such as sharding distribute data across nodes using a sharding key (e.g., hashed identifiers) to balance load and prevent hot spots, enabling horizontal scalability while maintaining consistent schemas across shards.^[71]^[72] Performance in distributed operating systems is typically evaluated using metrics like throughput (e.g., transactions per second) and response time, which quantify the system's capacity to handle workloads under varying conditions. Benchmarks such as those for distributed file systems measure these by simulating remote access and execution, revealing trade-offs in scalability where throughput may plateau due to network bottlenecks. For instance, in high-performance distributed environments, latency injectors demonstrate how response times increase nonlinearly with cluster size, guiding optimizations for real-world deployments.^[73]^[68] Mitigation strategies focus on optimizations like caching hierarchies and prefetching in distributed file systems to reduce latency impacts. Caching hierarchies, including client- and server-driven approaches with techniques like LRU replacement, can achieve up to 83% hit rates by exploiting temporal locality, thereby cutting server CPU time and client latency in large-scale setups. Prefetching algorithms, such as those based on highly relevant frequent patterns, anticipate data needs to improve read operation performance by 29% to 77% through proactive loading, enhancing overall throughput without excessive overhead. These methods briefly intersect with resource allocation principles by prioritizing data placement to minimize communication costs.^[74]^[75]

Security and Complexity Management

Distributed operating systems (DOS) face unique security vulnerabilities stemming from their reliance on multiple interconnected nodes, where trust must be established across potentially untrusted environments. Unlike centralized systems, DOS require mechanisms to verify the identity and integrity of nodes that may be geographically dispersed and operated by different entities. A primary challenge is ensuring mutual authentication to prevent unauthorized access, as compromised nodes can propagate malicious actions throughout the network. For instance, authentication protocols like Kerberos address this by providing a ticket-based system where a trusted third party issues time-limited credentials, allowing secure verification without transmitting passwords over the network.^[76] This approach mitigates risks such as eavesdropping and replay attacks in distributed environments.^[77] Secure communication channels are essential to protect data in transit across distributed nodes, where interception poses a significant threat. Integration of protocols like IPsec enables end-to-end encryption, authentication, and integrity checks at the IP layer, ensuring that communications between DOS components remain confidential even over public networks.^[78] IPsec's use of security associations and key management further supports dynamic policy enforcement tailored to DOS requirements, such as protecting resource-sharing operations.^[79] However, implementing these in DOS introduces overhead, as nodes must negotiate policies without a central authority, potentially exposing configuration vulnerabilities if not properly managed.^[80] Attacks in DOS often exploit the decentralized nature of the system, amplifying threats like the Byzantine generals problem, where faulty or malicious nodes send conflicting information, leading to inconsistent states across the network. This problem, formalized in seminal work, illustrates how even a small fraction of unreliable nodes can prevent consensus in critical operations such as resource allocation or synchronization.^[60] In peer-to-peer (P2P) DOS architectures, insider threats—where legitimate nodes are compromised or act maliciously—pose particular risks, as they can inject false data or disrupt services without external intrusion. Reputation-based mechanisms, such as those using distributed directories, help detect and isolate such insiders by tracking node behavior over time.^[81] These threats underscore the need for robust anomaly detection integrated into DOS kernels to maintain system integrity.^[82] Managing the inherent complexity of DOS designs requires structured approaches to ensure reliability without overwhelming development and maintenance efforts. Modular design principles decompose the system into independent components, such as separate modules for communication, naming, and fault handling, allowing isolated development and easier integration.^[83] This modularity reduces interdependencies, facilitating updates to specific parts without affecting the entire system, though it demands clear interfaces to avoid hidden coupling. Verification tools like TLA+ further aid complexity management by enabling formal specification and model checking of distributed algorithms, catching errors in concurrency and fault scenarios before deployment.^[84] TLA+'s temporal logic-based approach has been applied to verify protocols in real-world DOS, ensuring properties like safety and liveness hold under diverse failure modes.^[85] The trade-offs of this complexity are evident in challenges like debugging distributed deadlocks, where resource contention across nodes creates cycles that are harder to detect and resolve than in single-machine systems. Tools for distributed debugging, such as those employing global snapshots or trace analysis, help identify these issues but increase system overhead and require coordinated logging across nodes.^[65] To simplify management, virtualization layers abstract underlying hardware heterogeneity, allowing DOS to run as virtual machines or containers that encapsulate complexity and enable easier migration and scaling.^[86] While virtualization introduces its own performance costs, it streamlines deployment by providing a uniform interface, balancing the price of DOS complexity with practical usability.^[87]

Modern Research and Applications

Advances in Heterogeneous and Multi-Core Environments

Recent advances in distributed operating systems (DOS) have increasingly addressed the challenges posed by multi-core architectures, particularly in non-uniform memory access (NUMA) and multi-socket configurations, where traditional shared-memory models falter due to scalability limits. Barrelfish, a multikernel OS developed by researchers at ETH Zurich, exemplifies this shift by treating the system as a distributed node network, with each core running an independent kernel that communicates via explicit messages rather than shared state. This design enables efficient handling of NUMA topologies through a system knowledge base (SKB) that maintains hardware topology information, facilitating topology-aware resource allocation and multicast-based TLB shootdowns that scale to 32 cores on multi-socket AMD systems. In scheduling, Barrelfish employs distributed CPU drivers and monitors for per-core dispatchers, incorporating core affinity by pinning tasks to specific cores based on SKB-derived locality to minimize inter-core communication overheads. Measurements on a 2×2-core AMD Opteron system demonstrated Barrelfish's message-passing outperforming Linux's shared-memory IP loopback by up to 18% in throughput (2154 Mbit/s vs. 1823 Mbit/s). Although the core project became inactive in 2020, its principles have influenced subsequent multi-core research at ETH Zurich.^[88]^[89] Heterogeneous environments, combining CPUs, GPUs, and FPGAs, introduce further complexities in DOS, requiring middleware to abstract device differences and enable seamless resource orchestration across nodes. Distributed extensions to OpenCL have emerged as key enablers, allowing kernels to execute across networked heterogeneous devices without low-level messaging interfaces like MPI. For instance, PoCL-Remote provides a client-server model where a daemon on remote nodes exposes OpenCL devices as local, supporting peer-to-peer data transfers and event synchronization to optimize bandwidth in CPU-GPU clusters. This approach virtualizes heterogeneous hardware, enabling DOS to treat distributed GPUs or FPGAs as a unified compute fabric, with efficient buffer management reducing client-side overheads in multi-node setups. Similarly, SnuCL-D decentralizes OpenCL execution by redundantly running host programs on all nodes in CPU/GPU clusters, virtualizing remote devices to eliminate single-host bottlenecks and achieving up to 45x speedup over centralized frameworks on 512-node systems for compute-intensive benchmarks. These middleware layers integrate into DOS by providing device-neutral APIs, allowing adaptive task offloading to specialized accelerators while maintaining fault tolerance through redundant computation.^[90]^[91] In the 2020s, research has focused on lightweight DOS frameworks and adaptive mechanisms to better exploit heterogeneous multi-core resources. Unikraft, an open-source unikernel development kit, supports constructing specialized, minimal OS images for distributed cloud and edge deployments, reducing boot times to milliseconds and memory footprints by up to 90% compared to full Linux kernels for single-purpose applications. Its modular architecture allows fine-grained integration of components like networking and virtualization, enabling efficient scaling across heterogeneous nodes in distributed systems without unnecessary bloat. Complementing this, adaptive resource partitioning techniques dynamically allocate shared resources like cache and memory bandwidth in multi-core DOS to mitigate interference. For example, OS-anchored budgeting in multicores adjusts access to last-level cache based on real-time usage by prioritizing efficient threads. Holistic partitioning frameworks further extend this to distributed settings, dividing resources among cores and nodes while monitoring interference to rebalance tasks, as demonstrated in real-time systems with improved schedulability and reduced WCET interference. These efforts address post-2010 hardware diversity, prioritizing scalability and efficiency over legacy monolithic designs.^[92]^[93]^[94]^[95]

Integration with Cloud and Edge Computing

Distributed operating systems (DOS) integrate seamlessly with cloud computing paradigms, particularly in Infrastructure-as-a-Service (IaaS) environments, where they manage pooled resources across geographically dispersed nodes to support scalable workloads. OpenStack, an open-source IaaS platform, exemplifies this by providing modular components for orchestrating compute (Nova), storage (Cinder), and networking (Neutron) in distributed setups, enabling private and public cloud deployments with fault-tolerant resource allocation.^[96] Extensions such as OpenStack++ build on this foundation to support hybrid cloudlet environments, dynamically provisioning virtual machines and services at the network edge while maintaining centralized control for efficiency.^[97] In serverless architectures, DOS principles underpin lightweight virtualization technologies like AWS Firecracker, which acts as a minimal, secure runtime for function execution without a full traditional OS kernel. Firecracker employs microVMs to isolate workloads, achieving startup times under 125 milliseconds and memory footprints as low as 5 MiB, thus enabling dense, distributed serverless computing on cloud infrastructure.^[98] Edge computing extends DOS capabilities to resource-constrained environments, particularly for Internet of Things (IoT) applications, through fog operating systems that decentralize processing to reduce dependency on remote clouds. Platforms like EdgeX Foundry provide a vendor-neutral, microservices-based architecture for edge gateways, facilitating device interoperability and local data orchestration in distributed IoT networks. This latency-aware distribution routes tasks to nearby fog nodes, minimizing end-to-end delays compared to centralized models in real-time IoT scenarios.^[99] As of 2025, emerging trends in DOS research emphasize AI-driven enhancements, such as federated learning frameworks integrated into operating system layers for collaborative model training across distributed devices without centralizing sensitive data. A proposed horizontal federated AI operating system for telecommunications unifies edge resources for agent-based automation, improving scalability in 5G/6G networks while preserving privacy through on-device computation.^[100] Container orchestration tools like Kubernetes further embody this evolution, functioning as a DOS abstraction layer that schedules and scales containerized workloads across clusters, abstracting underlying hardware heterogeneity for cloud-native applications.^[101] Sustainability considerations are increasingly central to DOS design in cloud and edge contexts, with energy-efficient distribution strategies optimizing workload placement to curb power usage amid growing data volumes. For instance, distributed edge computing models in dense IoT deployments have achieved 19% higher energy efficiency by intelligently offloading tasks from battery-limited devices to fog nodes, alongside 54% better resource utilization and 86% improved latency management.^[102]

References

[1]
[PDF] Distributed Operating Systems - Computer Science
A distributed operating system runs on multiple, independent CPUs, where each computer has its own OS, and users don't know which machine their programs are ...
[2]
[PDF] Distributed and Operating Systems Course Notes - LASS
1.8.2 Distributed Operating System. Distributed operating systems are operating systems that manage resources in a distributed system. However, from a user ...
[3]
[PDF] Distributed Systems - cs.wisc.edu
Distributed systems involve many machines cooperating, like web services. Communication is unreliable, and failure is a major challenge.
[4]
A brief introduction to distributed systems | Computing
Aug 16, 2016 · A distributed system is a collection of autonomous computing elements that appears to its users as a single coherent system.
[5]
[PDF] The Amoeba Distributed Operating System - CERN Document Server
Amoeba consists of two basic pieces: a microkernel, which runs on every processor, and a collection of servers that provide most of the traditional operating ...
[6]
Sprite - UC Berkeley EECS
Sprite is a distributed operating system that provides a single system image to a cluster of workstations. It provides very high file system performance through ...
[7]
[PDF] Plan 9 from Bell Labs
Plan 9 is a distributed computing environment. It is assembled from separate machines acting as CPU servers, file servers, and terminals.
[8]
Distributed operating systems | ACM Computing Surveys
This paper is intended as an introduction to distributed operating systems, and especially to current university research about them.
[9]
Differences Between Distributed and Parallel Systems - ResearchGate
This report characterizes the differences between distributed systems, networks of workstations, and massively parallel systems and analyzes the impact of these ...
[10]
https://dl.acm.org/doi/10.1145/234313.234407
[11]
Cold War Computer Arms Race - Marine Corps University
The theory critical to the success of networking was “distributed communication,” proposed by Paul Baran at Rand in the early 1960s.9 The human brain, Baran ...
[12]
[PDF] A New Era in Computation - American Academy of Arts and Sciences
"batch processing" on computers: When the output appears as interminable pages of printout and numerical tables, it is difficult to uncover significant or ...
[13]
[PDF] computer development (SEAC and DTSEAC) at the National Bureau ...
Such topics as systems development, engineering development, design, construction, and maintenance of computer equipment are covered. The introduction ...
[14]
[PDF] 3. DYSEAC - Ed Thelen
As a consequence, the supervisory control over the common task may initially be loosely distributed throughout the system and then temporarily concentrated in ...
[15]
The Lincoln TX-2 computer development - ACM Digital Library
THE TX-2 is the newest member of a growing fam- ily of experimental computers designed and con- structed at the Lincoln Laboratory of M.I.T. as.Missing: modularity | Show results with:modularity
[16]
[PDF] Memory Units in the Lincoln TX-2. - Bitsavers.org
The Lincoln TX-2 incorporates several new developments in high-speed transistor circuits, large capacity magnetic-core-memories, and flexibility in machine ...Missing: 1958 | Show results with:1958
[17]
The History of the Development of Parallel Computing
[17] Burroughs introduces the D825 symmetrical MIMD multiprocessor. 1 to 4 CPUs access 1 to 16 memory modules using a crossbar switch. The CPUs are similar to ...
[18]
[PDF] Experiences with the Amoeba Distributed Operating System
The Amoeba distributed operating system has been in development and use for over eight years now. In this paper we describe the present system and our.
[19]
[PDF] A Shared Memory Architecture For Distributed Computing
A prototype loosely-coupled shared memory system called Ivy has been implemented in software on top of the Apollo Domain system [Li86]. It allows a ...
[20]
[PDF] Lazy Release Consistency for Software Distributed Shared Memory
distributed shared mem- ory system to use release consistency. Munin's inlplementa- tion of release consistency merges updates at release time, rather than.
[21]
[PDF] Scale and Performance in a Distributed File System - andrew.cmu.ed
The Andrew File System is a location-transparent distributed tile system that will eventually span more than 5000 workstations at Carnegie Mellon University ...
[22]
[PDF] Coda: A Highly Available File System for a Distributed Workstation ...
Coda integrates the use of two different mechanisms, whole-file caching and replication, while Locus relies solely on replication. Coda clients directly update ...
[23]
None
**Summary of Argus Use of Transactions and Two-Phase Commit in Distributed OS**
[24]
[PDF] Reliable communication in the presence of failures
A review of several uses for the protocols in the ISIS system, which supports fault-tolerant resilient objects and bulletin boards, illustrates the significant ...
[25]
[PDF] Process Migration in the Sprite Operating System - UC Berkeley EECS
Feb 11, 1987 · This paper describes a process migration facility for the Sprite operating system. In order to provide location-transparent remote execution, ...
[26]
[PDF] The Amoeba Distributed Operating System
Tanenbaum at the Vrije Universiteit (VU) in Amsterdam (The Netherlands) has been doing research since 1980 in the area of distributed computer systems. This.
[27]
[PDF] From L3 to seL4 What Have We Learnt in 20 Years of L4 ...
The L4 microkernel has undergone 20 years of use and evolution. It has an active user and developer commu- nity, and there are commercial versions which are ...<|separator|>
[28]
[PDF] The Amoeba Distributed Operating System - A Status Report
This mechanism for process creation is much more location transparent and efficient in a distributed system than the UNIX fork system call. Like all operating ...Missing: Plan | Show results with:Plan
[29]
[PDF] Strategies for dynamic load balancing on highly parallel computers
Sender Initiated Diffusion (SID)¹ is a highly distributed local approach which makes use of near-neighbor load information to apportion surplus load from ...
[30]
Process migration | ACM Computing Surveys - ACM Digital Library
Process migration is the act of transferring a process between two machines, enabling dynamic load distribution and fault resilience.
[31]
[PDF] Scheduling in Distributed Systems - Computer Science
In a distributed system, the local scheduler may need global information from other workstations to achieve the optimal overall performance of the entire system ...
[32]
[PDF] Distributed Priority Inheritance for Real-Time and Embedded ...
We study the problem of priority inversion in distributed real-time and embedded systems and propose a solution based on a dis- tributed version of the priority ...
[33]
[PDF] Implementing Remote Procedure Calls
This paper describes a package providing a remote procedure call facility, the options that face the designer of such a package, and the decisions. ~we made. We ...
[34]
[PDF] Comparing remote procedure calls
This report describes and compares three significant RPCs: • Open Network Computing (ONC) RPC from Sun Microsystems[SUN90][MS91]. • Distributed Computing ...<|separator|>
[35]
An optimal algorithm for mutual exclusion in computer networks
Ricart, G., and Agrawala, A.K. Using exact timing to implement mutual exclusion in a distributed network. Tech. Rept. TR-742, Dept. Comptr. Sci., Univ. of ...
[36]
Synchronization in Distributed Programs
A distributed semaphore is a distributed synchronization mechanism that be- haves in much the same way as a semaphore [5]. Two operations are defined on.Missing: seminal | Show results with:seminal
[37]
[PDF] Virtual time and global states of distributed systems
Abstract. A distributed system can be characterized by the fact that the global state is distributed and that a common time base does not exist.
[38]
[PDF] EXPLOITING VIRTUAL SYNCHRONY IN DISTRIBUTED SYSTEMS
The premise of the ISIS project and this paper is that the existing software development methodology is inadequate to address this class of applications ...
[39]
[PDF] Distributed Deadlock Detection - CSE IIT KGP
CHANDY, K.M., AND MISRA, J. A distributed algorithm for detecting resource deadlocks in distributed systems. In Proc. A CM SIGA CT-SIGOPS Syrup. Principles of ...
[40]
[PDF] Deadlock Detection in Distributed Systems
Chandy-Misra-Haas's distributed deadlock detection algorithm for AND model is based on edge-chasing. The algorithm uses a special message called probe, which is ...
[41]
A Master-Slave Operating System Architecture for Multiprocessor ...
This motivated us to reexamine the master-slave approach. In this paper, we attempt to address real-time and performance issues associated with the master-slave ...
[42]
[PDF] Operating Systems on a Master-Slave Mode
ABSTRACT. Master-Slave mode has been known in operating systems for many years. It refers to one processor which is the master that controls other ...
[43]
Gossiping in distributed systems - ACM Digital Library
In this paper, we present a brief introduction to the field of gossiping in distributed systems, by providing a simple framework and using that framework to ...
[44]
[PDF] Paxos Made Simple - Leslie Lamport
Paxos Made Simple. Leslie Lamport. 01 Nov 2001. Page 2. Abstract. The Paxos algorithm, when presented in plain English, is very simple. Page 3 ...
[45]
[PDF] In Search of an Understandable Consensus Algorithm
May 20, 2014 · Strong leader: Raft uses a stronger form of leader- ship than other consensus algorithms. For example, log entries only flow from the leader to ...
[46]
Client-server computing | Communications of the ACM
Client-server computing is a distributed computing model in which client applications request services from server processes. Clients and servers typically run ...
[47]
[PDF] The Sprite Network Operating System - UCSD CSE
Nov 19, 1987 · Sprite is a network OS for workstations, designed for memory sharing and process migration, with a transparent network file system and similar ...
[48]
[PDF] Spritely NFS: Experiments with Cache-Consistency Protocols
Stateless-server systems, such as NFS, cannot properly manage client file caches. Stateful sys- tems, such as Sprite, can use explicit cache consistency.Missing: paradigm | Show results with:paradigm
[49]
Peer-to-peer information systems - ACM Digital Library
P2P systems offer an alternative to traditional client/server systems: Every node acts both as a client and a server and "pays" its participation by providing ...
[50]
[PDF] Chord: A Scalable Peer-to-peer Lookup Service for Internet
This paper presents Chord, a distributed lookup protocol that addresses this problem. Chord provides support for just one operation: given a key, it maps ...
[51]
[PDF] A Distributed Anonymous Information Storage and Retrieval System ...
Freenet is a peer-to-peer network for anonymous, location-independent storage and retrieval of data, using a distributed file system.
[52]
From Client-Server to P2P Networking | SpringerLink
In this chapter, an investigation of network architecture evolution, from client-server to P2P networking, will be given, underlining the benefits and the ...
[53]
[PDF] Project JXTA: A Technology Overview - cs.Princeton
Apr 25, 2001 · Peer Discovery Protocol. This protocol enables a peer to find advertisements on other peers, and can be used to find any of the peer, peer ...
[54]
[PDF] Names in Distributed Systems
Naming Services in Distributed Systems in general - provide clients with values of attributes of named objects name space the collection of valid names ...
[55]
[PDF] DISTRIBUTED SYSTEMS PRINCIPLES AND PARADIGMS ...
Q: An alternative definition for a distributed system is that of a collection of independent computers providing the view of being a single system, that is, ...
[56]
Extension of the banker's algorithm for resource allocation in a ...
The Banker's algorithm for resource allocation prevents deadlocks. Its straightforward extension to a distributed environment would require a centralized ...
[57]
DCE 1.1: Directory Services - Information Model
The DCE directory information model consists of a set of composite name spaces. Composite name spaces are linked together in a hierarchical order. This document ...
[58]
Principles of Eventual Consistency - ACM Digital Library
In this tutorial, we carefully examine both the what and the how of consistency in distributed systems.
[59]
[PDF] A Location Service for Worldwide Distributed Objects
We introduced the Globe object model and its location service which provides trans- parency of location, migration, distribution, and replication. We can ...
[60]
[PDF] The Byzantine Generals Problem - Leslie Lamport
We devote the major part of the paper to a discussion of this abstract problem and conclude by indicating how our solutions can be used in implementing a ...
[61]
Failover Clustering | Microsoft Learn
Jun 25, 2025 · Failover clustering is a powerful strategy to ensure high availability and uninterrupted operations in critical environments.Create a failover cluster · Recover a failover cluster... · Hardware requirements
[62]
[PDF] IBM Tivoli Monitoring: High Availability Guide for Distributed Systems
The hot standby operation is best illustrated by describing a scenario. In this example, all monitoring components are started in order, a scenario that ...
[63]
[PDF] DMTCP: Transparent Checkpointing for Cluster Computations and ...
On 128 distributed cores (32 nodes), checkpoint and restart times are typically. 2 seconds, with negligible run-time overhead. Typical check- point times are ...Missing: paper | Show results with:paper
[64]
[PDF] Debugging Distributed Systems - UBC Computer Science
2 Achieving such fault tolerance, however, requires developers to reason through complex failure modes. For most distributed systems, fault tolerance cannot.
[65]
Distributed system availability - Availability and Beyond
The calculation of availability using MTBF and MTTR has its roots in hardware systems. However, distributed systems fail for very different reasons than a ...
[66]
[PDF] AN EXPERIMENTAL EVALUATION OF THE ASSUMPTION OF ...
N-version programming has been proposed as a method of incorporating fault tolerance into software. Multiple versions of a program (i.e. ''N'') are prepared ...
[67]
[PDF] Measuring Network Latency Variation Impacts to High Performance ...
Apr 9, 2018 · ABSTRACT. In this paper, we study the impacts of latency variation versus la- tency mean on application runtime, library performance, ...
[68]
[PDF] A Quantitative Survey of Communication Optimizations in Distributed ...
Due to the dependency between communication tasks and computation tasks (Fig. 1b), the C2C ratio is the key factor that affects the system scalability. In ...
[69]
On Extending Amdahl's law to Learn Computer Performance - arXiv
Oct 15, 2021 · We propose to (1) extend Amdahl's law to accommodate multiple configurable resources into the overall speedup equation, and (2) transform the speedup equation ...Missing: parallelism | Show results with:parallelism
[70]
[PDF] Scalable Dynamic Partial Order Reduction? - Parallel Data Lab
In this paper we are concerned with scaling dynamic partial order reduction, a key technique for mitigating the state space explosion problem, to very large ...
[71]
Data partitioning guidance - Azure Architecture Center
Horizontal partitioning (often called sharding). In this strategy, each partition is a separate data store, but all partitions have the same schema. Each ...Horizontal Partitioning... · Vertical Partitioning · Application Design...
[72]
A benchmark for performance evaluation of a distributed file system
This paper presents a performance study of a distributed file system. The distributed system allows for remote file access and remote process execution.
[73]
[PDF] CACHING IN LARGE-SCALE DISTRIBUTED FILE SYSTEMS
We introduce the notion of dynamic hierarchical caching, in which adaptive client hierarchies are constructed on a le - by - le basis. Trace - driven simulation ...
[74]
HRFP: Highly Relevant Frequent Patterns-Based Prefetching and ...
Mar 1, 2023 · In this research, we introduced a novel highly relevant frequent patterns (HRFP)-based algorithm that prefetches content from the distributed file system ...4. Proposed Work · 5. Simulation Experiments · 5.2. Simulation
[75]
Kerberos: An Authentication Service for Computer Networks
Kerberos is a distributed authentication service that allows a process (a client) running on behalf of a principal (a user) to prove its identity to a verifier ...
[76]
[PDF] A Survey of Security Threats in Distributed Operating System
Abstract—Over the last years, the security into software systems especially in the distributed system becomes complex day by day.Missing: scholarly | Show results with:scholarly
[77]
[PDF] Guide to IPsec VPNs - NIST Technical Series Publications
Jun 1, 2020 · Backing up the router operating system and configuration files is a necessity since the prototype is being implemented on production ...
[78]
Distributed Automatic Configuration of Complex IPsec-Infrastructures
We analyze the security requirements and further desirable properties of IPsec policy negotiation, and show that the distribution of security policy ...
[79]
Distributed systems security - ScienceDirect.com
This multiuser distributed environment proposes unique security challenges ... Journal of Network and Computer Applications, Volume 229, 2024, Article 103917.Missing: scholarly | Show results with:scholarly
[80]
ReDS: reputation for directory services in P2P systems
As a system distributed over a diverse set of untrusted nodes, however, directory services must be resilient to adversarial behavior by such malicious insiders.
[81]
[PDF] Security Issues in Distributed Systems – A survey - CORE
Based on these considerations the aim of this paper is to describe the concept of security issues in distributed systems considering some important concerns ...
[82]
modular architectures for distributed and databases systems
This paper describes the importance of modularity in systems and lists a number of reasons why systems will become increasingly modular.
[83]
[PDF] Specifying and Verifying Systems With TLA - Leslie Lamport
Correct code is an important component of software reliability. Modern operating systems make extensive use of concurrent and distributed algorithms.
[84]
Formal Verification Tool TLA+: An Introduction from the Perspective ...
Dec 20, 2021 · TLA+ verification is a basic standard for proposing a new algorithm from an existing algorithm in the research fields of distributed algorithms ...
[85]
[PDF] Virtualization in Distributed System: A Brief Overview
There are various types of virtual- izations that can be used to increase the performance in distributed systems. Some of them are OS virtualiza- tion, storage ...Missing: complexity | Show results with:complexity
[86]
[PDF] Design and Modular Verification of Distributed Transactions in ...
We developed our specification using a modular tech- nique, which allowed us to formalize the interactions between the distributed transactions protocol and the ...
[87]
[PDF] The Multikernel: A new OS architecture for scalable multicore systems
We present a multikernel, Barrelfish, which explores the im- plications of applying the model to a concrete OS implemen- tation. • We show through measurement ...
[88]
The Barrelfish Operating System
Barrelfish is a new research operating system being built from scratch and released by ETH Zurich in Switzerland.Missing: NUMA | Show results with:NUMA
[89]
No-MPI OpenCL-Only Distributed Computing With PoCL-Remote
Sep 4, 2023 · PoCL-Remote enables distributed OpenCL computing without MPI, using a client-server architecture with a daemon on any machine with OpenCL, and ...
[90]
[PDF] A Distributed OpenCL Framework using Redundant Computation ...
The host processor executes the host program, and compute devices execute kernels. Kernels are built online or offline.
[91]
https://www.osti.gov/servlets/purl/1295154
[92]
unikraft/unikraft - GitHub
Unikraft is a radical, yet Linux-compatible with effortless tooling, technology for running applications as highly optimized, lightweight and single-purpose ...Missing: 2020s | Show results with:2020s
[93]
[PDF] Adaptive Resource Sharing in Multicores - People at MPI-SWS
Abstract—This short paper presents an adaptive, operating system (OS) anchored budgeting mechanisms for controlling the access to a shared resource.
[94]
[PDF] Holistic resource allocation for multicore real-time systems ...
One effective approach to mitigating interferences among concurrent tasks is resource partitioning: by dividing the cache and memory bandwidth among the cores, ...
[95]
OpenStack: Open Source Cloud Computing Infrastructure
OpenStack is a set of software components that provide common services for cloud infrastructure. browse openstack componentsSoftware · Get Started · OpenStack Documentation · Setting Up Your Gerrit Account
[96]
[PDF] OpenStack++ for Cloudlet Deployment - Carnegie Mellon University
To provide a systematic way to expedite cloudlet deployment, I implemented OpenStack++ that extends OpenStack, an open source ecosystem for cloud computing. In ...
[97]
Firecracker: Lightweight Virtualization for Serverless Applications
We developed Firecracker, a new open source Virtual Machine Monitor (VMM) specialized for serverless workloads, but generally useful for containers, functions ...
[98]
All one needs to know about fog computing and related edge ...
Fog nodes can be placed close to IoT source nodes, allowing latency to be noticeably reduced compared to traditional cloud computing. While this example gives ...
[99]
Overview | Kubernetes
Sep 11, 2024 · Kubernetes is a portable, extensible, open source platform for managing containerized workloads and services, that facilitates both ...Kubernetes Components · The Kubernetes API · Kubernetes Object Management
[100]
Energy-Efficient Distributed Edge Computing to Assist Dense ... - MDPI
The proposed model achieved a higher energy efficiency by 19%, resource utilization by 54%, latency efficiency by 86%, and reduced network congestion by 92% ...2. Background And Related... · 5. Optimized Energy Model · 7. Results