Fact-checked by Grok 2 weeks ago

Distributed transaction

A distributed transaction is a set of operations that updates data across two or more distinct nodes or networked computer systems in a distributed environment, ensuring that the entire set either commits successfully or aborts entirely to maintain data integrity. Distributed transactions extend the core principles of traditional database transactions—known as the ACID properties (atomicity, consistency, isolation, and durability)—to scenarios where data is partitioned or replicated across multiple sites, such as in distributed databases or cloud systems. This allows applications to perform reliable operations in heterogeneous environments, where local transaction managers on each participating node coordinate to achieve global consistency. They are essential for building fault-tolerant systems in areas like enterprise resource planning, financial services, and large-scale web applications that require synchronized updates across geographically dispersed resources. The coordination of distributed transactions typically relies on atomic commitment protocols, with the two-phase commit (2PC) algorithm being the most widely used standard. In 2PC, a global coordinator first solicits prepare votes from all participants in the first phase; if all confirm readiness, it issues a commit directive in the second phase, or an abort if any participant fails. This mechanism ensures that no partial updates occur, even in the presence of failures. Despite their robustness, distributed transactions face challenges such as network partitions, node failures, and , which can lead to in-doubt states requiring by processes. Modern implementations often incorporate optimizations like presumed abort variants or integration with distributed transaction coordinators to mitigate blocking and improve performance in scalable systems.

Introduction

Definition and Scope

A distributed transaction is defined as a set of operations executed across multiple autonomous nodes or processes in a distributed system, which must be treated as a single unit to ensure either complete execution or total , thereby preserving the illusion of a unified . This extends beyond single-node operations by incorporating communication and handling potential failures across distributed components, such as processes or resources. The scope of distributed transactions encompasses operations spanning heterogeneous environments, including multiple , services, or resources in client-server architectures, infrastructures, or global-scale systems with replicated across datacenters. In contrast, local transactions are confined to a single node or resource manager, lacking the inter-node coordination required for distributed scenarios. These transactions are particularly relevant in systems where consistency must be maintained across geographically dispersed or independently managed components, such as in large-scale applications. Key characteristics of distributed transactions include the need for coordination mechanisms to reach collective decisions on committing or aborting operations, often facilitated by to achieve global despite partial failures or network partitions. Representative examples illustrate this scope: a fund between accounts at different institutions, where debiting one account and crediting another must occur atomically to avoid inconsistencies; or an e-commerce order process involving simultaneous updates to , , and shipping services across separate systems. Such transactions aim to uphold core properties like atomicity and in distributed settings.

Historical Development

Distributed transactions emerged in the as a response to the need for fault-tolerant computing in mainframe environments, where systems required reliable processing across multiple components to handle high-volume operations without failure. , founded in 1974, pioneered fault-tolerant systems with its NonStop architecture, designed specifically for in critical applications like banking and , emphasizing continuous through redundant hardware and software mechanisms. Similarly, IBM's Information Management System (IMS), introduced in 1968, with IMS/Virtual Storage (IMS/VS) in 1977 extending capabilities to virtual storage systems, supported hierarchical databases and transaction management, enabling atomic updates in multi-user environments for industries such as airlines and finance. These early systems laid the groundwork for distributed transaction concepts by addressing atomicity and recovery in shared-resource settings, with innovations like the originating in this era to coordinate commits across nodes. The 1980s saw further standardization efforts driven by the growing complexity of monitors. In the late 1980s, the X/Open consortium—formed in to promote portability in Unix environments—began developing the eXtended Architecture (XA) interface, which was formalized in 1991 as a for coordinating distributed transactions between managers and managers. This specification facilitated in heterogeneous systems, building on earlier proprietary protocols to enable atomic operations across diverse databases and . By the 1990s, distributed transactions integrated into enterprise middleware standards amid the shift to client-server architectures. The Object Management Group's (CORBA), with its Object Transaction Service (OTS) defined in the mid-1990s, extended transaction coordination to object-oriented distributed systems, supporting atomicity in remote method invocations across networks. Concurrently, introduced the Java Transaction API (JTA) in the late 1990s as part of the Java Enterprise Edition (J2EE) platform, providing a standard interface for managing transactions in Java-based applications, often leveraging XA for resource integration. These developments were propelled by the proliferation of client-server models, which demanded reliable coordination over networks, transitioning from monolithic mainframes to decentralized setups. The evolution accelerated in the 2010s with the rise of microservices architectures in cloud environments, where traditional two-phase commit protocols faced scalability limits, prompting alternatives like saga patterns for eventual consistency in loosely coupled services. Influential contributions include Jim Gray's 1981 paper, "The Transaction Concept: Virtues and Limitations," which formalized the transaction abstraction and its role in fault-tolerant systems at Tandem Computers. This was expanded in the seminal 1993 book Transaction Processing: Concepts and Techniques by Gray and Andreas Reuter, which provided a comprehensive framework for distributed transaction mechanisms, influencing decades of research and implementation.

Core Concepts

ACID Properties in Distributed Environments

In distributed environments, the ACID properties—originally formulated for centralized database systems—are adapted to ensure reliable transaction processing across multiple networked nodes, where failures, latency, and partitions introduce additional complexities. These properties maintain data integrity by coordinating operations globally, preventing scenarios like partial updates that could lead to inconsistent states across the system. While central systems rely on local mechanisms, distributed adaptations emphasize global synchronization to achieve the same guarantees. Atomicity ensures that a distributed transaction executes as a single, indivisible unit across all participating nodes, meaning either all operations succeed (commit) or none do (abort), avoiding partial commits that result in dangling or inconsistent states. This requires global coordination to propagate commit decisions uniformly, often using techniques like to track changes in private workspaces before finalization. Without such mechanisms, network failures could leave some nodes committed while others remain unchanged, violating the all-or-nothing rule essential for reliability. Consistency guarantees that a distributed transaction brings the entire system from one valid state to another, enforcing not only local database invariants but also global system-wide constraints, such as referential integrity across nodes. In distributed settings, this extends beyond individual node rules to mitigate network-induced inconsistencies, like delayed propagations that temporarily misalign data views. Achieving this often involves serializability, where concurrent transactions produce results equivalent to sequential execution, preserving application-specific rules across the distributed state. Isolation prevents concurrent distributed transactions from interfering with each other, ensuring that intermediate states remain hidden and that the outcome matches execution. This is implemented through distributed locking mechanisms that span nodes, coordinating access to shared resources to avoid phenomena like dirty reads or lost updates in multi-node environments. Such cross-node levels, often stricter than local ones, maintain independence despite varying network conditions. Durability ensures that once a distributed transaction commits, its effects are permanently persisted across the system, surviving node crashes, power failures, or other disruptions. This is achieved via replication to multiple stable storage locations and distributed , where commit acknowledgments confirm data safety before finalization. In practice, in distributed systems balances with , using techniques like synchronous replication to guarantee recovery from failures. Distributed extensions to ACID highlight trade-offs inherent in scaled environments. The CAP theorem posits that distributed systems cannot simultaneously provide strong consistency, availability, and partition tolerance, forcing choices that impact fulfillment—such as prioritizing consistency over availability during network partitions. As an alternative for scalability, BASE properties—Basically Available, Soft state, and Eventually consistent—relax immediate guarantees, allowing temporary inconsistencies for higher availability in large-scale, partition-prone systems.

Atomicity and Consistency Challenges

In distributed transactions, achieving atomicity—the guarantee that a either fully commits or fully aborts across all involved nodes—faces significant hurdles due to the inherent unreliability of distributed environments. partitions, where communication links fail and divide the system into isolated subsets, can lead to scenarios in which disconnected groups of nodes independently commit transactions, resulting in divergent states that violate global atomicity. For instance, if a occurs during a multi-node update, one subset may proceed to commit while the other remains stalled, creating inconsistent outcomes that require manual reconciliation. Similarly, node crashes mid-execution can produce orphan transactions—ongoing operations detached from their coordinating that continue executing unintended actions, potentially corrupting data or wasting resources. Consistency, which ensures that all nodes observe a coherent view of the database, is equally challenged in distributed settings without a centralized . The trade-off between (where updates are immediately visible to all) and (where updates propagate over time) arises because enforcing strong consistency often reduces availability during failures, as per the CAP theorem's implications for partitioned systems. Handling concurrent updates across replicas exacerbates this, as without synchronized clocks or a global lock, race conditions can lead to lost updates or write skews, where conflicting modifications occur simultaneously on different nodes. Various failure modes further disrupt the agreement on global state in distributed transactions. Byzantine faults, where malicious or arbitrarily behaving nodes send conflicting , can mislead honest nodes into inconsistent decisions, undermining both atomicity and . Timeouts, intended to detect failures, may trigger prematurely in high-latency networks, causing unnecessary aborts, while message losses—due to or drops—prevent critical coordination signals from reaching destinations, stalling transactions indefinitely. The probability of such failures scales unfavorably with system size; in a network of n nodes, if each link has an independent failure probability p, the overall transaction failure rate approximates 1 - (1 - p)^{n-1}, which grows roughly linearly with n for small p, highlighting how larger systems amplify unreliability risks. This vulnerability is theoretically bounded by the FLP impossibility theorem, which proves that no deterministic protocol can guarantee agreement in an tolerant to even a single crash failure, emphasizing the fundamental limits on achieving atomicity and consistency.

Protocols and Mechanisms

Two-Phase Commit Protocol

The two-phase commit (2PC) is a foundational designed to ensure that a distributed either commits entirely or aborts completely across multiple participating nodes, thereby preserving atomicity in distributed systems. Introduced by Gray in 1978 as part of database operating system design, the coordinates a single coordinator and multiple participants, where the coordinator drives the decision-making process while participants execute local transaction components. Each participant maintains a local log to record its state, enabling recovery from failures by replaying logs to determine prior decisions. The assumes reliable but asynchronous messaging and crash-stop failures, with no Byzantine faults. The protocol operates in two distinct phases: the prepare phase and the commit phase. In the prepare phase, the coordinator sends a PREPARE message to all participants, prompting each to check if it can commit the transaction locally by locking resources and performing preliminary writes to its log without finalizing the changes. Each participant responds with a vote if prepared (indicating it has locked resources and logged a prepare record) or a NO vote if unable to commit (e.g., due to local constraints), also logging its decision. The coordinator waits for responses from all participants; if any NO vote is received or a timeout occurs, it immediately decides to abort. In the commit phase, if all participants vote , the coordinator logs a commit decision, broadcasts a COMMIT message to all participants, and waits for acknowledgments. Upon receiving COMMIT, each participant logs the commit, releases locks, applies changes durably, and acknowledges; conversely, an ABORT message triggers rollback, log abortion, lock release, and acknowledgment. If the coordinator fails to receive all votes, it aborts the transaction. Participants play a reactive role, ensuring local durability through forced log writes during preparation to survive crashes. Upon receiving PREPARE, a participant transitions to a prepared , holding locks until a final decision arrives; if no decision is received within a timeout, the participant typically aborts to avoid indefinite blocking, though this risks inconsistency if the coordinator later decides to commit. In recovery scenarios, a restarted participant queries the (or uses ) to resolve its if in the prepared phase. The , as the decision authority, must remain available post-vote collection to broadcast the outcome, its decision durably before notifying participants. The formal steps of 2PC can be illustrated through the following pseudocode, adapted from standard implementations in distributed database systems: Coordinator Pseudocode:
1. Begin transaction (assign unique ID)
2. Send PREPARE to all participants
3. Wait for votes from all participants
4. If any participant votes NO or timeout:
     - Log ABORT decision (force write)
     - Send ABORT to all participants
     - Wait for ACKs from all
   Else (all YES):
     - Log COMMIT decision (force write)
     - Send COMMIT to all participants
     - Wait for ACKs from all
5. Log completion (DONE)
Participant Pseudocode (per node):
1. Receive PREPARE from coordinator
2. Execute local transaction steps (lock resources)
3. If local commit possible:
     - Log PREPARE (force write)
     - Send YES vote to coordinator
     - Wait for COMMIT or ABORT
     - On COMMIT: Log COMMIT (force), apply changes, release locks, send ACK
     - On ABORT: Log ABORT (force), rollback, release locks, send ACK
   Else:
     - Log ABORT (force)
     - Send NO vote to coordinator
     - Rollback and release locks
4. On timeout waiting for decision: Abort locally and release locks
This structure ensures all-or-nothing semantics but introduces a blocking nature, where prepared participants hold locks and cannot proceed with other transactions until the coordinator's decision arrives, potentially leading to resource contention and reduced throughput during failures or network partitions. Key limitations of 2PC stem from its synchronous messaging requirements and centralized coordination. The coordinator represents a single point of failure: if it crashes after the prepare phase but before broadcasting the decision, participants remain blocked in the prepared state, unable to unilaterally commit or abort without risking inconsistency, which can cause deadlocks or prolonged resource holds. Additionally, the protocol incurs performance overhead from multiple synchronous network round-trips (typically 4 messages per participant: PREPARE/YES, COMMIT/ACK) and forced disk writes for logging, making it costly in high-latency environments like wide-area networks. Mohan et al. note that these overheads motivate optimizations, but the basic protocol's blocking and coordinator dependency limit scalability in failure-prone systems.

Three-Phase Commit and Alternatives

The three-phase commit (3PC) protocol extends the two-phase commit (2PC) mechanism by introducing an additional coordination to mitigate blocking issues in distributed systems. In 3PC, after the prepare where participants vote on transaction readiness, the coordinator enters a pre-commit , broadcasting a tentative commit message to all participants that have voted yes. Participants then acknowledge this pre-commit, entering a prepared-to-commit without yet finalizing the transaction. Only upon receiving all acknowledgments does the coordinator proceed to the final commit , issuing the commit directive; if any participant fails to acknowledge, the transaction aborts globally. This separation ensures that participants do not remain indefinitely locked in an uncertain during coordinator failures. Compared to 2PC, 3PC offers significant advantages in , particularly non-blocking behavior in most failure scenarios. In 2PC, a crash after the prepare phase can leave participants blocked, awaiting resolution that may never arrive. 3PC addresses this by allowing participants to recover independently: if a participant detects a failure during the pre-commit phase, it can safely abort, as no final has been made; post-pre-commit, participants use timeouts and peer queries to elect a new or infer the outcome, preventing indefinite blocking. This enhances against single-point failures at the coordinator. Additionally, 3PC better handles network partitions by bounding the time, as participants transition through well-defined states that facilitate unanimous decision-making upon reconnection. Despite these benefits, 3PC incurs trade-offs, primarily in increased communication overhead. The protocol requires up to three rounds of messaging—prepare, pre-commit, and commit—compared to 2PC's two, potentially doubling the message count in normal operation (e.g., 3 messages per participant in 3PC versus 2 in 2PC for a system with one and n participants). This added makes 3PC less suitable for high-throughput environments. Furthermore, while 3PC assumes a stable network for its non-blocking guarantees, it remains vulnerable to simultaneous failures across multiple sites, limiting its to f < n/2 in some variants. Alternatives to 3PC leverage protocols for more robust agreement in asynchronous, failure-prone distributed s. The algorithm, a foundational mechanism, achieves fault-tolerant agreement through a multi-proposer, multi-acceptor model that tolerates up to f failures in a of 2f+1 nodes. Unlike 3PC's centralized coordinator, uses a prepare phase for proposal numbering, an accept phase for value selection, and a learn phase for dissemination, ensuring without blocking by allowing any surviving to drive progress. This makes ideal for replicating state machines in distributed commits. Similarly, the protocol simplifies by emphasizing and log replication, partitioning the problem into leader selection via randomized timeouts, log appending with heartbeats for replication, and safety checks to maintain consistency. matches in (tolerating f < n/2 failures) but reduces complexity through clearer role separation, facilitating implementation in practical s. Both protocols suit environments where 3PC's overhead and centralization are prohibitive, though they introduce algorithmic complexity and require s for every operation. In practice, 3PC finds application in high-availability clusters requiring strict atomicity with reduced blocking risk, such as certain legacy systems prioritizing coordinator resilience. Consensus alternatives like are widely adopted in databases; for instance, employs a variant for lightweight transactions, enabling conditional updates (e.g., "insert if not exists") across replicas with linearizable consistency, coordinating via phase 1 proposals and phase 2 accepts to resolve conflicts without global locks. powers systems like etcd and for distributed , ensuring replicated logs for transactional commits in cloud-native environments. These choices highlight a shift toward quorum-based for over 3PC's structured phases.

Implementation Strategies

Database Systems

Relational database systems provide support for distributed transactions primarily through the XA standard, which enables coordination between transaction managers and resource managers to ensure atomicity across multiple databases. In , the XA interface allows applications to integrate with external transaction monitors, facilitating global transactions that span multiple resource managers using the two-phase commit (2PC) protocol. Similarly, SQL Server supports XA transactions via the (MSDTC), where the database acts as a resource manager participating in distributed commits coordinated by an external transaction manager. These implementations rely on resource managers to handle local transaction branches and transaction managers to orchestrate the overall commit process. Horizontal scaling in relational databases often involves sharding and replication, where distributed transactions ensure consistency across data partitions. MySQL Cluster, utilizing the NDB storage engine, coordinates transactions across multiple data nodes using an internal 2PC mechanism to manage commits and aborts in a distributed environment. The PostgreSQL Citus extension transforms Postgres into a distributed system by sharding tables across nodes and supporting full SQL distributed transactions through PostgreSQL's extension APIs, including 2PC for cross-shard coordination. NoSQL databases typically offer limited full support for distributed transactions, favoring for scalability, but some have introduced mechanisms for stronger guarantees. , starting with version 4.0, supports multi-document transactions across collections, databases, and shards using a snapshot-based that incorporates elements of 2PC for commit coordination in distributed setups. By default, however, systems like prioritize to minimize coordination overhead in high-throughput scenarios. Distributed transactions introduce performance overhead due to mechanisms like locking and coordination, particularly from acquiring and releasing distributed locks across nodes, which can increase by factors of 2-10 times compared to local transactions depending on conditions and participant count. Optimizations mitigate this, such as treating read-only transaction branches as non-voting participants in 2PC to allow early lock release without full protocol involvement, or using deferred commit strategies where prepares are acknowledged before to reduce blocking. These techniques balance with throughput, though they require careful application design. Standards like the Java Transaction API (JTA) enable distributed transactions in Java environments by providing interfaces for transaction managers to coordinate XA-compliant resources, including multiple databases. JTA integrates with JDBC extensions, such as XADataSource implementations in drivers like Oracle's, to enlist connections in global transactions supporting distributed queries and updates. Similarly, ODBC supports distributed transactions through integration with the , allowing ODBC connections to participate in XA-style commits across heterogeneous data sources.

Microservices and Cloud Environments

In architectures, the emphasis on and independent deployment of services renders traditional distributed transaction protocols like two-phase commit (2PC) impractical due to their reliance on synchronous coordination and potential for blocking across network boundaries. Instead, these systems often adopt event-driven patterns to achieve , where services communicate asynchronously via message brokers, allowing each service to maintain its own local transaction boundaries while propagating changes through events. This approach mitigates latency issues and supports , but it introduces complexities in ensuring data alignment without strong guarantees across services. Cloud platforms address these challenges through specialized orchestration services and globally consistent databases. AWS Step Functions enable the coordination of distributed workflows across multiple services and by implementing saga-like , where each step represents a local , and failures trigger compensating actions to maintain . Similarly, Durable Functions provide stateful in a serverless environment, allowing developers to define durable workflows that handle retries, timeouts, and error compensation for long-running spanning . For scenarios requiring stronger , Cloud Spanner leverages TrueTime, a distributed clock that bounds clock uncertainty, in combination with to ensure external in global . Frameworks like Spring Cloud integrate with OSS components to facilitate distributed tracing and resilience in , enabling visibility into transaction flows across services via tools such as Zipkin for end-to-end monitoring. Spring Cloud Stream further supports event-driven communication using binders for or , promoting and without centralized transaction coordinators. In environments, transaction coordinators such as Oracle's Transaction Manager for Microservices provide XA-compliant support for coordinating distributed transactions across containerized services, ensuring atomicity in polyglot setups while scaling horizontally. Scalability in these environments is achieved through asynchronous messaging systems like Apache Kafka, which handle transaction events across thousands of services by partitioning topics for high throughput and durability, allowing independent scaling of producers and consumers. Fault isolation is a core benefit, as failures in one microservice do not propagate to others, thanks to decentralized data ownership and circuit breakers that prevent cascading outages. In platforms, such as those modeled after Amazon's architecture, saga-like flows orchestrate by sequencing steps like inventory reservation, payment processing, and shipping across independent services, using AWS services to manage compensations for partial failures and ensure reliable completion. employs similar event-driven patterns for subscription and billing workflows, coordinating across hundreds of to process millions of transactions daily while maintaining resilience through asynchronous event propagation.

Advanced Topics

Saga Pattern

The saga pattern is a design approach for managing distributed transactions that involve long-lived operations, where a global transaction is decomposed into a sequence of local sub-transactions, each executed independently across participating services or databases. If any sub-transaction fails, compensating transactions—semantically inverse actions—are invoked to undo the effects of previously completed sub-transactions, ensuring that the overall process either completes successfully or reverts to a consistent state without requiring global coordination. This pattern, originally proposed to address the limitations of traditional long-lived transactions that hold locks for extended periods, promotes interleaving with other concurrent operations to enhance system throughput. Sagas can be implemented through two primary coordination mechanisms: and . In , execution is decentralized, with each reacting to published by others to trigger the next local transaction, fostering among participants. Conversely, employs a central coordinator that directs the sequence of transactions and manages compensating actions upon failures, providing a more structured flow but introducing a potential single point of control. These approaches allow sagas to avoid the and blocking inherent in atomic commit protocols, making them particularly suitable for architectures where and are critical. Key advantages of the saga pattern include reduced locking durations, as resources are released after each local , enabling better concurrency and performance in environments prone to failures or partitions. It also aligns well with distributed systems by delivering , where the system recovers from partial failures through compensations without enforcing strict isolation across the entire transaction. However, drawbacks involve the complexity of designing reliable compensating actions, which may not always perfectly reverse effects (e.g., in irreversible operations), and the need for application-level logic to detect and handle inconsistencies arising from visible intermediate states. A representative example is an online travel booking system, where a saga coordinates reserving a flight and then a . The flight reservation completes locally, publishing an to trigger the hotel reservation; if the hotel step fails, a compensating cancels the flight reservation to maintain across services.

Compensation and Rollback Techniques

Compensating transactions provide a mechanism to reverse the effects of completed local transactions in distributed systems when a failure occurs in a subsequent step, ensuring without relying on traditional atomic commits. Introduced in the model, these transactions execute inverse operations to undo partial progress, such as debiting an account to counteract a prior credit if the overall workflow fails. For instance, in an order processing saga, reserving and charging a are followed by compensating actions like releasing the inventory and refunding the payment if shipping fails. To support retries in unreliable networks, compensating transactions must be idempotent, meaning repeated executions produce the same result as a single execution without unintended side effects. This is enforced by associating operations with unique transaction identifiers and maintaining versioned state to detect and ignore duplicates, preventing issues like double refunds. In practice, services track processed events per partner to ensure that compensation only applies once, even amid message redeliveries common in distributed messaging systems. Rollback strategies in distributed transactions vary by context: immediate aborts leverage protocols like , where the coordinator directs all participants to undo prepared changes using local logs if any node votes no, releasing locks promptly to avoid prolonged blocking. For long-running operations, such as those in the , deferred compensation allows partial commits to persist until failure, at which point inverse transactions are invoked in reverse order or parallel to minimize latency. Error propagation ensures failures trigger appropriate rollbacks across nodes, often by forwarding abort signals through or event streams. For uncompensatable actions—those without viable inverses, like deleting data—dead-letter queues capture failed messages or events, isolating them for manual review or reprocessing to prevent system-wide stalls. This approach, common in message-driven architectures, logs error details including transaction context for auditing. Best practices for implementing compensation and rollback emphasize designing services with reversible operations from the outset, prioritizing business rules that allow clean inverses over complex state manipulations. Developers should enforce idempotency at the API level, use workflow engines to orchestrate compensation sequences, and monitor for deviations from the expected "happy path" through metrics on failure rates and recovery times. Additionally, recording undo metadata during forward execution facilitates automated retries, while avoiding tight coupling to enable independent service evolution.

References

  1. [1]
    33 Distributed Transactions Concepts - Database - Oracle Help Center
    A distributed transaction includes one or more statements that, individually or as a group, update data on two or more distinct nodes of a distributed database.
  2. [2]
    Distributed Transactions Overview | Microsoft Learn
    Oct 18, 2016 · A distributed transaction is a transaction that updates data on two or more networked computer systems.
  3. [3]
    Support for Transactions and Two-Phase Commit - Microsoft Learn
    Apr 19, 2022 · Two-phase commit (2PC) is a protocol that allows a set of application (or cross-application) operations or commands to be either all rolled back or all ...
  4. [4]
    [PDF] Transactions in Distributed Systems - Cornell: Computer Science
    Transactions in Distributed Systems – p.2/32. Page 3. What is a transaction. A transaction is a collection of operation that represents a unit of consistency ...Missing: textbook | Show results with:textbook
  5. [5]
    [PDF] Spanner: Google's Globally-Distributed Database - USENIX
    It is the first system to distribute data at global scale and sup- port externally-consistent distributed transactions. This paper describes how Spanner is ...
  6. [6]
    From group communication to transactions in distributed systems
    er a classical transaction that transfers $1,000 from bank account #1 to bank account #2. To achieve fault tolerance, assume that each bank account is.
  7. [7]
    [PDF] Tandem Non-Stop Systems - Bitsavers.org
    MANAGEMENT SUMMARY. Tandem Computers Incorporated began operations in. 1974 with the goal of developing and marketing mini- computer systems for businesses ...
  8. [8]
    Information Management Systems - IBM
    On the transaction processing side, IMS worked seamlessly. Because it could queue transactions, or execute them in time order, it was ideal among industries ...
  9. [9]
    Consensus Protocols: Paxos - Paper Trail
    Feb 3, 2009 · Jim Gray (amongst others) proposes 2PC in the 1970s. · Dale Skeen (amongst others) shows the need for an extra, third phase to avoid blocking in ...Why Another Consensus... · Paxos In Detail · Legitimate Proposals
  10. [10]
    [PDF] Technical Standard Distributed Transaction Processing: The XA ...
    The XA Specification is a technical standard for distributed transaction processing, an X/Open CAE specification.Missing: 1980s | Show results with:1980s
  11. [11]
    Java Transaction API (JTA) - Oracle
    The JTA specification was developed by Sun Microsystems in cooperation with leading industry partners in the transaction processing and database system arena.
  12. [12]
    The evolution of distributed systems towards microservices ...
    This paper illustrates how distributed systems evolved from the traditional client-server model to the recently proposed microservices architecture.
  13. [13]
    [PDF] Jim Gray - The Transaction Concept: Virtues and Limitations
    This paper restates the transaction ... This section discusses techniques for “almost perfect” systems and explains their relationship to transaction processing.
  14. [14]
    [PDF] ACID Properties in Distributed Databases
    It is really important for database to have the ACID properties to perform. Atomicity, Consistency, Isolation and Durability in transactions. It ensures all ...
  15. [15]
    Durability in Distributed Systems - GeeksforGeeks
    Jul 23, 2025 · Durability in distributed systems refers to the guarantee that once data has been committed to the system, it will persist even in the face of failures.
  16. [16]
    [PDF] Brewer's Conjecture and the Feasibility of
    Seth Gilbert*. Nancy Lynch*. Abstract. When designing distributed web services, there are three properties that are commonly desired: consistency, avail ...
  17. [17]
    BASE: An Acid Alternative - ACM Queue
    Jul 28, 2008 · BASE is basically available, soft state, eventually consistent, diametrically opposed to ACID, which is pessimistic and forces consistency at ...
  18. [18]
    Timestamp-Based Orphan Elimination - ACM Digital Library
    An orphan in a distributed transaction system is an activity executing on behalf of an aborted transaction. A method is proposed for managing orphans ...
  19. [19]
    Comparative Study of Concurrency Control Techniques in ...
    The main issue in distributed databases is to maintain consistency in databases. To maintain consistency in database, correctness criteria must be met.Missing: challenges | Show results with:challenges
  20. [20]
    Multiple concurrency control policies in an object-oriented ...
    Different parallel and distributed applications have different consistency requirements, so multiple concurrency control policies are needed. When data is ...Missing: challenges | Show results with:challenges
  21. [21]
    Getting Real About Distributed System Reliability - Jay Kreps
    Mar 19, 2012 · So for any desired reliability R and any single-node failure probability ... failures in systems that aren't supposed to fail. This paper ...
  22. [22]
    [PDF] Nonblocking Commit Protocols* - UT Computer Science
    We' presented two such nonblocking protocols: the three phase central site and the three phase distributed commit protocols. The three phase protocols were.
  23. [23]
    [PDF] The Part-Time Parliament - Leslie Lamport
    Recent archaeological discoveries on the island of Paxos reveal that the parliament functioned de- spite the peripatetic propensity of its part-time ...
  24. [24]
    [PDF] In Search of an Understandable Consensus Algorithm
    May 20, 2014 · Diego Ongaro and John Ousterhout. Stanford University. Abstract. Raft is a consensus algorithm for managing a replicated log. It produces a ...
  25. [25]
    Guarantees | Apache Cassandra Documentation
    The Paxos protocol implements lightweight transactions that are able to handle concurrent operations using linearizable consistency. Linearizable consistency is ...What is CAP? · Lightweight transactions with...
  26. [26]
    Pattern: Microservice Architecture
    The solution is to use the service collaboration patterns, which implement distributed operations as a series of local transactions: There are four service ...Client-side service discovery · Server-side service discovery · Glossary
  27. [27]
    Challenges and solutions for distributed data management - .NET
    A good solution for this problem is to use eventual consistency between microservices articulated through event-driven communication and a publish-and-subscribe ...Missing: loose | Show results with:loose
  28. [28]
    Distributed transaction patterns for microservices compared
    Sep 21, 2021 · For this article, we'll use a single example scenario to evaluate the various approaches to handling dual writes in distributed transactions. ...The dual write problem · The modular monolith · Orchestration · Choreography
  29. [29]
    Building a serverless distributed application using a saga ...
    Sep 2, 2021 · You can use AWS Step Functions to implement the saga orchestration when the transaction is distributed across multiple databases. Overview. This ...The Step Functions Workflow · The Step Functions... · Triggering The Saga...Missing: Durable Google Spanner TrueTime
  30. [30]
    Durable Functions Overview - Azure - Microsoft Learn
    Apr 5, 2025 · Durable Functions is a feature of Azure Functions that lets you write stateful functions in a serverless compute environment.Missing: transactions AWS Step Google Spanner TrueTime
  31. [31]
  32. [32]
    Transaction Manager for Microservices (MicroTx) Free - Oracle
    Transaction Manager for Microservices Free enables you to easily manage distributed transactions across polyglot microservices deployed in Kubernetes and/or ...
  33. [33]
    Event-Driven Architecture (EDA): A Complete Introduction - Confluent
    Event-driven architecture (EDA) is a software design pattern that allows systems to detect, process, manage, and react to real-time events as they happen.<|separator|>
  34. [34]
    Microservices Architecture Style - Microsoft Learn
    Jul 10, 2025 · Fault isolation: If an individual microservice becomes unavailable, it doesn't disrupt the entire application as long as any upstream ...
  35. [35]
    Leveraging AWS and Java Microservices: An Analysis of Amazon\'s ...
    This article presents a comprehensive case study of Amazon's journey in developing a robust and scalable e-commerce platform.
  36. [36]
    Microservices Lessons From Netflix - The System Design Newsletter
    Oct 31, 2023 · This post outlines microservices architecture best practices from Netflix. If you want to learn more, scroll to the bottom and find the references.<|separator|>
  37. [37]
    [PDF] sagas.pdf - Cornell: Computer Science
    Abstract. Long lived transactions (LLTs) hold on to database resources for relatively long periods of time, slgmficantly delaymg the termmatlon of.Missing: seminal | Show results with:seminal
  38. [38]
  39. [39]
    Pattern: Saga - Microservices.io
    A saga is a sequence of local transactions. Each local transaction updates the database and publishes a message or event to trigger the next local transaction ...
  40. [40]
    Sagas | ACM SIGMOD Record
    A LLT is a saga if it can be written as a sequence of transactions that can be interleaved with other transactions.
  41. [41]
    Compensating Transaction pattern - Azure Architecture Center
    Use the Compensating Transaction pattern to undo work when a step of an eventually consistent operation fails.Missing: seminal papers
  42. [42]
    [PDF] Life beyond Distributed Transactions: an Apostate's Opinion
    Typically, the scale-agnostic (higher-level) portion of the application must implement mechanisms to ensure that the incoming message is idempotent. This is not.
  43. [43]
    Two-Phase Commit Mechanism
    All participating nodes in a distributed transaction should perform the same action: they should either all commit or all perform a rollback of the transaction.Missing: protocol | Show results with:protocol
  44. [44]
    Using dead-letter queues in Amazon SQS - AWS Documentation
    Amazon SQS supports dead-letter queues (DLQs), which source queues can target for messages that are not processed successfully. DLQs are useful for debugging ...Configuring a dead-letter queue · Learn how to configure a dead...<|control11|><|separator|>
  45. [45]
    Apache Kafka Dead Letter Queue: A Comprehensive Guide
    A Kafka Dead Letter Queue (DLQ) is a special type of Kafka topic where messages that fail to be processed by downstream consumers are routed.