FoundationDB
FoundationDB is an open-source, distributed database management system designed as an ordered key-value store that supports ACID-compliant transactions across clusters of commodity servers, enabling scalable storage and retrieval of large volumes of structured data.[1][2] It serves as a foundational layer for multi-model data storage, allowing developers to build various database interfaces—such as document-oriented, relational, or graph models—on top of its core key-value API without sacrificing consistency or performance.[3][2] Originally developed in 2009 by the company FoundationDB, the technology emphasizes reliability through a unique deterministic simulation testing framework that models entire cluster behaviors in a single-threaded process to uncover bugs and ensure fault tolerance under diverse failure scenarios.[4][5] Apple acquired the company in March 2015 to enhance its cloud services infrastructure, after which development continued internally.[6] In April 2018, Apple open-sourced FoundationDB under the Apache 2.0 license, fostering community contributions while maintaining its production-grade stability for handling read/write-intensive workloads at low cost.[7][1] Key strengths of FoundationDB include its shared-nothing architecture for horizontal scalability, automatic data replication and recovery from hardware failures, and industry-leading throughput on standard hardware, making it suitable for applications requiring high availability and strong consistency.[2][8] The system supports stateless layers that extend its functionality, such as the Document Layer for MongoDB-compatible APIs and the Record Layer for structured record storage with indexing and querying, enabling flexible data modeling within a unified, transactionally consistent environment.[9][10]Overview
Description
FoundationDB is a free and open-source, multi-model distributed NoSQL database with a shared-nothing architecture, owned by Apple Inc. since its acquisition in 2015.[6][3] It serves primarily as an ordered key-value store designed to handle large volumes of structured data across clusters of commodity servers, supporting ACID transactions for all operations.[3][4] The database employs an unbundled design that decouples transaction management from storage, enabling independent scaling of components and the flexible layering of higher-level data models—such as relational or document stores—on its foundational key-value interface.[4][8] Development of FoundationDB began in 2009 by founders Nick Lavezzo, Dave Scherer, and Dave Rosenthal, addressing limitations in existing distributed databases by combining NoSQL scalability with ACID guarantees.[11][4]Key Characteristics
FoundationDB distinguishes itself through its provision of strict serializability for all transactions, ensuring a global order across the entire database without relying on relaxed consistency models. This ACID-compliant approach uses optimistic concurrency control combined with multi-version concurrency control to guarantee that committed transactions appear to execute in a single, sequential order, even in a distributed environment.[12][4] The system achieves fault tolerance via automatic leader election among coordinator processes and replication of transaction logs across multiple storage nodes, allowing it to maintain high availability during node failures. With a replication factor of typically three, FoundationDB can tolerate up to two simultaneous failures per shard while continuing operations, and recovery from faults occurs in under five seconds in most cases.[4][2] High throughput and low latency are enabled by in-memory processing of transactions on proxy servers and deterministic simulation for conflict resolution during commits, which minimizes coordination overhead. Under moderate loads on commodity hardware, individual reads typically complete in about 1 millisecond, while the system scales to handle heavy workloads, such as up to 8.2 million operations per second (90% reads, 10% writes) on clusters of 24 machines.[12][13] As of November 2025, the latest stable release is 7.4, introducing enhancements like Backup V2 that reduce log writes by 50% and improve overall performance.[14] FoundationDB supports multi-model data storage through its layered architecture, where higher-level APIs for key-value, document, graph, and other models are built atop the core ordered key-value store without altering the underlying engine. Examples include the Record Layer for relational-like data and integrations with graph databases like JanusGraph.[2][4] The database employs an ordered key space based on lexicographic ordering of byte strings, facilitating efficient range queries and scans. The tuple layer provides an order-preserving encoding for composite keys, such as nesting strings and integers while maintaining sort order from left to right, with keys recommended to be under 1 KB for optimal performance (maximum 10 KB).[15][12]Architecture
Core Components
FoundationDB's architecture is built around a set of core processes that enable distributed operation while maintaining strong consistency. These components include coordinators for cluster oversight, storage processes for data persistence, and proxy processes for request handling, all operating within a versioned data model that timestamps mutations with global version numbers to ensure serializability. This design supports a shared-nothing paradigm, where individual nodes lack shared state and instead rely on transactional coordination for synchronization.[16][4] Coordinator processes form a highly available Paxos group that persists essential system metadata on disk, including the cluster file specifying access points like IP:PORT pairs. They facilitate master election to select a singleton cluster controller, which monitors server health, recruits other processes, and stores cluster configuration to enable fault-tolerant management. This setup ensures that even in the presence of failures, the cluster can rapidly re-elect leadership and maintain operational continuity.[16][4][17] Storage processes, known as storage servers, manage data persistence across disk using a B-tree structure implemented with a modified SQLite engine, supported by log-structured transaction logs for mutations. They maintain multi-version concurrency control (MVCC) within a 5-second mutation window, buffering recent changes in memory before durable writes, which allows efficient handling of versioned updates without immediate full persistence. This versioning aligns with FoundationDB's ordered key-value model, where keys maintain a total order to support range queries and efficient storage.[16][4] Proxy processes consist of stateless GRV (Get Read Version) proxies and commit proxies that collectively handle client interactions. GRV proxies issue snapshot read versions to clients, while commit proxies route transaction requests, perform load balancing across the cluster, and orchestrate commit sequencing to assign global commit versions. In the versioned data model, all data changes receive a unique global version number—advancing at up to 1 million per second—sourced from GRV proxies for reads and the master-coordinated sequencer for writes, guaranteeing consistent views across the system.[16][4][17] The shared-nothing design decouples these components into independent nodes that scale horizontally without shared memory or disks, coordinating solely through the transaction system for operations like version assignment and data replication. Coordinators bootstrap the cluster by electing the master, which in turn directs proxies to distribute load and route requests to storage processes; storage servers then pull necessary logs asynchronously to apply versions, forming a cohesive system that isolates failures and optimizes throughput. This interaction enables FoundationDB to simulate transactions deterministically on the client side while ensuring server-side enforcement of consistency.[16][4]Transaction Management
FoundationDB employs optimistic concurrency control (OCC) for transaction management, allowing transactions to proceed without locks until commit time, where conflicts are detected and resolved.[4] This approach minimizes contention in distributed environments by enabling parallel execution of transactions across shards.[4] A key reliability mechanism is the deterministic simulation framework, which replays and resolves potential transaction conflicts in a controlled, single-threaded environment before production deployment, ensuring robust handling of edge cases like network partitions or failures during commits.[5] This simulation tests the entire cluster behavior, including transaction logic, under millions of fault scenarios to verify correctness without nondeterminism.[4] Conflict detection occurs at commit via read-set and write-set comparisons, where each transaction records the keys read and written along with their versions.[18] Resolvers, distributed across key shards, use these sets to check for read-write conflicts by comparing against concurrent transactions' write sets within the read version and proposed commit version; if a read key was written after the transaction's read version, it aborts to maintain serializability.[4] While global versioning provides the temporal ordering for these checks, the per-transaction version tracking ensures efficient parallel resolution.[18] The commit protocol coordinates atomicity through a handshake among proxies, coordinators (via the master server), and storage components. Clients submit batched mutations to proxies, which request a monotonically increasing commit version from the master server before dispatching to resolvers for conflict checks.[18] Upon approval, proxies append the mutations to replicated transaction logs (as redo records) on log servers, ensuring durability across a configurable replication factor; storage servers then asynchronously apply these redo records to persistent data.[4] Proxies only acknowledge success to clients after confirming writes to the required number of log replicas.[4] Transactions operate under snapshot isolation for reads, capturing a consistent view at the assigned read version to avoid anomalies during execution.[4] Serializability is enforced at commit by the conflict detection, rejecting transactions that would violate a serial order relative to concurrent commits.[18] Mutations are batched into redo records at the proxy level, enabling efficient append-only writes to transaction logs and reducing overhead for high-throughput workloads.[4]Storage and Distribution
FoundationDB employs a log-structured storage model where mutations are recorded in append-only transaction logs on dedicated log servers, ensuring fast commit latencies through synchronous replication and fsync operations for durability.[16] These logs capture changes in version order, with storage servers asynchronously pulling and applying mutations to maintain a durable, versioned key-value store.[4] The storage engines, such as the default SSD B-tree or the higher-throughput Redwood B-tree introduced in version 7.0, periodically compact applied mutations into efficient on-disk structures, reducing write amplification and optimizing read performance by merging versions and reclaiming space from deletions.[19][20] Data distribution in FoundationDB is achieved through automatic sharding of the key space into contiguous ranges, typically sized between 125 MB and 500 MB, assigned to storage servers for horizontal scaling.[21] Shards are dynamically split or merged based on size or write hotspots to prevent imbalances, with the data distributor managing assignments to ensure even load across the cluster.[16] Replication occurs via redundancy groups, or "teams," where each shard maintains multiple copies—defaulting to three replicas—distributed across fault domains like machines or racks to tolerate failures without data loss.[4][21] Background rebalancing handles data movement to maintain uniform distribution and recover from failures, such as restoring replication in unhealthy teams or relocating shards after machine removals.[21] The data distributor monitors storage metrics, like bytes stored, and initiates shard migrations without considering read traffic, prioritizing byte-level balance to minimize latency impacts during ongoing operations.[16] This process ensures fault tolerance by continuously adapting to cluster changes, such as adding or removing nodes. FoundationDB supports backup and restore operations through versioned snapshots that capture consistent point-in-time states without downtime, using tools likefdbbackup to stream data to external storage.[22] Introduced in version 7.4, Backup V2 optimizes this by reducing log system writes by up to 50%, lowering commit latency and decreasing the required number of transaction logs through partitioned log handling and incremental options.[14]
Encryption is configurable for both at rest and in transit to secure data. At rest, FoundationDB supports native encryption using AES-256 CTR mode since version 7.2, integrated with external key management services (KMS) via a generic connector framework; data and metadata are encrypted on flush to disk, with headers preserved for decryption during reads.[23] In transit, Transport Layer Security (TLS) is enabled cluster-wide using LibreSSL, requiring certificate and key files for all inter-process communications to ensure authenticated and encrypted connections.[24]
Features
ACID Compliance and Serializability
FoundationDB ensures full ACID (Atomicity, Consistency, Isolation, Durability) compliance for all transactions, providing robust guarantees in a distributed environment through optimistic concurrency control and multiversion concurrency control (MVCC).[12] This design allows developers to rely on strong consistency without manual conflict resolution, making it suitable for applications requiring reliable data integrity across clusters.[25] Atomicity is achieved via an all-or-nothing commit protocol, where a transaction's writes are either fully applied or entirely rolled back in case of conflicts or failures. During commit, the system assigns a version and checks for read-write or write-write conflicts; if any are detected, the transaction aborts and rolls back automatically, ensuring no partial updates occur.[26] This protocol, involving sequencers for version stamping and resolvers for conflict detection, guarantees that concurrent transactions do not interfere partially.[4] Consistency is maintained through global versioning and the absence of partial writes, where every transaction operates on and produces a consistent database snapshot. The system uses MVCC to assign read versions at transaction start and commit versions only upon successful conflict resolution, preventing any intermediate states from being visible to other transactions.[12] This ensures that application-defined invariants, such as data relationships, remain intact even under high concurrency.[25] Isolation is provided via snapshot reads and conflict-free serialization, allowing transactions to read from a point-in-time view without blocking writers. Reads are performed against a snapshot determined by a get-read-version (GRV) request, while writes are buffered locally until commit, where conflicts are resolved optimistically.[26] This mechanism supports concurrent execution without dirty reads, non-repeatable reads, or phantom reads, as conflicting transactions are serialized at commit time.[4] Durability is ensured through synchronous replication and explicit disk synchronization, where committed writes are persisted to stable storage on multiple nodes before acknowledgment. Upon commit, data is replicated to a quorum of log servers (typically f+1 for fault tolerance against f failures), with fsync operations confirming writes to disk, guaranteeing recovery even after crashes.[12] This adds a small latency overhead but provides strong persistence guarantees in distributed setups.[4] FoundationDB achieves strict serializability, the strongest form of isolation, ensuring that the execution of transactions is equivalent to some serial order that respects both the real-time order of non-overlapping transactions and the commit order. This is proven through the system's versioning mechanism: a central sequencer assigns monotonically increasing read and commit versions based on transaction start times, while resolvers detect and prevent cycles in the serialization graph via conflict ranges.[25] As a result, committed transactions appear to execute in a total order matching their start times, with no transaction observing changes from later-starting but earlier-committing ones, thus eliminating anomalies like write skew.[4] This guarantee holds across the entire distributed database, simplifying reasoning about concurrent operations.[26]Scalability Mechanisms
FoundationDB achieves horizontal scalability by allowing the addition of storage nodes to the cluster, which enables linear scaling of read operations as more Storage Servers are introduced. The system automatically partitions the key space into ranges distributed across these nodes, with the Data Distributor continuously monitoring and relocating data shards to maintain balance based on load and storage utilization. This dynamic relocation ensures even distribution without manual intervention, supporting clusters that span from a single machine to dozens of multicore servers.[4][27] Throughput in FoundationDB scales to millions of transactions per second through parallel processing across multiple nodes and minimized contention via optimistic concurrency control, where transactions proceed in parallel and conflicts are resolved at commit time with a low conflict rate of approximately 0.73%. Writes scale by adding Proxies, Resolvers, and Log Servers, while the system's unbundled architecture separates transaction management from storage to avoid bottlenecks. In benchmarks, configurations with 24 machines have demonstrated up to 2.779 million operations per second.[4][28] Elasticity is provided through live reconfiguration capabilities that allow cluster resizing without downtime, as the system supports adding or removing processes dynamically while the Data Distributor rebalances data in the background. Recovery from failures or changes occurs rapidly, with median recovery times under 5 seconds, enabling seamless adaptation to varying workloads. Data redistribution for hot spots completes in milliseconds, and larger adjustments take minutes, ensuring continuous availability during scaling events.[4][28][27] Performance tuning in FoundationDB includes configurable redundancy levels, where replication factors (such as k = f + 1 replicas, with f being the number of failures tolerated) can be adjusted to balance durability and throughput. Batch sizes for transaction commits are dynamically tuned by the system to optimize latency and throughput, adapting to current load conditions. These parameters allow operators to fine-tune the cluster for specific performance requirements without altering the core architecture.[4][27] Monitoring and metrics in FoundationDB track key indicators such as throughput (e.g., 390.4K reads/s and 138.5K writes/s in tested configurations), latency (average 1ms for reads and 22ms for commits), and cluster health through components like the Ratekeeper, which monitors system load and adjusts transaction rates to prevent overload. The Cluster Controller oversees process health and coordinates reconfiguration, providing operators with insights into storage utilization, replication status, and overall performance to maintain scalability under load.[4][28][27]Layered Design and APIs
FoundationDB employs a layered architecture that allows developers to build higher-level data models on top of its core ordered key-value store, enabling extensibility without altering the underlying storage engine.[29] This design separates the low-level transactional storage from application-specific abstractions, ensuring that layers remain stateless and can scale independently while leveraging FoundationDB's ACID guarantees.[29] Layers are implemented as client-side libraries or microservices that translate higher-level operations into base key-value transactions, facilitating the creation of relational, document-oriented, or custom data models.[10] At its foundation, the key-value API provides basic operations for data manipulation within ACID transactions:get retrieves the value associated with a specific key; set stores or updates a value at a given key; clear removes a key-value pair; and range reads fetch all key-value pairs within a specified key range, preserving the ordered nature of keys for efficient prefix-based queries.[30] These operations form the minimal interface, treating all data as byte strings, which supports arbitrary serialization but requires careful key design to avoid hotspots or inefficient scans.[15]
The tuple layer builds directly on this base API by providing a structured encoding scheme for composite data types, allowing developers to pack and unpack tuples—such as strings, integers, booleans, UUIDs, or nested structures—into ordered keys that maintain lexicographic sorting.[15] For instance, a tuple like (state, county) can be encoded as a single key prefix, enabling range queries over subsets of data, such as all counties in a given state, without custom serialization logic.[15] This layer is integrated into all official language bindings, ensuring cross-language compatibility for key construction and decoding.[31]
For hierarchical organization and indexing, the directory layer (often referred to in tree-like contexts) manages namespaces as a tree structure, where paths like ('users', 'profiles') map to dedicated key subspaces for isolated data storage and efficient relocation.[32] It supports operations such as creating, opening, moving, and listing subdirectories, which allocate unique prefixes to prevent key collisions and facilitate scalable indexing for relational or nested models.[33] This enables tree-based data partitioning, where related records are grouped under common prefixes for fast range reads, akin to file system directories but optimized for distributed key-value storage.[12]
The records layer extends these foundations to offer SQL-like semantics for structured data, including schema definition, primary and secondary indexes, and declarative queries over records with nested types.[10] It stores records as serialized values under indexed keys, ensuring transactional consistency for index updates and supporting multi-record operations like joins or aggregations in a single transaction.[34] Designed for multi-tenancy, this layer allows elastic scaling across stateless servers, making it suitable for high-volume applications requiring relational features without a full RDBMS.[10]
FoundationDB provides official language bindings for C, C++, Java, Python, Go, Node.js, Ruby, and PHP, each exposing the base API and higher layers with asynchronous support to handle concurrent operations efficiently—such as Python's integration with gevent for non-blocking I/O.[30] These bindings ensure low-latency access to the core operations and layers, with async patterns allowing thousands of concurrent transactions per client.[12]
Developers can create custom layers using the extensible layer API, which involves defining stateless translators that map domain-specific models to key-value transactions, often combining tuple encoding for keys and directory structures for organization.[29] This API supports the development of specialized abstractions, such as sharded counters or graph stores, by ensuring all reads and writes occur atomically.[35] For example, custom layers have been built for document-oriented APIs, enabling FoundationDB to serve as a backend for custom sharded systems or higher-level databases.[9]