Datomic
Datomic is a distributed database system designed for data-of-record applications, where data is stored as a collection of immutable atomic facts known as datoms, enabling complete audit trails and time-based queries without deletions or updates to existing records.[1] Developed since 2010 by Rich Hickey, the creator of the Clojure programming language, and the team at Relevance and Metadata Partners, which merged to form Cognitect (acquired by Nubank in 2020), Datomic emphasizes functional programming principles, drawing inspiration from persistent data structures to ensure ACID transaction guarantees and total ordering of transactions.[2][3][4] Written primarily in Clojure and running on the Java Virtual Machine (JVM), it supports a flexible schema that allows entities to have any attributes without relying on NULL values, facilitating relational modeling with hierarchical navigation.[1]
At its core, Datomic's architecture separates storage from querying: the datoms are entity-attribute-value triples augmented with transaction metadata, stored in a chronological log that serves as an indelible history, while multiple read-only indexes (EAVT, AVET, AEVT, and VAET) enable efficient querying across diverse patterns.[5] Queries are expressed in Datalog, a declarative logic programming language that supports joins, rules, and historical as-of queries, allowing applications to reconstruct any past state of the database.[1] Datomic offers editions for different deployment needs, including a local embedded version for development, Datomic Pro for distributed setups with pluggable storage backends like DynamoDB or SQL databases, and Datomic Cloud, which is optimized for AWS with automated scaling and serverless deployment via Ions for application logic.[6]
Released under the Apache 2.0 license, Datomic is free for production use and has been adopted by hundreds of organizations for read-heavy workloads requiring strong consistency and scalability, such as financial systems and content management, by leveraging horizontal read scaling through peer nodes that cache indexes locally.[7][1] Its immutable design eliminates many traditional database challenges like locking contention and replication lag, instead appending new facts to build a comprehensive, queryable timeline of data evolution.[5]
Overview
Introduction
Datomic is a distributed, immutable database system that implements Datalog as its query language, enabling flexible and powerful data retrieval. Designed primarily as a general-purpose database for data-of-record applications, it excels in scenarios requiring inherent auditability, historical access, and point-in-time queries, such as in finance, healthcare, and compliance-heavy industries.[1]
A key differentiator of Datomic is its treatment of data as immutable facts, referred to as datoms, rather than mutable records found in conventional relational databases; this approach ensures that all changes are append-only additions, preserving the full history of data without overwrites or deletions.[1]
Datomic is free software with binaries licensed under the Apache 2.0 License, allowing broad adoption for both commercial and non-commercial use.[8][9] Originally developed by Rich Hickey and his team—initially under the Relevance banner before forming Cognitect—in 2012, the project was advanced by Cognitect until its acquisition by Nubank in 2020, where it continues to be maintained and integrated into large-scale production systems.[2][10]
History
Datomic was developed by Rich Hickey, the creator of the Clojure programming language, in collaboration with the team at Relevance, a software consultancy firm. The project began around 2010, with the initial public announcement occurring on March 29, 2012, introducing Datomic as a next-generation database emphasizing immutability and distributed architecture. The first public release, version 0.8.3335, followed on July 24, 2012, offering both free and professional editions with early support for Clojure integration to facilitate adoption within the functional programming community.[11][12]
In September 2013, Relevance merged with Metadata Partners to form Cognitect, which took over stewardship of Datomic and continued its development. Early releases, such as version 0.8.4020 in June 2013, enhanced transaction data handling and solidified the peer library for embedding database operations directly into application code, promoting seamless Clojure-based workflows. A stable release in the 0.9 series arrived in January 2014 with version 0.9.5130, marking Datomic Pro as suitable for production environments through features like schema alterations and improved reliability.[3][12]
Datomic Cloud was introduced on December 17, 2017, providing a managed AWS deployment option to simplify scaling and operations for cloud-native applications. In July 2020, Nubank, a Brazilian digital banking firm, acquired Cognitect, ensuring continued investment in Datomic's maintenance and ecosystem growth without disrupting customer access. The project transitioned to the 1.0 series in November 2020 with version 1.0.6222.[13] This acquisition paved the way for further accessibility improvements, including the release of Datomic binaries under the Apache 2.0 license on April 27, 2023, allowing free use via Maven Central for non-enterprise scenarios.[10][9]
As of 2025, Datomic remains in the 1.0 series with no major version increments since its introduction in 2020, focusing instead on incremental enhancements for cloud scalability, such as performance optimizations in version 1.0.7469 released on October 23, 2025, alongside improved documentation and integration tools to support evolving deployment needs.[12][14]
Design Principles
Core Data Model
Datomic's core data model revolves around immutable atomic facts known as datoms, which form the foundational building blocks of the database. Each datom is a 4-tuple consisting of an entity ID (E), an attribute (A), a value (V), and a transaction ID (Tx), representing a single, asserted fact about the world.[1] This structure captures relationships and properties in a universal relation, where datoms are never modified once added, ensuring a consistent historical record.[1]
Entities in Datomic emerge as dynamic collections of datoms that share the same entity ID, providing a lazy, associative view of related attributes and values at a specific point in time. Accessed via the entity API, such as (d/entity db id), an entity functions like a map where attributes serve as keys and values as their associated data, including references to other entities for relational navigation.[15] This design allows entities to evolve organically without rigid predefined structures, supporting flexible data representation across diverse domains.[15]
Attributes define the properties of entities and are themselves entities in the schema, specified with a value type (e.g., :db.type/string, :db.type/ref) and cardinality (:db.cardinality/one for single values or :db.cardinality/many for sets).[16] Uniqueness constraints can be applied via :db/unique, either as :db.unique/[identity](/page/Identity) for upserting based on domain keys like email addresses or :db.unique/value to enforce singleton usage.[17] Entity IDs, which are database-unique 64-bit longs assigned by the transactor, ensure stable identification, while temporary IDs facilitate client-side entity creation before resolution.[17]
Schema evolution in Datomic is inherently flexible, as attributes can be added, modified, or extended transactionally without impacting existing datoms or requiring data migration.[16] For instance, new attributes can be introduced to entities over time, preserving historical integrity while accommodating changing requirements. Unlike traditional relational models with fixed tables and columns, Datomic's entity-centric approach supports joins through referential attributes but eschews tabular storage, enabling a more fluid, schema-optional structure that emphasizes relationships via datoms.[15] This immutability of datoms underpins the model's time-aware nature, though details on transactional addition are covered elsewhere.[1]
Immutability and Transactions
Datomic enforces immutability by treating databases as append-only ledgers of facts, where no in-place updates or deletions occur.[18] Instead, changes are represented by adding new datoms—atomic facts consisting of an entity ID, attribute, value, and transaction identifier—while preserving the entire history of prior states.[18] This design ensures that every database value is an immutable set of all datoms ever added, enabling reliable auditing and temporal queries without mutable state conflicts.[18]
Transactions in Datomic are declarative and submitted via the d/transact API, which atomically accrues a set of datoms to the database.[18] Each transaction specifies additions or retractions using forms like [:db/add entity-id attribute value] or [:db/retract entity-id attribute old-value], processed as a cohesive unit without intermediate visibility to other operations.[19] The transactor, a centralized component, serializes and applies these transactions in the order received, ensuring that the resulting database reflects the complete set of changes or none at all.[20]
Datomic provides full ACID guarantees for transactions. Atomicity is achieved through a single transactor that performs writes in one atomic operation to durable storage, preventing partial commits.[21] Consistency is maintained via schema validation, which enforces rules like unique attributes and entity predicates before accepting changes, alongside a global transaction time basis for ordered state transitions.[21] Isolation ensures serializability by delivering point-in-time database views to peers, where reads are monotonic and writes form a total order across the system.[21] Durability relies on the underlying storage backends, such as DynamoDB or Cassandra, which confirm writes before transaction completion.[21]
Transaction functions extend Datomic's capabilities by allowing custom Clojure code to be invoked during transaction processing for business logic, such as validation or derived fact generation.[22] These pure functions take the database state before the transaction and arguments, returning additional transaction data or aborting via d/cancel if rules are violated.[22] Deployed either as database functions (via transactions) or classpath functions (on the transactor), they integrate seamlessly with the immutable model while maintaining ACID properties.[22]
Error handling in Datomic ensures transactional integrity by failing the entire operation if any constraint is breached, such as schema violations or function aborts, with no partial effects persisted.[20] The transactor processes transactions serially from a queue, returning detailed reports on success or failure, including exceptions for issues like timeouts or invalid data.[20] Peers can monitor outcomes asynchronously to confirm the database state post-submission.[20]
Querying and Data Access
Datalog Language
Datomic employs Datalog as its primary query language, a declarative, logic-based system designed for retrieving and manipulating data from its immutable database of facts known as datoms.[23] Datalog in this context allows users to express queries as logical patterns and rules, focusing on what data to retrieve rather than how to access it, which facilitates complex relational queries without procedural code.[24]
At its core, Datomic's Datalog operates on relations represented by triples of entity ID, attribute, and value, enabling joins and pattern matching akin to Prolog but optimized for database operations.[23] Queries are structured using keywords like :find to specify output variables, :where for pattern clauses, :in for input parameters, and optionally :rules for defining reusable logic.[24] Variables, denoted by ? prefixes (e.g., ?e for an entity), bind to values during execution, while clauses such as [?e :artist/name ?name] match entities with specific attributes. Rules support recursion, for instance, by defining transitive closures over relationships like parent-child hierarchies through repeated applications of base patterns.[24]
Query execution in Datomic returns unordered sets of bindings for the :find variables, typically as tuples or collections, executed against a database value that represents a point-in-time view of the data.[25] Inputs can include constants, entity IDs, or even subqueries via :in, allowing parameterized and composable queries, while outputs can leverage the Pull API for hierarchical entity attribute selection beyond flat bindings.[26]
Compared to SQL, Datomic's Datalog offers advantages in handling complex joins and recursion natively without subqueries or recursive common table expressions, and it provides schema flexibility by querying schema-optional data without requiring fixed table structures.[27] Its use of data structures for queries avoids SQL injection vulnerabilities inherent in string-based SQL and enables engine-level optimizations that evolve independently of query logic.[27]
For a simple entity lookup, a query might find all entities named "The Beatles" as follows:
[:find ?e
:where [?e :artist/name "The Beatles"]]
[:find ?e
:where [?e :artist/name "The Beatles"]]
This returns a set containing the entity ID, such as #{[17592186045470]}.[24]
A multi-hop relationship query could retrieve track names and durations for songs by that artist, joining across entities:
[:find ?name ?duration
:where [?e :artist/name "The Beatles"]
[?track :track/artists ?e]
[?track :track/name ?name]
[?track :track/duration ?duration]]
[:find ?name ?duration
:where [?e :artist/name "The Beatles"]
[?track :track/artists ?e]
[?track :track/name ?name]
[?track :track/duration ?duration]]
Executing this yields bindings like [["Here Comes the Sun" 186000]], demonstrating traversal of artist-to-track references.[24]
Time Travel Features
Datomic's time travel features allow users to query the database at any historical point, providing immutable snapshots and change logs without requiring separate auditing mechanisms. This capability stems from the system's design, where every transaction appends new facts to an indelible log, enabling retrospective analysis of data evolution.[28]
As-of queries retrieve a consistent view of the database as it existed at a specific transaction or timestamp, excluding all subsequent changes. For instance, invoking db.asOf(t) returns a database value containing only facts asserted up to time t, which can be specified as a transaction ID, basis-t, or Java Date. This facilitates auditing past states, such as verifying decisions based on outdated inventory levels where a query might show a stock count of 7 for an item before a correction transaction updated it to 100.[28][29]
History views, obtained via db.history(), expose the complete audit trail of all datoms ever asserted or retracted, including both additions (true) and removals (false) across the database's lifetime. Queries against this view can reveal the full sequence of changes for an entity, such as tracking updates to an item's count from 100 to 1000 over multiple transactions, complete with operation flags and timestamps. This unfiltered perspective supports detailed forensic analysis without data loss.[28][29]
Since and until filters enable targeted examination of changes within temporal bounds by combining functions like db.since(t1).asOf(t2), which yields datoms added after t1 but present up to t2. The since filter isolates facts transacted after a given point, useful for detecting deltas, while chaining with as-of refines the window for precise change detection. These operations require careful handling of database references to avoid inconsistencies in lookups.[28]
Each transaction in Datomic is associated with a :db/txInstant attribute, a timestamp recording when the transaction occurred, serving as the anchor for all temporal queries. This built-in metadata ensures that time points are precise and queryable, allowing users to reference exact moments like #inst "2014-01-01" for reproducible historical views.[28][16]
These features underpin key use cases such as regulatory compliance auditing, where full histories satisfy retention requirements; debugging application logic by replaying data states; and versioning without auxiliary logs, as the persistent fact log inherently supports rollback simulations and trend analysis.[29][28]
Architecture
System Components
Datomic's system architecture revolves around a distributed setup comprising peers, a transactor, and a pluggable storage layer, designed to separate read and write operations for reliability and performance.[30] Peers act as application-side clients that handle both querying and transaction submission, connecting directly to the storage service to discover the transactor's location via periodic heartbeats.[30] The transactor serves as the central authority for all write operations, while the storage layer provides durable persistence across various backends.[31] This separation ensures that read workloads can scale independently without impacting writes, though the single active transactor imposes a natural limit on write throughput.[32]
Peers are read-oriented JVM processes embedded in applications, caching database indexes locally to enable fast, consistent queries even during transactor outages.[30] They maintain a local view of the database value, allowing calls to retrieve the most recent consistent state available in memory, and use functions like d/sync to align with storage for point-in-time accuracy.[30] For writes, peers submit transactions to the transactor over a secure connection, but do not process them locally, ensuring all modifications are sequenced centrally.[31] Multiple peers can connect to the same database concurrently, distributing read load across application instances without requiring additional coordination.[30]
The transactor is a single, active process responsible for processing all transactions across one or more databases, guaranteeing ACID properties through atomic sequencing and validation.[31] It receives transaction requests from peers, orders them linearly, and applies changes to the database while writing heartbeats to the storage layer to advertise its endpoint for peer reconnection.[30] In production, a standby transactor monitors the active one for failover, enabling high availability without data loss, though frequent failovers signal underlying issues.[30] The transactor's design emphasizes fail-fast behavior, isolating it on dedicated hardware to minimize interference from peer or storage loads.[31]
The storage layer is a pluggable abstraction for persistent data storage, supporting backends such as AWS DynamoDB for scalable NoSQL persistence, Apache Cassandra for distributed clusters with a minimum of three nodes and replication factor of three, and relational databases like PostgreSQL or MySQL via JDBC.[33] These backends store the database's datoms and indexes durably, with API compatibility allowing seamless switching via connection strings.[33] The transactor interacts with storage to persist transaction outcomes and index segments, while peers read from it for cache population and synchronization.[30]
Index building occurs on the transactor, which accumulates recent datom changes in memory until a threshold (default 32MB) triggers background indexing into immutable segments of up to approximately 50KB each.[32] These segments, covering all index types like EAVT and AEVT, are then pushed to the storage layer for persistence, ensuring each datom is replicated at least three times for redundancy.[32] If memory usage approaches the maximum (default 512MB), the transactor applies back pressure to throttle incoming transactions until indexing completes, preventing overload.[32] Parallelism in indexing, configurable up to eight threads on multi-CPU systems with scalable storage, accelerates this process for high-write scenarios.[32]
For scalability, Datomic leverages multiple peers to horizontally scale read queries and local caching, allowing applications to handle increased load by adding instances without affecting the transactor.[30] Writes, however, are constrained by the single transactor's capacity, which can be tuned via CPU allocation, write concurrency (e.g., four threads for 800KB/second throughput on DynamoDB), and storage provisioning.[32] The architecture's process isolation—running peers, transactor, and storage on separate hardware—ensures that spikes in one area minimally impact others, supporting reliable operation in distributed environments.[30] High availability is further enhanced by the standby transactor and storage replication, though overall write scaling requires careful capacity planning to avoid indexing bottlenecks.[32]
Indexing and Storage
Datomic employs four built-in covering indexes to organize datoms for efficient data access patterns. These indexes maintain ordered sets of datoms, enabling optimized lookups without requiring additional schema configurations for most queries. The primary index, EAVT (Entity-Attribute-Value-Transaction), sorts datoms by entity ID (E) ascending, followed by attribute (A), value (V) ascending, and transaction ID (T) descending; it facilitates entity lookups, akin to accessing rows in a relational database, grouping all facts associated with a specific entity for master-detail operations.[34]
The AVET index sorts by attribute (A) ascending, value (V) ascending, entity (E) ascending, and transaction (T) descending, supporting attribute scans to retrieve all values for a given attribute across entities, similar to column-wise access in SQL; this index requires explicit schema configuration (:db/index true) in Datomic Pro but is always enabled in Datomic Cloud. For composite queries involving specific attribute-value pairs, the AEVT index orders by attribute (A) ascending, entity (E) ascending, value (V) ascending, and transaction (T) descending, allowing efficient filtering on unique combinations. The VAET index, a reverse index for reference attributes (:db.type/ref), sorts by value (V) ascending, attribute (A), entity (E), and transaction (T) descending, enabling value-based access and relationship traversal, such as finding all entities linked to a particular reference value.[34][35]
Index segments form the foundational units of these indexes, consisting of immutable, time-range sorted files that capture snapshots of datoms over transaction intervals. Each segment is a shallow tree structure with a wide branching factor (approximately 1000) and leaf nodes holding a few thousand datoms, ensuring compact storage and fast traversal. The transactor periodically rebuilds these segments through background indexing jobs, merging recent datoms from an in-memory index into durable tiers; this adaptive process scales sublinearly with data volume, minimizing rewritten segments and maintaining query performance.[36][34]
Datomic supports multiple storage backends for persisting datoms and indexes, prioritizing durability and scalability. In Datomic Cloud, AWS DynamoDB serves as the default backend, providing ACID-compliant storage with automatic replication across multiple availability zones. For on-premises or hybrid deployments, PostgreSQL is a common choice, utilizing a dedicated table (datomic_kvs) for key-value storage of datoms and segments. Development environments typically use an in-memory backend backed by local disk files via an embedded JDBC server, suitable for non-production testing but lacking enterprise-grade persistence. Other options like Cassandra are available for high-availability needs, requiring at least three nodes with a replication factor of three.[33][37]
To mitigate latency in distributed reads, peers employ caching mechanisms for index replicas. An LRU object cache holds frequently accessed index and log segments as Java objects directly in memory, requiring no explicit configuration. For larger-scale caching, Valcache provides a Memcached-compatible interface using local SSD storage on supported instances (e.g., AWS i3), with fallback to shared EFS for broader durability; this layered approach ensures segments are readily available without repeated fetches from the primary storage backend. Local storage options further allow peers to maintain personal replicas of indexes, reducing network overhead in multi-node setups.[37]
Data durability in Datomic is ensured through backend-specific replication guarantees, applying to both datoms and derived indexes. DynamoDB, for instance, replicates data across three facilities by default, offering 99.999999999% availability over a year, while PostgreSQL configurations can achieve similar redundancy via clustering. Indexes inherit this durability as immutable artifacts stored alongside datoms, with background jobs ensuring consistency without risking data loss during rebuilds. This stratified persistence model—combining transactional storage, durable caches, and archival layers—provides robust recovery from failures across the system.[33][37]
Deployment and Integrations
Deployment Options
Datomic offers three primary deployment options tailored to different scales and environments: Datomic Local for development and testing, Datomic Pro for distributed on-premises or hybrid setups, and Datomic Cloud for fully managed AWS deployments.[38] Each option leverages the same core data model and query language while varying in infrastructure management and scalability features.
Datomic Local provides an embedded, single-process database suitable for local development, continuous integration, and small applications without external dependencies. It stores data in local files or in-memory, requiring no network connectivity or separate server processes, and supports the full Datomic API for transactions and queries. Ideal for testing, it allows rapid iteration by adding the Datomic Local library to the application classpath and configuring storage via a .datomic/local.edn file to specify directories for databases.[39]
Datomic Pro enables distributed deployments with high availability, supporting on-premises, private cloud, or hybrid environments through custom storage backends like SQL databases, DynamoDB, or Cassandra. It requires managing an active transactor for writes—optionally with a standby for failover—and multiple peers for read scaling, all running on separate JVM processes for production reliability. Configuration uses properties files, such as dev-transactor-template.properties for initial setup, and the system supports manual scaling without automated cloud orchestration.[30][40]
Datomic Cloud delivers a serverless, fully managed experience exclusively on AWS, automating infrastructure with services like DynamoDB for transaction logs, S3 for indexes, and EC2 Auto Scaling Groups for compute. It eliminates transactor management, providing elastic scaling, built-in backups via S3 retention, and seamless integration with AWS features like API Gateway. Deployment occurs through AWS CloudFormation templates, focusing on VPC setup, IAM roles, and KMS encryption for security.[37][41]
All Datomic editions require a Java Virtual Machine, with Pro and Local needing Java 11 or later (LTS versions recommended) and Cloud using Java 17 on compute nodes; Clojure is essential for advanced peer or ion integrations. Configuration often involves EDN files for Local and Cloud ions, or properties files for Pro transactors, ensuring reproducible setups.[12][42][31]
Migration paths facilitate scaling: from Local to Pro or Cloud by exporting databases and reconnecting via compatible URIs, preserving the immutable data model; from Pro to Cloud involves porting applications to client-only access and leveraging ions for AWS-native features, though peer-dependent code may need refactoring. Contact Cognitect support for complex transitions.[43][39]
Client Interfaces
Datomic provides two primary programmatic interfaces for applications to interact with the database: the Peer API and the Client API. The Peer API is a full-featured library designed for embedding directly within application processes, offering direct access to queries and transactions on the JVM. In contrast, the Client API serves as a lightweight interface for remote connections, particularly suited for cloud deployments and short-lived services, routing requests through a peer server or cloud infrastructure.[44][45]
The Peer API, available as the datomic.api namespace in Clojure and the datomic.Peer class in Java, enables direct database connections for embedded use cases. Applications connect using a database URI that specifies the protocol and storage backend, such as datomic:dev://[localhost](/page/Localhost):4334/hello for local development or datomic:sql://host:port?jdbc:postgresql://.../mydb for SQL-based storage. Key functions include connect(uri) to establish a thread-safe connection, q(query, db) for executing Datalog queries, pull(selector, eid) for retrieving entity data, and transact(conn, [tx-data](/page/Tx-data)) for submitting transactions, which blocks until completion or can use transact-async for non-blocking operation. Connections are automatically cached for reuse, providing implicit pooling without manual management.[46][47][48]
The Client API, exposed through the datomic.client.api namespace, offers a synchronous interface wrapping an asynchronous core, ideal for remote access in distributed environments like Datomic Cloud. It begins with client(config-map) where the :server-type is set to :cloud (specifying :region, :endpoint, etc.) or :peer-server for on-premises setups, followed by connect(db-name) to obtain a connection. This API supports equivalent operations to the Peer API, including q(query) for queries, pull(selector, [eid](/page/Eid)) for data retrieval, and transact(tx-data) for transactions, with results returned directly or via channels in async mode. Designed for smaller footprints, it communicates over HTTP to a peer server gateway, enabling scalability in microservices.[49][50][45]
Integrations leverage these APIs natively: Clojure applications use datomic.api for seamless access, while Java interop employs the Peer class directly. For non-JVM languages, the Client API facilitates bindings via its HTTP-based protocol, though the legacy REST API (accessible at endpoints like https://localhost:8001) provides an alternative EDN-formatted HTTP interface for programmatic calls, albeit not recommended for new development.[46][47][51]
Connection management relies on URI schemes to abstract backends, with Peer API URIs like datomic:ddb://us-east-1/my-table/my-db for DynamoDB or datomic:mem://my-db for in-memory testing, and Client API using configuration maps for cloud or dev-local modes. Best practices include utilizing async transactions (transact-async in Peer, channel-based in Client) to achieve high throughput without blocking, retrying on transient errors like :busy with exponential backoff, and relying on built-in caching for efficient connection reuse in both APIs.[48][49][50]