ArangoDB
ArangoDB is a native multi-model, open-source NoSQL database system designed to handle graph, document, key-value, and full-text search data models within a unified core, enabling flexible data storage and querying without the need for multiple specialized databases.[1] Developed by ArangoDB Inc., which was founded in 2014 by Claudius Weinberger and Frank Celler in Cologne, Germany. The database employs the ArangoDB Query Language (AQL), a declarative, SQL-like language that allows complex traversals and joins across all supported data models in a single query.[1] As of November 2025, the latest stable version is 3.12.6.1, available in community and enterprise editions; the community edition, licensed under the Business Source License (BSL), provides full access to features without time limits for non-commercial use and for internal commercial use up to a 100 GiB dataset size.[2][3] Key features of ArangoDB include horizontal scalability through sharding and replication, support for ACID transactions, and integration with machine learning workflows via graph analytics engines for algorithms like PageRank and connected components.[1] It also offers advanced capabilities such as full-text search with ArangoSearch and vector search for AI applications, reducing infrastructure costs by up to 70% compared to siloed systems.[4] The system is particularly suited for handling connected data in scenarios requiring real-time analytics, such as recommendation engines, fraud detection, supply chain optimization, and generative AI platforms like chatbots and copilots.[5] Notable adopters include enterprises in finance, healthcare, and technology sectors, such as Deloitte, Cloudera, and NVIDIA, which leverage its performance for scalable AI and graph-based workloads.[6]Introduction
Overview
ArangoDB is an open-source, native multi-model NoSQL database that supports graph, document, key-value, vector, and search data models within a single core, allowing seamless integration of diverse data structures without the need for multiple specialized databases.[5] This architecture enables developers to handle complex, interconnected data workloads efficiently in one unified system. The primary purpose of ArangoDB is to unify data management for applications requiring flexible querying across models, facilitating use cases such as AI-driven contextual analytics, real-time recommendations, and knowledge graphs.[5] By combining these capabilities, it addresses the challenges of siloed data systems, promoting faster development and more agile data processing in modern applications.[5] As a foundation for AI data platforms, ArangoDB reduces integration costs by up to 70% through its native support for multiple paradigms, enabling enterprises to build scalable solutions for generative AI, fraud detection, and personalized services without extensive custom engineering.[5]Key Characteristics
ArangoDB is distinguished by its native multi-model architecture, which integrates support for graph, document, key-value, vector, and search data models within a single database engine, enabling developers to perform seamless operations across these models without requiring data duplication or complex external joins. This unified approach allows for querying diverse data types using a single declarative language, reducing the need for multiple specialized databases and minimizing integration overhead by up to 70%.[7] At its core, ArangoDB stores data in JSON format, with internal representation in the efficient VelocyPack binary format, providing schema flexibility that accommodates evolving application requirements without predefined structures. This design supports full ACID-compliant transactions across all supported models in single-server deployments, ensuring atomicity, consistency, isolation, and durability for multi-document and multi-collection operations, while in distributed setups, it maintains ACID properties for operations within the same shard.[8] For high-performance workloads, ArangoDB incorporates GPU acceleration, particularly through integration with NVIDIA's cuGraph for graph analytics, enabling faster processing of complex computations like pattern detection and centrality measures. It also offers both horizontal scaling via distributed clustering and auto-sharding, as well as vertical scaling to handle varying loads efficiently, making it suitable for enterprise-scale applications.[9][10] Developer-friendly aspects are further enhanced by ArangoDB's schema-free nature, which promotes agile development, and its native support for vector embeddings and search, facilitating integration with modern AI tools such as large language models (LLMs) for applications like GraphRAG and contextual intelligence systems that ground AI outputs in trusted enterprise data.[8][11]History and Development
Founding and Early Development
ArangoDB originated in 2011 in Cologne, Germany, when developers Claudius Weinberger, Frank Celler, and Lucas Dohmen began working on a new database project named AvocadoDB. The initiative aimed to develop a flexible NoSQL database capable of handling multiple data models, including key-value, document, and graph structures, to address limitations in existing systems that often required separate databases for different data types.[12] In May 2012, the project was renamed ArangoDB to avoid potential legal conflicts associated with the original name, while retaining the avocado-inspired logo as a nod to its versatile design. Shortly thereafter, in spring 2012, the first version of ArangoDB was released as an open-source project under the Apache 2.0 license, emphasizing its early emphasis on integrating document and graph storage capabilities for more efficient data management.[13][14][15] The project's growth led to the formal establishment of ArangoDB GmbH in May 2014 by Weinberger, Celler, and Dohmen, marking the transition from a personal development effort to a commercial entity dedicated to further developing, maintaining, and supporting the database. This company formation in Cologne laid the groundwork for professionalizing the open-source project while continuing to foster community contributions.[16]Funding and Growth
ArangoDB received its first external funding in February 2015 with a €1.85 million seed round led by Machao Holdings AG and triAGENS.[16] In June 2017, ArangoDB secured €4.2 million in seed funding led by Target Partners, with participation from CP Ventures and others, to accelerate its international expansion, particularly strengthening its presence in the US market.[17] This investment supported the company's efforts to build on its multi-model database foundation, originally developed from the open-source AvocadoDB project started in 2011.[18] Building on this momentum, ArangoDB raised $10 million in a Series A round in March 2019, led by Bow Capital with involvement from Target Partners and existing investors.[19] The funds were allocated toward global expansion, including hiring additional engineering and sales personnel to meet rising demand for its native multi-model database and to drive product development.[20] This round coincided with the relocation of its headquarters to San Francisco, California, marking a key step in establishing a stronger foothold in the North American market while maintaining operations in Cologne, Germany.[21] In October 2021, ArangoDB announced a $27.8 million Series B funding round led by Iris Capital, with participation from Bow Capital, Target Partners, and New Forge, bringing total financing to approximately $47 million.[22] The investment aimed to advance graph machine learning capabilities, enhance analytics and AI integrations, and support cloud-native services for enterprise-scale deployments.[23] These funding rounds fueled significant organizational growth, including the expansion of its workforce to over 100 employees across three continents by 2023.[24] The company maintained its engineering hub in Cologne, Germany, while the San Francisco office served as the primary headquarters, enabling a distributed team to serve a global customer base in industries such as finance, healthcare, and technology.[25]Major Releases and Milestones
ArangoDB's major releases have progressively enhanced its multi-model capabilities, performance, and integration with emerging technologies. Version 3.0, released in June 2016, marked a significant milestone by unifying document, graph, and key-value models into a single, cohesive architecture, enabling seamless queries across data types.[26] This release laid the foundation for ArangoDB's native multi-model support, allowing developers to mix and match data models without application-level sharding.[27] Subsequent versions focused on scalability and advanced analytics. ArangoDB 3.8, generally available on July 29, 2021, introduced new graph algorithms, including support for weighted traversals and k-shortest paths, improving analytics at scale for complex networks.[28] In September 2022, version 3.10 added native ARM architecture support, broadening deployment options for edge and cloud environments, alongside computed values and automated graph sharding.[29] Version 3.11, released on May 30, 2023, optimized search and graph query performance with features like improved AQL execution and enhanced view management, boosting usability for large-scale data operations.[30] The 3.12 series, starting with its general availability on March 27, 2024, integrated vector search capabilities and AI-focused optimizations, such as improved memory accounting and parallel AQL execution, to support generative AI workloads. As of November 2025, the latest stable release is 3.12.6.1 from November 8, 2025, which includes enhancements to the Kubernetes operator for better orchestration in containerized environments.[26][31] A key product milestone was the launch of ArangoDB Oasis, the company's managed cloud service, on November 20, 2019, simplifying deployment and scaling for multi-model databases across AWS and Google Cloud. By 2025, ArangoDB emphasized generative AI integrations through the Arango AI Suite, featuring tools for multimodal data ingestion, LLM connectivity, and graph-powered RAG systems to enable contextual AI applications.[32] In October 2023, ArangoDB announced a shift in its licensing model to promote sustainability. Starting with version 3.12, the source code adopted the Business Source License (BSL) 1.1, while binaries fell under the ArangoDB Community License, which limits commercial use in the Community Edition to datasets under 100GB per cluster. This change drew criticism from parts of the open-source community for restricting commercial applications compared to the previous Apache 2.0 license.[33]| Version | Release Date | Key Milestones |
|---|---|---|
| 3.0 | June 2016 | Unified multi-model architecture |
| 3.8 | July 2021 | Weighted graph traversals and analytics |
| 3.10 | September 2022 | ARM support and automated sharding |
| 3.11 | May 2023 | Search and graph performance enhancements |
| 3.12 | March 2024 | Vector search and AI optimizations |
Technical Architecture
Core Components
ArangoDB's storage engine is built on RocksDB, a persistent key-value store optimized for handling large datasets with fast read and write operations. It persists documents on disk while maintaining hot data in memory, using a log-structured merge-tree design to ensure efficient storage and recovery. The engine supports native handling of JSON documents in a schema-optional manner, allowing flexible, semi-structured data storage without rigid schema enforcement. Write-ahead logging (WAL) is employed for durability and replication, with WAL files typically sized around 64 MiB and configurable via options like--rocksdb.write-buffer-size. Compression using the LZ4 algorithm is enabled by default starting from level 2 of the storage hierarchy to optimize disk usage.[34]
The execution engine processes AQL (ArangoDB Query Language) queries by generating and optimizing execution plans through a cost-based optimizer. This optimizer creates multiple potential plans for a query, evaluates their estimated costs, and selects the one with the lowest cost to ensure efficient execution while preserving query semantics. Key optimization rules include index usage, filter removal when covered by indexes, and asynchronous prefetching to improve performance. Parallel execution is supported, particularly in distributed environments, using nodes like ScatterNode and GatherNode to distribute and collect data across shards, though core plan optimization occurs even in standalone setups. The engine represents queries as pipelines of execution nodes, such as IndexNode for index scans and ReturnNode for result output, enabling targeted optimizations like index-only or scan-only paths.[35]
ArangoDB provides several index types to accelerate data retrieval, all integrated with the RocksDB storage engine for persistence. Persistent indexes serve as the primary type for equality matches, range queries, and sorting, offering logarithmic time complexity and supporting options like sparsity control and caching; hash and skiplist indexes are legacy aliases for this type and are no longer recommended for new implementations. Full-text indexes enable word-based searches on attributes, supporting prefix and exact word matching, though they are deprecated since version 3.10 in favor of the more advanced ArangoSearch views. Geo-spatial indexes facilitate location-based queries, such as radius searches or nearest-neighbor lookups, using 2D coordinates or GeoJSON objects, and are invoked via specific AQL functions or automatic optimization. All these indexes are stored on disk with in-memory caches configurable via parameters like --cache.size and --rocksdb.block-cache-size to balance performance and resource usage.[36]
The transaction manager in ArangoDB ensures ACID compliance for operations spanning multiple collections and graphs by leveraging RocksDB's built-in transaction capabilities. For standalone AQL queries, it implements atomicity, consistency, isolation, and durability, where changes are isolated until commit and persisted via WAL for recovery. Transactions can involve multiple document collections, treating graphs as interconnected collections to maintain integrity across edges and vertices. Stream transactions allow explicit begin/commit/abort control for multi-document operations, while JavaScript transactions (deprecated in version 3.12) provide a programmatic interface with automatic commit handling. Durability is configurable, but committed changes are guaranteed to survive server restarts.[34][37]
Clustering and Scaling
ArangoDB achieves distributed deployment through its Cluster mode, which distributes data across multiple nodes using automatic sharding and synchronous leader-follower replication to ensure high availability and fault tolerance.[38] In this setup, collections are partitioned into shards based on a configurable shard key, typically the document's_key field via consistent hashing, allowing data to be evenly spread across DB-Server nodes without manual intervention.[39] Each shard maintains one leader replica responsible for handling writes, with one or more follower replicas that synchronously replicate changes to maintain consistency; the replication factor, set per collection, determines the total number of copies (e.g., 3 for one leader and two followers).[40]
The system supports both active-passive and active-active configurations for resilience. In active-passive setups, such as the deprecated Active Failover mode for single-server instances, one active leader handles operations while passive followers asynchronously replicate data for automatic failover.[41] For active-active clustering in distributed environments, particularly in the Enterprise Edition, datacenter-to-datacenter replication enables bidirectional synchronization across geographically separated clusters, allowing read and write operations from multiple active sites.[42] Leader election occurs automatically if a leader fails, with configurable timeouts (e.g., 15 seconds), ensuring minimal downtime through the resilient Agency component that coordinates the cluster using Raft consensus.[38]
Horizontal scaling in ArangoDB is achieved by dynamically adding DB-Server nodes, which triggers shard rebalancing to distribute load evenly and increase overall throughput linearly with the number of nodes; the architecture has no inherent limits on scalability, supporting hundreds of DB-Servers and Coordinators constrained only by hardware resources like CPU, memory, and network bandwidth.[43] This enables handling large-scale workloads, such as terabyte-sized datasets or high query volumes, by scaling out across commodity hardware while maintaining performance through the stateless Coordinator nodes that route client requests.[44]
To address challenges in geo-distributed data access, ArangoDB introduces satellite collections, which replicate an entire collection synchronously to every DB-Server node in the cluster, allowing joins with sharded data to execute locally on each node and minimizing cross-node network traffic—ideal for scenarios requiring low-latency operations across distributed locations.[45] Complementing this, SmartJoins optimize cross-shard queries by enforcing identical sharding on related collections (via the distributeShardsLike property), enabling the query optimizer to perform co-located joins without routing data through the Coordinator, thus reducing latency and inter-node communication for complex operations like graph traversals or analytical joins.[46]
Deployment and management of scaled clusters are streamlined in containerized environments via the ArangoDB Kubernetes Operator (kube-arangodb), introduced with enhancements in version 3.12, which automates provisioning, scaling, backups, and failover handling within Kubernetes clusters to support elastic resource allocation and seamless integration with cloud-native infrastructures.[47]
Features
Data Models Supported
ArangoDB supports multiple native data models, allowing users to store and query data in key-value, document, graph, and vector formats within the same database instance. This multi-model approach enables seamless integration across models without data duplication or complex ETL processes.[48] The document model in ArangoDB is based on JSON objects stored in collections, supporting nested structures and flexible schemas without rigid upfront definitions. Documents can contain structured or semi-structured data, with each document being self-contained and capable of having unique attributes. This model facilitates granular queries on individual attributes, aggregation operations, and the use of secondary indexes for efficient retrieval. For example, a document might represent a user profile with embedded arrays for preferences, allowing direct access to nested elements.[48][49] The key-value model serves as a foundational subset of the document model, providing simple persistent storage where each entry is identified by an immutable string key (_key). It leverages a primary index on the key for fast lookups and includes a unique identifier (_id) in the format <collection>/<key>. This model is particularly suited for caching scenarios, with support for time-to-live (TTL) settings to automatically expire entries after a specified duration. Users can store arbitrary JSON values associated with keys, enabling straightforward get, set, and delete operations.[48][50]
ArangoDB's graph model employs a property graph structure, consisting of vertices (nodes as documents) and edges (documents with _from and _to attributes linking vertices). Edges are directed, supporting traversals in outbound, inbound, or bidirectional directions. Native graph algorithms, such as shortest path and neighborhood queries, are built-in for efficient pattern matching and relationship analysis. For instance, in a social network graph, vertices could represent users, and edges could denote friendships, allowing queries to traverse multi-hop connections. These models can be queried across boundaries using AQL.[48][51]
Introduced in version 3.12.4, the vector model enables storage of embeddings—arrays of numerical vectors generated by machine learning models to capture semantic meanings—as attributes within documents. These embeddings support similarity searches using indexes powered by the Faiss library, with configurable distance metrics like cosine similarity, inner product, or L2 distance. Vector indexes must be created on pre-populated data, and new embeddings are dynamically assigned to clusters for ongoing searches. This model integrates natively with graph and document structures, allowing hybrid queries that combine semantic similarity with relational traversals, such as retrieving similar documents connected via graph edges in AI-driven applications.[52][53]
Query Language and Processing
ArangoDB's primary query interface is the ArangoDB Query Language (AQL), a declarative language designed for manipulating data across document, graph, and key-value models within a unified syntax.[54] AQL allows users to express desired results using SQL-like constructs, including operations for reading, writing, and modifying data without specifying the underlying execution details. It supports joins to combine data from multiple collections, subqueries for nested logic, and graph traversals to navigate relationships, enabling complex queries like finding connected components or shortest paths in a single statement.[55] For example, a traversal query might use theFOR ... IN GRAPH syntax to explore edges from a starting vertex, applying filters and options for direction, depth, and uniqueness.[55]
Query processing in ArangoDB begins with parsing the AQL statement on the server, followed by optimization to generate an efficient execution plan. The optimizer employs cost-based planning, evaluating multiple potential plans and selecting the one with the lowest estimated cost based on heuristics such as data access patterns and index usage.[35] Early pruning is achieved through rules that reposition filters closer to data sources, reducing the volume of intermediate results; for instance, the move-filters-up rule shifts conditions before joins or traversals.[35] Parallel execution is facilitated in clustered environments via rules like async-prefetch, which enables asynchronous loading of data batches, and parallelize-gather, which distributes computation across shards for scalable performance across data models.[35]
To handle large datasets securely and efficiently, AQL incorporates bind parameters for injecting values into queries, preventing SQL injection attacks while allowing parameterized reuse.[56] Parameters are denoted with @ for values (e.g., FOR doc IN collection FILTER doc.age > @minAge RETURN doc) or @@ for collection names, passed separately via APIs like bindVars.[56] Results are streamed using cursor-based interfaces, which return data in configurable batches (via batchSize) rather than loading everything into memory at once.[57] This streaming mode, enabled with the stream: true option, processes results lazily on the server, minimizing memory overhead for voluminous outputs and supporting iterative client-side consumption through subsequent cursor requests.[57]