Fact-checked by Grok 2 weeks ago

Log-structured merge-tree

The Log-structured merge-tree (LSM-tree) is a disk-based indexing data structure designed to enable efficient, low-cost insertions and deletions for files experiencing high rates of update operations, by batching changes in memory and cascading them through a series of sorted, immutable components on disk via periodic merge processes.^[1] Introduced in 1996 by Patrick O'Neil, Elizabeth O'Neil, Edward Cheng, and Dieter Gawlick, the LSM-tree addresses the limitations of traditional structures like B-trees, which suffer from expensive random disk I/O during frequent updates, by instead favoring sequential appends to log files and background merging to maintain sorted order.^[1] At its core, an LSM-tree organizes data across multiple levels: an in-memory write buffer (often called a memtable) captures incoming writes until it reaches capacity, at which point it is flushed to an immutable on-disk component (typically as sorted string tables or SSTables), and subsequent merges propagate data to deeper, larger levels to resolve overlaps and deletions.^[2] This tiered architecture, with each level exponentially larger than the previous, minimizes write amplification while supporting tunable trade-offs between write throughput and read performance through compaction strategies like leveled or tiered merging.^[2] The LSM-tree's primary advantages include dramatically reduced I/O costs—up to two orders of magnitude lower than B-trees for insert-heavy workloads—due to its reliance on multi-page block reads and sequential writes, making it particularly suitable for applications like time-series logging, audit trails, and high-volume append-only data.^[1] However, it incurs higher read latencies for point queries and scans, as these may require searching multiple components or bloom filters to skip irrelevant data, and compactions can introduce temporary write stalls or space overhead from duplicates before merging.^[2] Since its inception, the LSM-tree has become foundational to numerous production storage systems, powering write-optimized key-value stores such as Google's LevelDB (2011), Facebook's RocksDB (2012), and Apache Cassandra (2008), where it enables scalable handling of petabyte-scale datasets with sustained ingestion rates exceeding millions of operations per second.^[2] Ongoing research continues to refine LSM-tree variants, incorporating techniques like learned indexes and hybrid memory integration to mitigate amplification issues and adapt to flash-based storage.^[3]^[4]

Overview

Definition

A log-structured merge-tree (LSM-tree) is a disk-based data structure designed to provide low-cost indexing for files with high rates of record inserts and deletes over extended periods.^[1] It achieves this by deferring and batching index changes, cascading them from a memory-based component through one or more disk components in a manner similar to merge sort, thereby optimizing for sequential write operations on disk.^[1] At its core, the LSM-tree trades increased read latency for superior write efficiency through a tree-like hierarchy of levels, each containing multiple immutable sorted string tables (SSTables) that store key-value pairs in sorted order.^[1] This structure appends new writes sequentially to avoid random I/O, with organization deferred to background processes that merge data across levels to maintain stability.^[1] The LSM-tree guarantees eventual consistency for reads by propagating updates through merges, though point queries may require scanning and merging data from multiple levels to retrieve the most current value.^[5] In its basic architecture, recent writes are buffered in an in-memory memtable; once full, it is flushed to disk as an immutable SSTable in level L0, with subsequent levels (L1 to Ln) holding progressively larger, more stable datasets formed by merging lower-level SSTables.^[1] Higher levels are typically 10 times larger than the preceding ones, ensuring efficient scaling for write-heavy workloads.^[1]

Motivation

Traditional B-tree indexes exhibit significant write amplification in environments with high random write rates, as each update often triggers page splits, index rebalancing, and multiple random disk I/Os—typically O(log N) operations per write on rotating magnetic media, leading to inefficient performance for insert-heavy workloads.^[1] The log-structured merge-tree (LSM-tree) was designed to mitigate these issues through a log-structured approach that favors sequential append-only writes to disk, achieving an amortized write cost of O(1) and minimizing seek times. This strategy is especially advantageous for flash and SSD-based storage, where random writes cause uneven wear on memory cells and reduce device lifespan, enabling LSM-trees to sustain higher sustained write throughputs.^[1]^[6] In exchange for these write optimizations, LSM-trees incur higher read complexity by requiring the merging of data from multiple sorted components on disk, but this trade-off yields write performance 10-100 times faster than B-trees for append-dominated scenarios. LSM-trees target workloads involving massive data ingestion, such as time-series monitoring, event logging, and NoSQL databases handling millions of writes per second.^[1]^[7] A key comparison metric is space amplification, where LSM-trees can exhibit around 1.1x for leveled compaction to 2x or more for tiered compaction compared to the logical data size due to temporary key duplicates arising during the asynchronous merging process, though compaction helps mitigate this over time.^[8]

History

Invention

The Log-Structured Merge-Tree (LSM-tree) was proposed in 1996 by Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil as a novel disk-based indexing structure designed to support high rates of insert and delete operations in database systems.^[1] This work originated from a 1991 technical report by the same authors at the University of Massachusetts Boston Mathematics and Computer Science Department and was formally published in the journal Acta Informatica.^[1] The proposal aimed to overcome the write bottlenecks inherent in traditional B-tree indexes, which require frequent random I/O updates that scale poorly with transaction volumes.^[1] The invention was motivated by the need to handle high-volume updates in online transaction processing (OLTP) environments, such as those benchmarked by the TPC-A standard, where tables like the History table could grow rapidly—potentially adding millions of entries per day—without incurring the full cost of immediate index maintenance.^[1] Drawing inspiration from log-structured file systems (LFS), introduced by Mendel Rosenblum and John K. Ousterhout in 1992, the LSM-tree adapts the idea of sequential logging to indexed data by deferring and batching changes to amortize I/O costs across multiple operations.^[1]^[9] In LFS, all file system modifications are written sequentially to a log on disk, minimizing seek times and enabling efficient garbage collection; the LSM-tree extends this to key-value indexing by combining write-ahead logging with a multi-level merge process.^[1] Key innovations include an in-memory buffer called the C0 component (analogous to a memtable) that accumulates recent writes before flushing them to immutable disk files in levels C1, C2, and beyond, where older data is merged in a rolling fashion to resolve overlaps and deletions.^[1] This structure leverages sequential multi-page writes—similar to LFS block sizes—to reduce random I/O, achieving an expected I/O cost per insert of approximately 2 × (sequential page I/O cost) / (pages per block), compared to the higher random I/O demands of B-trees.^[1] For space usage, the design anticipates efficient storage for growing datasets; for instance, indexing 576 million entries in a TPC-A-like History table over 20 days requires about 9.2 GB, including redundancy from unmerged levels that is progressively resolved.^[1] These bounds highlight the LSM-tree's ability to trade some space for significantly improved write throughput in write-heavy workloads.^[1]

Evolution

Following the initial proposal of LSM-trees in the mid-1990s, their practical adoption accelerated in the 2000s through integration into distributed storage systems. Google's Bigtable, released in 2006, was among the first major implementations, employing LSM-tree structures to handle structured data at petabyte scale across commodity hardware clusters.^[10] This approach influenced open-source projects like Apache HBase, which modeled its column-family storage on Bigtable's LSM-based design for scalable, random read-write access to large datasets. Similarly, Apache Cassandra, launched in 2008, adopted LSM principles with SSTables for decentralized, high-availability storage, enabling fault-tolerant writes in wide-column databases. Refinements in the late 2000s and early 2010s focused on optimizing compaction and read efficiency. Earlier LSM designs, such as Bigtable's size-tiered compaction, prioritized write throughput but suffered from higher space amplification due to overlapping key ranges across tiers. In 2011, Google's LevelDB introduced leveled compaction, which enforces non-overlapping key ranges per level to cap space usage at approximately 10% amplification while trading off some write costs.^[11] Concurrently, Bloom filters emerged as a key read optimization, first practically integrated in Cassandra around 2008-2009 to probabilistically filter absent keys before SSTable scans, reducing unnecessary I/O by up to 90% in point lookups. The 2010s saw broader enhancements for production workloads, particularly on flash storage. Facebook's RocksDB, forked from LevelDB in 2012, added configurable compaction styles (including leveled and universal variants), per-column-family compression (e.g., Snappy or LZ4), and SSD-specific tuning to lower write amplification from 10-20x to under 5x in typical setups by optimizing merge scheduling and buffer management.^[12] These features addressed flash endurance limits, enabling LSM-trees in high-velocity applications like social graphs. Recent developments through 2025 have emphasized hardware acceleration and intelligent structures for cloud-native scalability. Integration with NVMe SSDs has reduced tail latencies by exploiting parallel I/O queues, as demonstrated in 2021 designs achieving 2-5x faster throughput over SATA-based LSM stores without sacrificing durability.^[13] Hybrid models blending LSM persistence with in-memory caching gained traction, notably in Apache Ignite 3 (2021), which layers RocksDB-backed LSM-trees under a distributed memory grid for unified transactional and analytical processing at terabyte scales.^[14] Meanwhile, research from 2020-2024 on learned indexes has targeted merge overhead reduction; for instance, the 2020 BOURBON system uses neural networks to approximate index lookups in LSM levels, improving point lookup performance by 1.23× to 1.78× while maintaining query accuracy.^[15] In 2025, further advances include HotRAP for efficient hot data retention and promotion in LSM-trees with under 4% overhead on uniform workloads, and TieredKV, a tiered LSM design integrating learned indexes for superior throughput and reduced amplification.^[16]^[17] These advances have extended LSM-trees' viability to elastic, multi-tenant cloud databases, filling gaps in post-2010 scalability for diverse storage media.

Data structure

Memtable

The memtable serves as the primary in-memory buffer in an LSM-tree, capturing incoming writes to achieve high-throughput ingestion without immediate disk I/O. It is typically realized as a sorted data structure, such as a skip list, red-black tree, or AVL tree, which stores key-value pairs in logarithmic time for insertions and supports efficient range scans.^[18] These structures ensure that data remains ordered by key, facilitating subsequent merging with on-disk components.^[19] In the write path, updates are appended sequentially to the memtable: insertions add new key-value pairs, updates overwrite existing entries for the same key, and deletions insert a tombstone to mark the key as invalid without immediate removal.^[20] This append-only approach avoids costly in-place modifications, converting random writes into efficient sequential operations in memory.^[18] The memtable is engineered for thread-safety, often using lock-free or fine-grained locking mechanisms to handle concurrent writes from multiple threads without significant contention.^[21] Durability is maintained by first logging each write to a write-ahead log (WAL) on persistent storage before committing it to the memtable, enabling recovery of unpersisted data in the event of a crash by replaying the WAL.^[18] Common configurations set the memtable capacity between 64 MB and 256 MB, balancing memory footprint with performance; for instance, RocksDB defaults to 64 MB per memtable.^[22] Upon reaching capacity, the active memtable is sealed as immutable, a new mutable memtable is initialized for ongoing writes, and the immutable one is flushed to disk as a level-0 sorted string table (SSTable).^[23] This process triggers background serialization without blocking new writes, though frequent flushes can introduce temporary stalls if memory pressure builds multiple immutable memtables.^[24] The memtable's size configuration embodies key memory trade-offs: larger allocations decrease flush frequency and mitigate write stalls by amortizing I/O overhead, but they increase RAM consumption and potential recovery time from failures.^[24] Optimizing this balance is critical for workloads with varying write rates, as excessive flushing amplifies disk writes while undersized buffers strain available memory.^[20]

SSTables and levels

SSTables, or Sorted String Tables, serve as the fundamental on-disk storage units in LSM-tree implementations, consisting of immutable, append-only files that store key-value pairs in sorted order by key.^[12] Each SSTable is typically sized between 1 and 64 MB to balance I/O efficiency and manageability, and it is divided into fixed-size data blocks (often around 4 KB) for sequential reads, with the entire file compressed using algorithms such as Snappy or Zstandard to reduce storage overhead.^[12] Once written, an SSTable is never modified, ensuring write-once semantics that align with the log-structured nature of the LSM-tree. In the LSM-tree's hierarchical structure, SSTables are organized into multiple levels, denoted as L0 through L_max (commonly up to 7-10 levels), where each level represents a progressively larger and more stable layer of data.^[12] Level L0 contains SSTables directly flushed from the in-memory memtable, and these files may overlap in their key ranges, potentially covering the same keys multiple times due to recent writes.^[1] Starting from L1 and higher, SSTables are non-overlapping within a level, with each subsequent level designed to hold approximately 10 times the total data size of the previous one, creating an exponential growth that accommodates the full dataset while minimizing redundancy.^[1] This tiered arrangement, inspired by the component hierarchy in the original LSM design (C0 for memory, C1+ for disk), ensures that newer data resides in upper levels and older, more persistent data in lower ones.^[1] Data placement across levels follows a range-partitioning scheme, where keys are divided into disjoint intervals to cover the entire key space efficiently.^[12] In L0, the overlapping nature allows for quick ingestion of new data without immediate reorganization, while L1 and beyond enforce strict non-overlap, with each SSTable responsible for a unique key range that tiles the key space without gaps or significant redundancies.^[12] Higher levels thus provide broad coverage of the key space with minimal duplication, facilitating scalable storage as the dataset grows.^[1] For efficient querying, each SSTable includes internal indexing structures, such as block indexes that map key prefixes to specific data blocks, enabling fast seeking without scanning the entire file.^[12] The file concludes with a footer containing metadata like block offsets, checksums, and pointers to auxiliary structures, while optional Bloom filters—probabilistic data structures—are embedded to quickly determine if a key is absent from the SSTable, avoiding unnecessary I/O for irrelevant files.^[25] These mechanisms, building on the sorted file indexes from early LSM designs, support logarithmic-time access by allowing reads to probe only a small number of SSTables per level.^[1] The space usage in this leveled organization results in a total amplification factor calculated as the sum of the sizes across all levels divided by the size of unique data, S = \sum_{i=0}^{L_{\max}} \frac{\text{size}_i}{\text{unique data}}, which typically ranges from 2x to 4x depending on update frequency and compaction efficiency.^[1] This overhead arises from temporary duplicates during merging but remains low compared to alternatives like B-trees, as exponential sizing ensures that lower levels dominate storage without proportional redundancy.^[12]

Merging process

The merging process in a log-structured merge-tree (LSM-tree) operates as an incremental, rolling merge analogous to external merge sort, where multiple sorted runs of key-value pairs are combined across levels, resolving duplicates and tombstones by retaining only the most recent value for each key.^[1]^[26] Level transitions are triggered when the size of level L_i exceeds a predefined threshold, typically around 10 times the maximum size of level L_{i-1}, prompting a minor compaction that selects and merges the smallest SSTables from L_i with overlapping SSTables in L_{i+1} to propagate data downward.^[1]^[26] In level L0, key-range overlaps are common and tolerated across up to hundreds of SSTables, with read operations resolving them by selecting the latest value; in contrast, higher levels under leveled compaction maintain disjoint, non-overlapping key ranges that are fully sorted to avoid such redundancy.^[26] This multi-level propagation results in an amortized I/O write cost per key of O(\log N / B), where N is the total number of keys and B is the block size, as each key is rewritten once per level during merges.^[26] LSM-tree implementations vary in merging strategy, with tiered compaction permitting multiple overlapping runs per level to favor write throughput and leveled compaction enforcing a single run per level for better space efficiency and read performance (detailed in implementations); the process executes asynchronously via background threads to ensure it does not block foreground write operations.^[1]^[26]

Operations

Write operation

In an LSM-tree, the write operation for inserting or updating a key-value pair begins by appending the entry to a write-ahead log (WAL) on disk to ensure durability in case of a crash.^[1] This step guarantees that the data can be recovered and replayed into the in-memory structure if needed, providing fault tolerance without immediate random I/O overhead.^[1] Next, the entry is inserted into the memtable, an in-memory balanced tree structure such as a skip list or red-black tree, which supports efficient lookups and updates.^[6] The insertion time complexity is O(log M), where M is the size of the memtable, enabling low-latency operations typically under 1 ms. Updates to existing keys are handled by overwriting the prior value in the memtable, while deletes insert a special deletion marker known as a tombstone, which logically removes the key without immediate physical erasure. When the memtable reaches a configured size threshold, it is frozen—made immutable to new writes—and a new active memtable is created to continue accepting incoming operations, ensuring concurrency without blocking.^[27] The frozen memtable is then flushed sequentially to disk as a new immutable sorted string table (SSTable) at level L0, amortizing the cost over batched writes and avoiding random I/O.^[1] This sequential flushing contributes to high write throughput by leveraging the storage device's strengths in sequential access.^[6] The write amplification (WA), defined as the ratio of total bytes written to disk versus user data bytes, arises from repeated copying during subsequent merges and is approximated by

\text{WA} \approx \log_L N,

where L is the fan-out factor per level (typically 10) and N is the total number of keys in the tree.^[6] This metric highlights the efficiency of the append-only write path, though it increases with tree size. Subsequent compaction of these L0 SSTables into higher levels manages overlap but is handled separately.^[28]

Read operation

In LSM-trees, point lookups for a specific key begin by searching the in-memory memtable, typically implemented as a balanced tree structure such as an AVL or red-black tree, which provides O(log M) time complexity where M is the number of entries in the memtable.^[1] If the key is not found, the search proceeds to the Level 0 (L0) SSTables on disk, which may require scanning multiple files in configurations like RocksDB, though Bloom filters are used to skip irrelevant files with over 90% efficiency by probabilistically determining if a key is absent.^[29]^[30] For higher levels (L1 and beyond), the process involves a binary search across the sorted list of SSTables to identify candidate files—O(log N) where N is the total number of files—followed by a binary search within each file's index, incurring O(log B) time per file where B is the block size.^[30] Range queries operate similarly but extend to scanning contiguous key ranges across relevant components; iterators are created for the memtable and each qualifying SSTable, then merged in key order to produce a sorted result set, ensuring all overlapping files from memtable to the highest level are examined.^[1] This merging leverages the sorted nature of SSTables, with optimizations like cached block indexes in memory to minimize disk seeks by mapping keys directly to data blocks.^[31] Key optimizations further enhance read efficiency: Bloom filters, built per SSTable during creation, enable skipping non-matching files for point lookups, while in-memory caching of file indexes and data blocks reduces repeated I/O for hot keys.^[30] The worst-case latency for a point lookup involves O(L × log B) I/O operations, where L is the number of levels (typically around 7 in practical systems like RocksDB), but with Bloom filters and caching, average latencies are significantly reduced.^[30]^[31] For consistency, reads retrieve the latest value by prioritizing the memtable and descending through levels, selecting the most recent version encountered due to the append-only, immutable nature of components, which eliminates the need for read locks.^[1]

Compaction

Compaction in log-structured merge-trees (LSM-trees) serves to reorganize on-disk data structures, primarily by merging sorted string tables (SSTables) across levels to reduce key overlaps, reclaim storage space occupied by tombstones (markers for deleted keys), and prevent excessive growth or bloat in individual levels that could degrade read performance and increase space amplification.^[1]^[26] This background maintenance process ensures the tree maintains its efficiency for write-heavy workloads by resolving conflicts from multiple versions of the same key and eliminating obsolete entries that accumulate due to deferred updates.^[1] Two primary compaction strategies are commonly employed: leveled and size-tiered. In the leveled strategy, SSTables are merged sequentially from one level to the next, with each level limited to a target size typically 10 times larger than the previous one; this approach aggressively reduces overlaps by ensuring non-overlapping key ranges within a level, but it incurs higher write amplification, often around 10× the ingested data size due to repeated rewriting of data across levels.^[29]^[26] Conversely, the size-tiered strategy groups SSTables by similar sizes within a level before compacting them into larger files in the next level, which allows multiple overlapping files per key range and results in lower write amplification, approximately 4×, at the expense of increased read amplification from more frequent key lookups across files.^[32]^[26] The compaction process operates in background threads to avoid interfering with foreground operations, selecting candidate SSTables based on a score metric that prioritizes levels or files with the highest saturation, such as score = total file size / target level size.^[29] Overlapping files are then read, their keys merged in sorted order—resolving duplicates, applying tombstones to remove deleted entries, and discarding superseded values—and the resulting sorted data is written as new, larger SSTables to a higher level, with old files marked for deletion once the new ones are durable.^[1]^[33] Compaction is computationally intensive due to the key merging step but primarily I/O-bound from reading and writing large volumes of data, with configurable heuristics like maximum background threads, minimum file sizes for inclusion, and score thresholds to balance resource usage against workload demands.^[33]^[26] Triggers are based on compaction debt, defined as the sum of file sizes exceeding level thresholds across the tree; compaction initiates when this debt surpasses a system-defined limit to maintain space and performance invariants.^[26] D = \sum (\text{file sizes exceeding thresholds}) Compaction proceeds when D > \theta, where \theta is the threshold parameter.^[26]

Performance characteristics

Advantages

LSM-trees excel in write optimization by buffering updates in memory before flushing them to disk in sequential batches, which minimizes random I/O operations and enables high throughput on both hard disk drives (HDDs) and solid-state drives (SSDs).^[1]^[34] This append-only approach contrasts with B-trees, which require in-place updates leading to frequent random writes, allowing LSM-trees to achieve insert costs approximately 50 times lower than B-trees under high-update workloads.^[1] Additionally, the in-memory memtable supports low tail latency for write bursts, as operations complete quickly without immediate disk involvement.^[34] In distributed systems, LSM-trees facilitate horizontal scalability through sharding, where data is partitioned across multiple nodes to handle petabyte-scale datasets without performance degradation. This design supports elastic scaling by adding nodes as data volume grows, making it ideal for large-scale key-value stores like those in NoSQL databases. LSM-trees are well-suited to modern hardware, particularly flash-based SSDs, where their typical write amplification of 10-20x is significantly lower than B-trees' 100x or more, reducing wear and extending device lifespan.^[28] Compaction processes, which merge sorted files, can be parallelized across multi-core processors to improve efficiency without blocking foreground writes.^[35] Fault tolerance is enhanced by the write-ahead log (WAL), which ensures durability by persisting updates before acknowledgment, allowing recovery of the memtable upon crashes.^[1] The immutability of SSTables simplifies crash recovery, as files remain consistent and can be replayed from the WAL without complex in-place repairs.^[1] Benchmarks, such as those using the YCSB workload, demonstrate LSM-trees achieving 5-50x faster write throughput compared to B+ trees in update-intensive scenarios, underscoring their suitability for write-heavy applications. Space efficiency is further improved during compaction, where compression techniques and duplicate removal can reduce storage overhead by up to 50% in leveled implementations.^[36]

Disadvantages

While LSM-trees excel in write-heavy workloads, they incur notable penalties in read performance, particularly for point queries. Point lookups often require scanning multiple levels of sorted string tables (SSTables), leading to higher read amplification compared to B-trees, which typically access data in logarithmic time with fewer I/O operations. In benchmarks, LSM-tree-based systems like RocksDB exhibit read throughput that is 1.5x to 3x lower than B-tree implementations for random point queries, translating to latencies that are proportionally higher due to the need to merge results from overlapping levels.^[37] Range scans, while more efficient than in B-trees for sequential access, still suffer from merge-heavy operations across levels, exacerbating latency in multi-level structures.^[28] Space overhead represents another key limitation, stemming from data duplication during the asynchronous merging process. LSM-trees can experience space amplification of typically around 1.2x to 2x the logical data size, depending on configuration and compaction policies, primarily due to temporary storage of outdated versions and duplicates in intermediate SSTables before compaction resolves them.^[31] Deletes and updates introduce tombstones—markers that propagate through levels until compacted—further inflating space usage, as these entries persist and consume resources even after the targeted data is logically removed. This lag in cleanup can lead to sustained overhead, especially in workloads with high update rates, where tombstones can significantly inflate space usage until compaction removes them.^[26] Compaction, essential for maintaining performance, introduces maintenance overheads that can disrupt operations. Background compaction processes cause I/O spikes known as write stalls, where incoming writes are throttled if levels become imbalanced, potentially increasing tail latencies by orders of magnitude during peak activity.^[7] Tuning parameters like size ratios and compaction policies is required to mitigate these "storms," but suboptimal configurations can interfere with foreground reads and writes, reducing overall throughput.^[38]

Implementations

Key databases

Apache Cassandra, first released in 2008, is an open-source distributed NoSQL wide-column store that employs LSM-trees as its core storage engine to handle high-write-throughput workloads across clusters. It utilizes size-tiered compaction strategies to manage SSTables, enabling efficient data partitioning and replication in decentralized environments.^[39]^[40] Apache HBase, initially released in 2007, is a distributed, scalable, big data store modeled after Google's Bigtable and built on Hadoop's HDFS, incorporating LSM-tree principles for storing sparse data tables in a master-slave architecture. It supports tiered and date-tiered compaction policies within its region-based storage model to optimize for column-family oriented data access.^[41]^[42]^[43] LevelDB, developed by Google and released in 2011, is an embeddable persistent key-value store that implements a leveled LSM-tree structure, popularizing leveled compaction where each level holds approximately ten times more data than the previous one. It focuses on simple operations like puts, gets, and scans, making it suitable for lightweight, high-performance storage needs.^[11]^[44] RocksDB, a 2012 fork of LevelDB developed by Facebook, serves as an optimized LSM-tree-based key-value store tailored for fast storage media like SSDs, featuring advanced features such as dynamic level sizing, multiple compaction styles, and transaction support. It powers numerous production systems, including Facebook's infrastructure handling petabyte-scale data across over 30 applications.^[45]^[46] Other notable implementations include ScyllaDB, a high-performance Cassandra-compatible database rewritten in C++ for shard-per-core architecture and low-latency LSM-tree operations, first released in 2015. TiKV, a distributed transactional key-value store written in Rust as part of the TiDB ecosystem, leverages RocksDB's LSM-tree engine for consistent, scalable storage and with its first general availability release in 2017. LSM-tree-based systems like these power thousands of production deployments worldwide, supporting diverse applications from social media to big data analytics.^[47]^[48]^[6]^[49]^[50]

Variants

LSM-trees have evolved through various compaction strategies to balance write and read performance. In size-tiered compaction, also known as tiering, immutable sorted runs of similar sizes are grouped within each level before merging to the next, which reduces write amplification by allowing multiple runs to accumulate per level but increases read amplification due to potential overlaps across multiple runs during lookups.^[51] Conversely, leveled compaction spreads data across non-overlapping ranges in each level, compacting runs from the previous level into a single sorted run per level, thereby minimizing read amplification and space usage at the expense of higher write amplification from more frequent full-level merges.^[51] These strategies represent fundamental variants, with tiered favoring write-intensive workloads and leveled suiting read-heavy scenarios.^[51] The Bw-tree, introduced by Microsoft Research in 2013, extends the LSM paradigm by layering a latch-free B-tree structure over log-structured storage, using delta updates to enable in-place modifications on pages while maintaining sequential logging for writes. This hybrid design achieves LSM-like write throughput through elastic, discontiguous pages and mapping tables, while providing B-tree efficiency for range queries and point lookups on modern hardware like multicore processors and flash storage. In the 2020s, learned LSM variants integrate machine learning models to optimize index structures and reduce scan overhead during reads. For instance, BOURBON (2020) employs learned indexes on Bloom filters and range partitioning within each LSM run, predicting key locations to skip unnecessary scans and improving lookup latency by 1.23× to 1.78× compared to state-of-the-art LSM implementations like WiscKey.^[15] These approaches leverage neural networks trained on key distributions to approximate sorted arrays, adapting dynamically to data patterns for better space efficiency and query speed in write-optimized settings. More recent variants, such as HotRAP (2025) for handling hot data in tiered LSM-trees and EcoTune (2025) for optimized compaction policies, continue to address performance trade-offs in modern workloads.^[16]^[52] Hybrid approaches address specialized storage constraints, such as zoned namespaces in SSDs and shingled magnetic recording (SMR) drives. Zoned LSM designs, emerging around 2022, adapt compaction to sequential-write zones by aligning merges with zone boundaries, minimizing random I/O and extending device lifespan; for example, lifetime-leveling compaction predicts run lifetimes to distribute writes evenly across zones, reducing space amplification by approximately 30% and achieving up to 1.7× better performance compared to traditional LSM designs on ZNS SSDs.^[53] In-memory augmented variants like PebbleDB enhance LSM with optimized memtables and concurrent compaction, prioritizing stability and throughput for embedded use in distributed systems while maintaining RocksDB compatibility.^[54]

References

[1]
[PDF] The Log-Structured Merge-Tree (LSM-Tree) - UMass Boston CS
The LSM-tree access method presented in this paper enables us to perform the frequent index inserts for the Account-ID||Timestamp index with much less disk arm ...
[2]
[1812.07527] LSM-based Storage Techniques: A Survey - arXiv
Dec 18, 2018 · In this paper, we provide a survey of recent research efforts on LSM-trees so that readers can learn the state-of-the-art in LSM-based storage techniques.Missing: authoritative | Show results with:authoritative
[3]
[2507.09642] Rethinking LSM-tree based Key-Value Stores: A Survey
Jul 13, 2025 · The goal of this survey is to review LSM-tree optimization, focusing on representative works in the past five years. This survey first studies ...Missing: authoritative | Show results with:authoritative
[4]
Algorithms Behind Modern Storage Systems - ACM Queue
May 14, 2018 · LSM-Trees. The log-structured merge-tree is an immutable disk-resident write-optimized data structure. It is most useful in systems where ...
[5]
B-Tree vs LSM-Tree - TiKV
The B-tree and the Log-Structured Merge-tree (LSM-tree) are the two most widely used data structures for data-intensive applications to organize and store data.Missing: features authoritative
[6]
[PDF] On Performance Stability in LSM-based Storage Systems
In this paper, we use a simple yet effective two-phase experimental approach to evaluate write stalls for various LSM-tree designs. We further identify and ...
[7]
[PDF] Optimizing Space Amplification in RocksDB
Jan 8, 2017 · RocksDB uses log-structured merge trees to obtain significant space efficiency and better write throughput while achieving acceptable read ...
[8]
[PDF] The design and implementation of a log-structured file system
A log-structured file system writes all modifications to disk sequentially in a log-like structure, thereby speeding up both file writing.
[9]
[PDF] Bigtable: A Distributed Storage System for Structured Data
Abstract. Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across ...
[10]
LevelDB is a fast key-value storage library written at Google ... - GitHub
Values used by the benchmark compress to about half their original size. LevelDB: version 1.1 Date: Sun May 1 12:11:26 2011 CPU: 4 x Intel(R) Core(TM)2 Quad ...Releases 21 · Issues 243 · Pull requests 125 · Actions
[11]
RocksDB Overview - GitHub
Jul 18, 2023 · RocksDB started at Facebook as a storage engine for server workloads on various storage media, with the initial focus on fast storage ...Missing: 2012 | Show results with:2012
[12]
Leveraging NVMe SSDs for Building a Fast, Cost-effective, LSM-tree ...
Jun 24, 2025 · The recent emergence of high-speed commodity non-volatile memory express solid-state drives (NVMe SSDs) has propelled new KV system designs that ...
[13]
Apache Ignite 3, Alpha 3: Apache Calcite, RAFT, and LSM-Tree
Oct 22, 2021 · Apache Ignite 3 Alpha 3 introduces a SQL engine based on Apache Calcite, RAFT-based replication, LSM-tree storage, and a unified client ...
[14]
[PDF] A survey of LSM-Tree based Indexes, Data Systems and KV-stores
Feb 27, 2024 · The key features of the design are, (1) PM-DRAM hybrid memory layout, (2) hash-based validation to filter out the obsolete entries, and. (3) ...<|control11|><|separator|>
[15]
LSM-tree Managed Storage for Large-Scale Key-Value Store
Figure 2 demonstrates the structure of. LevelDB, in which C0 consists of two sorted skiplists (the. MemTable and ImmTable), and each on-disk component is.
[16]
[PDF] Rethinking LSM-tree based Key-Value Stores: A Survey - arXiv
Jul 13, 2025 · In this case, MemTable acts as an in-memory buffer, providing a fast process for newly inserted or updated data and ensuring low-latency ...
[17]
Basic Operations · facebook/rocksdb Wiki - GitHub
Aug 21, 2023 · Within a single process, the same rocksdb::DB object may be safely shared by multiple concurrent threads. I.e., different threads may write ...
[18]
RocksDB Tuning Guide - GitHub
Mar 28, 2023 · The purpose of this guide is to provide you with enough information so you can tune RocksDB for your workload and your system configuration.
[19]
https://dl.acm.org/doi/pdf/10.1145/3127479.3127486
[20]
Write-Efficient LSM-Tree Storage via WAL-Time Key-Value Separation
Jun 5, 2025 · Representative LSM-Tree–based storage systems include LevelDB [1] , RocksDB [2] , HBase [3] , Cassandra [4] , and TiDB [5] .Ii-A Lsm-Tree And Rocksdb · Iii Design Of Bvlsm · Iv Evaluation
[21]
https://github.com/facebook/rocksdb/wiki/basic-operations
[22]
[PDF] Constructing and Analyzing the LSM Compaction Design Space
Aug 10, 2021 · LSM compactions reorganize data to form a tree with increasing capacity, merging sorted runs to reduce read and space amplification.Missing: 10x | Show results with:10x
[23]
Memtable - LSM in a Week
A memtable cannot continuously grow in size, and we will need to freeze them (and later flush to the disk) when it reaches the size limit. You may find the ...
[24]
[PDF] Closing the B+-tree vs. LSM-tree Write Amplification Gap ... - USENIX
Feb 24, 2022 · This paper presents three simple design techniques that can leverage such modern storage hardware to significantly reduce the B+ -tree write.
[25]
Leveled Compaction · facebook/rocksdb Wiki - GitHub
Nov 13, 2023 · Compaction's goal will be to restrict data size of those levels to be under the target. The size targets are usually exponentially increasing.
[26]
[PDF] Optimal Bloom Filters and Adaptive Merging for LSM-Trees
We pinpoint the problem to the fact that modern key-value stores suboptimally co-tune the merge policy, the buffer size, and the Bloom filters' false positive ...Missing: 2008-2010 | Show results with:2008-2010
[27]
https://skyzh.github.io/mini-lsm/week1-01-memtable.html
[28]
Compaction · facebook/rocksdb Wiki - GitHub
Classic Leveled compaction, introduced by LSM-tree paper by O'Neil et al, minimizes space amplification at the cost of read and write amplification. The LSM ...
[29]
[PDF] The LSM Design Space and its Read Optimizations - Boston University
The log-structured merge. (LSM) paradigm has emerged as one of the most popular storage paradigms for modern data stores. This is because. LSM-trees (i) offer ...
[30]
[PDF] Increase Merge Efficiency in LSM Trees Through Coordinated ...
Parallel execution via multi-threaded sub- compactions boosts write throughput but requires more CPU, memory, and disk usage, which depends on hardware. To.<|separator|>
[31]
Revisiting B+-tree vs. LSM-tree - USENIX
Mar 29, 2022 · We researched whether these two advantages of LSM-tree over B + -tree still stand in the presence of new storage hardware with built-in transparent compression.Missing: original | Show results with:original
[32]
Btree vs LSM - GitHub
Aug 23, 2017 · The key observation between the two graphs is that the read throughput using a btree is between 1.5x and 3x that achieved using an LSM tree. The ...
[33]
[PDF] MatrixKV: Reducing Write Stalls and Write Amplification in LSM-tree ...
Jul 17, 2020 · Popular LSM-tree based key-value stores suffer from subopti- mal and unpredictable performance due to write amplification.
[34]
[PDF] SDF: Software-Defined Flash for Web-Scale Internet Storage Systems
Figure 1 shows ran- dom write throughput as a function of the over-provisioning ratio for the low-end SSD that has been widely deployed in. Baidu's data centers ...
[35]
[PDF] ECO-KVS: Energy-Aware Compaction Offloading Mechanism for ...
The LSM-tree ensures read performance by sorting the appended key-value pairs through compaction. However, since the key ranges of L0 SST files overlap with one ...
[36]
Storage Engine | Apache Cassandra Documentation
The Cassandra storage engine is optimized for high performance, write-oriented workloads. The architecture is based on Log Structured Merge (LSM) trees.Logging writes to commit logs · Memtables · SSTables · SSTable Versions
[37]
The Apache Software Foundation Announces the 5th Anniversary of ...
Apr 8, 2014 · Originally developed at Facebook in 2008 to power their Inbox Search feature, Cassandra entered the Apache Incubator in 2009 and graduated as an ...
[38]
Apache HBase® Reference Guide
This is the official reference guide for the HBase version it ships with. Herein you will find either the definitive documentation on an HBase topic as of its ...
[39]
[PDF] Characterizing, Modeling, and Benchmarking RocksDB Key-Value ...
Feb 25, 2020 · This paper characterizes RocksDB workloads at Facebook, finding key/value size distribution relates to use cases, access patterns have locality ...
[40]
Apache HBase Introduction Tutorial | CloudDuggu
2010: In June, HBase 0.89.20100621, the first developer version was released. 2011: In January, HBase 0.90.0 version was released. 2011: In mid of 2011 ...
[41]
LevelDB
May 9, 2024 · The LSM-tree is a persistent key-value store optimized for insertions and deletions. LevelDB is an open source LSM-tree implementation.
[42]
Under the Hood: Building and open-sourcing RocksDB
Nov 21, 2013 · We are open-sourcing RocksDB, an embeddable, persistent key-value store for fast storage that we built and use here at Facebook.
[43]
RocksDB: Evolution of Development Priorities in a Key-value Store ...
Oct 15, 2021 · Since there is no way to directly remove key-value pairs, deletion is achieved by adding a special marker to the LSM-tree, called a Tombstone, ...
[44]
What is a Log Structured Merge Tree? Definition & FAQs | ScyllaDB
A log-structured merge-tree (LSM tree) is a data structure that efficiently stores key-value pairs for retrieval in disk- or flash-based storage systems.
[45]
ScyllaDB release: version 1.0
Mar 31, 2016 · The ScyllaDB team is pleased to announce the release of ScyllaDB 1.0 (GA), the first production ready ScyllaDB release.Missing: year | Show results with:year
[46]
Cloud Native Computing Foundation announces TiKV Graduation
Sep 2, 2020 · “TiKV was one of our first Rust based projects and is truly a ... October 2016: beta version of TiKV was released and used in production ...
[47]
[PDF] LSM-based Storage Techniques: A Survey - arXiv
Jul 19, 2019 · In this paper, we survey these recent research efforts on improving LSM-trees, ranging from key-value store set- tings with a single LSM-tree to ...Missing: PebbleDB | Show results with:PebbleDB
[48]
[PDF] Lifetime-Leveling LSM-Tree Compaction for ZNS SSD
Jun 28, 2022 · ABSTRACT. The Log-Structured Merge (LSM) tree is considered well- suited to zoned namespace (ZNS) storage devices since the.<|separator|>
[49]
Introducing Pebble: A RocksDB-inspired key-value store written in Go
Sep 15, 2020 · Pebble is a RocksDB inspired and RocksDB compatible key-value store focused on the needs of CockroachDB. Pebble brings better performance and stability.Missing: variant | Show results with:variant