Apache Iceberg
Apache Iceberg is an open-source, high-performance table format designed for managing huge analytic datasets in data lakes, bringing the reliability, simplicity, and atomicity of SQL tables to big data environments.[1] It enables efficient handling of petabyte-scale tables with features like ACID-compliant transactions, schema evolution, time travel queries, and hidden partitioning, allowing users to query historical snapshots, merge or update data expressively, and optimize performance without manual reconfiguration.[1] Developed initially at Netflix in 2017 to overcome limitations in Apache Hive for incremental processing and large-scale analytics, Iceberg was open-sourced and donated to the Apache Software Foundation in November 2018.[2][3] It achieved top-level Apache project status in May 2020, marking its maturity and broad community adoption.[4] Iceberg's architecture separates metadata from data files, using a layered structure of manifests and snapshots to track table state efficiently, which supports concurrent operations across multiple engines without conflicts.[5] This design facilitates safe, multi-engine access, data compaction strategies like bin-packing or sorting, and rollback capabilities for error recovery or auditing.[1] It integrates seamlessly with popular compute engines such as Apache Spark, Trino, Apache Flink, Presto, Apache Hive, and Cloudera Impala, as well as storage systems like Amazon S3, Azure Data Lake, and Google Cloud Storage.[1] By addressing challenges in traditional data lake formats, such as rigid partitioning and lack of transactional support, Iceberg has become a foundational technology for modern data lakehouses, enabling reliable analytics on streaming and batch data at scale.[6]Overview
Definition and Purpose
Apache Iceberg is an open-source table format designed for managing huge analytic datasets, providing a high-performance structure that enables the use of SQL-like tables in big data environments.[5] Licensed under the Apache 2.0 license as a project of the Apache Software Foundation, it supports compute engines such as Spark, Trino, Flink, and Hive by layering reliable table management over distributed file systems like cloud object stores.[7] Unlike traditional file formats such as Parquet or ORC, which focus solely on encoding data within individual files, Iceberg functions as a complete table format that separates metadata from data files, allowing for efficient schema enforcement, partitioning, and query optimization without requiring data rewrites.[8] The primary purpose of Apache Iceberg is to deliver the reliability, simplicity, and performance of SQL databases to large-scale analytic workloads in data lakes, particularly addressing the limitations of earlier systems like Apache Hive that struggle with atomicity and consistency at petabyte scales.[9] It enables atomic operations, such as concurrent writes with serializable isolation, through optimistic concurrency control and atomic metadata swaps, ensuring that updates to massive tables—potentially comprising billions of files—can be performed reliably without full data rewrites or directory-based locking.[10] This design tackles challenges in eventually consistent storage systems, like Amazon S3, by maintaining table state in dedicated metadata files that track snapshots, manifests, and file locations.[11] Iceberg's development was initially motivated by Netflix's need for efficient incremental processing of petabyte-scale data in 2017, where traditional Hive tables led to slow query planning (up to 9.6 minutes) and execution times due to scanning millions of files and partitions.[12] By introducing features like hidden partitioning and time travel queries, Iceberg reduced these bottlenecks dramatically—for instance, cutting query planning to 10-25 seconds in early benchmarks—while supporting ACID transactions for reliable analytics.[12]Core Principles
Apache Iceberg's design is guided by principles that emphasize reliability, scalability, and flexibility for managing large-scale analytic datasets in distributed environments. These principles enable the format to provide SQL-like guarantees on data lakes while addressing limitations of traditional file-based systems, such as inefficiency in metadata handling and lack of transactional support. A core principle is the separation of metadata from data files, where table state is tracked in dedicated JSON metadata files rather than embedded within the data. This separation allows query engines to perform rapid operations, such as scan planning, by reading only metadata without accessing the underlying data files, which can number in the billions for massive tables.[8] Another key principle is the immutability of data files, ensuring that once written, files are never modified or deleted arbitrarily to maintain data integrity and support reliable concurrent operations. Updates and deletes are instead handled atomically by adding new files and updating metadata, preventing partial or inconsistent states during concurrent reads and writes.[9] Iceberg adheres to an open format principle to promote interoperability across diverse compute engines, including Apache Spark, Trino, Flink, and Hive, without tying users to a specific vendor or ecosystem. This design uses standard file formats like Parquet, Avro, and ORC, with uniform schema evolution rules that ensure seamless data sharing and avoid lock-in.[9] The format supports schema-on-read semantics combined with enforcement mechanisms, allowing tables to evolve over time—such as adding, dropping, or reordering columns—without rewriting existing data files, while providing validation to maintain compatibility and prevent errors. Field IDs in the schema enable projection and evolution, ensuring that reads adapt to changes reliably.[13] Performance optimization is achieved through comprehensive metadata structures, particularly manifests and manifest lists, which store statistics like partition values, null counts, and file paths to enable query engines to prune irrelevant files early in the planning phase. This metadata-driven approach minimizes data scanned during queries, supporting efficient operations on petabyte-scale tables with constant-time planning complexity.[14]History
Origins and Development
Development of Apache Iceberg began in 2017 by engineers at Netflix, who sought to overcome key limitations of Apache Hive in managing large-scale analytic tables. Specifically, Hive struggled with incremental data updates, atomic commits, and reliable processing in petabyte-scale environments, prompting the need for a more robust table format.[2][15] The primary challenges addressed included Hive's poor performance during schema evolution, absence of ACID transaction support, and inefficiencies arising from directory-based partitioning in Hadoop-based data lakes, which often led to bottlenecks in concurrent access and maintenance. These issues were particularly acute for Netflix's data-intensive workloads involving streaming analytics and machine learning pipelines.[2][16] In November 2018, Netflix open-sourced the project under the Apache 2.0 license, making it available for broader adoption and contribution within the big data community. Early development emphasized features like metadata management to ensure reliability without relying on Hive's metastore.[16][17] Initial contributions were led by teams from Netflix, Apple, and Airbnb, who prioritized integrations with query engines such as Apache Spark and Presto (later renamed Trino) to enable seamless operation across diverse ecosystems. The project's first public release, version 0.1.0, arrived in July 2019 during its Apache Incubator phase, highlighting metadata logging as a core mechanism for atomic operations and snapshot isolation.[18][19]Apache Project Milestones
Apache Iceberg entered the Apache Incubator on November 16, 2018, marking the beginning of its formal open-source governance under the Apache Software Foundation.[20] The project graduated from incubation to become a top-level Apache project on May 20, 2020, after achieving sufficient community consensus, active participation, and demonstration of maturity in its codebase and processes.[20] Key releases have driven the project's evolution, with version 0.11.0 released on January 27, 2021, introducing support for the REST catalog to enable broader interoperability with external systems.[21] Version 1.0.0 followed on November 3, 2022, stabilizing core APIs and establishing production-ready guarantees for table format management.[22] Subsequent updates included version 1.4.0 on October 4, 2023, which added enhancements for row-level deletes to improve data manipulation efficiency.[23] Version 1.5.0, released on March 11, 2024, advanced branching and tagging capabilities for better version control in collaborative environments.[24] The most recent major release, version 1.10.0 on September 11, 2025, incorporated bug fixes alongside advancements in the format specification and API stability.[25] As of November 2025, the Apache Iceberg community had grown to over 200 contributors, fostering active development through the project's GitHub repository, where ongoing contributions address scalability and ecosystem integrations. Significant milestones include the integration with the AWS Glue Catalog in 2021, enabling seamless metadata management for Iceberg tables in Amazon S3 environments.[26] The project has seen widespread adoption across major cloud providers, including AWS, Google Cloud, and Microsoft Azure, supporting large-scale analytic workloads.[27] Additionally, the formation of Iceberg working groups has facilitated extensions, such as geospatial data support and Rust bindings, expanding the format's applicability in diverse domains.[28]Architecture
Metadata Management
Apache Iceberg manages table metadata through a hierarchical tree structure that enables efficient tracking of table state changes and supports large-scale analytic workloads. At the root, a table metadata pointer references the current metadata file, which contains the table's schema, partition specification, properties, and a list of snapshots. This pointer facilitates atomic updates by swapping references to new metadata files without altering existing ones. Below the metadata file, snapshots represent immutable versions of the table, each pointing to a manifest list that organizes access to underlying manifest files. These manifest files, in turn, index the data files, forming a layered abstraction that separates metadata from data storage for scalability and reliability.[29] Snapshots serve as point-in-time captures of the entire table state, ensuring immutability to prevent inconsistencies during concurrent operations. Each snapshot includes a unique identifier, a timestamp, a manifest list file, optional summary information, and in version 3 (ratified 2025), fields such as sequence-number for ordering, first-row-id, and added-rows to enable row lineage tracking for advanced auditing and optimization.[30] The manifest list within a snapshot is an Avro-encoded file that aggregates manifest files, providing partition-level statistics to optimize query planning across distributed systems. Manifest files themselves are also Avro-encoded and contain detailed entries for data files, including file paths, partition values, and columnar statistics such as minimum and maximum values, null counts, and value counts. In version 3, manifests also track deletion vectors for improved file pruning. These statistics enable predicate pushdown and file pruning, reducing the volume of data scanned during queries by filtering irrelevant files early in the process.[30][31][32] Commits in Iceberg are atomic, achieved through optimistic concurrency control where writers stage new snapshots and metadata in temporary locations before swapping the table metadata pointer to the new version. This pointer swap ensures that only one snapshot becomes the current table state, providing serializable isolation without requiring distributed locks, even in multi-writer environments. If conflicts arise, failed attempts are discarded, preserving the existing table state. To support introspection and maintenance, Iceberg provides built-in metadata tables, such as$snapshots, which exposes the list of all snapshots with their IDs, timestamps, and manifest lists for querying table history, and $manifests, which details individual manifest files including paths, added/removed counts, and partition summaries. These system tables allow users to audit changes and diagnose issues directly via SQL.[29][10]
Versioning in Iceberg relies on the chronological sequence of snapshot IDs, which form a timeline of table modifications from initial creation onward. This structure supports operations like rollback by referencing a prior snapshot ID to restore the table to an earlier state, as well as auditing by tracing changes through the snapshot log in the metadata file. Snapshot expiration policies can prune old snapshots to manage storage, but the metadata pointer always ensures access to the active version. Integrated catalogs, such as Hive Metastore or AWS Glue, store the table metadata pointer to coordinate access across engines. As of version 3 (2025), metadata supports table-level encryption keys for securing sensitive data.[30][29]
Data File Organization
Apache Iceberg stores data in columnar file formats such as Parquet, ORC, or Avro, providing a high-level table abstraction that manages these underlying files without altering their structure.[33] This separation enables compatibility with existing data lakes while leveraging the efficiency of columnar storage for analytical workloads.[5] Data files in Iceberg follow an immutable, append-only design, where new data is always written to fresh files rather than modifying existing ones.[34] For updates or deletes, Iceberg creates additional files—such as equality delete files referencing conditions or, in version 3 (ratified 2025), binary deletion vectors for efficient row-level operations—while marking obsolete files as deleted solely through metadata updates, preserving the integrity of stored data without in-place mutations. Position delete files, used in earlier versions, are deprecated in version 3.[35][36] This approach ensures reliability in distributed environments, including eventually consistent object stores like Amazon S3. To facilitate efficient query planning and parallel processing, Iceberg organizes data files into manifest files, which list the paths, partitions, and statistics of individual files within a snapshot.[32] These manifests are further grouped by manifest lists, creating logical collections that allow engines to prune irrelevant files based on embedded statistics, such as column value ranges or null counts, before scanning data.[31] This grouping reduces overhead during reads by enabling coarse-grained filtering at the metadata level.[37] Unlike traditional Hive-style tables, Iceberg does not rely on directory-based partitioning in storage; instead, all data files are stored in a flat structure within the table's location, with partition layouts defined entirely in metadata.[38] This metadata-driven approach avoids physical reorganization when partition specs evolve and supports hidden partitioning, where query engines automatically apply filters without users specifying partition keys.[39] For performance optimization, Iceberg includes compaction as a maintenance procedure that merges multiple small data files into fewer, larger ones, reducing metadata volume and improving scan efficiency. This process operates by rewriting the selected files into new ones—adhering to the append-only principle—while updating metadata to reference the consolidated output and obsolete the inputs, often run as a background task via tools like Spark.[40] Iceberg fully supports nested data types, including structs, lists, and maps, and in version 3 (2025), extends to new types such as variant for semi-structured data and geometry for geospatial use cases, allowing complex schemas to be represented within the supported file formats.[41][36] Additionally, compression can be configured at the table level through properties such aswrite.parquet.compression-codec, enabling choices like GZIP, ZSTD, or Snappy to balance storage efficiency and read performance across all data files in the table.[42]
Key Features
ACID Compliance
Apache Iceberg implements ACID (Atomicity, Consistency, Isolation, Durability) properties to provide reliable transaction semantics for large-scale analytic tables in distributed environments, enabling safe concurrent operations without traditional locking.[5] Atomicity is ensured through all-or-nothing commits achieved by atomic swaps of the table's metadata pointer to a new snapshot. When a writer stages changes—such as adding or removing data files—it creates a new metadata version based on the current snapshot; the commit succeeds only if the pointer swap is atomic, fully applying all changes or rolling back to the previous snapshot in case of partial failure. This mechanism prevents incomplete states from becoming visible to readers.[30][10] Consistency is maintained via schema validation and enforcement of table invariants during write operations. Writers validate data against the current schema before committing files, ensuring all columns are populated and partition transforms are compatible; metadata structures further enforce invariants like file coverage and deletion consistency across the table.[43][33] Isolation follows a snapshot isolation model, where each read operation loads a specific committed snapshot at the time of query initiation, providing a consistent view of the table without interference from concurrent writes and avoiding dirty reads. Readers remain isolated from uncommitted changes until they refresh to a newer snapshot. As detailed in the Metadata Management section, this relies on Iceberg's snapshot mechanism for immutable, versioned table states.[30][10] Durability is achieved by logging all write operations in the table metadata before committing and flushing data files to durable storage in the underlying file system. Once committed, snapshots and associated files are immutable, ensuring changes persist even in the event of failures post-commit.[44][30] To handle concurrent writes, Iceberg employs optimistic concurrency control, where writers assume no conflicts and detect them via checks against the current snapshot ID during the metadata swap. If a conflict arises—indicating another writer has advanced the snapshot—the operation retries by reapplying changes to the updated base snapshot, with mechanisms like sequence numbers to manage file rewrites efficiently.[10][45] Recovery is supported by an append-only transaction log in the form of the snapshot log within table metadata, which records timestamped entries of snapshot transitions. This log allows reconstruction of the table state by replaying operations from the latest consistent metadata file, facilitating auditing and failure recovery without data loss.[44][46]Schema Evolution
Apache Iceberg supports schema evolution through metadata-only updates, allowing changes to table schemas without rewriting data files or incurring downtime. This feature enables atomic operations such as adding, dropping, renaming, or reordering columns and nested fields, ensuring that schema updates are independent and free of side effects.[13][47] Backward compatibility is maintained by assigning new columns default values, typically null for optional fields, so existing readers continue to function without accessing the new columns. When columns are dropped, the data remains in files but is hidden from new readers, preventing accidental exposure of previously deleted information. Forward compatibility ensures that added columns do not read values from existing data files, avoiding inconsistencies during mixed-version access.[13][43] Type promotions follow strict rules to preserve data integrity and partition compatibility. Primitive types can be widened, such as promoting an integer to a long, a float to a double, or a decimal with increased precision while maintaining scale. Nested structures support restructuring, like adding or removing fields in structs, maps, or lists, provided the changes do not alter map key equality or partition transform outputs. Required fields added during evolution must include non-null default values to ensure write consistency.[47][48] Schema evolution operations are performed using standard SQL commands likeALTER TABLE ADD COLUMN, DROP COLUMN, or RENAME COLUMN, which update the table's metadata atomically under Iceberg's ACID guarantees. These operations leverage unique field IDs—persistent integers assigned to each column or nested field—to track changes stably, decoupling evolution from position-based or name-based dependencies that could cause issues like column shifting or undeletion. While position-based schemas are supported for compatibility with formats like Parquet, field ID-based evolution provides greater stability for long-term maintenance.[13][49]
For interoperability, Iceberg enforces schema validation during reads and writes to handle mismatches, with mechanisms to resolve missing fields via defaults or nulls, promoting robustness across query engines without requiring full schema alignment.[43]
Hidden Partitioning and Sorting
Apache Iceberg implements hidden partitioning by defining partition layouts entirely within the table's metadata, eliminating the need for users to manage physical directories or include partition columns in their data files. For instance, a table can be partitioned by date or region through transforms applied to column values, such as extracting the day from a timestamp or bucketing a categorical field, without altering the underlying storage structure. This approach ensures that queries automatically benefit from partition pruning, as Iceberg pushes down filters to skip irrelevant data files based on the metadata alone, preventing common errors like full table scans due to mismatched partition predicates.[39] Partition evolution in Iceberg allows schema administrators to modify partition specifications over time without rewriting existing data, enabling adaptations to changing query patterns or data volumes. For example, a table initially partitioned by day can evolve to hourly granularity by updating the metadata spec, where new writes follow the refined layout while historical files retain their original partitioning; transforms like truncation (e.g., rounding to the nearest hour) or bucketing (e.g., hashing into fixed buckets) facilitate these changes seamlessly. Identity partitions serve as the default for unpartitioned tables, using raw column values without transformation, which supports the addition of partitioning later without disrupting existing data organization.[13] Iceberg supports table-level sorting defined in metadata to optimize write operations and improve read performance through data clustering. Administrators can specify sort orders, such as ascending by category followed by descending by identifier, which engines like Spark enforce during inserts and merges to group similar rows together within files. This sorting complements partitioning by reducing the amount of data scanned during queries, as metadata statistics— including value ranges and null counts—enable dynamic filter pushdown to prune files early in the query planning phase.[50][51] The combination of hidden partitioning and sorting yields significant performance gains by minimizing I/O overhead; for common workloads involving time-based or categorical filters, Iceberg can achieve up to 10x performance improvements through effective file pruning, substantially reducing scan volumes compared to traditional directory-based schemes.[51]Time Travel and Snapshots
Apache Iceberg provides time travel capabilities that allow users to query historical states of a table as they existed at specific points in time, enabling reproducible analyses and examination of data changes without altering the current table state.[7] This feature relies on atomic snapshots, which represent immutable, point-in-time views of the table's data files and metadata, as briefly referenced in the snapshot structure.[52] Each snapshot is assigned a unique identifier and timestamp, capturing the results of operations like appends, merges, or deletes.[53] Time travel queries in Iceberg support SQL syntax using theTIMESTAMP AS OF or VERSION AS OF clauses to specify a past state. For example, SELECT * FROM prod.db.sample FOR TIMESTAMP AS OF '2017-11-10 00:00:00.000 America/Los_Angeles' retrieves data as it appeared at that timestamp, while SELECT * FROM prod.db.sample FOR VERSION AS OF 10963874102873 uses a specific snapshot ID for exact reproducibility.[54] These queries can also reference branches or tags, such as VERSION AS OF 'audit-branch', allowing access to experimental or historical versions without impacting production data.[54]
Snapshot expiration helps manage storage costs by removing old snapshots and their associated data files that are no longer needed. The expire_snapshots procedure automates this cleanup, retaining a configurable number of recent snapshots—by default, the last one—based on age or count thresholds. For instance, CALL prod.system.expire_snapshots('db.sample', TIMESTAMP '2021-06-30 00:00:00.000', 100) deletes snapshots older than the specified timestamp while keeping the 100 most recent ones.[55]
Iceberg supports branching and tagging for version management, similar to Git workflows, enabling isolated development and testing. Branches can be created to fork from an existing snapshot, such as for schema evolution experiments, and merged via fast-forward operations using the fast_forward procedure: CALL catalog.system.fast_forward('my_table', 'main', 'dev-branch') updates the target branch to match the source's latest snapshot.[56] Tags provide immutable labels for specific snapshots, queryable via VERSION AS OF 'tag-name', facilitating A/B comparisons or production rollouts without affecting the main lineage.[54]
Rollback operations revert a table to a previous state by setting the current snapshot to an earlier one. The rollback_to_snapshot procedure achieves this atomically: CALL [catalog](/page/Catalog).[system](/page/System).rollback_to_snapshot('db.sample', 1) sets the table to snapshot ID 1, discarding intermediate changes.[57] Similarly, rollback_to_timestamp uses a timestamp for the reversion point.[58]
Audit capabilities are provided through system tables that log snapshot history, allowing users to track changes including who made them, when, and what operations occurred. Querying SELECT * FROM prod.db.sample.history returns details like made_current_at timestamps, operation types (e.g., 'append'), and summary metadata such as the Spark application ID responsible.[59] The snapshots table offers deeper insights, including parent snapshot IDs for lineage tracing via the ancestors_of procedure: CALL spark_catalog.system.ancestors_of('db.tbl', 1).[60] These features support compliance and debugging by providing a complete audit trail of table modifications.[53]
Branching in Iceberg is particularly useful for testing schema changes or running A/B comparisons in isolation, as branches maintain separate snapshot lineages that can be queried independently before merging into production.[54] This approach ensures safe experimentation on large-scale tables without risking data integrity.[7]