Fact-checked by Grok 2 weeks ago

Apache Iceberg

Apache Iceberg is an open-source, high-performance table format designed for managing huge analytic datasets in data lakes, bringing the reliability, simplicity, and atomicity of SQL tables to big data environments.^[1] It enables efficient handling of petabyte-scale tables with features like ACID-compliant transactions, schema evolution, time travel queries, and hidden partitioning, allowing users to query historical snapshots, merge or update data expressively, and optimize performance without manual reconfiguration.^[1] Developed initially at Netflix in 2017 to overcome limitations in Apache Hive for incremental processing and large-scale analytics, Iceberg was open-sourced and donated to the Apache Software Foundation in November 2018.^[2]^[3] It achieved top-level Apache project status in May 2020, marking its maturity and broad community adoption.^[4] Iceberg's architecture separates metadata from data files, using a layered structure of manifests and snapshots to track table state efficiently, which supports concurrent operations across multiple engines without conflicts.^[5] This design facilitates safe, multi-engine access, data compaction strategies like bin-packing or sorting, and rollback capabilities for error recovery or auditing.^[1] It integrates seamlessly with popular compute engines such as Apache Spark, Trino, Apache Flink, Presto, Apache Hive, and Cloudera Impala, as well as storage systems like Amazon S3, Azure Data Lake, and Google Cloud Storage.^[1] By addressing challenges in traditional data lake formats, such as rigid partitioning and lack of transactional support, Iceberg has become a foundational technology for modern data lakehouses, enabling reliable analytics on streaming and batch data at scale.^[6]

Overview

Definition and Purpose

Apache Iceberg is an open-source table format designed for managing huge analytic datasets, providing a high-performance structure that enables the use of SQL-like tables in big data environments.^[5] Licensed under the Apache 2.0 license as a project of the Apache Software Foundation, it supports compute engines such as Spark, Trino, Flink, and Hive by layering reliable table management over distributed file systems like cloud object stores.^[7] Unlike traditional file formats such as Parquet or ORC, which focus solely on encoding data within individual files, Iceberg functions as a complete table format that separates metadata from data files, allowing for efficient schema enforcement, partitioning, and query optimization without requiring data rewrites.^[8] The primary purpose of Apache Iceberg is to deliver the reliability, simplicity, and performance of SQL databases to large-scale analytic workloads in data lakes, particularly addressing the limitations of earlier systems like Apache Hive that struggle with atomicity and consistency at petabyte scales.^[9] It enables atomic operations, such as concurrent writes with serializable isolation, through optimistic concurrency control and atomic metadata swaps, ensuring that updates to massive tables—potentially comprising billions of files—can be performed reliably without full data rewrites or directory-based locking.^[10] This design tackles challenges in eventually consistent storage systems, like Amazon S3, by maintaining table state in dedicated metadata files that track snapshots, manifests, and file locations.^[11] Iceberg's development was initially motivated by Netflix's need for efficient incremental processing of petabyte-scale data in 2017, where traditional Hive tables led to slow query planning (up to 9.6 minutes) and execution times due to scanning millions of files and partitions.^[12] By introducing features like hidden partitioning and time travel queries, Iceberg reduced these bottlenecks dramatically—for instance, cutting query planning to 10-25 seconds in early benchmarks—while supporting ACID transactions for reliable analytics.^[12]

Core Principles

Apache Iceberg's design is guided by principles that emphasize reliability, scalability, and flexibility for managing large-scale analytic datasets in distributed environments. These principles enable the format to provide SQL-like guarantees on data lakes while addressing limitations of traditional file-based systems, such as inefficiency in metadata handling and lack of transactional support. A core principle is the separation of metadata from data files, where table state is tracked in dedicated JSON metadata files rather than embedded within the data. This separation allows query engines to perform rapid operations, such as scan planning, by reading only metadata without accessing the underlying data files, which can number in the billions for massive tables.^[8] Another key principle is the immutability of data files, ensuring that once written, files are never modified or deleted arbitrarily to maintain data integrity and support reliable concurrent operations. Updates and deletes are instead handled atomically by adding new files and updating metadata, preventing partial or inconsistent states during concurrent reads and writes.^[9] Iceberg adheres to an open format principle to promote interoperability across diverse compute engines, including Apache Spark, Trino, Flink, and Hive, without tying users to a specific vendor or ecosystem. This design uses standard file formats like Parquet, Avro, and ORC, with uniform schema evolution rules that ensure seamless data sharing and avoid lock-in.^[9] The format supports schema-on-read semantics combined with enforcement mechanisms, allowing tables to evolve over time—such as adding, dropping, or reordering columns—without rewriting existing data files, while providing validation to maintain compatibility and prevent errors. Field IDs in the schema enable projection and evolution, ensuring that reads adapt to changes reliably.^[13] Performance optimization is achieved through comprehensive metadata structures, particularly manifests and manifest lists, which store statistics like partition values, null counts, and file paths to enable query engines to prune irrelevant files early in the planning phase. This metadata-driven approach minimizes data scanned during queries, supporting efficient operations on petabyte-scale tables with constant-time planning complexity.^[14]

History

Origins and Development

Development of Apache Iceberg began in 2017 by engineers at Netflix, who sought to overcome key limitations of Apache Hive in managing large-scale analytic tables. Specifically, Hive struggled with incremental data updates, atomic commits, and reliable processing in petabyte-scale environments, prompting the need for a more robust table format.^[2]^[15] The primary challenges addressed included Hive's poor performance during schema evolution, absence of ACID transaction support, and inefficiencies arising from directory-based partitioning in Hadoop-based data lakes, which often led to bottlenecks in concurrent access and maintenance. These issues were particularly acute for Netflix's data-intensive workloads involving streaming analytics and machine learning pipelines.^[2]^[16] In November 2018, Netflix open-sourced the project under the Apache 2.0 license, making it available for broader adoption and contribution within the big data community. Early development emphasized features like metadata management to ensure reliability without relying on Hive's metastore.^[16]^[17] Initial contributions were led by teams from Netflix, Apple, and Airbnb, who prioritized integrations with query engines such as Apache Spark and Presto (later renamed Trino) to enable seamless operation across diverse ecosystems. The project's first public release, version 0.1.0, arrived in July 2019 during its Apache Incubator phase, highlighting metadata logging as a core mechanism for atomic operations and snapshot isolation.^[18]^[19]

Apache Project Milestones

Apache Iceberg entered the Apache Incubator on November 16, 2018, marking the beginning of its formal open-source governance under the Apache Software Foundation.^[20] The project graduated from incubation to become a top-level Apache project on May 20, 2020, after achieving sufficient community consensus, active participation, and demonstration of maturity in its codebase and processes.^[20] Key releases have driven the project's evolution, with version 0.11.0 released on January 27, 2021, introducing support for the REST catalog to enable broader interoperability with external systems.^[21] Version 1.0.0 followed on November 3, 2022, stabilizing core APIs and establishing production-ready guarantees for table format management.^[22] Subsequent updates included version 1.4.0 on October 4, 2023, which added enhancements for row-level deletes to improve data manipulation efficiency.^[23] Version 1.5.0, released on March 11, 2024, advanced branching and tagging capabilities for better version control in collaborative environments.^[24] The most recent major release, version 1.10.0 on September 11, 2025, incorporated bug fixes alongside advancements in the format specification and API stability.^[25] As of November 2025, the Apache Iceberg community had grown to over 200 contributors, fostering active development through the project's GitHub repository, where ongoing contributions address scalability and ecosystem integrations. Significant milestones include the integration with the AWS Glue Catalog in 2021, enabling seamless metadata management for Iceberg tables in Amazon S3 environments.^[26] The project has seen widespread adoption across major cloud providers, including AWS, Google Cloud, and Microsoft Azure, supporting large-scale analytic workloads.^[27] Additionally, the formation of Iceberg working groups has facilitated extensions, such as geospatial data support and Rust bindings, expanding the format's applicability in diverse domains.^[28]

Architecture

Metadata Management

Apache Iceberg manages table metadata through a hierarchical tree structure that enables efficient tracking of table state changes and supports large-scale analytic workloads. At the root, a table metadata pointer references the current metadata file, which contains the table's schema, partition specification, properties, and a list of snapshots. This pointer facilitates atomic updates by swapping references to new metadata files without altering existing ones. Below the metadata file, snapshots represent immutable versions of the table, each pointing to a manifest list that organizes access to underlying manifest files. These manifest files, in turn, index the data files, forming a layered abstraction that separates metadata from data storage for scalability and reliability.^[29] Snapshots serve as point-in-time captures of the entire table state, ensuring immutability to prevent inconsistencies during concurrent operations. Each snapshot includes a unique identifier, a timestamp, a manifest list file, optional summary information, and in version 3 (ratified 2025), fields such as sequence-number for ordering, first-row-id, and added-rows to enable row lineage tracking for advanced auditing and optimization.^[30] The manifest list within a snapshot is an Avro-encoded file that aggregates manifest files, providing partition-level statistics to optimize query planning across distributed systems. Manifest files themselves are also Avro-encoded and contain detailed entries for data files, including file paths, partition values, and columnar statistics such as minimum and maximum values, null counts, and value counts. In version 3, manifests also track deletion vectors for improved file pruning. These statistics enable predicate pushdown and file pruning, reducing the volume of data scanned during queries by filtering irrelevant files early in the process.^[30]^[31]^[32] Commits in Iceberg are atomic, achieved through optimistic concurrency control where writers stage new snapshots and metadata in temporary locations before swapping the table metadata pointer to the new version. This pointer swap ensures that only one snapshot becomes the current table state, providing serializable isolation without requiring distributed locks, even in multi-writer environments. If conflicts arise, failed attempts are discarded, preserving the existing table state. To support introspection and maintenance, Iceberg provides built-in metadata tables, such as $snapshots, which exposes the list of all snapshots with their IDs, timestamps, and manifest lists for querying table history, and $manifests, which details individual manifest files including paths, added/removed counts, and partition summaries. These system tables allow users to audit changes and diagnose issues directly via SQL.^[29]^[10] Versioning in Iceberg relies on the chronological sequence of snapshot IDs, which form a timeline of table modifications from initial creation onward. This structure supports operations like rollback by referencing a prior snapshot ID to restore the table to an earlier state, as well as auditing by tracing changes through the snapshot log in the metadata file. Snapshot expiration policies can prune old snapshots to manage storage, but the metadata pointer always ensures access to the active version. Integrated catalogs, such as Hive Metastore or AWS Glue, store the table metadata pointer to coordinate access across engines. As of version 3 (2025), metadata supports table-level encryption keys for securing sensitive data.^[30]^[29]

Data File Organization

Apache Iceberg stores data in columnar file formats such as Parquet, ORC, or Avro, providing a high-level table abstraction that manages these underlying files without altering their structure.^[33] This separation enables compatibility with existing data lakes while leveraging the efficiency of columnar storage for analytical workloads.^[5] Data files in Iceberg follow an immutable, append-only design, where new data is always written to fresh files rather than modifying existing ones.^[34] For updates or deletes, Iceberg creates additional files—such as equality delete files referencing conditions or, in version 3 (ratified 2025), binary deletion vectors for efficient row-level operations—while marking obsolete files as deleted solely through metadata updates, preserving the integrity of stored data without in-place mutations. Position delete files, used in earlier versions, are deprecated in version 3.^[35]^[36] This approach ensures reliability in distributed environments, including eventually consistent object stores like Amazon S3. To facilitate efficient query planning and parallel processing, Iceberg organizes data files into manifest files, which list the paths, partitions, and statistics of individual files within a snapshot.^[32] These manifests are further grouped by manifest lists, creating logical collections that allow engines to prune irrelevant files based on embedded statistics, such as column value ranges or null counts, before scanning data.^[31] This grouping reduces overhead during reads by enabling coarse-grained filtering at the metadata level.^[37] Unlike traditional Hive-style tables, Iceberg does not rely on directory-based partitioning in storage; instead, all data files are stored in a flat structure within the table's location, with partition layouts defined entirely in metadata.^[38] This metadata-driven approach avoids physical reorganization when partition specs evolve and supports hidden partitioning, where query engines automatically apply filters without users specifying partition keys.^[39] For performance optimization, Iceberg includes compaction as a maintenance procedure that merges multiple small data files into fewer, larger ones, reducing metadata volume and improving scan efficiency. This process operates by rewriting the selected files into new ones—adhering to the append-only principle—while updating metadata to reference the consolidated output and obsolete the inputs, often run as a background task via tools like Spark.^[40] Iceberg fully supports nested data types, including structs, lists, and maps, and in version 3 (2025), extends to new types such as variant for semi-structured data and geometry for geospatial use cases, allowing complex schemas to be represented within the supported file formats.^[41]^[36] Additionally, compression can be configured at the table level through properties such as write.parquet.compression-codec, enabling choices like GZIP, ZSTD, or Snappy to balance storage efficiency and read performance across all data files in the table.^[42]

Key Features

ACID Compliance

Apache Iceberg implements ACID (Atomicity, Consistency, Isolation, Durability) properties to provide reliable transaction semantics for large-scale analytic tables in distributed environments, enabling safe concurrent operations without traditional locking.^[5] Atomicity is ensured through all-or-nothing commits achieved by atomic swaps of the table's metadata pointer to a new snapshot. When a writer stages changes—such as adding or removing data files—it creates a new metadata version based on the current snapshot; the commit succeeds only if the pointer swap is atomic, fully applying all changes or rolling back to the previous snapshot in case of partial failure. This mechanism prevents incomplete states from becoming visible to readers.^[30]^[10] Consistency is maintained via schema validation and enforcement of table invariants during write operations. Writers validate data against the current schema before committing files, ensuring all columns are populated and partition transforms are compatible; metadata structures further enforce invariants like file coverage and deletion consistency across the table.^[43]^[33] Isolation follows a snapshot isolation model, where each read operation loads a specific committed snapshot at the time of query initiation, providing a consistent view of the table without interference from concurrent writes and avoiding dirty reads. Readers remain isolated from uncommitted changes until they refresh to a newer snapshot. As detailed in the Metadata Management section, this relies on Iceberg's snapshot mechanism for immutable, versioned table states.^[30]^[10] Durability is achieved by logging all write operations in the table metadata before committing and flushing data files to durable storage in the underlying file system. Once committed, snapshots and associated files are immutable, ensuring changes persist even in the event of failures post-commit.^[44]^[30] To handle concurrent writes, Iceberg employs optimistic concurrency control, where writers assume no conflicts and detect them via checks against the current snapshot ID during the metadata swap. If a conflict arises—indicating another writer has advanced the snapshot—the operation retries by reapplying changes to the updated base snapshot, with mechanisms like sequence numbers to manage file rewrites efficiently.^[10]^[45] Recovery is supported by an append-only transaction log in the form of the snapshot log within table metadata, which records timestamped entries of snapshot transitions. This log allows reconstruction of the table state by replaying operations from the latest consistent metadata file, facilitating auditing and failure recovery without data loss.^[44]^[46]

Schema Evolution

Apache Iceberg supports schema evolution through metadata-only updates, allowing changes to table schemas without rewriting data files or incurring downtime. This feature enables atomic operations such as adding, dropping, renaming, or reordering columns and nested fields, ensuring that schema updates are independent and free of side effects.^[13]^[47] Backward compatibility is maintained by assigning new columns default values, typically null for optional fields, so existing readers continue to function without accessing the new columns. When columns are dropped, the data remains in files but is hidden from new readers, preventing accidental exposure of previously deleted information. Forward compatibility ensures that added columns do not read values from existing data files, avoiding inconsistencies during mixed-version access.^[13]^[43] Type promotions follow strict rules to preserve data integrity and partition compatibility. Primitive types can be widened, such as promoting an integer to a long, a float to a double, or a decimal with increased precision while maintaining scale. Nested structures support restructuring, like adding or removing fields in structs, maps, or lists, provided the changes do not alter map key equality or partition transform outputs. Required fields added during evolution must include non-null default values to ensure write consistency.^[47]^[48] Schema evolution operations are performed using standard SQL commands like ALTER TABLE ADD COLUMN, DROP COLUMN, or RENAME COLUMN, which update the table's metadata atomically under Iceberg's ACID guarantees. These operations leverage unique field IDs—persistent integers assigned to each column or nested field—to track changes stably, decoupling evolution from position-based or name-based dependencies that could cause issues like column shifting or undeletion. While position-based schemas are supported for compatibility with formats like Parquet, field ID-based evolution provides greater stability for long-term maintenance.^[13]^[49] For interoperability, Iceberg enforces schema validation during reads and writes to handle mismatches, with mechanisms to resolve missing fields via defaults or nulls, promoting robustness across query engines without requiring full schema alignment.^[43]

Hidden Partitioning and Sorting

Apache Iceberg implements hidden partitioning by defining partition layouts entirely within the table's metadata, eliminating the need for users to manage physical directories or include partition columns in their data files. For instance, a table can be partitioned by date or region through transforms applied to column values, such as extracting the day from a timestamp or bucketing a categorical field, without altering the underlying storage structure. This approach ensures that queries automatically benefit from partition pruning, as Iceberg pushes down filters to skip irrelevant data files based on the metadata alone, preventing common errors like full table scans due to mismatched partition predicates.^[39] Partition evolution in Iceberg allows schema administrators to modify partition specifications over time without rewriting existing data, enabling adaptations to changing query patterns or data volumes. For example, a table initially partitioned by day can evolve to hourly granularity by updating the metadata spec, where new writes follow the refined layout while historical files retain their original partitioning; transforms like truncation (e.g., rounding to the nearest hour) or bucketing (e.g., hashing into fixed buckets) facilitate these changes seamlessly. Identity partitions serve as the default for unpartitioned tables, using raw column values without transformation, which supports the addition of partitioning later without disrupting existing data organization.^[13] Iceberg supports table-level sorting defined in metadata to optimize write operations and improve read performance through data clustering. Administrators can specify sort orders, such as ascending by category followed by descending by identifier, which engines like Spark enforce during inserts and merges to group similar rows together within files. This sorting complements partitioning by reducing the amount of data scanned during queries, as metadata statistics— including value ranges and null counts—enable dynamic filter pushdown to prune files early in the query planning phase.^[50]^[51] The combination of hidden partitioning and sorting yields significant performance gains by minimizing I/O overhead; for common workloads involving time-based or categorical filters, Iceberg can achieve up to 10x performance improvements through effective file pruning, substantially reducing scan volumes compared to traditional directory-based schemes.^[51]

Time Travel and Snapshots

Apache Iceberg provides time travel capabilities that allow users to query historical states of a table as they existed at specific points in time, enabling reproducible analyses and examination of data changes without altering the current table state.^[7] This feature relies on atomic snapshots, which represent immutable, point-in-time views of the table's data files and metadata, as briefly referenced in the snapshot structure.^[52] Each snapshot is assigned a unique identifier and timestamp, capturing the results of operations like appends, merges, or deletes.^[53] Time travel queries in Iceberg support SQL syntax using the TIMESTAMP AS OF or VERSION AS OF clauses to specify a past state. For example, SELECT * FROM prod.db.sample FOR TIMESTAMP AS OF '2017-11-10 00:00:00.000 America/Los_Angeles' retrieves data as it appeared at that timestamp, while SELECT * FROM prod.db.sample FOR VERSION AS OF 10963874102873 uses a specific snapshot ID for exact reproducibility.^[54] These queries can also reference branches or tags, such as VERSION AS OF 'audit-branch', allowing access to experimental or historical versions without impacting production data.^[54] Snapshot expiration helps manage storage costs by removing old snapshots and their associated data files that are no longer needed. The expire_snapshots procedure automates this cleanup, retaining a configurable number of recent snapshots—by default, the last one—based on age or count thresholds. For instance, CALL prod.system.expire_snapshots('db.sample', TIMESTAMP '2021-06-30 00:00:00.000', 100) deletes snapshots older than the specified timestamp while keeping the 100 most recent ones.^[55] Iceberg supports branching and tagging for version management, similar to Git workflows, enabling isolated development and testing. Branches can be created to fork from an existing snapshot, such as for schema evolution experiments, and merged via fast-forward operations using the fast_forward procedure: CALL catalog.system.fast_forward('my_table', 'main', 'dev-branch') updates the target branch to match the source's latest snapshot.^[56] Tags provide immutable labels for specific snapshots, queryable via VERSION AS OF 'tag-name', facilitating A/B comparisons or production rollouts without affecting the main lineage.^[54] Rollback operations revert a table to a previous state by setting the current snapshot to an earlier one. The rollback_to_snapshot procedure achieves this atomically: CALL [catalog](/page/Catalog).[system](/page/System).rollback_to_snapshot('db.sample', 1) sets the table to snapshot ID 1, discarding intermediate changes.^[57] Similarly, rollback_to_timestamp uses a timestamp for the reversion point.^[58] Audit capabilities are provided through system tables that log snapshot history, allowing users to track changes including who made them, when, and what operations occurred. Querying SELECT * FROM prod.db.sample.history returns details like made_current_at timestamps, operation types (e.g., 'append'), and summary metadata such as the Spark application ID responsible.^[59] The snapshots table offers deeper insights, including parent snapshot IDs for lineage tracing via the ancestors_of procedure: CALL spark_catalog.system.ancestors_of('db.tbl', 1).^[60] These features support compliance and debugging by providing a complete audit trail of table modifications.^[53] Branching in Iceberg is particularly useful for testing schema changes or running A/B comparisons in isolation, as branches maintain separate snapshot lineages that can be queried independently before merging into production.^[54] This approach ensures safe experimentation on large-scale tables without risking data integrity.^[7]

Integrations

Query Engines

Apache Spark provides native integration with Apache Iceberg since version 3.1, released in 2021, enabling full support for reading and writing Iceberg tables, including advanced features like schema evolution, time travel, and ACID transactions through SQL and DataFrame APIs.^[61] This integration allows Spark users to leverage Iceberg's metadata for efficient query planning and data compaction without requiring additional extensions.^[62] Trino, formerly known as Presto, offers a dedicated Iceberg connector that supports querying Iceberg tables with optimizations such as predicate pushdown and file-level pruning based on Iceberg's manifest files, improving performance on large datasets.^[63] The connector enables read operations, including support for time travel and schema evolution, while write capabilities are available through procedures like OPTIMIZE for data compaction.^[63] Apache Flink integrates with Apache Iceberg to support both batch and streaming writes, including change data capture (CDC) workflows, where Flink processes real-time events and commits them to Iceberg tables with exactly-once semantics.^[64] This is facilitated through Flink's Table API and SQL, with compatibility for catalogs like Hive Metastore, allowing seamless streaming ingestion into Iceberg formats.^[65] Enterprise platforms like Dremio and Starburst extend Iceberg support with enhanced governance features, such as data versioning, access controls, and automated query acceleration. Dremio's SQL query engine optimizes Iceberg tables using reflections for up to 100x faster performance and integrates with catalogs for unified metadata management.^[66] Starburst, built on Trino, provides full DML operations on Iceberg tables, including support for Iceberg v3 specifications, with built-in row-level lineage and security integrations for governed data lakes.^[67] Other query engines include Apache Hive, which uses the HiveIcebergStorageHandler to read and write Iceberg tables via the Hive Metastore, supporting DDL and DML operations on existing Hive environments.^[68] Apache Impala offers read and write support for Iceberg tables, including row-level deletes and schema evolution, optimized for high-performance analytics on Parquet-based Iceberg data.^[69] Emerging support is available in DuckDB through its Iceberg extension, enabling lightweight querying and writing to Iceberg tables in embedded analytical workflows, with full read and write capabilities as of version 1.4.0 (September 2025).^[70] For custom applications, Apache Iceberg provides Rust and Python clients that allow programmatic access to table metadata and data without relying on full query engines. The Rust implementation offers native bindings for building high-performance tools, with improvements in version 0.7.0 (released October 2025) enhancing compatibility and API stability.^[71] PyIceberg, the official Python library, supports catalog interactions, table creation, and scans, enabling Python-based data pipelines and integrations with libraries like Pandas.^[72]

Storage Systems

Apache Iceberg supports a variety of object storage systems for persisting data files, enabling multi-cloud deployments. Commonly used options include Amazon S3, Microsoft Azure Blob Storage, and Google Cloud Storage, which provide scalable, durable storage for large analytic datasets. These systems are compatible due to Iceberg's design that avoids operations like file renames and listings, relying instead on atomic file writes and immutable file structures.^[5]^[73] For metadata management, Iceberg uses catalogs to track table locations and schemas, with several implementations available. The Hive Metastore serves as a legacy option, integrating with existing Hadoop ecosystems via a configurable URI and supporting locking mechanisms. AWS Glue acts as a cloud-native catalog, mapping Iceberg namespaces to Glue databases and tables to Glue tables for seamless integration with AWS services. JDBC-based catalogs, such as those using PostgreSQL, allow relational databases to store metadata through standard SQL connections, requiring database support for atomic transactions. The Hadoop Catalog enables on-premises setups by leveraging HDFS paths, while REST catalogs provide a protocol-agnostic API for cloud-native environments, facilitating distributed access without direct database dependencies.^[42]^[68]^[74]^[75] Iceberg catalogs support multi-table namespaces, allowing organization of tables into hierarchical structures. In AWS Glue, namespaces correspond to databases, with access control enforced via AWS Identity and Access Management (IAM) policies that provide fine-grained permissions equivalent to ACLs. Custom catalog implementations can extend this with bespoke ACL mechanisms for advanced security needs.^[74]^[42] On-premises deployments utilize HDFS as the primary file system, ensuring compatibility through POSIX-like semantics for file operations. This setup maintains the same immutability and atomicity guarantees as cloud object stores.^[5]^[42] Table properties in Iceberg allow storage-specific optimizations, such as enabling encryption through key management systems or selecting compression codecs like Zstandard for Parquet files. These configurations are set at the table level to tailor performance and security to the underlying storage.^[42]^[74] Iceberg requires atomic file system operations for metadata updates via snapshot swaps and does not support non-atomic file systems that could lead to inconsistent states. Additionally, metadata must be stored on durable systems to preserve snapshot history and enable features like time travel.^[73]

Adoption and Use Cases

Industry Adoption

Apache Iceberg has seen widespread adoption since its inception, beginning with early pioneers in the technology sector. Netflix, one of the original developers, has utilized Iceberg in production since 2017 to manage its exabyte-scale data lake, migrating over 1.5 million Hive tables to the format for reliable analytics on petabyte-plus datasets.^[17]^[76] Similarly, Airbnb adopted Iceberg for its event data pipelines and data warehousing needs, launching the "Airbnb Icehouse" initiative to upgrade infrastructure and handle large-scale analytics from millions of user interactions.^[77]^[78] Major cloud providers have integrated Iceberg to enhance their data services, accelerating enterprise uptake. Amazon Web Services (AWS) introduced support for Iceberg in Amazon Athena in 2021 (preview) and 2022 (general availability), enabling ACID transactions on S3 data lakes.^[79] Google Cloud incorporated Iceberg into BigLake in 2022 (announcement) with general availability in 2023, allowing unified querying across multi-cloud storage for high-performance lakehouses.^[80] Microsoft Azure Synapse Analytics added Iceberg format support in 2024 via Azure Data Factory, facilitating seamless data processing in Spark pools.^[81] Beyond cloud giants, numerous enterprises across industries rely on Iceberg for data lake management. DoorDash employs it with Apache Flink for real-time ingestion of over 30 million events per second, improving scalability in its processing platform.^[82] Expedia Group has leveraged Iceberg since 2021 for petabyte-scale tables, contributing Hive read support to enhance interoperability.^[83] Siemens uses Iceberg in lakehouse architectures for industrial data collaboration, as highlighted in its 2025 open-source initiatives.^[84] By 2025, adoption has expanded to hundreds of organizations, with community analyses tracking over 200 companies among large enterprises.^[85] Iceberg's growth reflects its transition from an Apache incubator project in 2018 to a top-level project in 2020, with sustained community momentum evidenced by active development and surveys showing it in use by about 30% of data lakehouse implementations by 2024.^[86] In 2025, Iceberg released version 1.5.0, introducing enhancements like improved REST catalog support and deletion vectors, further accelerating adoption.^[25] Key contributors include Apple, which has deployed it across hundreds of teams and added features like copy-on-write and merge-on-read optimizations; Tabular, founded by Iceberg's creators to offer commercial cataloging and support (acquired by Databricks in 2024); and Dremio, a primary incubator that drives extensions for query performance.^[16]^[87]^[18] A primary adoption driver has been overcoming Hive migration challenges, where organizations report significant query performance gains post-transition. For instance, AWS benchmarks using TPC-H workloads demonstrate improved query execution on Iceberg-sorted tables compared to Hive Parquet setups, establishing key context for scalability in production environments.^[88]

Common Applications

Apache Iceberg is widely applied in data lakehouse architectures, where it unifies storage and compute layers to support business intelligence (BI) and machine learning (ML) workloads on a single set of tables. By enabling multiple query engines to access the same datasets concurrently with ACID guarantees, Iceberg facilitates scalable analytics without data duplication or silos, allowing organizations to perform ad-hoc queries, reporting, and model training on petabyte-scale data stored in cloud object stores like S3.^[1]^[89] In ETL pipelines, Iceberg supports incremental data processing through operations like MERGE INTO, which handle change data capture (CDC) and upsert scenarios efficiently. This allows for atomic updates to large tables without full rewrites, ensuring data consistency during batch or streaming ingestion from sources such as Kafka or databases, and reducing processing overhead in production workflows.^[1]^[89] For analytics applications, particularly time-series analysis, Iceberg's time travel feature enables querying historical snapshots of data, which is essential for auditing financial transactions or log events. Users can reconstruct past states to investigate anomalies or comply with reporting requirements, with hidden partitioning optimizing query performance by skipping irrelevant files automatically.^[1]^[90] In machine learning pipelines, Iceberg serves as a foundation for feature stores by accommodating schema evolution as models iterate, allowing additions or modifications to columns without disrupting ongoing training jobs. This supports reproducible experiments through branching and rollback capabilities, ensuring feature data remains consistent across distributed teams and engines like Spark or Trino.^[90]^[91] Regulatory compliance benefits from Iceberg's immutable snapshot history, which provides verifiable audit trails for standards like GDPR or HIPAA. Organizations can query or restore specific data versions to demonstrate data handling practices, delete records in compliance with right-to-be-forgotten requests, and maintain long-term retention of records such as call logs or patient data without performance degradation.^[89]^[91] Practical examples include real-time dashboards in media streaming services, where Iceberg processes live user interaction data for immediate insights, and fraud detection in financial institutions, leveraging branching to test detection algorithms on production-like data without risking operational integrity.^[1]^[89]

References

[1]
Apache Iceberg - Apache Iceberg™
Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data.Spec · Spark and Iceberg Quickstart · Apache Amoro · Iceberg extension
[2]
Apache Iceberg Explained: A Complete Guide for Beginners
Aug 4, 2024 · Netflix developed Apache Iceberg in 2017 to address limitations with Hive, particularly in handling incremental processing and streaming data.What Is Apache Iceberg? · Apache Iceberg history · Partitioning · Iceberg catalogMissing: origin | Show results with:origin
[3]
What is Apache Iceberg? | Confluent
Apache Iceberg was open-sourced and donated to Apache Software Foundation in November 2018, after being initially developed at Netflix to overcome the ...Cloud Data Warehouses: One... · Legacy Data Lake... · Apache Iceberg For...<|control11|><|separator|>
[4]
What Is Apache Iceberg? - IBM
Originally created by data engineers at Netflix and Apple in 2017 to address the shortcomings of Apache Hive, Iceberg was made open source and donated to the ...Missing: history | Show results with:history
[5]
Spec - Apache Iceberg™
This is a specification for the Iceberg table format that is designed to manage a large, slow-changing collection of files in a distributed file system or key- ...View Spec · Puffin Spec · Implementation Status
[6]
What Is Apache Iceberg? | Cloudera
The story of Apache Iceberg begins with the need for a more efficient and flexible data management system for Apache Hive, a popular data warehousing and SQL ...
[7]
Introduction - Apache Iceberg™
Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to compute engines including Spark, Trino, PrestoDB, Flink, Hive and ...Apache Hive · Iceberg extension · Apache Amoro · Branching and TaggingMissing: definition | Show results with:definition
[8]
https://iceberg.apache.org/spec/#overview
[9]
Spec - Apache Iceberg™
Summary of each segment:
[10]
Spec - Apache Iceberg™
Summary of each segment:
[11]
https://iceberg.apache.org/spec/#metadata
[12]
IcebergProposal - INCUBATOR - Apache Software Foundation
### Summary of Iceberg Proposal
[13]
Evolution - Apache Iceberg™
Iceberg guarantees that schema evolution changes are independent and free of side-effects, without rewriting files: Added columns never read existing values ...Schema evolution · Correctness · Partition evolution
[14]
https://iceberg.apache.org/spec/#scan-planning
[15]
A Deep Dive into Apache Iceberg: A Journey Through the Future of ...
Apr 4, 2025 · Just like every superhero has their origin story, the Lakehouse architecture emerged to solve many complex problems. Think of it as the ...
[16]
How Iceberg Powers Data and AI Applications at Apple, Netflix ... - Qlik
Apache Iceberg is an open table format initially developed by Netflix and open-sourced in 2018. ... Apache Iceberg and Apache Spark projects. Following this ...Missing: origins | Show results with:origins
[17]
Incremental Processing using Netflix Maestro and Apache Iceberg
Nov 20, 2023 · We will show how we are building a clean and efficient incremental processing solution (IPS) by using Netflix Maestro and Apache Iceberg.
[18]
What Is Apache Iceberg? Features & Benefits - Dremio
It was created by Netflix and Apple, and is deployed in production by the largest technology companies and proven at scale on the world's largest workloads and ...Missing: origins | Show results with:origins
[19]
Incubator PMC report for January 2019 - Apache
The Apache Incubator is the entry path into the ASF for projects and codebases wishing to become part of the Foundation's efforts.
[20]
Iceberg Project Incubation Status
2018-11-16 Project enters incubation. 2018-12-10 Software grant agreement filed by secretary. 2019-03-21 New PPMC member: Dan Weeks added to the PPMC.
[21]
[ANNOUNCE] Apache Iceberg release 0.11-Apache Mail Archives
Hi everyone, I'm pleased to announce the release of Apache Iceberg 0.11! Apache Iceberg is an open table format for huge analytic datasets.Missing: date | Show results with:date
[22]
https://github.com/apache/iceberg/releases/tag/apache-iceberg-1.0.0
[23]
Releases - Apache Iceberg
Apache Iceberg 1.2.0 was released on March 20th, 2023. The 1.2.0 release adds a variety of new features and bug fixes. Here is an overview:.Missing: 0.1.0 2019
[24]
[ANNOUNCE] Apache Iceberg release 1.5.0-Apache Mail Archives
I'm pleased to announce the release of Apache Iceberg 1.5.0! Apache Iceberg is an open table format for huge analytic datasets. Iceberg delivers high query ...Missing: branching | Show results with:branching
[25]
Releases - Apache Iceberg™
Apache Iceberg 1.10.0 was released on September 11, 2025. The 1.10.0 release contains bug fixes and new features. For full release notes visit Github.
[26]
Using the Iceberg framework in AWS Glue
You can use AWS Glue to perform read and write operations on Iceberg tables in Amazon S3, or work with Iceberg tables using the AWS Glue Data Catalog.Enabling Iceberg · Example: Write Iceberg · Example: Insert into an Iceberg...
[27]
The Iceberg Wave: How an Open Format Became an Enterprise ...
Jul 14, 2025 · Iceberg Adoption at Cloudera. Cloudera featured native integration of Apache Iceberg in its public cloud Lakehouse platform in 2021, followed by ...
[28]
Apache Iceberg and Parquet now support GEO - Wherobots
Feb 11, 2025 · With native geospatial data type support in Apache Iceberg and Parquet, you can seamlessly run query and processing engines like Wherobots, ...
[29]
https://iceberg.apache.org/spec/#table-metadata-and-snapshots
[30]
Spec - Apache Iceberg™
Summary of each segment:
[31]
https://iceberg.apache.org/spec/#manifest-lists
[32]
https://iceberg.apache.org/spec/#manifests
[33]
https://iceberg.apache.org/spec/#writing-data-files
[34]
https://iceberg.apache.org/spec/#file-system-operations
[35]
https://iceberg.apache.org/spec/#row-level-deletes
[36]
https://iceberg.apache.org/spec/#version-3-extended-types-and-capabilities
[37]
https://iceberg.apache.org/spec/#data-file-fields
[38]
Partitioning - Apache Iceberg™
Iceberg avoids reading unnecessary partitions automatically. Consumers don't need to know how the table is partitioned and add extra filters to their queries.Missing: directory flat
[39]
https://iceberg.apache.org/docs/latest/partitioning/
[40]
https://iceberg.apache.org/spec/#commit-conflict-resolution-and-retry
[41]
Configuration - Apache Iceberg™
Iceberg tables support table properties to configure table behavior, like the default split size for readers. Read properties . Property, Default, Description ...
[42]
Spec - Apache Iceberg™
Summary of each segment:
[43]
Spec - Apache Iceberg™
Summary of each segment:
[44]
https://iceberg.apache.org/spec/#table-metadata
[45]
https://iceberg.apache.org/spec/#sequence-numbers
[46]
https://iceberg.apache.org/spec/#snapshot-log
[47]
https://iceberg.apache.org/spec/#schema-evolution
[48]
https://iceberg.apache.org/spec/#default-values
[49]
DDL - Apache Iceberg™
Spark DDL ... To use Iceberg in Spark, first configure Spark catalogs. Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations.
[50]
Performance - Apache Iceberg™
Lower latency SQL queries -- by eliminating a distributed scan to plan a distributed scan ... In some cases, this is a 10x performance improvement. Back to ...
[51]
Spark Queries - Apache Iceberg
To select a specific table snapshot or the snapshot at some time in the DataFrame API, Iceberg supports four Spark read options: snapshot-id selects a ...
[52]
https://iceberg.apache.org/docs/latest/spark-queries/
[53]
https://iceberg.apache.org/docs/latest/spark-queries/#snapshots
[54]
https://iceberg.apache.org/docs/latest/spark-queries/#time-travel-queries-with-sql
[55]
https://iceberg.apache.org/docs/latest/spark-procedures/#expire_snapshots
[56]
https://iceberg.apache.org/docs/latest/spark-procedures/#fast_forward
[57]
https://iceberg.apache.org/docs/latest/spark-procedures/#rollback_to_snapshot
[58]
https://iceberg.apache.org/docs/latest/spark-procedures/#rollback_to_timestamp
[59]
https://iceberg.apache.org/docs/latest/spark-queries/#history
[60]
Getting Started - Apache Iceberg™
Getting Started . The latest version of Iceberg is 1.10.0. Spark is currently the most feature-rich compute engine for Iceberg operations.<|control11|><|separator|>
[61]
Spark and Iceberg Quickstart
This guide will get you up and running with Apache Iceberg™ using Apache Spark™, including sample code to highlight some powerful features. You can learn more ...
[62]
Iceberg connector — Trino 478 Documentation
The Iceberg connector allows querying data stored in files written in Iceberg format, as defined in the Iceberg Table Spec.
[63]
Flink Writes - Apache Iceberg™
Iceberg support batch and streaming writes with Apache Flink's DataStream API and Table API. ... The input schema cache stores incoming schemas per table along ...Metrics · Distribution Mode · Range Distribution...Missing: CDC | Show results with:CDC
[64]
Iceberg | Apache Flink CDC
The Iceberg Pipeline Connector functions as a Data Sink for data pipelines, enabling data writes to Apache Iceberg tables.Missing: Store | Show results with:Store
[65]
SQL Query Engine - Dremio
The #1 SQL Query Engine for Apache Iceberg · Optimized price performance for every query · Up to 100x faster performance with Reflections query acceleration.
[66]
Iceberg v3 + Starburst
Sep 4, 2025 · Starburst delivers support for Iceberg Spec v3 · Faster Queries, Fewer Headaches · Unlocking New Use Cases · Enhanced Data Governance and Trust.
[67]
Hive - Apache Iceberg™
If the Iceberg storage handler is not in Hive's classpath, then Hive cannot load or update the metadata for an Iceberg table when the storage handler is set.Enabling Iceberg support in Hive · DDL Commands · DML Commands
[68]
Using Impala with Iceberg Tables
Impala now supports Apache Iceberg which is an open table format for huge analytic datasets. With this functionality, you can access any existing Iceberg ...
[69]
Iceberg Extension - DuckDB
Feb 15, 2023 · The iceberg extension implements support for the Apache Iceberg open table format. In this page we will go over the basic usage of the ...
[70]
apache/iceberg-rust - GitHub
Supported Rust Version. Iceberg Rust is built and tested with stable rust, and will keep a rolling MSRV(minimum supported rust version).Missing: 1.10.0 | Show results with:1.10.0
[71]
PyIceberg
PyIceberg is a Python implementation for accessing Iceberg tables, without the need of a JVM. InstallationAPI · Table · Configuration · Glue
[72]
Reliability - Apache Iceberg™
### Summary of Storage Requirements, Limitations, Compatibility, Atomicity, and Durable Metadata Storage in Apache Iceberg
[73]
Iceberg AWS Integrations
Iceberg enables the use of AWS Glue as the Catalog implementation. When used, an Iceberg namespace is stored as a Glue Database, an Iceberg table is stored as a ...Enabling AWS Integration · Catalogs · DynamoDb Lock Manager · S3 FileIO
[74]
REST Catalog Spec - Apache Iceberg™
Iceberg defines a REST-based Catalog API for managing table metadata and performing catalog operations. You can find the OpenAPI specification here: REST ...
[75]
Netflix's Apache Iceberg Data Lake Migration - Amazon AWS
Nov 22, 2024 · Netflix engineers share journey of modernizing exabyte-scale data lake using Apache Iceberg at AWS re:Invent 2023.
[76]
Upgrading Data Warehouse Infrastructure at Airbnb | by Ronnie Zhu
Sep 26, 2022 · Apache Iceberg is a table format designed to address several of the shortcomings of traditional file system-based Data Warehousing storage ...
[77]
Airbnb Icehouse: The Journey to Iceberg - YouTube
Apr 30, 2025 · Airbnb Icehouse: The Journey to Iceberg. 353 views · 6 months ago #Airbnb #icebergSummit #ApacheIceberg ...more. Apache Iceberg. 2.89K.
[78]
Announcing Amazon Athena ACID transactions, powered by ...
Nov 29, 2021 · Announcing Amazon Athena ACID transactions, powered by Apache Iceberg (Preview). Posted on: Nov 29, 2021. We are excited to announce the public ...
[79]
Enhancing BigLake for Iceberg lakehouses | Google Cloud Blog
May 29, 2025 · BigLake tables for Apache Iceberg deliver an Iceberg-native storage experience directly on Cloud Storage. Whether these tables are created using ...Missing: 2023 | Show results with:2023
[80]
Iceberg format in Azure Data Factory and Azure Synapse Analytics
Nov 6, 2024 · This topic describes how to deal with Iceberg format in Azure Data Factory and Azure Synapse Analytics.
[81]
How does DoorDash evolve realtime processing platform with Iceberg
Jul 9, 2025 · With Iceberg, DoorDash can achieve time travel capabilities with more control over data retention. The Iceberg adoption aligns with their data- ...
[82]
A Short Introduction to Apache Iceberg | Expedia Group Technology
Jan 26, 2021 · One of the features that is still in development to make it into an Iceberg release is allowing a user to write 'time-travel' queries from Hive.
[83]
Open Source @ Siemens 2025: It's always a people solution
Jul 29, 2025 · Vakamo's Christian Thiel gives an introduction to lakehouses and the Apache Iceberg project, delving into the technical details of implementing ...
[84]
Companies using Apache Iceberg and its marketshare - Enlyft
219 companies use Apache Iceberg. Apache Iceberg is most often used by companies with >10000 employees & $>1000M in revenue. Our usage data goes back 4 ...<|separator|>
[85]
Businesses Embrace Data Lakehouses - Dremio Press Release
Nov 28, 2023 · The survey confirmed Iceberg's growing popularity. While 39% of respondents are currently using Delta Lake, compared to 31% who are using ...
[86]
Databricks Agrees to Acquire Tabular, the Company Founded by the ...
Jun 4, 2024 · Tabular is the independent data platform built by the original creators of Apache Iceberg. Tabular addresses the pain data engineers and data ...
[87]
Optimizing read performance - AWS Prescriptive Guidance
Iceberg's bucket transform groups multiple partition values together into fewer, hidden (bucket) partitions by using hash functions on the partitioning column.
[88]
What is Apache Iceberg: Features, Architecture & Use Cases
Aug 19, 2025 · Apache Iceberg is an open-source table format designed to handle petabyte-scale analytical datasets efficiently on cloud object stores and distributed data ...
[89]
Apache Iceberg Explained: Features and Use Cases - CelerData
Jan 30, 2025 · Hidden partitioning eliminates the need for manual partition management, improving query performance and reducing errors. Additionally, data ...<|control11|><|separator|>
[90]
What Is Apache Iceberg? How It Works, Benefits, & Use Cases
Mar 25, 2025 · Apache Iceberg is an open-source data lakehouse table format for faster processing of large datasets, designed to be easily queried with SQL.Missing: ETL | Show results with:ETL