Fact-checked by Grok 2 weeks ago

Apache Iceberg

Apache Iceberg is an open-source, high-performance table format designed for managing huge analytic datasets in data lakes, bringing the reliability, simplicity, and atomicity of SQL tables to big data environments. It enables efficient handling of petabyte-scale tables with features like ACID-compliant transactions, schema evolution, time travel queries, and hidden partitioning, allowing users to query historical snapshots, merge or update data expressively, and optimize performance without manual reconfiguration. Developed initially at Netflix in 2017 to overcome limitations in Apache Hive for incremental processing and large-scale analytics, Iceberg was open-sourced and donated to the Apache Software Foundation in November 2018. It achieved top-level Apache project status in May 2020, marking its maturity and broad community adoption. Iceberg's architecture separates from files, using a layered structure of manifests and snapshots to track table state efficiently, which supports concurrent operations across multiple engines without conflicts. This design facilitates safe, multi-engine access, compaction strategies like bin-packing or sorting, and rollback capabilities for error recovery or auditing. It integrates seamlessly with popular compute engines such as , Trino, , Presto, , and , as well as storage systems like , , and . By addressing challenges in traditional formats, such as rigid partitioning and lack of transactional support, has become a foundational technology for modern data lakehouses, enabling reliable analytics on streaming and batch at scale.

Overview

Definition and Purpose

Apache Iceberg is an open-source table format designed for managing huge analytic datasets, providing a high-performance structure that enables the use of SQL-like tables in big data environments. Licensed under the Apache 2.0 license as a project of , it supports compute engines such as , Trino, , and by layering reliable table management over distributed file systems like cloud object stores. Unlike traditional file formats such as or , which focus solely on encoding data within individual files, Iceberg functions as a complete table format that separates from data files, allowing for efficient enforcement, partitioning, and query optimization without requiring data rewrites. The primary purpose of Apache Iceberg is to deliver the reliability, simplicity, and performance of SQL databases to large-scale analytic workloads in data lakes, particularly addressing the limitations of earlier systems like that struggle with atomicity and consistency at petabyte scales. It enables atomic operations, such as concurrent writes with serializable isolation, through and atomic metadata swaps, ensuring that updates to massive tables—potentially comprising billions of files—can be performed reliably without full data rewrites or directory-based locking. This design tackles challenges in eventually consistent storage systems, like , by maintaining table state in dedicated metadata files that track snapshots, manifests, and file locations. Iceberg's development was initially motivated by Netflix's need for efficient incremental processing of petabyte-scale data in 2017, where traditional tables led to slow query planning (up to 9.6 minutes) and execution times due to scanning millions of files and partitions. By introducing features like hidden partitioning and time travel queries, Iceberg reduced these bottlenecks dramatically—for instance, cutting query planning to 10-25 seconds in early benchmarks—while supporting transactions for reliable analytics.

Core Principles

Apache Iceberg's design is guided by principles that emphasize reliability, , and flexibility for managing large-scale analytic sets in distributed environments. These principles enable to provide SQL-like guarantees on data lakes while addressing limitations of traditional file-based systems, such as inefficiency in metadata handling and lack of transactional support. A core principle is the separation of from files, where state is tracked in dedicated files rather than embedded within the data. This separation allows query engines to perform rapid operations, such as scan planning, by reading only metadata without accessing the underlying data files, which can number in the billions for massive tables. Another key principle is the immutability of data files, ensuring that once written, files are never modified or deleted arbitrarily to maintain and support reliable concurrent operations. Updates and deletes are instead handled atomically by adding new files and updating , preventing partial or inconsistent states during concurrent reads and writes. Iceberg adheres to an open format principle to promote interoperability across diverse compute engines, including , Trino, , and , without tying users to a specific or . This uses standard file formats like , , and , with uniform schema evolution rules that ensure seamless data sharing and avoid lock-in. The format supports schema-on-read semantics combined with enforcement mechanisms, allowing tables to evolve over time—such as adding, dropping, or reordering columns—without rewriting existing data files, while providing validation to maintain and prevent errors. Field IDs in the enable and , ensuring that reads adapt to changes reliably. Performance optimization is achieved through comprehensive metadata structures, particularly manifests and manifest lists, which store statistics like partition values, null counts, and file paths to enable query engines to prune irrelevant files early in the planning phase. This metadata-driven approach minimizes data scanned during queries, supporting efficient operations on petabyte-scale tables with constant-time planning complexity.

History

Origins and Development

Development of Apache Iceberg began in 2017 by engineers at , who sought to overcome key limitations of in managing large-scale analytic tables. Specifically, Hive struggled with incremental data updates, atomic commits, and reliable processing in petabyte-scale environments, prompting the need for a more robust table format. The primary challenges addressed included Hive's poor performance during evolution, absence of transaction support, and inefficiencies arising from directory-based partitioning in Hadoop-based data lakes, which often led to bottlenecks in concurrent access and maintenance. These issues were particularly acute for Netflix's data-intensive workloads involving streaming and pipelines. In November 2018, Netflix open-sourced the project under the Apache 2.0 license, making it available for broader adoption and contribution within the community. Early development emphasized features like management to ensure reliability without relying on Hive's metastore. Initial contributions were led by teams from , Apple, and , who prioritized integrations with query engines such as and Presto (later renamed Trino) to enable seamless operation across diverse ecosystems. The project's first public release, version 0.1.0, arrived in July 2019 during its Apache Incubator phase, highlighting metadata logging as a core mechanism for atomic operations and snapshot isolation.

Apache Project Milestones

Apache Iceberg entered the Apache Incubator on November 16, 2018, marking the beginning of its formal under . The project graduated from incubation to become a top-level Apache project on May 20, 2020, after achieving sufficient community consensus, active participation, and demonstration of maturity in its codebase and processes. Key releases have driven the project's evolution, with version 0.11.0 released on January 27, 2021, introducing support for the catalog to enable broader with external systems. Version 1.0.0 followed on November 3, 2022, stabilizing core s and establishing production-ready guarantees for table format management. Subsequent updates included version 1.4.0 on October 4, 2023, which added enhancements for row-level deletes to improve data manipulation efficiency. Version 1.5.0, released on March 11, 2024, advanced branching and tagging capabilities for better in collaborative environments. The most recent major release, version 1.10.0 on September 11, 2025, incorporated bug fixes alongside advancements in the format specification and stability. As of November 2025, the Apache Iceberg community had grown to over 200 contributors, fostering active development through the project's repository, where ongoing contributions address scalability and ecosystem integrations. Significant milestones include the integration with the AWS Glue Catalog in 2021, enabling seamless metadata management for Iceberg tables in environments. The project has seen widespread adoption across major cloud providers, including AWS, Google Cloud, and , supporting large-scale analytic workloads. Additionally, the formation of Iceberg working groups has facilitated extensions, such as geospatial data support and bindings, expanding the format's applicability in diverse domains.

Architecture

Metadata Management

Apache Iceberg manages table metadata through a hierarchical tree structure that enables efficient tracking of table state changes and supports large-scale analytic workloads. At the root, a table metadata pointer references the current metadata file, which contains the table's schema, partition specification, properties, and a list of snapshots. This pointer facilitates atomic updates by swapping references to new metadata files without altering existing ones. Below the metadata file, snapshots represent immutable versions of the table, each pointing to a manifest list that organizes access to underlying manifest files. These manifest files, in turn, index the data files, forming a layered abstraction that separates metadata from data storage for scalability and reliability. Snapshots serve as point-in-time captures of the entire , ensuring immutability to prevent inconsistencies during concurrent operations. Each snapshot includes a , a , a list file, optional summary information, and in version 3 (ratified 2025), fields such as sequence-number for ordering, first-row-id, and added-rows to enable row lineage tracking for advanced auditing and optimization. The list within a snapshot is an Avro-encoded file that aggregates files, providing partition-level statistics to optimize query planning across distributed systems. files themselves are also Avro-encoded and contain detailed entries for data files, including file paths, values, and columnar statistics such as minimum and maximum values, null counts, and value counts. In version 3, manifests also track deletion vectors for improved file pruning. These statistics enable predicate pushdown and file pruning, reducing the volume of data scanned during queries by filtering irrelevant files early in the process. Commits in Iceberg are atomic, achieved through where writers stage new and in temporary locations before swapping the table pointer to the new . This pointer swap ensures that only one becomes the current table state, providing serializable without requiring distributed locks, even in multi-writer environments. If conflicts arise, failed attempts are discarded, preserving the existing table state. To support introspection and maintenance, Iceberg provides built-in tables, such as $snapshots, which exposes the list of all with their IDs, timestamps, and lists for querying table history, and $manifests, which details individual files including paths, added/removed counts, and partition summaries. These system tables allow users to audit changes and diagnose issues directly via SQL. Versioning in Iceberg relies on the chronological sequence of snapshot IDs, which form a of table modifications from initial creation onward. This structure supports operations like by referencing a prior snapshot ID to restore the to an earlier state, as well as auditing by tracing changes through the snapshot log in the file. Snapshot expiration policies can prune old snapshots to manage , but the metadata pointer always ensures access to the active version. Integrated catalogs, such as Metastore or AWS Glue, store the table metadata pointer to coordinate access across engines. As of version 3 (2025), metadata supports table-level keys for securing sensitive data.

Data File Organization

Apache Iceberg stores data in columnar file formats such as , , or , providing a high-level abstraction that manages these underlying files without altering their . This separation enables compatibility with existing data lakes while leveraging the efficiency of columnar storage for analytical workloads. Data files in Iceberg follow an immutable, append-only design, where new data is always written to fresh files rather than modifying existing ones. For updates or deletes, Iceberg creates additional files—such as equality delete files referencing conditions or, in version 3 (ratified 2025), binary deletion vectors for efficient row-level operations—while marking obsolete files as deleted solely through updates, preserving the integrity of stored data without in-place mutations. delete files, used in earlier versions, are deprecated in version 3. This approach ensures reliability in distributed environments, including eventually consistent object stores like Amazon S3. To facilitate efficient query planning and , Iceberg organizes data files into manifest files, which list the paths, partitions, and statistics of individual files within a snapshot. These manifests are further grouped by manifest lists, creating logical collections that allow engines to prune irrelevant files based on embedded statistics, such as column value ranges or null counts, before scanning data. This grouping reduces overhead during reads by enabling coarse-grained filtering at the metadata level. Unlike traditional Hive-style tables, Iceberg does not rely on directory-based partitioning in storage; instead, all data files are stored in a flat structure within the table's location, with partition layouts defined entirely in . This metadata-driven approach avoids physical reorganization when partition specs evolve and supports partitioning, where query engines automatically apply filters without users specifying partition keys. For performance optimization, includes compaction as a procedure that merges multiple small data files into fewer, larger ones, reducing volume and improving scan efficiency. This operates by rewriting the selected files into new ones—adhering to the principle—while updating to reference the consolidated output and obsolete the inputs, often run as a background task via tools like . Iceberg fully supports nested data types, including structs, lists, and maps, and in version 3 (2025), extends to new types such as variant for and for geospatial use cases, allowing complex schemas to be represented within the supported file formats. Additionally, can be configured at the level through properties such as write.parquet.compression-codec, enabling choices like , , or Snappy to balance storage efficiency and read performance across all files in the .

Key Features

ACID Compliance

Apache Iceberg implements (, , , ) properties to provide reliable semantics for large-scale analytic tables in distributed environments, enabling safe concurrent operations without traditional locking. is ensured through all-or-nothing commits achieved by atomic swaps of the table's pointer to a new . When a writer stages changes—such as adding or removing data files—it creates a new metadata version based on the current ; the commit succeeds only if the pointer swap is atomic, fully applying all changes or rolling back to the previous in case of partial failure. This mechanism prevents incomplete states from becoming visible to readers. Consistency is maintained via validation and enforcement of invariants during write operations. Writers validate against the current before committing , ensuring all columns are populated and transforms are compatible; structures further enforce invariants like coverage and deletion across the . Isolation follows a isolation model, where each read operation loads a specific committed at the time of query , providing a consistent view of the without from concurrent writes and avoiding dirty reads. Readers remain isolated from uncommitted changes until they refresh to a newer . As detailed in the Management section, this relies on Iceberg's mechanism for immutable, versioned states. Durability is achieved by logging all write operations in the table before committing and flushing data files to durable storage in the underlying . Once committed, and associated files are immutable, ensuring changes persist even in the event of failures post-commit. To handle concurrent writes, Iceberg employs , where writers assume no conflicts and detect them via checks against the current ID during the metadata swap. If a conflict arises—indicating another writer has advanced the —the operation retries by reapplying changes to the updated base , with mechanisms like sequence numbers to manage file rewrites efficiently. Recovery is supported by an append-only transaction log in the form of the snapshot log within table , which records timestamped entries of snapshot transitions. This log allows reconstruction of the table state by replaying operations from the latest consistent metadata file, facilitating auditing and failure without data loss.

Schema Evolution

Apache Iceberg supports evolution through metadata-only updates, allowing changes to table schemas without rewriting data files or incurring downtime. This feature enables operations such as adding, dropping, renaming, or reordering columns and nested fields, ensuring that schema updates are independent and free of side effects. Backward compatibility is maintained by assigning new columns default values, typically null for optional fields, so existing readers continue to function without accessing the new columns. When columns are dropped, the data remains in files but is hidden from new readers, preventing accidental exposure of previously deleted information. ensures that added columns do not read values from existing data files, avoiding inconsistencies during mixed-version access. Type promotions follow strict rules to preserve and partition compatibility. Primitive types can be widened, such as promoting an to a long, a to a , or a with increased while maintaining . Nested structures support restructuring, like adding or removing fields in structs, maps, or , provided the changes do not alter map key equality or partition transform outputs. Required fields added during evolution must include non-null default values to ensure write consistency. Schema evolution operations are performed using standard SQL commands like ALTER TABLE ADD COLUMN, DROP COLUMN, or RENAME COLUMN, which update the table's metadata atomically under Iceberg's guarantees. These operations leverage unique field IDs—persistent integers assigned to each column or nested field—to track changes stably, decoupling evolution from position-based or name-based dependencies that could cause issues like column shifting or undeletion. While position-based schemas are supported for compatibility with formats like , field ID-based evolution provides greater stability for long-term maintenance. For interoperability, Iceberg enforces schema validation during reads and writes to handle mismatches, with mechanisms to resolve missing fields via defaults or nulls, promoting robustness across query engines without requiring full schema alignment.

Hidden Partitioning and Sorting

Apache Iceberg implements hidden partitioning by defining partition layouts entirely within the table's , eliminating the need for users to manage physical directories or include partition columns in their data files. For instance, a table can be partitioned by date or region through transforms applied to column values, such as extracting the day from a or bucketing a categorical field, without altering the underlying storage structure. This approach ensures that queries automatically benefit from partition pruning, as Iceberg pushes down filters to skip irrelevant data files based on the alone, preventing common errors like full table scans due to mismatched partition predicates. Partition evolution in Iceberg allows schema administrators to modify partition specifications over time without rewriting existing data, enabling adaptations to changing query patterns or data volumes. For example, a table initially partitioned by day can evolve to hourly granularity by updating the metadata spec, where new writes follow the refined layout while historical files retain their original partitioning; transforms like truncation (e.g., rounding to the nearest hour) or bucketing (e.g., hashing into fixed buckets) facilitate these changes seamlessly. Identity partitions serve as the default for unpartitioned tables, using raw column values without transformation, which supports the addition of partitioning later without disrupting existing data organization. Iceberg supports table-level sorting defined in metadata to optimize write operations and improve read performance through data clustering. Administrators can specify sort orders, such as ascending by category followed by descending by identifier, which engines like enforce during inserts and merges to group similar rows together within files. This sorting complements partitioning by reducing the amount of data scanned during queries, as metadata statistics— including value ranges and null counts—enable dynamic pushdown to prune files early in the query planning phase. The combination of hidden partitioning and sorting yields significant performance gains by minimizing I/O overhead; for common workloads involving time-based or categorical filters, Iceberg can achieve up to 10x performance improvements through effective file pruning, substantially reducing scan volumes compared to traditional directory-based schemes.

Time Travel and Snapshots

Apache Iceberg provides capabilities that allow users to query historical states of a table as they existed at specific points in time, enabling reproducible analyses and examination of changes without altering the current table state. This feature relies on snapshots, which represent immutable, point-in-time views of the table's files and , as briefly referenced in the snapshot structure. Each snapshot is assigned a and , capturing the results of operations like appends, merges, or deletes. Time travel queries in Iceberg support SQL syntax using the TIMESTAMP AS OF or VERSION AS OF clauses to specify a past state. For example, SELECT * FROM prod.db.sample FOR TIMESTAMP AS OF '2017-11-10 00:00:00.000 America/Los_Angeles' retrieves data as it appeared at that timestamp, while SELECT * FROM prod.db.sample FOR VERSION AS OF 10963874102873 uses a specific snapshot ID for exact reproducibility. These queries can also reference branches or tags, such as VERSION AS OF 'audit-branch', allowing access to experimental or historical versions without impacting production data. Snapshot expiration helps manage storage costs by removing old snapshots and their associated data files that are no longer needed. The expire_snapshots procedure automates this cleanup, retaining a configurable number of recent snapshots—by default, the last one—based on age or count thresholds. For instance, CALL prod.system.expire_snapshots('db.sample', TIMESTAMP '2021-06-30 00:00:00.000', 100) deletes snapshots older than the specified timestamp while keeping the 100 most recent ones. Iceberg supports branching and tagging for version management, similar to Git workflows, enabling isolated development and testing. Branches can be created to fork from an existing snapshot, such as for schema evolution experiments, and merged via fast-forward operations using the fast_forward procedure: CALL catalog.system.fast_forward('my_table', 'main', 'dev-branch') updates the target branch to match the source's latest snapshot. Tags provide immutable labels for specific snapshots, queryable via VERSION AS OF 'tag-name', facilitating A/B comparisons or production rollouts without affecting the main lineage. Rollback operations revert a table to a previous state by setting the current to an earlier one. The rollback_to_snapshot achieves this atomically: CALL [catalog](/page/Catalog).[system](/page/System).rollback_to_snapshot('db.sample', 1) sets the table to ID 1, discarding intermediate changes. Similarly, rollback_to_timestamp uses a for the reversion point. Audit capabilities are provided through system tables that log snapshot history, allowing users to track changes including who made them, when, and what operations occurred. Querying SELECT * FROM prod.db.sample.history returns details like made_current_at timestamps, operation types (e.g., 'append'), and summary such as the Spark application ID responsible. The snapshots table offers deeper insights, including parent snapshot IDs for tracing via the ancestors_of : CALL spark_catalog.system.ancestors_of('db.tbl', 1). These features support compliance and debugging by providing a complete of table modifications. Branching in Iceberg is particularly useful for testing schema changes or running comparisons in isolation, as branches maintain separate snapshot lineages that can be queried independently before merging into production. This approach ensures safe experimentation on large-scale tables without risking .

Integrations

Query Engines

Apache Spark provides native integration with Apache since version 3.1, released in 2021, enabling full support for reading and writing tables, including advanced features like schema evolution, , and transactions through SQL and DataFrame APIs. This integration allows Spark users to leverage 's metadata for efficient query planning and data compaction without requiring additional extensions. Trino, formerly known as Presto, offers a dedicated Iceberg connector that supports querying Iceberg tables with optimizations such as predicate pushdown and file-level based on Iceberg's manifest files, improving performance on large datasets. The connector enables read operations, including support for and evolution, while write capabilities are available through procedures like OPTIMIZE for data compaction. Apache Flink integrates with Apache Iceberg to support both batch and streaming writes, including (CDC) workflows, where Flink processes real-time events and commits them to Iceberg tables with exactly-once semantics. This is facilitated through Flink's Table API and SQL, with compatibility for catalogs like Metastore, allowing seamless streaming ingestion into Iceberg formats. Enterprise platforms like Dremio and Starburst extend Iceberg support with enhanced governance features, such as data versioning, access controls, and automated query acceleration. Dremio's SQL query engine optimizes Iceberg tables using reflections for up to 100x faster performance and integrates with catalogs for unified management. Starburst, built on Trino, provides full DML operations on Iceberg tables, including support for Iceberg v3 specifications, with built-in row-level lineage and security integrations for governed data lakes. Other query engines include , which uses the HiveIcebergStorageHandler to read and write tables via the Hive Metastore, supporting DDL and DML operations on existing Hive environments. offers read and write support for tables, including row-level deletes and schema evolution, optimized for high-performance analytics on Parquet-based data. Emerging support is available in DuckDB through its extension, enabling lightweight querying and writing to tables in embedded analytical workflows, with full read and write capabilities as of version 1.4.0 (September 2025). For custom applications, Apache Iceberg provides and clients that allow programmatic access to table metadata and data without relying on full query engines. The implementation offers native bindings for building high-performance tools, with improvements in version 0.7.0 (released October 2025) enhancing compatibility and API stability. PyIceberg, the official library, supports catalog interactions, table creation, and scans, enabling Python-based data pipelines and integrations with libraries like .

Storage Systems

Apache Iceberg supports a variety of object storage systems for persisting data files, enabling multi-cloud deployments. Commonly used options include , Blob Storage, and , which provide scalable, durable storage for large analytic datasets. These systems are compatible due to Iceberg's design that avoids operations like file renames and listings, relying instead on atomic file writes and immutable file structures. For metadata management, Iceberg uses catalogs to track table locations and schemas, with several implementations available. The Hive Metastore serves as a legacy option, integrating with existing Hadoop ecosystems via a configurable URI and supporting locking mechanisms. AWS Glue acts as a cloud-native catalog, mapping Iceberg namespaces to Glue databases and tables to Glue tables for seamless integration with AWS services. JDBC-based catalogs, such as those using , allow relational databases to store metadata through standard SQL connections, requiring database support for atomic transactions. The Hadoop Catalog enables on-premises setups by leveraging HDFS paths, while catalogs provide a protocol-agnostic API for cloud-native environments, facilitating distributed access without direct database dependencies. Iceberg catalogs support multi-table namespaces, allowing organization of tables into hierarchical structures. In AWS Glue, namespaces correspond to databases, with access control enforced via AWS (IAM) policies that provide fine-grained permissions equivalent to ACLs. Custom catalog implementations can extend this with bespoke ACL mechanisms for advanced security needs. On-premises deployments utilize HDFS as the primary file system, ensuring compatibility through POSIX-like semantics for file operations. This setup maintains the same immutability and atomicity guarantees as cloud object stores. Table properties in Iceberg allow storage-specific optimizations, such as enabling encryption through key management systems or selecting compression codecs like Zstandard for files. These configurations are set at the table level to tailor and to the underlying storage. Iceberg requires atomic file system operations for metadata updates via snapshot swaps and does not support non-atomic file systems that could lead to inconsistent states. Additionally, must be stored on durable systems to preserve and enable features like .

Adoption and Use Cases

Industry Adoption

Apache Iceberg has seen widespread since its , beginning with early pioneers in the technology sector. , one of the original developers, has utilized Iceberg in production since 2017 to manage its exabyte-scale , migrating over 1.5 million tables to the format for reliable analytics on petabyte-plus datasets. Similarly, adopted Iceberg for its event data pipelines and data warehousing needs, launching the "Airbnb Icehouse" initiative to upgrade infrastructure and handle large-scale analytics from millions of user interactions. Major cloud providers have integrated to enhance their data services, accelerating enterprise uptake. (AWS) introduced support for Iceberg in Amazon Athena in 2021 (preview) and 2022 (general availability), enabling ACID transactions on S3 data lakes. Cloud incorporated Iceberg into BigLake in 2022 (announcement) with general availability in 2023, allowing unified querying across multi-cloud storage for high-performance lakehouses. Microsoft Synapse Analytics added Iceberg format support in 2024 via Azure Data Factory, facilitating seamless data processing in pools. Beyond cloud giants, numerous enterprises across industries rely on for management. employs it with for real-time ingestion of over 30 million events per second, improving scalability in its processing platform. has leveraged since 2021 for petabyte-scale tables, contributing read support to enhance . uses in lakehouse architectures for industrial data collaboration, as highlighted in its 2025 open-source initiatives. By 2025, adoption has expanded to hundreds of organizations, with community analyses tracking over 200 companies among large enterprises. Iceberg's growth reflects its transition from an Apache project in 2018 to a top-level project in 2020, with sustained community momentum evidenced by active development and surveys showing it in use by about 30% of data lakehouse implementations by 2024. In 2025, Iceberg released version 1.5.0, introducing enhancements like improved catalog support and deletion vectors, further accelerating adoption. Key contributors include Apple, which has deployed it across hundreds of teams and added features like and merge-on-read optimizations; Tabular, founded by Iceberg's creators to offer commercial cataloging and support (acquired by in 2024); and Dremio, a primary that drives extensions for query performance. A primary adoption driver has been overcoming Hive migration challenges, where organizations report significant query performance gains post-transition. For instance, AWS benchmarks using TPC-H workloads demonstrate improved query execution on Iceberg-sorted tables compared to Hive setups, establishing key context for in environments.

Common Applications

Apache Iceberg is widely applied in data lakehouse architectures, where it unifies storage and compute layers to support (BI) and (ML) workloads on a single set of tables. By enabling multiple query engines to access the same datasets concurrently with guarantees, Iceberg facilitates scalable without data duplication or silos, allowing organizations to perform ad-hoc queries, , and model on petabyte-scale data stored in cloud object stores like S3. In ETL pipelines, Iceberg supports incremental data processing through operations like MERGE INTO, which handle (CDC) and upsert scenarios efficiently. This allows for atomic updates to large tables without full rewrites, ensuring data consistency during batch or streaming ingestion from sources such as Kafka or databases, and reducing processing overhead in production workflows. For applications, particularly , Iceberg's feature enables querying historical snapshots of data, which is essential for auditing financial transactions or log events. Users can reconstruct past states to investigate anomalies or comply with reporting requirements, with hidden partitioning optimizing query performance by skipping irrelevant files automatically. In pipelines, serves as a foundation for feature stores by accommodating as models iterate, allowing additions or modifications to columns without disrupting ongoing jobs. This supports reproducible experiments through branching and capabilities, ensuring feature data remains consistent across distributed teams and engines like or Trino. Regulatory compliance benefits from Iceberg's immutable snapshot history, which provides verifiable audit trails for standards like GDPR or HIPAA. Organizations can query or restore specific data versions to demonstrate data handling practices, delete records in compliance with right-to-be-forgotten requests, and maintain long-term retention of records such as call logs or patient data without performance degradation. Practical examples include real-time dashboards in streaming services, where processes live user interaction data for immediate insights, and detection in financial institutions, leveraging branching to test detection algorithms on production-like data without risking operational integrity.

References

  1. [1]
    Apache Iceberg - Apache Iceberg™
    Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data.Spec · Spark and Iceberg Quickstart · Apache Amoro · Iceberg extension
  2. [2]
    Apache Iceberg Explained: A Complete Guide for Beginners
    Aug 4, 2024 · Netflix developed Apache Iceberg in 2017 to address limitations with Hive, particularly in handling incremental processing and streaming data.What Is Apache Iceberg? · Apache Iceberg history · Partitioning · Iceberg catalogMissing: origin | Show results with:origin
  3. [3]
    What is Apache Iceberg? | Confluent
    Apache Iceberg was open-sourced and donated to Apache Software Foundation in November 2018, after being initially developed at Netflix to overcome the ...Cloud Data Warehouses: One... · Legacy Data Lake... · Apache Iceberg For...<|control11|><|separator|>
  4. [4]
    What Is Apache Iceberg? - IBM
    Originally created by data engineers at Netflix and Apple in 2017 to address the shortcomings of Apache Hive, Iceberg was made open source and donated to the ...Missing: history | Show results with:history
  5. [5]
    Spec - Apache Iceberg™
    This is a specification for the Iceberg table format that is designed to manage a large, slow-changing collection of files in a distributed file system or key- ...View Spec · Puffin Spec · Implementation Status
  6. [6]
    What Is Apache Iceberg? | Cloudera
    The story of Apache Iceberg begins with the need for a more efficient and flexible data management system for Apache Hive, a popular data warehousing and SQL ...
  7. [7]
    Introduction - Apache Iceberg™
    Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to compute engines including Spark, Trino, PrestoDB, Flink, Hive and ...Apache Hive · Iceberg extension · Apache Amoro · Branching and TaggingMissing: definition | Show results with:definition
  8. [8]
  9. [9]
    Spec - Apache Iceberg™
    Summary of each segment:
  10. [10]
    Spec - Apache Iceberg™
    Summary of each segment:
  11. [11]
  12. [12]
  13. [13]
    Evolution - Apache Iceberg™
    Iceberg guarantees that schema evolution changes are independent and free of side-effects, without rewriting files: Added columns never read existing values ...Schema evolution · Correctness · Partition evolution
  14. [14]
  15. [15]
    A Deep Dive into Apache Iceberg: A Journey Through the Future of ...
    Apr 4, 2025 · Just like every superhero has their origin story, the Lakehouse architecture emerged to solve many complex problems. Think of it as the ...
  16. [16]
    How Iceberg Powers Data and AI Applications at Apple, Netflix ... - Qlik
    Apache Iceberg is an open table format initially developed by Netflix and open-sourced in 2018. ... Apache Iceberg and Apache Spark projects. Following this ...Missing: origins | Show results with:origins
  17. [17]
    Incremental Processing using Netflix Maestro and Apache Iceberg
    Nov 20, 2023 · We will show how we are building a clean and efficient incremental processing solution (IPS) by using Netflix Maestro and Apache Iceberg.
  18. [18]
    What Is Apache Iceberg? Features & Benefits - Dremio
    It was created by Netflix and Apple, and is deployed in production by the largest technology companies and proven at scale on the world's largest workloads and ...Missing: origins | Show results with:origins
  19. [19]
    Incubator PMC report for January 2019 - Apache
    The Apache Incubator is the entry path into the ASF for projects and codebases wishing to become part of the Foundation's efforts.
  20. [20]
    Iceberg Project Incubation Status
    2018-11-16 Project enters incubation. 2018-12-10 Software grant agreement filed by secretary. 2019-03-21 New PPMC member: Dan Weeks added to the PPMC.
  21. [21]
    [ANNOUNCE] Apache Iceberg release 0.11-Apache Mail Archives
    Hi everyone, I'm pleased to announce the release of Apache Iceberg 0.11! Apache Iceberg is an open table format for huge analytic datasets.Missing: date | Show results with:date
  22. [22]
  23. [23]
    Releases - Apache Iceberg
    Apache Iceberg 1.2.0 was released on March 20th, 2023. The 1.2.0 release adds a variety of new features and bug fixes. Here is an overview:.Missing: 0.1.0 2019
  24. [24]
    [ANNOUNCE] Apache Iceberg release 1.5.0-Apache Mail Archives
    I'm pleased to announce the release of Apache Iceberg 1.5.0! Apache Iceberg is an open table format for huge analytic datasets. Iceberg delivers high query ...Missing: branching | Show results with:branching
  25. [25]
    Releases - Apache Iceberg™
    Apache Iceberg 1.10.0 was released on September 11, 2025. The 1.10.0 release contains bug fixes and new features. For full release notes visit Github.
  26. [26]
    Using the Iceberg framework in AWS Glue
    You can use AWS Glue to perform read and write operations on Iceberg tables in Amazon S3, or work with Iceberg tables using the AWS Glue Data Catalog.Enabling Iceberg · Example: Write Iceberg · Example: Insert into an Iceberg...
  27. [27]
    The Iceberg Wave: How an Open Format Became an Enterprise ...
    Jul 14, 2025 · Iceberg Adoption at Cloudera. Cloudera featured native integration of Apache Iceberg in its public cloud Lakehouse platform in 2021, followed by ...
  28. [28]
    Apache Iceberg and Parquet now support GEO - Wherobots
    Feb 11, 2025 · With native geospatial data type support in Apache Iceberg and Parquet, you can seamlessly run query and processing engines like Wherobots, ...
  29. [29]
  30. [30]
    Spec - Apache Iceberg™
    Summary of each segment:
  31. [31]
  32. [32]
  33. [33]
  34. [34]
  35. [35]
  36. [36]
  37. [37]
  38. [38]
    Partitioning - Apache Iceberg™
    Iceberg avoids reading unnecessary partitions automatically. Consumers don't need to know how the table is partitioned and add extra filters to their queries.Missing: directory flat
  39. [39]
  40. [40]
  41. [41]
    Configuration - Apache Iceberg™
    Iceberg tables support table properties to configure table behavior, like the default split size for readers. Read properties . Property, Default, Description ...
  42. [42]
    Spec - Apache Iceberg™
    Summary of each segment:
  43. [43]
    Spec - Apache Iceberg™
    Summary of each segment:
  44. [44]
  45. [45]
  46. [46]
  47. [47]
  48. [48]
  49. [49]
    DDL - Apache Iceberg™
    Spark DDL ... To use Iceberg in Spark, first configure Spark catalogs. Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations.
  50. [50]
    Performance - Apache Iceberg™
    Lower latency SQL queries -- by eliminating a distributed scan to plan a distributed scan ... In some cases, this is a 10x performance improvement. Back to ...
  51. [51]
    Spark Queries - Apache Iceberg
    To select a specific table snapshot or the snapshot at some time in the DataFrame API, Iceberg supports four Spark read options: snapshot-id selects a ...
  52. [52]
  53. [53]
  54. [54]
  55. [55]
  56. [56]
  57. [57]
  58. [58]
  59. [59]
  60. [60]
    Getting Started - Apache Iceberg™
    Getting Started . The latest version of Iceberg is 1.10.0. Spark is currently the most feature-rich compute engine for Iceberg operations.<|control11|><|separator|>
  61. [61]
    Spark and Iceberg Quickstart
    This guide will get you up and running with Apache Iceberg™ using Apache Spark™, including sample code to highlight some powerful features. You can learn more ...
  62. [62]
    Iceberg connector — Trino 478 Documentation
    The Iceberg connector allows querying data stored in files written in Iceberg format, as defined in the Iceberg Table Spec.
  63. [63]
    Flink Writes - Apache Iceberg™
    Iceberg support batch and streaming writes with Apache Flink's DataStream API and Table API. ... The input schema cache stores incoming schemas per table along ...Metrics · Distribution Mode · Range Distribution...Missing: CDC | Show results with:CDC
  64. [64]
    Iceberg | Apache Flink CDC
    The Iceberg Pipeline Connector functions as a Data Sink for data pipelines, enabling data writes to Apache Iceberg tables.Missing: Store | Show results with:Store
  65. [65]
    SQL Query Engine - Dremio
    The #1 SQL Query Engine for Apache Iceberg · Optimized price performance for every query · Up to 100x faster performance with Reflections query acceleration.
  66. [66]
    Iceberg v3 + Starburst
    Sep 4, 2025 · Starburst delivers support for Iceberg Spec v3 · Faster Queries, Fewer Headaches · Unlocking New Use Cases · Enhanced Data Governance and Trust.
  67. [67]
    Hive - Apache Iceberg™
    If the Iceberg storage handler is not in Hive's classpath, then Hive cannot load or update the metadata for an Iceberg table when the storage handler is set.Enabling Iceberg support in Hive · DDL Commands · DML Commands
  68. [68]
    Using Impala with Iceberg Tables
    Impala now supports Apache Iceberg which is an open table format for huge analytic datasets. With this functionality, you can access any existing Iceberg ...
  69. [69]
    Iceberg Extension - DuckDB
    Feb 15, 2023 · The iceberg extension implements support for the Apache Iceberg open table format. In this page we will go over the basic usage of the ...
  70. [70]
    apache/iceberg-rust - GitHub
    Supported Rust Version. Iceberg Rust is built and tested with stable rust, and will keep a rolling MSRV(minimum supported rust version).Missing: 1.10.0 | Show results with:1.10.0
  71. [71]
    PyIceberg
    PyIceberg is a Python implementation for accessing Iceberg tables, without the need of a JVM. InstallationAPI · Table · Configuration · Glue
  72. [72]
    Reliability - Apache Iceberg™
    ### Summary of Storage Requirements, Limitations, Compatibility, Atomicity, and Durable Metadata Storage in Apache Iceberg
  73. [73]
    Iceberg AWS Integrations
    Iceberg enables the use of AWS Glue as the Catalog implementation. When used, an Iceberg namespace is stored as a Glue Database, an Iceberg table is stored as a ...Enabling AWS Integration · Catalogs · DynamoDb Lock Manager · S3 FileIO
  74. [74]
    REST Catalog Spec - Apache Iceberg™
    Iceberg defines a REST-based Catalog API for managing table metadata and performing catalog operations. You can find the OpenAPI specification here: REST ...
  75. [75]
    Netflix's Apache Iceberg Data Lake Migration - Amazon AWS
    Nov 22, 2024 · Netflix engineers share journey of modernizing exabyte-scale data lake using Apache Iceberg at AWS re:Invent 2023.
  76. [76]
    Upgrading Data Warehouse Infrastructure at Airbnb | by Ronnie Zhu
    Sep 26, 2022 · Apache Iceberg is a table format designed to address several of the shortcomings of traditional file system-based Data Warehousing storage ...
  77. [77]
    Airbnb Icehouse: The Journey to Iceberg - YouTube
    Apr 30, 2025 · Airbnb Icehouse: The Journey to Iceberg. 353 views · 6 months ago #Airbnb #icebergSummit #ApacheIceberg ...more. Apache Iceberg. 2.89K.
  78. [78]
    Announcing Amazon Athena ACID transactions, powered by ...
    Nov 29, 2021 · Announcing Amazon Athena ACID transactions, powered by Apache Iceberg (Preview). Posted on: Nov 29, 2021. We are excited to announce the public ...
  79. [79]
    Enhancing BigLake for Iceberg lakehouses | Google Cloud Blog
    May 29, 2025 · BigLake tables for Apache Iceberg deliver an Iceberg-native storage experience directly on Cloud Storage. Whether these tables are created using ...Missing: 2023 | Show results with:2023
  80. [80]
    Iceberg format in Azure Data Factory and Azure Synapse Analytics
    Nov 6, 2024 · This topic describes how to deal with Iceberg format in Azure Data Factory and Azure Synapse Analytics.
  81. [81]
    How does DoorDash evolve realtime processing platform with Iceberg
    Jul 9, 2025 · With Iceberg, DoorDash can achieve time travel capabilities with more control over data retention. The Iceberg adoption aligns with their data- ...
  82. [82]
    A Short Introduction to Apache Iceberg | Expedia Group Technology
    Jan 26, 2021 · One of the features that is still in development to make it into an Iceberg release is allowing a user to write 'time-travel' queries from Hive.
  83. [83]
    Open Source @ Siemens 2025: It's always a people solution
    Jul 29, 2025 · Vakamo's Christian Thiel gives an introduction to lakehouses and the Apache Iceberg project, delving into the technical details of implementing ...
  84. [84]
    Companies using Apache Iceberg and its marketshare - Enlyft
    219 companies use Apache Iceberg. Apache Iceberg is most often used by companies with >10000 employees & $>1000M in revenue. Our usage data goes back 4 ...<|separator|>
  85. [85]
    Businesses Embrace Data Lakehouses - Dremio Press Release
    Nov 28, 2023 · The survey confirmed Iceberg's growing popularity. While 39% of respondents are currently using Delta Lake, compared to 31% who are using ...
  86. [86]
    Databricks Agrees to Acquire Tabular, the Company Founded by the ...
    Jun 4, 2024 · Tabular is the independent data platform built by the original creators of Apache Iceberg. Tabular addresses the pain data engineers and data ...
  87. [87]
    Optimizing read performance - AWS Prescriptive Guidance
    Iceberg's bucket transform groups multiple partition values together into fewer, hidden (bucket) partitions by using hash functions on the partitioning column.
  88. [88]
    What is Apache Iceberg: Features, Architecture & Use Cases
    Aug 19, 2025 · Apache Iceberg is an open-source table format designed to handle petabyte-scale analytical datasets efficiently on cloud object stores and distributed data ...
  89. [89]
    Apache Iceberg Explained: Features and Use Cases - CelerData
    Jan 30, 2025 · Hidden partitioning eliminates the need for manual partition management, improving query performance and reducing errors. Additionally, data ...<|control11|><|separator|>
  90. [90]
    What Is Apache Iceberg? How It Works, Benefits, & Use Cases
    Mar 25, 2025 · Apache Iceberg is an open-source data lakehouse table format for faster processing of large datasets, designed to be easily queried with SQL.Missing: ETL | Show results with:ETL