Fact-checked by Grok 2 weeks ago

Apache Parquet

Apache Parquet is an open-source, column-oriented data file format designed for efficient data storage and retrieval, particularly in big data analytics ecosystems like Apache Hadoop. Developed collaboratively by Twitter and Cloudera, Parquet was first released in July 2013 as an Apache Incubator project, drawing inspiration from the columnar storage and query techniques outlined in Google's 2010 Dremel paper. The format addresses limitations of row-oriented storage by organizing data into row groups—horizontal partitions of rows—each containing column chunks for individual columns, which are subdivided into pages as the basic units for encoding and compression. This hierarchical structure supports complex nested data schemas, advanced encodings (such as dictionary, run-length, and bit-packing), and pluggable compression algorithms like Snappy or GZIP, enabling significant reductions in storage footprint and I/O overhead. Parquet's columnar layout offers key advantages for analytical processing, including faster query execution by scanning only required columns and improved ratios due to grouping similar types together—often achieving 75% or more space savings over text-based formats like . Unlike row-based formats such as , which prioritize transactional updates, Parquet excels in read-heavy workloads typical of data lakes and warehouses, with built-in support for evolution to handle evolving structures without rewriting files. It integrates seamlessly with major frameworks including , , Presto, and , as well as cross-language libraries like for in-memory transfer, making it a for distributed computing environments. Widely adopted in cloud platforms such as and , Parquet facilitates scalable analytics on petabyte-scale datasets while maintaining splittability for across clusters.

History

Origins and Initial Development

Apache Parquet originated from a collaborative effort between engineers at and in early 2013, aimed at addressing the limitations of existing row-oriented data formats in the Hadoop ecosystem. At the time, formats like were efficient for row-based operations but inefficient for analytical workloads that required scanning large volumes of data across specific columns, leading to excessive I/O and poor query performance. The project sought to introduce a columnar storage format that could support complex nested data structures while enabling better compression and faster scans for processing. The development was heavily inspired by the columnar storage techniques described in Google's 2010 Dremel paper, which outlined methods for efficient analytical querying on massive datasets using record shredding and assembly algorithms to handle nested data without flattening it. This influence guided Parquet's to prioritize column-wise , repetition and levels for reconstructing nested structures, and per-column encoding schemes to optimize for modern hardware and reduce overhead. By adapting these principles to an open-source context, the collaborators aimed to make advanced columnar capabilities accessible to the broader Hadoop community, filling a gap in tools for high-performance analytics. Parquet was first introduced as an open-source project in March 2013, with the initial version, Parquet 1.0, released in July 2013. This release focused on core functionalities, including a basic columnar layout for primitive and nested types, embedded metadata for schema description, and initial support for encodings like dictionary and bit-packing to enable efficient data representation. Early implementations targeted integration with Hadoop components such as and , setting the foundation for its use in analytical pipelines.

Adoption and Milestones

Apache Parquet was donated to the Apache Software Foundation and entered the Incubator on May 20, 2014, after initial development by Twitter and Cloudera. It graduated to a top-level Apache project on April 27, 2015, marking its full integration into the open-source ecosystem and community-driven governance. The project saw several key version releases that advanced its capabilities. Parquet 1.0 was released on July 30, 2013, establishing the core columnar file format optimized for Hadoop analytics. Parquet 1.10.0 followed on April 5, 2018, introducing support for Zstandard (ZSTD) compression alongside other codecs to enhance storage efficiency. Parquet 1.12.0 arrived on March 17, 2021, with enhancements to schema resolution and compatibility for evolving data structures. Parquet 1.13.0 was issued on April 6, 2023, including optimizations for processing nested and complex data types to improve query performance. Subsequent releases included Parquet 1.14.0 on May 7, 2024, focusing on bug fixes and stability improvements; 1.15.0 on December 2, 2024; and 1.16.0 on September 3, 2025. The Parquet format specification advanced to version 2.12.0 on August 28, 2025, adding support for the VARIANT logical type to handle semi-structured data more efficiently. Adoption grew rapidly within the landscape. By 2015, Parquet had been incorporated into major Hadoop distributions, including CDH and Hortonworks HDP, facilitating its use in enterprise analytics workflows. By 2020, it had achieved widespread use as a for columnar storage in environments, powering efficient data processing in tools like and across industries. A significant milestone came in 2016 with its integration into the newly incubating project, which provided in-memory columnar data interchange and cross-language compatibility, further boosting Parquet's interoperability.

Design and File Format

Core Principles

Apache Parquet is fundamentally designed as a columnar storage format, where is organized by columns rather than rows, allowing for more efficient compression and selective reading of only the relevant columns during query execution, particularly for analytical workloads that apply predicates on specific fields. This columnar model minimizes I/O overhead by enabling systems to skip irrelevant , making it ideal for scan-heavy operations in environments. A key principle of Parquet is its self-describing nature, achieved by embedding the and directly in the file's footer, which allows files to be read independently without requiring external schema definitions or catalogs. The footer includes details such as column locations, data types, and statistics, facilitating across different tools and languages while supporting schema evolution over time. Parquet is optimized for write-once, read-many scenarios common in processing, prioritizing high scan speeds and low storage costs over frequent updates or transactional support, as updates would require rewriting entire row groups. This design choice aligns with its roots in handling bulk analytical queries, where data is typically appended in immutable batches rather than modified incrementally. From its inception, Parquet has supported complex data types, including nested structures such as maps, lists, and repeated fields, using techniques like record shredding to flatten hierarchical data into columnar form while preserving logical relationships. This capability, inspired by the columnar storage approach in Google's system, enables efficient handling of semi-structured data from sources like or in analytical pipelines.

Structure Components

Apache Parquet files follow a structured consisting of a header, a section, and a footer to enable efficient columnar storage and access. The header begins with 4-byte magic bytes "PAR1" in ASCII encoding, identifying the file as . The section contains the actual columnar organized into row groups, while the footer holds all and ends with another 4-byte "PAR1" magic number preceded by the length in little-endian . Row groups represent the primary horizontal partitioning unit in a Parquet file, dividing the into independent chunks for and storage optimization. By default, row groups are sized at approximately 128 MB, though recommendations suggest larger sizes of 512 MB to 1 GB to align with typical distributed block sizes like HDFS. Each row group encompasses a subset of rows and includes column chunks for every column in the , allowing readers to process entire groups in for column-wise operations. Within each row group, data is further partitioned vertically into column chunks, one per column, to support selective column reading and tailored to individual columns. A column chunk comprises one or more pages: data pages that store the actual values along with repetition and definition levels for handling nesting and nulls, and optional dictionary pages that hold unique values for -based encodings. This page-level subdivision enables fine-grained access, with data pages typically limited to 1 MB for efficient buffering during reads and writes. The footer contains comprehensive file metadata serialized using Apache Thrift's TCompactProtocol, ensuring compact and efficient parsing. This metadata includes the schema definition in a depth-first traversal, detailing field names, physical and logical types (such as INT32, STRING, or nested groups/maps), and annotations for repetition (required, optional, or repeated) and definition levels to manage optional fields and repeated structures in nested data. Repetition levels track the hierarchy of repeated elements by indicating the depth of repetition from the previous value, while definition levels specify the number of optional fields that are null or defined in the path to the current value, enabling compact representation of complex schemas without explicit null markers. Additionally, the footer lists row group and column chunk metadata, including offsets, sizes, and statistics like min/max values for query optimization.

Features

Key Capabilities

Apache Parquet's columnar storage structure enables column pruning, which allows query engines to load only the specific columns required for a given analytical query, thereby minimizing I/O operations and improving performance on large datasets. This capability is facilitated by the organization of data into independent column chunks within row groups, where describes the layout and content of each chunk without necessitating the reading of irrelevant columns. Complementing column pruning is predicate pushdown, a technique where filtering predicates from queries are evaluated against the file's to skip entire row groups or column chunks that do not satisfy the conditions, reducing the volume of data scanned during reads. This optimization is particularly effective for selective queries in environments, as it pushes the filtering logic to the storage level before data is loaded into memory. Parquet supports nested data structures through group types for structs and logical types for lists and maps, enabling the representation of complex, hierarchical data common in modern applications. To encode these hierarchies efficiently in a columnar format, Parquet employs repetition levels and definition levels: repetition levels indicate the level at which a value is repeated within a nested path, while definition levels track the number of optional fields that are null or defined, allowing reconstruction of the original without storing explicit null markers for every position. Each column chunk in Parquet includes in its , such as minimum and maximum values along with counts, which aid query planners in determining data relevance and enabling further optimizations like skipping non-qualifying chunks. These are computed per chunk to provide granular insights into data distribution, supporting efficient query planning without full data scans. In October 2025, Apache Parquet ratified the Variant logical type, a new for storing . This binary format uses shredding to extract common fields into typed column chunks, improving for dynamic schemas and nested data by enabling better , faster reads (up to 8x ), and enhanced data skipping.

Schema Evolution

Apache Parquet supports schema evolution, enabling datasets to adapt to structural changes over time without necessitating complete data rewrites. This capability is built into the file format's design, which embeds schema metadata within each file, allowing readers to interpret varying schemas across files in a dataset. Like other formats such as , , and Thrift, Parquet facilitates gradual schema modifications, primarily through the use of optional fields and logical type annotations. Backward compatibility in Parquet ensures that newer readers can process older files by ignoring any newly added optional fields, provided that all required fields from prior schemas remain unchanged and in the same positions. Required fields cannot be removed or altered in type without breaking compatibility, as their presence is strictly enforced during reads. This design prevents errors when evolving schemas in production environments, where data from multiple versions may coexist. Forward compatibility allows older readers to handle files with additional or unknown fields by skipping them, leveraging definition levels to determine value presence without attempting to parse incompatible elements. Definition levels, stored alongside data pages, encode whether a value exists for a given field in nested or repeated structures, enabling safe navigation of schema differences. This mechanism supports reading across schema versions without data loss or corruption. Schema merging is handled by processing tools like and , which combine schemas from multiple files during ingestion. Merging promotes missing fields to optional status, resolves naming conflicts by favoring the latest definitions, and performs type promotions (such as integer to long) when compatible. To enable this in , the mergeSchema option must be set to true, though it incurs computational overhead by scanning file footers. Hive similarly supports evolution via its metastore, applying union-like operations to derive a superset . Despite these features, Parquet's evolution has practical limitations: it lacks native support for type changes (e.g., converting a to an ) or permanent field deletions, requiring external , data compaction, or for such operations. Complex evolutions, including renaming fields or handling incompatible promotions, depend on reader-side logic and may lead to values or errors if not managed carefully. Nested types can be referenced briefly in this , as repetition and definition levels facilitate variability in hierarchical schemas. A representative example of schema evolution involves appending a new optional column, such as "user_age" of type INT32, to an existing dataset of user records. Older files lack this field, but when merged in with mergeSchema=true, the resulting DataFrame includes "user_age" as optional, populated only in newer files while older rows receive nulls, preserving query functionality across versions.

Compression and Encoding

Encoding Techniques

Apache Parquet employs several encoding techniques to represent column efficiently within its columnar structure, primarily applied at the level of data pages to minimize overhead before any is applied. These encodings target different data types and patterns, such as , dictionaries of unique values, and sequential differences, enabling compact representation of both and nested . By optimizing how values, repetition levels, and levels are stored, Parquet achieves significant space savings, particularly for analytical workloads involving selective column access. Dictionary encoding replaces repeated values in a column chunk with indices that reference a shared of unique values, making it particularly effective for columns with low . The is built once per column chunk and stored in a dedicated page using plain encoding, while subsequent data pages store only the indices, which are encoded using (RLE) or bit-packing to further compact the representation. If the size exceeds a threshold (typically 50% of the page size), the encoding falls back to plain to avoid excessive overhead. This approach reduces redundancy by mapping strings or to compact IDs, with the supporting all data types including byte arrays. Run-length encoding (RLE), often in a form with bit-packing, encodes sequences of identical values or levels (such as and definition levels in nested s) as pairs of value and run length, ideal for data with long runs of repeats like flags or sparse nested fields. In Parquet's , the RLE alternates between bit-packed runs for short sequences of varying values and direct RLE runs for identical values, using a fixed bit width determined by the maximum value in the sequence (up to 32 bits). For example, a run of 100 identical levels of 0 would be stored as the value 0 followed by the length 100 encoded in a compact format, with the entire prefixed by indicating the encoding type and bit width. This minimizes space for both clustered and scattered patterns in levels, which are crucial for representing optional or repeated fields without storing explicit null indicators for every row. Bit packing encodes small integers or s into tightly packed bit fields, eliminating byte-level to achieve near-optimal for fixed-width like flags or low-range integers. In , bit packing is integrated into the RLE hybrid for repetition and definition levels, where groups of values are packed into whole bytes from least significant bit (LSB) first, with any incomplete byte at the end unpadded. For instance, a sequence of 8 values can be packed into a single byte, representing 0s and 1s directly. While a standalone bit-packed encoding exists, it is deprecated in favor of the more versatile RLE hybrid, which handles transitions between packed and run-length modes seamlessly. This is especially useful for like levels, where values rarely exceed a few bits. Delta encoding stores differences between consecutive sorted values rather than the values themselves, providing efficiency for monotonically increasing sequences such as integers or timestamps in time-series data. For 32-bit and 64-bit integers, uses a binary-packed scheme that divides the sequence into blocks of 128 values, each block further split into miniblocks of 32 values; within each miniblock, the first value is stored plainly, followed by binary-packed deltas (differences from the previous value) and an optional stride (second-order differences). These components are packed using variable-length encoding like ULEB128 for lengths and bit-packing for the deltas, assuming the data is sorted to maximize delta smallness. This method, inspired by techniques in columnar storage systems, can reduce storage by orders of magnitude for sorted numeric data, as small deltas fit into fewer bits. For variable-length data like strings (byte arrays), delta length byte array encoding first applies delta encoding to the lengths of the arrays and then concatenates the payloads without separators, optimizing for repeated prefixes or similar lengths. The lengths are delta-encoded as described for integers, storing the base length followed by compact differences, while the actual byte data follows in sequence, relying on the decoded lengths for . This is particularly beneficial for columns with strings of varying but clustered lengths, such as log messages or IDs. Similarly, delta byte array encoding (also known as delta strings) extends this by delta-encoding both prefix lengths (shared starts between consecutive strings) and suffix lengths, storing only the differing suffix payloads after the prefixes; for fixed-length byte arrays, it simplifies to delta on the entire values. These techniques leverage incremental similarities in string data, common in sorted or grouped sets, to achieve denser storage than plain length-prefixed encoding. Byte stream split encoding, introduced in Parquet format version 2.11.0 (March 2023), is designed for floating-point types ( and ). It splits the into byte streams based on the type size (e.g., 4 streams for ), scattering bytes across streams, and concatenates them without additional or padding. While it does not reduce storage size on its own, it enhances ratios and speeds when combined with subsequent algorithms by reorganizing the for better compressibility.

Compression Algorithms

Apache Parquet supports a pluggable set of codecs applied to the encoded data within dictionary pages and data pages, allowing users to configure the codec per column to balance storage efficiency and processing speed. This layer operates after encoding, reducing file sizes without loss of , and is managed via in the to indicate the used. The default codec is Snappy, which prioritizes speed with low CPU overhead during compression and decompression, typically achieving 2-3x compression ratios for common datasets like web data or logs. Snappy, developed by , is designed for real-time use cases where rapid access is critical, such as in analytics pipelines. GZIP offers higher compression ratios, often up to 5x for text-heavy or repetitive data, making it suitable for archival storage where file size is prioritized over query speed; however, it incurs higher latency due to its deflate-based algorithm. This codec is widely supported and follows RFC 1952 standards, ensuring interoperability across systems. LZ4 provides a balance between and speed, with decompression rates sometimes exceeding Snappy's in benchmarks, while maintaining ratios around 2-3x for varied data types. Parquet uses the LZ4_RAW variant (block format without framing) to avoid compatibility issues, as the original LZ4 is deprecated. Zstandard (ZSTD), introduced in Parquet 1.10, is a modern with tunable levels (1-22), delivering superior ratios of 3-4x and faster performance than for most workloads, especially at levels 3-9. It excels in scenarios requiring both efficiency and speed, such as , and is based on RFC 8478. Brotli, added in Parquet format version 2.4.0, targets high for static or web-like data, achieving ratios comparable to or better than with moderate speed trade-offs, leveraging its dictionary-based approach for efficiency in archival or transmission use cases. Other codecs like LZO and uncompressed options are available but less commonly used; LZO focuses on fast decompression for systems. Overall, codec selection depends on data characteristics and workload, with tools like allowing runtime configuration.

Use Cases and Applications

Big Data Processing

Apache Parquet plays a central role in traditional pipelines, particularly for batch analytics and extract-transform-load (ETL) processes within the Hadoop ecosystem. It offers native support in starting from version 0.13, allowing users to create and query Parquet tables directly using HiveQL for distributed SQL operations on large datasets. Similarly, Apache Pig integrates with Parquet through storage functions, enabling data loading and processing in scripts for complex ETL tasks across Hadoop clusters. These integrations facilitate efficient querying of Parquet files in jobs, where the columnar structure minimizes data shuffling and I/O overhead during batch computations. In , enhances DataFrame operations by leveraging predicate pushdown, which filters data at the storage level using embedded statistics in Parquet files, thereby reducing the volume of data loaded into memory for distributed processing. This optimization supports efficient execution of distributed joins and aggregations, as Spark can skip irrelevant row groups and columns during scans, accelerating analytical workloads on cluster-based environments. For instance, when performing group-by aggregations on multi-terabyte datasets, predicate pushdown can significantly reduce unnecessary data reads in selective queries. ETL workflows commonly involve writing partitioned Parquet files from diverse sources such as application logs or relational , organizing data by keys like date or category to enable partition pruning in downstream batch jobs. This approach allows for scalable ingestion and transformation, where tools like or extract raw data, apply schema enforcement, and output optimized Parquet partitions for reliable storage in Hadoop Distributed File System (HDFS). Partitioning in Parquet further boosts performance by limiting scans to specific directories during ETL reloads or incremental updates. However, users should ensure Parquet libraries are updated to the latest versions (e.g., 1.15.1 or later as of 2025) to mitigate vulnerabilities like CVE-2025-30065 in the parquet-avro module, which could allow remote code execution when processing untrusted files. Parquet delivers significant performance gains in environments due to its columnar access pattern, which avoids reading entire rows for column-wise operations. This efficiency stems from selective column retrieval and built-in , reducing I/O and CPU costs in batch pipelines. techniques like encoding contribute to these gains by minimizing footprints, often achieving 75% or more reduction compared to uncompressed text formats. A representative example is the processing of terabyte-scale event data in batch jobs at for on service logs. Parquet files store events in Hadoop Distributed File System (HDFS) for efficient querying.

Cloud and Data Lakes

Apache Parquet is optimized for storage in cloud object stores such as , (GCS), and Blob Storage, where its columnar format and built-in compression enable efficient handling of large-scale data without the need for traditional file systems. This compatibility allows serverless querying services like Amazon Athena, Google BigQuery, and to directly access and analyze Parquet files stored in these object stores, leveraging pushdown predicates and column pruning to minimize data transfer and processing overhead. In data lake architectures, serves as the foundational storage format for open table formats like and Delta Lake, which build upon Parquet files to provide atomicity and transactions through layers and transaction logs. These integrations enable reliable updates, deletes, and schema evolution in cloud-based data lakes, ensuring data consistency across distributed object storage without requiring centralized servers. Parquet's compression techniques, such as Snappy or Zstandard, deliver significant cost efficiencies in pay-per-use models by reducing requirements—often achieving up to 75% savings compared to uncompressed formats—and accelerating query times through smaller data scans. In serverless environments, this translates to lower compute charges, as services like bill based on scanned data volume, making Parquet ideal for cost-effective scaling in petabyte-scale lakes. The schema-on-read capability of facilitates ad-hoc analysis on raw data dumps in cloud data lakes, as querying engines automatically infer the from the file without requiring upfront ETL transformations. This approach supports flexible exploration of diverse datasets, such as log files or sensor data, directly from . For instance, analytics pipelines on petabyte-scale data lakes often files by date and region to enable efficient filtering and parallelism; queries can then target specific partitions (e.g., year=2024/month=11/region=us-east), reducing scan volumes and enabling faster insights in tools like or .

Comparisons with Other Formats

Columnar Formats

Both Apache Parquet and Optimized Row Columnar () are columnar storage formats designed to optimize analytical queries by storing data column-wise, enabling efficient compression and selective column reads in environments. Parquet provides broader language and ecosystem support compared to , with native integrations across tools like , Presto, and , making it suitable for diverse, multi-engine environments, while is more tightly coupled to Hadoop and origins. Parquet also excels in handling nested data structures through its Dremel-inspired model using repetition and definition levels, which supports complex schemas more flexibly than 's length and presence model. In contrast, includes Hive-specific optimizations such as bloom filters at finer granularity (every 10,000 rows) and index streams like zone maps for improved predicate pushdown and data skipping in selection-heavy queries. Regarding compression trade-offs, Parquet supports a wider range of flexible codecs including , Snappy, , LZ4, and LZO, allowing users to balance compression ratios and speed based on workload needs, whereas defaults to ZLIB with options like Snappy and LZ4 but incurs lighter metadata overhead in some cases. Both formats achieve similar 3-5x compression ratios on analytical datasets, with Parquet often producing slightly smaller file sizes in and log workloads due to default dictionary encoding on all columns. Performance-wise, Parquet enables faster cross-engine reads and sequential scans through its simpler encoding and streaming , while performs better in update-heavy workloads leveraging its fine-grained indexes for projection and filtering. In terms of ecosystem positioning, Parquet's general-purpose design and extensive community adoption make it ideal for heterogeneous pipelines involving multiple processing engines, whereas ORC's strengths in Hive-optimized analytics suit environments focused on Hadoop-based querying. Users should choose for multi-tool, read-intensive analytics with complex data types, and ORC for pure deployments requiring advanced indexing for frequent updates and selective reads.

Row-Based Formats

Row-based formats, such as and , store data sequentially by records, contrasting with Parquet's columnar organization that optimizes for selective column access in analytical workloads. This row-oriented approach suits scenarios requiring full record retrieval or frequent updates, but it often leads to inefficiencies in storage and query performance when compared to Parquet's design for large-scale . In comparison to , offers superior compression through column-specific techniques like dictionary and , resulting in smaller file sizes and up to 10x faster scan speeds for analytical queries due to predicate pushdown and metadata skipping. , however, excels in schema evolution for streaming applications, providing full support for type changes, additions, and removals with via its embedded , making it more suitable for dynamic, pipelines. Additionally, 's compact binary encoding produces smaller files optimized for transmission in messaging systems like Kafka, whereas prioritizes analytics over such portability. Performance-wise, aligns with write-once, read-many patterns in data lakes, delivering efficient reads for OLAP tasks, while 's row-based structure enables faster writes and appends in write-heavy OLTP or streaming environments. Parquet natively handles complex, nested data structures and enforces s at the file level, capabilities absent in , which relies on simple delimited text without built-in schema support. files lack , often resulting in 10-20x larger sizes compared to Parquet—for instance, a 1 TB can compress to 130 GB in Parquet, an 87% reduction—while also missing pushdown, forcing full scans that increase I/O overhead. This parsing and deserialization burden makes unsuitable for processing, where Parquet reduces scanned data by up to 99% and accelerates queries by 34x in large-scale scenarios. Use cases diverge accordingly: thrives in OLAP and data warehousing for read-intensive analytics, in OLTP, streaming, and record-oriented transmission, and for lightweight, small-scale exports or human-readable interchange. Trade-offs include 's emphasis on immutability for reliable, storage in analytical pipelines versus 's compactness and flexibility for evolving, record-centric workflows.

Implementations and Ecosystem

Libraries and APIs

Apache Parquet provides official and community-maintained libraries across multiple programming languages, enabling developers to read, write, and manipulate Parquet files with language-specific that adhere to the format's columnar structure and metadata standards. These implementations emphasize efficiency in handling large datasets, supporting features like schema and , while integrating with broader ecosystems such as for cross-language data interchange. The primary implementation, known as parquet-java (formerly parquet-mr), serves as the reference for and includes modules for core format handling and Hadoop integration. It offers via ParquetInputFormat and ParquetOutputFormat for reading and writing in jobs, allowing configuration of input splits and output compression through Hadoop's JobConf. Schema definition in Java uses MessageType builders or integrations with serialization formats like and Thrift, enabling nested data structures. Writer options include selectable compression codecs (e.g., Snappy, ) and row group sizes, typically set to 128 MB for optimal performance, while readers access per-column statistics for filtering and validation. For C++, the parquet-cpp library, now merged into the project, delivers high-performance I/O operations tightly coupled with Arrow's in-memory columnar format. Key classes include parquet::arrow::FileReader for loading files into Arrow tables via ReadTable(), and parquet::arrow::FileWriter for batch-wise writing with WriteTable(). This implementation powers backends in tools like DuckDB and supports schema introspection through parquet::schema::GroupNode. Writer properties allow customization of compression (e.g., , LZ4) and row group length (default 64,000 rows), while reader properties enable statistics collection for min/max values and null counts per column. Python support is provided through PyArrow, the Python bindings for Apache Arrow, which expose intuitive functions for Parquet operations integrated with libraries like Pandas. Users can write DataFrames to Parquet using pandas.DataFrame.to_parquet() or Arrow's pq.write_table(table, 'file.parquet'), specifying options like compression ('snappy') and row group size (e.g., 128 * 1024 * 1024 bytes). Reading occurs via pandas.read_parquet() or pq.read_table(), returning Arrow tables or Pandas objects with access to file metadata and column statistics for query optimization. Schema handling leverages Arrow's type system, supporting complex types like lists and structs. Implementations in other languages include official bindings via projects, though with varying levels of maturity. The Go library, part of Apache Arrow Go (parquet-go), provides APIs for reading and writing via parquet.NewFileReader() and parquet.NewWriter(), supporting parsing and basic options. In , the parquet crate offers a native implementation with modules for (parquet::file::writer::SerializedFileWriter) and deserialization, including resolution and statistics access, optimized for operations in pipelines. Across libraries, common APIs focus on schema definition using logical types (e.g., INT64, , ), writer configurations for compression algorithms and row group sizes to balance I/O and memory usage, and reader access to statistics for efficient skipping of irrelevant data chunks. These standardize interactions while allowing language-specific , ensuring with the format specification.

Integrations with Frameworks

Apache Parquet is natively supported in through its , with initial integration introduced in 1.1.0 in 2014, allowing seamless reading and writing of Parquet files while preserving information. This support enables efficient columnar storage and query execution in distributed environments, with key optimizations including vectorized readers that process data in batches for up to 10x faster performance compared to non-vectorized decoding, and built-in partitioning to align with Hive-style partitioning for reduced data shuffling during operations. In engines like Presto and its Trino, serves as a core format for querying data lakes, leveraging the connector to enable federated SQL queries across heterogeneous sources such as without data movement. This integration supports interactive on large-scale datasets stored in , with Trino's architecture distributing query execution to handle petabyte-scale federated operations efficiently. Apache Hive integrates Parquet via a dedicated SerDe (Serializer/Deserializer) added natively in Hive 0.13.0, facilitating SQL-based querying and table management on storage systems like HDFS and Amazon S3. This allows Hive users to define external tables over Parquet files, enabling ACID transactions and schema evolution in data warehouse workflows. Beyond these, Parquet embeds in parallel computing frameworks like Dask, where it supports scalable reads and writes through engines such as PyArrow, enabling distributed Python operations on large datasets without loading everything into memory. Additionally, table formats like Delta Lake and build atop Parquet files to provide advanced features such as , schema enforcement, and transactional consistency in data lakes. As of 2025, 4.0 enhances handling with improved schema merging via the SQL MERGE syntax, supporting evolution during writes to accommodate dynamic data pipelines. Similarly, Trino version 476 introduces support for comparing geometry types, enhancing geospatial capabilities with OGC-compliant functions for spatial analytics on Parquet-stored geographic data.

Limitations and Future Directions

Challenges and Limitations

Apache Parquet's columnar storage design introduces several performance challenges, particularly in write operations. Writing data to Parquet files is generally slower than to row-based formats like , primarily due to the overhead involved in reorganizing data into columns, applying encodings, and generating metadata. Benchmarks indicate that this can result in significantly slower write speeds for small or incremental writes compared to row-oriented formats, making Parquet less ideal for write-heavy workloads. A core limitation stems from Parquet's immutability, where files are treated as and cannot be modified in place. Any updates, deletions, or insertions require rewriting entire files or partitions, which is computationally expensive and inefficient for frequent changes. This design renders Parquet unsuitable for transactional (OLTP) scenarios that demand low-latency updates, favoring instead batch-oriented analytical processing. Schema evolution in Parquet is constrained, supporting primarily the addition of new fields or marking existing ones as optional to maintain . More complex changes, such as altering data types or reordering fields, often necessitate external tools, full data rewrites, or table formats like for handling, as Parquet's embedded lacks robust forward evolution mechanisms. These restrictions can complicate maintenance in evolving data pipelines. Security concerns have also emerged, exemplified by CVE-2025-30065, a critical remote execution (RCE) vulnerability disclosed in early 2025 affecting the parquet-avro module in versions up to 1.15.0. This flaw, stemming from insecure schema parsing, allows attackers to execute arbitrary by crafting malicious Parquet files, impacting readers in systems; users must apply patches from to mitigate risks. Additionally, Parquet performs poorly for frequent small reads or ad-hoc queries requiring row-level access, as its columnar and scanning introduce unnecessary overhead. In scenarios with very small files, the embedded —such as footers detailing column chunks and statistics—can consume a disproportionate amount of space and processing time, exacerbating the "small files problem" in distributed systems. Parquet files are also not human-readable without specialized tools, hindering quick inspections.

Ongoing Developments

Apache Parquet's development continues through regular version releases by , with the 1.14 series introduced in 2024 featuring bug fixes and performance enhancements, including improved dictionary filtering to reduce false positives during reads. The 1.15 series began with the 1.15.0 release in December 2024, followed by 1.15.1 in March 2025 primarily addressing critical security vulnerabilities but also incorporating ongoing optimizations for better compatibility and efficiency in data processing pipelines. Community efforts have focused on extending Parquet's applicability to specialized domains, notably through the GeoParquet specification, which was released as version 1.0 in September 2023 by the Open Geospatial Consortium as an incubating standard. This extension embeds geospatial primitives like points, lines, and polygons directly into Parquet files, enabling efficient storage and querying of spatial data in columnar formats without requiring separate metadata layers. Additionally, Parquet's integration with has advanced read capabilities, allowing seamless in-memory data sharing across languages and systems to minimize overhead in analytical workflows. In the broader ecosystem, Parquet is seeing deeper integration with table formats such as version 3, released in mid-2025, which introduces deletion vectors stored as compact sidecar Puffin files alongside Parquet data files to enable efficient row-level updates and deletes without rewriting entire datasets. This evolution supports time-travel queries and compliance in data lakes, enhancing Parquet's role in managed storage systems. Recent discussions in 2025 highlight challenges with Parquet in workloads, where its columnar scan patterns can create bottlenecks for and iterative training on large datasets. Proposals for formats combining Parquet's strengths with row-oriented or vectorized structures aim to address these limitations, potentially reducing I/O overhead for ML-specific operations like feature sampling. Looking ahead, future directions emphasize enhanced write parallelism to scale with distributed systems, with recent advancements achieving 40-60% faster write speeds through optimized compression and partitioning techniques. Security hardening remains a priority following the disclosure of CVE-2025-30065 in April 2025, a critical remote code execution vulnerability in the parquet-avro module affecting versions up to 1.15.0, which was mitigated in 1.15.1 by restricting unsafe schema deserialization. In September 2025, version 1.16.0 was released, introducing further performance improvements and bug fixes. However, security issues persist, with CVE-2025-46762 disclosed later in 2025 as another critical RCE vulnerability in the parquet-avro module affecting versions through 1.15.1. The Parquet specification was updated in 2025 to include geometry and geography types for spatial data support, and in October 2025, the Variant type was introduced as a new open standard for efficient storage of semi-structured data, compatible with Parquet, Delta Lake, and Iceberg.

References

  1. [1]
    Overview - Apache Parquet
    Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance ...
  2. [2]
    Dremel made simple with Parquet - Blog - X
    Sep 11, 2013 · Parquet stores nested data structures in a flat columnar format using a technique outlined in the Dremel paper from Google.
  3. [3]
    Concepts - Apache Parquet
    Mar 8, 2024 · Key concepts include: Block (HDFS block), File (metadata), Row group (logical data partitioning), Column chunk (data for a column), and Page ( ...
  4. [4]
    File Format - Apache Parquet
    Jul 7, 2024 · File Format. Documentation about the Parquet File Format. This file and the thrift definition should be read together to understand the format.
  5. [5]
    Apache Parquet: Efficient Data Storage - Databricks
    Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.Missing: history | Show results with:history
  6. [6]
    Announcing Parquet 1.0: Columnar Storage for Hadoop - Blog - X
    Jul 30, 2013 · In March we announced the Parquet project, the result of a collaboration between Twitter and Cloudera intended to create an open-source ...Missing: origins | Show results with:origins
  7. [7]
    Motivation - Apache Parquet
    Mar 24, 2022 · We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem.
  8. [8]
  9. [9]
    Parquet Project Incubation Status
    2014-05-20 Project enters incubation. 2014-11-14 Release: parquet-format-2.2.0; 2015-01-15 New Committer: Alex Levenson; 2015-01-15 New Committer: Daniel ...
  10. [10]
    The Apache Software Foundation Announces Apache™ Parquet ...
    Apr 27, 2015 · Apache Parquet is an Open Source columnar storage format for the Apache™ Hadoop® ecosystem, built to work across programming languages and much ...
  11. [11]
    org.apache.parquet » parquet » 1.10.0 - Maven Repository
    Apache Parquet Java » 1.10.0 ; Apr 05, 2018 · pom (21 KB) View All · CentralApache ReleasesMapr DrillMulesoftSpring PluginsWSO2 Public · #353330 in ...
  12. [12]
    Handling Large Amounts of Data with Parquet – Part 2
    Aug 21, 2018 · Compression Codecs. Parquet 1.10 supports around these 6 compression codecs. Snappy; GZIP; LZO; Brotli; LZ4; ZSTD. Along with these 6 ...
  13. [13]
    Central Repository: org/apache/parquet/parquet/1.12.0
    org/apache/parquet/parquet/1.12.0 ../ parquet-1.12.0-source-release.zip 2021-03-17 08:51 2122014 parquet-1.12.0-source-release.zip.asc 2021-03-17 08:51 833 ...
  14. [14]
    1.13.0 - Apache Parquet
    Apr 6, 2023 · The latest version of parquet-java is 1.13.0. To check the validity of this release, use its: The latest version of parquet-java on the previous minor branch ...
  15. [15]
    Parquet: Columnar Storage for Hadoop Data - Xandr-Tech - Medium
    Mar 30, 2015 · Parquet is widely used in the Hadoop world for analytics workloads by many query engines. Among them are engines on top of Hadoop, such as Hive, ...
  16. [16]
    What is the Parquet File Format? Use Cases & Benefits | Qlik Blog
    Advantages of Parquet Columnar Storage – Why Should You Use It? · Flexible and Efficient Compression · Enabling High-Performance Querying · Support for Schema ...Missing: key | Show results with:key
  17. [17]
    Apache Parquet Format - GitHub
    Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression ...Missing: principles | Show results with:principles
  18. [18]
    Configurations - Apache Parquet
    Mar 8, 2024 · We recommend large row groups (512MB - 1GB). Since an entire row group might need to be read, we want it to completely fit on one HDFS block.
  19. [19]
    Metadata - Apache Parquet
    Mar 5, 2025 · File metadata is described by the FileMetaData structure. This file metadata provides offset and size information useful when navigating the Parquet file.
  20. [20]
    Nested Encoding - Apache Parquet
    Jan 14, 2024 · Repetition levels specify at what repeated field in the path has the value repeated. The max definition and repetition levels can be computed ...
  21. [21]
    Parquet Files - Spark 4.0.1 Documentation
    # Parquet files are self-describing so the schema is preserved. # The result ... Note that Spark writes the output schema into Parquet's footer metadata on file ...Partition Discovery · Schema Merging · Columnar EncryptionMissing: embedded | Show results with:embedded
  22. [22]
    Encodings - Apache Parquet
    Plain: (PLAIN = 0) ... This is the plain encoding that must be supported for types. It is intended to be the simplest encoding. Values are encoded back to back.
  23. [23]
  24. [24]
    Compression - Apache Parquet
    Mar 11, 2024 · Parquet allows the data block inside dictionary pages and data pages to be compressed for better space efficiency.Overview · Codecs · Lz4
  25. [25]
  26. [26]
  27. [27]
    Add new compression codecs to the Parquet spec - Issues
    Jun 23, 2024 · After recent tests, I think we should add Zstd to the spec. I'm also proposing we add LZ4 because it is widely available and outperforms snappy.
  28. [28]
  29. [29]
  30. [30]
    CompressionCodec (Apache Parquet Format 2.4.0 API) - javadoc.io
    Codecs added in 2.3.2 can be read by readers based on 2.3.2 and later. Codec support may vary between readers based on the format version and libraries ...
  31. [31]
    [PDF] A Deep Dive into Common Open Formats for Analytical DBMSs
    Aug 10, 2023 · We evaluate. Zstandard (Zstd) at level 1 (we evaluate other levels later in this section), LZ4, Gzip, Snappy, and Zlib compression algorithms, ...
  32. [32]
    Benchmarking Apache Parquet: The Allstate Experience - Cloudera
    Apr 22, 2016 · The final test, disk space results, are quite impressive for both formats: With Parquet, the 194GB CSV file was compressed to 4.7GB; and with ...Missing: MapReduce | Show results with:MapReduce
  33. [33]
    Engineering Data Analytics with Presto and Apache Parquet at Uber
    Jul 11, 2017 · Snap your fingers and presto! How Uber Engineering built a fast, efficient data analytics system with Presto and Parquet.
  34. [34]
    Documentation - Apache Parquet
    Welcome to the documentation for Apache Parquet. The specification for the Apache Parquet file format is hosted in the parquet-format repository. The current ...File Format · Overview · Types · Concepts
  35. [35]
    Apache Iceberg 101: The Table Format Reshaping Data Lakes
    Sep 4, 2024 · A Data Lake is the concept of using cloud object storage like S3 or GCS and storing your data in open file/table formats for high volume as ...
  36. [36]
    Parquet SerDe - Amazon Athena
    ### Summary: Athena Support for Parquet, Optimizations for S3, Query Performance, Cost Benefits
  37. [37]
    Query Parquet files using serverless SQL pool - Azure Synapse ...
    Dec 10, 2024 · In this article, you'll learn how to write a query using serverless SQL pool that will read Parquet files.Quickstart Example · Read Parquet File · Automatic Schema Inference
  38. [38]
  39. [39]
    Apache Iceberg vs Parquet – File Formats vs Table Formats | Qlik Blog
    Aug 27, 2025 · Iceberg is a table format that provides an abstraction layer for better data management and universal access (similar to Hudi or Delta Lake).
  40. [40]
    What is Delta Lake in Databricks?
    Oct 8, 2025 · Delta Lake is open source software that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling.Missing: GCS | Show results with:GCS
  41. [41]
    Overview of cost optimization - Cost Modeling Data Lakes for ...
    Use data compression · Without compression: $2,406.40 per month · With compression: $614.40 per month This represents a saving of 75%.
  42. [42]
    Parquet File Format: The Complete Guide - Coralogix
    Aug 28, 2023 · Parquet file format supports schema evolution by default, since it's designed with the dynamic nature of computer systems in mind. The format ...Parquet file format characteristics · Parquet file format vs CSV
  43. [43]
    Build a Schema-On-Read Analytics Pipeline Using Amazon Athena
    Sep 29, 2017 · In this post, I show how to build a schema-on-read analytical pipeline, similar to the one used with relational databases, using Amazon Athena.
  44. [44]
    All About Parquet Part 09 — Parquet in Data Lake Architectures
    Nov 4, 2024 · Compression is crucial for reducing storage costs in data lakes. Use compression algorithms like Snappy for fast compression and decompression, ...
  45. [45]
    Data Lake Querying in AWS - Optimising Data Lakes with Parquet
    Nov 25, 2021 · In the following sections we will explain how we generated the raw data and the unpartitioned and partitioned Parquet files. In the next ...
  46. [46]
    [PDF] An Empirical Evaluation of Columnar Storage Formats (Extended ...
    Nov 7, 2023 · According to our survey, Arrow and DuckDB only adopt zone maps at the row group level for Parquet, while InfluxDB and Spark enable PageIndex and ...
  47. [47]
    Column File Formats: How Hudi Leverages Parquet and ORC
    Jul 31, 2024 · Less Community Support: Compared to Parquet, ORC has less community support, meaning fewer resources, libraries, and tools for this file format.
  48. [48]
    Avro vs. Parquet: A Complete Comparison for Big Data Storage
    Feb 26, 2025 · Avro: Provides fast write speeds since it stores data row-by-row. However, read operations can be slower for analytics because entire rows must ...Features of Parquet · Differences Between Avro and... · When to use Parquet
  49. [49]
    Apache Parquet vs AVRO: Open file formats, compute engine
    May 30, 2024 · Apache Parquet is a column-oriented format that accelerates query performance and uses column-specific data compression and encoding to reduce ...
  50. [50]
    Parquet vs. Avro: A Detailed Comparison of Big Data File Formats
    Sep 2, 2025 · Apache Parquet is an open-source column-oriented file format designed for efficient data storage and processing in big-data environments. It was ...Missing: milestones | Show results with:milestones
  51. [51]
    Apache Parquet vs. Avro: Which File Format Is Better? - Snowflake
    This makes Parquet more efficient for analytical workloads, while Avro is typically better for real-time ingestion and write-optimized use cases. Performance ...
  52. [52]
    Reading and writing Parquet files — Apache Arrow v22.0.0
    The Parquet C++ implementation is part of the Apache Arrow project and benefits from tight integration with the Arrow C++ classes and facilities.Writing Parquet Files · Supported Parquet Features · TypesMissing: inclusion | Show results with:inclusion
  53. [53]
    Apache Parquet Java - GitHub
    Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression ...Issues 537 · Add Variant Logical Type #3070 · GH-3235 · Pull requests 159
  54. [54]
  55. [55]
    Introduction to Java Parquet (Formerly Parquet MR) - Baeldung
    Sep 30, 2025 · Predicate pushdown enables us to skip scanning row groups that don't satisfy a filter. Nested predicates and combinations (and, or, not) are ...<|separator|>
  56. [56]
    apache/parquet-cpp - GitHub
    May 10, 2024 · Development for Apache Parquet in C++ has moved. The Apache Arrow and Parquet C++ projects have merged development process and build systems in the Arrow ...
  57. [57]
    Reading and Writing the Apache Parquet Format
    Apache Arrow is an ideal in-memory transport layer for data that is being read or written with Parquet files. We have been concurrently developing the C++ ...Reading And Writing Single... · Writing To Partitioned... · Parquet Modular Encryption...Missing: inclusion | Show results with:inclusion
  58. [58]
    parquet package - github.com/apache/arrow/go/parquet
    Nov 12, 2021 · Package parquet provides an implementation of Apache Parquet for Go. Apache Parquet is an open-source columnar data storage format.Documentation · Overview · Constants · Types
  59. [59]
    Crate parquet - Rust - Docs.rs
    The crate provides the official Native Rust implementation of Apache Parquet, a columnar format, with APIs to read and write Parquet files.
  60. [60]
    Tips and Best Practices to Take Advantage of Spark 2.x
    Jul 8, 2020 · Spark 2.x has a vectorized Parquet reader that does decompression and decoding in column batches, providing ~ 10x faster read performance.
  61. [61]
    Vectorized Parquet Decoding (Reader) - The Internals of Spark SQL
    Vectorized Parquet Decoding (Vectorized Parquet Reader) allows for reading datasets in parquet format in batches, i.e. rows are decoded in batches. That aims at ...
  62. [62]
    Hive connector — Trino 478 Documentation
    The Hive connector allows querying data stored in an Apache Hive data warehouse. Hive is a combination of three components.
  63. [63]
    Ecosystem: Data lake components - Trino
    Trino is designed as a query engine for data lakes and data lake houses. A complete lake or lake house uses numerous components including object storage ...
  64. [64]
    Trino | Distributed SQL query engine for big data
    Federated queries in Trino can access your object storage, your main relational databases, and your new streaming or NoSQL system, all in the same query. Trino ...Trino documentation · Get started with Trino · Trino blog · Data lake components
  65. [65]
  66. [66]
    Spark Release 4.0.0
    Add SQL MERGE syntax to enable schema evolution; [SPARK-47430] Support GROUP ... Spark 4.0.1 released (Sep 06, 2025). Archive · Download Spark. Built-in ...
  67. [67]
    Release 476 (5 Jun 2025) — Trino 477 Documentation
    Jun 5, 2025 · Add support for comparing values of geometry type. (#25225). ⚠️ ... Previous Release 477 (24 Sep 2025) · Next Release 475 (23 Apr 2025)Missing: GeoParquet extensions
  68. [68]
    Geospatial functions — Trino 478 Documentation
    Trino Geospatial functions that begin with the ST_ prefix support the SQL/MM specification and are compliant with the Open Geospatial Consortium's (OGC) ...Missing: 2025 | Show results with:2025<|control11|><|separator|>
  69. [69]
    A Technical Comparison of Apache Parquet, ORC, and Arrow
    Mar 6, 2025 · Apache Arrow focuses on in-memory processing and data interchange. While it uses a columnar format like Parquet and ORC, its design is tailored ...<|control11|><|separator|>
  70. [70]
    Apache Parquet: Good for Analytics, Not So Good for Search - Lucenia
    Mar 5, 2025 · 4. Update and Schema Evolution Bottlenecks. Apache Parquet is designed for immutable datasets, making updates costly and inefficient.
  71. [71]
    Delta Lake vs. Parquet Comparison
    Delta Lake and Parquet are columnar, so you can cherry-pick specific columns from a data set via column pruning (aka column projection). Column pruning isn't an ...Essential Characteristics Of... · The Challenge Of Storing... · The Basic Structure Of A...
  72. [72]
    Schema evolution in parquet format - apache spark - Stack Overflow
    Jun 5, 2016 · Schema evolution can be (very) expensive. In order to figure out schema, you basically have to read all of your parquet files and ...Missing: documentation | Show results with:documentation
  73. [73]
    CVE-2025-30065 Detail - NVD
    Apr 1, 2025 · Description. Schema parsing in the parquet-avro module of Apache Parquet 1.15.0 and previous versions allows bad actors to execute arbitrary ...
  74. [74]
    Max severity RCE flaw discovered in widely used Apache Parquet
    Apr 3, 2025 · A maximum severity remote code execution (RCE) vulnerability has been discovered impacting all versions of Apache Parquet up to and including 1.15.0.
  75. [75]
    Parquet Data Format: Exploring Its Pros and Cons for 2025
    May 9, 2025 · Self-describing, Parquet stores metadata and schema in addition to data. It facilitates schema evolution, which is necessary for long-lived ...Missing: embedded | Show results with:embedded
  76. [76]
    Parquet file format - everything you need to know! - Data Mozart
    We've already mentioned that Parquet is a column-based storage format. ... As per Parquet official docs: A logical horizontal partitioning of the data into rows.Missing: core principles
  77. [77]
    The Small Files Problem: Solutions for Big Data - MinIO Blog
    Jun 18, 2025 · Querying many small files incurs overhead to read metadata, conduct a non-contiguous disk seek, open the file, close the file and repeat.
  78. [78]
    1.14.4 - Apache Parquet
    Nov 11, 2024 · The latest version of parquet-java is 1.14.4. With the following bugfixes: GH-3040: DictionaryFilter.canDrop may return false positive ...Missing: notes | Show results with:notes
  79. [79]
    1.15.1 - Apache Parquet
    Mar 16, 2025 · The latest version of parquet-java is 1.15.1. For the changes, please check out the Release on Github.
  80. [80]
    GeoParquet - Cloud-Optimized Geospatial Formats Guide
    In September 2023, GeoParquet published a 1.0 release, and now any changes to the specification are expected to be backwards compatible. Reading and writing ...Missing: extension | Show results with:extension
  81. [81]
    GeoParquet
    About. Apache Parquet is a powerful column-oriented data format, built from the ground up to as a modern alternative to CSV files.Releases · V1.1.0 release · Browser-based converterMissing: 2023 | Show results with:2023
  82. [82]
    Apache Iceberg™ v3: Moving the Ecosystem Towards Unification
    Jun 2, 2025 · Iceberg v3 includes major improvements such as deletion vectors, row lineage, and new types for semi-structured data and geospatial use cases.Deletion Vectors · Semi-Structured Data and... · Interoperability with Delta Lake...
  83. [83]
    Spec - Apache Iceberg™
    Deletion vectors are added in v3 and are not supported in v2 or earlier. Position delete files must not be added to v3 tables, but existing position delete ...View Spec · Puffin Spec · Implementation Status
  84. [84]
    Parquet and ORC's many shortfalls for machine learning (ML ...
    Jan 7, 2025 · Discusses why dominant columnar file formats, Parquet and ORC, are problematic when used for machine learning (ML) workloads.<|control11|><|separator|>
  85. [85]
    Is Parquet becoming the bottleneck? Why new storage formats are ...
    Sep 14, 2025 · Parquet gave data lakes a common language: columnar layout, good compression, and fast scans. That still works well for classic analytics.
  86. [86]
    Understanding the Parquet Data Format: Benefits and Best Practices
    Sep 8, 2025 · This columnar storage format improves compression ratios, lowers I/O, and dramatically accelerates analytics on large datasets stored in Amazon ...How Columnar Formats Improve... · Creating Parquet Files · How Does Parquet Compare To...