Fact-checked by Grok 2 weeks ago

Apache Parquet

Apache Parquet is an open-source, column-oriented data file format designed for efficient data storage and retrieval, particularly in big data analytics ecosystems like Apache Hadoop.^[1] Developed collaboratively by Twitter and Cloudera, Parquet was first released in July 2013 as an Apache Incubator project, drawing inspiration from the columnar storage and query techniques outlined in Google's 2010 Dremel paper.^[2] The format addresses limitations of row-oriented storage by organizing data into row groups—horizontal partitions of rows—each containing column chunks for individual columns, which are subdivided into pages as the basic units for encoding and compression.^[3] This hierarchical structure supports complex nested data schemas, advanced encodings (such as dictionary, run-length, and bit-packing), and pluggable compression algorithms like Snappy or GZIP, enabling significant reductions in storage footprint and I/O overhead.^[4] Parquet's columnar layout offers key advantages for analytical processing, including faster query execution by scanning only required columns and improved compression ratios due to grouping similar data types together—often achieving 75% or more space savings over text-based formats like CSV.^[5] Unlike row-based formats such as Avro, which prioritize transactional updates, Parquet excels in read-heavy workloads typical of data lakes and warehouses, with built-in support for schema evolution to handle evolving data structures without rewriting files.^[5] It integrates seamlessly with major frameworks including Apache Spark, Hive, Presto, and Drill, as well as cross-language libraries like Apache Arrow for in-memory data transfer, making it a standard for distributed computing environments.^[1] Widely adopted in cloud platforms such as Amazon S3 and Google Cloud Storage, Parquet facilitates scalable analytics on petabyte-scale datasets while maintaining splittability for parallel processing across clusters.^[5]

History

Origins and Initial Development

Apache Parquet originated from a collaborative effort between engineers at Twitter and Cloudera in early 2013, aimed at addressing the limitations of existing row-oriented data formats in the Hadoop ecosystem.^[6]^[7] At the time, formats like Apache Avro were efficient for row-based operations but inefficient for analytical workloads that required scanning large volumes of data across specific columns, leading to excessive I/O and poor query performance.^[6]^[2] The project sought to introduce a columnar storage format that could support complex nested data structures while enabling better compression and faster scans for big data processing.^[7] The development was heavily inspired by the columnar storage techniques described in Google's 2010 Dremel paper, which outlined methods for efficient analytical querying on massive datasets using record shredding and assembly algorithms to handle nested data without flattening it.^[8]^[2] This influence guided Parquet's design to prioritize column-wise storage, repetition and definition levels for reconstructing nested structures, and per-column encoding schemes to optimize for modern hardware and reduce storage overhead.^[7]^[2] By adapting these principles to an open-source context, the collaborators aimed to make advanced columnar capabilities accessible to the broader Hadoop community, filling a gap in tools for high-performance analytics.^[6] Parquet was first introduced as an open-source project in March 2013, with the initial version, Parquet 1.0, released in July 2013.^[6] This release focused on core functionalities, including a basic columnar layout for primitive and nested types, embedded metadata for schema description, and initial support for encodings like dictionary and bit-packing to enable efficient data representation.^[6]^[7] Early implementations targeted integration with Hadoop components such as MapReduce and Hive, setting the foundation for its use in analytical pipelines.^[6]

Adoption and Milestones

Apache Parquet was donated to the Apache Software Foundation and entered the Incubator on May 20, 2014, after initial development by Twitter and Cloudera. It graduated to a top-level Apache project on April 27, 2015, marking its full integration into the open-source ecosystem and community-driven governance.^[9]^[10] The project saw several key version releases that advanced its capabilities. Parquet 1.0 was released on July 30, 2013, establishing the core columnar file format optimized for Hadoop analytics. Parquet 1.10.0 followed on April 5, 2018, introducing support for Zstandard (ZSTD) compression alongside other codecs to enhance storage efficiency.^[6]^[11] Parquet 1.12.0 arrived on March 17, 2021, with enhancements to schema resolution and compatibility for evolving data structures. Parquet 1.13.0 was issued on April 6, 2023, including optimizations for processing nested and complex data types to improve query performance.^[12]^[13] Subsequent releases included Parquet 1.14.0 on May 7, 2024, focusing on bug fixes and stability improvements; 1.15.0 on December 2, 2024; and 1.16.0 on September 3, 2025. The Parquet format specification advanced to version 2.12.0 on August 28, 2025, adding support for the VARIANT logical type to handle semi-structured data more efficiently.^[14]^[15]^[16]^[17] Adoption grew rapidly within the big data landscape. By 2015, Parquet had been incorporated into major Hadoop distributions, including Cloudera CDH and Hortonworks HDP, facilitating its use in enterprise analytics workflows. By 2020, it had achieved widespread use as a de facto standard for columnar storage in big data environments, powering efficient data processing in tools like Apache Spark and Hive across industries. A significant milestone came in 2016 with its integration into the newly incubating Apache Arrow project, which provided in-memory columnar data interchange and cross-language compatibility, further boosting Parquet's interoperability.^[18]^[19]

Design and File Format

Core Principles

Apache Parquet is fundamentally designed as a columnar storage format, where data is organized by columns rather than rows, allowing for more efficient compression and selective reading of only the relevant columns during query execution, particularly for analytical workloads that apply predicates on specific fields.^[1] This columnar model minimizes I/O overhead by enabling systems to skip irrelevant data, making it ideal for scan-heavy operations in big data environments.^[20] A key principle of Parquet is its self-describing nature, achieved by embedding the schema and metadata directly in the file's footer, which allows files to be read independently without requiring external schema definitions or catalogs.^[4] The footer includes details such as column locations, data types, and statistics, facilitating interoperability across different tools and languages while supporting schema evolution over time.^[20] Parquet is optimized for write-once, read-many scenarios common in big data processing, prioritizing high scan speeds and low storage costs over frequent updates or transactional support, as updates would require rewriting entire row groups.^[7] This design choice aligns with its roots in handling bulk analytical queries, where data is typically appended in immutable batches rather than modified incrementally.^[1] From its inception, Parquet has supported complex data types, including nested structures such as maps, lists, and repeated fields, using techniques like record shredding to flatten hierarchical data into columnar form while preserving logical relationships.^[20] This capability, inspired by the columnar storage approach in Google's Dremel system, enables efficient handling of semi-structured data from sources like JSON or Protocol Buffers in analytical pipelines.^[20]

Structure Components

Apache Parquet files follow a structured layout consisting of a header, a data section, and a footer to enable efficient columnar storage and metadata access.^[4] The header begins with 4-byte magic bytes "PAR1" in ASCII encoding, identifying the file as Parquet format.^[4] The data section contains the actual columnar data organized into row groups, while the footer holds all metadata and ends with another 4-byte "PAR1" magic number preceded by the metadata length in little-endian format.^[4] Row groups represent the primary horizontal partitioning unit in a Parquet file, dividing the dataset into independent chunks for parallel processing and storage optimization.^[4] By default, row groups are sized at approximately 128 MB, though recommendations suggest larger sizes of 512 MB to 1 GB to align with typical distributed file system block sizes like HDFS.^[21]^[20] Each row group encompasses a subset of rows and includes column chunks for every column in the schema, allowing readers to process entire groups in memory for column-wise operations.^[4] Within each row group, data is further partitioned vertically into column chunks, one per column, to support selective column reading and compression tailored to individual columns.^[4] A column chunk comprises one or more pages: data pages that store the actual values along with repetition and definition levels for handling nesting and nulls, and optional dictionary pages that hold unique values for dictionary-based encodings.^[4] This page-level subdivision enables fine-grained access, with data pages typically limited to 1 MB for efficient buffering during reads and writes.^[20] The footer contains comprehensive file metadata serialized using Apache Thrift's TCompactProtocol, ensuring compact and efficient parsing.^[22] This metadata includes the schema definition in a depth-first traversal, detailing field names, physical and logical types (such as INT32, STRING, or nested groups/maps), and annotations for repetition (required, optional, or repeated) and definition levels to manage optional fields and repeated structures in nested data.^[22]^[20] Repetition levels track the hierarchy of repeated elements by indicating the depth of repetition from the previous value, while definition levels specify the number of optional fields that are null or defined in the path to the current value, enabling compact representation of complex schemas without explicit null markers.^[20] Additionally, the footer lists row group and column chunk metadata, including offsets, sizes, and statistics like min/max values for query optimization.^[22]

Features

Key Capabilities

Apache Parquet's columnar storage structure enables column pruning, which allows query engines to load only the specific columns required for a given analytical query, thereby minimizing I/O operations and improving performance on large datasets. This capability is facilitated by the organization of data into independent column chunks within row groups, where metadata describes the layout and content of each chunk without necessitating the reading of irrelevant columns.^[23] Complementing column pruning is predicate pushdown, a technique where filtering predicates from queries are evaluated against the file's metadata to skip entire row groups or column chunks that do not satisfy the conditions, reducing the volume of data scanned during reads. This optimization is particularly effective for selective queries in big data environments, as it pushes the filtering logic to the storage level before data is loaded into memory.^[22] Parquet supports nested data structures through group types for structs and logical types for lists and maps, enabling the representation of complex, hierarchical data common in modern applications. To encode these hierarchies efficiently in a columnar format, Parquet employs repetition levels and definition levels: repetition levels indicate the level at which a value is repeated within a nested path, while definition levels track the number of optional fields that are null or defined, allowing reconstruction of the original schema without storing explicit null markers for every position.^[24] Each column chunk in Parquet includes statistics in its metadata, such as minimum and maximum values along with null counts, which aid query planners in determining data relevance and enabling further optimizations like skipping non-qualifying chunks. These statistics are computed per chunk to provide granular insights into data distribution, supporting efficient query planning without full data scans.^[20] In October 2025, Apache Parquet ratified the Variant logical type, a new open standard for storing semi-structured data. This binary format uses shredding to extract common fields into typed column chunks, improving performance for dynamic schemas and nested data by enabling better compression, faster reads (up to 8x improvement), and enhanced data skipping.^[25]

Schema Evolution

Apache Parquet supports schema evolution, enabling datasets to adapt to structural changes over time without necessitating complete data rewrites. This capability is built into the file format's design, which embeds schema metadata within each file, allowing readers to interpret varying schemas across files in a dataset. Like other formats such as Protocol Buffers, Avro, and Thrift, Parquet facilitates gradual schema modifications, primarily through the use of optional fields and logical type annotations.^[26] Backward compatibility in Parquet ensures that newer readers can process older files by ignoring any newly added optional fields, provided that all required fields from prior schemas remain unchanged and in the same positions. Required fields cannot be removed or altered in type without breaking compatibility, as their presence is strictly enforced during reads. This design prevents errors when evolving schemas in production environments, where data from multiple versions may coexist.^[26] Forward compatibility allows older readers to handle files with additional or unknown fields by skipping them, leveraging definition levels to determine value presence without attempting to parse incompatible elements. Definition levels, stored alongside data pages, encode whether a value exists for a given field in nested or repeated structures, enabling safe navigation of schema differences. This mechanism supports reading across schema versions without data loss or corruption.^[26] Schema merging is handled by processing tools like Apache Spark and Apache Hive, which combine schemas from multiple Parquet files during ingestion. Merging promotes missing fields to optional status, resolves naming conflicts by favoring the latest definitions, and performs type promotions (such as integer to long) when compatible. To enable this in Spark, the mergeSchema option must be set to true, though it incurs computational overhead by scanning file footers. Hive similarly supports evolution via its metastore, applying union-like operations to derive a superset schema.^[26] Despite these features, Parquet's schema evolution has practical limitations: it lacks native support for type changes (e.g., converting a string to an integer) or permanent field deletions, requiring external orchestration, data compaction, or rewriting for such operations. Complex evolutions, including renaming fields or handling incompatible promotions, depend on reader-side logic and may lead to null values or errors if not managed carefully. Nested data types can be referenced briefly in this context, as repetition and definition levels facilitate variability in hierarchical schemas.^[26] A representative example of schema evolution involves appending a new optional column, such as "user_age" of type INT32, to an existing dataset of user records. Older files lack this field, but when merged in Spark with mergeSchema=true, the resulting DataFrame includes "user_age" as optional, populated only in newer files while older rows receive nulls, preserving query functionality across versions.^[26]

Compression and Encoding

Encoding Techniques

Apache Parquet employs several encoding techniques to represent column data efficiently within its columnar structure, primarily applied at the level of data pages to minimize storage overhead before any compression is applied. These encodings target different data types and patterns, such as repetitions, dictionaries of unique values, and sequential differences, enabling compact representation of both primitive and nested data. By optimizing how values, repetition levels, and definition levels are stored, Parquet achieves significant space savings, particularly for analytical workloads involving selective column access.^[27] Dictionary encoding replaces repeated values in a column chunk with integer indices that reference a shared dictionary of unique values, making it particularly effective for columns with low cardinality. The dictionary is built once per column chunk and stored in a dedicated dictionary page using plain encoding, while subsequent data pages store only the indices, which are encoded using run-length encoding (RLE) or bit-packing to further compact the representation. If the dictionary size exceeds a threshold (typically 50% of the page size), the encoding falls back to plain to avoid excessive overhead. This approach reduces redundancy by mapping strings or integers to compact IDs, with the dictionary supporting all data types including byte arrays.^[27] Run-length encoding (RLE), often in a hybrid form with bit-packing, encodes sequences of identical values or levels (such as repetition and definition levels in nested structures) as pairs of value and run length, ideal for data with long runs of repeats like boolean flags or sparse nested fields. In Parquet's implementation, the RLE hybrid alternates between bit-packed runs for short sequences of varying values and direct RLE runs for identical values, using a fixed bit width determined by the maximum value in the sequence (up to 32 bits). For example, a run of 100 identical repetition levels of 0 would be stored as the value 0 followed by the length 100 encoded in a compact binary format, with the entire structure prefixed by metadata indicating the encoding type and bit width. This hybrid minimizes space for both clustered and scattered patterns in levels, which are crucial for representing optional or repeated fields without storing explicit null indicators for every row.^[27] Bit packing encodes small integers or booleans into tightly packed bit fields, eliminating byte-level padding to achieve near-optimal density for fixed-width data like flags or low-range integers. In Parquet, bit packing is integrated into the RLE hybrid for repetition and definition levels, where groups of values are packed into whole bytes from least significant bit (LSB) first, with any incomplete byte at the end unpadded. For instance, a sequence of 8 boolean values can be packed into a single byte, representing 0s and 1s directly. While a standalone bit-packed encoding exists, it is deprecated in favor of the more versatile RLE hybrid, which handles transitions between packed and run-length modes seamlessly. This technique is especially useful for metadata like levels, where values rarely exceed a few bits.^[27] Delta encoding stores differences between consecutive sorted values rather than the values themselves, providing efficiency for monotonically increasing sequences such as integers or timestamps in time-series data. For 32-bit and 64-bit integers, Parquet uses a binary-packed delta scheme that divides the sequence into blocks of 128 values, each block further split into miniblocks of 32 values; within each miniblock, the first value is stored plainly, followed by binary-packed deltas (differences from the previous value) and an optional stride (second-order differences). These components are packed using variable-length encoding like ULEB128 for lengths and bit-packing for the deltas, assuming the data is sorted to maximize delta smallness. This method, inspired by techniques in columnar storage systems, can reduce storage by orders of magnitude for sorted numeric data, as small deltas fit into fewer bits.^[27]^[28] For variable-length data like strings (byte arrays), delta length byte array encoding first applies delta encoding to the lengths of the arrays and then concatenates the payloads without separators, optimizing for repeated prefixes or similar lengths. The lengths are delta-encoded as described for integers, storing the base length followed by compact differences, while the actual byte data follows in sequence, relying on the decoded lengths for parsing. This is particularly beneficial for columns with strings of varying but clustered lengths, such as log messages or IDs. Similarly, delta byte array encoding (also known as delta strings) extends this by delta-encoding both prefix lengths (shared starts between consecutive strings) and suffix lengths, storing only the differing suffix payloads after the prefixes; for fixed-length byte arrays, it simplifies to delta on the entire values. These techniques leverage incremental similarities in string data, common in sorted or grouped datasets, to achieve denser storage than plain length-prefixed encoding.^[27] Byte stream split encoding, introduced in Parquet format version 2.11.0 (March 2023), is designed for floating-point types (FLOAT and DOUBLE). It splits the data into byte streams based on the type size (e.g., 4 streams for FLOAT), scattering bytes across streams, and concatenates them without additional metadata or padding. While it does not reduce storage size on its own, it enhances compression ratios and speeds when combined with subsequent compression algorithms by reorganizing the data for better compressibility.^[27]

Compression Algorithms

Apache Parquet supports a pluggable set of lossless compression codecs applied to the encoded data within dictionary pages and data pages, allowing users to configure the codec per column to balance storage efficiency and processing speed.^[29] This compression layer operates after encoding, reducing file sizes without loss of information, and is managed via metadata in the page header to indicate the codec used.^[29] The default codec is Snappy, which prioritizes speed with low CPU overhead during compression and decompression, typically achieving 2-3x compression ratios for common datasets like web data or logs.^[29]^[30] Snappy, developed by Google, is designed for real-time use cases where rapid access is critical, such as in big data analytics pipelines.^[30] GZIP offers higher compression ratios, often up to 5x for text-heavy or repetitive data, making it suitable for archival storage where file size is prioritized over query speed; however, it incurs higher decompression latency due to its deflate-based algorithm.^[29]^[31] This codec is widely supported and follows RFC 1952 standards, ensuring interoperability across systems.^[29] LZ4 provides a balance between compression ratio and speed, with decompression rates sometimes exceeding Snappy's in benchmarks, while maintaining ratios around 2-3x for varied data types.^[29] Parquet uses the LZ4_RAW variant (block format without framing) to avoid compatibility issues, as the original LZ4 codec is deprecated.^[29] Zstandard (ZSTD), introduced in Parquet 1.10, is a modern codec with tunable compression levels (1-22), delivering superior ratios of 3-4x and faster performance than GZIP for most workloads, especially at levels 3-9.^[29]^[32]^[33] It excels in scenarios requiring both efficiency and speed, such as cloud storage, and is based on RFC 8478.^[29] Brotli, added in Parquet format version 2.4.0, targets high compression for static or web-like data, achieving ratios comparable to or better than GZIP with moderate speed trade-offs, leveraging its dictionary-based approach for efficiency in archival or transmission use cases.^[29]^[34]^[35] Other codecs like LZO and uncompressed options are available but less commonly used; LZO focuses on fast decompression for legacy systems.^[29] Overall, codec selection depends on data characteristics and workload, with tools like Spark allowing runtime configuration.^[36]

Use Cases and Applications

Big Data Processing

Apache Parquet plays a central role in traditional big data pipelines, particularly for batch analytics and extract-transform-load (ETL) processes within the Hadoop ecosystem. It offers native support in Apache Hive starting from version 0.13, allowing users to create and query Parquet tables directly using HiveQL for distributed SQL operations on large datasets. Similarly, Apache Pig integrates with Parquet through storage functions, enabling data loading and processing in Pig Latin scripts for complex ETL tasks across Hadoop clusters. These integrations facilitate efficient querying of Parquet files in MapReduce jobs, where the columnar structure minimizes data shuffling and I/O overhead during batch computations. In Apache Spark, Parquet enhances DataFrame operations by leveraging predicate pushdown, which filters data at the storage level using embedded statistics in Parquet files, thereby reducing the volume of data loaded into memory for distributed processing.^[26] This optimization supports efficient execution of distributed joins and aggregations, as Spark can skip irrelevant row groups and columns during scans, accelerating analytical workloads on cluster-based environments. For instance, when performing group-by aggregations on multi-terabyte datasets, predicate pushdown can significantly reduce unnecessary data reads in selective queries. ETL workflows commonly involve writing partitioned Parquet files from diverse sources such as application logs or relational databases, organizing data by keys like date or category to enable partition pruning in downstream batch jobs. This approach allows for scalable ingestion and transformation, where tools like Spark or Hive extract raw data, apply schema enforcement, and output optimized Parquet partitions for reliable storage in Hadoop Distributed File System (HDFS). Partitioning in Parquet further boosts performance by limiting scans to specific directories during ETL reloads or incremental updates. However, users should ensure Parquet libraries are updated to the latest versions (e.g., 1.15.1 or later as of 2025) to mitigate vulnerabilities like CVE-2025-30065 in the parquet-avro module, which could allow remote code execution when processing untrusted files.^[37]^[5] Parquet delivers significant performance gains in big data environments due to its columnar access pattern, which avoids reading entire rows for column-wise operations. This efficiency stems from selective column retrieval and built-in compression, reducing I/O and CPU costs in batch pipelines. Compression techniques like dictionary encoding contribute to these gains by minimizing storage footprints, often achieving 75% or more reduction compared to uncompressed text formats.^[5] A representative example is the processing of terabyte-scale event data in batch jobs at Uber for analytics on service logs. Parquet files store events in Hadoop Distributed File System (HDFS) for efficient querying.^[38]

Cloud and Data Lakes

Apache Parquet is optimized for storage in cloud object stores such as Amazon S3, Google Cloud Storage (GCS), and Azure Blob Storage, where its columnar format and built-in compression enable efficient handling of large-scale data without the need for traditional file systems.^[39]^[40] This compatibility allows serverless querying services like Amazon Athena, Google BigQuery, and Azure Synapse Analytics to directly access and analyze Parquet files stored in these object stores, leveraging pushdown predicates and column pruning to minimize data transfer and processing overhead.^[41]^[42]^[43] In data lake architectures, Parquet serves as the foundational storage format for open table formats like Apache Iceberg and Delta Lake, which build upon Parquet files to provide atomicity and ACID transactions through metadata layers and transaction logs.^[44] These integrations enable reliable updates, deletes, and schema evolution in cloud-based data lakes, ensuring data consistency across distributed object storage without requiring centralized metadata servers.^[45]^[46] Parquet's compression techniques, such as Snappy or Zstandard, deliver significant cost efficiencies in pay-per-use cloud models by reducing storage requirements—often achieving up to 75% savings compared to uncompressed formats—and accelerating query times through smaller data scans.^[47]^[29] In serverless environments, this translates to lower compute charges, as services like Athena bill based on scanned data volume, making Parquet ideal for cost-effective scaling in petabyte-scale lakes.^[41] The schema-on-read capability of Parquet facilitates ad-hoc analysis on raw data dumps in cloud data lakes, as querying engines automatically infer the schema from the file metadata without requiring upfront ETL transformations.^[48] This approach supports flexible exploration of diverse datasets, such as log files or sensor data, directly from object storage. For instance, analytics pipelines on petabyte-scale data lakes often partition Parquet files by date and region to enable efficient filtering and parallelism; queries can then target specific partitions (e.g., year=2024/month=11/region=us-east), reducing scan volumes and enabling faster insights in tools like Athena or BigQuery.^[49]^[50]

Comparisons with Other Formats

Columnar Formats

Both Apache Parquet and Optimized Row Columnar (ORC) are columnar storage formats designed to optimize analytical queries by storing data column-wise, enabling efficient compression and selective column reads in big data environments.^[51] Parquet provides broader language and ecosystem support compared to ORC, with native integrations across tools like Apache Spark, Presto, and Arrow, making it suitable for diverse, multi-engine environments, while ORC is more tightly coupled to Hadoop and Hive origins.^[52]^[36] Parquet also excels in handling nested data structures through its Dremel-inspired model using repetition and definition levels, which supports complex schemas more flexibly than ORC's length and presence model.^[51] In contrast, ORC includes Hive-specific optimizations such as bloom filters at finer granularity (every 10,000 rows) and index streams like zone maps for improved predicate pushdown and data skipping in selection-heavy queries.^[51] Regarding compression trade-offs, Parquet supports a wider range of flexible codecs including Gzip, Snappy, Zstd, LZ4, and LZO, allowing users to balance compression ratios and speed based on workload needs, whereas ORC defaults to ZLIB with options like Snappy and LZ4 but incurs lighter metadata overhead in some cases.^[36] Both formats achieve similar 3-5x compression ratios on analytical datasets, with Parquet often producing slightly smaller file sizes in machine learning and log workloads due to default dictionary encoding on all columns.^[51] Performance-wise, Parquet enables faster cross-engine reads and sequential scans through its simpler integer encoding and streaming API, while ORC performs better in update-heavy Hive workloads leveraging its fine-grained indexes for projection and filtering.^[51]^[36] In terms of ecosystem positioning, Parquet's general-purpose design and extensive community adoption make it ideal for heterogeneous big data pipelines involving multiple processing engines, whereas ORC's strengths in Hive-optimized analytics suit environments focused on Hadoop-based querying.^[52]^[36] Users should choose Parquet for multi-tool, read-intensive analytics with complex data types, and ORC for pure Hive deployments requiring advanced indexing for frequent updates and selective reads.^[51]

Row-Based Formats

Row-based formats, such as Apache Avro and CSV, store data sequentially by records, contrasting with Parquet's columnar organization that optimizes for selective column access in analytical workloads.^[53] This row-oriented approach suits scenarios requiring full record retrieval or frequent updates, but it often leads to inefficiencies in storage and query performance when compared to Parquet's design for large-scale data processing.^[54] In comparison to Avro, Parquet offers superior compression through column-specific techniques like dictionary and run-length encoding, resulting in smaller file sizes and up to 10x faster scan speeds for analytical queries due to predicate pushdown and metadata skipping.^[55] Avro, however, excels in schema evolution for streaming applications, providing full support for type changes, additions, and removals with backward compatibility via its embedded schema, making it more suitable for dynamic, real-time data pipelines.^[56] Additionally, Avro's compact binary encoding produces smaller files optimized for transmission in messaging systems like Kafka, whereas Parquet prioritizes analytics over such portability.^[53] Performance-wise, Parquet aligns with write-once, read-many patterns in data lakes, delivering efficient reads for OLAP tasks, while Avro's row-based structure enables faster writes and appends in write-heavy OLTP or streaming environments.^[54] Parquet natively handles complex, nested data structures and enforces schemas at the file level, capabilities absent in CSV, which relies on simple delimited text without built-in schema support.^[5] CSV files lack compression, often resulting in 10-20x larger sizes compared to Parquet—for instance, a 1 TB CSV dataset can compress to 130 GB in Parquet, an 87% reduction—while also missing predicate pushdown, forcing full scans that increase I/O overhead.^[5] This parsing and deserialization burden makes CSV unsuitable for big data processing, where Parquet reduces scanned data by up to 99% and accelerates queries by 34x in large-scale scenarios.^[5] Use cases diverge accordingly: Parquet thrives in OLAP and data warehousing for read-intensive analytics, Avro in OLTP, streaming, and record-oriented transmission, and CSV for lightweight, small-scale exports or human-readable interchange.^[55] Trade-offs include Parquet's emphasis on immutability for reliable, append-only storage in analytical pipelines versus Avro's compactness and flexibility for evolving, record-centric workflows.^[54]

Implementations and Ecosystem

Libraries and APIs

Apache Parquet provides official and community-maintained libraries across multiple programming languages, enabling developers to read, write, and manipulate Parquet files with language-specific APIs that adhere to the format's columnar structure and metadata standards.^[1] These implementations emphasize efficiency in handling large datasets, supporting features like schema evolution and compression, while integrating with broader ecosystems such as Apache Arrow for cross-language data interchange.^[57] The primary Java implementation, known as parquet-java (formerly parquet-mr), serves as the reference for Parquet and includes modules for core format handling and Hadoop integration.^[58] It offers APIs via ParquetInputFormat and ParquetOutputFormat for reading and writing in MapReduce jobs, allowing configuration of input splits and output compression through Hadoop's JobConf.^[59] Schema definition in Java uses MessageType builders or integrations with serialization formats like Avro and Thrift, enabling nested data structures.^[60] Writer options include selectable compression codecs (e.g., Snappy, GZIP) and row group sizes, typically set to 128 MB for optimal performance, while readers access per-column statistics for filtering and validation.^[58] For C++, the parquet-cpp library, now merged into the Apache Arrow project, delivers high-performance I/O operations tightly coupled with Arrow's in-memory columnar format.^[61] Key classes include parquet::arrow::FileReader for loading files into Arrow tables via ReadTable(), and parquet::arrow::FileWriter for batch-wise writing with WriteTable().^[57] This implementation powers backends in tools like DuckDB and supports schema introspection through parquet::schema::GroupNode. Writer properties allow customization of compression (e.g., ZSTD, LZ4) and row group length (default 64,000 rows), while reader properties enable statistics collection for min/max values and null counts per column.^[57] Python support is provided through PyArrow, the Python bindings for Apache Arrow, which expose intuitive functions for Parquet operations integrated with libraries like Pandas.^[62] Users can write DataFrames to Parquet using pandas.DataFrame.to_parquet() or Arrow's pq.write_table(table, 'file.parquet'), specifying options like compression ('snappy') and row group size (e.g., 128 * 1024 * 1024 bytes). Reading occurs via pandas.read_parquet() or pq.read_table(), returning Arrow tables or Pandas objects with access to file metadata and column statistics for query optimization.^[62] Schema handling leverages Arrow's type system, supporting complex types like lists and structs. Implementations in other languages include official bindings via Apache Arrow projects, though with varying levels of maturity. The Go library, part of Apache Arrow Go (parquet-go), provides APIs for reading and writing via parquet.NewFileReader() and parquet.NewWriter(), supporting schema parsing and basic compression options.^[63] In Rust, the parquet crate offers a native implementation with modules for serialization (parquet::file::writer::SerializedFileWriter) and deserialization, including schema resolution and statistics access, optimized for zero-copy operations in data processing pipelines.^[64] Across libraries, common APIs focus on schema definition using logical types (e.g., INT64, STRING, LIST), writer configurations for compression algorithms and row group sizes to balance I/O and memory usage, and reader access to metadata statistics for efficient skipping of irrelevant data chunks. These standardize interactions while allowing language-specific ergonomics, ensuring compatibility with the Parquet format specification.^[4]

Integrations with Frameworks

Apache Parquet is natively supported in Apache Spark through its DataFrame API, with initial integration introduced in Spark 1.1.0 in 2014,^[65] allowing seamless reading and writing of Parquet files while preserving schema information.^[26] This support enables efficient columnar storage and query execution in distributed environments, with key optimizations including vectorized readers that process data in batches for up to 10x faster performance compared to non-vectorized decoding, and built-in partitioning to align with Hive-style table partitioning for reduced data shuffling during operations.^[26]^[66]^[67] In distributed SQL engines like Presto and its fork Trino, Parquet serves as a core format for querying data lakes, leveraging the Hive connector to enable federated SQL queries across heterogeneous sources such as object storage without data movement.^[68]^[69] This integration supports interactive analytics on large-scale datasets stored in Parquet, with Trino's architecture distributing query execution to handle petabyte-scale federated operations efficiently.^[70] Apache Hive integrates Parquet via a dedicated SerDe (Serializer/Deserializer) added natively in Hive 0.13.0, facilitating SQL-based querying and table management on storage systems like HDFS and Amazon S3.^[71] This allows Hive users to define external tables over Parquet files, enabling ACID transactions and schema evolution in data warehouse workflows. Beyond these, Parquet embeds in parallel computing frameworks like Dask, where it supports scalable reads and writes through engines such as PyArrow, enabling distributed Python operations on large datasets without loading everything into memory. Additionally, table formats like Delta Lake and Apache Iceberg build atop Parquet files to provide advanced features such as time travel, schema enforcement, and transactional consistency in data lakes. As of 2025, Spark 4.0 enhances Parquet handling with improved schema merging via the SQL MERGE syntax, supporting evolution during writes to accommodate dynamic data pipelines.^[72] Similarly, Trino version 476 introduces support for comparing geometry types, enhancing geospatial capabilities with OGC-compliant functions for spatial analytics on Parquet-stored geographic data.^[73]^[74]

Limitations and Future Directions

Challenges and Limitations

Apache Parquet's columnar storage design introduces several performance challenges, particularly in write operations. Writing data to Parquet files is generally slower than to row-based formats like Avro, primarily due to the overhead involved in reorganizing data into columns, applying encodings, and generating metadata.^[55] Benchmarks indicate that this can result in significantly slower write speeds for small or incremental writes compared to row-oriented formats, making Parquet less ideal for write-heavy workloads.^[55] A core limitation stems from Parquet's immutability, where files are treated as append-only and cannot be modified in place. Any updates, deletions, or insertions require rewriting entire files or partitions, which is computationally expensive and inefficient for frequent changes.^[75] This design renders Parquet unsuitable for transactional online transaction processing (OLTP) scenarios that demand low-latency updates, favoring instead batch-oriented analytical processing.^[76] Schema evolution in Parquet is constrained, supporting primarily the addition of new fields or marking existing ones as optional to maintain backward compatibility. More complex changes, such as altering data types or reordering fields, often necessitate external tools, full data rewrites, or table formats like Iceberg for handling, as Parquet's embedded schema lacks robust forward evolution mechanisms.^[56] These restrictions can complicate maintenance in evolving data pipelines.^[77] Security concerns have also emerged, exemplified by CVE-2025-30065, a critical remote code execution (RCE) vulnerability disclosed in early 2025 affecting the parquet-avro module in versions up to 1.15.0. This flaw, stemming from insecure Avro schema parsing, allows attackers to execute arbitrary code by crafting malicious Parquet files, impacting readers in data processing systems; users must apply patches from Apache to mitigate risks.^[37]^[78] Additionally, Parquet performs poorly for frequent small reads or ad-hoc queries requiring row-level access, as its columnar structure and metadata scanning introduce unnecessary overhead. In scenarios with very small files, the embedded metadata—such as footers detailing column chunks and statistics—can consume a disproportionate amount of space and processing time, exacerbating the "small files problem" in distributed systems.^[79]^[80] Parquet files are also not human-readable without specialized tools, hindering quick inspections.^[81]

Ongoing Developments

Apache Parquet's development continues through regular version releases by the Apache Software Foundation, with the 1.14 series introduced in 2024 featuring bug fixes and performance enhancements, including improved dictionary filtering to reduce false positives during reads.^[82] The 1.15 series began with the 1.15.0 release in December 2024, followed by 1.15.1 in March 2025 primarily addressing critical security vulnerabilities but also incorporating ongoing optimizations for better compatibility and efficiency in data processing pipelines.^[83] Community efforts have focused on extending Parquet's applicability to specialized domains, notably through the GeoParquet specification, which was released as version 1.0 in September 2023 by the Open Geospatial Consortium as an incubating standard.^[84] This extension embeds geospatial primitives like points, lines, and polygons directly into Parquet files, enabling efficient storage and querying of spatial data in columnar formats without requiring separate metadata layers.^[85] Additionally, Parquet's integration with Apache Arrow has advanced zero-copy read capabilities, allowing seamless in-memory data sharing across languages and systems to minimize serialization overhead in analytical workflows.^[62] In the broader ecosystem, Parquet is seeing deeper integration with table formats such as Apache Iceberg version 3, released in mid-2025, which introduces deletion vectors stored as compact sidecar Puffin files alongside Parquet data files to enable efficient row-level updates and deletes without rewriting entire datasets.^[86] This evolution supports time-travel queries and ACID compliance in data lakes, enhancing Parquet's role in managed storage systems.^[87] Recent discussions in 2025 highlight challenges with Parquet in machine learning workloads, where its columnar scan patterns can create bottlenecks for random access and iterative training on large datasets.^[88] Proposals for hybrid formats combining Parquet's compression strengths with row-oriented or vectorized structures aim to address these limitations, potentially reducing I/O overhead for ML-specific operations like feature sampling.^[89] Looking ahead, future directions emphasize enhanced write parallelism to scale with distributed systems, with recent advancements achieving 40-60% faster write speeds through optimized compression and partitioning techniques.^[90] Security hardening remains a priority following the disclosure of CVE-2025-30065 in April 2025, a critical remote code execution vulnerability in the parquet-avro module affecting versions up to 1.15.0, which was mitigated in 1.15.1 by restricting unsafe schema deserialization.^[37] In September 2025, version 1.16.0 was released, introducing further performance improvements and bug fixes.^[16] However, security issues persist, with CVE-2025-46762 disclosed later in 2025 as another critical RCE vulnerability in the parquet-avro module affecting versions through 1.15.1.^[91] The Parquet specification was updated in 2025 to include geometry and geography types for spatial data support, and in October 2025, the Variant type was introduced as a new open standard for efficient storage of semi-structured data, compatible with Parquet, Delta Lake, and Iceberg.^[25]

References

[1]
Overview - Apache Parquet
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance ...
[2]
Dremel made simple with Parquet - Blog - X
Sep 11, 2013 · Parquet stores nested data structures in a flat columnar format using a technique outlined in the Dremel paper from Google.
[3]
Concepts - Apache Parquet
Mar 8, 2024 · Key concepts include: Block (HDFS block), File (metadata), Row group (logical data partitioning), Column chunk (data for a column), and Page ( ...
[4]
File Format - Apache Parquet
Jul 7, 2024 · File Format. Documentation about the Parquet File Format. This file and the thrift definition should be read together to understand the format.
[5]
Apache Parquet: Efficient Data Storage - Databricks
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.Missing: history | Show results with:history
[6]
Announcing Parquet 1.0: Columnar Storage for Hadoop - Blog - X
Jul 30, 2013 · In March we announced the Parquet project, the result of a collaboration between Twitter and Cloudera intended to create an open-source ...Missing: origins | Show results with:origins
[7]
Motivation - Apache Parquet
Mar 24, 2022 · We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem.
[8]
https://research.google.com/pubs/pub36632.html
[9]
Parquet Project Incubation Status
2014-05-20 Project enters incubation. 2014-11-14 Release: parquet-format-2.2.0; 2015-01-15 New Committer: Alex Levenson; 2015-01-15 New Committer: Daniel ...
[10]
The Apache Software Foundation Announces Apache™ Parquet ...
Apr 27, 2015 · Apache Parquet is an Open Source columnar storage format for the Apache™ Hadoop® ecosystem, built to work across programming languages and much ...
[11]
org.apache.parquet » parquet » 1.10.0 - Maven Repository
Apache Parquet Java » 1.10.0 ; Apr 05, 2018 · pom (21 KB) View All · CentralApache ReleasesMapr DrillMulesoftSpring PluginsWSO2 Public · #353330 in ...
[12]
Handling Large Amounts of Data with Parquet – Part 2
Aug 21, 2018 · Compression Codecs. Parquet 1.10 supports around these 6 compression codecs. Snappy; GZIP; LZO; Brotli; LZ4; ZSTD. Along with these 6 ...
[13]
Central Repository: org/apache/parquet/parquet/1.12.0
org/apache/parquet/parquet/1.12.0 ../ parquet-1.12.0-source-release.zip 2021-03-17 08:51 2122014 parquet-1.12.0-source-release.zip.asc 2021-03-17 08:51 833 ...
[14]
1.13.0 - Apache Parquet
Apr 6, 2023 · The latest version of parquet-java is 1.13.0. To check the validity of this release, use its: The latest version of parquet-java on the previous minor branch ...
[15]
Parquet: Columnar Storage for Hadoop Data - Xandr-Tech - Medium
Mar 30, 2015 · Parquet is widely used in the Hadoop world for analytics workloads by many query engines. Among them are engines on top of Hadoop, such as Hive, ...
[16]
What is the Parquet File Format? Use Cases & Benefits | Qlik Blog
Advantages of Parquet Columnar Storage – Why Should You Use It? · Flexible and Efficient Compression · Enabling High-Performance Querying · Support for Schema ...Missing: key | Show results with:key
[17]
Apache Parquet Format - GitHub
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression ...Missing: principles | Show results with:principles
[18]
Configurations - Apache Parquet
Mar 8, 2024 · We recommend large row groups (512MB - 1GB). Since an entire row group might need to be read, we want it to completely fit on one HDFS block.
[19]
Metadata - Apache Parquet
Mar 5, 2025 · File metadata is described by the FileMetaData structure. This file metadata provides offset and size information useful when navigating the Parquet file.
[20]
Nested Encoding - Apache Parquet
Jan 14, 2024 · Repetition levels specify at what repeated field in the path has the value repeated. The max definition and repetition levels can be computed ...
[21]
Parquet Files - Spark 4.0.1 Documentation
# Parquet files are self-describing so the schema is preserved. # The result ... Note that Spark writes the output schema into Parquet's footer metadata on file ...Partition Discovery · Schema Merging · Columnar EncryptionMissing: embedded | Show results with:embedded
[22]
Encodings - Apache Parquet
Plain: (PLAIN = 0) ... This is the plain encoding that must be supported for types. It is intended to be the simplest encoding. Values are encoded back to back.
[23]
https://www.datacamp.com/tutorial/apache-parquet
[24]
Compression - Apache Parquet
Mar 11, 2024 · Parquet allows the data block inside dictionary pages and data pages to be compressed for better space efficiency.Overview · Codecs · Lz4
[25]
https://www.databricks.com/blog/introducing-variant-new-open-standard-semi-structured-data-apache-parquettm-delta-lake
[26]
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
[27]
Add new compression codecs to the Parquet spec - Issues
Jun 23, 2024 · After recent tests, I think we should add Zstd to the spec. I'm also proposing we add LZ4 because it is widely available and outperforms snappy.
[28]
https://arxiv.org/pdf/1209.2137v5.pdf
[29]
https://parquet.apache.org/docs/file-format/data-pages/compression/
[30]
CompressionCodec (Apache Parquet Format 2.4.0 API) - javadoc.io
Codecs added in 2.3.2 can be read by readers based on 2.3.2 and later. Codec support may vary between readers based on the format version and libraries ...
[31]
[PDF] A Deep Dive into Common Open Formats for Analytical DBMSs
Aug 10, 2023 · We evaluate. Zstandard (Zstd) at level 1 (we evaluate other levels later in this section), LZ4, Gzip, Snappy, and Zlib compression algorithms, ...
[32]
Benchmarking Apache Parquet: The Allstate Experience - Cloudera
Apr 22, 2016 · The final test, disk space results, are quite impressive for both formats: With Parquet, the 194GB CSV file was compressed to 4.7GB; and with ...Missing: MapReduce | Show results with:MapReduce
[33]
Engineering Data Analytics with Presto and Apache Parquet at Uber
Jul 11, 2017 · Snap your fingers and presto! How Uber Engineering built a fast, efficient data analytics system with Presto and Parquet.
[34]
Documentation - Apache Parquet
Welcome to the documentation for Apache Parquet. The specification for the Apache Parquet file format is hosted in the parquet-format repository. The current ...File Format · Overview · Types · Concepts
[35]
Apache Iceberg 101: The Table Format Reshaping Data Lakes
Sep 4, 2024 · A Data Lake is the concept of using cloud object storage like S3 or GCS and storing your data in open file/table formats for high volume as ...
[36]
Parquet SerDe - Amazon Athena
### Summary: Athena Support for Parquet, Optimizations for S3, Query Performance, Cost Benefits
[37]
Query Parquet files using serverless SQL pool - Azure Synapse ...
Dec 10, 2024 · In this article, you'll learn how to write a query using serverless SQL pool that will read Parquet files.Quickstart Example · Read Parquet File · Automatic Schema Inference
[38]
https://www.uber.com/blog/presto/
[39]
Apache Iceberg vs Parquet – File Formats vs Table Formats | Qlik Blog
Aug 27, 2025 · Iceberg is a table format that provides an abstraction layer for better data management and universal access (similar to Hudi or Delta Lake).
[40]
What is Delta Lake in Databricks?
Oct 8, 2025 · Delta Lake is open source software that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling.Missing: GCS | Show results with:GCS
[41]
Overview of cost optimization - Cost Modeling Data Lakes for ...
Use data compression · Without compression: $2,406.40 per month · With compression: $614.40 per month This represents a saving of 75%.
[42]
Parquet File Format: The Complete Guide - Coralogix
Aug 28, 2023 · Parquet file format supports schema evolution by default, since it's designed with the dynamic nature of computer systems in mind. The format ...Parquet file format characteristics · Parquet file format vs CSV
[43]
Build a Schema-On-Read Analytics Pipeline Using Amazon Athena
Sep 29, 2017 · In this post, I show how to build a schema-on-read analytical pipeline, similar to the one used with relational databases, using Amazon Athena.
[44]
All About Parquet Part 09 — Parquet in Data Lake Architectures
Nov 4, 2024 · Compression is crucial for reducing storage costs in data lakes. Use compression algorithms like Snappy for fast compression and decompression, ...
[45]
Data Lake Querying in AWS - Optimising Data Lakes with Parquet
Nov 25, 2021 · In the following sections we will explain how we generated the raw data and the unpartitioned and partitioned Parquet files. In the next ...
[46]
[PDF] An Empirical Evaluation of Columnar Storage Formats (Extended ...
Nov 7, 2023 · According to our survey, Arrow and DuckDB only adopt zone maps at the row group level for Parquet, while InfluxDB and Spark enable PageIndex and ...
[47]
Column File Formats: How Hudi Leverages Parquet and ORC
Jul 31, 2024 · Less Community Support: Compared to Parquet, ORC has less community support, meaning fewer resources, libraries, and tools for this file format.
[48]
Avro vs. Parquet: A Complete Comparison for Big Data Storage
Feb 26, 2025 · Avro: Provides fast write speeds since it stores data row-by-row. However, read operations can be slower for analytics because entire rows must ...Features of Parquet · Differences Between Avro and... · When to use Parquet
[49]
Apache Parquet vs AVRO: Open file formats, compute engine
May 30, 2024 · Apache Parquet is a column-oriented format that accelerates query performance and uses column-specific data compression and encoding to reduce ...
[50]
Parquet vs. Avro: A Detailed Comparison of Big Data File Formats
Sep 2, 2025 · Apache Parquet is an open-source column-oriented file format designed for efficient data storage and processing in big-data environments. It was ...Missing: milestones | Show results with:milestones
[51]
Apache Parquet vs. Avro: Which File Format Is Better? - Snowflake
This makes Parquet more efficient for analytical workloads, while Avro is typically better for real-time ingestion and write-optimized use cases. Performance ...
[52]
Reading and writing Parquet files — Apache Arrow v22.0.0
The Parquet C++ implementation is part of the Apache Arrow project and benefits from tight integration with the Arrow C++ classes and facilities.Writing Parquet Files · Supported Parquet Features · TypesMissing: inclusion | Show results with:inclusion
[53]
Apache Parquet Java - GitHub
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression ...Issues 537 · Add Variant Logical Type #3070 · GH-3235 · Pull requests 159
[54]
https://www.starburst.io/blog/apache-parquet-vs-avro/
[55]
Introduction to Java Parquet (Formerly Parquet MR) - Baeldung
Sep 30, 2025 · Predicate pushdown enables us to skip scanning row groups that don't satisfy a filter. Nested predicates and combinations (and, or, not) are ...<|separator|>
[56]
apache/parquet-cpp - GitHub
May 10, 2024 · Development for Apache Parquet in C++ has moved. The Apache Arrow and Parquet C++ projects have merged development process and build systems in the Arrow ...
[57]
Reading and Writing the Apache Parquet Format
Apache Arrow is an ideal in-memory transport layer for data that is being read or written with Parquet files. We have been concurrently developing the C++ ...Reading And Writing Single... · Writing To Partitioned... · Parquet Modular Encryption...Missing: inclusion | Show results with:inclusion
[58]
parquet package - github.com/apache/arrow/go/parquet
Nov 12, 2021 · Package parquet provides an implementation of Apache Parquet for Go. Apache Parquet is an open-source columnar data storage format.Documentation · Overview · Constants · Types
[59]
Crate parquet - Rust - Docs.rs
The crate provides the official Native Rust implementation of Apache Parquet, a columnar format, with APIs to read and write Parquet files.
[60]
Tips and Best Practices to Take Advantage of Spark 2.x
Jul 8, 2020 · Spark 2.x has a vectorized Parquet reader that does decompression and decoding in column batches, providing ~ 10x faster read performance.
[61]
Vectorized Parquet Decoding (Reader) - The Internals of Spark SQL
Vectorized Parquet Decoding (Vectorized Parquet Reader) allows for reading datasets in parquet format in batches, i.e. rows are decoded in batches. That aims at ...
[62]
Hive connector — Trino 478 Documentation
The Hive connector allows querying data stored in an Apache Hive data warehouse. Hive is a combination of three components.
[63]
Ecosystem: Data lake components - Trino
Trino is designed as a query engine for data lakes and data lake houses. A complete lake or lake house uses numerous components including object storage ...
[64]
Trino | Distributed SQL query engine for big data
Federated queries in Trino can access your object storage, your main relational databases, and your new streaming or NoSQL system, all in the same query. Trino ...Trino documentation · Get started with Trino · Trino blog · Data lake components
[65]
https://spark.apache.org/releases/spark-release-1-1-0.html
[66]
Spark Release 4.0.0
Add SQL MERGE syntax to enable schema evolution; [SPARK-47430] Support GROUP ... Spark 4.0.1 released (Sep 06, 2025). Archive · Download Spark. Built-in ...
[67]
Release 476 (5 Jun 2025) — Trino 477 Documentation
Jun 5, 2025 · Add support for comparing values of geometry type. (#25225). ⚠️ ... Previous Release 477 (24 Sep 2025) · Next Release 475 (23 Apr 2025)Missing: GeoParquet extensions
[68]
Geospatial functions — Trino 478 Documentation
Trino Geospatial functions that begin with the ST_ prefix support the SQL/MM specification and are compliant with the Open Geospatial Consortium's (OGC) ...Missing: 2025 | Show results with:2025<|control11|><|separator|>
[69]
A Technical Comparison of Apache Parquet, ORC, and Arrow
Mar 6, 2025 · Apache Arrow focuses on in-memory processing and data interchange. While it uses a columnar format like Parquet and ORC, its design is tailored ...<|control11|><|separator|>
[70]
Apache Parquet: Good for Analytics, Not So Good for Search - Lucenia
Mar 5, 2025 · 4. Update and Schema Evolution Bottlenecks. Apache Parquet is designed for immutable datasets, making updates costly and inefficient.
[71]
Delta Lake vs. Parquet Comparison
Delta Lake and Parquet are columnar, so you can cherry-pick specific columns from a data set via column pruning (aka column projection). Column pruning isn't an ...Essential Characteristics Of... · The Challenge Of Storing... · The Basic Structure Of A...
[72]
Schema evolution in parquet format - apache spark - Stack Overflow
Jun 5, 2016 · Schema evolution can be (very) expensive. In order to figure out schema, you basically have to read all of your parquet files and ...Missing: documentation | Show results with:documentation
[73]
CVE-2025-30065 Detail - NVD
Apr 1, 2025 · Description. Schema parsing in the parquet-avro module of Apache Parquet 1.15.0 and previous versions allows bad actors to execute arbitrary ...
[74]
Max severity RCE flaw discovered in widely used Apache Parquet
Apr 3, 2025 · A maximum severity remote code execution (RCE) vulnerability has been discovered impacting all versions of Apache Parquet up to and including 1.15.0.
[75]
Parquet Data Format: Exploring Its Pros and Cons for 2025
May 9, 2025 · Self-describing, Parquet stores metadata and schema in addition to data. It facilitates schema evolution, which is necessary for long-lived ...Missing: embedded | Show results with:embedded
[76]
Parquet file format - everything you need to know! - Data Mozart
We've already mentioned that Parquet is a column-based storage format. ... As per Parquet official docs: A logical horizontal partitioning of the data into rows.Missing: core principles
[77]
The Small Files Problem: Solutions for Big Data - MinIO Blog
Jun 18, 2025 · Querying many small files incurs overhead to read metadata, conduct a non-contiguous disk seek, open the file, close the file and repeat.
[78]
1.14.4 - Apache Parquet
Nov 11, 2024 · The latest version of parquet-java is 1.14.4. With the following bugfixes: GH-3040: DictionaryFilter.canDrop may return false positive ...Missing: notes | Show results with:notes
[79]
1.15.1 - Apache Parquet
Mar 16, 2025 · The latest version of parquet-java is 1.15.1. For the changes, please check out the Release on Github.
[80]
GeoParquet - Cloud-Optimized Geospatial Formats Guide
In September 2023, GeoParquet published a 1.0 release, and now any changes to the specification are expected to be backwards compatible. Reading and writing ...Missing: extension | Show results with:extension
[81]
GeoParquet
About. Apache Parquet is a powerful column-oriented data format, built from the ground up to as a modern alternative to CSV files.Releases · V1.1.0 release · Browser-based converterMissing: 2023 | Show results with:2023
[82]
Apache Iceberg™ v3: Moving the Ecosystem Towards Unification
Jun 2, 2025 · Iceberg v3 includes major improvements such as deletion vectors, row lineage, and new types for semi-structured data and geospatial use cases.Deletion Vectors · Semi-Structured Data and... · Interoperability with Delta Lake...
[83]
Spec - Apache Iceberg™
Deletion vectors are added in v3 and are not supported in v2 or earlier. Position delete files must not be added to v3 tables, but existing position delete ...View Spec · Puffin Spec · Implementation Status
[84]
Parquet and ORC's many shortfalls for machine learning (ML ...
Jan 7, 2025 · Discusses why dominant columnar file formats, Parquet and ORC, are problematic when used for machine learning (ML) workloads.<|control11|><|separator|>
[85]
Is Parquet becoming the bottleneck? Why new storage formats are ...
Sep 14, 2025 · Parquet gave data lakes a common language: columnar layout, good compression, and fast scans. That still works well for classic analytics.
[86]
Understanding the Parquet Data Format: Benefits and Best Practices
Sep 8, 2025 · This columnar storage format improves compression ratios, lowers I/O, and dramatically accelerates analytics on large datasets stored in Amazon ...How Columnar Formats Improve... · Creating Parquet Files · How Does Parquet Compare To...