Apache Drill
Apache Drill is an open-source distributed massively parallel processing (MPP) query engine designed for interactive analysis and exploration of large-scale, semi-structured big data without requiring predefined schemas.[1] It supports standard ANSI SQL queries across diverse data sources, including Hadoop file systems, NoSQL databases, and cloud storage, enabling low-latency ad hoc querying on formats such as JSON, Parquet, Avro, text files, HBase, Hive tables, and others like Kafka streams, ESRI Shapefiles, PCAP files, and OpenTSDB.[1] Inspired by Google's Dremel system, Drill employs a columnar storage model with in-memory processing to handle complex, nested data structures efficiently, making it suitable for data-intensive applications in environments like Hadoop ecosystems.[2][3] As an Apache top-level project that entered the incubator in 2012, became a top-level project in December 2014, and reached version 1.0 in 2015, Drill emphasizes plug-and-play integration with existing infrastructure, such as Apache Hive and HBase, while providing JDBC and ODBC drivers for compatibility with business intelligence tools and SQL clients.[1] Its architecture features a distributed execution engine that scales horizontally across clusters, supporting dynamic schema discovery for self-describing data and alternative query languages beyond SQL.[1] Notable for its focus on high-performance analytics on evolving datasets, Drill avoids the need for data warehousing or ETL processes, allowing users to query raw data directly from storage systems.[1]Introduction
History and Development
Apache Drill's origins trace back to Google's 2010 Dremel paper, which introduced a scalable system for ad-hoc querying of large datasets using a distributed SQL query engine.[4] This inspiration led to the development of an open-source equivalent, aiming to enable interactive analysis on petabyte-scale data without predefined schemas.[5] Initial development began at MapR Technologies in 2012, with the project announced as an open-source initiative in August of that year to replicate Dremel's capabilities for Hadoop and NoSQL environments.[6] MapR released the first public version, MapR Drill 0.1, in October 2012, shortly after donating the codebase to the Apache Software Foundation as an incubating project in August 2012.[7] The project graduated to a top-level Apache project in December 2014, reflecting its maturity and growing adoption.[8] Key early contributors included the MapR team, led by founder Tomer Shiran, who served as the project's initial architect and vice president of product management at MapR. In 2019, MapR Technologies was acquired by Hewlett Packard Enterprise (HPE), transitioning Drill's primary stewardship to the broader Apache community while HPE integrated related technologies into its data fabric offerings.[9] Major milestones include the 1.0 release in May 2015, which introduced schema-free SQL querying for Hadoop, NoSQL, and cloud storage.[10] Subsequent versions added dynamic user-defined functions in 1.9 (November 2016), cryptographic functions and PCAP file support in 1.11 (July 2017), and the latest 1.22.0 in June 2025, featuring bug fixes, DNS lookup functions, SFTP support, and removal of Java 8 compatibility.[11][12][13] Under the Apache License 2.0, Drill has seen sustained community growth, with contributions from organizations like Dremio—founded by Shiran—accelerating after 2020 through enhancements to its query engine and integration with modern big data ecosystems like Hadoop.[14][15]Overview and Purpose
Apache Drill is an open-source, schema-free SQL query engine designed for interactive analysis of large-scale, multi-structured data without requiring ETL preprocessing.[1] It supports standard ANSI SQL for querying diverse data sources, including Hadoop, NoSQL databases, and cloud storage, enabling ad-hoc exploration of petabyte-scale datasets across thousands of servers.[16] This distributed architecture allows users to perform low-latency queries directly on raw or semi-structured data, such as JSON, Parquet, and text files, without predefined schemas or metadata catalogs.[1][17] Unlike traditional relational databases focused on transactional processing, Apache Drill emphasizes data exploration and analytics through its schema-optional model, which automatically infers data structures at query time.[17] Inspired by Google's Dremel, it prioritizes flexibility for rapidly evolving datasets over rigid schema enforcement, reducing setup time and enabling self-service analytics.[16] Common use cases include discovering insights in raw logs and nested files, as well as integrating with business intelligence tools like Tableau for ad-hoc reporting across heterogeneous sources.[17] Implemented in Java for cross-platform compatibility, Apache Drill operates under the Apache License 2.0, fostering community contributions and broad adoption in big data environments.[2][18]Architecture
Core Components
Apache Drill's distributed architecture relies on several fundamental components that enable its schema-free querying capabilities across diverse data sources. At the heart of the system is the Drillbit, a core daemon process that runs on each node in the cluster. The Drillbit serves as the primary execution unit, responsible for handling client requests, coordinating query processing, and delivering results. It encompasses an executor for running query fragments and a controller for planning, with no master-slave hierarchy—any Drillbit can accept and drive queries while maximizing data locality by assigning fragments to nodes closest to the data.[16][19] In distributed mode, Apache Drill integrates with ZooKeeper for essential cluster coordination. ZooKeeper manages cluster membership, performs leader election among Drillbits, and handles metadata management, allowing clients to discover available nodes through a quorum without direct knowledge of cluster topology changes, such as node additions or failures. This integration ensures fault-tolerant operation by providing health-check information and node availability data to the driving Drillbit for efficient query distribution.[16][19] The Storage Engine Plugin Framework provides a modular abstraction layer for interacting with various data sources. This framework defines interfaces for metadata retrieval, read and write operations, and optimization rules, enabling dynamic loading of plugins to connect Drill to underlying storage systems like Hadoop file systems or HBase. It supports extensibility, allowing new plugins to be added without modifying the core engine, and facilitates schema-on-read interactions tailored to each data source.[16][19] For efficient in-memory data handling, Apache Drill employs Value Vectors as its primary data structure. These are columnar-oriented constructs consisting of one or more contiguous buffers that store sequences of values along with metadata, passed between operators during query execution. Value Vectors enable vectorized processing, leveraging modern CPU architectures for high-performance columnar operations, and inherently support dynamic schemas by allowing schema changes mid-query through the creation of new vectors in record batches. They accommodate various data types, including fixed-width, nullable, repeated, and variable-width values, using techniques like offset tables for indirection and selection vectors for subsets, ensuring constant-time random access and immutability post-construction.[16][20] Complementing these is the Metadata Provider, a decentralized caching mechanism that derives schema information directly from storage plugins. This provider eliminates the need for a centralized metadata store by inferring and caching schemas on-demand, optimizing repeated queries through schema discovery and integration with sources like multiple Hive metastores. Accessible via the INFORMATION_SCHEMA, it supports dynamic schema handling and enhances query planning efficiency without relying on external execution engines.[16][19]Query Processing Pipeline
Apache Drill's query processing pipeline orchestrates the transformation of user-submitted SQL queries into executable operations across a distributed cluster, emphasizing schema discovery and efficient data movement without requiring predefined metadata. This pipeline begins with query ingestion and progresses through optimization and parallel execution, leveraging storage plugins for data access as needed.[21][22] The initial stage involves parsing and validation, where incoming ANSI SQL queries—supporting extensions for complex types such as arrays and maps—are processed by the Apache Calcite framework to generate a preliminary logical plan. This plan represents an abstract sequence of operators, such as scans, projections, and joins, while validating syntax and semantics against Drill's supported constructs. The parser ensures compatibility with SQL-2003 standards, converting the query into a language-agnostic relational algebra tree that captures the intended data flow.[19][21] Following parsing, logical planning occurs via Drill's cost-based optimizer, which refines the initial logical plan into an optimized version tailored to the target data sources. During this phase, schema discovery happens on-the-fly: for schema-free data like JSON or Parquet files, the optimizer probes samples from storage plugins to infer structures dynamically, enabling queries without upfront schema definitions. Rule-based transformations, such as predicate pushdown and join reordering, are applied to minimize data movement and computation, producing a high-level logical plan that balances flexibility with preliminary efficiency. This step incorporates statistics from data sources where available, ensuring the plan accounts for data distribution and cardinality.[22][23] The optimized logical plan is then converted into a physical plan, which specifies concrete execution strategies, including operator implementations and parallelism decisions. A parallelizer divides the physical plan into major fragments—distinct phases separated by data exchange boundaries, such as sorts or shuffles—and further slices them into minor fragments for concurrent execution. Major fragments delineate pipeline stages, while minor fragments enable intra-node and inter-node parallelism, with data shuffling facilitated through network exchanges between sender and receiver operators. This fragmentation allows queries to scale horizontally across multiple nodes, distributing workload without centralized coordination beyond planning.[21][23] At the core of execution lies the execution engine, which processes minor fragments in a pipelined, vectorized manner to achieve low-latency columnar operations. Data is handled using Value Vectors, specialized structures that store columnar values in contiguous memory buffers, supporting fixed-width, variable-width, and nullable types with metadata for efficient access. Vectorization enables operators to process batches of records simultaneously—typically 4,096 rows per vector—leveraging SIMD instructions and reducing overhead from row-by-row iteration. In memory-constrained scenarios, certain operators like hash aggregates, hash joins, and external sorts support spill-to-disk, temporarily writing excess data to configurable temporary directories before resuming in-memory processing, thus preventing out-of-memory failures while maintaining query progress. Fragments execute as leaf (data scans), intermediate (transformations), or root (final assembly) types, with data flowing upstream through the pipeline.[20][24][21] Finally, result delivery consolidates outputs in the root fragment, performing any remaining aggregations, projections, and formatting before streaming results back to the client. This stage ensures that only requested columns and rows are materialized, minimizing unnecessary computation and network transfer. The entire pipeline operates optimistically, scheduling all fragments at once for pipelined execution, and supports iterative refinement through tools like EXPLAIN for plan inspection.[21][23]Data Support
Back-end Storage Plugins
Apache Drill employs pluggable storage plugins to connect to a wide array of back-end data sources, enabling schema-free SQL queries against diverse systems without requiring data movement or transformation. These plugins are essential for Drill's distributed execution engine, allowing it to read data directly from storage while optimizing query plans based on the underlying system's capabilities.[25] For file systems, Drill provides a versatile file system storage plugin that supports distributed and local storage via the Hadoop FileSystem API, including Hadoop Distributed File System (HDFS) for scalable big data processing and local file systems for direct access to on-premises data. This plugin extends to cloud-based object stores, such as Amazon S3 through a dedicated S3 storage plugin that handles AWS credentials and bucket configurations for seamless integration with cloud workloads. Similarly, the Azure Blob storage plugin leverages Hadoop-compatible layers to query data in Azure Blob Storage, supporting secure access via account keys or shared access signatures. Google Cloud Storage is also accessible through the file system plugin when configured with the appropriate Hadoop GCS connector, enabling queries across multi-cloud environments. In Drill 1.22.0, support for SFTP file systems was added, allowing secure file transfers and access to remote servers via the SSH File Transfer Protocol for scenarios requiring encrypted connections.[26][27][28][13] Drill's NoSQL database connectors include dedicated plugins for popular columnar and document stores. The HBase storage plugin enables columnar access to HBase tables, integrating with HBase's client libraries for efficient scans and filters pushed down to the storage layer. For document-oriented databases, the MongoDB storage plugin uses the latest MongoDB Java driver to read collections, supporting operations like aggregation pipelines translated to SQL. The Cassandra storage plugin, introduced in Drill 1.19, allows SQL queries against Cassandra tables using Apache Calcite for optimization and pushdown of predicates to Cassandra's query engine. Additionally, the MapR-DB plugin provides native support for MapR's distributed NoSQL database, treating it as a binary store with schema-on-read capabilities similar to HBase. The Hive storage plugin connects to the Hive metastore, enabling queries on Hive tables across various underlying storage formats like ORC and Parquet, with support for ACID tables since Drill 1.13.[29][30][31][32][33] For relational databases, the RDBMS storage plugin uses JDBC drivers to query traditional SQL databases such as PostgreSQL, MySQL, Oracle, and SQL Server, facilitating joins between relational data and big data sources.[34] Drill also supports specialized plugins for streaming and time-series data. The Kafka storage plugin, introduced in Drill 1.12, allows querying Kafka topics as tables for real-time stream processing. Similarly, the OpenTSDB storage plugin enables SQL access to OpenTSDB time-series data via its REST API.[35][36] Storage plugins are configured using JSON-formatted definitions, which specify connection parameters such as URIs, credentials, and authentication modes, managed through the Drill Web UI, REST API, or configuration files likestorage-plugins-override.conf. This JSON structure includes attributes for enabling the plugin, defining workspaces, and setting formats, with support for credential providers to handle sensitive information securely. Authentication modes, enhanced in Drill 1.21, include plain credentials, shared user translation, and credential provider integration on a per-plugin basis.[37][38][39]
A key strength of these plugins is federation, which permits joining data across multiple back-ends in a single query—for instance, combining files from S3 with documents from MongoDB—leveraging Drill's optimizer to push computations to the appropriate storage systems for efficiency. This capability plays a role in schema inference by allowing Drill to discover structures dynamically from heterogeneous sources during query execution.[1][25]
Supported Data Formats
Apache Drill natively supports structured file formats optimized for efficient querying. Parquet files benefit from columnar storage, enabling projection pushdown that reads only required columns to minimize I/O and improve performance on large datasets.[40] Avro files are supported with schema evolution capabilities, allowing queries on files with evolving structures without metadata reconfiguration.[41] Delimited formats like CSV and TSV are handled through text parsing, where delimiters are specified or inferred from file extensions and configurations to extract columnar data dynamically.[42] For semi-structured data, Drill provides robust handling of JSON, treating nested objects as MAP types and repeated elements as ARRAY types, which can be queried and manipulated using built-in SQL functions for accessing hierarchical fields.[43] XML files are supported via a dedicated format plugin introduced in version 1.19, enabling schema-free reading by converting XML to a queryable structure, though capabilities are limited to basic element extraction without advanced XPath processing.[44] Unstructured and raw data formats are accessible for specialized analysis. PCAP files, used for network packet captures, have been supported since version 1.11 with built-in decoding functions that extract fields like timestamps, IP addresses, ports, and payloads for protocols including TCP, UDP, and ICMP. ESRI Shapefiles, supported since version 1.17, allow querying geospatial vector data such as points, lines, and polygons using a dedicated format plugin. Log files can be parsed using the logfile plugin, which applies regular expression patterns to split lines into columns, facilitating ad-hoc querying of unstructured log entries without preprocessing.[45][46][47] Drill's schema discovery mechanism allows on-the-fly inference when querying directories containing mixed formats, automatically detecting and adapting to varying schemas across files to avoid errors from inconsistencies.[5] This dynamic approach supports evolving data without requiring upfront metadata definitions. Complex data types are integral to Drill's querying model, with full support for arrays, maps, and unions derived from formats like JSON, Parquet, and Avro. Flattening operators, such as FLATTEN, enable unnesting of repeated or nested structures into relational tables for standard SQL operations.[48] These capabilities are accessed through configured back-end storage plugins that integrate the formats into Drill's query engine.[29]Features and Capabilities
Schema-Free Querying
Apache Drill enables schema-free querying by performing dynamic schema discovery at query execution time, allowing users to access data without predefined schemas or type specifications. This feature leverages self-describing data formats such as JSON and Parquet to automatically infer the structure, including nested and repeated elements, as the query processes the data. For instance, Drill can query evolving datasets where fields may vary across records or files, adapting operators on the fly without requiring centralized metadata storage.[5] This dynamic resolution supports direct exploration of raw, multi-structured data, eliminating the need for upfront schema definition or maintenance. In practice, users can issue SQL queries against files or directories without loading data into a traditional database, as Drill derives the schema from the source during execution. This approach is particularly valuable for ad-hoc analysis of semi-structured data, where traditional relational systems would demand rigid schemas.[49] To handle nested and complex data types inherent in schema-free environments, Drill extends standard SQL with functions tailored for hierarchical structures. The FLATTEN function, for example, expands repeated fields like arrays into individual rows while preserving associations with other columns, enabling queries likeSELECT FLATTEN(employees.skills) AS skill FROM hr.json; to transform nested employee skills into a flat result set. Similarly, UNNEST combined with LATERAL joins provides SQL-standard support for unnesting collections, such as arrays of maps, allowing operations like SELECT * FROM emp LATERAL VIEW UNNEST(emp.skills) AS skill WHERE skill.name = 'SQL';, which processes nested data in chunks for efficiency and handles schema variations across elements. These extensions facilitate intuitive access to complex types, such as SELECT t.trans_info.prod_id[0] FROM transactions.json AS t;, without prior flattening.[50][51][52]
Drill's schema-free model obviates the need for Extract, Transform, Load (ETL) processes, permitting queries on raw data in its native location across sources like HDFS or cloud storage. This reduces preparation time for data exploration, as users can combine disparate formats—such as JSON files and Hive tables—in a single query without schema alignment or preprocessing. For example, a query might join nested JSON documents with relational Hive data directly, bypassing traditional warehousing steps.[49][53]
In cases of missing or inconsistent fields, Drill employs graceful degradation, treating absent elements as null values during schema discovery to ensure query continuity rather than failure. Users can opt for strict mode via session settings if validation is required, but the default behavior prioritizes flexibility for evolving datasets. This handling supports robust exploration of dynamic structures, where fields may appear or change across records without halting execution.[5][54]
Since version 1.9, Drill has supported user-defined functions (UDFs) in Java, including dynamic registration via commands like CREATE FUNCTION my_udf USING JAR 'path/to/myudf.jar';, which allows custom processing for format-specific needs without cluster restarts. These UDFs extend schema-free capabilities by enabling tailored logic for nested or irregular data, such as parsing custom JSON variants, and are managed through distributed file systems for multi-tenant environments.[55][56]
Performance and Scalability
Apache Drill achieves high performance and scalability through its distributed execution model, which supports horizontal scaling across thousands of nodes by breaking queries into parallel fragments executed on multiple drillbits.[16] This fragment parallelism allows queries to process large datasets efficiently, with data locality ensuring that computations occur close to the data sources to minimize network shuffle and latency.[16] By co-locating drillbits with storage nodes, Drill reduces data movement, enabling it to handle petabyte-scale workloads interactively.[16] The query optimizer in Drill employs rule-based transformations to restructure plans for efficiency, incorporating cost estimates based on resource usage such as I/O, CPU, and memory to guide decisions.[23] Key optimizations include join reordering to minimize intermediate data sizes and predicate pushdown to filter data early in the execution pipeline, reducing the volume of data processed downstream.[23] Additionally, datastore-aware rules tailor optimizations to specific back-ends, such as leveraging native indexing in Hive or HBase, which further enhances performance by pushing computations to the storage layer where possible.[23] Drill's vectorized execution engine processes data in columnar format using Value Vectors, which represent batches of records as contiguous memory arrays to exploit CPU vectorization instructions.[16] This approach improves CPU efficiency by operating on multiple values simultaneously, avoiding row-by-row processing and reducing serialization overhead between operators.[57] As a result, queries benefit from higher throughput, particularly for analytical workloads involving aggregations and scans over large datasets.[57] Memory management in Drill includes adaptive spilling mechanisms for memory-intensive operators like sorts and hash joins, which automatically offload excess data to disk when direct memory limits are approached, preventing out-of-memory errors.[24] Query profiles provide detailed metrics on memory usage, operator costs, and spill events, enabling administrators to tune configurations such as direct memory allocation for optimal performance.[58] Recent releases, such as version 1.22.0, include fixes for memory statistics accuracy and enhancements to pagination in storage plugins, improving reliability for large-scale queries. Version 1.22.0, released in June 2025, introduced the REGEXP_EXTRACT function and improved JSON empty key handling to further enhance schema-free querying and performance.[13] Benchmarks demonstrate Drill's capability for petabyte-scale queries at interactive speeds, drawing inspiration from Google's Dremel for interactive analysis on massive clusters.[16]Interfaces and Integrations
Client Interfaces
Apache Drill provides several native client interfaces for users to interact with the system, enabling query submission, execution monitoring, and cluster management. These interfaces support both interactive and programmatic access, allowing integration with standard tools while maintaining Drill's schema-free querying capabilities. The primary interfaces include the command-line shell, standard database drivers, a RESTful API, and a web-based console, with configuration options for security and customization. The Drill Shell serves as a command-line interface for interactive SQL execution, built on SQLLine, a pure-Java utility that follows the SQL:2011 standard with extensions for nested data. Users can start the shell in embedded mode using the command./sqlline -u jdbc:drill:drillbit=local for local development or in distributed mode via ZooKeeper connection strings like ./sqlline -u jdbc:drill:zk=<zk-hosts>:2181 to connect to a cluster. It features auto-completion for SQL commands and keywords, configurable via the fastConnect option, and command history management with commands such as !history to view past queries, supporting up to a customizable maximum number of rows. Additional customizations include prompt formatting (e.g., !set prompt good-drill) and output formats like table or CSV via !set outputFormat table. The shell is particularly useful for ad-hoc querying and testing storage plugins without requiring external tools.[59]
For broader integration with business intelligence and analytics tools, Apache Drill offers JDBC and ODBC drivers as standard connectors. The JDBC driver, distributed as drill-jdbc-all-<version>.jar within the Drill installation's jars/jdbc-driver directory or downloadable from the Apache mirrors, supports connections via URLs such as jdbc:drill:zk=<zk-quorum> and the class org.apache.drill.jdbc.Driver. It enables parameterized queries through methods like setMaxRows to limit result sets (introduced in Drill 1.16) and setQueryTimeout for execution time limits (since Drill 1.13), facilitating use with tools like Tableau, Excel, and SQuirreL SQL client by adding the JAR to the client's classpath and configuring connection parameters. The ODBC driver, installable on Windows, Linux, and macOS, creates Data Source Names (DSNs) for seamless connectivity to BI applications, with configuration options in odbc.ini or connection strings to control behaviors like query timeouts and schema handling; installation involves downloading the driver package and verifying it in the system's ODBC administrator. These drivers allow BI tools to leverage Drill's querying without custom code, supporting direct data exploration from spreadsheets or dashboards.[60][61][62]
The REST API provides HTTP-based programmatic access to Drill, enabling query submission, status checks, and result retrieval through endpoints exposed at the default port 8047 (e.g., http://<drill-host>:8047). Key endpoints include POST /query.json for submitting SQL queries with options like queryType=SQL, autoLimit for result capping (added in Drill 1.16), and defaultSchema (since Drill 1.18); GET /[status](/page/Status) for cluster health; and GET /profiles.json for monitoring running or completed queries. Result streaming over HTTP responses in JSON format was enhanced in Drill 1.19 to reduce memory overhead during large data transfers, with verbose error reporting configurable via http.rest.errors.verbose=true. Authentication supports Basic auth using base64-encoded credentials in the Authorization header (Drill 1.18+) or form-based login with cookies, making it suitable for scripting and application integration.[63]
The built-in Web UI offers a browser-based console accessible at http://<drill-host>:8047 (or HTTPS if enabled), serving as an intuitive interface for query execution, profile analysis, and metrics viewing without additional setup. Users submit queries on the Query page using keyboard shortcuts like Ctrl+Enter, view execution profiles on the Profiles page (with admin access to all queries and user-limited views), and monitor cluster metrics such as CPU and memory usage under the Metrics tab. Authentication is optional but can be enforced via FORM-based login with username/password (integrated with PAM) or SPNEGO for Kerberos, providing role-based views where administrators access full options including Storage and Threads tabs. The UI is particularly valuable for non-technical users to explore data visually and troubleshoot queries interactively.[64]
Interface configurations, including authentication and security, are managed primarily through the drillbit.conf file and related overrides like drill-override.conf. For user authentication, Plain security mode integrates with Pluggable Authentication Modules (PAM) for Linux-based validation against /etc/passwd, LDAP, or other backends, requiring users to provide credentials via clients like the shell or JDBC; this ensures consistent usernames and UIDs across cluster nodes. LDAP support is achieved through PAM configuration without dedicated Drill settings, while custom authenticators can extend username/password handling. HTTPS for Web UI and REST API is enabled by setting SSL options in drillbit.conf, such as keystore paths, to encrypt communications. These options allow administrators to tailor access controls, with changes requiring Drillbit restarts for propagation.[65][66]