Fact-checked by Grok 2 weeks ago

HPCC

HPCC Systems, also known as High-Performance Computing Cluster, is an open-source, distributed computing platform designed for big data processing, analytics, and management, enabling scalable handling of massive datasets through parallel processing. Developed initially in 1999 at Seisint for managing large-scale datasets and formally released as open source by LexisNexis Risk Solutions in 2011, it provides an alternative to traditional big data frameworks like Hadoop by emphasizing simplicity, performance, and enterprise-grade reliability. The 's core architecture revolves around two primary s: Thor, a data-centric cluster for batch-oriented tasks such as , , and enrichment at scales of billions of records per second, and Roxie, a high-performance query supporting , sub-second responses for thousands of concurrent users. Programming is facilitated by ECL (Enterprise Control ), a declarative, dataflow-oriented that allows developers to define logic without low-level distributed , promoting efficient parallel execution across clusters. The system integrates seamlessly with cloud environments, including on AWS and , and supports storage in formats like or Blob Storage, ensuring elasticity and cost-effectiveness for operations. Since its open-source debut, HPCC Systems has fostered a global developer community exceeding 2,000 ECL programmers, with in sectors like , healthcare, and by organizations such as universities and enterprises including Quod in Brazil. It emphasizes security features like , OAuth 2.0, and service meshes (e.g., Linkerd or Istio), while tools such as ECL Watch for monitoring and Real BI for enhance its usability for end-to-end data workflows. This combination of design, high throughput, and open extensibility positions HPCC Systems as a robust solution for modern challenges.

Overview and History

Definition and Purpose

HPCC Systems is an open-source platform designed for scalable, high-performance and . Developed by , it originated from internal needs for handling massive datasets and was released as open-source in 2011 to enable broader adoption in data-intensive applications. The primary purposes of HPCC Systems include facilitating scalable data ingestion from diverse sources, performing ETL (Extract, Transform, Load) operations, conducting advanced , and supporting workflows, all optimized for commodity hardware to achieve cost-effective . This addresses the challenges of petabyte-scale data lakes by providing near results and unified for both batch and streaming workloads. Unlike alternatives such as Hadoop, which rely on models and separate ecosystems for batch and processing, HPCC Systems offers a single, end-to-end architecture with native support for both paradigms in a homogeneous . Its core principles emphasize a data-centric that places at the heart of operations, leveraging across distributed nodes for efficiency, and employing via the ECL language to simplify development and ensure implicit parallelism without manual optimization.

Development Timeline

The development of HPCC Systems originated in 1999 at Seisint, a data analytics company and predecessor to , where it was initially conceived as a memory-based system designed to handle large-scale queries on massive datasets for applications such as credit scoring and fraud detection. Following Seisint's acquisition by in 2004, the platform underwent extensive in-house development for over a decade, evolving to meet the demands of , insurance analytics, and processing needs, including the integration of technologies from subsequent acquisitions like ChoicePoint in 2008. On June 15, 2011, publicly released HPCC Systems as an open-source project under the 2.0, marking a pivotal shift that allowed broader adoption and community involvement in its evolution. Early post-release milestones included the December 2011 announcement of the Thor Data Refinery Cluster's availability on (AWS) EC2, enabling scalable cloud-based for workloads. In January 2012, the platform introduced its extensible Library, providing parallel implementations of supervised and unsupervised algorithms accessible via the ECL programming language to support advanced analytics at scale. The project reached its 10th open-source anniversary on June 15, 2021, during which it had adopted industry standards for , enhanced security features such as improved and , and expanded capabilities in areas like and . Today, HPCC Systems remains an active open-source initiative with quarterly releases occurring every three months to incorporate community contributions and refinements. Version 10.0, released in 2025, emphasizes reductions in cloud operational costs through optimized , alongside enhancements and improved user interfaces for tasks. Having been in productive use for over 20 years, the platform supports thousands of deployments across enterprises and academic institutions worldwide.

System Architecture

Thor Cluster

The Thor cluster serves as the primary engine within the HPCC Systems , designed for batch-oriented tasks such as extract, transform, and load (ETL) operations, , and large-scale analytics on distributed commodity hardware. It processes vast datasets by importing raw , performing transformations like and linking to other sources, and outputting enriched files, enabling efficient handling of bulk data volumes that can reach billions of records in minutes. Built to operate on cost-effective, off-the-shelf servers, Thor leverages parallel execution to achieve high throughput without specialized hardware requirements. The cluster follows a , where the coordinates job scheduling and , while multiple slave execute the in . Data is partitioned across slave using key-based methods, which determine how records are sorted and distributed for balanced workload allocation, ensuring efficient computation. Each slave typically requires balanced resources, such as 4 CPU cores, 8 GB , 1 Gb/sec network connectivity, and 200 MB/sec disk I/O, to optimize performance, with multiple slaves possible per physical server for finer-grained . Thor achieves horizontal scalability by expanding from a single node to thousands, supporting petabyte-scale datasets through seamless addition of nodes without manual reconfiguration of parallelism. This design incorporates fault tolerance via data replication, typically maintaining at least one or two copies of files across nodes, allowing automatic or manual failover to replicas if a slave fails, and recovery mechanisms like node replacement or data copying to maintain operations. In terms of performance, Thor employs a map-reduce-like paradigm but is optimized through graphs, where processing nodes execute in as data flows continuously between them, avoiding the sequential cycles common in traditional implementations. This enables Thor to handle petabyte-scale batch workflows efficiently on commodity clusters. ECL queries are compiled into these execution graphs for deployment on Thor.

Roxie Cluster

The Roxie cluster in HPCC Systems functions as the dedicated online query processing engine, optimized for delivering sub-second response times on indexed datasets to support access and . It operates as a high-performance component, enabling efficient handling of concurrent user queries through a scalable, distributed architecture. The cluster's design emphasizes distributed storage of indexes across multiple nodes, featuring load-balanced slave nodes—known as —that process incoming requests in parallel. This setup includes a combination of and roles, where servers manage and agents execute operations on partitioned , supporting key-value lookups for rapid retrieval and complex joins for advanced analytical computations. The leverages a shared-nothing model, allowing seamless scaling from single nodes to thousands while maintaining data locality for optimal performance. Key optimizations in Roxie involve pre-building indexes from outputs generated by the Thor cluster, which are then preloaded into memory across nodes for immediate availability. Dynamic distribution of queries ensures balanced workload allocation, facilitating high throughput rates of thousands of requests per node per second and supporting extensive concurrency without bottlenecks. In hybrid deployments, Roxie complements Thor by serving query results derived from processed data lakes, providing a streamlined pathway for real-time insights on refined datasets.

Software Architecture

ECL Programming Language

ECL (Enterprise Control Language) is a high-level, data-centric language designed specifically for defining data transformations, analytics, and processing on massive datasets within the HPCC Systems platform. It enables developers to express complex data operations in a non-procedural manner, focusing on what needs to be achieved rather than how, which facilitates across environments. ECL's syntax revolves around reusable attributes and definitions that build upon one another, allowing for efficient query composition and reuse. The language employs a declarative with a rich set of operators tailored for parallel execution, such as JOIN for combining , for transforming records, and SORT for ordering . For instance, a simple projection might be written as:
projected := [PROJECT](/page/Project)(inputDataset, TRANSFORM([SELF](/page/Self).output[Field](/page/Field) := LEFT.inputField));
These operators abstract low-level details of and parallelism, compiling directly to optimized C++ code for high-performance execution on clusters. ECL supports through activity graphs, which visualize the sequence of operations as a , aiding in and optimization. Key constructs include definitions using the DATASET keyword, such as myDataset := DATASET('filePath', recordStructure);, and inline for embedding small directly, like inlineData := DATASET([{'value1'}, {'value2'}], {STRING field});. ECL's advantages stem from its ability to abstract distribution details, ensuring that code remains portable across different cluster configurations without modification. This portability allows the same ECL queries to run efficiently on both (Thor) and query (Roxie) engines with minimal adjustments. Additionally, ECL includes modular libraries for advanced , such as modules for tasks like clustering and , promoting code reusability and rapid development in environments.

Middleware and Integration Components

The middleware layer of HPCC Systems consists of system servers that facilitate control, inter-component communication, and distributed job execution across clusters. Key components include the (Enterprise Services Platform) server, which serves as the external communications layer by providing a for services like WsECL for query submission and ECL Watch for web-based management, supporting protocols such as XML, , , and secure /SSL. Client APIs enable programmatic interaction, with the HPCC4J library offering Java-based access to web services and C++ tools, while PyHPCC provides a wrapper for communicating with HPCC instances via these services. Auxiliary components support system reliability and abstraction. The Dali server acts as a distributed abstract layer (DAL), managing such as workunit records, logical directories, message queues, and locking to abstract the underlying . Configuration is handled through the Manager, a graphical utility that edits the environment.xml to define global settings like paths and component placements, ensuring consistent deployment. Security modules integrate LDAP for granular to and workunits, alongside basic htpasswd and SSL encryption for communications, configurable via ECL Watch or utilities like initldap. The integration ecosystem extends HPCC Systems to third-party tools and hybrid environments. Plugins support streaming ingestion from via an optional kafka embed module and a Spring Framework-based HTTP server, enabling publish-subscribe messaging for processing. JDBC drivers allow direct without ECL, while ODBC support facilitates connections from tools like Excel or platforms; integration occurs through a stand-alone distributed connector and Java library, permitting user-managed clusters to query and write HPCC . Compatibility with services is achieved through deployment options that support hybrid setups, such as linking to AWS or storage for scalable lakes. Management capabilities are centralized in ECL Watch, a web accessible at port 8010, which monitors job status, , and error handling by browsing workunits, viewing data flow graphs, and accessing system logs. Additional system servers like for archiving workunits and ECL Scheduler for event-based automation enhance without requiring external load balancers for most components.

HPCC Systems Platform

Key Features and Capabilities

The HPCC Systems platform distinguishes itself through its lightweight core architecture, enabling high-speed with near query results in sub-second response times for thousands of concurrent users via the Roxie cluster. This performance is complemented by the Thor cluster's ability to process billions of records per second in batch operations, supporting efficient resource utilization on commodity hardware and reducing (TCO) compared to more resource-intensive alternatives. The platform's design emphasizes low operational overhead, allowing organizations to achieve significant cost savings in cloud environments through optimized scaling and minimal infrastructure demands. Key capabilities include a built-in library offering scalable algorithms such as for and Decision Trees for supervised , integrated directly into the ECL programming environment. Additional features encompass data profiling tools via the Scalable Automated Linking Technology () for tasks like record linking and quality assessment, alongside graph analytics modules that facilitate relationship mapping and network analysis on large datasets. The platform provides full-spectrum support, handling both structured and unstructured through distributed file systems in Thor for ETL processes and Roxie for indexed queries, enabling seamless integration across diverse data types without proprietary storage requirements. HPCC Systems offers unified processing for both batch and workloads, eliminating the need for separate systems and enhancing productivity with the declarative ECL , which significantly reduces code volume relative to imperative languages by leveraging modular, parallelizable constructs that compile to optimized C++. It incorporates robust through data replication across nodes in both Thor and Roxie clusters, ensuring no single points of failure and maintaining operations even under node loss. As of 2025, enhancements include native deployments with improved chart support for automated scaling on cloud providers like AWS and , alongside security advancements such as , OAuth 2.0 authentication, and updated libraries for stronger cryptographic protections. and extensions continue to evolve, with ongoing support for advanced algorithms.

Deployment Options and Editions

HPCC Systems is available in two primary editions: the Community Edition, which is free and open-source under the Apache 2.0 license, suitable for development, testing, and production use by organizations seeking a cost-effective solution supported by community forums and resources; and the Enterprise Edition, a paid offering provided through partners such as and ClearFunnel, which includes professional support, advanced security features, performance optimizations, and customized implementations for large-scale enterprise environments. Deployment options for HPCC Systems encompass on-premises installations on bare-metal clusters using operating systems like 22.04 or 24.04, 7, and 8, allowing users to configure custom hardware setups for needs. Cloud deployments are facilitated through a containerized platform compatible with major providers including AWS, , and , leveraging pre-built Amazon Machine Images (AMIs) or equivalent templates for rapid provisioning. Additionally, with and orchestration via and charts enables easy scaling and management, while single-node installations serve as entry points for testing and learning on local machines or virtual environments like Minikube. Scalability in HPCC Systems ranges from small single-node setups for prototyping to expansive multi-petabyte clusters spanning thousands of nodes, with automated tools for provisioning, such as for , and support for rolling upgrades to minimize downtime during expansions. The platform's design allows seamless growth from development environments to production-scale data lakes, handling massive across Thor and Roxie components. The latest version, HPCC Systems 10.0.10-1, released on November 20, 2025, emphasizes cloud-native enhancements including improved and cost optimizations for containerized deployments, with quarterly updates delivered through the official repository to incorporate contributions and patches. Post-deployment management can leverage components for and , as outlined in related .

References

  1. [1]
    About - HPCC Systems
    Your End-to-End Data Lake Management Solution · HPCC Systems gives you the ability to quickly develop the data your application needs. · Simple. Fast. Accurate.
  2. [2]
    [PDF] The HPCC Systems Open Source Big Data Platform
    The big data platform that would become HPCC Systems was developed in 2001 by an in-house engineering team at LexisNexis® Risk Solutions. The beginnings of the ...
  3. [3]
    Year Open Source Anniversary of its HPCC Systems Platform
    LexisNexis® Risk Solutions today announced the 10-year open source anniversary of HPCC Systems®, its platform for big data insights.Missing: history | Show results with:history
  4. [4]
    Platform | HPCC Systems
    A High Performance Cluster Computing platform built for high-speed data engineering. HPCC Systems key advantage comes from its lightweight core architecture.
  5. [5]
    hpcc-systems/HPCC-Platform - GitHub
    HPCC Systems (High Performance Computing Cluster) is an open source, massive parallel-processing computing platform for big data processing and analytics.<|control11|><|separator|>
  6. [6]
    HPCC - LexisNexis Risk Solutions
    HPCC Systems® from LexisNexis® Risk Solutions is a proven, open source solution for big data insights that can be implemented by businesses of all sizes.
  7. [7]
    HPCC Systems 10 Year Open Source Anniversary
    Feb 24, 2025 · June 15, 2021 marked the 10th anniversary of HPCC Systems as an open source offering in the big data analytics market.
  8. [8]
    HPCC Systems Launches Big Data Delivery Engine on EC2 - InfoQ
    Dec 1, 2011 · HPCC (High Performance Computing Cluster) is an open source massively parallel-processing computing platform that solves Big Data problems.
  9. [9]
    HPCC Systems | Forum Archive
    Jan 30, 2012 · Now available! An extensible library of fully parallel machine learning routines for the HPCC Platform; covering supervised and unsupervised ...
  10. [10]
    HPCC Systems Marks its 10th Year Anniversary as Open Source
    It wasn't until 2011 that LexisNexis Risk Solutions and its parent company RELX decided to generously release the platform as an open source project. During a ...Missing: history | Show results with:history
  11. [11]
    Release-Notes | HPCC Systems
    Machine Learning (ML) · Tutorials. Detailed documentation. Made searchable and organized to increase your productivity. Read ... Release Date: Oct 10, 2025 ...
  12. [12]
    Deploy | HPCC Systems
    Welcome to the HPCC Systems developer community! Track the latest developments and help make our platform even better. Whether you're an experienced HPCC ...
  13. [13]
    HPCC Systems: Home
    A platform purpose-built for high-speed data engineering. Innovative Performance and Productivity for Your Data Lake.About Us · Documentation · Deploy · Platform
  14. [14]
    [PDF] HPCC System Administrator's Guide
    A single Thor slave node works optimally when allocated 4 CPU cores, 8GB RAM, 1Gb/sec network and 200MB/ sec sequential read/write disk I/O. Hardware ...
  15. [15]
    Options - HPCC Systems
    Specifies which recordset provides the partition points that determine how the records are sorted and distributed amongst the supercomputer nodes. PARTITION ...
  16. [16]
    [PDF] Data Intensive Supercomputing Solutions - HPCC Systems
    The programming model for MapReduce architecture is a simple abstraction where the computation takes a set of input key-value pairs associated with the input ...
  17. [17]
    [PDF] NYC - HPCC Systems
    Truly parallel: Unlike Hadoop, nodes of a data graph can be processed in parallel as data seamlessly flows through them. In Hadoop MapReduce. (Java, Pig, Hive, ...
  18. [18]
    [PDF] Roxie: The Rapid Data Delivery Engine - HPCC Systems
    The Thor platform is designed to perform operations quickly on massive datasets without indexes, where the entire dataset (or almost all of it) is to be ...
  19. [19]
    None
    Below is a merged summary of ECL (Enterprise Control Language) from the HPCC Systems ECL Language Reference, consolidating all information from the provided segments into a single, comprehensive response. To retain all details efficiently, I will use a structured format with text for the overview and a table in CSV format for examples and specific constructs. This ensures maximum density and clarity while avoiding redundancy.
  20. [20]
    Learning ECL - HPCC Systems
    The ECL Programmers Guide gives an introduction to the ECL language along with example data and use cases. The supporting ECL Code Files referenced in the guide ...
  21. [21]
    [PDF] HPCC Systems® Administrator's Guide
    The System Servers are integral middleware components of an HPCC Systems platform. They are used to control workflow and inter-component communication. Dali.
  22. [22]
    Java API Project - HPCC Systems
    The HPCC Systems for Java Project provides a set of Java based libraries and tools (HPCC4J) which facilitate interaction with HPCC Systems Web Services and C++ ...Missing: Python | Show results with:Python
  23. [23]
    hpcc-systems/pyhpcc - GitHub
    PyHPCC is a Python package and wrapper built around the HPCC Systems web services that facilitate communication between Python and HPCC Systems.
  24. [24]
    Using your favorite language or data source with HPCC Systems
    Kafka – We support the streaming of data to HPCC Systems using Apache Kafka via a Spring Framework (http://spring.io) based HTTP REST server. More ...
  25. [25]
    [PDF] Installing & Running the HPCC Systems® Platform
    The optional plugins are: • JAVA : javaembed. • JavaScript : v8embed. • R : rembed. • MySql : mysqlembed. • Kafka : kafka.
  26. [26]
  27. [27]
  28. [28]
    HPCC Systems Machine Learning Library
    The HPCC Systems Machine Learning Library provides a wide range of Machine Learning algorithms accessible from ECL, and designed to utilize the parallel ...Missing: key AWS Thor 2012 10th anniversary 2021 quarterly 10.0
  29. [29]
    [PDF] Powerful Open Source Big Data Analytics Platform - HPCC Systems
    SALT - Scalable Automated Linking Technology addresses most common data integration tasks such as record linking and clustering, data profiling, ...
  30. [30]
    Documentation - HPCC Systems
    We've got you covered, with documentation and training to support you from initial installation all the way to power user. View all documentation here.