Fact-checked by Grok 2 weeks ago

Apache Storm

Apache Storm is a free and open-source distributed real-time computation system designed for reliably processing unbounded streams of data, serving as the Hadoop equivalent for rather than batch jobs. It enables the development of scalable, fault-tolerant applications that handle continuous data flows, such as real-time analytics, , continuous computation, distributed RPC, and ETL processes. Originating from work at BackType by Nathan Marz, the project was open-sourced by following its 2011 acquisition of BackType, entered the Apache Incubator in September 2013, and graduated to a top-level project in 2014. At its core, Apache Storm operates through topologies, which package real-time application logic as directed acyclic graphs of computation components that run indefinitely. These topologies consist of streams—unbounded sequences of tuples (named lists of values, such as integers or strings)—sourced by spouts that ingest data from external systems like queues or APIs, and processed by bolts that perform operations like filtering, aggregation, or joins. Spouts can be reliable (enabling tuple replay on failure) or unreliable, while bolts use an OutputCollector to emit new tuples and acknowledge processing to track dependencies. Storm supports any programming language via its API, integrates seamlessly with queueing and database systems through abstractions like spouts, and guarantees that every tuple is fully processed (at least once, with configurable timeouts for failure handling) to ensure no data loss. Key features include high performance, processing over one million tuples per second per node, horizontal scalability by distributing topologies across clusters, and ease of setup with via task reassignment on node failures. It has been widely adopted in industries for handling high-velocity data, from to , and remains actively maintained with the latest release being version 2.8.3 as of November 2025.

Introduction

Overview and Purpose

Apache Storm is a free and open-source distributed computation system designed for reliably processing unbounded streams of data in with low latency. It enables the development of scalable applications that handle continuous, high-volume data flows without the delays inherent in traditional systems. The primary purpose of Apache Storm is to process ongoing data streams from diverse sources, such as sensors, server logs, or feeds, facilitating applications like , extract-transform-load (ETL) processes, and . For instance, companies like use it for personalization and revenue optimization, while leverages it for music recommendations and ad targeting. This focus on allows for immediate insights and actions, addressing the need for time-sensitive computations in modern data environments. Storm emerged to overcome the limitations of batch-oriented frameworks like Hadoop , which are ill-suited for scenarios requiring sub-second response times on live data. By providing primitives for parallel real-time computation, it simplifies the creation of robust, distributed systems akin to how eased batch workloads. In terms of scale, Apache Storm can process over a million tuples per second per while maintaining across clusters. Benchmarks have demonstrated it handling up to 1,000,000 messages per second on a 10- setup, underscoring its efficiency for large-scale deployments.

Key Features and Benefits

Apache Storm excels in horizontal scalability, distributing processing tasks across clusters to manage large-scale, unbounded streams of data without requiring downtime; new machines can be added dynamically, with Storm automatically rebalancing the load. This design supports efficient handling of high-volume workloads by parallelizing topologies over multiple nodes. A core strength is its , providing at-least-once processing semantics through acknowledgment-based tracking and tuple replay mechanisms, with exactly-once semantics available via the extension; this ensures that every input is processed reliably even in the event of failures. Storm's language agnosticism allows developers to use multiple programming languages, including , , and , via a straightforward and Thrift for submission and execution. This flexibility accommodates diverse development environments while maintaining seamless integration within the ecosystem. Key benefits include sub-second latency for computations, making it ideal for time-sensitive use cases such as real-time analytics. The extension unifies batch and by offering high-level abstractions for operations like aggregations and joins, supporting exactly-once semantics and transactional persistence across batches. Additionally, integrates effortlessly with message queues like Kafka, enabling efficient data ingestion and output through dedicated spouts and bolts. Performance-wise, Storm achieves up to 1 million tuples per second per node on standard hardware, backed by at-least-once and exactly-once processing guarantees to ensure reliability at scale.

Development History

Origins and Early Development

Apache Storm originated at BackType, a startup founded in 2008 by Christopher Golda and Michael Montano. The project was conceived by Nathan Marz, BackType's first employee and lead architect, in December 2010 to address the limitations of existing processing systems, which relied on brittle combinations of queues and worker processes. Marz aimed to create a unified framework that could handle unbounded data streams reliably, automating deployment, scaling, and to simplify for high-velocity data. Development began in earnest in early 2011, with Marz prototyping core abstractions like , spouts for ingestion, and bolts for over a five-month period. Key innovations included a broker-free algorithm for guaranteeing message , implemented primarily in with APIs for user-facing components. Early contributions came from BackType team members, including intern Jason Jackson, who helped automate AWS deployments. The motivation intensified after Twitter acquired BackType in May 2011, integrating the team to tackle on Twitter's massive firehose of , which demanded low-latency at unprecedented scale. Storm was open-sourced on September 19, 2011, during Marz's presentation at the conference, under the to encourage community adoption for distributed real-time computation. This release marked the transition from an internal tool to a publicly available , initially hosted on and rapidly gaining attention for its Hadoop-like approach to .

Apache Project Milestones

Apache Storm entered the Apache Incubator on September 18, 2013, marking the formal beginning of its adoption into (ASF) ecosystem. This step followed Twitter's open-sourcing of the project in 2011, after its initial development at BackType, and represented the initial phase of transitioning Storm from a company-specific tool to an open, community-governed initiative under ASF oversight. During incubation, the project focused on aligning with Apache standards, including licensing, , and community building, while addressing key issues such as releasing version 0.9.0 with essential features for distributed . Storm graduated from the Apache Incubator to become a Top-Level Project (TLP) on September 29, 2014, signifying its maturity and self-sufficiency within the ASF. This elevation granted the project greater autonomy in governance and development, allowing it to operate independently while benefiting from the broader Apache community's resources and visibility. The rapid progression from incubation entry to TLP status—spanning less than a year—highlighted the project's strong technical foundation and growing adoption for real-time data processing needs. The transition to Apache governance shifted Storm from Twitter-led development to a fully community-driven model under the ASF's Project Management Committee (PMC). Composed of active contributors who demonstrated merit through code, documentation, and , the PMC ensured decentralized , aligning with Apache's meritocratic principles. This change fostered broader participation from external developers and organizations, reducing reliance on any single corporate sponsor and enhancing the project's long-term . Key milestones during this period included deepened integration with the Hadoop ecosystem in 2014, enabling to complement by handling workloads on the same clusters. This synergy allowed Hadoop users to process unbounded data streams efficiently alongside interactive and batch tasks, broadening 's applicability in enterprise environments. In 2016, the release of 1.0.0 on April 12 further solidified its stability, introducing performance optimizations, improved logging with 2, and native support for streaming windows to enhance reliability in production deployments. These advancements marked 's evolution into a robust, enterprise-ready platform within the portfolio. Ongoing community contributions have continued to drive refinements, supporting its integration into modern data pipelines.

Recent Releases and Evolution

Apache Storm has undergone steady evolution since its maturation as an Apache top-level project, with releases emphasizing performance optimizations, dependency updates, and compatibility enhancements for contemporary deployment environments. Version 2.0.0, released on May 30, 2019, represented a pivotal update by rewriting the core engine in , replacing much of the original codebase to improve maintainability and contributor accessibility. This release introduced the , a typed and functional approach to that optimizes computational pipelines, alongside a high-performance core achieving latencies under 1 through a leaner threading model and efficient backpressure handling. Subsequent versions have focused on refinement and integration. Storm 2.5.0, released on August 4, 2023, brought dependency upgrades including to 6.27.3, removed Python 2 support to align with end-of-life practices, and added features like a scheduler with node constraints for better . The most recent release, 2.8.3 on November 2, 2025, primarily addresses maintenance through upgrades to key dependencies—such as the Kafka client to version 4.0 (requiring Kafka 2.1+ brokers), Netty to 4.2.7.Final, and to 11.0.26—along with bug fixes for blob store synchronization and the removal of deprecated storm-sql modules. The project's evolution reflects adaptations to modern computing paradigms, including enhanced support via official images available since early versions and improved compatibility for deployments through community charts and configurations starting around version 2.2.0 in June 2020. has also advanced, with authentication integrated for secure multi-tenant clusters and ongoing refinements in releases like 2.4.0 (March 2022) to support automated credential reloading for components such as the and DRPC . As an active Apache project, Storm maintains regular release cycles prioritizing stability and bug resolution over major overhauls since 2020, supported by a community of 48 committers. This approach ensures reliability for established real-time stream processing workloads.

Architecture

Core Components

Apache Storm's runtime environment is built around a distributed architecture with several key daemons and services that manage resource allocation, task execution, and coordination across a cluster. The primary components include the Nimbus master daemon, Supervisor worker daemons, ZooKeeper for coordination, and worker processes that handle actual computation. Nimbus serves as the central master daemon responsible for distributing code around the cluster, assigning tasks to worker nodes, and monitoring for failures. It is designed to be stateless and fail-fast, meaning it can be restarted without losing state, as all persistent data is stored externally in or on local disks. Nimbus uses to coordinate with other components, ensuring through mechanisms in multi-node setups. Supervisors are the worker daemons that run on each in the , listening for work assignments from and managing local worker processes accordingly. Each Supervisor starts and stops these processes based on the requirements and ensures that the assigned tasks are executed reliably. Like , Supervisors are stateless and fail-fast, relying on for heartbeat reporting and configuration synchronization to maintain health. ZooKeeper acts as an external coordination service essential for maintaining distributed state, tracking heartbeats from Nimbus and Supervisors, and storing cluster configuration information. It enables fault-tolerant operation by providing a centralized yet highly available repository for metadata, such as task assignments and topology details, without which the cluster cannot function. Worker processes are Java Virtual Machine (JVM) instances launched by Supervisors to execute the tasks of a specific topology, with one worker process typically allocated per slot in the topology configuration. These processes run multiple executor threads, each handling one or more tasks from spouts or bolts, allowing for parallel processing across the cluster. The number of worker processes is defined by the topology's worker count, which determines the parallelism level and resource consumption on the nodes. For internal messaging between components, Apache Storm relies on Netty as the default starting from version 0.9, offering improved performance over the earlier ZeroMQ implementation used in pre-0.9 releases. This shift to Netty enhances throughput and reliability in distributed communication, such as between worker processes on different nodes, while maintaining compatibility options for legacy setups.

Data Flow and Topology

In Apache Storm, a topology defines the logical structure of a stream processing application as a (DAG) consisting of spouts and bolts interconnected by streams. Spouts serve as the sources of data, injecting streams into the , while bolts represent the processing units that transform, filter, or aggregate the incoming data and emit new streams for further processing. This graph-based model enables the definition of complex, multi-stage pipelines for computation, where data flows continuously from spouts through multiple bolts without interruption. Data in Storm flows as unbounded sequences known as streams, each comprising tuples—ordered lists of values with a predefined of fields such as strings, integers, or other serializable objects. Tuples are emitted by spouts or bolts and routed to downstream components, allowing for parallel and distributed processing across a . The flow is inherently asynchronous, with each tuple processed independently to ensure and low in handling high-velocity . Stream groupings determine how tuples from an upstream component are distributed to the tasks of a downstream bolt, enabling various partitioning strategies to balance load and preserve order where necessary. Common groupings include shuffle grouping, which randomly distributes tuples for even load balancing; fields grouping, which routes tuples to tasks based on hashed values of specified fields to ensure related data follows the same path; all grouping, which replicates tuples to every task for broadcast; global grouping, which directs all tuples to the task with the lowest ID; direct grouping, where the producer explicitly selects the target task; and local-or-shuffle grouping, which prefers shuffling within the same worker process before falling back to global shuffling. These strategies allow developers to tailor data distribution to the application's semantics, such as ensuring key-based or maximizing parallelism. Once defined, a is submitted to a Storm cluster for execution, where it runs indefinitely until explicitly killed. Parallelism is achieved by assigning multiple tasks—lightweight threads executing instances of spouts or bolts—and executors, which are JVM threads managing one or more tasks. The number of tasks for each component is configurable, and Storm automatically distributes them across available worker nodes to process streams in a manner, with reliability features like tuple acknowledgments ensuring during flow (detailed in fault tolerance mechanisms).

Fault Tolerance Mechanisms

Apache Storm provides fault tolerance through a combination of daemon-level recovery mechanisms and guarantees, ensuring continuous operation in distributed environments despite or failures. The system's master daemon, , and worker daemons, , are designed to be fail-fast and stateless, with all persisted in . This allows for automatic recovery without data loss. When a worker dies, the detects the failure via timeouts and restarts the process locally; if the itself fails, detects the absence of heartbeats and reassigns the affected tasks to other available nodes in the cluster. A core aspect of Storm's is its message processing guarantees, achieved via an acknowledgment protocol that tracks through the . Each emitted by a spout is assigned a unique 64-bit , and bolts their output to input ones, forming a (DAG) of dependencies. Special-purpose acker bolts monitor these trees by maintaining a running XOR of IDs; upon receiving acknowledgments from all downstream tasks, the acker notifies the spout to or fail the original if it times out (default 30 seconds). This protocol ensures at-least-once processing semantics by enabling spouts to replay failed , with external queuing systems handling re-emission of unacknowledged . Storm supports three configurable reliability modes per to balance with performance. In at-most-once mode, acker bolts are disabled (via TOPOLOGY_ACKERS = 0), allowing potential message loss but minimizing overhead. At-least-once, the default mode, uses the full acknowledgment protocol for reliable delivery with possible duplicates. Exactly-once semantics are achieved through the API, which introduces transactional topologies and state management; each batch is assigned a unique transaction ID, and state updates (e.g., in databases like ) are made idempotent by checking IDs before committing, preventing duplicates on retries. For stateful processing, Storm incorporates checkpointing to capture periodic snapshots of bolt states, facilitating recovery and replay after failures. Stateful bolts implementing IStatefulBolt (or extending BaseStatefulBolt) use key-value state stores, with a dedicated checkpoint spout emitting special tuples every configurable interval (default 1 second via topology.state.checkpoint.interval.ms). These trigger a three-phase commit across the topology: prepare (save tentative state), commit (finalize on acknowledgments), and notify (record completion in ). Upon failure, the system rolls back to the last committed checkpoint, replaying tuples from that point to restore consistency, often integrated with persistent backends like or HBase for durability.

Programming and Usage

Building Topologies

Apache Storm topologies are constructed using a declarative that defines spouts as data sources and bolts as processing units, connected via streams to form a (DAG) for processing. Developers implement custom logic by extending base interfaces or classes, enabling integration with various data sources and processing requirements. This approach supports both simple and complex stream transformations, with topologies defined programmatically in languages like or . Spouts serve as the entry points for data streams in a , responsible for emitting tuples from external sources such as message queues or files. They implement the ISpout interface, which includes the non-blocking nextTuple() method to produce tuples and the ack() or fail() methods for reliability tracking. Spouts can be reliable, capable of replaying tuples in case of processing failures, or unreliable for higher throughput at the cost of potential data loss. A common example is the KafkaSpout, which reads messages from topics and emits them as tuples, supporting offset management for fault tolerance. Bolts process incoming tuples from spouts or other bolts, performing operations like filtering, aggregation, or joining, and may emit new tuples to downstream components. For simple transformations without or complex logic, developers use the IBasicBolt interface, which automatically acknowledges tuples after execution via the execute() . More advanced processing, such as maintaining or emitting multiple s, is handled by extending BaseRichBolt, which provides lifecycle methods like prepare() for initialization and manual acknowledgment control. Bolts declare output fields using OutputFieldsDeclarer to define schemas, ensuring type-safe handling. Topologies are built using the TopologyBuilder class in , which allows developers to specify spouts, bolts, and connections with groupings like , fields, or all. For instance, a basic topology might add a spout with setSpout("kafka-spout", new KafkaSpout(...)), a bolt with setBolt("word-splitter", new SplitBolt()).[shuffle](/page/Shuffle!)Grouping("kafka-spout"), and then compile it via [builder](/page/Builder).createTopology(). Since Storm 1.0, the Stream API enhances this by supporting dynamic declarations and multi-language components, though remains the primary implementation language with support for concise DSL definitions. For stateful stream processing, Storm provides Trident as a high-level abstraction layer on top of the core API, enabling exactly-once semantics through transactional batches and aggregations. Trident processes streams in micro-batches with unique transaction IDs, ensuring idempotent updates to state stores like databases or caches during failures. It supports operations such as joins, grouping by fields, and aggregations (e.g., counting occurrences across batches), compiling them into optimized Storm topologies that minimize network shuffling. Developers define Trident topologies using a fluent API, for example: TridentTopology topology = new TridentTopology(); Stream stream = topology.newStream("kafka", new KafkaSpout(...)).each(new Fields("sentence"), new Split(), new Fields("word")).groupBy(new Fields("word")).persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"));. Before submitting topologies to a , developers test them in local mode, which simulates a full within a single process using threads for worker nodes. This mode allows rapid iteration by running topologies via the storm local command or programmatically with LocalCluster, capturing logs and exceptions for debugging without overhead. Configurations like TOPOLOGY_DEBUG enable logging, and options such as --java-debug facilitate breakpoints, ensuring topologies behave correctly prior to production deployment.

Deployment and Configuration

Apache Storm supports multiple deployment modes to accommodate development, testing, and production environments. In local mode, topologies run on a single machine without a distributed , ideal for development and as it simulates the full environment using in-memory messaging. Pseudo-distributed mode operates on one machine but launches multiple processes for , Supervisors, and workers, providing a closer to a full for . Full cluster mode distributes components across multiple machines for production-scale processing, requiring coordination via . Setting up a Storm cluster begins with installing ZooKeeper, a distributed coordination service essential for leader election and configuration management; it should be deployed on separate nodes with log compaction enabled for reliability. Next, ensure Java 11 or higher and Python 3.x are installed on all Nimbus and worker nodes. Download the latest Storm release from the official site, extract it to a consistent directory on each machine, and configure the storm.yaml file, which overrides defaults from defaults.yaml. Key configurations include storm.zookeeper.servers to specify ZooKeeper hosts (e.g., a list of IP addresses), storm.local.dir for the local state directory (e.g., /mnt/storm with sufficient disk space), nimbus.seeds listing Nimbus hostnames or IPs for discovery, and supervisor.slots.ports defining available ports for worker processes (defaults to 6700-6703, allowing up to four parallel workers per Supervisor). Launch the daemons using bin/storm nimbus on the master node, bin/storm supervisor on worker nodes, and bin/storm ui for the web interface, ensuring all run under a process supervisor like systemd for persistence. Scaling in Storm involves adjusting resource allocation through . Cluster-wide is achieved by adding more nodes or modifying supervisor.slots.ports in storm.yaml to increase the number of worker slots per machine, each handling one worker process. For fine-grained control, topology-specific settings like num.executors per or spout in the determine , with topology.max.task.parallelism capping the overall limit to prevent overload. These adjustments allow horizontal to handle higher throughput without restarting the cluster. Monitoring deployed topologies relies on the Storm UI, accessible at http://{ui-host}:8080, which displays metrics such as throughput, , usage, and topology health, enabling debugging of bottlenecks. Logs in the logs/ directory provide detailed traces, while configurable health checks in storm.health.check.dir (default: healthchecks) run scripts to verify daemon status, with timeouts set via storm.health.check.timeout.ms (default: 5000 ms). This setup supports proactive issue resolution in production. For cloud environments, Storm offers native integration with resource managers like and Mesos, allowing dynamic allocation of containers for workers and in Hadoop or Mesos clusters. Since version 2.2, deployment is supported via charts, facilitating orchestrated rollouts and scaling in containerized setups.

Integration with Ecosystems

Apache Storm facilitates end-to-end data pipelines by integrating seamlessly with various input sources and output sinks through its spout and bolt abstractions, enabling real-time data ingestion and processing. This connectivity extends to broader ecosystems, allowing Storm to serve as a versatile component in distributed data workflows. For input sources, Storm supports dedicated spouts to consume streams from messaging systems such as , , and . The KafkaSpout, for instance, reads from Kafka topics using KafkaSpoutConfig, supporting offset strategies like EARLIEST or LATEST and processing guarantees such as AT_LEAST_ONCE, making it suitable for reliable stream ingestion. Similarly, the KinesisSpout fetches records from streams, managing shard iterators and storing progress in for fault-tolerant restarts, with configurations for retry handlers and record limits. RabbitMQ integration is achieved via community-maintained spouts that poll queues and emit messages as tuples, leveraging Storm's spout API for custom queueing brokers. On the output side, Storm employs bolts to persist processed data to storage systems including HDFS, HBase, , and . The HdfsBolt writes delimited files or sequence files to HDFS, with configurable rotation policies (e.g., by size or count) and sync intervals for efficient batch integration. For HBase, the HBaseBolt uses mappers like SimpleHBaseMapper to insert tuples as rows or counters, supporting WAL writes and batching for high-throughput operations. Cassandra integration via maps tuples to CQL statements, enabling inserts or batches across tables with node and port configurations for cluster connectivity. Elasticsearch bolts, such as , index tuples as documents using EsTupleMapper to define fields like source and ID, facilitating real-time search indexing. Within the Apache ecosystem, complements Hadoop by writing intermediate results to HDFS, enabling hybrid batch-stream processing where handles real-time computation and Hadoop performs batch analytics on accumulated data. enhances topology management by defining workflows in , allowing declarative specification of spouts, bolts, and configurations with property substitution for flexible deployment across environments. Modern integrations position Storm for advanced analytics, such as piping outputs to for post-processing via shared intermediaries like Kafka or HDFS, where Storm enriches streams before Spark's MLlib applies models. For real-time analytics, Storm connects to by emitting events to Kafka, which Druid ingests via its indexing service for sub-second queries on streaming data. Best practices recommend deploying Storm as the ETL layer in architectures, where it serves as the speed layer for low-latency transformations and serving layer queries, complementing a batch layer (e.g., Hadoop) for immutable data recomputation to ensure accuracy. This approach leverages Storm's and for continuous ETL pipelines handling unbounded streams.

Comparisons and Alternatives

Similar Stream Processing Systems

Apache is an open-source distributed framework that unifies batch and paradigms, providing advanced capabilities through its state backend and checkpointing mechanisms. It offers exactly-once guarantees, which surpass 's at-least-once semantics, enabling more reliable handling of complex, stateful stream jobs. In benchmarks for intricate topologies involving windowing and joins, Flink has demonstrated lower end-to-end latency compared to due to its optimized runtime and . Like Storm, Flink supports horizontal scalability by distributing tasks across clusters and ensures fault tolerance via asynchronous, incremental checkpoints that allow quick recovery from failures without replaying the entire stream. Apache Spark Streaming extends the Spark ecosystem to handle streaming data through a micro-batch processing model, where incoming data is aggregated into small batches for processing at regular intervals, typically every few seconds. This approach simplifies development for users already familiar with Spark's batch APIs but introduces higher latency than Storm's true continuous streaming, making it suitable for applications tolerant of slight delays in favor of unified batch-stream workflows. Spark Streaming inherits Spark's scalability features, processing data across distributed nodes, and provides through RDD lineage for recomputation of lost batches, aligning with Storm's emphasis on resilient, large-scale stream handling. Apache Kafka Streams is a lightweight, client-side library embedded within Kafka applications for building and running pipelines directly on Kafka topics, eliminating the need for a separate processing cluster. It excels in simple, Kafka-centric tasks like filtering, aggregations, and joins, offering lower operational overhead than Storm for basic real-time processing without the complexity of full topologies. It shares Storm's focus on fault tolerance via Kafka's durable logs for exactly-once semantics in stateful operations and scales by distributing stream tasks across application instances, supporting unbounded data flows efficiently. Other notable systems include Apache Samza, a YARN-integrated tightly coupled with Kafka for processing large-scale, stateful streams in a coordinated manner, emphasizing high throughput in enterprise environments. Apache provides a portable, unified for defining batch and streaming pipelines that can execute on various runners, such as or , promoting code reusability across engines while handling unbounded datasets with windowing and state support. These systems, including , commonly process unbounded data streams in or near-real-time fashions, prioritizing horizontal scalability to manage growing data volumes and robust mechanisms to ensure continuity during node failures or network issues.

Distinctions from Batch Processing Frameworks

Apache Storm fundamentally differs from batch processing frameworks like Hadoop in its processing paradigm, focusing on continuous, over unbounded of that arrive indefinitely, whereas batch systems handle finite, bounded submitted as jobs for offline . This stream-oriented approach allows Storm to ingest and process events as they occur, enabling applications to react instantaneously to incoming without waiting for a complete to accumulate. In contrast, Hadoop organizes computations around map and reduce phases applied to static stored in distributed file systems, prioritizing fault-tolerant, disk-based operations over immediacy. A key distinction lies in latency trade-offs: Storm achieves sub-second processing latencies, capable of handling over one million tuples per second per , making it suitable for time-sensitive use cases like fraud detection alerts or live monitoring systems. Batch frameworks, however, introduce delays ranging from minutes to hours due to job scheduling, data shuffling, and I/O operations, which align better with non-urgent tasks such as periodic reporting or historical trend analysis. These differences stem from Storm's in-memory, distributed execution versus the disk-persistent, job-based model of batch systems. Storm's data model supports unbounded, high-velocity streams that reflect real-world data generation patterns, such as sensor feeds or user interactions, allowing it to manage both volume and velocity on-the-fly without predefined endpoints. Batch processing, by design, assumes bounded datasets with known sizes and structures, facilitating optimizations for large-scale but static computations. This enables Storm to address scenarios where data freshness is paramount, while batch frameworks provide robust handling for exhaustive, retrospective analysis. In hybrid architectures, Storm often complements batch tools; for instance, in the —originated by Storm's creator Marz—Storm powers the speed layer for real-time views, while Hadoop handles the batch layer for comprehensive historical recomputation, with a serving layer merging results for query serving. Similarly, the Kappa architecture leverages stream processors like Storm for both real-time handling and replaying historical data from logs, minimizing the need for separate batch pipelines. Storm is particularly advantageous for time-critical applications demanding immediate insights and responsiveness, such as real-time analytics or event-driven systems, whereas batch frameworks like Hadoop are preferred for cost-efficient, large-scale offline processing where latency is not a constraint.

References

  1. [1]
    Apache Storm
    Apache Storm is a free and open source distributed realtime computation system. Apache Storm makes it easy to reliably process unbounded streams of data.Download · Tutorial · Apache Storm 2.8.0 Released · Project Information
  2. [2]
    History of Apache Storm and lessons learned - thoughts from the red ...
    Oct 6, 2014 · Storm originated out of my work at BackType. At BackType we built analytics products to help businesses understand their impact on social media ...
  3. [3]
    The Apache Software Foundation Announces Apache&trade
    Sep 29, 2014 · Storm was originally developed at BackType prior to being acquired by Twitter, and entered the Apache Incubator in September 2013. The project ...Missing: history | Show results with:history
  4. [4]
    Concepts - Apache Storm
    This page lists the main concepts of Storm and links to resources where you can find more information. The concepts discussed are:.
  5. [5]
    Documentation - Apache Storm
    Apache Storm integrates with any queueing system and any database system. Apache Storm's spout abstraction makes it easy to integrate a new queuing system.
  6. [6]
    Apache Storm downloads
    Downloads for Apache Storm are below. Instructions for how to set up an Apache Storm cluster can be found here.
  7. [7]
    Rationale - Apache Storm
    Extremely broad set of use cases: Storm can be used for processing messages and updating databases (stream processing), doing a continuous query on data streams ...
  8. [8]
    Companies Using Apache Storm
    While Hadoop is our primary technology for batch processing, Storm empowers stream/micro-batch processing of user events, content feeds, and application logs.Missing: limitations | Show results with:limitations
  9. [9]
    Scalable - Apache Storm
    Scalable · Fault tolerant · Guarantees data processing · Use with any language · Easy to deploy and operate · Free and open source. Apache Storm topologies are ...
  10. [10]
    Guaranteeing Message Processing
    ### Summary of Guaranteeing Message Processing in Apache Storm 2.8.0
  11. [11]
    any programming language - Apache Storm
    Apache Storm was designed from the ground up to be usable with any programming language. At the core of Apache Storm is a Thrift definition for defining and ...Missing: Java Clojure
  12. [12]
    Trident Tutorial
    ### Summary of Trident Features and Benefits
  13. [13]
    Storm Kafka Integration (0.10.x+)
    ### Confirmation and Summary of Storm Kafka Integration
  14. [14]
    BackType - Crunchbase Company Profile & Funding
    BackType is a marketing intelligence platform that helps brands and agencies understand the business impact of social media.
  15. [15]
    Twitter acquires BackType for improved analytics | VentureBeat
    Jul 5, 2011 · Founded in 2008, BackType's client base consists of over 100 companies, including The New York Times, AOL, Microsoft, Bitly, HubSpot and ...
  16. [16]
    A Brief History of Apache Storm - DZone
    May 25, 2016 · Storm was originally created by Nathan Marz while he was at Backtype (later acquired by Twitter) working on analytics products based on ...Missing: creator | Show results with:creator
  17. [17]
    Apache Foundation embraces real time big data cruncher 'Storm'
    Sep 19, 2013 · Nathan Marz, poster of the Storm GitHub repository believes “The ... Storm is currently offered under the Eclipse Public License. The ...Missing: date | Show results with:date
  18. [18]
    Incubator PMC report for November 2013 - Apache
    ... data. Storm has been incubating since 2013-09-18. Three most important issues to address in the move towards graduation: 1. Release Storm 0.9.0 with the ...
  19. [19]
  20. [20]
    The Apache Software Foundation Announces Apache™ Storm™ as ...
    Sep 29, 2014 · "Graduation to a top level project gives users the confidence that they can adopt Apache Storm knowing that it's backed by a robust, sustainable ...
  21. [21]
    Apache Storm 1.0.0 released
    Apr 12, 2016 · Apache Storm UI now includes a function that allow you to sample a percentage tuples flowing through a topology or individual component ...
  22. [22]
    Apache Storm 2.0.0 Released
    May 30, 2019 · The Apache Storm community is pleased to announce that version 2.0.0 has been released and is available from the downloads page.Kafka Integration Changes · Removal Of Storm-Kafka · Reorganization Of Apache...
  23. [23]
    Apache Storm 2.5.0 Released
    Aug 4, 2023 · The Apache Storm community is pleased to announce that version 2.5.0 has been released and is available from the downloads page.Bug · Improvement · Dependency Upgrade
  24. [24]
    Apache Storm 2.8.3 Released
    Nov 2, 2025 · The Apache Storm community is pleased to announce that version 2.8.3 has been released and is available from the downloads page.Missing: history | Show results with:history
  25. [25]
    storm - Official Image | Docker Hub
    Supported tags and respective Dockerfile links. 2.8.2-jre17 , 2.8-jre17 ... Apache Storm, Storm, Apache, the Apache feather logo, and the Apache Storm ...
  26. [26]
    Apache Storm 2.2.0 Released
    Jun 30, 2020 · The Apache Storm community is pleased to announce that version 2.2.0 has been released and is available from the downloads page.New Feature · Improvement · Bug
  27. [27]
    Running Apache Storm Securely
    Apache Storm offers a range of configuration options when trying to secure your cluster. By default all authentication and authorization is disabled but can be ...
  28. [28]
    Storm 2.4.0 Release Notes
    Release Notes for Storm 2.4.0. JIRA issues addressed in the 2.4.0 release of Storm. Documentation for this release is available at the Apache Storm project site ...
  29. [29]
    Apache Projects List
    Apache Storm: 48 committers, 47 PMC members. Apache Torque: 48 committers, 46 PMC members.
  30. [30]
    Tutorial - Apache Storm
    In this tutorial, you'll learn how to create Storm topologies and deploy them to a Storm cluster. Java will be the main language used, but a few examples ...
  31. [31]
  32. [32]
    Understanding the Parallelism of a Storm Topology
    ### Summary of Worker Processes in Apache Storm
  33. [33]
  34. [34]
  35. [35]
    What happens when Nimbus or Supervisor daemons die?
    This page explains the design details of Storm that make it a fault-tolerant system. What happens when a worker dies? When a worker dies, the supervisor ...
  36. [36]
    Guaranteeing Message Processing
    ### Summary of Apache Storm's Message Processing Guarantees
  37. [37]
    Trident Tutorial
    ### Summary: Trident's Exactly-Once Semantics in Apache Storm
  38. [38]
    State management - Apache Storm
    Once the checkpoint spout receives ACK from all the bolts, the state commit is complete and the transaction is recorded as committed by the checkpoint spout.
  39. [39]
    Simple API - Apache Storm
    Apache Storm has a simple and easy to use API. When programming on Apache Storm, you manipulate and transform streams of tuples, and a tuple is a named list of ...
  40. [40]
    Trident Tutorial
    ### Summary of Trident in Apache Storm
  41. [41]
    Local Mode - Apache Storm
    Local mode simulates a Storm cluster in process and is useful for developing and testing topologies. Running topologies in local mode is similar to running ...Programmatic · Drpc · Debugging Your Topology With...
  42. [42]
    Setting up a Storm Cluster - Apache Storm
    Set up a Zookeeper cluster; Install dependencies on Nimbus and worker machines; Download and extract a Storm release to Nimbus and worker machines; Fill in ...Fill In Mandatory... · Monitoring Health Of... · Launch Daemons Under...Missing: core components
  43. [43]
    Configuration - Apache Storm
    Setting up a Storm cluster: explains how to create and configure a Storm cluster. Running topologies on a production cluster: lists useful configurations when ...
  44. [44]
    Storm on Mesos! - GitHub
    Storm integration with the Mesos cluster resource manager. To use a release, you first need to unpack the distribution, fill in configurations listed below.Missing: YARN | Show results with:YARN
  45. [45]
    storm 1.2.0 · helm/gresearch - Artifact Hub
    Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data.Missing: support | Show results with:support
  46. [46]
    Integrates - Apache Storm
    Apache Storm integrates with any queueing system and any database system. Apache Storm's spout abstraction makes it easy to integrate a new queuing system.
  47. [47]
    Storm Kafka Integration (0.10.x+)
    The storm-kafka-client module only supports Kafka 0.10 or newer. For older versions, you can use the storm-kafka module (https://github.com/apache/storm/tree/1.Storm Apache Kafka... · Writing To Kafka As Part Of... · Reading From Kafka (spouts)
  48. [48]
    Storm Kinesis
    Apache Storm integrates with any queueing system and any database system. Apache Storm's spout abstraction makes it easy to integrate a new queuing system.
  49. [49]
    burov4j/storm-rabbitmq: Storm bolt and spout for RabbitMQ - GitHub
    You can also set some properties for RabbitMqSpout: builder.setSpout("rabbitmq-spout", rabbitMqSpout) .addConfiguration(RabbitMqSpout ...
  50. [50]
    Storm HDFS Integration
    Apache Storm integrates with any queueing system and any database system. Apache Storm's spout abstraction makes it easy to integrate a new queuing system.
  51. [51]
    Storm HBase Integration
    Storm/Trident integration for Apache HBase. Usage: The main API for interacting with HBase is the org.apache.storm.hbase.bolt.mapper.HBaseMapper interface.
  52. [52]
    Storm Cassandra Integration
    Bolt API implementation for Apache Cassandra. This library provides core storm bolt on top of Apache Cassandra. Provides simple DSL to map storm Tuple to ...
  53. [53]
    Storm Elasticsearch Integration - Apache Storm
    Storm Elasticsearch Bolt & Trident State. EsIndexBolt, EsPercolateBolt and EsState allows users to stream data from storm into Elasticsearch directly.
  54. [54]
    Flux - Apache Storm
    A framework for creating and deploying Apache Storm streaming computations with less friction. Definition. flux |fləks| noun.
  55. [55]
    Guaranteeing Message Processing - Apache Storm
    Apache Storm integrates with any queueing system and any database system. Apache Storm's spout abstraction makes it easy to integrate a new queuing system.
  56. [56]
    Storm Compatibility in Apache Flink: How to run existing Storm ...
    Dec 11, 2015 · Apache Flink is a stream processing engine that improves upon older technologies like Storm in several dimensions, including strong consistency ...
  57. [57]
    Apache Storm vs Flink: Stream Processing Showdown - RisingWave
    Jun 9, 2024 · Contrasting with Apache Storm, Flink boasts a more unified architecture that seamlessly integrates both batch and stream processing capabilities ...
  58. [58]
    Apache Flink® — Stateful Computations over Data Streams
    ### Comparison of Apache Flink and Apache Storm in Stream Processing
  59. [59]
    Apache Storm vs. Spark: Side-by-Side Comparison - phoenixNAP
    Jul 7, 2021 · Storm parallelizes task computation while Spark parallelizes data computations. However, there are other basic differences between the APIs.
  60. [60]
    Kafka Streams vs. Apache Flink vs. Apache Storm - Design Gurus
    The stream processing capabilities are less feature-rich compared to some other systems like Apache Flink. · Kafka Streams only supports Java, limiting the usage ...
  61. [61]
    7 Popular Stream Processing Frameworks Compared - Upsolver
    Mar 21, 2019 · Comparing Popular Stream Processing Frameworks · Apache Spark · Apache Storm · Apache Samza · Apache Flink · Amazon Kinesis Streams · Apache Apex.Types of Stream Processing... · Comparing Popular Stream... · Apache Spark
  62. [62]
    9 Best Stream Processing Frameworks: Comparison 2025 - Estuary
    Aug 7, 2025 · Top stream processing frameworks include Estuary Flow, Apache Spark, and Apache Kafka. Estuary Flow is a top pick, while Spark is best for big ...
  63. [63]
    Difference Between Apache Hadoop and Apache Storm
    Feb 15, 2023 · Hadoop comprises HDFS (used for data storage) and MapReduce (used for Computation) as architectural units. Storm comprises streams, spouts, and ...
  64. [64]
    Hadoop, Storm, Samza, Spark, and Flink: Big Data Frameworks ...
    Oct 28, 2016 · We will cover the following frameworks: Batch-only frameworks: Apache Hadoop. Stream-only frameworks: Apache Storm; Apache Samza. Hybrid ...Hadoop, Storm, Samza, Spark... · Stream Processing Systems · Apache Storm
  65. [65]
    Spark vs Hadoop vs Storm - ProjectPro
    Apr 18, 2024 · Apache Spark is designed to do more than plain data processing as it can make use of existing machine learning libraries and process graphs.
  66. [66]
    Tradeoffs between Storm and Hadoop (MapReduce)
    Jun 1, 2014 · Hadoop (processing via MapReduce in a Hadoop Cluster) is a batch processing system, and Storm is a real-time processing system.
  67. [67]
    Nathan Marz on Storm, Immutability in the Lambda Architecture ...
    Apr 6, 2014 · Nathan Marz explains the ideas behind the Lambda Architecture and how it combines the strengths of both batch and realtime processing as ...
  68. [68]
    Data processing architectures – Lambda and Kappa - Ericsson
    Nov 19, 2015 · Low-latency systems, for instance Apache Storm, Apache Samza, and Spark Streaming can be used to implement incremental model updates in the ...<|separator|>