Apache Flink
Apache Flink is an open-source framework and distributed processing engine for stateful computations over unbounded and bounded data streams.[1] Originating from the Stratosphere research project at the Technical University of Berlin in 2009, Flink entered the Apache Incubator in April 2014 and graduated to a top-level Apache project in December 2014.[2][3][4] The project achieved its first stable release, version 1.0, in March 2016, marking a milestone in unified stream and batch processing capabilities.[5] In March 2025, Flink released version 2.0, with the 2.1 series following in July 2025 and 2.1.1 as the latest stable release in November 2025, further advancing real-time data processing with enhanced support for AI integrations and scalable stream analytics.[6][7] Flink's architecture features a scale-out design with incremental checkpoints, enabling high-availability deployments and efficient handling of large-scale state.[1] Key strengths include exactly-once state consistency, event-time processing, and support for late-arriving data, ensuring reliable computations in distributed environments.[1] It provides multiple APIs, such as the DataStream API for low-level stream processing, the Table API and SQL for declarative queries on streams and batches, and the ProcessFunction for fine-grained control over state and timers.[1] Performance is optimized through low-latency, high-throughput execution via in-memory computing and flexible windowing mechanisms.[1] Common use cases for Flink encompass event-driven applications that react to real-time event streams, stream and batch analytics for unified querying, and data pipelines for ETL processes involving data conversion and movement across systems.[1] It integrates seamlessly with ecosystems like Apache Kafka for ingestion, Hadoop for batch storage, and cloud platforms for deployment, powering mission-critical applications in industries such as finance, e-commerce, and telecommunications.[1] Flink's community-driven development under the Apache Software Foundation continues to evolve, with recent enhancements focusing on AI/ML pipelines and Kubernetes-native operations.[1]Introduction
Overview
Apache Flink is an open-source, distributed processing engine designed for stateful computations over unbounded and bounded data streams.[1] It provides a unified architecture that treats batch processing as a special case of streaming, allowing developers to build applications that handle both real-time and historical data with the same codebase and runtime. This stream-first design enables low-latency, high-throughput processing suitable for modern data-intensive applications.[1] At its core, Flink adheres to principles such as exactly-once semantics, which guarantee that each input event affects the final output exactly once, even in the presence of failures.[8] It scales horizontally to thousands of cores, managing very large state through features like incremental checkpoints.[9] Additionally, Flink supports event-time processing, which aligns computations with the timestamps of events rather than processing time, enabling accurate handling of out-of-order data.[1] In comparison to predecessors like Hadoop MapReduce, which relies on disk-based batch processing and lacks native streaming support, Flink offers in-memory computation and true streaming capabilities for faster, more efficient real-time analytics. Relative to contemporaries like Apache Spark, Flink's stream-native architecture provides lower latency and better state management for continuous data flows, though Spark remains dominant for batch-heavy workloads.[10] Flink has seen widespread adoption for real-time analytics by companies including Alibaba, Netflix, and Uber.[11]Key Features
Apache Flink provides exactly-once processing guarantees, ensuring that each input event is processed precisely once even in the presence of failures, through its checkpointing mechanism that captures consistent snapshots of application state.[9] Savepoints extend this capability by allowing manual, consistent snapshots for application upgrades, scaling, or migration without data loss.[12] Flink natively supports both event-time and processing-time semantics for streaming data, enabling accurate handling of out-of-order events based on their occurrence timestamps rather than arrival times, which is essential for reliable analytics in real-world scenarios like sensor data or financial transactions.[9] The framework offers flexible windowing mechanisms, including tumbling windows for non-overlapping fixed-size intervals, sliding windows for overlapping periods with specified slide durations, and session windows for dynamic grouping based on inactivity gaps.[13] These can be customized with triggers to control evaluation timing and evictors to remove elements before computation, allowing tailored aggregation over unbounded streams.[13] Flink integrates machine learning through the FlinkML library, which enables scalable implementations of algorithms such as alternating least squares for matrix factorization in recommendation systems.[14][15] The Complex Event Processing (CEP) library supports pattern detection in event streams, allowing users to define sequences of events with conditions and temporal constraints for applications like fraud detection or monitoring.[16] In terms of performance, Flink achieves sub-second latencies for real-time processing, as demonstrated in benchmarks like Nexmark, while supporting horizontal scalability across clusters to handle terabyte-scale state.[6][17] It also features built-in backpressure handling to prevent overload by dynamically adjusting data flow rates between operators.[18]Architecture
Distributed Runtime
Apache Flink's distributed runtime forms the core execution engine that enables scalable processing of streaming and batch data across clusters of machines. It orchestrates the deployment and execution of user-defined jobs by transforming high-level application logic into a distributed dataflow graph, which is then optimized and executed in a fault-tolerant manner. The runtime supports both bounded and unbounded data processing, ensuring low-latency and high-throughput performance through efficient resource utilization and data locality.[19] The runtime is composed of two primary process types: the JobManager and TaskManagers. The JobManager serves as the central coordinator, responsible for scheduling tasks, managing resources, and overseeing job execution. It includes subcomponents such as the ResourceManager for allocating slots and containers, the Dispatcher for handling job submissions via REST or Web UI, and the JobMaster for managing individual JobGraphs by translating them into execution plans and distributing tasks. In high-availability setups, multiple JobManagers operate with one active leader and others in standby mode to ensure continuous operation. TaskManagers, on the other hand, execute the actual dataflow tasks assigned by the JobManager. Each TaskManager runs one or more task slots, which are fixed-size resource units that isolate tasks and manage their memory allocation, buffering input/output streams and exchanging data between subtasks.[19] Flink's pipeline execution model optimizes the dataflow graph through techniques like operator chaining and fusion to minimize overhead and latency. Upon job submission, the JobManager generates an execution graph from the logical JobGraph by applying optimizations such as fusing compatible operators (e.g., consecutive map or filter operations) into single tasks. This chaining allows multiple operators to run in the same thread on a TaskManager, reducing serialization, deserialization, and network shuffling costs while improving data locality. Chaining is enabled by default but can be controlled programmatically—for instance, usingstartNewChain() to break chains at specific points or disableChaining() for individual operators—to balance performance and debugging needs.[20]
Deployment of Flink jobs occurs in various modes to support different environments and resource managers. In standalone mode, Flink runs directly on a cluster of machines without an external orchestrator, where the user starts the JobManager and TaskManagers manually or via scripts for simple, on-premises setups. For Hadoop ecosystems, YARN mode integrates with YARN's ResourceManager, allowing dynamic allocation of containers for JobManager and TaskManagers upon job submission. Kubernetes mode deploys Flink as containerized applications, leveraging Kubernetes for pod orchestration, scaling, and service discovery, which is ideal for cloud-native infrastructures. Additionally, cloud-native options from providers like Amazon EMR or Alibaba Cloud Realtime Compute offer managed deployments that handle infrastructure provisioning and integration with cloud services. In all modes, a client submits the JobGraph to the JobManager, which then provisions TaskManagers as needed for execution.[21]
Resource allocation in the runtime relies on a slot-based model to control parallelism and scaling. Each TaskManager exposes configurable task slots, the smallest allocatable unit, where each slot dedicates a portion of the TaskManager's resources (primarily memory) to one or more subtasks, enabling concurrent execution. Parallelism is set at the job or operator level, with slots shared across pipelines by default to maximize utilization; for example, a single slot can handle multiple chained operators from the same pipeline. Dynamic scaling is facilitated by the Adaptive Scheduler, which adjusts job parallelism reactively based on available slots—scaling up by adding TaskManagers or down by reducing parallelism using savepoints—without interrupting execution. This is particularly useful in elastic environments like Kubernetes, where resources can be redeclared at runtime via APIs to respond to workload variations.[19][22]
To handle varying data rates and prevent job failures from overload, Flink implements a credit-based backpressure mechanism at the task level. When a downstream task cannot consume data as quickly as an upstream task produces it—due to bottlenecks like slow sinks or complex computations—the upstream task experiences backpressure through depleted output buffers. This triggers a flow control signal that propagates upstream, slowing input rates (e.g., throttling sources) without dropping records or crashing the job. Buffering in TaskManagers absorbs temporary spikes, while metrics such as backPressuredTimeMsPerSecond (tracking time spent waiting on buffers) allow monitoring and diagnosis, with status levels from OK (under 10% backpressure) to HIGH (over 50%). This ensures stable execution even under imbalanced loads.[23]
State Management and Fault Tolerance
Apache Flink's state management enables the persistence and consistent handling of application state across distributed tasks, crucial for maintaining computational progress in long-running stream processing jobs. State in Flink represents the data maintained by operators, such as accumulators in window functions or key-value mappings in keyed streams, and is stored locally on TaskManagers unless configured otherwise. This local storage facilitates fast access during normal operation, while fault tolerance mechanisms ensure state recovery upon failures without data loss or duplication. Flink provides several state backends to manage how state is stored and checkpointed, balancing performance, scalability, and persistence needs. The HeapStateBackend (also known as HashMapStateBackend) stores state as objects directly in the Java heap memory of TaskManagers, offering high-speed access suitable for applications with moderate state sizes or when high availability is ensured through replication. However, it is limited by available memory and does not persist state to disk by default, making it vulnerable to TaskManager failures without checkpoints. In contrast, the RocksDBStateBackend (EmbeddedRocksDBStateBackend) serializes state into RocksDB, a key-value store embedded in each TaskManager, writing data to local disk directories for persistence. This backend excels in handling very large states—potentially terabytes—by spilling to disk when memory is constrained, though it incurs overhead from serialization and I/O operations. For global state storage, Flink supports filesystem options, where checkpoints are written to remote durable storage like HDFS or S3; the experimental ForStStateBackend further enables disaggregated state by storing it asynchronously on remote filesystems, allowing unlimited state sizes and decoupling from local TaskManager resources. The choice of backend impacts checkpointing efficiency and recovery times, with RocksDB and ForSt supporting incremental updates to minimize data transfer.[24] Checkpoints form the core of Flink's fault tolerance, capturing periodic, consistent snapshots of the entire application state and input stream positions to enable recovery as if no failure occurred. Enabled via configuration (e.g.,env.enableCheckpointing(1000) for one-second intervals), checkpoints use a snapshot-and-restore mechanism where the JobManager coordinates the process across all tasks. This involves streaming barriers—special markers injected into data streams—that trigger state serialization without halting processing, ensuring a globally consistent view. The protocol employs a two-phase commit approach: in the first phase, tasks write state snapshots to durable storage and acknowledge; in the second phase, upon all acknowledgments, the JobManager finalizes the checkpoint, guaranteeing atomicity. This design supports exactly-once semantics internally by pre-aggregating state changes before barriers and replaying from the last checkpoint upon failure, restoring operators to their exact prior state.[25]
For end-to-end consistency with external systems, Flink extends exactly-once guarantees through transactional integrations, such as with Apache Kafka via idempotent producers or two-phase commit sinks that coordinate with external transaction managers. For instance, when Kafka serves as both source and sink, Flink's checkpoint-aligned offsets ensure that upon recovery, messages are neither skipped nor duplicated across the pipeline. These guarantees hold for stateful operators, where pre-aggregated results (e.g., window sums) are committed only after barrier alignment, preventing partial updates.
Savepoints build on checkpoints as manually triggered, portable snapshots that allow pausing, resuming, or migrating jobs across clusters or versions. Unlike automatic checkpoints, savepoints are initiated via commands (e.g., flink savepoint <job-id> <path>), storing state in a version-agnostic format compatible with future Flink releases, facilitating upgrades without downtime. They are stored in configurable directories on durable filesystems, enabling blueprinting of production jobs for development or scaling.
To optimize for large-state applications, Flink supports incremental checkpoints, which capture only changes (deltas) since the previous checkpoint rather than full snapshots. Enabled by setting execution.checkpointing.incremental: true, this feature—available in RocksDB and ForSt backends—significantly reduces checkpointing time and storage overhead; for example, in scenarios with gigabytes of state, incremental mode can cut completion times from hours to minutes by avoiding redundant serialization of unchanged data. The backend maintains versioned logs of modifications, merging them periodically to bound storage growth. This efficiency is vital for real-time applications processing high-volume streams, ensuring fault tolerance without compromising latency.[26]
Programming Model
DataStream API
The DataStream API serves as the foundational programming interface in Apache Flink for developing scalable stream and batch processing applications, enabling developers to ingest, transform, and output continuous or finite data streams in a distributed environment.[27] It supports both unbounded streams, which represent potentially infinite data flows like real-time sensor inputs, and bounded streams for finite datasets, allowing unified processing logic across streaming and batch workloads.[27] This API is implemented in Java and Python, with transformations applied lazily to form a directed acyclic graph (DAG) that Flink's runtime optimizes and executes. In Flink 2.0, an experimental DataStream API V2 was introduced to enhance the original API with improved ergonomics and performance.[28] At its core, the DataStream abstraction encapsulates immutable collections of elements of the same type, sourced from external systems and manipulated through a rich set of operators.[27] A DataStream can be created from various inputs, such as files or message queues, and undergoes transformations that produce new DataStreams without altering the original.[27] Key transformations includemap, which applies a one-to-one function to each element—for instance, converting a string to an integer via input.map(value -> Integer.parseInt(value))[29]—and filter, which selectively retains elements based on a predicate, such as excluding zero values with data.filter(value -> value != 0).[29] For aggregation, keyBy logically partitions the stream by a key selector, enabling keyed stateful operations like reduce, which iteratively combines elements within a key group, e.g., summing integers through keyedStream.reduce((a, b) -> a + b).[29]
Flink's handling of time is crucial for accurate stream processing, distinguishing between processing time, which relies on the system's clock during computation and is sensitive to delays; ingestion time, assigned when data arrives at the source operator; and event time, derived from timestamps embedded in the data records themselves to reflect real-world occurrence order.[30] Event time is preferred for out-of-order arrivals common in distributed systems, managed via watermarks—special records that indicate the latest observed event time and bound lateness, allowing Flink to close windows and discard excessively delayed elements.[30] Developers assign timestamps and watermarks using the TimestampAssigner and WatermarkGenerator interfaces, often with strategies like bounded-out-of-orderness to tolerate minor delays.[30]
Windowing in the DataStream API groups unbounded streams into finite subsets for aggregation, supporting count-based windows that trigger after a fixed number of elements, such as every 100 tuples, and time-based windows aligned to event, ingestion, or processing time.[31] Common types include tumbling windows for non-overlapping intervals (e.g., 5-minute slots via TumblingEventTimeWindows.of(Time.minutes(5))) and sliding windows that overlap by a specified duration.[31] Applied on keyed streams, windows use functions like reduce for incremental updates or aggregate for custom logic, such as computing averages with a user-defined aggregator.[31] Late data, arriving after a watermark passes a window's end, can be handled by allowing configurable lateness or redirecting to side outputs, ensuring robustness in asynchronous environments.[31]
For advanced patterns, side outputs enable operators to emit secondary streams of data that do not fit the primary result type, such as tagging anomalous events with OutputTag for separate processing.[32] Broadcasting complements this by distributing a control stream to all parallel instances of a downstream operator, facilitating dynamic rule updates in keyed computations without full repartitioning.[29] These mechanisms support complex, stateful applications while integrating with Flink's fault-tolerant state backend for consistency.[32]
DataStream applications connect to external systems via connectors for sources and sinks, with built-in support for reading from files (e.g., using FileSource for continuous monitoring) or sockets, and writing to similar targets.[33] Third-party connectors, such as those for Apache Kafka, allow scalable ingestion from topics using KafkaSource and output via KafkaSink, configured with serialization schemas for seamless integration into event-driven architectures.[34]
Table API and SQL
The Table API and SQL in Apache Flink provide declarative interfaces for relational data processing, enabling unified handling of both streaming and batch data through a high-level abstraction over the underlying DataStream API.[35] These APIs allow users to express queries using familiar relational concepts, such as tables and joins, while leveraging Flink's distributed runtime for scalable execution. The Table API offers a fluent, expression-based approach in languages like Java, Scala, and Python, whereas SQL provides a standardized query language for more direct declarative programming. Both integrate seamlessly to support complex analytics workloads, from real-time aggregations to historical batch computations.[35] The Table API is a language-integrated query builder that enables the creation of tables from input streams or batches using methods likeTableEnvironment.from(), which converts DataStream objects into relational tables without altering the underlying data flow.[36] It supports a range of relational operations, including selections for projecting and computing fields, joins for combining tables based on conditions, and groupBy for aggregations over grouped data. For instance, in Java, a simple aggregation might be expressed as orders.groupBy($("category")).select($("category"), $("price").avg().as("avgPrice")), where $ denotes column references, demonstrating the API's concise syntax with IDE-friendly autocompletion and type safety.[36] This fluent style abstracts away low-level stream manipulations, focusing on relational logic while maintaining unified semantics for bounded (batch) and unbounded (streaming) inputs.[36]
Flink's SQL integration, built on Apache Calcite, offers compliance with ANSI SQL:2011 standards, supporting core Data Definition Language (DDL) for creating tables and views, Data Manipulation Language (DML) for inserts and updates, and standard query constructs like SELECT for filtering and projecting data.[37] It extends SQL for streaming scenarios through dynamic tables, which represent changing data sources as append-only or updatable streams, allowing queries to produce continuously updating results.[38] Temporal joins further enhance streaming capabilities by correlating rows based on event time or processing time against versioned tables, such as enriching transaction data with historical exchange rates at specific timestamps via syntax like SELECT * FROM transactions AS t JOIN rates FOR SYSTEM_TIME AS OF t.proctime AS r ON t.currency = r.currency.[39] These features enable expressive queries over time-sensitive data without manual state management.
Catalog management in Flink unifies metadata handling across Table API and SQL, providing a persistent namespace for databases, tables, functions, and views.[40] The Hive Catalog integrates with Apache Hive Metastore for storing Flink metadata alongside existing Hive tables, supporting case-insensitive operations and seamless access to Hive-compatible data warehouses via configurations like HiveCatalog hive = new HiveCatalog("myhive", "thrift://hive-metastore:9083", "default").[41] Similarly, the JDBC Catalog connects to relational databases such as PostgreSQL or MySQL, automatically mapping schemas and enabling DDL operations like CREATE TABLE to propagate metadata changes. This abstraction allows users to reference external metadata sources directly in queries, facilitating hybrid stream-batch environments.
Continuous queries in Flink transform batch processing into streaming by treating finite datasets as changelog streams, where operations like aggregations produce ongoing updates maintained as materialized views.[38] For example, a windowed aggregation over batch input can be expressed as a continuous query using TUMBLE functions in SQL, yielding incremental results that evolve with new data appends, thus bridging traditional batch jobs with real-time requirements through eager view maintenance.[38]
Flink's query optimizer, powered by Apache Calcite, combines rule-based rewrites—such as subquery decorrelation, filter push-down, and join reordering—with cost-based optimization that evaluates execution plans using statistics on data volume, cardinality, and resource costs like CPU and I/O.[42] This dual approach ensures efficient plans; for instance, rule-based pruning eliminates unnecessary projections early, while cost-based decisions select optimal join strategies based on table sizes, configurable via options like table.optimizer.join-reorder-enabled=true.[42] The optimizer applies these transformations transparently, enhancing performance for both simple filters and complex multi-table queries without user intervention.[42]
Batch and Extension APIs
The DataSet API, Flink's legacy imperative interface for batch processing, was removed in Apache Flink 2.0 (released March 2025).[6] Batch workloads are now handled through the unified DataStream API in batch execution mode or the Table API and SQL.[43] Since Flink 1.12, the DataSet API had been marked as legacy and soft-deprecated, with official deprecation in version 1.18, as the framework unifies stream and batch processing under the DataStream API and Table API/SQL.[44][45] It was removed in Flink 2.0, and users are encouraged to use the unified APIs for batch workloads.[6] The Apache Beam Flink Runner enables the execution of portable Beam pipelines on the Flink runtime, translating Beam'sPTransforms and DoFns into Flink jobs for both streaming and batch processing.[46] Available in classic (Java-only) and portable (supporting Java, Python, Go) flavors, it leverages Flink's capabilities for high-throughput, low-latency execution with exactly-once semantics, deployable on clusters like YARN or Kubernetes.[46] For batch scenarios, it processes bounded inputs efficiently, with capabilities detailed in Beam's runner matrices.[47]
FlinkML extends Flink for machine learning pipelines, offering Table API-based components like Estimator for training models on batch datasets and Transformer for feature engineering and predictions.[48] It supports batch workflows, such as fitting models on tabular data via methods like fit(), and includes algorithms for tasks like clustering and regression.[48] Similarly, the Complex Event Processing (CEP) library detects patterns in event sequences using a declarative Pattern API, applicable to batch data when executed via the DataStream API in batch mode for finite datasets.[49]
Migration from the DataSet API to unified APIs involves replacing ExecutionEnvironment with StreamExecutionEnvironment set to BATCH mode for DataStream equivalents, or adopting the Table API/SQL for relational batch queries.[50] Direct mappings exist for transformations like map and filter, while operations like join may require windowing adjustments; connectors for sources and sinks remain compatible with minimal changes.[50] For Table API migrations, users convert datasets to tables using as() methods and express logic declaratively, ensuring semantic equivalence for batch processing.[50]
History
Origins and Early Development
The Stratosphere project originated in 2009 as a collaborative research initiative led by the Technical University of Berlin (TU Berlin) and Humboldt University of Berlin, with the goal of advancing distributed data processing systems for large-scale analytics.[4] Funded initially by the German Research Foundation (DFG), the effort brought together academic researchers to explore innovative architectures beyond existing paradigms.[51] Key contributors included Stephan Ewen and Kostas Tzoumas, who served as lead developers from TU Berlin, alongside Volker Markl, Alexander Alexandrov, and others from the involved institutions, with growing involvement from industry partners over time.[52] The core motivation for Stratosphere stemmed from the recognized shortcomings of Apache Hadoop's MapReduce model, which excelled at simple batch jobs but struggled with iterative algorithms essential for machine learning, graph processing, and optimization tasks due to its acyclic execution and repeated data reloading overhead. To overcome these, the project introduced the PACT (Parallelization Contracts) programming model, a generalization of MapReduce that supported higher-order functions and delta iterations for efficient incremental updates in loops. Additionally, Stratosphere emphasized pipelined dataflow execution to enable low-latency processing, laying groundwork for unified handling of both batch and streaming workloads, though initial emphasis was on batch-oriented analytics. Early development progressed with the project's first open-source release, Stratosphere 0.1, in 2011, which centered on batch processing capabilities while incorporating prototype support for streaming through pipelined operators.[53] Subsequent versions built on this foundation, refining the runtime and APIs. In April 2014, the Stratosphere team donated the codebase to the Apache Software Foundation Incubator, renaming it Apache Flink to resolve trademark concerns and align with open-source governance.[54][4] This transition marked the shift from academic prototyping to a broader community-driven project, culminating in Flink's graduation to top-level Apache status in December 2014.[4]Major Releases and Evolution
Apache Flink achieved top-level project status within the Apache Software Foundation in December 2014, marking a significant milestone in its maturation as an open-source framework.[55] This transition from the Apache Incubator enabled greater community governance and resource allocation for development. In March 2016, Flink released version 1.0, which introduced stable APIs with guaranteed backward compatibility across the 1.x series, facilitating reliable production deployments and encouraging wider adoption by ensuring application portability.[5] A pivotal architectural shift occurred with Flink 1.12 in December 2020, which unified batch and stream processing under a single runtime. This release deprecated the legacy DataSet API in favor of the DataStream API in batch mode, simplifying development and operations by eliminating the need for separate batch and streaming pipelines.[56] Subsequent releases built on this foundation with targeted enhancements. Flink 1.18, released in October 2023, improved PyFlink support through better user-defined function (UDF) integration and Python packaging, making it easier for Python developers to build scalable streaming applications.[57] Flink 1.20, announced in August 2024, advanced Change Data Capture (CDC) capabilities with refined Debezium connector support and schema evolution handling, streamlining real-time data synchronization from databases.[58] The transition to the 2.x series accelerated Flink's evolution toward modern infrastructures. Flink 2.0, released on March 24, 2025, introduced AI integrations such as dynamic model invocation in Flink CDC for real-time processing with external AI services like OpenAI, alongside Kubernetes-native features including disaggregated state management on distributed file systems for enhanced scalability in cloud environments.[6] Flink 2.1, launched on July 31, 2025, further upgraded real-time AI capabilities with AI Model DDL for SQL-based model management and realtime AI functions, enabling seamless embedding of machine learning workflows into streaming pipelines.[59] Over these releases, Flink has trended toward cloud-native architectures, exemplified by native Kubernetes integration and elastic scaling optimizations that support terabyte-scale state rescaling without downtime. SQL performance has seen iterative gains, such as adaptive batch execution in 2.0. Ecosystem expansions include Flink CDC 3.4 in May 2025, which added pipeline connectors for targets like Apache Iceberg with batch execution support, broadening integration options for data lakes and warehouses.[60] Subsequent minor releases, including Flink 2.1.1 on November 10, 2025, and Flink CDC 3.5.0 in September 2025, provided bug fixes and additional enhancements.[7][61] These advancements have driven Flink's adoption in real-time AI and streaming ETL scenarios, with surveys indicating faster decision-making through low-latency processing of event data for AI agents and ETL pipelines.[62]Community and Ecosystem
Development Process
Apache Flink operates under the governance of the Apache Software Foundation, with a Project Management Committee (PMC) consisting of 55 members as of August 2025 who oversee the project's direction, release verification, and community decisions. The PMC, established in December 2014, uses consensus-driven processes for major decisions, facilitated by tools such as JIRA for issue tracking and reporting bugs or features.[63] Significant changes and enhancements are proposed and documented through Flink Improvement Proposals (FLIPs), which serve as a centralized mechanism to outline planned major developments while JIRA handles ongoing task tracking and resolution.[64] Contributions to Flink follow structured guidelines to ensure code quality and maintainability, requiring developers to first create a JIRA ticket, discuss the approach on the dev mailing list to reach consensus, and obtain assignment from a committer before submitting pull requests.[65] Adherence to the project's code style is enforced via the Code Style and Quality Guide, with all changes needing to passmvn clean verify builds, including unit and end-to-end tests; unrelated formatting alterations are discouraged to streamline reviews.[65] New contributors are supported through labeled "starter" issues in JIRA, which provide guided entry points for learning the codebase and participating effectively, fostering community involvement without a formal mentorship program.[65]
The project maintains a time-based release cadence of approximately three months for minor versions, enabling regular delivery of features and improvements, while major releases occur roughly annually to introduce significant architectural advancements.[66] Security vulnerabilities and critical bugs receive ad-hoc patch releases on dedicated branches, ensuring timely fixes without disrupting the main development cycle.[66]
Since entering the Apache incubator in 2014, Flink has grown to over 1,300 unique contributors worldwide, reflecting sustained activity and expansion.[67] This includes notable growth in Asia, driven by contributions from organizations like Alibaba, which have enhanced adoption and development in the region through events and code integrations.[68]
Collaboration occurs via mirrored repositories on GitHub for code hosting and pull requests, supplemented by Apache mailing lists for discussions and a dedicated Slack workspace for real-time community interactions.[69]