Fact-checked by Grok 2 weeks ago

Apache Flink

Apache Flink is an open-source framework and distributed processing engine for stateful computations over unbounded and bounded data streams.^[1] Originating from the Stratosphere research project at the Technical University of Berlin in 2009, Flink entered the Apache Incubator in April 2014 and graduated to a top-level Apache project in December 2014.^[2]^[3]^[4] The project achieved its first stable release, version 1.0, in March 2016, marking a milestone in unified stream and batch processing capabilities.^[5] In March 2025, Flink released version 2.0, with the 2.1 series following in July 2025 and 2.1.1 as the latest stable release in November 2025, further advancing real-time data processing with enhanced support for AI integrations and scalable stream analytics.^[6]^[7] Flink's architecture features a scale-out design with incremental checkpoints, enabling high-availability deployments and efficient handling of large-scale state.^[1] Key strengths include exactly-once state consistency, event-time processing, and support for late-arriving data, ensuring reliable computations in distributed environments.^[1] It provides multiple APIs, such as the DataStream API for low-level stream processing, the Table API and SQL for declarative queries on streams and batches, and the ProcessFunction for fine-grained control over state and timers.^[1] Performance is optimized through low-latency, high-throughput execution via in-memory computing and flexible windowing mechanisms.^[1] Common use cases for Flink encompass event-driven applications that react to real-time event streams, stream and batch analytics for unified querying, and data pipelines for ETL processes involving data conversion and movement across systems.^[1] It integrates seamlessly with ecosystems like Apache Kafka for ingestion, Hadoop for batch storage, and cloud platforms for deployment, powering mission-critical applications in industries such as finance, e-commerce, and telecommunications.^[1] Flink's community-driven development under the Apache Software Foundation continues to evolve, with recent enhancements focusing on AI/ML pipelines and Kubernetes-native operations.^[1]

Introduction

Overview

Apache Flink is an open-source, distributed processing engine designed for stateful computations over unbounded and bounded data streams.^[1] It provides a unified architecture that treats batch processing as a special case of streaming, allowing developers to build applications that handle both real-time and historical data with the same codebase and runtime. This stream-first design enables low-latency, high-throughput processing suitable for modern data-intensive applications.^[1] At its core, Flink adheres to principles such as exactly-once semantics, which guarantee that each input event affects the final output exactly once, even in the presence of failures.^[8] It scales horizontally to thousands of cores, managing very large state through features like incremental checkpoints.^[9] Additionally, Flink supports event-time processing, which aligns computations with the timestamps of events rather than processing time, enabling accurate handling of out-of-order data.^[1] In comparison to predecessors like Hadoop MapReduce, which relies on disk-based batch processing and lacks native streaming support, Flink offers in-memory computation and true streaming capabilities for faster, more efficient real-time analytics. Relative to contemporaries like Apache Spark, Flink's stream-native architecture provides lower latency and better state management for continuous data flows, though Spark remains dominant for batch-heavy workloads.^[10] Flink has seen widespread adoption for real-time analytics by companies including Alibaba, Netflix, and Uber.^[11]

Key Features

Apache Flink provides exactly-once processing guarantees, ensuring that each input event is processed precisely once even in the presence of failures, through its checkpointing mechanism that captures consistent snapshots of application state.^[9] Savepoints extend this capability by allowing manual, consistent snapshots for application upgrades, scaling, or migration without data loss.^[12] Flink natively supports both event-time and processing-time semantics for streaming data, enabling accurate handling of out-of-order events based on their occurrence timestamps rather than arrival times, which is essential for reliable analytics in real-world scenarios like sensor data or financial transactions.^[9] The framework offers flexible windowing mechanisms, including tumbling windows for non-overlapping fixed-size intervals, sliding windows for overlapping periods with specified slide durations, and session windows for dynamic grouping based on inactivity gaps.^[13] These can be customized with triggers to control evaluation timing and evictors to remove elements before computation, allowing tailored aggregation over unbounded streams.^[13] Flink integrates machine learning through the FlinkML library, which enables scalable implementations of algorithms such as alternating least squares for matrix factorization in recommendation systems.^[14]^[15] The Complex Event Processing (CEP) library supports pattern detection in event streams, allowing users to define sequences of events with conditions and temporal constraints for applications like fraud detection or monitoring.^[16] In terms of performance, Flink achieves sub-second latencies for real-time processing, as demonstrated in benchmarks like Nexmark, while supporting horizontal scalability across clusters to handle terabyte-scale state.^[6]^[17] It also features built-in backpressure handling to prevent overload by dynamically adjusting data flow rates between operators.^[18]

Architecture

Distributed Runtime

Apache Flink's distributed runtime forms the core execution engine that enables scalable processing of streaming and batch data across clusters of machines. It orchestrates the deployment and execution of user-defined jobs by transforming high-level application logic into a distributed dataflow graph, which is then optimized and executed in a fault-tolerant manner. The runtime supports both bounded and unbounded data processing, ensuring low-latency and high-throughput performance through efficient resource utilization and data locality.^[19] The runtime is composed of two primary process types: the JobManager and TaskManagers. The JobManager serves as the central coordinator, responsible for scheduling tasks, managing resources, and overseeing job execution. It includes subcomponents such as the ResourceManager for allocating slots and containers, the Dispatcher for handling job submissions via REST or Web UI, and the JobMaster for managing individual JobGraphs by translating them into execution plans and distributing tasks. In high-availability setups, multiple JobManagers operate with one active leader and others in standby mode to ensure continuous operation. TaskManagers, on the other hand, execute the actual dataflow tasks assigned by the JobManager. Each TaskManager runs one or more task slots, which are fixed-size resource units that isolate tasks and manage their memory allocation, buffering input/output streams and exchanging data between subtasks.^[19] Flink's pipeline execution model optimizes the dataflow graph through techniques like operator chaining and fusion to minimize overhead and latency. Upon job submission, the JobManager generates an execution graph from the logical JobGraph by applying optimizations such as fusing compatible operators (e.g., consecutive map or filter operations) into single tasks. This chaining allows multiple operators to run in the same thread on a TaskManager, reducing serialization, deserialization, and network shuffling costs while improving data locality. Chaining is enabled by default but can be controlled programmatically—for instance, using startNewChain() to break chains at specific points or disableChaining() for individual operators—to balance performance and debugging needs.^[20] Deployment of Flink jobs occurs in various modes to support different environments and resource managers. In standalone mode, Flink runs directly on a cluster of machines without an external orchestrator, where the user starts the JobManager and TaskManagers manually or via scripts for simple, on-premises setups. For Hadoop ecosystems, YARN mode integrates with YARN's ResourceManager, allowing dynamic allocation of containers for JobManager and TaskManagers upon job submission. Kubernetes mode deploys Flink as containerized applications, leveraging Kubernetes for pod orchestration, scaling, and service discovery, which is ideal for cloud-native infrastructures. Additionally, cloud-native options from providers like Amazon EMR or Alibaba Cloud Realtime Compute offer managed deployments that handle infrastructure provisioning and integration with cloud services. In all modes, a client submits the JobGraph to the JobManager, which then provisions TaskManagers as needed for execution.^[21] Resource allocation in the runtime relies on a slot-based model to control parallelism and scaling. Each TaskManager exposes configurable task slots, the smallest allocatable unit, where each slot dedicates a portion of the TaskManager's resources (primarily memory) to one or more subtasks, enabling concurrent execution. Parallelism is set at the job or operator level, with slots shared across pipelines by default to maximize utilization; for example, a single slot can handle multiple chained operators from the same pipeline. Dynamic scaling is facilitated by the Adaptive Scheduler, which adjusts job parallelism reactively based on available slots—scaling up by adding TaskManagers or down by reducing parallelism using savepoints—without interrupting execution. This is particularly useful in elastic environments like Kubernetes, where resources can be redeclared at runtime via APIs to respond to workload variations.^[19]^[22] To handle varying data rates and prevent job failures from overload, Flink implements a credit-based backpressure mechanism at the task level. When a downstream task cannot consume data as quickly as an upstream task produces it—due to bottlenecks like slow sinks or complex computations—the upstream task experiences backpressure through depleted output buffers. This triggers a flow control signal that propagates upstream, slowing input rates (e.g., throttling sources) without dropping records or crashing the job. Buffering in TaskManagers absorbs temporary spikes, while metrics such as backPressuredTimeMsPerSecond (tracking time spent waiting on buffers) allow monitoring and diagnosis, with status levels from OK (under 10% backpressure) to HIGH (over 50%). This ensures stable execution even under imbalanced loads.^[23]

State Management and Fault Tolerance

Apache Flink's state management enables the persistence and consistent handling of application state across distributed tasks, crucial for maintaining computational progress in long-running stream processing jobs. State in Flink represents the data maintained by operators, such as accumulators in window functions or key-value mappings in keyed streams, and is stored locally on TaskManagers unless configured otherwise. This local storage facilitates fast access during normal operation, while fault tolerance mechanisms ensure state recovery upon failures without data loss or duplication. Flink provides several state backends to manage how state is stored and checkpointed, balancing performance, scalability, and persistence needs. The HeapStateBackend (also known as HashMapStateBackend) stores state as objects directly in the Java heap memory of TaskManagers, offering high-speed access suitable for applications with moderate state sizes or when high availability is ensured through replication. However, it is limited by available memory and does not persist state to disk by default, making it vulnerable to TaskManager failures without checkpoints. In contrast, the RocksDBStateBackend (EmbeddedRocksDBStateBackend) serializes state into RocksDB, a key-value store embedded in each TaskManager, writing data to local disk directories for persistence. This backend excels in handling very large states—potentially terabytes—by spilling to disk when memory is constrained, though it incurs overhead from serialization and I/O operations. For global state storage, Flink supports filesystem options, where checkpoints are written to remote durable storage like HDFS or S3; the experimental ForStStateBackend further enables disaggregated state by storing it asynchronously on remote filesystems, allowing unlimited state sizes and decoupling from local TaskManager resources. The choice of backend impacts checkpointing efficiency and recovery times, with RocksDB and ForSt supporting incremental updates to minimize data transfer.^[24] Checkpoints form the core of Flink's fault tolerance, capturing periodic, consistent snapshots of the entire application state and input stream positions to enable recovery as if no failure occurred. Enabled via configuration (e.g., env.enableCheckpointing(1000) for one-second intervals), checkpoints use a snapshot-and-restore mechanism where the JobManager coordinates the process across all tasks. This involves streaming barriers—special markers injected into data streams—that trigger state serialization without halting processing, ensuring a globally consistent view. The protocol employs a two-phase commit approach: in the first phase, tasks write state snapshots to durable storage and acknowledge; in the second phase, upon all acknowledgments, the JobManager finalizes the checkpoint, guaranteeing atomicity. This design supports exactly-once semantics internally by pre-aggregating state changes before barriers and replaying from the last checkpoint upon failure, restoring operators to their exact prior state.^[25] For end-to-end consistency with external systems, Flink extends exactly-once guarantees through transactional integrations, such as with Apache Kafka via idempotent producers or two-phase commit sinks that coordinate with external transaction managers. For instance, when Kafka serves as both source and sink, Flink's checkpoint-aligned offsets ensure that upon recovery, messages are neither skipped nor duplicated across the pipeline. These guarantees hold for stateful operators, where pre-aggregated results (e.g., window sums) are committed only after barrier alignment, preventing partial updates. Savepoints build on checkpoints as manually triggered, portable snapshots that allow pausing, resuming, or migrating jobs across clusters or versions. Unlike automatic checkpoints, savepoints are initiated via commands (e.g., flink savepoint <job-id> <path>), storing state in a version-agnostic format compatible with future Flink releases, facilitating upgrades without downtime. They are stored in configurable directories on durable filesystems, enabling blueprinting of production jobs for development or scaling. To optimize for large-state applications, Flink supports incremental checkpoints, which capture only changes (deltas) since the previous checkpoint rather than full snapshots. Enabled by setting execution.checkpointing.incremental: true, this feature—available in RocksDB and ForSt backends—significantly reduces checkpointing time and storage overhead; for example, in scenarios with gigabytes of state, incremental mode can cut completion times from hours to minutes by avoiding redundant serialization of unchanged data. The backend maintains versioned logs of modifications, merging them periodically to bound storage growth. This efficiency is vital for real-time applications processing high-volume streams, ensuring fault tolerance without compromising latency.^[26]

Programming Model

DataStream API

The DataStream API serves as the foundational programming interface in Apache Flink for developing scalable stream and batch processing applications, enabling developers to ingest, transform, and output continuous or finite data streams in a distributed environment.^[27] It supports both unbounded streams, which represent potentially infinite data flows like real-time sensor inputs, and bounded streams for finite datasets, allowing unified processing logic across streaming and batch workloads.^[27] This API is implemented in Java and Python, with transformations applied lazily to form a directed acyclic graph (DAG) that Flink's runtime optimizes and executes. In Flink 2.0, an experimental DataStream API V2 was introduced to enhance the original API with improved ergonomics and performance.^[28] At its core, the DataStream abstraction encapsulates immutable collections of elements of the same type, sourced from external systems and manipulated through a rich set of operators.^[27] A DataStream can be created from various inputs, such as files or message queues, and undergoes transformations that produce new DataStreams without altering the original.^[27] Key transformations include map, which applies a one-to-one function to each element—for instance, converting a string to an integer via input.map(value -> Integer.parseInt(value))^[29]—and filter, which selectively retains elements based on a predicate, such as excluding zero values with data.filter(value -> value != 0).^[29] For aggregation, keyBy logically partitions the stream by a key selector, enabling keyed stateful operations like reduce, which iteratively combines elements within a key group, e.g., summing integers through keyedStream.reduce((a, b) -> a + b).^[29] Flink's handling of time is crucial for accurate stream processing, distinguishing between processing time, which relies on the system's clock during computation and is sensitive to delays; ingestion time, assigned when data arrives at the source operator; and event time, derived from timestamps embedded in the data records themselves to reflect real-world occurrence order.^[30] Event time is preferred for out-of-order arrivals common in distributed systems, managed via watermarks—special records that indicate the latest observed event time and bound lateness, allowing Flink to close windows and discard excessively delayed elements.^[30] Developers assign timestamps and watermarks using the TimestampAssigner and WatermarkGenerator interfaces, often with strategies like bounded-out-of-orderness to tolerate minor delays.^[30] Windowing in the DataStream API groups unbounded streams into finite subsets for aggregation, supporting count-based windows that trigger after a fixed number of elements, such as every 100 tuples, and time-based windows aligned to event, ingestion, or processing time.^[31] Common types include tumbling windows for non-overlapping intervals (e.g., 5-minute slots via TumblingEventTimeWindows.of(Time.minutes(5))) and sliding windows that overlap by a specified duration.^[31] Applied on keyed streams, windows use functions like reduce for incremental updates or aggregate for custom logic, such as computing averages with a user-defined aggregator.^[31] Late data, arriving after a watermark passes a window's end, can be handled by allowing configurable lateness or redirecting to side outputs, ensuring robustness in asynchronous environments.^[31] For advanced patterns, side outputs enable operators to emit secondary streams of data that do not fit the primary result type, such as tagging anomalous events with OutputTag for separate processing.^[32] Broadcasting complements this by distributing a control stream to all parallel instances of a downstream operator, facilitating dynamic rule updates in keyed computations without full repartitioning.^[29] These mechanisms support complex, stateful applications while integrating with Flink's fault-tolerant state backend for consistency.^[32] DataStream applications connect to external systems via connectors for sources and sinks, with built-in support for reading from files (e.g., using FileSource for continuous monitoring) or sockets, and writing to similar targets.^[33] Third-party connectors, such as those for Apache Kafka, allow scalable ingestion from topics using KafkaSource and output via KafkaSink, configured with serialization schemas for seamless integration into event-driven architectures.^[34]

Table API and SQL

The Table API and SQL in Apache Flink provide declarative interfaces for relational data processing, enabling unified handling of both streaming and batch data through a high-level abstraction over the underlying DataStream API.^[35] These APIs allow users to express queries using familiar relational concepts, such as tables and joins, while leveraging Flink's distributed runtime for scalable execution. The Table API offers a fluent, expression-based approach in languages like Java, Scala, and Python, whereas SQL provides a standardized query language for more direct declarative programming. Both integrate seamlessly to support complex analytics workloads, from real-time aggregations to historical batch computations.^[35] The Table API is a language-integrated query builder that enables the creation of tables from input streams or batches using methods like TableEnvironment.from(), which converts DataStream objects into relational tables without altering the underlying data flow.^[36] It supports a range of relational operations, including selections for projecting and computing fields, joins for combining tables based on conditions, and groupBy for aggregations over grouped data. For instance, in Java, a simple aggregation might be expressed as orders.groupBy($("category")).select($("category"), $("price").avg().as("avgPrice")), where $ denotes column references, demonstrating the API's concise syntax with IDE-friendly autocompletion and type safety.^[36] This fluent style abstracts away low-level stream manipulations, focusing on relational logic while maintaining unified semantics for bounded (batch) and unbounded (streaming) inputs.^[36] Flink's SQL integration, built on Apache Calcite, offers compliance with ANSI SQL:2011 standards, supporting core Data Definition Language (DDL) for creating tables and views, Data Manipulation Language (DML) for inserts and updates, and standard query constructs like SELECT for filtering and projecting data.^[37] It extends SQL for streaming scenarios through dynamic tables, which represent changing data sources as append-only or updatable streams, allowing queries to produce continuously updating results.^[38] Temporal joins further enhance streaming capabilities by correlating rows based on event time or processing time against versioned tables, such as enriching transaction data with historical exchange rates at specific timestamps via syntax like SELECT * FROM transactions AS t JOIN rates FOR SYSTEM_TIME AS OF t.proctime AS r ON t.currency = r.currency.^[39] These features enable expressive queries over time-sensitive data without manual state management. Catalog management in Flink unifies metadata handling across Table API and SQL, providing a persistent namespace for databases, tables, functions, and views.^[40] The Hive Catalog integrates with Apache Hive Metastore for storing Flink metadata alongside existing Hive tables, supporting case-insensitive operations and seamless access to Hive-compatible data warehouses via configurations like HiveCatalog hive = new HiveCatalog("myhive", "thrift://hive-metastore:9083", "default").^[41] Similarly, the JDBC Catalog connects to relational databases such as PostgreSQL or MySQL, automatically mapping schemas and enabling DDL operations like CREATE TABLE to propagate metadata changes. This abstraction allows users to reference external metadata sources directly in queries, facilitating hybrid stream-batch environments. Continuous queries in Flink transform batch processing into streaming by treating finite datasets as changelog streams, where operations like aggregations produce ongoing updates maintained as materialized views.^[38] For example, a windowed aggregation over batch input can be expressed as a continuous query using TUMBLE functions in SQL, yielding incremental results that evolve with new data appends, thus bridging traditional batch jobs with real-time requirements through eager view maintenance.^[38] Flink's query optimizer, powered by Apache Calcite, combines rule-based rewrites—such as subquery decorrelation, filter push-down, and join reordering—with cost-based optimization that evaluates execution plans using statistics on data volume, cardinality, and resource costs like CPU and I/O.^[42] This dual approach ensures efficient plans; for instance, rule-based pruning eliminates unnecessary projections early, while cost-based decisions select optimal join strategies based on table sizes, configurable via options like table.optimizer.join-reorder-enabled=true.^[42] The optimizer applies these transformations transparently, enhancing performance for both simple filters and complex multi-table queries without user intervention.^[42]

Batch and Extension APIs

The DataSet API, Flink's legacy imperative interface for batch processing, was removed in Apache Flink 2.0 (released March 2025).^[6] Batch workloads are now handled through the unified DataStream API in batch execution mode or the Table API and SQL.^[43] Since Flink 1.12, the DataSet API had been marked as legacy and soft-deprecated, with official deprecation in version 1.18, as the framework unifies stream and batch processing under the DataStream API and Table API/SQL.^[44]^[45] It was removed in Flink 2.0, and users are encouraged to use the unified APIs for batch workloads.^[6] The Apache Beam Flink Runner enables the execution of portable Beam pipelines on the Flink runtime, translating Beam's PTransforms and DoFns into Flink jobs for both streaming and batch processing.^[46] Available in classic (Java-only) and portable (supporting Java, Python, Go) flavors, it leverages Flink's capabilities for high-throughput, low-latency execution with exactly-once semantics, deployable on clusters like YARN or Kubernetes.^[46] For batch scenarios, it processes bounded inputs efficiently, with capabilities detailed in Beam's runner matrices.^[47] FlinkML extends Flink for machine learning pipelines, offering Table API-based components like Estimator for training models on batch datasets and Transformer for feature engineering and predictions.^[48] It supports batch workflows, such as fitting models on tabular data via methods like fit(), and includes algorithms for tasks like clustering and regression.^[48] Similarly, the Complex Event Processing (CEP) library detects patterns in event sequences using a declarative Pattern API, applicable to batch data when executed via the DataStream API in batch mode for finite datasets.^[49] Migration from the DataSet API to unified APIs involves replacing ExecutionEnvironment with StreamExecutionEnvironment set to BATCH mode for DataStream equivalents, or adopting the Table API/SQL for relational batch queries.^[50] Direct mappings exist for transformations like map and filter, while operations like join may require windowing adjustments; connectors for sources and sinks remain compatible with minimal changes.^[50] For Table API migrations, users convert datasets to tables using as() methods and express logic declaratively, ensuring semantic equivalence for batch processing.^[50]

History

Origins and Early Development

The Stratosphere project originated in 2009 as a collaborative research initiative led by the Technical University of Berlin (TU Berlin) and Humboldt University of Berlin, with the goal of advancing distributed data processing systems for large-scale analytics.^[4] Funded initially by the German Research Foundation (DFG), the effort brought together academic researchers to explore innovative architectures beyond existing paradigms.^[51] Key contributors included Stephan Ewen and Kostas Tzoumas, who served as lead developers from TU Berlin, alongside Volker Markl, Alexander Alexandrov, and others from the involved institutions, with growing involvement from industry partners over time.^[52] The core motivation for Stratosphere stemmed from the recognized shortcomings of Apache Hadoop's MapReduce model, which excelled at simple batch jobs but struggled with iterative algorithms essential for machine learning, graph processing, and optimization tasks due to its acyclic execution and repeated data reloading overhead. To overcome these, the project introduced the PACT (Parallelization Contracts) programming model, a generalization of MapReduce that supported higher-order functions and delta iterations for efficient incremental updates in loops. Additionally, Stratosphere emphasized pipelined dataflow execution to enable low-latency processing, laying groundwork for unified handling of both batch and streaming workloads, though initial emphasis was on batch-oriented analytics. Early development progressed with the project's first open-source release, Stratosphere 0.1, in 2011, which centered on batch processing capabilities while incorporating prototype support for streaming through pipelined operators.^[53] Subsequent versions built on this foundation, refining the runtime and APIs. In April 2014, the Stratosphere team donated the codebase to the Apache Software Foundation Incubator, renaming it Apache Flink to resolve trademark concerns and align with open-source governance.^[54]^[4] This transition marked the shift from academic prototyping to a broader community-driven project, culminating in Flink's graduation to top-level Apache status in December 2014.^[4]

Major Releases and Evolution

Apache Flink achieved top-level project status within the Apache Software Foundation in December 2014, marking a significant milestone in its maturation as an open-source framework.^[55] This transition from the Apache Incubator enabled greater community governance and resource allocation for development. In March 2016, Flink released version 1.0, which introduced stable APIs with guaranteed backward compatibility across the 1.x series, facilitating reliable production deployments and encouraging wider adoption by ensuring application portability.^[5] A pivotal architectural shift occurred with Flink 1.12 in December 2020, which unified batch and stream processing under a single runtime. This release deprecated the legacy DataSet API in favor of the DataStream API in batch mode, simplifying development and operations by eliminating the need for separate batch and streaming pipelines.^[56] Subsequent releases built on this foundation with targeted enhancements. Flink 1.18, released in October 2023, improved PyFlink support through better user-defined function (UDF) integration and Python packaging, making it easier for Python developers to build scalable streaming applications.^[57] Flink 1.20, announced in August 2024, advanced Change Data Capture (CDC) capabilities with refined Debezium connector support and schema evolution handling, streamlining real-time data synchronization from databases.^[58] The transition to the 2.x series accelerated Flink's evolution toward modern infrastructures. Flink 2.0, released on March 24, 2025, introduced AI integrations such as dynamic model invocation in Flink CDC for real-time processing with external AI services like OpenAI, alongside Kubernetes-native features including disaggregated state management on distributed file systems for enhanced scalability in cloud environments.^[6] Flink 2.1, launched on July 31, 2025, further upgraded real-time AI capabilities with AI Model DDL for SQL-based model management and realtime AI functions, enabling seamless embedding of machine learning workflows into streaming pipelines.^[59] Over these releases, Flink has trended toward cloud-native architectures, exemplified by native Kubernetes integration and elastic scaling optimizations that support terabyte-scale state rescaling without downtime. SQL performance has seen iterative gains, such as adaptive batch execution in 2.0. Ecosystem expansions include Flink CDC 3.4 in May 2025, which added pipeline connectors for targets like Apache Iceberg with batch execution support, broadening integration options for data lakes and warehouses.^[60] Subsequent minor releases, including Flink 2.1.1 on November 10, 2025, and Flink CDC 3.5.0 in September 2025, provided bug fixes and additional enhancements.^[7]^[61] These advancements have driven Flink's adoption in real-time AI and streaming ETL scenarios, with surveys indicating faster decision-making through low-latency processing of event data for AI agents and ETL pipelines.^[62]

Community and Ecosystem

Development Process

Apache Flink operates under the governance of the Apache Software Foundation, with a Project Management Committee (PMC) consisting of 55 members as of August 2025 who oversee the project's direction, release verification, and community decisions. The PMC, established in December 2014, uses consensus-driven processes for major decisions, facilitated by tools such as JIRA for issue tracking and reporting bugs or features.^[63] Significant changes and enhancements are proposed and documented through Flink Improvement Proposals (FLIPs), which serve as a centralized mechanism to outline planned major developments while JIRA handles ongoing task tracking and resolution.^[64] Contributions to Flink follow structured guidelines to ensure code quality and maintainability, requiring developers to first create a JIRA ticket, discuss the approach on the dev mailing list to reach consensus, and obtain assignment from a committer before submitting pull requests.^[65] Adherence to the project's code style is enforced via the Code Style and Quality Guide, with all changes needing to pass mvn clean verify builds, including unit and end-to-end tests; unrelated formatting alterations are discouraged to streamline reviews.^[65] New contributors are supported through labeled "starter" issues in JIRA, which provide guided entry points for learning the codebase and participating effectively, fostering community involvement without a formal mentorship program.^[65] The project maintains a time-based release cadence of approximately three months for minor versions, enabling regular delivery of features and improvements, while major releases occur roughly annually to introduce significant architectural advancements.^[66] Security vulnerabilities and critical bugs receive ad-hoc patch releases on dedicated branches, ensuring timely fixes without disrupting the main development cycle.^[66] Since entering the Apache incubator in 2014, Flink has grown to over 1,300 unique contributors worldwide, reflecting sustained activity and expansion.^[67] This includes notable growth in Asia, driven by contributions from organizations like Alibaba, which have enhanced adoption and development in the region through events and code integrations.^[68] Collaboration occurs via mirrored repositories on GitHub for code hosting and pull requests, supplemented by Apache mailing lists for discussions and a dedicated Slack workspace for real-time community interactions.^[69]

Integrations and Use Cases

Apache Flink integrates seamlessly with various data ingestion, storage, and analytics tools through its extensive connector ecosystem. For messaging and streaming, Flink provides official connectors for Apache Kafka, enabling exactly-once semantics for reading and writing data to Kafka topics. Similarly, the Pulsar connector supports integration with Apache Pulsar for multi-cloud streaming scenarios, allowing Flink to process data from Pulsar topics with fault-tolerant guarantees. For storage, Flink's Hadoop ecosystem connectors facilitate interaction with HDFS and Hive, supporting batch and streaming workloads on Hadoop-compatible systems. Additionally, the Elasticsearch connector enables Flink to index and query data in real-time search applications. Flink's Change Data Capture (CDC) capabilities, provided through the dedicated Flink CDC project, enable real-time synchronization of database changes to downstream systems. The MySQL CDC connector, for instance, captures snapshot and incremental changes from MySQL databases and streams them to targets like Kafka, supporting full-database synchronization with schema evolution handling. This is particularly useful for building low-latency data pipelines that propagate updates from operational databases to analytics stores without custom polling logic. In real-world applications, Flink powers diverse use cases across industries. Financial institutions, such as Capital One, employ Flink for real-time fraud detection by analyzing transaction streams to identify anomalies and trigger alerts within milliseconds. Netflix leverages Flink for recommendation systems, processing massive-scale streaming data to personalize content suggestions and enhance user engagement through real-time feature extraction. Alibaba utilizes Flink in ETL pipelines for e-commerce, handling high-throughput data transformations to support real-time analytics and business intelligence. Emerging trends in 2025 include Flink's role in AI feature stores, where it enables online serving of low-latency features for machine learning models, as seen in integrations with systems like Hopsworks for streaming feature engineering. Cloud providers offer managed Flink services to simplify deployment and scaling. Amazon Managed Service for Apache Flink provides a fully managed environment for running Flink applications, integrating with AWS services like Kinesis for end-to-end streaming pipelines. Alibaba Cloud's Realtime Compute for Apache Flink has introduced incremental processing enhancements in 2025, improving efficiency for unified batch and stream workloads in lakehouse architectures. On Microsoft Azure, Flink applications can be deployed via the Azure Marketplace for in-flight stream processing, with connectors facilitating ingestion into services like Azure Data Explorer. An emerging integration involves Flink with large language models (LLMs) for real-time AI inference. Flink 2.1.0 introduces AI Model DDL support, allowing users to manage and invoke LLMs directly in SQL or Table API pipelines for applications like fraud detection agents and sentiment analysis. This enables streaming data to feed into remote model servers via API calls, supporting predictive AI in production environments.

Flink Forward Conference

Flink Forward is an annual conference series dedicated to Apache Flink and stream processing technologies, with its inaugural event held in Berlin in October 2015.^[70] Organized initially to unite the growing Flink community, the conference has expanded globally, featuring editions in Europe (such as Berlin in 2019 and 2024, and Barcelona in 2025), the United States (San Francisco in 2019 and 2022, Seattle in 2023), and Asia (including Singapore in 2025 and Jakarta in 2024).^[71] This international presence allows participants from diverse regions to engage with Flink's advancements in real-time data processing. The conference is organized by Ververica, the company founded by Apache Flink's original creators and rebranded from dataArtisans in 2021, in collaboration with the Apache Flink community.^[68]^[72] Events typically span four days, beginning with two days of hands-on training sessions like bootcamps and workshops on topics such as state management, Flink SQL, and ecosystem integrations.^[73] The subsequent two days feature keynotes, technical talks, and panels selected by a program committee, emphasizing themes like the fusion of streaming data with AI and cloud-native deployments, as highlighted in the 2025 Barcelona edition focused on real-time AI applications.^[73]^[74] Flink Forward fosters networking among hundreds to over a thousand attendees, including engineers and leaders from companies like Apple, Alibaba, and Uber, while serving as a platform for major project announcements, such as previews of Apache Flink 2.0 in 2024.^[75]^[76]^[77] These gatherings have significantly contributed to Flink's community growth and adoption by facilitating knowledge sharing and collaboration. Following the COVID-19 pandemic, the conference shifted to virtual formats in 2020 (April and October editions) and 2021, with hybrid elements in some cases, enabling broader participation—such as over 92,000 viewers for the Asia 2020 event—and providing recordings of sessions for ongoing access.^[78]^[79]^[80] Recent editions have returned to in-person events, supplemented by online resources.^[81]

References

[1]
Apache Flink® — Stateful Computations over Data Streams ...
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run ...Documentation · Architecture · Use Cases · Flink
[2]
Incubator PMC report for July 2014 - Apache Software Foundation
Flink was originally known as Stratosphere when it entered the Incubator. Flink has been incubating since 2014-04-14. Three most important issues to address ...
[3]
Apache Flink 0.6 available
Aug 26, 2014 · Releases up to 0.5 were under the name Stratosphere, the academic and open source project that Flink originates from.Missing: origin | Show results with:origin
[4]
Announcing Apache Flink 1.0.0
Mar 8, 2016 · Flink version 1.0.0 marks the beginning of the 1.XX series of releases, which will maintain backwards compatibility with 1.0.0.
[5]
Apache Flink 2.0.0: A new Era of Real-Time Data Processing
Mar 24, 2025 · In the 1.0 era, Flink pioneered Stateful Computations over Data Streams, making end-to-end exactly-once stateful stream processing a reality.
[6]
Stateful Stream Processing | Apache Flink
A streaming dataflow can be resumed from a checkpoint while maintaining consistency (exactly-once processing semantics) by restoring the state of the operators ...
[7]
A side-by-side comparison of Apache Spark and Apache Flink for ...
Jul 28, 2023 · Apache Flink and Apache Spark are both open-source, distributed data processing frameworks used widely for big data processing and analytics.
[8]
Powered by Flink - Apache Flink
Apache Flink powers business-critical applications in many companies and enterprises around the globe. On this page, we present a few notable Flink users.Missing: history | Show results with:history
[9]
Use Cases - Apache Flink
Flink's features include support for stream and batch processing, sophisticated state management, event-time processing semantics, and exactly-once consistency ...Event-driven Applications · Data Analytics Applications · Data Pipeline Applications
[10]
Operations | Apache Flink
Flink focuses on operational aspects, ensuring 24/7 operation with failure recovery, savepoints for updates, and monitoring/logging for continuous operation.Missing: key | Show results with:key
[11]
Introducing Stream Windows in Apache Flink
Dec 4, 2015 · Apache Flink features three different notions of time, namely processing time, event time, and ingestion time. In processing time, windows are ...What Are Windows And What... · Time Windows · Dissecting Flink's Windowing...
[12]
Apache Flink ML 2.0.0 Release Announcement
Jan 7, 2022 · Flink ML is a library that provides APIs and infrastructure for building stream-batch unified machine learning algorithms, that can be easy-to-use and ...
[13]
Announcing Apache Flink 0.9.0
Jun 24, 2015 · Furthermore, it includes an alternating least squares (ALS) implementation to factorizes large matrices. The matrix factorization can be used to ...New Features · Exactly-Once Fault Tolerance... · More Improvements And Fixes<|separator|>
[14]
Introducing Complex Event Processing (CEP) with Apache Flink
Apr 6, 2016 · Complex event processing (CEP) addresses exactly this problem of matching continuously incoming events against a pattern.
[15]
Applications - Apache Flink
Apache Flink is a framework for stateful computations over unbounded and bounded data streams. Flink provides multiple APIs at different levels of abstraction.
[16]
How to identify the source of backpressure? - Apache Flink
Jul 7, 2021 · This post will try to clarify some of these changes and go into more detail about how to track down the source of backpressure.Missing: benchmarks | Show results with:benchmarks
[17]
Flink Architecture - Apache Nightlies Distribution Directory
This section contains an overview of Flink's architecture and describes how its main components interact to execute applications and recover from failures.Anatomy of a Flink Cluster · Task Slots and Resources · Flink Application Execution
[18]
https://flink.apache.org/2021/07/07/how-to-identify-the-source-of-backpressure/
[19]
Overview | Apache Flink
Below, we briefly explain the building blocks of a Flink cluster, their purpose and available implementations. If you just want to start Flink locally, we ...
[20]
Elastic Scaling | Apache Flink
The Adaptive Scheduler can adjust the parallelism of a job based on available slots. It will automatically reduce the parallelism if not enough slots are ...
[21]
Monitoring Back Pressure | Apache Flink
In the Back Pressure tab next to the job overview you can find more detailed metrics. For subtasks whose status is OK, there is no indication of back pressure.Back Pressure · Task performance metricsMissing: mechanism | Show results with:mechanism
[22]
State Backends
### Summary of State Backends in Apache Flink
[23]
Checkpointing
### Summary of Checkpoints and Savepoints in Flink (from official docs)
[24]
https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/state/state_backends/
[25]
Flink DataStream API Programming Guide
This documentation is for an unreleased version of Apache Flink. We recommend you use the latest stable version. Flink DataStream API Programming Guide #.Execution Mode (Batch... · Operators · Data Sources · Overview
[26]
Overview
### Summary: Pipeline Execution Model in Apache Flink
[27]
Event Time
**Summary of Time Concepts in DataStream API (Apache Flink Docs)**
[28]
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream-v2/overview/
[29]
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/overview/
[30]
Overview
### Summary of DataStream Connectors Integration (Flink Docs)
[31]
Table API & SQL - Apache Nightlies Distribution Directory
Apache Flink features two relational APIs - the Table API and SQL - for unified stream and batch processing. The Table API is a language-integrated query API ...
[32]
Table API | Apache Flink
This documentation is for an unreleased version of Apache Flink. ... The Table API shares many concepts and parts of its API with Flink's SQL integration.Overview & Examples · Operations · Aggregations · Joins
[33]
SQL
### Flink SQL Support Summary
[34]
Dynamic Tables | Apache Flink
Dynamic tables are the core concept of Flink's Table API and SQL support for streaming data. In contrast to the static tables that represent batch data, dynamic ...Relational Queries On Data... · Defining A Table On A Stream · Continuous Queries
[35]
Joins | Apache Flink
Event Time temporal joins allow joining against a versioned table. This means a table can be enriched with changing metadata and retrieve its value at a certain ...Regular Joins · Interval Joins · Temporal Joins · Lookup Join
[36]
Catalogs | Apache Flink
Catalogs provide a unified API for managing metadata and making it accessible from the Table API and SQL Queries. Catalog enables users to reference existing ...Catalog Types · How to Create and Register... · Table API and SQL for Catalog
[37]
Hive Catalog | Apache Flink
HiveCatalog enables Flink to use Hive Metastore for metadata, providing a persistent catalog and allowing users to create meta-objects once.
[38]
Concepts & Common API | Apache Flink
This includes a series of rule and cost-based optimizations such as: Subquery decorrelation based on Apache Calcite; Project pruning; Partition pruning; Filter ...
[39]
Overview | Apache Flink
DataSet API #. DataSet programs in Flink are regular programs that implement transformations on data sets (e.g., filtering, mapping, joining, grouping).Example Program · DataSet Transformations · Specifying Keys · Data Sources
[40]
Announcing the Release of Apache Flink 1.18
Oct 24, 2023 · Flink 1.18 comes with a JDBC Driver for the Flink SQL Gateway. So, you can now use any SQL Client that supports JDBC to interact with your ...Flink Sql Improvements · Introduce Flink Jdbc Driver... · Batch Execution Improvements
[41]
https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/hive/hive_catalog/
[42]
Apache Flink Runner
The Apache Flink Runner can be used to execute Beam pipelines using Apache Flink. For execution you can choose between a cluster execution mode.
[43]
Apache Beam Capability Matrix
To help clarify the capabilities of individual runners, we've created the capability matrix below. Individual capabilities have been grouped by their ...Bounded Splittable Dofn... · Unbounded Splittable Dofn... · Where In Event Time?
[44]
Overview | Apache Flink Machine Learning Library
Overview #. This document provides a brief introduction to the basic concepts in Flink ML. Table API #. Flink ML's API is based on Flink's Table API.
[45]
Event Processing (CEP) | Apache Flink
FlinkCEP is the Complex Event Processing (CEP) library implemented on top of Flink. It allows you to detect event patterns in an endless stream of events.Getting Started · The Pattern API · Detecting Patterns · Time in CEP library
[46]
How to Migrate from DataSet to DataStream | Apache Flink
The purpose of this document is to help users understand how to achieve the same data processing behaviors with DataStream APIs as using DataSet APIs.Setting the execution... · Migrating DataSet APIs · Category 1 · Category 2
[47]
The Apache Software Foundation Announces Apache™ Flink™ as a ...
Jan 12, 2015 · High-performance, multi-language serialization open source project graduates from the Apache Incubator Wilmington, DE – September 17, 2025 – The ...Missing: date | Show results with:date
[48]
Apache Software Foundation Announces Apache™ Flink™ is now a ...
Jan 15, 2015 · The origins of Apache Flink can be traced back to the 2009 DFG ... Stratosphere Research Project, also supported by the EIT ICT Labs.
[49]
Inside the Apache Software Foundation's Flink - SD Times
Jan 12, 2015 · The open-source community around Flink has steadily grown since the project's inception at the Technical University of Berlin in 2009. Now at ...
[50]
The evolution of the data pipeline | Barracuda Networks Blog
Jan 13, 2021 · Flink also made its public debut in May 2011. Owing its roots to a research project called “Stratosphere” [http://stratosphere.eu/], which ...<|control11|><|separator|>
[51]
Flink Project Incubation Status
This page tracks the project status, incubator-wise. For more general project status, look on the project website.
[52]
December 2014 in the Flink community | Apache Flink
Jan 6, 2015 · Flink graduation #. The biggest news is that the Apache board approved Flink as a top-level Apache project! The Flink team is working closely ...Notable Code Contributions · Intermediate Datasets · Sorting Of Very Large...
[53]
Apache Flink 1.12.0 Release Announcement
Dec 10, 2020 · This blog post describes all major new features and improvements, important changes to be aware of and what to expect moving forward.New Features and Improvements · Batch Execution Mode in the...
[54]
https://incubator.apache.org/projects/flink.html
[55]
Announcing the Release of Apache Flink 1.20
Aug 2, 2024 · The DataSet API has been already formally deprecated and will be removed in the Flink 2.0 version. Flink users are recommended to migrate from ...
[56]
Apache Flink 2.1.0: Ushers in a New Era of Unified Real-Time Data ...
Jul 31, 2025 · Giving access to Flink's managed state, event-time and timer services, and underlying table changelogs. Adds the VARIANT data type for efficient ...Flink Sql Improvements · Realtime Ai Function · Process Table Functions...
[57]
Apache Flink CDC 3.4.0 Release Announcement
May 16, 2025 · This release introduces a new pipeline Connector for Apache Iceberg, and provides support for batch execution mode.
[58]
2025 Data Streaming Report: Moving the Needle on AI ... - Confluent
The 2025 Data Streaming Report reveals how 4,175 IT leaders view data streaming platforms (DSPs) as pivotal for simplifying access to real-time data, ...
[59]
Apache Projects List
Apache Flink: 121 committers, 55 PMC members; Apache Geode: 120 committers, 32 PMC members; Apache Dubbo: 116 committers, 31 PMC members; Apache Arrow: 115 ...Missing: current | Show results with:current
[60]
Apache Flink Committee
Committee established: 2014-12 · PMC Chair: Robert Metzger · Reporting cycle: March, June, September, December, see minutes · PMC Roster (from committee-info; ...
[61]
Flink Improvement Proposals - Apache Flink - Apache Software Foundation
No readable text found in the HTML.<|separator|>
[62]
Contribute Code - Apache Flink
1. Create Jira Ticket and Reach Consensus. The first step for making a contribution to Apache Flink is to reach consensus with the Flink community. This means ...Code Contribution Process · Create Jira Ticket and Reach...
[63]
Release Schedule and Planning - Apache Flink - Apache Software Foundation
### Summary of Release Schedule and Planning for Flink Kubernetes Operator
[64]
Apache Flink - GitHub
Documentation. The documentation of Apache Flink is located on the website: https://flink.apache.org or in the docs/ directory of the source code. Fork and ...Package flink/flink · GitHub · Flink-connector-kafka · Flink-connector-aws · Releases
[65]
The Past, Present, and Future of Apache Flink - Alibaba Cloud
Dec 17, 2024 · Similar to Apache Spark, Apache Flink originated from a university research project called "Stratosphere," based at the Technische Universität ...
[66]
Community & Project Info
### Extracted Tools
[67]
Announcing Flink Forward 2015
Sep 3, 2015 · Flink Forward 2015 is the first conference with Flink at its center that aims to bring together the Apache Flink community in a single place.Missing: history inception event
[68]
Events - Flink Forward
We organize Flink Forward each year in different geos to provide learning and networking opportunities to streaming heroes from all around the world.Missing: history inception first 2015
[69]
Flink Forward | Barcelona 2025
This year's event is set to be the biggest Flink Forward to date, bringing together more attendees, speakers, and partners than ever before.Berlin 2024 · Berlin 2019 · Barcelona 2025 | Agenda · Seattle 2023Missing: history inception 2015
[70]
Barcelona 2025 | Agenda - Flink Forward
Explore the extensive four-day agenda of Flink Forward Barcelona 2025, featuring hands-on training for data engineers and Java developers on Apache Flink.Missing: organizers | Show results with:organizers
[71]
Key Insights from Flink Forward 2025: The Future of AI is Real-Time
Oct 31, 2025 · Flink's strengths in real-time event processing, state management, and fault tolerance, along with its seamless integration of data and agentic ...
[72]
Insights From Flink Forward 2024 - Research Euranova
Dec 2, 2024 · At Flink Forward 2024, a major announcement was made about the release of Flink 2.0, marking a significant evolution in Apache Flink's ...Missing: impact | Show results with:impact
[73]
Highlights from Flink Forward Berlin 2024 - Ververica
Nov 5, 2024 · Flink Forward Berlin 2024 is a wrap, and Ververica is proud to have hosted the 10-year anniversary of Apache Flink® with the community!
[74]
Board Meeting Minutes - Flink - Apache Whimsy
Flink was originally known as Stratosphere when it entered the Incubator. Flink has been incubating since 2014-04-14. Three most important issues to address ...<|control11|><|separator|>
[75]
Virtual 2020 - Flink Forward
Flink Forward Virtual 2020 was a virtual conference for the Apache Flink community, held April 22-24, dedicated to stream processing.Missing: hybrid | Show results with:hybrid
[76]
Global 2020 - Flink Forward
Flink Forward is the conference dedicated to Apache Flink and the stream processing community. We return for Flink Forward Global, October 19-22, 2020.Missing: hybrid | Show results with:hybrid
[77]
Flink Forward Asia 2020 - Keynote Summary - Alibaba Cloud
Jan 11, 2021 · This year's event recorded a total number of participants (UV) of over 92,000 (3-day total), and the highest number of viewers (UV) in a single ...Missing: virtual | Show results with:virtual
[78]
Berlin 2024 | Recordings - Flink Forward
Explore the past, present, and future of Apache Flink at Flink Forward Berlin 2024! Discover expert sessions and innovative insights from industry leaders.<|separator|>