Glossary of Terms used for Flink
Apache Flink | Apache Flink is an open-source stream processing and batch processing framework for high-throughput, low-latency, big data processing and analytics. |
Apache Flink Table Catalog | A catalog that contains metadata about tables, including their schema and connection information for external data sources. |
Application Cluster | A Flink Application Cluster is a dedicated Flink Cluster that only executes Flink Jobs from one Flink Application. The lifetime of the Flink Cluster is bound to the lifetime of the Flink Application. |
At-Least-Once Processing | A processing guarantee in Flink that ensures that events are processed at least once but may be duplicated in case of failures. |
At-Most-Once Semantics | A guarantee that each record will be processed at most once, but possibly not at all. |
Backpressure | A mechanism in Flink to deal with overload situations where data processing cannot keep up with the incoming data rate, allowing for controlled slowdown or buffering. |
Batch Interval | In Flink's batch processing mode, this parameter specifies the size of the batches or chunks of data that are processed together. |
Batch Iterations | A feature in Flink for iterative processing of data, often used in graph algorithms and machine learning. |
Batch Processing | The processing of finite, bounded sets of data. Flink can handle both batch and stream processing. |
Broadcast State | A type of state in Flink that can be broadcast to all parallel instances of an operator, enabling efficient sharing of data across tasks. |
Catalog Connector | A component in Flink that connects to external catalog systems, allowing for metadata management and integration with external storage and data sources. |
CEP (Complex Event Processing) | A Flink library that allows for the detection of patterns and complex events within streaming data. |
CepPattern | In Flink's Complex Event Processing (CEP) library, a CepPattern is used to specify the pattern or sequence of events that you want to detect in a data stream. |
Checkpoint | A consistent snapshot of the application's state, taken at regular intervals, to provide fault tolerance and exactly-once processing semantics. |
Checkpoint Retention Policy | A policy that specifies how long checkpoints should be retained, helping manage the storage space consumed by checkpoints. |
Checkpoint Storage | The location where the State Backend will store its snapshot during a checkpoint (Java Heap of JobManager or Filesystem). |
Checkpointing | The process of taking a snapshot of the state of a Flink application, so it can be restored in case of failures. |
Cluster | A group of machines or nodes that run Flink jobs, managed by Flink's cluster manager. |
Cluster Manager | Software responsible for managing the deployment and execution of Flink jobs on a cluster of machines. Common cluster managers include Apache YARN, Apache Mesos, and Kubernetes. |
Cluster Resource Management | The allocation and management of CPU, memory, and other resources across JobManagers and TaskManagers in a Flink cluster. |
Connector | A library or module in Flink that provides integration with external data sources and sinks, such as Apache Kafka, Apache Cassandra, or relational databases. |
Continuous Processing Mode | Flink's ability to process data in a low-latency, continuous manner, suitable for use cases where sub-second processing times are required. |
CoProcessFunction | A special type of function in Flink that allows you to connect and process two or more input streams with separate logic for each stream. |
Custom Partitioners | Implementations that allow you to define custom logic for partitioning data in Flink applications when using KeyedStreams. |
DataSet API | The API in Flink for processing bounded data sets, typically used for batch processing. |
DataStream | A core data structure in Flink for representing and processing streams of data. It represents an unbounded stream of events or records. |
DataStream API | The API in Flink used for processing unbounded streams of data, supporting event-time processing, windowing, and stateful operations. |
DataStream API Windows | A feature in the DataStream API that allows you to define windows for grouping and aggregating data within a time-based or count-based interval. |
DataStream SQL | A module in Flink that enables SQL queries on streaming data using the Table API. |
Dynamic Scaling | The ability to dynamically adjust the parallelism of operators in a running Flink job to adapt to changing data processing requirements. |
Dynamic Table | A table in Flink that can be updated or modified in real-time as new data arrives or changes in external systems. |
Dynamic Table Connector | A connector in Flink that enables dynamic updates to table data as it changes in external systems, providing real-time access to changing data. |
Event | An event is a statement about a change of the state of the domain modelled by the application. Events can be input and/or output of a stream or batch processing application. Events are special types of records. |
Event Time | The time at which an event occurred in the real world, as opposed to processing time or ingestion time. |
Event Time Alignment | Techniques for aligning events in a distributed stream processing system like Flink to ensure correct event time processing. |
Event Time Timers | Timers in Flink that allow you to schedule actions or computations based on event time, useful for complex event processing. |
Event-Driven | A paradigm where the flow of the program is determined by events, such as user actions, sensor outputs, or messages from other programs. |
Event-driven Architecture | A software architecture paradigm where components communicate primarily through events or messages. Flink supports event-driven processing. |
Exactly-Once Kafka Semantics | Integration with Apache Kafka that ensures exactly-once processing when reading from and writing to Kafka topics. |
Exactly-Once Processing | A processing guarantee in Flink that ensures that each event is processed only once, even in the presence of failures. |
Exactly-Once Semantics | A guarantee that each record will be processed exactly once, ensuring no data is lost or seen more than once. |
Exactly-Once Sink Connectors | External connectors in Flink that provide exactly-once semantics when writing data to specific sinks or storage systems. |
Exactly-Once Sink Semantics | A mechanism in Flink that ensures that data is written to sinks (e.g., databases) exactly once when using exactly-once processing. |
Exactly-Once Source Semantics | A mechanism in Flink that ensures that data is read from sources exactly once when using exactly-once processing. |
Execution Mode | DataStream API programs can be executed in one of two execution modes: BATCH or STREAMING. See Execution Mode for more details. |
ExecutionGraph | The result of translating a Logical Graph for execution in a distributed runtime. The nodes are Tasks and the edges indicate input/output-relationships or partitions of data streams or data sets. |
Externalized Checkpoints | A checkpoint mechanism in Flink that stores checkpoint data externally, often in a distributed file system like HDFS or cloud storage. |
Fault Tolerance Mechanisms | The strategies and mechanisms in Flink that ensure that jobs continue processing in the presence of failures or errors. |
Flink Application | A Flink application is a Java Application that submits one or multiple Flink Jobs from the main() method (or by some other means). Submitting jobs is usually done by calling execute() on an execution environment. The jobs of an application can either be submitted to a long running Flink Session Cluster, to a dedicated Flink Application Cluster, or to a Flink Job Cluster. |
Flink Catalog | A catalog in Flink that manages metadata about external data sources and tables, facilitating integration with external storage systems. |
Flink CEP (Complex Event Processing) | A library for pattern detection and event sequence analysis in data streams. |
Flink CEP Library | The Complex Event Processing (CEP) library in Flink, which provides features for detecting patterns and complex events in event streams. |
Flink CLI | A command-line interface for interacting with and managing Flink applications, including submitting jobs and monitoring job status. |
Flink Cluster | A distributed system consisting of (typically) one JobManager and one or more Flink TaskManager processes. |
Flink Cluster Coordination | The process by which Flink's JobManager and TaskManagers coordinate and distribute tasks across the cluster. |
Flink Cluster Manager | A component responsible for managing the allocation and coordination of resources across the Flink cluster, such as Apache YARN, Kubernetes, or standalone cluster managers. |
Flink Connectors | Pre-built connectors in Flink for integrating with external systems and data sources, such as connectors for Apache Kafka, Apache Cassandra, and more. |
Flink Data Encryption | Configurations and practices for encrypting sensitive data in Flink applications to enhance security. |
Flink Data Serialization | Techniques and configurations for efficiently serializing and deserializing data in Flink applications. |
Flink DataSet API | A batch processing API in Flink for working with static, bounded datasets, providing a familiar batch processing programming model. |
Flink Event Time Processing Patterns | Common patterns and techniques used in Flink for processing events based on their event time characteristics. |
Flink Gelly | A library for graph processing in Flink, allowing the analysis of large-scale graph data. |
Flink Job | A Flink Job is the runtime representation of a logical graph, often referred to as a dataflow graph or directed acyclic graph (DAG) of operators, that is created and submitted for execution when calling execute() in a Flink Application. It represents an instance of a Flink application or program that is submitted to a Flink cluster for execution. |
Flink Job Recovery | The process of recovering and resuming the execution of a Flink job after a failure or interruption. |
Flink Job Scheduling | The process by which Flink schedules and assigns tasks to TaskManagers based on available resources and parallelism settings. |
Flink Metrics | Metrics and monitoring data provided by Flink to track the performance and health of running Flink jobs. |
Flink REST API | An API that allows you to interact with a Flink cluster programmatically, enabling job submission, monitoring, and control. |
Flink REST Proxy | A component that allows you to interact with a Flink cluster using RESTful HTTP endpoints, enabling remote management and job submission. |
Flink Savepoint | A snapshot of the entire application state, including operator states and configuration, which can be used to stop and resume a Flink job. |
Flink Savepoint Cleanup | The management of older savepoints to ensure efficient storage usage and to remove unnecessary savepoints. |
Flink Savepoint ID | A unique identifier associated with a specific savepoint, which is used when restoring the state of a Flink application. |
Flink Security | Measures and configurations for securing Flink clusters and applications, including authentication, authorization, and encryption. |
Flink SQL | A component of Apache Flink that enables you to run SQL queries on streaming data. It allows for easy integration of SQL queries into Flink applications. |
Flink State Access Patterns | Best practices and patterns for accessing and updating state data efficiently in Flink applications. |
Flink State Evolution | Managing changes to the schema or structure of Flink state to accommodate evolving data requirements. |
Flink State Processor API | An API for reading, modifying, and writing Flink application state externally, enabling integration with external tools and systems. |
Flink State Size Tuning | The process of optimizing the size of Flink application state to ensure efficient memory usage and performance. |
Flink State Snapshot | A snapshot of the state of a Flink job at a specific point in time, used for fault tolerance and recovery. |
Flink Stateful Operators | Operators in Flink that maintain state and allow for complex, stateful computations on data streams. |
Flink Task Slots | The number of slots available on a TaskManager for executing tasks. It determines the degree of parallelism that can be achieved. |
Flink UI | A web-based user interface that provides monitoring and management capabilities for Flink jobs, including metrics, checkpoints, and job status. |
FlinkML | A library for machine learning algorithms that can be used in Flink jobs for analytics and data processing. |
FlinkMLlib | An extension of FlinkML, providing integration with Apache Spark's MLlib machine learning library. |
Function | Functions are implemented by the user and encapsulate the application logic of a Flink program. Most Functions are wrapped by a corresponding Operator. |
Ingestion Time | The time at which an event is ingested by the system. |
Instance | The term instance is used to describe a specific instance of a specific type (usually Operator or Function) during runtime. As Apache Flink is mostly written in Java, this corresponds to the definition of Instance or Object in Java. In the context of Apache Flink, the term parallel instance is also frequently used to emphasize that multiple instances of the same Operator or Function type are running in parallel. |
Job | A Flink application or program that defines a data processing pipeline. |
Job Cluster | A Flink Job Cluster is a dedicated Flink Cluster that only executes a single Flink Job. The lifetime of the Flink Cluster is bound to the lifetime of the Flink Job. This deployment mode has been deprecated since Flink 1.15. |
JobGraph | A directed graph where the nodes are Operators and the edges define input/output-relationships of the operators and correspond to data streams or data sets. A job graph is created by submitting jobs from a Flink Application. |
JobManager | A component of the Flink cluster responsible for coordinating and managing the execution of Flink jobs. It oversees the lifecycle of a job, schedules tasks, and coordinates checkpoints. |
Key Selector | A function that extracts a key from an element in a KeyedStream. It is used to partition the data for operations like windowing and state management. |
Keyed State | State in Flink that is partitioned by a key, allowing for efficient access to state data for a specific key in KeyedStreams. |
KeyedStream | A DataStream that has been partitioned or grouped by a specific key. KeyedStreams are used for operations like windowing and aggregation. |
Late Data | Data that arrives after the window it belongs to has already been processed. |
Late Data Handling | Strategies and techniques for handling out-of-order data or late-arriving events in Flink applications. |
Late Events | Events that arrive after their expected processing time, which can be handled using event time processing and watermarks in Flink. |
Managed State | A type of state in Flink that is registered with the framework and is automatically handled by the system. For Managed State, Apache Flink takes care of persistence, rescaling, checkpointing, and distribution across TaskManagers among other responsibilities. |
Operator | A basic processing unit in a Flink job that represents a processing node and performs transformations on data. Examples of operators include actions like map, filter, reduce, and aggregation. |
Operator Chain | An Operator Chain consists of two or more consecutive Operators without any repartitioning in between. Operators within the same Operator Chain forward records to each other directly without going through serialization or Flink’s network stack. |
Operator State | State that is local to a specific operator in a Flink job. It allows an operator to maintain and update its internal state. |
Parallelism | The degree to which a Flink job can be divided into parallel tasks, allowing for concurrent execution and optimization of available resources. In Flink, parallelism can be set at different levels, such as job, operator, etc. |
Partition | A partition is an independent subset of the overall data stream or data set. A data stream or data set is divided into partitions by assigning each record to one or more partitions. Partitions of data streams or data sets are consumed by Tasks during runtime. A transformation which changes the way a data stream or data set is partitioned is often called repartitioning. |
ProcessFunction | A low-level API in Flink that enables fine-grained control over event processing, including timers, state management, and side outputs. |
Processing Time | The time at which an event is processed by the system. |
ProcessWindowFunction | A function in Flink used in combination with window operations to perform custom processing on windowed data. |
Queryable State | A feature in Flink that allows you to query the state of a running job, enabling real-time analytics and debugging. |
Queryable State Client | A client library or tool that allows external applications to query the state of a running Flink job using the Queryable State feature. |
Reactive Programming | A programming paradigm that focuses on building systems that are responsive to changes and events in real-time, often used in Flink applications. |
Record | Records are the constituent elements of a data set or data stream. Operators and Functions receive records as input and emit records as output. |
RichFunction | An interface in Flink that allows for more complex operations than regular functions, including access to Flink's runtime context. |
Savepoint | A manually triggered checkpoint in Flink that captures the application's state at a specific point in time. It can be used for purposes such as resuming, rolling back, debugging, upgrading, or migrating applications. |
Savepoint and Checkpoint Compatibility | Ensuring that savepoints and checkpoints remain compatible across different Flink versions for seamless state migration. |
Savepoint ID | A unique identifier associated with a specific savepoint, allowing you to reference and restore to that specific point in your Flink application's state. |
Savepoint Migration | The process of migrating a Flink application's state from one Flink version to another, often necessary when upgrading Flink. |
Savepoint Path | The location where Flink stores savepoint data, which can be configured to use distributed storage systems like HDFS or a shared file system. |
Session Cluster | A long-running Flink Cluster which accepts multiple Flink Jobs for execution. The lifetime of this Flink Cluster is not bound to the lifetime of any Flink Job. Formerly, a Flink Session Cluster was also known as a Flink Cluster in session mode. Compare to Flink Application Cluster. |
Session Window | A window that groups events that are close in time. |
Session Window | A type of window in Flink that groups events into sessions based on a gap in time between events, enabling flexible windowing for irregular event patterns. |
Side Output | In Flink, a mechanism that allows an operator to emit data to multiple output streams from a single operator. This is useful for scenarios such as handling errors, routing data differently, or producing multiple distinct output streams. |
Sink | A component in a Flink job that writes data to external destinations, including databases, file systems, messaging systems, or other external systems. |
Sink Parallelism | The degree of parallelism at which data is written to external sinks. It can be configured separately from the main job parallelism. |
Sliding Window | A window that moves by a fixed amount of time. |
Snapshot Interval | The frequency at which Flink takes checkpoints to capture the state of a running job. It's defined in terms of time or event counts. |
Source | A component in a Flink job that reads data from an external source, such as Kafka, Apache Pulsar, or file systems. |
State | The ability to maintain and update mutable state within a Flink application. It's often used for keeping track of aggregations and other application-specific data. |
State Backend | In Flink, the State Backend configuration determines how the state of a job is stored and managed on each TaskManager. It allows users to choose from various storage systems, such as the Java Heap of the TaskManager, filesystem, embedded RocksDB, or other external systems. This choice affects how state is persisted and accessed during stream processing. |
State Processor API | An API in Flink for reading, modifying, and writing the state of a Flink application externally, enabling stateful external operations. |
State Retention Time | A configuration setting in Flink that determines how long state data should be retained before it's considered stale and eligible for eviction. |
State Size | The amount of memory used by the state of a Flink application, which can impact performance and resource allocation. |
State TTL (Time-to-Live) | A feature in Flink that allows you to set a time limit on the retention of state data, automatically cleaning up expired state. |
Stateful Functions | A Flink extension that allows you to build serverless applications using stateful, event-driven functions with a consistent state management model. |
Stateful Window Operator | An operator in Flink that combines stateful processing with windowing, allowing for advanced analytics on streaming data. |
Stateless Function | An operator or function in Flink that does not maintain any internal state, making it suitable for parallel processing. |
Streaming Analytics | The practice of performing real-time data analysis and generating insights from streaming data using Flink. |
Sub-Task | A Sub-Task is a Task responsible for processing a partition of the data stream. The term “Sub-Task” emphasizes that there are multiple parallel Tasks for the same Operator or Operator Chain. |
Table | A structured representation of data in Flink that resembles a database table. Tables can be created from DataStreams or other sources and used for querying and transformation. |
Table API | A SQL-like API for Flink that allows users to express queries and transformations on data using SQL queries or fluent API. |
Table API & SQL | High-level APIs in Flink for relational data processing. |
Table API and SQL Planner | The component in Flink responsible for parsing and optimizing SQL queries written using the Table API or Flink SQL. |
Table Program | A generic term for pipelines declared with Flink’s relational APIs (Table API or SQL). |
Task | In Flink, a task represents the smallest unit of work and is the node of a physical graph. It encapsulates the execution of exactly one parallel instance of an operator or an operator chain, serving as the basic unit executed by Flink’s runtime. |
TaskManager | A component of the Flink cluster responsible for executing individual tasks within a job and managing their state. TaskManagers handle data processing and, when working in tandem, enable parallelized processing across the cluster. |
Temporal Join | A type of join operation in Flink that allows you to match events from two or more streams based on their event time or processing time characteristics. |
Time Window | A window defined by a start and end time. |
Transformation | A Transformation is applied on one or more data streams or data sets and results in one or more output data streams or data sets. A transformation might change a data stream or data set on a per-record basis, but might also only change its partitioning or perform an aggregation. While Operators and Functions are the “physical” parts of Flink’s API, Transformations are only an API concept. Specifically, most transformations are implemented by certain Operators. |
Tumbling Window | A window that tumbles forward in time without overlapping. |
UID | A unique identifier of an Operator, either provided by the user or determined from the structure of the job. When the Application is submitted this is converted to a UID hash. |
UID hash | A unique identifier of an Operator at runtime, otherwise known as “Operator ID” or “Vertex ID” and generated from a UID. It is commonly exposed in logs, the REST API or metrics, and most importantly is how operators are identified within savepoints. |
Ververica Cloud | Ververica Cloud, a fully-managed cloud-native service for real-time data processing. |
Ververica Community Edition | Ververica Platform Community Edition is a free-to-use version of the Ververica Platform, which is designed to provide an enterprise-ready platform for managing Apache Flink applications. |
Ververica Platform | Ververica Platform is an Enterprise Stream Processing Platform available on-premise or in the private cloud. |
Ververicas GmbH | We are founded by the original creators of the Apache Flink project. Ververica provides commercial support and services around Apache Flink, such as Ververica Platform, Cloud, Community Edition. |
Watermark | A mechanism in Flink used to denote and track progress in event time. It's essential for time-based operations like windowing and handling out-of-order events. |
Window | A window represents a finite set of data in a stream, divided into fixed or dynamic intervals based on criteria like time, count, or other factors. These intervals facilitate aggregation or analysis of the streaming data. Flink supports various types of windows, including tumbling, sliding, and session windows. |
Niciun comentariu:
Trimiteți un comentariu