Flink中的Exactly Once | 青训营笔记Exactly Once 语义在 Flink 中的实现这是我参

这是我参与「第四届青训营」笔记创作活动的第3天

Exactly Once 语义在 Flink 中的实现

课后拓展阅读

SQL在Flink中的使用

最开始的分布式流处理系统Apache Storm提供低延迟处理但是只能保证at-least-once
为了让人们使用流而不用关注无边界流，window，state这些概念。在0.9版本，Apache Flink添加了一个使用SQL-like的expressions处理关系数据的API，称为Table API

Table API

Table API与DataSet和DataStream API紧密集成。Table可以很容易地从DataSet或DataStream创建，也可以被转换回DataSet或DataStream

但是original Table API是有一些缺陷的：
（1）Table API查询必须始终嵌入到DataSet或DataStream程序中
（2）对batch Tables的查询不支持SQL查询中常用的外部连接、排序和许多标量函数。
（3）对streaming tables的查询只支持筛选器、联合和投影，不支持聚合或连接。
（4）除了应用于所有DataSet程序的物理优化外，翻译过程没有利用查询优化技术。

表API与SQL的结合

build the new Table API on top of Apache Calcite, a popular SQL parser and optimizer framework.

结合了Table API和SQL Query的新架构如下： Screenshot 2022-08-31 at 10.40.12 AM.png

The new architecture features two integrated APIs to specify relational queries, the Table API and SQL.

Queries of both APIs are validated against a catalog of registered tables and converted into Calcite’s representation for logical plans. In this representation, stream and batch queries look exactly the same.

Next, Calcite’s cost-based optimizer applies transformation rules and optimizes the logical plans. Depending on the nature of the sources (streaming or static) we use different rule sets.

Finally, the optimized plan is translated into a regular Flink DataStream or DataSet program. This step involves again code generation to compile relational expressions into Flink functions.

两阶段提交协议（Two-phase commit protocol： 2PC）

2PC 是 atomic commitment protocol(ACP)
2PC 是一种分布式算法，用于协调参与分布式原子事务的所有进程，以确定是提交还是中止（回滚）事务。
To accommodate recovery from failure, the protocol's participants use logging of the protocol's states. Log records, which are typically slow to generate but survive failures, are used by the protocol's recovery procedures.

normal的2PC包含两个阶段:

The commit-request phase (or voting phase): a coordinator process attempts to prepare all the transaction's participating processes (named participants, cohorts, or workers) to take the necessary steps for either committing or aborting the transaction and to vote, either "Yes": commit, or "No": abort (if a problem has been detected with the local portion)
The commit phase: based on voting of the participants, the coordinator decides whether to commit (only if all have voted "Yes") or abort the transaction (otherwise), and notifies the result to all the participants. The participants then follow with the needed actions (commit or abort) with their local transactional resources

2PC缺点 2PC is a blocking protocol. If the coordinator fails permanently, some participants will never resolve their transactions: After a participant has sent an agreement message to the coordinator, it will block until a commit or rollback is received.

Apache Flink 中端到端的 Exactly-Once 处理（checkpoint + 2PC实现）

checkpoint

A checkpoint in Flink is a consistent snapshot of:

The current state of an application
The position in an input stream

Flink定期生成检查点，然后将检查点写入持久存储系统，如HDFS。将检查点数据写入持久化存储是异步发生的，即Flink application在检查点过程中继续处理数据。
如果机器或软件发生故障并在重新启动时，Flink 应用程序会从最近成功完成的检查点恢复处理；Flink恢复应用程序状态并从检查点回滚到输入流中的正确位置，然后再次开始处理。

在 Flink1.4.0之前，Exactly-once 语义仅限于Flink application的范围，并没有扩展到 Flink 处理后发送数据的大部分外部系统。但是 Flink 应用程序与各种数据接收器一起运行，开发人员应该能够在一个组件的上下文之外维护完全一次性的语义。

为了提供端到端的Exactly-once语义——即除了Flink应用程序的状态，这些外部系统必须提供一种提交或回滚写入的方法与Flink的检查点相协调。在分布式系统中协调提交和回滚的一种常用方法是两阶段提交协议（2PC）。

2PC

In the sample Flink application, we have:

A data source that reads from Kafka (in Flink, a KafkaConsumer)
A windowed aggregation
A data sink that writes data back to Kafka (in Flink, a KafkaProducer)

For the data sink to provide exactly-once guarantees, 必须将一个事务范围内的所有数据写入Kafka，两个检查点之间的所有写操作要都被提交，这样才能保证writes are rolled back in case of a failure.

然而，在一个具有多个并发运行的接收任务的分布式系统中，一个简单的提交或回滚是不够的，因为所有组件必须在提交或回滚上“一致”，以确保一致的结果。Flink uses the two-phase commit protocol and its pre-commit phase to address this challenge.

pre-commit phase

Screenshot 2022-08-31 at 12.01.36 PM.png

The starting of a checkpoint represents the “pre-commit” phase of our two-phase commit protocol.

当一个检查点开始时，Flink JobManager会向数据流中注入一个checkpoint barrier（它将数据流中的记录分为进入当前检查点的集合与进入下一个检查点的集合）。
The barrier is passed from operator to operator. 对于每个算子，它都会触发operator’s state backend来拍摄其状态的快照。
This approach works if an operator has internal state only. 但是还有external state. 外部状态通常以写入外部系统（如 Kafka）的形式出现。在这种情况下，为了提供完全一次的保证，外部系统必须为与两阶段提交协议集成的事务提供支持。所以在预提交阶段，data sink除了将其状态写入状态后端外，还必须预先提交其外部事务。
The pre-commit phase finishes when the checkpoint barrier passes through all of the operators and the triggered snapshot callbacks complete. 此时检查点成功完成并包含整个应用程序的状态，包括预先提交的外部状态。如果发生故障，我们将从该检查点重新初始化应用程序。
notify all operators that the checkpoint has succeeded.

这是两阶段提交协议的提交阶段，JobManager为应用程序中的每个operator发出检查点完成的回调。data source和窗口算子没有外部状态，因此在提交阶段，这些算子不必采取任何行动。但是，data sink具有外部状态，并使用外部写入提交事务。

Screenshot 2022-08-31 at 12.15.01 PM.png

Once all of the operators complete their pre-commit, they issue a commit.
If at least one pre-commit fails, all others are aborted, and we roll back to the previous successfully-completed checkpoint.
After a successful pre-commit, the commit must be guaranteed to eventually succeed – both our operators and our external system need to make this guarantee. If a commit fails (for example, due to an intermittent network issue), the entire Flink application fails, restarts according to the user’s restart strategy, and there is another commit attempt. This process is critical because if the commit does not eventually succeed, data loss occurs.

Therefore, we can be sure that all operators agree on the final outcome of the checkpoint: all operators agree that the data is either committed or that the commit is aborted and rolled back.

课后思考题

流式处理中算子为什么会有状态？

因为在流式处理中，任何数据都会被看做流，源源不断流入，所以要保存一下现在数据记录到了哪里，这样才能下次正确读取之前没有处理过的记录。例如sum算子，你需要知道你当前的sum值状态到哪里了，这样下一个数据来的时候你才能用原来sum值加上新的数据值得到结果。

数据流和动态表之间是如何进行转换的？

动态表到实时流的转换

Append-only Stream: Append-only 流（只有 INSERT 消息）
Retract Stream: Retract流（同时包含 INSERT 消息和 DELETE 消息）
Upsert Stream: Upsert 流（同时包含 UPSERT 消息和 DELETE 消息）

Flink 作业为什么需要考虑故障恢复？

实时任务不同于批处理任务，除非用户主动停止，一般会一直运行，运行的过程中可能存在机器故障、网络问题、外界存储问题等等，要想实时任务一直能够稳定运行，实时任务要有自动容错恢复的功能。

Flink 故障恢复前为什么需要Checkpoint？

批处理任务在遇到异常情况时，在重新计算一遍即可。实时任务因为会一直运行的特性，如果在从头开始计算，成本会很大，尤其是对于那种运行时间很久的实时任务来说。实时任务开启 Checkpoint 功能，也能够减少容错恢复的时间。因为每次都是从最新的 Chekpoint 点位开始状态恢复，而不是从程序启动的状态开始恢复。

为什么不能保留任意时刻的状态作为故障恢复的时间点？

状态恢复的时间点：需要保证source进入消费位点的所有数据都被下游处理过了。需要等待所有处理逻辑消费完成source保留状态及以前的数据。因为为了确保所有流的状态都是同步的。

Flink Checkpoint 对作业性能的影响有多大？

（1）在快照制作和Barrier Alignment过程中需要暂停处理数据，仍然会增加数据处理延迟。
（2）快照保存到远端也有可能极为耗时。

两阶段提交协议对性能影响有多大？

协调者存在单点问题。如果协调者挂了，整个2PC逻辑就彻底不能运行。
执行过程是完全同步的。各参与者在等待其他参与者响应的过程中都处于阻塞状态，大并发下有性能问题。

写入下游如果不支持事务读写，能做到 Exactly-Once 语义么？

不能。