What Are Apache Flink Watermarks? A Beginner’s Guide to Handling Late Arrival Data

What Are Flink Watermarks? A Beginner’s Guide to Handling Late Arrival Data | Image By Author

When I first started working with streaming data, I realized it was very different from batch data processing. We are no longer just processing static files or bounded dates; we are attempting to make sense of a never-ending stream of data. When I first started, some of the core concepts seemed strange. Terms like “late arrival data” and “watermarks” appeared abstract and confusing.

That is exactly why I decided to write this post. I will walk you through one of my favorite data processing tools—Apache Flink. My goal is to break down the core ideas behind the concept of watermarks in a way that is clear and intuitive, hopefully saving you some of the headaches I experienced.

What’s streaming data? 

You might already be very familiar with what a batch job is. You process data in a bounded manner; usually we do it by 24 hours or daily.

Imagine you’re not just processing a file of data that has a clear beginning and end, but instead, you’re dealing with a never-ending river of information. This “river” is what we call a data stream. It could be anything from user clicks on a website to transactions from a point-of-sale system to sensor readings from IoT devices.

Think of real-world data less like a steady faucet and more like ocean waves—sometimes it crashes in as a torrent, and other times it gently rolls in.

Batch is just a way humans define a boundary on top of streaming data for easier understanding and processing. 

Apache Flink is an open-source stream processing framework. Its primary job is to perform calculations and transformations on these continuous streams of data in real time.

Why does time matter in streaming data?

In the world of streaming data, time is a crucial but tricky concept. We generally deal with two notions of time:

  • Processing Time: This refers to the moment when the Flink application actually processes the data. The Flink application actually processes the data during this period. For the machine running the code, it symbolizes the present.
  • Event Time: This is the time at which the event actually occurred at its source. Consider the moment when a user pressed a button within a mobile application.

Why is this so critical to knowing the difference between those two times? Data seldom reaches its destination in the same sequence in which it originated. Due to network latency, system load, or other issues, an event that was created at 10:00:01 AM might arrive after an event that happened at 10:00:03 AM at the processing layer. 

So, if you process data based on processing time, you might get inaccurate results. 

Imagine you’re counting the number of clicks per minute. If a click from 10:00:59 AM arrives at 10:01:02 AM, a processing-time-based system would incorrectly count it in the 10:01 minute window. In fact, we care about when the event actually happened, which brings us to the problem of late arrival data.

Event Time vs. Processing Time | Image By Author

Dealing with the Unpredictable: Late Arrival Data

Late arrival data refers to data that arrives after you thought you had finished processing within a specific time frame. For example, we are still using the click count system; the 10:00:59 AM click that arrived on the server at 10:01:02 AM is considered late, and we can label our data as out of order (OOO). 

Ignoring late data can result in inaccurate analytics and poor business decisions. So, how should we handle this? This is where watermarks come into play.

The Magic of Watermarks: Flink’s Internal Clock

A watermark is a special type of message that flows through your Flink stream alongside your data. It carries a timestamp and essentially tells Flink—I believe that no more data with a timestamp earlier than this watermark’s timestamp will arrive.

A Moving line in the sand | Image By author

Think of it like a moving line in the sand. As Flink processes data, it observes the timestamps of the events. Based on these timestamps, it periodically generates watermarks. Flink receives a watermark of a window as a signal, indicating that it’s safe to assume all events have arrived.

When Flink is working with different time windows, it will buffer the data in memory or local storage, then wait until a watermark passes the end of a window before it triggers the computation for that window. 

Flink knows that data can be out of order. Therefore, you can configure a lateness or an out-of-orderness bound. For instance, one could instruct Flink to produce watermarks that lag 5 seconds behind the highest event timestamp it has encountered to date. We are postponing data processing and allowing a 5-second grace period for it. 

Watermarks Flows—Visualizing the Concept

Let’s break down the key parts and see how watermarks worked in Flink

The Data Stream (Flowing from Left to Right)

The Data Stream (Flowing from Left to Right) | Image By Author

Imagine a stream where each data point has a timestamp. You can see the data events, represented as blocks, flowing from left to right. Each block has its timestamp, representing the “event time” when the action actually occurred.

In the diagram above, you can see that the block for 10:02 comes after the block for 10:03. This indicates that there are events in which the data is not in the correct order. 

The Watermark (The Wavy, Progressing Line)

The Watermark (The Wavy, Progressing Line) | Image By Author

The glowing line moving through the events is the Watermark.

Think of it as Flink’s internal clock. It represents a point in time, indicating that the system believes all events that occurred prior to the watermark’s time have been received. Notice how the events behind the watermark are slightly dimmed; the effect visually represents that they have been “seen” and acknowledged by the system, bringing a sense of order to the stream.

Time Windows (The Grouped Blocks)

Time Windows (The Grouped Blocks) | Image By Author

The visualization depicts the stream being divided into “Time Windows.” These are the buckets you create in your Flink application to organize data for calculations (for example, “count all clicks between 10:00 and 10:05”).

A window is considered “open” and will continue to collect data until the watermark indicates the end of the window’s timeframe.

Triggering the Calculation

Triggering the Calculation | Image By Author

This is the most critical aspect of the concept. When the watermark reaches the end of a time window, it initiates the processing for that window.

When the watermark for ⁣ crosses the ⁣ window, Flink acknowledges that it has received all data for the 5-minute interval.10:05:0010:00 - 10:05 It then performs the calculation (like an aggregation or sum) on all the events collected in that window and produces a result.

What if data is really late?

The grace period isn’t perfect for catching everything; it serves as a buffer to provide Flink with the opportunity to capture the majority of the data within our agreed SLA. Even with this grace period, some data may still arrive exceptionally late. Flink gives you options to handle these stragglers:

  1. Dropping: The simplest option is to just ignore them.
  2. Allowed Lateness: After processing a window initially, you can define an “allowed lateness” period. You can define an “allowed lateness” period after a window has been initially processed. If data for that window arrives within this period, Flink will re-run the calculation for that window and update the result. But you’d need to make sure your calculation is idempotent.
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.allowedLateness(Time.seconds(5))

3. Side Output: You can redirect truly late data to a separate stream or a dead-letter queue (DLQ). This feature is useful for logging, debugging, or applying a different kind of processing to the late events. Therefore, you can reprocess the data later in a batch manner.

OutputTag<Event> lateOutputTag = new OutputTag<>("late-data");

stream
.windowAll(...)
.sideOutputLateData(lateOutputTag);

Final Thoughts

Watermarks are a fundamental concept in Apache Flink that enables accurate and simple out-of-order data processing. Watermarks enable Flink to extract meaningful insights from the flow of information by providing a mechanism for tracking event time progress and dealing with the inherent challenges of late-arriving data.

Understanding watermarks becomes increasingly important as you learn about stream processing and real-time applications. They are critical to understanding and extracting insights from streaming data.

 

About Me

I hope my stories are helpful to you. 

For data engineering post, you can also subscribe to my new articles or becomes a referred Medium member that also gets full access to stories on Medium.

In case of questions/comments, do not hesitate to write in the comments of this story or reach me directly through Linkedin or Twitter.

More Articles

Photo by JIUNN-YIH LAU on Unsplash

5 Tips for Self-Promotion as Data Professionals

Getting the work done isn't the journey's end. Your work should be your channel to get YOU self-promotion. I will give five tips to get ...
Photo by George Flowers on Unsplash

I Built a Game By Using Streaming Data: A Fun Way for Data Visualization

Data visualization has always been a delightful area for me to work as a data professional. Visualizing data is like an art. Can I visualize ...
Foto von Enis Yavuz auf Unsplash

Is Apache Airflow Due for Replacement? The First Impression Of mage-ai

Airflow has been widespread for years. Is Apache Airflow due for a replacement? mage-ai is the new ETL tool for data engineers to check out ...
5 1 vote
Article Rating
Subscribe
Notify of
1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Scroll to Top
1
0
Would love your thoughts, please comment.x
()
x