Streaming data is an exciting space in the data field, and it has been getting tremendous attraction in recent years. With much excitement, the areas for open-source projects became crowded. Many technologies have made the streaming data process more straightforward than ever: Kafka, Flink, Spark, Storm, Beam, etc., have been in the market for years and have built a solid user base.
“Let’s do streaming processing.”
It is an inevitable topic for data professionals. However, before anyone tells you about streaming, we should step back and double confirm with ourselves with a simple question: Do I need streaming data for this use case? Before jumping into it, let’s face the facts of streaming data in this story.
Before we look at the facts about streaming data, let’s first look at what streaming data is. Hadoop set the foundation for processing large datasets and empowered data professionals to design more sophisticated data processing tools.
Tyler Akidau’s paper on MillWheel: Fault-Tolerant Stream Processing at Internet Scale in 2013 sets the basis for modern streaming and inspires a streaming framework like Apache Flink.
I use the term “streaming,” you can safely assume I mean an execution engine designed for unbounded data sets, and nothing more. — Tyler Akidau
Let’s use the exact definition followed by Tyler and focus on unbounded data throughout this story.
Kappa vs. Lambda Architecture
We are all familiar with Lambda architecture — we use two independent data processing systems: batch and streaming, writing similar logic twice and processing the same data. The streaming is for speculation, and the collection is for accuracy and completeness.
On the other hand, we have Kappa architecture. We have a single pipeline running without duplicated code and leverage Kafka to achieve replayable action when we’d need accuracy and completeness.
Ultimately, Kappa is a great idea for a well-designed system. However, the such system needs to keep data processing as the first citizen.
Streaming Data Is Not Silver Bullet
A while ago, there was an impression on data processing that “Streaming Data Is a Silver Bullet,” and we will all move to stream data. Batch processing is an antique.
Momentarily, people realized that streaming data isn’t the silver bullet to solve the problem but could worsen things:
- Streaming isn’t good enough to generate the complete data analysis dataset. A batch is still required to close the gap due to the late arrival of data or processing errors.
- Streaming and batch processing usually speak a different language. Streaming usually runs in Java, Scala, and Go with frameworks like Apache Flink / Kafka Stream. Batch processing usually runs in Python, SQL, and R with frameworks like Apache Spark / SQL engine. Duplicating the same logic for both batch and streaming is a headache. It is one of the most challenging problems when running lambda architecture in production.
Streaming Data trade-offs
Data naturally comes in a streaming fashion. Solving data problems in batch seems inappropriate initially, but batch has a reason to be famous for decades. Processing data in the batch is a simplified philosophy to resolve a complex problem.
There are significant trade-offs between batch processing and streaming processing.
Completeness vs. Speculation
Many data sources are inevitably generated with delay; mainly, your data analysis includes multiple data sources. Batch processing is an excellent position to handle completeness by delaying processing when everything is ready.
On the other hand, data streaming can do so by waiting additional time means keeping data in memory for hours or a day, and it is expensive to accomplish this goal. Streaming can also deliver a complete dataset but requires the upstream data generator to cooperate to solve the data consolidation and extra delay.
The accurate SLA for your use case
How fast do you need your streaming system to process data, and how much latency can you accept for your use case? Many ETL batch pipeline is processed daily. Is that slow for your business? Many use cases are NOT SLA restricted. Unlike advertisement or day trading, delays for hours won’t stop the company from operating regularly.
Streaming Data is not easy to maintain
Late arrival data
One inevitable fact for any data processing system is: Data Arrives Late. A well-designed system could sometimes dodge this problem but only occasionally.
In batch processing, late arrival data is not a big concern since data is processed much later than its event date, and SLA isn’t strict to minutes or hours. People who work in batch processing have a lower expectation that data will arrive within 24 hours or more.
Streaming isn’t a solution for a “catch-all” scenario—concepts like the watermark give an additional buffer for us to process that late arrival data. However, the watermark is another way to keep data in memory for some time. Memory isn’t free: at a further point, the watermark has to advance, and you decide to drop the record or send it to a dead queue for another process to reprocess — batch processing.
Maintaining a streaming application is demanding. Unlike batch processing, you have a downtime window in which you can relax to decide to fix a bug or drink a coffee.
With streaming data processing, 24/7 with minimum downtime is required. Your on-call team must monitor and fix potential data issues to keep the pipeline running. Streaming might sound exciting, but being on call as a data engineer who maintains a streaming pipeline takes a lot of work.
The join date is way more complex.
Join data with multiple streams isn’t trivial in streaming. In batch, joining is easy by stitching a set of standard keys with two bounded tables. In streaming, careful consideration must be taken where two unbounded datasets join together.
A bigger question arises: How do we know if there are still incoming records we’d need to consider? Tyler’s Streaming 102 has a great example to demonstrate this. tl;dr, joining data among different streams is far more complex than batch processing
Before adopting a streaming application, it’s critical to understand if your use case suits it. Processing data in streaming fashion is exciting and attractive.
However, there is a cost for the excitement. Batch processing is more straightforward and has been historically approved for decades. Understanding the pros and cons before blindly jumping into data streaming processing should be carefully evaluated.
I hope my stories are helpful to you.
Using R and Shiny, we can build an app where the end users can interact with the data analysis we have done. I will show you how to engage with users by storytelling - show data analytics in R and Shiny.
R could add potential benefits to help the data engineering community. Let’s discuss about Why R for Data Engineering is More Powerful Than You Thought.