Introduction to Spark Structured Streaming - Part 10 : Ingestion Time

Structured Streaming is a new streaming API, introduced in spark 2.0, rethinks stream processing in spark land. It models stream
as an infinite table, rather than discrete collection of data. It’s a radical departure from models of other stream processing frameworks like
storm, beam, flink etc. Structured Streaming is the first API to build stream processing on top of SQL engine.

Structured Streaming was in alpha in 2.0 and 2.1. But with release 2.2 it has hit stable status. In next few releases,
it’s going to be de facto way of doing stream processing in spark. So it will be right time to make ourselves familiarise
with this new API.

In this series of posts, I will be discussing about the different aspects of the structured streaming API. I will be discussing about
new API’s, patterns and abstractions to solve common stream processing tasks.

This is the tenth post in the series. In this post, we discuss about ingestion time abstraction. You
can read all the posts in the series here.

Ingestion Time Abstraction

In last post we discussed about the processing time abstraction. In this post, we will discuss about ingestion time abstraction.

Ingestion time, as name indicates, is a time captured at the ingestion of data. Sources like kafka, file stream capture the time of ingestion to provide the ordering guarantees. We can use this time captured at the source as mechanism for the processing data.

Ingestion time abstraction is useful when each batch of data takes considerable amount to process and we still want to capture the matrices depending upon time of capture. Let’s say we have some sensor data coming in and we want to know what happened in last 10 seconds. If our source captures the data as soon as it’s sent from sensors, ingestion time will be closer to the real time. For some reason, stream processing took more time than usual. If we have used processing time abstraction, we would have lost some records for our metric calculation. But in the ingestion time abstraction we will still get all the right records.

In this post, we will discuss how to define a window on ingestion time with simple wordcount example.

Windowed WordCount using Ingestion Time

The below are the steps to implement windowed word count using ingestion time abstraction.

Defining Window

In above code, we define window as part of groupby. It looks exactly same as we saw in processing time example in last post. Only the difference is that this groups the data depending upon when it’s ingested whereas earlier example groups the data upon when it’s received for processing.

Once we do groupBy, we do the count and sort by window to observe the results.

Query

Once we have defined the window, we can setup the execution using query.

As you can observe from the above output, we have counted both records in same window, even though processing may have taken more than 30 seconds. This shows that spark is using ingested time for grouping than the time at which it has received data.