Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

4.
@s_kontopoulos
Insights
Data Insight: A conclusion, a piece of information which can be used to make
actions and optimize a decision-making process.
Customer Insight: A non-obvious understanding about your customers, which if
acted upon, has the potential to change their behaviour for mutual benefit
Customer insight, Wikipedia
DAT
A
INFO INSIGHTS ACTIONS
4

7.
@s_kontopoulos
Streaming Analytics
“Streaming Analytics is the acquisition and analysis of the data at the moment it
streams into the system. It is a process done in a near real-time(NRT) fashion and
analysis results trigger specific actions for the system to execute.“
● No constraints or deadlines in the way they exist in RT systems
● Processing delay (end-to-end) varies and depends on the application ( < 1 ms
to minutes)
7

13.
@s_kontopoulos
Streaming Platforms
Its an ecosystem/environment that supports building and running streaming
applications. At its core it uses a streaming engine. Example of tools:
● A durable pub/sub component to fetch or store data
● A streaming engine
● A registry for storing data metadata like the data format etc.
13

18.
@s_kontopoulos
DataFlow Execution Model
User defines computations/operations (map, flatMap etc) on the data-sets
(bounded or not) as a DAG. The data-sets are considered as immutable
distributed data. DAG is shipped to nodes where the data lie, computation is
executed and results are sent back to the user.
18
Spark Model
example
Flink model - FLIP 6

23.
@s_kontopoulos
Analyzing Data Streams
● Data flows from one or more sources through the engine and is written to one
or more sinks.
● Two cases for processing:
○ Single event processing: event transformation, trigger an alarm on an error event
○ Event aggregations: summary statistics, group-by, join and similar queries. For example
compute the average temperature for the last 5 minutes from a sensor data stream.
23

24.
@s_kontopoulos
Analyzing Data Streams
● Event aggregation introduces the concept of windowing wrt to the notion of
time selected:
○ Event time (the time that events happen): Important for most use cases where context and
correctness matter at the same time. Example: billing applications, anomaly detection.
○ Processing time (the time they are observed during processing): Use cases where I only care
about what I process in a window. Example: accumulated clicks on a page per second.
○ System Arrival or Ingestion time (the time that events arrived at the streaming system).
● Ideally event time = processing time. Reality is: there is skew.
24

26.
@s_kontopoulos
Analyzing Data Streams
● Watermarks: indicates that no elements with a timestamp older or equal to
the watermark timestamp should arrive for the specific window of data. Marks
the progress of the event time.
● Triggers: decide when the window is evaluated or purged. Affect latency &
state kept.
● Late data: provide a threshold for how late data can be compared to current
watermark value.
26

31.
@s_kontopoulos
Processing Guarantees
Many things can go wrong…
● At-most once
● At-least once
● Exactly once
What are the boundaries?
Within the streaming engine?
How about end-to-end including sources and sinks?
How about side effects like calling an external service?
31

32.
@s_kontopoulos
Table Stream Duality
Stream table : The aggregation of a stream of updates over time yields a
table.
Table stream: The observation of changes to a table over time yields a
stream.
Why is this useful?
32

33.
@s_kontopoulos
Streaming SQL Queries
Semantics ? How we define a join on an unbounded stream? Table join?
There is a joint work from:
https://docs.google.com/document/d/1wrla8mF_mmq-NW9sdJHYVgMyZsgCmHu
mJJ5f5WUzTiM/
33
Apache Flink

39.
@s_kontopoulos
Kafka Streams vs Beam Model
- Trigger is more of an operational aspect compared to business parameters
like the window length. How often do I update my computation (affecting
latency and state size) is a non-functional requirement.
- A Table covers both the case of immutable data and the case of updatable
data.
39