Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Data all over the place! How SQL and Apache Calcite bring sanity to streaming and spatial data

The revolution has happened. We are living the age of the deconstructed database. The modern enterprises are powered by data, and that data lives in many formats and locations, in-flight and at rest, but somewhat surprisingly, the lingua franca for remains SQL.

In this talk, Julian describes Apache Calcite, a toolkit for relational algebra that powers many systems including Apache Beam, Flink and Hive. He discusses some areas of development in Calcite: streaming SQL, materialized views, enabling spatial query on vanilla databases, and what a mash-up of all three might look like.

He also describes how SQL is being extended to handle streaming, and the challenges that will need to be solved if it is to become standard.

17.
Stream-table duality
select *
from Orders
where units > 1000
➢ Yes, you can use a stream as
a table
➢ And you can use a table as a
stream
➢ Actually, Orders is both
➢ Use the stream keyword
➢ Where to actually find the
data? That’s up to the system
select stream *
from Orders
where units > 1000

18.
Hybrid query combines a stream with its
own history
● Orders is used as both as stream
and as “stream history” virtual table
● “Average order size over last year”
should be maintained by the system,
i.e. a materialized view
select stream *
from Orders as o
where units > (
select avg(units)
from Orders as h
where h.productId = o.productId
and h.rowtime > o.rowtime
- interval ‘1’ year)
“Orders” used
as a stream
“Orders” used as
a “stream history”
virtual table

19.
Other operations
Other relational operations make sense on streams (usually only if there is an
implicit time bound).
Examples:
● order by – E.g. Each hour emit the top 10 selling products
● union – E.g. Merge streams of orders and shipments
● insert, update, delete – E.g. Continuously insert into an external table
● exists, in sub-queries – E.g. Show me shipments of products for which
there has been no order in the last hour
● view – Expanded when query is parsed; zero runtime cost
● match_recognize – Complex event processing (CEP)

20.
“Standard streaming SQL”
Consistent semantics over tables (if database contains both streams and tables)
Consistent semantics over streams. The replay principle:
A streaming query produces the same result as the corresponding
non-streaming query would if given the same data in a table.
Consistent look and feel (e.g. same data types, keywords)
Some hard problems remain... mainly concerning time…

21.
Controlling when data is emitted
Early emission is the defining
characteristic of a streaming query.
The emit clause is a SQL extension
inspired by Apache Beam’s “trigger”
notion. (Still experimental… and
evolving.)
A relational (non-streaming) query is
just a query with the most conservative
possible emission strategy.
select stream productId,
count(*) as c
from Orders
group by productId,
floor(rowtime to hour)
emit at watermark,
early interval ‘2’ minute,
late limit 1
select *
from Orders
emit when complete

22.
Event time
Example streams and query so far have used a rowtime column
● Is rowtime a system column, or can streams choose their own sort key?
● Can a stream have more than one sort key?
● Is rowtime’s type always TIMESTAMP?
● Is a stream totally sorted on rowtime? Alternatives:
○ K-sorted: Every row is within 100 rows of sorted
○ T-sorted: Every row is within 5 minutes of sorted
○ Bounded latency: Every row is within 10 minutes of wallclock time
○ Watermark: An upper-bound on rows’ delay, allows early emit

23.
Event time in queries
We’ve sorted out event time in streams… what
about derived streams (queries / views)?
Given a streaming query:
● What are its sort keys?
● What is the maximum delay of rows?
● Does the system refuse to run queries with
unbounded delay?
SQL syntax to promote a query column to a sort key
select count(*)
from Orders
group by floor(
rowtime to hour)
select count(*)
from Orders
select shipTime
as rowtime
from Orders
order by shipTime
?

24.
Models of relations
A table is a multi-set of records; contents vary over time, with some kind
transactional consistency
A temporal table is like a table, but you can also query its contents at previous
times; usually mapped to a base table with (start, end) columns
So, what is a stream?
1. A stream is an append-only table with an event-time column
2. Or, if you speak monads[3], a stream is a function: Stream x → Time → Bag x
3. Or, a stream is the time-derivative of a table
Are a table, its temporal table, and its stream one catalog object or three?

25.
The “pie chart” problem
● Task – Write dashboard of orders over last hour
● Problem – The Orders stream only contains
the current few records
● Solution – Materialize short-term history
● Question – Does the developer maintain that
mapping, or does the DBMS do it automatically?
Orders over the last hour
Beer
48%
Cheese
30%
Wine
22%
select productId, count(*)
from Orders
where rowtime > current_timestamp - interval ‘1’ hour
group by productId

26.
Transactions & isolation
In a regular database, INSERT, UPDATE and DELETE are transactional
operations; SELECT is not
In a streaming database, SELECT might be transactional too
● Consider a pub-sub system: the system acknowledges when you send a
message, you must acknowledge when you receive a message
In a regular database query, snapshot isolation freezes the database in time; in a
streaming query, we don’t want that

32.
Hilbert space-filling curve
● A space-filling curve invented by mathematician David Hilbert
● Every (x, y) point has a unique position on the curve
● Points near to each other typically have Hilbert indexes close together

35.
More spatial queries
What state am I in? (point-to-polygon)
Which states does Yellowstone NP intersect?
(polygon-to-polygon)

36.
More spatial queries
What state am I in? (point-to-polygon)
Which states does Yellowstone NP intersect?
(polygon-to-polygon)
SELECT *
FROM States AS s
WHERE ST_Intersects(s.geometry,
ST_MakePoint(6, 7))
SELECT *
FROM States AS s
WHERE ST_Intersects(s.geometry,
ST_GeomFromText('LINESTRING(...)'))

37.
Point-to-polygon query
What state am I in? (point-to-polygon)
1. Divide the plane into tiles, and pre-compute
the state-tile intersections
2. Use this ‘tile index’ to narrow list of states
SELECT s.*
FROM States AS s
WHERE s.stateId IN (SELECT stateId
FROM StateTiles AS t
WHERE t.tileId = 8)
AND ST_Intersects(s.geometry, ST_MakePoint(6, 7))

38.
Building the tile index
Use the ST_MakeGrid function to decompose each
state into a series of tiles
Store the results in a table, sorted by tile id
A materialized view is a table that remembers how
it was computed, so the planner can rewrite queries
to use it
CREATE MATERIALIZED VIEW StateTiles AS
SELECT s.stateId, t.geometry, t.tileId, t.id_col, t.id_row
FROM States AS s,
LATERAL TABLE(ST_MakeGrid(s.stateId, 4, 4)) AS t

39.
Streaming + spatial
Every minute, emit the number of journeys that have intersected each city. (Some
journeys intersect multiple cities.)
(Efficient implementation is left as an exercise to the reader. Probably involves
splitting journeys into tiles, partitioning by tile hash-code, intersecting with cities
in those tiles, then rolling up cities.)
SELECT STREAM c.name, COUNT(*)
FROM Journeys AS j
CROSS JOIN Cities AS c
ON ST_Intersects(c.geometry, j.geometry)
GROUP BY c.name, FLOOR(j.rowtime TO HOUR)

44.
Talk proposal
Bio
Julian Hyde is an expert in query optimization and in-memory analytics. He founded Apache Calcite, a framework for
query optimization and data virtualization. He also founded Mondrian, the popular open source OLAP engine, and
co-founded SQLstream, an early streaming SQL platform. He is an architect at Looker.
Title
Data all over the place! How Apache Calcite brings SQL and sanity to streaming and spatial data
Abstract
The revolution has happened. We are living the age of the deconstructed database. The modern enterprises are powered
by data, and that data lives in many formats and locations, in-flight and at rest, but somewhat surprisingly, the lingua
franca for remains SQL. In this talk, Julian describes Apache Calcite, a toolkit for relational algebra that powers many
systems including Apache Beam, Flink and Hive. He discusses some areas of development in Calcite: streaming SQL,
materialized views, enabling spatial query on vanilla databases, and what a mash-up of all three might look like.

45.
Other applications of data profiling
Query optimization:
● Planners are poor at estimating selectivity of conditions after N-way join
(especially on real data)
● New join-order benchmark: “Movies made by French directors tend to have
French actors”
● Predict number of reducers in MapReduce & Spark
“Grokking” a data set
Identifying problems in normalization, partitioning, quality
Applications in machine learning?

46.
Further improvements
● Build sketches in parallel
● Run algorithm in a distributed framework (Spark or MapReduce)
● Compute histograms
○ For example, Median age for male/female customers
● Seek out functional dependencies
○ Once you know FDs, a lot of cardinalities are no longer “surprising”
○ FDs occur in denormalized tables, e.g. star schemas
● Smarter criteria for stopping algorithm
● Skew/heavy hitters. Are some values much more frequent than others?
● Conditional cardinalities and functional dependencies
○ Does one partition of the data behave differently from others? (e.g. year=2005, state=LA)

51.
Join stream to a changing table
Execution is more difficult if the
Products table is being changed
while the query executes.
To do things properly (e.g. to get the
same results when we re-play the
data), we’d need temporal database
semantics.
(Sometimes doing things properly is
too expensive.)
select stream *
from Orders as o
join Products as p
on o.productId = p.productId
and o.rowtime
between p.startEffectiveDate
and p.endEffectiveDate

54.
Aggregation and windows
on streams
GROUP BY aggregates multiple rows into sub-totals
➢ In regular GROUP BY each row contributes to
exactly one sub-total
➢ In multi-GROUP BY (e.g. HOP, GROUPING
SETS) a row can contribute to more than one
sub-total
Window functions (OVER) leave the number of rows
unchanged, but compute extra expressions for
each row (based on neighboring rows)
Multi
GROUP BY
Window
functions
GROUP BY