A perfect Hive query for a perfect meeting (Hadoop Summit 2014)

During one of our epic parties, Martin Lorentzon (chairman of Spotify) agreed to help me to arrange a dinner for me and Timbuktu (my favourite Swedish rap and reggae artist), if I prove somehow that I am the biggest fan of Timbuktu in my home country. Because at Spotify we attack all problems using data-driven approaches, I decided to implement a Hive query that processes real datasets to figure out who streams Timbuktu the most frequently in my country. Although this problem seems to be well-defined, one can find many challenges in implementing this query efficiently and they relate to sampling, testing, debugging, troubleshooting, optimizing and executing it over terabytes of data on the Hadoop-YARN cluster that contains hundreds of nodes. During my talk, I will describe all of them, and share how to increase your (and the cluster’s) productivity by following tips and best practices for analyzing large datasets with Hive on YARN. I will also explain how the newly-added features to Hive (e.g. join optimizations, OCR File Format and Tez integration that is coming soon) can be used to make your query extremely fast.

6.
* If Adam proves somehow that
he is the biggest fan of Timbuktu in his home country
The Deal
Martin will invite Adam
and Timbuktu, my favourite Swedish artist,
for a beer or coke or whatever to drink *
by Martin

30.
SELECT s.user_id, COUNT(*) AS count
FROM stream s
JOIN track t ON t.track_id = s.track_id
JOIN user u ON u.user_id = s.user_id
WHERE lower(get_json_object(t.json_data, '$.album.artistname'))
= 'timbuktu'
AND u.country = 'PL'
AND s.date BETWEEN 20130212 AND 20140415
AND u.date = 20140415 AND t.date = 20140415
GROUP BY s.user_id
ORDER BY count DESC
LIMIT 100;
Query
A line where
I may have a bug ? !

42.
For Each Line
1. Creates a table with a given schema
- In local Hive database called beetest
2. Loads table.txt file into the table
- A list of files can be also explicitly specified
table.ddl
stream(track_id String, user_id STRING, date BIGINT)
user(user_id STRING, country STRING, date BIGINT)
track(track_id STRING, json_data STRING, date BIGINT)

115.
✓ Process data in a batch of 1024 rows together
- Instead of processing one row at a time
✓ Reduces the CPU usage
✓ Scales very well especially with large datasets
- Kicks in after a number of function calls
✓ Supported by ORC
Vectorization

117.
✗ Not yet fully supported by vectorization
✓ Still less GC
✓ Still better use of memory to CPU bandwidth
✗ Cost of switching from vectorized to row-
mode
✓ It will be revised soon by the community
Vectorized Map Join

127.
■ Deals at 5 AM might be the best idea ever
■ Hive can be a single solution for all SQL queries
- Both interactive and batch
- Regardless of input dataset size
■ Tez is really MapReduce v2
- Very flexible and smart
- Improves what was inefficient with MapReduce
- Easy to deploy
Summary