snowflake data warehouse date range based joins 45x speedup

I have been using http://www.snowflake.net for a new data processing at work for a few months, and it’s just amazing, to be able to run large queries over large data-sets, and the ability to increase the cluster size, when doing development work to get faster turnarounds, that are not impacting the production cluster, brilliant.

One of the thing I have noticed is slower than I would like is joins based on tableA.time being inside a time-range of tableB.start and tableB.end when the time period being is in the months not days.

So the pattern mapping a value from TABLE_I onto all rows in the time span (not including the end)

For one set of data spanning 3 months the above takes 45 minutes on a small cluster for TABLE_P 65M/TABLE_I 10M rows.
Where-as for a similar set of 4 days, and ~45M rows this takes 30 seconds.

So I add some TO_DATE(time), TO_DATE(start_time) columns to the two tables, and then added AND tp.time_day = i.start_time_day and the first query went to ~60 seconds. But I was missing a few million rows as my time ranges span multiple days…

So I did many things that didn’t work (like trying to use a GENERATOR with dynamic input) and settled on a simple solution

So the above code creates another temp table with a row per table B record with every Day is a duplicate row, now we have more rows in the seconds table, but we can do a date based match to speedup the query.

CREATE OR REPLACE TEMPORARY TABLE WORKING_B_B AS
SELECT tp.u_id, tp.time, i.value
FROM TABLE_P_B tp
JOIN TABLE_I_B i
ON tp.u_id = i.u_id AND tp.time_day = i.range_day_part
AND tp.time >= i.start_time AND tp.time < i.end_time;
This code runs in 60 seconds and gives the same results as the 45 minute code.

Things to note, putting LATERAL table joins on a selects with CTE’s presently breaks the SQL parser, in fact even nested selects and LATERAL don’t mix, thus the extra tables with _B etc. Also CTE’s make life so much easier, but as you start joining to them a lot, performance slips, I have found where I do a complex join is a good time to output to a temporary table, and the performance again is crazy…