Presto 0.188 to 0.205 Migration
AWS USAWS TokyoIDCF

Presto 0.205 in Production Now - Prepare for Upgrade

Treasure Data’s new Presto release (available 1 October 2018) is based on the open source Presto 0.205 release. Our previous release was based on Presto 0.188.

We have done our own internal testing to identify compatibility and performance regression issues, and we document those below.

Side-by-Side Environments

We understand that customers will need to do their own testing as well. Therefore, in this release, Presto 0.188 and Presto 0.205 available side-by-side to allow customers to test code themselves before the upgrade.

Migration Timeline

Until 31 October 2018, the default environment will be Presto 0.188. On 1 November 2018, the default environment will switch to Presto 0.205. Presto 0.188 will remain available until 15 November 2018 to allow time to finish migration.

If you finish your compatibility testing before 31 October, you can contact Treasure Data Support and we can change your default to Presto 0.205 early.

Side-by-side environments and gradual migration periods will be the practice in future releases as well.

Running Code in Presto 0.205 vs. Presto 0.188

During the transition from Presto 0.188 and Presto 0.205 the two releases will be available side-by-side from 1 October 2018 to 15 November 2018.

You can control which Presto release runs your code by using the following query hint:

-- @TD engine_version: 0.188

or

-- @TD engine_version: 0.205

Required SQL Changes for Presto 0.205

For this release there are a number of Presto syntax changes, mostly focused on closer adherence to the ANSI SQL standard.

A small number of existing queries will require correction before they will run in the new Presto environment. Note that for the changes described below, the queries will also be compatible with Presto 0.188. Therefore you can make these modifications in the Presto 0.188 environment, without changing your results. When you migrate to the new release your queries will continue to work.

For each change we describe the change in Presto, the error message if you have incompatible Presto SQL, and the recommended rewrite.

New System Function current_user

Presto has added a new current_user() function that returns the name of the current user. As a result, if you have a column or alias named current_user(), you will have to quote the references to that column.

Syntax Error-Producing Queries

Changes in ORDER BY behavior

Presto is now more closely compliant with ANSI SQL standard ORDER BY clause.

The general rules for ORDER BY behavior are now:

ORDER BY is executed after projection

ORDER BY prefers an ordinal alias wherever possible

Input and output columns can be used in the ORDER BY clause, but if there is ambiguity, the output columns have higher precedence

There are several error messages that may be caused by this change to Presto.

The issues in all the cases below are due to naming conflicts in the query -- use of aliases in the SELECT or ORDER BY that collide with column name or table name in the input. The use of the alias prevents the reference to the original input table or column.

Rewrites generally involve using aliases that don’t match the input table or column names, or using ordinals in the ORDER BY clause.

For each case we show the message, a query that causes the error, and an example of how to rewrite the query.

Error Message: Mismatched types: varchar(1) vs varchar

Original code that causes the error:

SELECT * FROM page_view WHERE CAST(week_page_total_actions as VARCHAR) IS DISTINCT FROM '0';

Modified code that fixes the error:

SELECT * FROM page_view WHERE CAST(week_page_total_actions as VARCHAR (1)) IS DISTINCT FROM '0';

Error Message: Non deterministic ORDER BY expression is not supported with SELECT DISTINCT

SELECT week_page_total_actionsFROM ( SELECT DISTINCT week_page_total_actions, rand() FROM page_view WHERE week_page_total_actions NOT LIKE '1' LIMIT 20 ) sub-- Remove the function or expression from the ORDER BY clause

Error Message: Invalid reference to output projection attribute from ORDER BY aggregation

Original code that causes the error:

SELECT
code, count(code) AS cnt
FROM www_access GROUP BY code ORDER BY count(code)
-- output column, code, is used at ORDER BY aggregation

Modified code that fixes the error:

SELECT
code, count(code) AS cnt
FROM www_access GROUP BY code ORDER BY count(www_access.code)
-- (or) ORDER BY cnt (or) ORDER BY 2

Note that the following query (where the name clash occurs on the identifier time) is not equivalent to the rewrite above. The return value of TD_TIME_FORMAT aliased as time will be a date-formatted VARCHAR, not a timestamp:

SELECT
TD_TIME_FORMAT(time,
'yyyy-MM-dd HH:mm:ss z',
'JST') AS time, -- alias has higher precedence, is used at ORDER BY
count(*) AS access
FROM access_log access
ORDER BY time -- (or) ORDER BY 1

Error Message: For SELECT DISTINCT, ORDER BY expressions must appear in select list

Example code that causes the error:

SELECT
DISTINCT TD_TIME_FORMAT(time,
'yyyy-MM-dd HH:mm:ss z',
'JST') AS time,
code
FROM www_access access
ORDER BY TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ss z', 'JST')
-- this TD_TIME_FORMAT call does not match an expression in the SELECT list, because
-- time in ORDER BY refers to the varchar returned by TD_TIME_FORMAT

Modified code that fixes the error:

SELECT
DISTINCT TD_TIME_FORMAT(time,
'yyyy-MM-dd HH:mm:ss z',
'JST') AS time, -- higher precedence, overrides input time column
code
FROM www_access access
ORDER BY time
-- could also use ORDER BY 1

SELECT
TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ss z', 'JST') AS time, -- higher precedence, overrides input time column
code
FROM www_access access
ORDER BY TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ss z', 'JST')-- time here is the varchar output column in select list,
-- not the timestamp input column
-- so the expression here is not the same as the
-- expression in the select listed

Modified code for td_time_format that fixes the error:

SELECT
TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ss z', 'JST') AS time, -- higher precedence
code
FROM www_access access
ORDER BY time -- (or) ORDER BY 1

JOIN with USING

With SELECT… JOIN... with the USING clause for an equijoin, any columns mentioned in the USING clause will appear only once in the output, with an unqualified name, rather than once for each table in the join. It is an error to reference the column from the source tables with qualified names.

The following error messages that may be caused by this change vs. previous Presto. For each we show an example of how to rewrite the query.

Note that for this issue, you should review your code closely to make sure your changes do not change the meaning of the code.

Recommended Update: Replace TD_SESSIONIZE with TD_SESSIONIZE_WINDOW

While reviewing code for this update you may also wish to make an important upgrade to any pre-existing queries that use TD_SESSIONIZE. The TD_SESSIONIZE() function has been deprecated because of performance issues and inconsistent results. The replacement Presto window function, TD_SESSIONIZE_WINDOW(), was added in 2016. It addresses the issues with TD_SESSIONIZE. The detailed instructions on rewriting your queries are here: