What functions are available in Presto?

My query is not working. How can I fix it?

Solution: Start with a smaller data set, then look into subqueries.

Presto can process millions of rows in a second. If you see query errors or unexpected results, try to minimize your input data set. Because checking millions rows is a hard task, here are some steps to fix your query:

Solution: Ask Our Support Team

For Treasure Data customers, if problems persist, see Support Channels or send an e-mail to support@treasuredata.com with the job IDs of your queries. If you can, include information about the expected results and the meaning of your data set.

Query failed with a PAGE_TRANSPORT_TIMEOUT error. What does it mean?

See the answer to the following question.

How reliable is Presto?

Presto transfers the data through the network. Due to connection timeout or some problem in worker nodes, this network data transfer may fail occasionally (PAGE_TRANSPORT_TIMEOUT, WORKER_RESTARTED). Presto is designed for faster query processing when compared to Hive, so it sacrifices fault-tolerance, to some extent. Typically, more than 99.5% of Presto queries finish without any error on the first run. In addition, Treasure Data provides a query retry mechanism on query failures, so nearly 100% of queries finish successfully after being retried.

Because of the nature of network-based distributed query processing, if your query tries to process billions of records, the chance of hitting network failures increases. If you see PAGE_TRANSPORT_TIMEOUT error frequently, try to reduce the input data size by narrowing down the TD_TIME_RANGE or reducing the number of columns in SELECT statements.

My query is slow. How can I improve it?

Solution: Avoid single-node operations.

Presto is a distributed query engine, but some operators need to be complete in a single process. For example,

Solution: Avoid using SELECT *, by using CREATE TABLE AS or INSERT INTO

See Query that produces a huge result is slow topic below.

Solution: Aggregate a series of LIKE clauses into one regexp_like expression.

Presto’s query optimizer is unable to improve queries where several LIKE clauses are used. As a consequence, the query execution can be slower than expected. To improve the performance, you can substitute a series of LIKE clauses chained in an OR with a single regexp_like clause. For example:

Solution: Check the 1-hour data distribution

If you have a huge number of rows in a 1 hour partition, processing this partition can be the performance bottleneck. To check the number rows contained in each partition, run a query similar to following:

This query shows the top 100 partitions that contain the highest number of records during 2015-10 to 2015-11.

How can I improve the join performance?

Solution: Use simple equi-joins.

The equi-join concatenates tables by comparing join keys using the equal (=) operator. If this comparison becomes complex, the join processing slows down. For example, if you want to join two tables with date string, ‘2015-10-01’, but one of the tables only has columns for year, month, and day values, you can write the following query to generate date strings:

These numbers correspond to the column indexes (1-origin) of the SELECT statement.

Exceeded max (local) memory xxGB error.

Rewrite your SQL to use less memory.

Presto tracks the memory usage of each query. While the available memory varies according to your price plan, in most cases it is possible to rewrite your query to resolve this issue. Here is a list of memory-intensive operations:

distinct

UNION

ORDER BY

GROUP BY (of many columns)

joins

Solution: Avoid using distinct operator.

distinct eliminates all non unique rows. For example, the following query checks whether your table contains the same set of (c1, c2, c3) tuples:

SELECTdistinct c1, c2, c3 FROM my_table

This stores the entire set of columns c1, c2 and c3 into a memory of a single worker node to check the uniqueness of the tuples. The amount of the required memory increases with the number of columns and their size. Remove distinct from your query or use it after reducing the number of input rows by using a subquery.

Solution: Replace count(distinct x) with approx_distinct(x).

If you are counting the number of distinct users or events, it is typically better to use the count(distinct id) operation, but this can cause memory issues.

SELECTcount(distinct id) FROM my_table

To reduce the memory usage, use an approximate version of count(distinct x):

SELECT approx_distinct(id) FROM my_table

approx_distinct(x) returns an approximate result of the true value. It ensures that returning a distant value happens only with low probability. If you need to summarize the characteristics of your data set, use this approximate version.

Solution: Prefer UNION ALL over UNION.

For the same reason as with distinct, a UNION of SQL queries performs duplicate elimination and requires a substantial amount of memory:

This requires less memory and is faster. If you need to concatenate two or more SQL query results, use UNION ALL.

Solution: Avoid ORDER BY.

SELECT c1, c2 FROM my_table ORDERBY c1

Presto performs sorting using a single worker node. The entire data set must fit within the memory limit of a worker, which is usually less than 5GB.

If you are sorting a small number of rows (~ 10,000 rows), using ORDER BY is fine. If you are going to sort a data set with GBs of data, you might need to find an alternative strategy. If sorting the entire data set is necessary, you can combine Hive and Presto:

Store the results of your Presto query by using CREATE TABLE AS or INSERT INTO query.

Use Hive to sort the data set.

Solution: Reduce the number of GROUP BY columns.

If you enumerate many target columns in a GROUP BY clause, storing the set of keys consisting of (c1, c2, c3, c4, …) requires a lot of memory. Reducing the number of columns in GROUP BY clause will reduce the memory usage.

Solution: Reduce the size of GROUP BY keys with smart_digest().

We provide a smart_digest(key) UDF for Treasure Data customers, which creates smaller hash values to reduce the size of keys.

Presto performs a broadcast join by default, which partitions the left side table into several worker nodes, then sends the entire copy of the right side table to the worker nodes that have a partition. If the right side table is large and doesn’t fit in memory on a worker node, it causes an error.

Reordering the table join to put the largest table first typically solves the issue:

The distributed join algorithms partitions both left and right side tables by using hash values of the join keys as a partitioning key. It works even if the right side table is large. However, it can increase the number of network data transfers and is usually slower than the broadcast join.

Query that produces a huge result is slow.

Consider using CREATE TABLE AS or INSERT INTO.

SELECT * FROM my_table

Presto uses JSON text to materialize query results. If the above table contains 100GB of data, the coordinator transfers more than 100GB of JSON text to save the query result. So, even if the query computation is almost finished, outputting JSON results takes a long time.

Solution: Use CREATE TABLE (table) AS SELECT …

You can parallelize the query result output process by using CREATE TABLE AS SELECT statement. To clean up the result table beforehand, add a DROP TABLE statement at the top of your query. The result output performance will be 5x ~ more faster than running SELECT *. Our Presto skips the JSON output process and directly produces 1-hour partitioned table.

Solution: Use INSERT INTO (table) SELECT …

You can also use INSERT INTO (table) AS SELECT … to append the data to an existing table. It can improves the query result output performance:

If the subsequent SELECT statement does not produce the time column, INSERT INTO attaches the query execution time as time column values. You can find the inserted rows using the query execution time.

Solution: Avoid using SELECT *.

SELECT * FROM my_table

Treasure Data uses a column-oriented storage format, so accessing a small set of columns is really fast, but as the number of columns increases in your query, it deteriorates the query performance. Be selective in choosing columns:

SELECT id, name, address FROM my_table

Solution: [Experimental] Use result_output_redirect

(This is an experimental feature that is subject to change in future) By setting result_output_redirect='true' within a magic comment, you can make the query output faster:

-- set session result_output_redirect='true'select * FROM my_table

With this query hint our Presto produces the query results in parallel and can skip the JSON conversion process at the coordinator.

But if the ORDER BY clause or an upper case field name is used in the query, this magic comment is ignored.

How do I report daily statistics into the columns in a row?

Sometimes, you may need to report access counts of different date ranges into the columns in a row. For example:

How can I specify JSON keys with white spaces or special symbols?

Solution: Use JSON paths with square brackets and double quotation.

SELECT json_extract('{"a b":1}', '$["a b"]');

This returns 1.

Query was failed with “Compiler failed and interpreter is disabled” error. What does it mean?

The cause is too many OR and AND conditions, more than 1000 for a job, in WHERE can cause the execution byte code generation failure. There’s no other solution than reducing the number of conditions through a UNION or an intermediate table.

How can I use Binary function?

Presto supports binary functions. To use a binary function on Treasure Data, you need to convert binary/varbinary types to varchar type.

How do I use Presto geospatial functions?

Solution: How to store geospatial data.

The building blocks of geospatial functions are the Geometry data type. Geometry data types are expressed using the WKT format. Well-known text (WKT) is a text markup language for representing vector geometry objects on a map. Examples include:

As WKT is a text format, Geometry datatypes are stores as VARCHAR or STRING datatypes. (Excluding the Point data type which can be store as a DOUBLE representing the latitude and longitude.). For example, to store a polygon in a table you would store the WKT presentation of a polygon. The actual text content that would be stored in TD table would be: ‘POLYGON ((0 0, 4 0, 4 4, 0 4, 0 0), (1 1, 2 1, 2 2, 1 2, 1 1))’. If you needed to store a multipoint line the actual text you would store in the TD table would be: ‘MULTIPOINT (0 0, 1 2)’

Solution: How to convert string data to Geometry type?

All Presto geospatial functions require the Geometry data type. The data is stored in TD as a WKT formatted string. As a result, the WKT string data has to be converted into a Geometry type before it can be used in geospatial functions. The conversion from WKT string data to Geometry type is done using geospatial constructor functions. These functions accept WKT formatted text as arguments and return the Geometry data type required by geospatial functions. Examples of constructor functions include:

Solution: How to run geospatial queries in Presto.

Let’s assume you had database with two tables: Cities & Trips. The columns of the Cities table would be city and area. The city column would be a String (Dallas) and the area column would be a string WKT representation of the Dallas area (polygon) ((0.3528644 32.5869589, 0.3587434,32.5963149, 0.3353124,32.6015929,0.3528644,32.5869589)) . Your second table, Trips would be a list of bus rides taken. The columns would be username , a string of a user’s name (Peter) and the journey would would be a WKT string representation of a linestring for each trip taken: LINESTRING (0 0, 1 1, 1 2).

Given that, let us assume that you wanted to find how many passed through certain cities. The geospatial function to find that is:

ST_Intersects(Geometry, Geometry) → boolean

An example query is:

SELECT c.city AS City, count(*) as NumTrips
FROM trips as t
JOIN cities as c
on ST_Intersects(ST_Polygon(c.area), ST_LineFromText(t.journey))
GROUPBY1

The output would be:

City

NumTrips

Ames

10

New York

11

Dallas

3

Vegas

44

Solution: How to calculate distance in kilometers between two point using latitude and longitude.

For Presto, ST_DISTANCE returns a spatial-ref that at times can be impractical in use cases where you need the distance between two points in kilometers. If you need to calculate distance in kilometers, you have to use an SQL query. Assuming you had two locations: A(lat1, long1), B(lat1, long2) the SQL query to find the distance between the two points in kilometers is: