Expanding JSON arrays to rows with SQL on RedShift

12 December 2013 — Torsten Becker

Amazon's RedShift is a really neat product that solves a lot of our problems at work. However, its SQL dialect has some limitations when compared to Hive or PostgresSQL. I hit a limit when I needed table-generating functions but found a work-around.

Some of the data we store in RedShift contains JSON arrays. However, when running analytical queries, there is no out-of-the box way to join on "nested data" inside of arrays so up until now this data was very hard to use.

As a work-around, I came up with a simple hack: Joining the JSON array with a predefined sequence of integers and then extracting the element at each index into a new relation through that join.

If you want to follow along the queries and play with the data, I created a Gist which has all the queries to create the dummy tables and that fills them in with test data.

In this example, I am assuming a table clusters where each row represents a cluster of "things" and each cluster consists of many nodes modeled as a JSON array. Each node then has its size stored in this array – you could ask "What is the maximum node size over all clusters?".

id

node_sizes

1

'[1, 2]'

2

'[5, 1, 3]'

3

'[2]'

Assuming the above data in the table clusters, you can use the following SQL query in RedShift to extract the maximum node size from all arrays:

The example query above uses a Common Table Expression to create a intermediate relation exploded_array which looks like this:

id

size

1

1

1

2

2

5

2

1

2

3

3

2

As you can see, each array was expanded into many rows, but the id is still the same for each element.

So where does the magical seq_0_to_100 in the above queries come from?

Since RedShift is currently missing any kind of sequence-generating functions, I had to emulate this as well. For that I created a view like this:

CREATEVIEWseq_0_to_100AS(SELECT0ASiUNIONALLSELECT1UNIONALL-- You get the idea...SELECT99UNIONALLSELECT100);

One constraint of the presented technique is that the maximum length of the longest array has to be known upfront to generate the sequence of integers. This constraint did not provide any obstacles for me in practice yet since this size can just be queried in most cases:

SELECTMAX(JSON_ARRAY_LENGTH(node_sizes))FROMclusters;

Of course, this example is not all you can do. Once the data is in the shape of exploded_array you can work with the resulting intermediate relation in any other way and join it with all the things in your data warehouse.