CTEs and Recursive CTEs appeared in MySQL 8.0; first in a Labs release and now in the official release 8.0.1. Several blogs have been published: here, here and here ; my colleague Øystein also wrote about how using a CTE made one DBT3 query run twice faster.

Today I will continue the series, focusing on these points of recursive CTEs:

Depth-first or breadth-first

I already covered this previously (in paragraph retrieving a full tree) but let’s recap the problem and solution. We assume we have a table of people with a parent-child
relationship:

1

2

3

4

5

6

7

8

9

10

11

CREATE TABLE tree(person CHAR(20),parentCHAR(20));

INSERT INTO tree VALUES

('Robert I',NULL),

('Thurimbert','Robert I'),

('Robert II','Thurimbert'),

('Cancor','Thurimbert'),

('Landrade','Thurimbert'),

('Ingramm','Thurimbert'),

('Robert III','Robert II'),

('Chaudegrand','Landrade'),

('Ermengarde','Ingramm');

This is a part of the genealogy of some ancient French kings, around the 8th century AD. In a picture:

1

2

3

4

RobertI->Thurimbert-->Robert II->Robert III

+->Cancor

+->Landrade->Chaudegrand

+->Ingramm->Ermengarde

Now we want to know all descendants of Thurimbert:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

WITH RECURSIVE descendants AS

(

SELECT person

FROM tree

WHERE person='Thurimbert'

UNION ALL

SELECTt.person

FROM descendantsd,treet

WHEREt.parent=d.person

)

SELECT *FROM descendants;

+-------------+

|person|

+-------------+

|Thurimbert|

|Robert II|

|Cancor|

|Landrade|

|Ingramm|

|Robert III|

|Chaudegrand|

|Ermengarde|

+-------------+

Breadth-first ordering means we want to see all children (direct descendants) first, grouped together, and only then should their own children appear. This is what the query returned, but you should not rely on it (per the usual rule in SQL: “no ORDER BY specified? no predictable order!”). As direct descendants all appear at the first iteration, we just need to count iterations in a new “level” column, and sort by it:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

WITH RECURSIVE descendants AS

(

SELECT person,1aslevel

FROM tree

WHERE person='Thurimbert'

UNION ALL

SELECTt.person,d.level+1

FROM descendantsd,treet

WHEREt.parent=d.person

)

SELECT *FROM descendants ORDER BY level;

+-------------+-------+

|person|level|

+-------------+-------+

|Thurimbert|1|

|Robert II|2|

|Cancor|2|

|Landrade|2|

|Ingramm|2|

|Robert III|3|

|Chaudegrand|3|

|Ermengarde|3|

+-------------+-------+

Depth-first ordering means we want to see children grouped immediately under their parent. For that we build a “path” column and sort by it:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

WITH RECURSIVE descendants AS

(

SELECT person,CAST(person ASCHAR(500))ASpath

FROM tree

WHERE person='Thurimbert'

UNION ALL

SELECTt.person,CONCAT(d.path,',',t.person)

FROM descendantsd,treet

WHEREt.parent=d.person

)

SELECT *FROM descendants ORDER BY path;

+-------------+---------------------------------+

|person|path|

+-------------+---------------------------------+

|Thurimbert|Thurimbert|

|Cancor|Thurimbert,Cancor|

|Ingramm|Thurimbert,Ingramm|

|Ermengarde|Thurimbert,Ingramm,Ermengarde|

|Landrade|Thurimbert,Landrade|

|Chaudegrand|Thurimbert,Landrade,Chaudegrand|

|Robert II|Thurimbert,Robert II|

|Robert III|Thurimbert,Robert II,Robert III|

+-------------+---------------------------------+

I had already mentioned the necessity of CAST here : the type of the “path” column is determined from the initial SELECT, and it has to be long enough to fit the path of the deepest leaf in the tree: CAST ensures that.

In the SQL standard, there is an optional SEARCH BY clause to request breadth-first or depth-first ordering; as we saw above, it is easy to emulate in MySQL with a “level” or “path” column.

Computing transitive closures with simple cycle avoidance

In year 2300, the Earth is overcrowded and people are encouraged to emigrate to nearby planets by boarding on these space rockets:

1

2

3

4

5

6

CREATE TABLE rockets

(origin CHAR(20),destination CHAR(20),trip_time INT);

INSERT INTO rockets VALUES

('Earth','Mars',2),

('Mars','Jupiter',3),

('Jupiter','Saturn',4);

Note that the rulers of the Earth have intentionally not set up any way to return from those planets back to the Earth.

Let’s list all destinations which can be reached from the Earth: we first find the planets which can be directly reached, then we find what can be directly reached from these first planets, and so on:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

WITH RECURSIVE all_destinations AS

(

SELECT destination ASplanet

FROM rockets

WHERE origin='Earth'

UNION ALL

SELECTr.destination

FROM rocketsr,all_destinationsd

WHEREr.origin=d.planet

)

SELECT *FROM all_destinations;

+---------+

|planet|

+---------+

|Mars|

|Jupiter|

|Saturn|

+---------+

Now it is year 2400 and the population of the Earth has decreased so much that the rulers decide to bring some emigrates back, so they set up a new rocket from Saturn to Earth:

1

INSERT INTO rockets VALUES('Saturn','Earth',9);

Let’s repeat our query to list all destinations which can be reached from the Earth:

1

2

3

4

5

6

7

8

9

10

11

12

13

WITH RECURSIVE all_destinations AS

(

SELECT destination ASplanet

FROM rockets

WHERE origin='Earth'

UNION ALL

SELECTr.destination

FROM rocketsr,all_destinationsd

WHEREr.origin=d.planet

)

SELECT *FROM all_destinations;

^C--query aborted

ERROR1317(70100):Query execution was interrupted

The query seemed to never finish, I had to interrupt it manually (note that starting from MySQL 8.0.3, the query would instead abort automatically, thanks to @@cte_max_recursion_depth). Why is it so? We have a cycle in our data: starting from the Earth, we add Mars, then Jupiter, then Saturn, then the Earth (because of the new rocket), so we’re back to the beginning – the Earth. Then we add Mars, and so on, forever.

Setting max_execution_time will surely ensure that such runaway query is automatically killed, but… how do we modify the query to work properly, i.e. not run away?

A simple solution is to not add a planet to the result if it’s already there. This elimination of duplicates is done by using UNION DISTINCT instead of UNION ALL:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

WITH RECURSIVE all_destinations AS

(

SELECT destination ASplanet

FROM rockets

WHERE origin='Earth'

UNION DISTINCT

SELECTr.destination

FROM rocketsr,all_destinationsd

WHEREr.origin=d.planet

)

SELECT *FROM all_destinations;

+---------+

|planet|

+---------+

|Mars|

|Jupiter|

|Saturn|

|Earth|

+---------+

More complex cycle avoidance

So we have a nicely working query with UNION DISTINCT. Let’s extend it just a bit, to additionally show the time taken by each trip (pretty innocent idea, right?):

1

2

3

4

5

6

7

8

9

10

11

12

13

WITH RECURSIVE all_destinations AS

(

SELECT destination ASplanet,trip_time AStotal_time

FROM rockets

WHERE origin='Earth'

UNION DISTINCT

SELECTr.destination,d.total_time+r.trip_time

FROM rocketsr,all_destinationsd

WHEREr.origin=d.planet

)

SELECT *FROM all_destinations;

^C--query aborted

ERROR1317(70100):Query execution was interrupted

What? again it never ends? Indeed, consider what the first generated rows must be:

Initial SELECT: Mars, 2

Iteration 1: Jupiter, 5 (2+3)

Iteration 2: Saturn, 9 (5+4)

Iteration 3: Earth, 18 (9+9)

Iteration 4: Mars, 20 (18+2)

The time of the Earth->Mars->Jupiter->Saturn->Earth->Mars trip is 20, while the time of the Earth->Mars trip is 2; the two trips now form two different rows as the their differing time is part of the rows: (Mars, 2) and (Mars, 20). UNION DISTINCT will not help with cycle elimination, anymore, as it eliminates only rows for which all columns match… So (Mars, 20) is put into the result, and it leads to (Jupiter, 23) and so on.

The SQL standard has an optional CYCLE clause for that, which we can easily emulate: the solution is to build a “path” column (as in the depth/breadth-first paragraph above), and break when we see we’re adding something which is already in the path. We needn’t use DISTINCT anymore so we use ALL which avoids the overhead of (pointless) duplicate elimination:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

WITH RECURSIVE all_destinations AS

(

SELECT destination ASplanet,trip_time AStotal_time,

CAST(destination ASCHAR(500))ASpath

FROM rockets

WHERE origin='Earth'

UNION ALL

SELECTr.destination,d.total_time+r.trip_time,

CONCAT(d.path,',',r.destination)

FROM rocketsr,all_destinationsd

WHEREr.origin=d.planet

ANDFIND_IN_SET(r.destination,d.path)=0

)

SELECT *FROM all_destinations;

+---------+------------+---------------------------+

|planet|total_time|path|

+---------+------------+---------------------------+

|Mars|2|Mars|

|Jupiter|5|Mars,Jupiter|

|Saturn|9|Mars,Jupiter,Saturn|

|Earth|18|Mars,Jupiter,Saturn,Earth|

+---------+------------+---------------------------+

Note that if the “row keys” (the planet’s name, here) are allowed to contain a comma, it may be safer to encode the keys when adding them to the path, for example with TO_BASE64 or HEX, so that FIND_IN_SET works reliably.

If, in your particular use case, cycles are considered an anomaly which you want to hunt for, a variant of our query will flag them. We just need to allow the cycle to enter our result and advertise itself (with a new “is_cycle” column), but we still must prevent the query from running forever, by not iterating on rows where is_cycle=1:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

WITH RECURSIVE all_destinations AS

(

SELECT destination ASplanet,trip_time AStotal_time,

CAST(destination ASCHAR(500))ASpath,0ASis_cycle

FROM rockets

WHERE origin='Earth'

UNION ALL

SELECTr.destination,d.total_time+r.trip_time,

CONCAT(d.path,',',r.destination),

FIND_IN_SET(r.destination,d.path)!=0

FROM rocketsr,all_destinationsd

WHEREr.origin=d.planet

ANDis_cycle=0

)

SELECT *FROM all_destinations;

+---------+------------+--------------------------------+----------+

|planet|total_time|path|is_cycle|

+---------+------------+--------------------------------+----------+

|Mars|2|Mars|0|

|Jupiter|5|Mars,Jupiter|0|

|Saturn|9|Mars,Jupiter,Saturn|0|

|Earth|18|Mars,Jupiter,Saturn,Earth|0|

|Mars|20|Mars,Jupiter,Saturn,Earth,Mars|1|

+---------+------------+--------------------------------+----------+

A “1” in the “is_cycle” column points us to the cycle.

That’s all for today. I hope the above tricks will help you write powerful queries to navigate your hierarchical or graph data. As always: Thank you for using MySQL!

SELECT * FROM search_path
WHERE path_ids->’$[0]’ = 1 AND path_ids->’$[last]’ = 6
ORDER BY length;

I took your PostgreSQL query and did the following changes:
– MySQL doesn’t have an array type, but it has a JSON type, in particular it supports JSON arrays, so I’m using a JSON array
– PG’s || concatenation becomes json_array_append
– PG’s =ANY becomes json_contains (the value to search for must be a valid JSON value, so I cast the integer to JSON)
– PG’s array indexing with the length:
f.path_ids[array_length(path_ids, 1)] = d.node_id becomes
f.path_ids->’$[last]’ = d.node_id
– the recursive CTE (search_path) references the non-recursive one (node_links_select) so I swapped their definitions.

You’re right; all quote symbols which WordPress’ formatting has inserted in my query should be changed to ordinary ASCII single-quotes, or ordinary ASCII double-quotes (except that the latter won’t work if @@sql_mode contains the ANSI_QUOTES flag).

I’m using a recursive cte query to generate a path (a “WBS” of sorts) using a “listorder” column to control the ranking (eg “1.2.3”). However, as the path is a string, “20” displays before “3”, “30” displays before “4” and so on.

To get around this problem, I use a custom function to add leading zeros to each path “segment”. ( eg “001.002.003”). So, 1.2.3 lists before 1.2.10.

Your query is ordering solely on the “path”; “child a” and “child a.1” have the same parent (root) and the same listorder (1) so they have the same “path” (1.1 and 1.1), so they will come out sorted next to each other. So, sub-tasks of “child a” will be separated from “child a” in the output. Is it really what’s wanted? What’s the desired output’s order, for the sample data you showed?

Assuming you want all children listed under their parent, and between these children, you want them ordered by “listorder”, then I would build a “path” which is made of the IDs of all parents from root to current node (excluding current node), then I’d do
order by path, listorder;
This way, all children of the same parent would be together under their parent, as they would have the same path, and inside their group they would be sorted by listorder.
Such path made of PIDs may not have the padding issue, if all PIDs are of same length.
I still think that using only a concatenation of listorders cannot work, because “child a” and “child a.1” have the same listorder so they would sort next to each other.