Solutions to Related Tables Challenge

Last week I provided a T-SQL Challenge involving writing a recursive query that returns tables related to an input table directly or indirectly through foreign key relationships. You can find the challenge details here. The tricky part wasn’t really to come up with any solution that works, rather to come up with a solution using a recursive query. With this constraint the challenge became much more difficult—especially coming up with a solution that performs reasonably.

I’d like to thank all those who sent solutions, including Peter Larsson (Peso), Geri Reshef and Rubén Garrigós. Special thanks go to Peso for his help in benchmarking solutions in his environment and in sending corrections to other solutions. I know that many others tried to solve the puzzle and would like to thank you for your efforts—it is a difficult challenge!

The sample data I provided last week is far too simple to be used to test the performance of the solutions. Peso tested some of the solutions in his data warehouse with 160 tables and this way we got more realistic measures. I created a new set of tables—smaller than Peso’s environment, yet more complex than last week’s—and this was sufficient to show the performance differences between the solutions. Here’s the new sample data:

USE testfk;

CREATETABLE dbo.T2(c2 INT PRIMARY KEY);

CREATETABLE dbo.T3(c3 INT PRIMARY KEY);

CREATETABLE dbo.T4(c4 INT PRIMARY KEY);

CREATETABLE dbo.T5(c5 INT PRIMARY KEY);

CREATETABLE dbo.T1(c1 INT PRIMARY KEY,

c2 INT REFERENCES dbo.T2,

c3 INT REFERENCES dbo.T3,

c4 INT REFERENCES dbo.T4,

c5 INT REFERENCES dbo.T5);

CREATETABLE dbo.T6(c6 INT PRIMARY KEY,

c2 INT REFERENCES dbo.T2,

c3 INT REFERENCES dbo.T3);

CREATETABLE dbo.T7(c7 INT PRIMARY KEY,

c3 INT REFERENCES dbo.T3,

c4 INT REFERENCES dbo.T4);

CREATETABLE dbo.T8(c8 INT PRIMARY KEY,

c4 INT REFERENCES dbo.T4,

c5 INT REFERENCES dbo.T5);

CREATETABLE dbo.T9(c9 INT PRIMARY KEY,

c2 INT REFERENCES dbo.T2,

c5 INT REFERENCES dbo.T5);

CREATETABLE dbo.T10(c10 INT PRIMARY KEY,

c6 INT REFERENCES dbo.T6,

c3 INT REFERENCES dbo.T3);

CREATETABLE dbo.T11(c11 INT PRIMARY KEY,

c3 INT REFERENCES dbo.T3,

c7 INT REFERENCES dbo.T7);

CREATETABLE dbo.T12(c12 INT PRIMARY KEY,

c7 INT REFERENCES dbo.T7,

c4 INT REFERENCES dbo.T4);

CREATETABLE dbo.T13(c13 INT PRIMARY KEY,

c4 INT REFERENCES dbo.T4,

c8 INT REFERENCES dbo.T8);

CREATETABLE dbo.T14(c14 INT PRIMARY KEY,

c5 INT REFERENCES dbo.T5,

c8 INT REFERENCES dbo.T8);

CREATETABLE dbo.T15(c15 INT PRIMARY KEY,

c5 INT REFERENCES dbo.T5,

c9 INT REFERENCES dbo.T9);

CREATETABLE dbo.T16(c16 INT PRIMARY KEY,

c2 INT REFERENCES dbo.T2,

c9 INT REFERENCES dbo.T9);

CREATETABLE dbo.T17(c17 INT PRIMARY KEY,

c2 INT REFERENCES dbo.T2,

c6 INT REFERENCES dbo.T6);

And here’s a graphical depiction of the relationships:

These tables are created in the testfk database used last week in addition to the other tables already there (tables A, B, C, D, E, F, G).

The performance measures I’ll provide are against this data model, using warm cache, not including parse and compile time. I used the following input for the performance test:

DECLARE @table AS NVARCHAR(261) = N'dbo.T1';

The expected output is (not necessarily in this order):

obj_schema_name obj_name

---------------- ---------

dbo T15

dbo T16

dbo T17

dbo T2

dbo T3

dbo T4

dbo T5

dbo T1

dbo T6

dbo T7

dbo T8

dbo T9

dbo T10

dbo T11

dbo T12

dbo T13

dbo T14

First, I’d like to emphasize that the constraint to come up with a solution based on a recursive query was just to increase the puzzle’s level of difficulty. In practical terms, it makes sense to address such need using a loop-based solution. Loop-based solutions in this case are far simpler and more efficient than recursive solutions. Here’s the one I showed last week (call it Itzik Loop) with a minor revision to return the input table when it has no related tables (remember to include in the code the declaration and initialization of the input @table provided earlier):

DECLARE @T AS TABLE ( id INT NOT NULL PRIMARY KEY );

INSERTINTO @T(id)

SELECT OBJECT_ID(@table) AS id

WHERE OBJECT_ID(@table) IS NOT NULL

UNION

SELECT referenced_object_id

FROM sys.foreign_keys

WHERE parent_object_id = OBJECT_ID(@table)

UNION

SELECT parent_object_id

FROM sys.foreign_keys

WHERE referenced_object_id = OBJECT_ID(@table);

WHILE@@ROWCOUNT > 0

INSERT INTO @T(id)

SELECT referenced_object_id AS id

FROM sys.foreign_keys AS FK

JOIN @T

ON FK.parent_object_id = [@T].id

UNION

SELECT parent_object_id

FROM sys.foreign_keys AS FK

JOIN @T

ON FK.referenced_object_id = [@T].id

EXCEPT

SELECT id FROM @T;

SELECTOBJECT_SCHEMA_NAME(id) AS obj_schema_name, OBJECT_NAME(id) AS obj_name

FROM @T;

The statistics for this solution on my system with the new sample data are: Duration: 4 ms, Reads: 587.

WHERE NOT EXISTS(SELECT * FROM @Tables AS t WHERE t.TableID = w.TableID)

SELECT

OBJECT_SCHEMA_NAME(TableID) AS obj_schema_name,

OBJECT_NAME(TableID) AS obj_name

FROM @Tables;

The stats I got for this solution are: Duration: 4 ms, reads: 728. In short, both loop-based solutions are pretty efficient as you can see.

As for recursive solutions, I’ll start with the ones I came up with, then I’ll provide also the ones by Peso, Geri and Rubén.

My first attempt at a solution using a recursive query is the following (call it Itzik Recursive 1):

WITH C AS

(

SELECT OBJECT_ID(@table) AS obj,

'.' + CAST(OBJECT_ID(@table) AS VARCHAR(MAX)) + '.' AS objpath

WHERE OBJECT_ID(@table) IS NOT NULL

UNION ALL

SELECT D.obj, D.objpath + CAST( D.obj AS VARCHAR(MAX) ) + '.'

FROM ( SELECT

CASE WHEN C.obj = F.parent_object_id

THEN F.referenced_object_id

ELSE F.parent_object_id

END AS obj, objpath

FROM C

JOIN sys.foreign_keys AS F

ON C.obj IN (F.parent_object_id, F.referenced_object_id) ) AS D

WHERE objpath NOT LIKE '%.' + CAST(obj AS VARCHAR(10)) + '.%'

)

SELECT DISTINCT

OBJECT_SCHEMA_NAME(obj) AS obj_schema_name,

OBJECT_NAME(obj) AS obj_name

FROM C;

The anchor member returns the input table. The recursive member returns a row for each counterpart of the tables that appeared in the previous round as either a referencing or a referenced table. For each edge, the code constructs a dot separated path with all of the object ids of the objects leading to the current. In each recursive round, the code filters out nodes that already appear in their parent node’s path.

This solution is short (in terms of amount of code) and elegant, but isn’t very fast. It creates all possible paths starting with the input node and made of unique nodes. As you can imagine, the number of paths can explode quickly to a large number with additions of new related tables.

The statistics I got for this solution are: Duration: 7,812, Reads: 2,949,749.

In order for a recursive solution to be really fast you need to avoid constructing multiple different paths and visit each node only once. The two main challenges are: 1. How to apply DISTINCT logic in the recursive member without using the DISTINCT clause (not allowed in recursive member). 2. Avoid building multiple different paths when multiple nodes are counterparts in a foreign key relationship with a single node from the previous round.

Both challenges have a solution. Challenge 1 can be handled by calculating a row number partitioned by the node and then filtering only rows with row number 1. Challenge 2 can be handled by using a FOR XML PATH query to concatenate multiple nodes into one. Here’s the complete solution code (call it Itzik Recursive 2):

ROW_NUMBER() OVER(PARTITION BY CASE n WHEN 1 THEN entirepath ELSE obj END

ORDER BY (SELECT NULL)) AS rownum

FROM (

SELECT

LEFT(entirepath, LEN(entirepath) - 1) AS entirepath,

'~' + STR(obj, 11) AS obj

FROM (

SELECT

C.entirepath,

CAST(

CASE WHEN C.lastpath LIKE '%~' + STR(F.parent_object_id, 11) + '~%'

THEN F.referenced_object_id

ELSE F.parent_object_id

END AS VARCHAR(10)) AS obj

FROM C

JOIN sys.foreign_keys AS F

ON C.lastpath LIKE '%~' + STR(F.parent_object_id, 11) + '~%'

OR C.lastpath LIKE '%~' + STR(F.referenced_object_id, 11) + '~%'

) AS D1

WHERE entirepath NOT LIKE '%.' + STR(obj, 11) + '.%'

) AS D2

CROSS JOIN (SELECT 1 UNION ALL SELECT 2) AS Nums(n)

) AS D3

WHERE rownum = 1

ORDER BY element

FOR XML PATH('')

) AS D4(entirepath)

WHERE entirepath IS NOT NULL

) AS D5

),

Split AS

(

SELECT

CAST(SUBSTRING(lastpath, 2, 11) AS INT) AS obj,

STUFF(lastpath, 1, 12, '') AS lastpath

FROM C

UNION ALL

SELECT

CAST(SUBSTRING(lastpath, 2, 11) AS INT) AS obj,

STUFF(lastpath, 1, 12, '') AS lastpath

FROM Split

WHERE lastpath '~'

)

SELECT

OBJECT_SCHEMA_NAME(obj) AS obj_schema_name,

OBJECT_NAME(obj) AS obj_name

FROM Split;

Observe in the CTE C that only one path is built in each iteration of the recursive query, concatenating the path built by the previous round and the qualifying items from the current round (concatenation done with FOR XML PATH). The code uses a trick to distinguish between the part of the path representing nodes accessed prior to current round and nodes added in this round (could be more than one). The trick is to use different separators for the two sections in the layer of the code that concatenates all parts—old (. as separator) and new (~ as separator). The reason to include both sections in the same string and not as different strings is that a FOR XML PATH query can return only one string. Then a layer of code on top can extract the last part of the path (section starting with ~) giving it the attribute name lastpart, and replace all ocurrences of the character ~ with . to return the entire path, giving it the attribute name entirepath.

Finally, another recursive query called Split extracts the node ids from the lastpath attribute calculated in each round. First round (anchor member) handles first node in each lastpath, second round (already recursive query) handles second node in each last path, and so on, until no more nodes exist.

Since this solution visits each node only once it is very fast. The performance stats I got for it are: Duration: 17, Reads: 2,129.

Following are solutions using recursive queries by Peso, Geri and Ruben and their performance statistics.

Discuss this Blog Entry 3

I'll test them all on the data warehouse (with 160+ tables) later this morning.Earlier I only measured the duration for my and Itzik's solutions.

In short, the recursive CTE's (except Itzik 2) were killed after 12 hours!Itzik 2 ran for 400ms which is really good, and Peso loop ran for 40 ms.I posted another loop based solution based on composable dml which ran for 1200ms, and there is no point displaying that solution when my other loop based was so much faster.

Hre we go. For the published solutions above, only "Itzik Loop", "Peso Loop" and "Itzik Recursive 2" passed before the 30 minute mark on our Data Warehouse, which has 160 tables. I used the "anchor" table "dimMember" to start with as this has most connections through the database. The solutions passed the 30 minute mark all returns 144 tables.