Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. It's 100% free, no registration required.

Each node is identified by (forest_id, node_id) (there can be another node with the same name in another forest). Each tree starts at a root node (where parent_id is null), although I'm only expecting one per forest.

As expected, this isn't very efficient. I'm partly surprised it doesn't seem to make use of any index.

Considering that this data would be read often but rarely modified (perhaps a small modification every couple of weeks), what possible techniques are there to optimise such queries and/or data representation?

EDIT: I would also like to retrieve the tree in depth-first order. Using ORDER BY path also degrades substantially the speed of the query above.

Sample Python program to populate the table with test data (requires Psycopg2), probably a bit more than I expect to have in a more realistic situation:

1 Answer
1

If you really have to modify these data rarely, then you can simply store the result of the CTE in a table, and run queries against this table. You can define indexes based on your typical queries.
Then TRUNCATE and repopulate (and ANALYZE) as necessary.

On the other hand, if you can put the CTE in separate stored procedures rather than a view, you can easily put your conditions in the CTE part rather then the final SELECT (which is basically what you do querying against tree_view_1), so that much less rows will be involved in the recursion. From the query plan it looks like that PostgreSQL estimates row numbers based on some far-from-true assumptions, probably producing suboptimal plans - this effect can be reduced somewhat with the SP solution.

EDIT I may miss something, but just noticed that in the non-recursive term you don't filter the rows. Possibly you want to include only root nodes there (WHERE parent_id IS NULL) - I'd expect much less rows and recursions this way.

EDIT 2 AS it slowly became clear for me from the comments, I misthought the recursion in the original question going the other way. Here I mean starting from the root nodes and going deeper in the recursion.

Thank you. I'm not sure what you mean by putting WHERE parent_id IS NULL in the non-recursive term (or generally filtering there): this prevents the recursion from happening.
–
BrunoJul 17 '12 at 12:47

Why would it prevent recursion? The non-recursive term just sets the 'anchor' for starting the recursion. So first you collect all your root nodes and then go through all the forests starting from there.
–
dezsoJul 17 '12 at 13:04

If I do this query without a view (directly WITH RECURSIVE ...) and use SELECT td.forest_id, td.node_id, td.parent_id, 0, ARRAY[td.node_id], FALSE FROM tree_data_1 td WHERE parent_id IS NULL, only the root nodes (where parent_id is null) are returned by the overall query. The other nodes are not taken into account (that's what I would expect).
–
BrunoJul 17 '12 at 13:11

Hm. Hmm. I think I see what I've missed: you state rec.parent_id = td.node_id there. But if the non-recursive term collects the root nodes then td.parent_id = rec.node_id is the condition of the next level. Isn't this what you wanted to mean? (I think not, you started the other way round...)
–
dezsoJul 17 '12 at 13:17

Ah, I see where you're getting at. If I change completely the order of the recursion (putting the WHERE parent_id IS NULL in the non-recursive term, selecting td.node_id instead of rec.node_id and using td.parent_id = rec.node_id in the recursive term), I do get the expected results faster indeed!
–
BrunoJul 17 '12 at 13:29