Tag Archives: Subquery Factor

Post navigation

I noticed a question on AskTom last November concerning SQL for splitting delimited strings, Extract domain names from a column having multiple email addresses, a kind of question that arises frequently on the forums. There was some debate between reviewers Rajeshwaran Jeyabal, Stew Ashton and the AskTom guys on whether an XML-based solution performs better or worse than a more 'classic' solution based on the Substr and Instr functions and collections. AskTom's Chris Saxon noted:

For me this just highlights the importance of testing in your own environment with your own data. Just because someone produced a benchmark showing X is faster, doesn't mean it will be for you.

For me, relative performance is indeed frequently dependent on the size and 'shape' of the data used for testing. As I have my own 'dimensional' benchmarking framework, A Framework for Dimensional Benchmarking of SQL Performance, I was able to very quickly adapt Rajesh's test data to benchmark across numbers of records and numbers of delimiters, and I put the results on the thread. I then decided to take the time to expand the scope to include other solutions, and to use more general data sets, where the token lengths vary as well as the number of tokens per record.

In fact the scope expanded quite a bit, as I found more and more ways to solve the problem, and I have only now found the time to write it up. Here is a list of all the queries considered:

Queries using Connect By for row generation

MUL_QRY - Cast/Multiset to correlate Connect By

LAT_QRY - v12 Lateral correlated Connect By

UNH_QRY - Uncorrelated Connect By unhinted

RGN_QRY - Uncorrelated Connect By with leading hint

GUI_QRY - Connect By in main query using sys_guid trick

RGX_QRY - Regular expression function, Regexp_Substr

Queries not using Connect By for row generation

XML_QRY - XMLTABLE

MOD_QRY - Model clause

PLF_QRY - database pipelined function

WFN_QRY - 'WITH' PL/SQL function directly in query

RSF_QRY - Recursive Subquery Factor

RMR_QRY - Match_Recognize

Test Problem

DELIMITED_LISTS Table

CREATE TABLE delimited_lists(id INT, list_col VARCHAR2(4000))
/

Functional Test Data

The test data consist of pipe-delimited tokens ('|') in a VARCHAR2(4000) column in a table with a separate integer unique identifier. For functional testing we will add a single 'normal' record with two tokens, plus four more records designed to validate null-token edge cases as follows:

All queries returned the expected results above, except that the XML query returned 12 rows with only a single null token returned for record 5. In the performance testing, no null tokens were included, and all queries returned the same results.

Performance Test Data

Each test set consisted of 3,000 records with the list_col column containing the delimited string dependent on width (w) and depth (d) parameters, as follows:

Each record contains w tokens

Each token contains d characters from the sequence 1234567890 repeated as necessary

The output from the test queries therefore consists of 3,000*w records with a unique identifier and a token of length d. For performance testing purposes the benchmarking framework writes the results to a file in csv format, while counting only the query steps in the query timing results.

In Oracle upto version 11.2 VARCHAR2 expressions cannot be longer than 4,000 characters, so I decided to run the framework for four sets of parameters, as follows:

Depth fixed, high; width range low: d=18, w=(50,100,150,200)

Depth fixed, low; width range high: d=1, w=(450,900,1350,1800)

Width fixed, low; depth range high: w=5, d=(195,390,585,780)

Width fixed, high; depth range low: w=150, d=(6,12,18,24)

All the queries showed strong time correlation with width, while a few also showed strong correlation with depth.

Queries

All execution plans are from the data point with Width=1800, Depth=1, which has the largest number of tokens per record.

This is the 'classic' CONNECT BY solution referred to above, which appears frequently on AskTom and elsewhere, and I copied the version used by Jayesh. The somewhat convoluted casting between subquery and array and back to SQL record via multiset allows the prior table in the from list to be referenced within the inline view, which is otherwise not permitted in versions earlier than 12.1, where the LATERAL keyword was introduced.

Despite this query being regarded as the 'classic' CONNECT BY solution to string-splitting, we will find that it is inferior in performance to a query I wrote myself across all data points considered. The new query is also simpler, but is not the most efficient of all methods, as we see later.

This query is taken from Splitting Strings: Proof!, and uses a v12.1 new feature, described with examples in LATERAL Inline Views. The new feature allows you to correlate an inline view directly without the convoluted Multiset code, and can also be used with the keywords CROSS APPLY instead of LATERAL. It's sometimes regarded as having peformance advantages, but in this context we will see that avoiding this correlation altogether is best for performance.

I wrote the UNH_QRY query in an attempt to avoid the convoluted Multiset approach of the 'classic' solution. The reason for the use of arrays and Multiset seems to be that, while we need to 'generate' multiple rows for each source row, the number of rows generated has to vary by source record and so the row-generating inline view computes the number of tokens for each record in its where clause.

The use of row-generating subqueries is quite common, but in other cases one often has a fixed number of rows to generate, as in data densification scenarios for example. It occurred to me that, although we don't know the number to generate, we do have an upper bound, dependent on the maximum number of characters, and we could generate that many in a subquery, then join only as many as are required to the source record.

This approach resulted in a simpler and more straightforward query, but it turned out in its initial form to be very slow. The execution plan above shows that the row generator is driving a nested loops join within which a full scan is performed on the table. The CBO is not designed to optimise this type of algorithmic query, so I added a leading hint to reverse the join order, and this resulted in much better performance. In fact, as we see later the hinted query outperforms the other CONNECT BY queries, including the v12.1 LAT_QRY query at all data points considered.

This query also generates rows using CONNECT BY, but differs from the others shown by integrating the row-generation code with the main rowset and avoiding a separate subquery against DUAL. This seems to be a more recent approach than the traditional Multiset solution. It uses a trick involving the system function sys_guid() to avoid the 'connect by cycle' error that you would otherwise get, as explained in this OTN thread: Reg : sys_guid()
.

Unfortunately, and despite its current popularity on OTN, it turns out to be even less efficient than the earlier approaches, by quite a margin.

This is a fairly well known approach to the problem that involves doing the string splitting within a pipelined database function that is passed the delimited string as a parameter. I wrote my own version for this article, taking care to make only one call to each of the oracle functions Instr and Substr within a loop over the tokens.

The results confirm that it is in fact the fastest approach over all data points considered, and CPU time increased approximately linearly with number of tokens.

Oracle introduced the ability to include a PL/SQL function definition directly in a query in version 12.1. I converted my pipelined function into a function within a query, returning an array of character strings.

As we would expect from the results of the similar pipelined function approach, this also turns out to be a very efficient solution. However, it may be surprising to many that it is significantly slower (20-30%) than using the separate database function, given the prominence that is usually assigned to context-switching.

Oracle introduced Match_Recognize in v12.1, as a mechanism for pattern matching along the lines of regular expressions for strings, but for matching patterns across records. I wrote the query for this article, converting each character in the input strings into a separate record to allow for its use.

This approach might seems somewhat convoluted, and one might expect it to be correspondingly slow. As it turns out though, for most datasets it is faster than many of the other methods, the ones with very long tokens being the exception, and CPU time increased linearly with both number of tokens and number of characters per token. It is notable that, apart from the exception mentioned, it outperformed the regular expression query.

The CPU times are listed but elapsed times are much the same. Each table has columns in order of increasing last CPU time by query.

Depth fixed, high; width range low: d=18, w=(50,100,150,200)

Depth fixed, low; width range high: d=1, w=(450,900,1350,1800)

Width fixed, low; depth range high: w=5, d=(195,390,585,780)

Width fixed, high; depth range low: w=150, d=(6,12,18,24)

A Note on the Row Generation by Connect By Results

It is interesting to observe that the 'classical' mechanism for row-generation in string-splitting and similar scenarios turns out to be much slower than a simpler approach that removes the correlation of the row-generating subquery. This 'classical' mechanism has been proposed on Oracle forums over many years, while a simpler and faster approach seems to have gone undiscovered. The reason for its performance deficit is simply that running a Connect By query for every master row is unsurprisingly inefficient. The Use of the v12.1 LATERAL correlation syntax simplifies but doesn't improve performance by much.

The more recent approach to Connect By row-generation is to use the sys_guid 'trick' to embed the Connect By in the main query rather than in a correlated subquery, and this has become very popular on forums such as OTN. As we have seen, although simpler, this is even worse for performance: Turning the whole query into a tree-walk isn't good for performance either. It's better to isolate the tree-walk, execute it once, and then just join its result set as in RGN_QRY.

Conclusions

The database pipelined function solution (PLF_QRY) is generally the fastest across all data points

Using the v12.1 feature of a PL/SQL function embedded within the query is almost always next best, although slower by up to about a third; its being slower than a database function may surprise some

Times generally increased uniformly with numbers of tokens, usually either linearly or quadratically

Times did not seem to increase so uniformly with token size, except for XML (XML_QRY), Match_Recognize (RMR_QRY) and regular expression (RGX_QRY)

For larger numbers of tokens, three methods all showed quadratic variation and were very inefficient: Model (MOD_QRY), regular expression (RGX_QRY), and recursive subquery factors (RSF_QRY)

We have highlighted two inefficient but widespread approaches to row-generation by Connect By SQL, and pointed out a better method

These conclusions are based on the string-splitting problem considered, but no doubt would apply to other scenarios involving requirements to split rows into multiple rows based on some form of string-parsing.

Networks or hierarchies of arbitrary depth are difficult to traverse in SQL without using recursion. However, there also exist hierarchies of fixed and fairly small depths, and these can be traversed either recursively or by a sequence of joins for each of the levels. In this article I compare the performance characteristics of three traversal methods, two recursive and one non-recursive, using my own benchmarking package (A Framework for Dimensional Benchmarking of SQL Performance), on a test problem of a fixed level organization structure hierarchy, with 5 levels for performance testing and 3 levels for functional testing.

The three queries tested were:

JNS_QRY: Sequence of joins

PLF_QRY: Recursive pipelined function

RSF_QRY: Recursive subquery factors

Fixed Level Hierarchy Problem Definition

A hierarchy is assumed in which there are a number of root records, and at each level a parent can have multiple child records and a child can also have multiple parents. Each level in the hierarchy corresponds to an entity of a particular type. Each parent-child record is associated with a numerical factor, and the products of these propagate down the levels.

The problem considered is to report all root/leaf combinations with their associated products. There may of course be multiple paths between any root and leaf, and in a real world example one would likely want to aggregate. However, in order to keep it simple and focus on the traversal performance, I do not perform any aggregation.

To simplify functional validation a 3-level hierarchy was taken, with a relatively small number of records. The functional test data were generated by the same automated approach used for performance testing. The fact number was obtained as a random number betwee 0 and 1, and to keep it simple, duplicate pairs were permitted.

The test data were parametrised by width and depth as follows (the exact logic is a little complicated, but can be seen in the code itself):

Width corresponds to a percentage increase in the number of child entities relative to the number of parents

Depth corresponds to the proportion of the parent entity records a child is (randomly) assigned. Each child has a minimum of 1 parent (lowest depth), and a maximum of all parent entities (highest depth)

It is interesting to note that all joins in the execution plan are hash joins, and in the sequence you would expect. The first three are in the default join 'sub-order' that defines whether the joined table or the prior rowset (the default) is used to form the hash table, while the last two are in the reverse order, corresponding to the swap_join_inputs hint. I wrote a short note on that subject, A Note on Oracle Join Orders and Hints, last year, and have now written an article using the largest data point in the current problem to explore performance variation across the possible sub-orders.

For simplicity a stand-alone database function was used here. The query execution plan was obtained by the benchmarking framework and the highest data point plan listed. The query within the function was extracted and an explain Plan performed manually, which showed the expected index range scan.

JNS_QRY is faster than RSF_QRY, which is faster than PLF_QRY at all data points

PLF_QRY tracks the number of output records very closely. This is likely because the function executes a query at every node in the hierarchy that uses an indexed search.

The pure SQL methods scale better through being able to do full table scans, and avoiding multiple query executions

Deep Slice Elapsed - CPU Times

The elapsed time minus the CPU times are shown in the first graph below, followed by the disk writes. The disk writes (and reads) are computed as the maximum values across the explain plan at the given data point, and are obtained from the system view v$sql_plan_statistics_all. The benchmarking framework gathers these and other statistics automatically.

The graphs show how the elapsed time minus CPU times track the disk accesses reasonably well

RSF_QRY does nearly twice as much disk writes as the other two

Wide Slice Results [w=180, d=(100, 120, 140, 160, 180)]

The performance characteristics of the three methods across the wide slice data points are pretty similar to those across the deep slice. The graphs are shown below.

Conclusions

For the example problem taken, the most efficient way to traverse fixed-level hierarchies is by a sequence of joins

Recursive methods are significantly worse, and the recursive function is especially inefficient because it performs large numbers of query executions using indexed searches, instead of full scans

I recently posted an article on Dimensional Benchmarking of Oracle v10-v12 Queries for SQL Bursting Problems. This article added an Oracle v12 SQL solution, involving Match_Recognize to benchmark against some v10 and v11 solutions that I had posted on Scribd a few years ago. A few days before posting it I noticed an OTN thread with a problem that struck me as being of a similar type, Amalgamating groups to be beyond a given size threshold. Where in my original 'bursting' problem a group is defined by a maximum interval from its starting date, in the OTN problem a group is defined by the cumulative sum of a numeric attribute from the group starting record.

I added a comment on the thread at the time mentioning the results that I had got on the original problem, and adding a model solution for the problem raised on the new thread. I have now taken this second 'bursting'-type problem and have benchmarked both the main two solutions proposed on that thread (by other posters), as well as two versions of my own model solution, and a variant of the recursive subquery factor solution that uses a temporary table to achieve much faster performance.

Also, I noticed a question just yesterday on AskTom that is posing essentially the same problem as in my earlier article (which itself came from AskTom several years ago 🙂 ), Complex sql.

The results show that Match_Recognize, as before, is by far the most efficient solution. They also show that the faster solutions vary linearly with dataset size (within a given partition), while the slow ones vary quadratically. One interesting finding is that the solution by the Model clause can be changed from very slow, and quadratically varying, to linearly varying, and second in performance only to Match_Recognize, by using a rule ordering clause (which avoids the need for automatic rules ordering).

The problem is to determine break groups using a running aggregate based on some function of the record attributes, with a defined ordering, starting from the group starting record, and with a group's end record defined by the aggregate reaching (or exceeding) some limit. One may consider the first record reaching (or exceeding) the limit to define the first record in the next group, as in the original bursting problem, or to be the last record in the current group, as in the OTN example.

The data are partitioned by some key in general.

OTN-like Item Weights 'Bursting' Problem

The data structure used in this article is based on that of the original poster in the OTN thread, but with more generic table and column names.

I created test data with a test weight limit of 10, as follows, with groups shown at detailed level. The first two categories are taken from the OTN problem, while I added a third category to test the case where the limit is not reached.

In the Match_Recognize query proposed in the OTN thread the pattern is defined in terms of two categories, say s and t, where:

s denotes a record where the running sum < the limit

t denotes a record where the running sum >= the limit

The pattern to match can be written as (s* t?) meaning zero or more category s records, followed by zero or one category t records. This immediately suggests that any given match falls into one of the following scenarios for frequencies of (s, t):

(0, 0) - this looks like an empty set of records, but could be non-empty if null values were allowed for the weight

(1+, 0) - the case where the limit is not reached, which must be the last match if there are no null weights

(0, 1) - where the first record in a group reaches the limit by itself

(1+, 1) - where one or more records in a group are below the limit, followed by a record that reaches the limit

In the results above, we see that group 22 matches scenario 2, while groups 10, 16 and 17 match scenario 3, and the remainder match scenario 4. We take the weight to be not null so scenario 1 is not possible. This kind of 'scenario coverage' is much more important than the 'code coverage' that is often focussed on in testing, especially by object oriented programmers.

In the following sections for individual queries, the query (and other SQL) is listed first, followed by the execution plan for the largest problem (W40-D8000).

In some cases, Oracle Database may not be able to ascertain that your model is acyclic even though there is no cyclical dependency among the rules. This can happen if you have complex expressions in your cell references. Oracle Database assumes that the rules are cyclic and employs a CYCLIC algorithm that evaluates the model iteratively based on the rules and data. Iteration stops as soon as convergence is reached and the results are returned. Convergence is defined as the state in which further executions of the model will not change values of any of the cell in the model. Convergence is certain to be reached when there are no cyclical dependencies.

When we specify automatic order, the solution is obtained without error using Oracle's cyclic algorithm (operation SQL MODEL CYCLIC). Unfortunately, in this case there is a large performance impact, and we will see in the results section that execution time varies as the square of the number of records within a partition, i.e. quadratically.

In the query above, the rules order clause is omitted, thus defaulting to sequential, while avoiding the ORA-32637 error. This is achieved by specifying ORDER BY rn DESC on the left side of the second rule. The solution, via operation SQL MODEL ORDERED is much faster, and we will see in the results section that execution time now varies linearly with the number of records within a partition.

The query above is essentially the same as one of the posters proposed on the OTN thread, with a slight tweak to the pattern that does not alter its meaning, and also changing it to return one row per match. The query performs much more efficiently than any of the other queries, using the Match_Recognize clause introduced in Oracle 12.1 SQL for Pattern Matching.

This is based on the second of the recursive subquery factor queries in the OTN thread, and we can see the performance issue in the plan above. The recursive branch of the UNION ALL executes once for each record within a partition and performs a full scan on the items table each time. This results in execution time varying as the square of the number of records within a partition, as can be seen in the results section later. The performance can be much improved by using a temporary table, as in the next query.

In this solution, the initial subquery from the previous query is written to a temporary table that is indexed on the join column. This means that the join in the recursive branch of the UNION ALL is indexed and much quicker, resulting in linear variation in execution time with the number of records in a partition.

Notice that it was necessary to hint the index usage. It is possible to achieve the indexed join without a hint by including a call to gather statistics in the pre-query SQL. Unfortunately, Oracle's DBMS_Stats procedure performs a commit - which clears the data from the temporary table. Although we could get around the clearing of the table by making it a normal table and manually truncating it, it is probably better to accept this as a valid use-case for a hint - after all, the whole purpose of the temporary table is to permit index use.

Performance Testing Results

The 'width' parameter is taken to be the number of cat values partitioning the dataset, while the 'depth' parameter is taken to be the number of records within each category. The weight is assigned a random integer between 1 and 100, and the weight limit is 5,000.

Record Counts Table

Input Record Counts

Depth

W10

W20

W40

D1000

10,000

20,000

40,000

D2000

20,000

40,000

80,000

D4000

40,000

80,000

160,000

D8000

80,000

160,000

320,000

Output Record Counts

Depth

W10

W20

W40

D1000

105

209

422

D2000

205

411

829

D4000

406

815

1,625

D8000

808

1,618

3,229

Elapsed Times Table (elapsed seconds)

MOD_QRY

Elapsed Seconds

Depth Ratios to Prior

Width Ratios to Prior

Depth

W10

W20

W40

W10

W20

W40

W20

W40

D1000

16

47

99

2.9

2.1

D2000

62

190

397

3.9

4.0

4.0

3.1

2.1

D4000

243

762

1,390

3.9

4.0

3.5

3.1

1.8

D8000

962

2,082

5,566

4.0

2.7

4.0

2.2

2.7

Average

3.9

3.6

3.8

2.8

2.2

MOD_QRY_D

Elapsed Seconds

Depth Ratios to Prior

Width Ratios to Prior

Depth

W10

W20

W40

W10

W20

W40

W20

W40

D1000

0.08

0.16

0.31

2.0

1.9

D2000

0.16

0.30

0.61

2.0

1.9

2.0

1.9

2.0

D4000

0.30

0.59

1.19

1.9

2.0

2.0

2.0

2.0

D8000

0.59

1.20

2.42

2.0

2.0

2.0

2.0

2.0

Average

1.9

2.0

2.0

2.0

2.0

MTH_QRY

Elapsed Seconds

Depth Ratios to Prior

Width Ratios to Prior

Depth

W10

W20

W40

W10

W20

W40

W20

W40

D1000

0

0.016

0.016

#DIV/0!

1.0

D2000

0.016

0.016

0.047

#DIV/0!

1.0

2.9

1.0

2.9

D4000

0.016

0.031

0.078

1.0

1.9

1.7

1.9

2.5

D8000

0.047

0.094

0.172

2.9

3.0

2.2

2.0

1.8

Average

2.0

2.0

2.3

1.6

2.1

RSF_QRY

Elapsed Seconds

Depth Ratios to Prior

Width Ratios to Prior

Depth

W10

W20

W40

W10

W20

W40

W20

W40

D1000

6

12

24

2.0

2.0

D2000

23

46

94

3.8

3.8

3.9

2.0

2.0

D4000

92

185

377

4.0

4.0

4.0

2.0

2.0

D8000

369

750

1,513

4.0

4.1

4.0

2.0

2.0

Average

3.9

4.0

4.0

2.0

2.0

RSF_TMP

Elapsed Seconds

Depth Ratios to Prior

Width Ratios to Prior

Depth

W10

W20

W40

W10

W20

W40

W20

W40

D1000

0.09

0.19

0.36

2.1

1.9

D2000

0.19

0.38

0.73

2.1

2.0

2.0

2.0

1.9

D4000

0.39

0.77

1.53

2.1

2.0

2.1

2.0

2.0

D8000

0.77

1.55

3.49

2.0

2.0

2.3

2.0

2.3

Average

2.0

2.0

2.1

2.0

2.0

Slice Graphs

Performance Discussion

Variation with Width

The width parameter represents the number of categories here, and category (CAT) is the query partitioning key. We might therefore expect that the execution time would be proportional to the width when the depth parameter is fixed. The width values used were 10, 20 and 40, so we would expect times to double between W10 and W20, and again between W20 and W40.

In fact, we see from the width ratios columns in the tables that this expectation is very closely matched in the cases of MOD_QRY_D, RSF_QRY, AND RSF_TMP.

For MOD_QRY, the ratios are quite variable, and mostly above 2, so that the CYCLIC Model algorithm does not meet our expectation.

For MTH_QRY (Match_Recogize), the elapsed times are very small, 0.17 for the largest problem (14 times faster than the next best, MOD_QRY_D), and that likely explains the variance.

Variation with Depth

The depth parameter represents the number of of records for each category. The depth ratios show that two of the queries show very close to quadratic variation of time with depth, while three show very close to linear variation, and the linear queries are unsurprisingly much faster.

MOD_QRY and RSF_QRY vary quadratically with depth (number of records per partition key).

As in the earlier article, the new v12.1 feature Match_Recogize proved to be much faster than the other techniques for this problem

The solution using Model clause with the operation SQL MODEL CYCLIC showed quadratic variation in execution times with size, but a very simple change to allow SQL MODEL ORDERED operation produced linear variation, and was second only to Match_Recogize in performance

Recursive subquery factoring had timings that increased quadratically with number of records; this was due to a combination of the number of starts of a subquery, and full scans within it

In March 2013 I wrote an article on the use of SQL to group network-structured records into their distinct connected subnetworks, SQL for Network Grouping. I looked at two solution approaches commonly put forward on Oracle forums for these types of problem, using Oracle's Connect By recursion, and the more recent recursive subquery factoring, and also put forward a new solution of my own using the Model clause. I noted however that SQL solutions are generally very inefficent compared with a good PL/SQL solution, such as I posted here, PL/SQL Pipelined Function for Network Analysis. For the first two methods, I noted:

Non-hierarchical networks have no root nodes, so the traversal needs to be repeated from every node in the network set

Hierarchical queries retrieve all possible routes through a network

I also noted that Connect By is more inefficient than recursive subquery factoring, but did not say why, promising a more detailed explanation at a later date. In this article I illustrate the behaviour of both recursive SQL methods through a series of five elementary networks, followed by a simple combination of the five. I then use the foreign key network from Oracle's HR demo (v12 version, with OE and PM schemas included) as a final example.

In this article I consider traversal of a single connected network from a given root node (or several if each root node is specified).

It is shown that the behaviour of Connect By can be understood best by considering it to traverse all paths through a network that is dual to the original network.

Dual Networks

Dual network definition

The dual network consists of a set of nodes and links (d-nodes and d-links say) defined thus:

the d-nodes correspond to each link in the original network that is adjacent (via a node) to at least one other link, including itself if its start and end nodes are the same

the d-links correspond to each pair of adjacent links where the 'from' link identifier is alphabetically smaller than that of the 'to' link, except for the case of links that are adjacent to themselves where a single d-link has the same 'from' and 'to' link

Dual network SQL

The d-node identifiers are just the link identifiers, while the d-link identifiers use the adjacency-defining node identifiers with a sequential number (partitioned by node) attached.

Dual networks defined as above are generally larger than the original networks and are usually more heavily looped, which explains the inferior performance of Connect by compared with recursive subquery factor solutions. The PL/SQL solution mentioned above, while traversing the entire network, does not traverse all possible routes through it and its performance is thus not adversely affected by the degree of looping.

SQL Queries

The recursive SQL queries return all routes through the network from the roots supplied. In my attached script I also have versions that filter out repeated links. The pipelined function query returns a single, exhaustive route through the network, distinguishing a set of tree links from loop-closing links; it also returns all subnetworks without requiring input roots.

The CONNECT_BY_ISCYCLE pseudocolumn returns 1 if the current row has a child which is also its ancestor. Otherwise it returns 0

Connect By queries do not return loop-closing nodes, and the prior node is marked as the cycle node.

Recursive Subquery Factor Cycles

A row is considered to form a cycle if one of its ancestor rows has the same values for the cycle columns.

Recursive Subquery Factor queries do return loop-closing nodes, and these nodes are marked as the cycle nodes.

We will see this differing behaviour clearly in the following examples. We will also see that the Connect By output on the original network has exactly the same structure as recursive subquery factor output on the dual network if the loop-closing rows are disregarded. Cycle nodes on both definitions are marked with a '*' in the outputs below.

In the output below I deleted all the loop rows from the RSF output for the dual network and placed the result beside the output for CBY for the original network, using a column-wise copy and paste. It's easy to see then their equivalent structure. Both have 69 rows.

Neither of the two SQL recursion methods completed within a period of an hour and had to be terminated. The result for CBY on the original network suggests that RSF on the dual network should return somewhere above 4,414,420 rows.

Conclusions

We have shown by examples how network traversal by the Connect By (CBY) approach in SQL corresponds to traversal of all routes in a type of dual version of the original network

This dual version, which has forks converted to loops, tends to be larger and more heavily looped, resulting in worse performance compared with solution by recursive subquery factors (RSF)

The examples illustrate the different treatment of loop-closing links between the two types of SQL recursion

The RSF solutions on the dual network in the simpler examples where it completes is seen to be equivalent to the CBY solution on the original network, after allowing for the different treatment of loop-closing links

I wrote an article a couple of weeks ago, SQL for Shortest Path Problems, in which analytic functions are used to truncate sub-optimal routes in SQL recursions for shortest paths through networks. The problem was posed by an OTN poster, How to use Recursive Subquery Factoring (RSF) to Implement Dijkstra’s shortest path algorithm?, who referenced a very simple test network, and included his own SQL to solve it, which turned out to be quite similar to my own effort. The solutions are guaranteed to be optimal if the algorithm terminates normally, which it does on the small test network, and will on any network unless resources such as memory or time are exhausted owing to problem size.

In that article I referenced two earlier articles that I had written (in June/July 2013) that used analytic functions for other combinatorial problems. The usage in those cases was similar syntactically, but pruned out routes that looked inferior to others at a given iteration, so that the final solutions were not guaranteed to be optimal. The motivation was to to be able to use the SQL for exact solutions on smaller problems and for good, maybe sub-optimal, solutions for problems too large to solve exactly.

I wondered how the SQL in the last article would perform on larger networks, and whether further tuning methods could be found, perhaps based on some form of search truncation, as in the earlier articles. The resulting solution methods can be considered as branch and bound algorithms in SQL.

Both data sets are of the non-physical type, where there are no differential costs associated with links, and the problem therefore reduces to determining the minimum number of links between a node and other reachable nodes. These non-physical networks tend to be heavily looped, owing to the essentially zero cost of adding a new link. For that reason, I change my problem definition in this article to that of finding a single best path to each reachable node, rather than all, which reduces the solution set size considerably.

It is well known that in non-physical networks, such as social media networks like Linked-In and Facebook, the minimum paths between members tends to remain relatively small as the network size goes up. This will influence the type of algorithm that will be more efficient.

Approximation Methods: Simple truncation

The most obvious approximative approach would be to simple truncate the search after a certain depth (or level). This actually works quite well and gives good results for highly looped networks, where the minimum paths tend to be much shorter than the number of nodes. However there is no guarantee of optimality, and it will be less effective for less looped networks with longer minimum paths.

Row_Number gives a single row with value 1 per partitioning node, so that we retain only one row per node for the previous iteration

The global pruning can be done without an additional subquery by grouping with KEEP, since we only want one optimal row per node

Note that in the double Max to get maxlev, the inner one is the grouping Max, while the outer is an analytic Max over the whole (grouped) result set

intnod obtains the maximum intermdiate value of lev for a given node

intmax obtains the maximum intermdiate value of lev over all nodes

Approximation Methods: Preliminary approximate subquery
A less obvious approach is based on the fact that during the recursion our path pruning can only take into account information available to the current iteration: Other than loops, we can prune out only paths to a given node that are longer than another at the same level. But what if we ran an approximate search in advance, in a prior subquery? Then we could outer-join the subquery by node and prune out any paths for which the subquery has found a better cost. This would potentially reduce the total searching without sacrificing guranteed optimality.

approx_best_paths gets the global best paths found from the approximate recursion

paths now has an outer join to path_0 that is used to prune paths that are inferior to any path to the same node found in the prior subquery

the approximate recursion may not reach all reachable nodes, but the outer join ensures this does not cut off any such nodes incorrectly

Approximation Methods: Preliminary approximate subquery to GTT
We will see later that the second approach works quite well, but that the CBO does not process the preliminary query very efficiently. For this reason writing the query result instead to a temporary table may be more efficient overall. The table can be indexed, and dynamic sampling allows the CBO to estimate the cardinalities more accurately.

SQL for SP_GTTRSF_I and SP_GTTRSF_Q (preliminary approximate query to GTT)

Arxiv GR-QC (General Relativity and Quantum Cosmology) collaboration network is from the e-print arXiv and covers scientific collaborations between authors papers submitted to General Relativity and Quantum Cosmology category. If an author i co-authored a paper with author j, the graph contains a undirected edge from i to j. If the paper is co-authored by k authors this generates a completely connected (sub)graph on k nodes.

The data set comes with the reverse arcs already present, making a total of 28,980.

I took the first node in the first line in the data set file as the initial root node, 3466, and tested the three methods above for values of LEVMAX of 5, 10, 15, 20, 25 and 30. The complete results, including execution plans are in the attached file.

The exact solution has 4,158 nodes reached from the source node 3466, with a maximum level of 11. The listing below gives the output from SP_GTTRSF_Q for LEVMAX=5.

SP_RSFTWO ran in from 27 seconds to 552 seconds, and returned the exact solution in all cases

From the intmax value (Excel file) we can see that only in the case of LEVMAX=5 did the second recursion iterate beyond the optimum level of 11 (in fact to a level of 45). This would be due the preliminary approximate subquery allowing it to discard sub-optimal paths in the second recursion

The execution plan above shows that Oracle CBO has chosen not to materialize the first recursive subquery, and essentially reruns it at each of the 45 outer iterations

A better approach for the CBO to have taken would appear to be to form a hash table in memory (where possible) of the first subquery, and run each iteration of the outer recursion outer-joining to that unchanging hash table; or alternatively, to materialize it and outer-join in any other way deemed appropriate

I tried hinting the subquery to get CBO to materialise or avoid the repetition of the first subquery, but without success, and so decided using a temproray table to materialise it myself would be a good idea

SP_GTTRSF_I and SP_GTTRSF_Q ran in from 5 seconds to 54 seconds combined

LEVMAX=10 gave slight the better result here: Evidently, the extra work in the insert compared with LEVMAX=5 was over-compensated by the benefit of a better approximate solution

Once LEVMAX is sufficiently large to give a good approximate solution it is most efficient not to increase it further

This method is almost as efficient as simply truncating at a given level, but guarantees optimality

Network analysis

I have my own network analysis program, implemented as a PL/SQL pipelined function. I thought this might be useful to help validate the results. The function returns all distinct subnetworks and is called three times to give different levels of detail. It runs against a view links_v that must be created for the data source containing the network links. Here is the output:

The results show that the data set contains 354 connected networks, with one much larger than the rest, having 4158 nodes. This (more or less) matches with the results we got from source 3466. Actually, we should get one fewer record back than the number of nodes in the network, but as in the earlier article the SQL returns the source node in one record - we can easily fix this, but it's not worthwhile my redoing all the results, so I leave it as is.

As another check, we can run against the second largest network, using 10677 as the source, which should give 14 records. Here is the result for SP_GTTRSF_Q, LEVMAX=10.

Brightkite was once a location-based social networking service provider where users shared their locations by checking-in. The friendship network was collected using their public API, and consists of 58,228 nodes and 214,078 edges. The network is originally directed but we have constructed a network with undirected edges when there is a friendship in both ways.

The data set comes with the reverse arcs already present, making a total of 428,156 arcs.

I took the first node in the first line in the data set file as the initial root node, 0, and tested only the method SP_GTTRSF_I/Q for values of LEVMAX of 5, 10. The complete results, including execution plans are in the attached file.

The exact solution has 56,739 nodes reached from the source node 0, with a maximum level of 10. The output is a bit large to embed, so attaching it in file, but here are the last few lines of the exact solution:

We can do a further test by using a source from a smaller network. 51944 is a root of the network of size 49 nodes (as the full output shows). Here is the result of sourcing from that, again consistent witht he network analysis program that uses a completely diffferent algorithm:

It has a master entity with two independent detail entities, and therefore requires a minimum of two queries.

But why does such a structure require two queries? And can we determine the minimum number of queries for reports in general? To start with, let's define a report in this context as being a hierarchy of record groups, where:

a record group is a set of records having the same columns with (possibly) differing values

each group is linked to a single record in its (single) parent group by values in the parent record, except the top level (or root) group

For example, in the earlier post the root group is a set of bank accounts, with the two detail (or child) groups being the set of owners of the bank account and the set of audit records for the bank account parent record. Corresponding to this group structure, each bank account record is the root of the data hierarchies, comprising two sets of records below the bank account record, one for the owners and one for the audit records linked to the root record by the bank account id.

A (relational) query always returns a flat record set, and it's this fact that determines the minimum number of queries required for a given group structure. A master-detail group structure can be flattened in the query by simply copying master fields on to the child record sets. The set cardinality is then the cardinality of the child set. The report designer uses their chosen reporting tool to specify display of the queried data in either flat, or in master-detail format.

In fact this approach works for any number of child levels, with the query cardinality being the number of bottom level descendants (using null records for potential parents that are in fact childless). It's clear though that the approach will not work for any second child at the same level because there would be two cardinalities and no meaningful single record for both child groups could be constructed within a flat query.

This reasoning leads to the conclusion that the minimum number of queries required in general is equal to the number of groups minus the number of parent groups.

In the earlier post I also stated:

This minimum number of queries is usually the best choice...

There are two main reasons for this:

each child query fires for every record returned by its parent, with associated performance impact

maintenance tends to be more difficult with extra queries; this is much worse when the individual groups, which should almost always be implemented by a maximum of one query each, are split, and then need to be joined back together procedurally

On thinking about this, it occurred to me that if the group structure were defined in a metadata table we might be able to return minimum query structures using an SQL query. Just one, obviously 🙂 . To save effort we could use Oracle's handy HR demo schema with the employee hierarchy representing groups.

The remainder of this article describes the query I came up with. As it's about hierarchies, recursion is the technique to use, and this is one of those cases where Oracle's old tree-walk syntax is too limited, so I am using the Oracle 11.2 recursive subquery factoring feature.

The query isn't going to be of practical value for report group structures since these are always quite small in size, but I expect there are different applications where this kind of Primogeniture Recursion would be useful.

Oracle eBusiness applications allow audit history records to be automatically maintained on database tables, as explained in the release 12 System Administrator's guide, Reporting On AuditTrail Data.

Oracle E-Business Suite provides an auditing mechanism based on Oracle database triggers. AuditTrail stores change information in a "shadow table" of the audited table. This mechanism saves audit data in an uncompressed but "sparse" format, and you enable auditing for particular tables and groups of tables ("audit groups").

Oracle provides an Audit Query Navigator page where it is possible to search for changes by primary key values. For reporting purposes, the manual says:

You should write audit reports as needed. Audit Trail provides the views of your shadow tables to make audit reporting easier; you can write your reports to use these views.

In fact the views are of little practical use, and it is quite hard to develop reports that are user-friendly, efficient and not over-complex, owing, amongst other things, to the "sparse" data format. However, once you have developed one you can use that as a design pattern and starting point for audit reporting on any of the eBusiness tables.

In this article I provide such a report for auditing external bank account changes on Oracle eBusiness 12.1. The report displays the current record, with lookup information, followed by a list of the changes within an input date range. Only records that have changes within that range are included, and for each change only the fields that were changed are listed. The lookup information includes a list of detail records, making the report overall pretty general in structure: It has a master entity with two independent detail entities, and therefore requires a minimum of two queries. This minimum number of queries is usually the best choice and is what I have implemented (it's fine to have an extra query for global data, but I don't have any here). The main query makes extensive use of analytic functions, case expressions and subquery factors to achieve the necessary data transformations as simply and efficiently as possible.

The report is implemented in XML (or BI) Publisher, which is the main batch reporting tool for Oracle eBusiness.

I start by showing sample output from the report, followed by the RTF template. The queries are then documented, starting with query structure diagrams with annotations explaining the logic. A link is included to a zip file with all the code and templates needed to install the report. Oracle provides extensive documentation on the setup of Auditing and developing in XML Publisher, so I will not cover this.

Report LayoutExample Output in Excel Format

There are three regions

Bank Account Current Record - the master record

Owners - first detail block, listing the owners of the bank account

Bank Account Changes - the second detail block, listing the audit history. Note that unchanged fields are omitted

Note that some audit fields are displayed directly, such as account number, while for others, such as branch number, the display value is on a referenced table

XML Publisher RTF Tempate

Note that each audit field has its own row in the table, but the if-block excludes it if both old and new values are null

Recently an OTN poster asked how to return values from a column in multiple records into a single output record and column in SQL, Multiple Rows Into One Column Field. In the usual version of this list aggregation problem, one record is required for each distinct combination of grouping columns, with the aggregation fields delimited within the output field, and from Oracle v11.2 there is a built-in function for it, ListAgg. However, in this case the poster wanted a maximum of three values in the output field, with overflow records as necessary. Tom Kyte solved the problem in that thread essentially by adding in a calculated row number to the grouping columns, and concatenating the aggregation fields directly in the code.

Tim Hall has compiled a list of the main techniques available for the standard problem for different versions of the database, up to v11.2, here: String Aggregation Techniques

I thought it would be an interesting and useful variation on the problem to base record overflow on concatenated length rather than on number of values. This would provide an alternative to the CLOB-based variations for cases where the length exceeds 4KB in v11.2 or earlier (v12.1 raises the string limit to 32KB), and cases where the records just need to be of limited length for display or other purposes. Here's an example on another forum of a requirement to handle long strings: Ordering by list of strings in Oracle SQL without LISTAGG.

On the OTN thread, I provided an SQL solution for this variation, using recursion and the new v12.1 MATCH_RECOGNIZE syntax for row pattern matching, and an alternative using the MODEL clause. In this article I provide modified versions with explanations and execution plans.

SQL Analytics and Recursion

The problem variation as I have defined it is harder than it may at first appear, and in fact, can't be solved by SQL grouping and analytics alone: Recursion is required. The reason is that when considering whether a source record needs to start a new grouping record, or can be included on the prior record, the answer depends on the lengths of the fields of an unknown number of prior records. Analytic functions can only sum over a known number of prior records, and, although 'known' includes values that can be computed via a prior subquery, this is not possible here.

The approach we take involves two logical steps. In the first step, the records are processed in sequence within the partitions, and aggregate strings are accumulated for each record. When the string would exceed the maximum length a new aggregate is started and a flag is set.

Now, from the first step we have the input source number of records, with the desired aggregate strings being on the last records before any new aggregate, marked by the flag. We then just need to filter out the intermediate records.

Test Example

We will use Oracle's standard HR schema for test data, and will aggregate employee names by department with the name list field having maximum length 80. There are 107 employees, and the desired output is:

Recursive subquery factors are available from Oracle v11.2 up, and can be used to implement the required recursion. In this solution the flag denoting an overspill line has to be set initially on that overspill line, rather than on the preceding line, and on the last line in the partition, which are the lines we want to display. We therefore need another step.

Setting print flag via analytic function

In v11.2, an additional subquery can be added that uses the analytic function Lead to set a flag on the required lines. This works by looking at the previously set flag on the next record (if the record exists), and setting the desired line print flag accordingly on the current record. The query is:

emps_ordered subquery: Formats the name field and gets a row number within department ordering by the name

rsf recursive subquery: Anchor branch selects first employee in each department; recursive branch joins the next employee based on the row number, and accumulates the name list, resetting and flagging when length dictates a new overspill line

leads_v subquery: Use Lead analytic function to set the print flag

Main query: Selects rows where the print flag = 1

The output from the leads_v subquery, before filtering, illustrates how it works:

MATCH_RECOGNIZE clause: defines matching row sets that end in the lines to print

MEASURES section: includes two built-in functions for illustration purposes only: Classifier() = grouping, and Match_Number() = match number of the record

DEFINE section: defines a grouping, sm, based on the previously set flag, new_line, that applies if the flag = 0

PATTERN section: ( strt sm* ) includes an undefined grouping strt, and means to match an ordered set of records beginning with a record that does not fall into any defined grouping and continuing with zero or more records (but as many as possible) that are in the sm grouping

The output, with the extra built-in fields, helps to show how it works:

As in the earlier query, the employees table is accessed 46 times, indicating an obvious performance issue.

MODEL Solution

Oracle introduced the MODEL clause to its SQL syntax in v10. It has a reputation for often leading to SQL that is difficult to understand and sometimes inefficient, but it is well suited to this problem.

mod subquery: the aggregation and line pirint flags are calculated in a single subquery using the MODEL clause

RULES section: first rule accumulates the aggregated lines, resetting when overspill occurs, with similar logic to that in the recursive subquery solution; the rule relies on the calculation occurring in the default order, by ascending dimension value; second rule relies on all the aggregates being calculated first by the first rule, then looks one row ahead within the partition to set the print flag