SQL and PL/SQL techniques and solutions

Menu

Database 12c MATCH_RECOGNIZE: for all sizes of data

Oracle has been marketing the new MATCH_RECOGNIZE clause as a pattern matching solution for “Big Data”. I would add that it is a general-purpose tool that belongs in the toolbox of every SQL programmer. My next several blog posts will compare MATCH_RECOGNIZE to existing techniques; you will see how simple and straightforward it is to use once you get the hang of it.

As an introduction, here is a question from the AskTom site: Ask Tom: Grouping Ranges. Tom actually made this question into an Oracle Magazine article: On Unlearning and Modeling (scroll down to “Grouping Ranges”). The requirement is to order the data by STUDY_SITE and calculate a running total, except that the running total cannot exceed 65,000. If the next row makes us go over 65,000, we need to start over with a new “group”. Here is the data from the original question (see the next page for the CREATE TABLE and INSERT statements):

STUDY_SITE

CNT

1001

3407

1002

4323

1004

1623

1008

1991

1011

885

1012

11597

1014

1989

1015

5282

1017

2841

1018

5183

1020

6176

1022

2784

1023

25865

1024

3734

1026

137

1028

6005

1029

76

1031

4599

1032

1989

1034

3427

1036

879

1038

6485

1039

3

1040

1105

1041

6460

1042

968

1044

471

1045

3360

The desired output is the first and last STUDY_SITE for each group, and the running total:

FIRST_SITE

LAST_SITE

SUM_CNT

1001

1022

48081

1023

1044

62203

1045

1045

3360

Tom answered this question using a pipelined table function in PL/SQL. This prompted volunteers to post several answers in SQL:

there were a handful of attempts using analytic functions, none of which worked;

plus two solutions by Jichao Li, using either the MODEL clause or recursive subquery factoring. In his article, Tom recommended the MODEL solution as the most efficient.

You can tell something is hard to do in SQL when Tom Kyte uses PL/SQL and his readers come up with all kinds of solutions, half of which are wrong.

Think “row pattern matching”

Before looking at code, let’s try to think of this problem in terms of what Oracle calls row pattern matching. In plain English, our pattern is “as many consecutive rows as possible, as long as the running total of CNT does not exceed 65,000″. Each set of rows that fits this pattern will be called a match. We want to:

Order the rows by STUDY_SITE;

Starting from the first row, gather as many consecutive rows as we can, as long as the running total of CNT does not exceed 65,000. This set of rows is our first match.

Now we skip past the last row of the preceding match and we start over, looking for the next set of rows that fit the same pattern.

From each match, return one row with three columns: the STUDY_SITE from the first row, the STUDY_SITE from the last row, and the total CNT.

The syntax of the MATCH_RECOGNIZE clause is practically the same as the language I just used, except for two things:

First, it uses the MEASURES keyword to identify the list of output columns;

Next, it uses the PATTERN and DEFINE keywords to define the pattern.

The code (at last)

SELECT * FROM t
MATCH_RECOGNIZE (
ORDER BY study_site
MEASURES
FIRST(study_site) first_site,
LAST(study_site) last_site,
SUM(cnt) sum_cnt
ONE ROW PER MATCH -- this is the default
AFTER MATCH SKIP PAST LAST ROW -- this is the default
PATTERN (A+)
DEFINE A AS SUM(cnt) <= 65000
);

I hope you agree that there is not that much difference between my “plain English” description and this code. The only part that should need further explanation is the PATTERN clause: it is a series of identifiers that follow a “regular expression” syntax. A normal regular expression is a series of characters that you want to find in a text string. If you searched for ‘A+’ you would match one or more consecutive uppercase A’s. Here we are not searching for characters, but rows, so A is just an identifier or placeholder for a rule that we DEFINE. In this case, the rule is that the running total cannot exceed 65,000.

Compare this code to the MODEL solution and decide which one is easier to understand:

What next?

After this introduction, there is much to explain about the syntax and the possibilities of the MATCH_RECOGNIZE clause. My main goal is to show you that this clause belongs in your SQL toolkit, so I’ll continue to compare it to some of the best pre-12c solutions.

(The next page contains the CREATE TABLE and INSERT statements for the test data.)