TR93-036
Abstract
This paper introduces a new discrimination network structure called Gator that is a generalization of
the widely known Rete and TREAT algorithms. Gator can be used as a replacement for Rete or TREAT
in active database rule systems and production system interpreters. In Gator, f/-memory nodes that hold
intermediate join results can have two or more inputs, not exactly two inputs as they do in Rete. Gator
is designed as the target structure for a discrimination network optimizer. Algorithms for performing
pattern matching using a Gator network to see if a rule condition has been satisfied are given. Cost
estimation functions for Gator networks based on cardinality, predicate selectivity, and update frequency
information are introduced, and a technique for constructing optimized Gator networks is described. A
Gator network optimizer has been developed, and the optimizer has been tested with simulated inputs.
The results show that in terms of time, Gator can substantially outperform unoptimized Rete networks,
TREAT networks, and optimized Rete networks, often by an order of magnitude. Gator also uses space
effectively, requiring only slightly more space than TREAT and much less then Rete.

1 Introduction

Both production systems and active database systems must perform rule condition matching during execution

to determine which rules to fire. The most successful rule condition testing mechanisms developed for main-

memory production systems system are discrimination networks known as Rete [5] and TREAT [12]. Like

production systems, active database systems must also test rule conditions, and we believe some kind of

discrimination network will be the best tool for doing so. However, choosing a good discrimination network

structure is far more important in the active database environment than in main-memory production systems

because of the volume of data involved, and the fact that some or all of the data may be on secondary storage.

R1.d=R2.d R2.e=R4.e R4.f=R5.f
R)-- R2 R4 R5
R1.a>17 R5.b="Friday"

R3.g=R2.g

R3)
R3.c="on"

Figure 1: Example rule condition graph.

Previous work has shown that the TREAT algorithm usually out-performs the Rete algorithm [12]. A

a-memory One a-memory holds all tuples from a single relation that match the selection condition of one

RCE.

3-memory One 3-memory holds combinations of tuples that match a subset of the selection and join

conditions of the rule condition. The tree of nodes rooted at the 3-memory defines this subset of the

conditions.

P-node A P-node is a special node that takes the place of a 3-memory at the bottom of the network. There

is one P-node for each rule. The tree rooted at the P-node corresponds to all the selection and join

conditions in the rule condition.

TREAT networks have only a-memory nodes. Rete networks have a left-deep binary tree format and

maintain internal (3-memory) nodes, which always have two inputs. Unlike a Rete network, a Gator network

may have internal memory nodes with two or more inputs, not just two. These multiple-input nodes are

called 3-memories to be consistent with Rete terminology. Children of a multiple-input node may be a

combination of leaves (a-memories) and other multiple-input nodes.

Sample Rete, TREAT, and Gator networks for the rule whose condition graph is shown in figure 1 are

shown in figure 2. The Gator network subtree with three children has the structure of a TREAT network.

3 Virtual-a memories

Another property of the Gator network is that a memory nodes may be either materialized, and thus contain

all the tokens that match their selection condition, or they may be virtual, in which case they contain only

their selection predicate but not the tokens matching the predicate. The concept of virtual a-memories

has been used in a variation of the TREAT algorithm called A-TREAT [6]. Use of virtual a-memories for

rule condition matching in Gator is identical to that in A-TREAT. Virtual a-memories save space since

the matching tokens need not be stored in the memory node. This is particularly important in a database

environment since the underlying data sets can be huge.

4 Rule condition matching in a Gator network

When a base relation tuple is inserted, deleted or modified, rule condition matching must be performed to

see if any rules are triggered or made no longer eligible to fire. The rule condition matching strategy of

Gator is similar to the algorithms for rule condition matching in Rete and TREAT [5, 12]. Gator performs

rule condition matching by propagating lokens through the network. When a tuple is inserted into a base

relation table, an insert token, or "+" token, is created by tagging a copy of the tuple with a + tag. This +

token is then propagated through the network. When a tuple is deleted from a base relation, a delete token,

or "-" token, is created similarly and propagated through the network. Modifications are treated as deletes

followed by inserts. The rest of this section covers handling of + tokens and tokens. The algorithms for

processing tokens are described here in a set-oriented style that is suitable for use in active database systems.

A tuple-at-a-time, recursive style of the algorithm suitable for main-memory production systems is presented

elsewhere [7].

In the following discussion, memory nodes of type a and 3 will be referred to together as memory nodes.

Nodes that can have multiple inputs, including 3-memories and P-nodes, will be called multiple input nodes.

The term node may be used to describe an a-memory, 3-memory, or P-node.

4.1 Handling + tokens

The set-oriented version of the Gator algorithm keeps around at each step a set of tokens called a temporary

join result (TR). For processing a + token, this algorithm is implemented in terms of a recursive function,

InsertPlusTempResult. As part of each memory node N in the network, there is a list of pairs, one pair for

each multiple-input node of which N is a child (N may have more than one parent if it is part of a shared

subexpression). Each pair contains the following:

a multiple input node Parent which has N as one of its inputs, and

a plan P for how to join a temporary result inserted into N to the other input nodes of Parent. This

plan is a list of the identifiers of these other nodes, in the order in which they should be joined to the

temporary result.

The list of pairs of the above form is called the parent/plan pair list, or PPPlist. The function Parent(PPpair)

extracts the parent node from the parent/plan pair PPpair. The function plan(PPpair) extracts from PPpair

the list of nodes specifying the join order. Based on this terminology, InsertPlusTempResult is shown in

Figure 3. To initiate match processing when a new token t is inserted into a leaf node LEAF of the Gator

network (LEAF is either a P-node or an a-memory), an initial temporary result TR1 is constructed. TR1 is

a set containing only one token, t. Then InsertPlusTempResult is called with TR1 and LEAF as arguments.

The algorithm InsertPlusTempResult terminates when the temporary result is empty or the final temporary

result is added to the P-node.

As an example, referring to figure 1 and the Gator network in figure 2, suppose that a tuple was added

to R3 with a value of "on" for attribute c. TR1 would contain an entry for just this tuple. TR1 would arrive

at a3. TR1 would then be joined to a2 via the join condition on the dashed edge from a3 to a2, yielding

TR2. TR2 would similarly be joined to al, yielding TR3. The contents of TR3 would be added to i1, and

then joined to 32, yielding TR4. TR4 would then be added to the P-node. If any of the temporary results

TR2, TR3 or TR4 were empty upon creation, processing would stop at that point.

Selection of a good join order plan for each input of a multiple-input node is considered later. The nested

loop join method will normally be used to join a TR to a memory node, and will make use of any existing

InsertPlusTempResult(TR,Node)
{
/* TR is a temporary result.
Node is a memory node or P-node in a Gator network. */
If TR is empty, return.
Insert the tokens in TR into the collection of tokens
belonging to Node.
If Node is a P-node,
adjust the rule agenda if necessary, and return.

For each parent/plan pair x in PPPlist(Node) {
For each node y in plan(x),
in order from first to last,
{
/* Join the current temporary result to the next
memory node specified by the join order plan,
forming the next temporary result. */
TR = join(TR,y)
}
/* Insert the final temporary result into the
multiple input node for which
new matching tokens are being found. */
InsertPlusTempResult(TR,Parent(x))

Figure 3 Procedure inserting a temporary resultinto a node
}

Figure 3: Procedure for inserting a temporary result into a node

index on the join attribute of the memory node. A sort-merge or other join strategy could be used sometimes

if appropriate.

4.2 Handling tokens

Handling (delete) tokens is slightly different from handling of + tokens. The standard delete optimization

familiar from implementations of Rete is used. This optimization does not do joins during deletions. Rather,

when a token t enters a node, the token is deleted from the node. Then, if that node feeds into a multiple

input node, the tokens in the multiple input node are scanned to see if they contain t as a component. If

so, they are deleted. In turn, more tokens are generated and passed to the successor of the multiple input

node. Detailed algorithms for tokens are not presented.

A discussion of how Gator supports negated condition elements like those that can be specified in OPS5

Figure 6: When a temporary result TR is propagated out of a node N that is an input node to a node 3,
TR must be joined to the other input nodes of 3. This procedure computes both the cost of doing these
joins, and the size of the final temporary result generated after doing the joins.

UpdateCost(P,TRsize) {
/* A TRsize value less than one is significant
since small join selectivities can produce temporary
results that are small on average. The Yao function
takes this into consideration. */
cost = Yao(pages(3),TRsize) 2 I/Oweight + TRsize -CPUweight
/* If TR is larger than one page, add the cost to
allocate or delete pages in 3. */
if TRsize >1
StuplesPerPage(t)
TRsize 2 IlOweight
cost = cost + tupleP e 2 I/Ow t
return(cost)

Figure 7 Function to compute cost to update a memory node

Figure 7: Function to compute cost to update a 3 memory node.

The procedure PerChildDelCost(N, 3) has a slightly different structure than the procedure

PerChildInsCost(N, 3). It is assumed that the standard delete optimization discussed earlier that is often

used in Rete and TREAT implementations is employed.

The function PerChildDelCost(N, 3) must account for the cost to read all pages of the 3 node (no index

is available to support this operation), plus the cost to write pages of / that contain tokens with components

from child N. Each token in 3 must be examined, so one CPUweight factor must be paid for each token in

3 as well. This function is most easily represented as the following procedure:

best Gator network. Rules in any database will rarely contain more than 12 RCE's, thus performance of

the dynamic programming-based optimizer will be adequate for most situations. Randomized optimization

methods are being investigated to handle rules with more than 12 RCE's [13].

6.4 Generality of the Gator network structure

One of the targets of the design of the Gator network optimizer was that it will not favor any particular

discrimination network configuration; it will always choose the best, in terms of network cost, regardless

of whether it is Rete, TREAT or anything in between. Results obtained from the optimizer show that the

optimizer is indeed choosing the cheapest network without favoring any particular configuration. It has been

found that, at very low level network activity, characterized by small values of update frequency, selectivity

and JSF, Rete style networks are better than TREAT style networks. By the term Rete style, we mean

networks where most 3 nodes have 2 inputs. These networks have almost a binary tree structure. When it

is a left deep or right deep binary tree, it is a pure Rete network. As the level of database activity increases,

OPTIMIZATION TIME VS. RULE SIZE

150

< 100

(GATOR
O

RETE
0
2 4 6 8 10 12 14
NUMBER OF RCE

Figure 11: Time taken to produce optimized Gator and Rete networks using dynamic programming vs. the
number of rule condition elements. A TREAT network can be constructed in a fraction of a second without
optimization for rules with any reasonable number of RCE's.

TREAT style networks seem to be a better choice. These networks have fewer 3 nodes, thus most of the

3 nodes have multiple inputs. This transition of choice is smooth and consistent. As one or more of the

parameters are varied from low to high values, the transition from one type of network to the other can

be seen. Figure 12(b) shows the network for the rule graph in figure 12(a) when selectivity threshold is

very small, .0008. All the 3 nodes have two inputs, except one that has three inputs. In figure 12(c), the

selectivity values are larger, which has eliminated one 3 node, in addition to rearranging other a and 3

nodes. Figure 12(d) has only three 3 nodes, each taking inputs from three a nodes.

7 Related work

The Gator network is a descendant of Rete and TREAT [5, 12]. The Gator network is only useful with an

optimizer or at least a good set of heuristics for constructing a network for a particular rule or set of rules. The

feasibility of generating an optimizer for Rete networks was demonstrated by Ishida [11]. Use of heuristics to

construct a good discrimination network for testing active database rule conditions was discussed by Fabret

[9] Yiannis loannidis and Younkyung Cha Kang. Randomized algorithms for optimizing large join queries.
In Proceedings of the AC if SIC, iOD International Conference on Management of Data, pages 312-321,
May 1990.

[10] Yiannis loannidis and Younkyung Cha Kang. Left-deep vs. bush trees: An analysis of strategy spaces and
its implications for query optimization. In Proceedings of the AC if SIC, iOD International Conference
on Management of Data, pages 168-177, May 1991.

[11] Toru Ishida. Optimizing rules in production system programs. In Proc. AAAI National Conference on
Artificial Intelligence, pages 699-704, 1988.

[15] P. Selinger et al. Access path selection in a relational database management system. In Proceedings of
the AC if SIC, iOD International Conference on Management of Data, June 1979. (reprinted in [17]).