About Association

Association is a data mining function that discovers the probability of the co-occurrence of items in a collection. The relationships between co-occurring items are expressed as association rules.

Oracle Data Mining does not support the scoring operation for association modeling. The results of an association model are the rules that identify patterns of association within the data. Association rules can be ranked by support (How often do these items occur together in the data?) and confidence (How likely are these items to occur together in the data?).

Association rules are often used to analyze sales transactions. For example, it might be noted that customers who buy cereal at the grocery store often buy milk at the same time. In fact, association analysis might find that 85% of the checkout sessions that include cereal also include milk. This relationship could be formulated as the following rule.

Cereal implies milk with 85% confidence

This application of association modeling is called market-basket analysis. It is valuable for direct marketing, sales promotions, and for discovering business trends. Market-basket analysis can also be used effectively for store layout, catalog design, and cross-sell.

Association modeling has important applications in other domains as well. For example, in e-commerce applications, association rules may be used for Web page personalization. An association model might find that a user who visits pages A and B is 70% likely to also visit page C in the same session. Based on this rule, a dynamic link could be created for users who are likely to be interested in page C. The association rule could be expressed as follows.

Transactions

Unlike other data mining functions, association is transaction-based. In transaction processing, a case consists of a transaction such as a market basket or Web session. The collection of items in the transaction is an attribute of the transaction. Other attributes might be the date, time, location, or user ID associated with the transaction.

The collection of items in the transaction is a multi-record attribute. Transactional data is said to be in multi-record case format. An example is shown in Figure 8-1.

Association models can be built using either transactional or nontransactional (single-record case) data. For all other types of models, Oracle Data Mining requires nontransactional data. To build any model other than an association model on transactional data, the data must first be transformed to single-record case format.

Items and Collections

In transactional data, a collection of items is associated with each case. The collection could theoretically include all possible members of the collection. For example, all products could theoretically be purchased in a single market-basket transaction. However, in actuality, only a tiny subset of all possible items are present in a given transaction; the items in the market-basket represent only a small fraction of the items available for sale in the store.

Sparse Data

When an item is not present in a collection, it may have a null value or it may simply be missing. Many of the items may be missing or null, since many of the items that could be in the collection are probably not present in any individual transaction.

Missing rows in a collection indicate sparsity. This means that a high proportion of the nested rows are not populated. The Oracle Data Mining association algorithm is optimized for processing sparse data.

Itemsets

The first step in association analysis is the enumeration of itemsets. An itemset is any combination of two or more items in a transaction.

The maximum number of items in an itemset is user-specified. If the maximum is two, all the item pairs will be counted. If the maximum is greater than two, all the item pairs, all the item triples, and all the item combinations up to the specified maximum will be counted.

The maximum number of items in an itemset is specified by the ASSO_MAX_RULE_LENGTH setting, which also applies to the rules derived from the itemsets.

Table 8-1 shows the itemsets derived from the transactions in Figure 8-1, assuming that ASSO_MAX_RULE_LENGTH is set to 3.

Table 8-1 Itemsets

Transaction

Itemsets

11

(B,D) (B,E) (D,E) (B,D,E)

12

(A,B) (A,C) (A,E) (B,C) (B,E) (C,E) (A,B,C) (A,B,E) (A,C,E) (B,C,E)

13

(B,C) (B,D) (B,E) (C,D) (C,E) (D,E) (B,C,D) (B,C,E) (B,D,E) (C,D,E)

Tip:

Decrease the maximum rule length if you want to decrease the build time for the model and generate simpler rules.

Frequent Itemsets

Association rules are calculated from itemsets. If rules are generated from all possible itemsets, there may be a very high number of rules and the rules may not be very meaningful. Also, the model may take a long time to build. Typically it is desirable to only generate rules from itemsets that are well-represented in the data. Frequent itemsets are those that occur with a minimum frequency specified by the user.

The minimum frequent itemset support is a user-specified percentage that limits the number of itemsets used for association rules. An itemset must appear in at least this percentage of all the transactions if it is to be used as a basis for rules.

The ASSO_MIN_SUPPORT setting specifies the minimum frequent itemset support. It also applies to the rules derived from the frequent itemsets.

A model with default settings built on this data generates many rules. One way to limit the number of rules is to raise the support and confidence. Figure 8-4 shows Confidence raised to 65% and Support raised to 75% in the Advanced Settings dialog.

You can filter the rules in a number of different ways. The dialog in Figure 8-6 specifies that only rules with "Mouse Pad" in the antecedent, and "Keyboard Wrist Rest" in the consequent should be returned.

Figure 8-7 shows the three rules that result from the filtering criteria specified in Figure 8-6. The first rule states that a customer who purchases a mouse pad and a 1.44 MB External 3.5 Diskette is likely to also buy a keyboard wrist rest at same time. The confidence for this rule is 99%. The support is 77%.