http://prdeepakbabu.wordpress.

com

Association Rule Mining

Association Rule mining is one of the classical DM technique. Association Rule mining is a very powerful technique of analysing / findingpatterns in the data set. It is a supervised learning technique in the sense that we feed the Association Algorithm with a training data set( ascalled Experience E in machine learning context) to formulate hypothesis(H) . The input data to a association rule mining algorithm requiresa format which will be detailed shortly. Ok let me first introduce the readers with some of the application areas of this DM technique and motivation for the study ofAssociation analysis. The classic application of the association rule mining is to analyse the Market Basket Data of a retail store. Forexample, Retail stores like Wal-Mart, Reliance fresh, big bazaar gather data about customer purchase behaviour and they have completedetails of the goods purchased as part of a single bill. This is called Market basket data and its analysis is termed “market basket analysis”. Ithas been found that customers who buy diapers are more likely to buy beer. This is a pattern discovered by association analysis. Otherapplications include but not limited to scientific data analysis (earth science to study ocean, land and atm. Processes) and in the field ofbioinformatics (genome sequence mining, etc.) Also it is used in document analysis for determining the words that often occur together andweblog mining temporal data for any pattern in online behaviour and website navigation. There are numerous other examples of associationanalysis which is only bounded by human imagination and capability. Let’s start with Association mining with market basket data as the example. An itemset is the group of items. A k-itemset indicates theno. of items under study is K numbers. As part of a transaction (purchase by customer) one or more items from the itemset may be included.The occurrence/purchase of an item is indicated by a value 1 while non-inclusion is indicated by a value 0. Hence a typical market basketdata like the one below:

Book Pen Pencil Eraser Sharpener Crayons Maps A4 sheets

If you see the above representation of market basket data, one may think there are few additional info which are missing like the quantitypurchased, Amount involved in the transactions/purchase. Of course, the association analysis can be extended to involve such detail. The application of Association rule mining algorithm results in the discovery of rules/patterns of the following form:{Pencil} - > {Eraser}{Book,Maps}- > {Pen} What it simply says is “If a customer bought a pencil he is more likely to buy an eraser”. Mathematically, it says “Purchase of Pencilimplies purchase of eraser”. Once a pattern is discovered, it can used/integrated into decision support system to form strategies based on therule. In the above case, the company may use this rule to do cross-selling i.e place pencil and eraser as close to each other which increasesthe sales and hence profit. Just imagine, if a number of such strong rules are disovered in a jewellery shop it would result in tremendousvalue. Now let me answer what “strong” rule means? Now that we have defined what a rule is, we are posed with two important questions. Are all rules discovered by my algorithm is reallyuseful / meaningful? How confident i am about the rule? To answer these questions, we use some mathematical measure to quantify theusefulness and confidence. Most common evaluation measures for a rule are support and confidence measures. There are other measures namely lift, interestfactor, correlation. We will talk about it a bit later. A support measure answers the first question, the interestingness measure. It isrepresented in percentage. It defines how many of my transactions support this rule. If it is say 4/100 it means just 4 out of 100 transactionsinvolve this rule, then probably this is uninteresting so we may choose to ignore it. Hence our Association rule mining algorithm sets somethreshold/min value for the support to eliminate uninteresting rules and retain the interesting ones. An example of uninteresting rule could be{pen} - > {eraser}, this could be an uninteresting rule as pen and eraser might be purchased as a matter of chance, i.e it has lower support. Now having answered the interestingness criteria, we are left with determining the confidence of the rule. A confidence measurequantifies the confidence as a ratio of no. of transaction holding this rule valid against the no. of transactions involving this rule. Higher thevalue, more reliable is the rule. A strong rule indicates a rule with higher confidence value.

Lets quickly jump into details of the algorithm. The Association rule mining is carried out using the famous Apriori Algorithm. We will alsotalk about the variations of this algorithm to apply it for continous data and hierarchial data. Before that, let’s formalize the definition of theassociation analysis problem: “Given a set of transactions, the problem is to find all rules/patterns with support >= minsup and confidence >= minconf”

The Apriori Algorithm:

A brute force approach is very expensive task. Hence the approach followed by apriori algorithm is to break up the requirement ofcomputing support and confidence as a two separate tasks. In the first step, frequent itemsets are generated i.e those itemsets which holds thecriteria of minimum support. In the second and final step, Rule generation is made possible by evaluation the confidence measure. Let’svisualize the approach diagrammatically as shown below:http://prdeepakbabu.wordpress.com

step1: Frequent Itemset generation

reduce number of candidate itemsets using support count

support counting Reduce number of comparisons using advanced DS Hashsets

Apriori FP Growth Algorithm

step2: Rule Generation

Fig: Apriori Algorithm

Measures could be classified into two categories – subjective and objective. A subjective measure often involves some heuristics andinvolves domain expertise to eliminate un interesting rules while objective measure are domain independent measures. Support andconfidence are good examples of objective measures. Objective measures could be either symmetric binary or asymmetric binary. Thechoice of measure depends on the type of application and it must be carefully chosen to get quality results. Simpson’s paradox states that there is a possibility of misinterpretation due to the hidden variable not as part of the analysisinfluencing the rules/patterns.

The apriori algorithm can be extended to solving various other problems by making little modifications to the data representation methods,Data structures and algorithm.