Using association rules to perform a market basket analysis

Imagine you have an online shop and you would like to know which products often bought together. This task is known under the term of market basket analysis, in which retailers seek to understand the purchase behavior of their customers. This information can then be used for purposes of cross-selling and up-selling (Wikipedia).

Let us assume we have a data set which contains a list of customers of an online shop and the products they have bought (or viewed) in the past. We can see that one customer can have bought multiple products.

UserID

ProductID

10039052252084471969

Product_587

10039052252084471969

Product_40

10039052252084471969

Product_154

10046183258816255929

Product_256

10046183258816255929

Product_44

10047293680636077566

Product_1184

10055849645924040293

Product_334

10060944748730254910

Product_306

10060944748730254910

Product_154

10060944748730254910

Product_78

…

…

We will use a rule-based machine learning algorithm called Apriori to perform our market basket analysis. It is intended to identify strong rules/relations discovered in a data set. The easiest way to understand association rule mining is to look at the results of such an analysis. To do that we first want to read in our data set from above as transactions in single format. I saved my data as a csv file with two rows called mydata. After this we will use the apriori algorithm from the arules package to identify strong rules in the data set.

Running the apriori algorithm with the code above will give us a list of association rules based on our input data. Let us have a look at the output of the model to see what these rules look like. You can use the inspect command from the arules package to print out rules to the console.

lhs

rhs

support

confidence

lift

{Product_125}

{Product_306}

0.006

0.387

4.040

{Product_306}

{Product_125}

0.006

0.072

4.040

{Product_63}

{Product_385}

0.005

0.400

19.472

{Product_385}

{Product_63}

0.005

0.285

19.472

{Product_264}

{Product_92}

0.005

0.378

27.143

{Product_92}

{Product_264}

0.005

0.360

27.143

{Product_523}

{Product_306}

0.005

0.369

3.859

{Product_306}

{Product_523}

0.005

0.061

3.858

{Product_102}

{Product_120}

0.005

0.506

8.460

…

…

…

…

…

An example rule for our data set could be {product_125} ⇒ {product_306} meaning that if product_125 is bought, customers also buy product_306. To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best-known constraints are minimum thresholds on support and confidence. The support is defined as the proportion of transactions in the data set which contain the specific product(s). In the table above, the rule {product_125} ⇒ {product_306} has a support of 0.006 meaning that the two products have been bought together in 0.6% of all transactions. The confidence is another important measure of interest. The rule {product_125} ⇒ {product_306} has a confidence of 0.387, which means that the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS is 38.7%. If you want to execute the Apriori algorithm you will need to define both a minimum support and a minimum confidence constraint at the same time. This will help you filter out interesting rules. We also defined a minimum length of two because we want the rule to cover at least two products. Another popular measure of interest is the lift of a association rule. The lift is defined as lift(X ⇒ Y ) = supp(X ∪ Y )/(supp(X)supp(Y)), and can be interpreted as the deviation of the support of the whole rule from the support expected under independence given the supports of the LHS and the RHS. Greater lift values indicate stronger associations. There is a lot more to discover about association rule mining with the arules package if you look at its reference manual.

lhs

rhs

support

confidence

lift

{Product_92}

{Product_264}

0.005

0.360

27.143

{Product_374}

{Product_378}

0.006

0.398

21.923

{Product_98}

{Product_929}

0.012

0.556

20.165

{Product_375}

{Product_376}

0.007

0.365

20.139

{Product_257}

{Product_880}

0.006

0.378

19.847

{Product_63}

{Product_385}

0.005

0.400

19.472

{Product_908}

{Product_98}

0.007

0.412

18.702

{Product_376}

{Product_378}

0.006

0.331

18.338

{Product_378}

{Product_375}

0.006

0.384

17.824

{Product_54}

{Product_719}

0.005

0.256

17.415

…

…

…

…

…

At the table above we sorted our rules via lift and now we can see the top 10 most interesting associations in our data set. This information can now be used for purposes of cross-selling and up-selling. We also removed the redundant rules from this table. As you can see from the code above you can also easily use filters to find rules for a specific product. Find the whole code along with other projects on my Github.