Association Rule Learning | Apriori | Machine Learning


In this tutorial, we are going to understand the association rule learning and implement the Apriori algorithm in Python.


Association Rule Learning: Association rule learning is a machine learning method that uses a set of rules to discover interesting relations between variables in large databases i.e. the transaction database of a store. It identifies frequent associations among variables called association rules that consists of an antecedent  (if) and a consequent (then).


For example, if a person buys a burger, then there is a chance that he will buy some french fries too. This is because there is some relationship between french fries and burger (they are often taken together).


The task of associations rule learning is to discover this kind of relationships and identify the rules of their association.


Association rule learning algorithms are used extensively in data mining for market basket analysis, which is determining dependencies among various products purchased by the customers at different times analyzing the customer transaction databases.


There are two basic types of Association learning algorithms- Apriori and Eclat. In this article, we are going to implement the Apriori algorithm.



Apriori Intuition: This is a classic algorithm in data mining. It is used for analyzing frequent itemsets and relevant association rules. It can operate on databases containing a lot of transactions.

Let's take an example of transactions made by customers in a grocery shop.



Form the above transactions, the potential association rules can be

If the customer buys Burgers can also buy French fries

If the customer buys Vegetable can also buy Fruits

If the customer buys Pasta can also buy Butter



These associations are measured using three common metrics- Support, Confidence and lift.


Support: Support is the rate of the frequency of an item appears in the total number of items. Like the frequency of burger among all the transactions. Mathematically, for an item I,


Confidence: Confidence is the conditional probability of occurrence of a consequent (then) providing the occurrence of an antecedent  (if). It's kind of testing a rule. Like if a customer buys a burger(antecedent), he is supposed to buy french fries(consequent). Mathematically, the confidence of l2 given l1 will be

                       
Lift: Lift is the ratio of confidence and support. It tells how likely an item is purchased after another item is purchased. Simply it is the likelihood to buy french fries if a customer buys a burger. Mathematically,

                                                       

Step 1: Set a minimum support and confidence.

Step 2: Take all the subsets in transactions having support than minimum support.

Step 3: Take all the rules of these subsets having higher confidence than minimum confidence.

Step 4: Sort the rules by decreasing lift.

 

Let's apply these steps one by one on the above example

First, we will calculate the frequency table for the itemset

                                                                                                 


Let's say we need to find the association rule for Burger->French fries

Step 1: let's say we have set the minimum support and confidence to 15%. That means no items having support less than 15% will be incurred.

Then we are left with the following items.

                                                                                                   


Step 2: Now our possible subsets for the above itemsets will be {Burger, French Fries}, {Burger, Vegetables}, {French Fries, Vegetables} etc.

Step 3: For our threshold value of confidence, we are left with one pair or one rule. And that is {Burger, French Fries} or Burger --> French Fries


Step 4: As we are left with only one rule we calculate the lift for this rule and that is approximately 3.7


Implementation in Python: Now, we will implement the Apriori algorithm in Python. For this task, we are using a dataset called "Market_Basket_Optimization.csv" that contains the transaction of different products by customers from a grocery store. Now, we need to implement the Apriori algorithm to find out some potential association rules among the products.


You can download the dataset from here.


First of all, we will import the essential libraries.


# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


The next step is to preprocess our data. The dataset preserves the transaction of different products by a single customer in a separate row. So we need to treat the columns as a name of the products, not as a header. For that, we will remove the take no header in the dataset. The Apriori algorithm works with strings, that means we need to make a list of string values from the dataset. To do so, we will implement the str function from Python.

# Data Preprocessing
dataset = pd.read_csv('Market_Basket_Optimisation.csv', header = None)
transactions = []
for i in range(0, 7501):
    transactions.append([str(dataset.values[i,j]) for j in range(0, 20)])

Now, we reach the part where we will train our dataset with the Apriori algorithm. To use the Apriori class in our program we need to have the apyori.py an open-source python module for Apriori algorithm. 

You can find the module here.


In the class, we need to take the list transactions as a parameter. The crucial step of performing Apriori is to set the minimum value for the support, confidence and lift. To find the most valuable association rules, we need the perfect combination of these values. Otherwise, the association rules will not be useful. 

For our dataset, we have found that the combination of 30% support and 20% confidence as minimum values is perfect. But these values vary across different datasets and business problems. So, you should carefully observe the dataset to set these values. 


The minimum length is set to two that means we want associations among at least two products.


# Training Apriori on the dataset
from apyori import apriori
rules = apriori(transactions, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length = 2)



Now, we will see how our model performs to find potential association rules. 

# Visualising the results
results = list(rules)


This will generate a list like the following. This is a list of all the potential rules upon the given constraints.