Upper Confidence Bound (UCB) | Machine Learning


In this tutorial, I will explain to you the application of Upper Confidence Bound(UCB) algorithm to solve the Multi Bandit problem and show you the whole coding process in Python.

What is the Multi-Armed Bandit Problem?
Before going to learn the multi-armed bandit problem, first, take a look at the exploration vs. exploitation dilemma. This dilemma exists in many aspects of our life. Well, say you take your lunch at your favorite restaurant every day as you are confident about what you get from there is good. But you may be missing the chances of discovering even a better option. If you explore all the restaurants in your locality one by one, the probability of tasting the worst food in your life would be pretty high. But then again, there is also a probability that you get an even better option! This dilemma is called exploration vs. exploitation dilemma. Now lets go to the classic example of this dilemma- the Multi-Armed Bandit Problem

Multi-Armed Bandit Problem:

This is a use case of reinforcement learning, where we are given a slot machine called multi-armed bandit( the slot machines in casinos are called bandit as it turns out all casinos configure these machines in such a way that all gamblers end up losing money!) with each arm having its own rigged probability distribution of success. Pulling any one of these arms gives you a stochastic reward of either 1, for success or 0, for a failure. Now your task is to find such an optimal strategy that will give you the highest rewards in the long run without the prior knowledge of the probability distribution of success of the machines.

This problem can be related to more business examples such as displaying the optimal ads to the viewer.

So in this tutorial, we are going to solve a use case of the multi-armed bandit problem. We take an example of an online advertising campaign dataset where we have 10 different versions of a similar ad. You see the dataset looks similar to that of a multi-armed bandit problem!

Let's have a quick check on our multi-armed bandit problem


                                   21_ucb_1

You can download the dataset from here.

UCB Algorithm Intuition:

You can find the best option by using an A/B test. Where you can show a different ad to a user and take his feedback, that is whether the user clicks or not. And summing up all the feedback, you can find the best ad to display. Bat wait, there is a problem, in A/B test you are incurring more cost and time by doing exploration and exploitation separately. So we need to combine exploration and exploitation and get the result as soon as possible. We do not have enough time and money to actually do this exploitation, so we will do this during the campaign.

The Upper Confidence Bound algorithm the kind of algorithm that helps us to perform exploitation and exploration together.

How the UCB Algorithm Works?

At the start of the campaign, we don't know what is the best arm or ad. So we cannot distinguish or discriminate any arm or ad. So the UCB algorithm assumes they all have the same observed average value. Then the algorithm creates confidence bound for each arm or ad. So it randomly picks any of the arms or ads. Then two things can happen- the user clicks the ad or the arm gives reward or does not. Let's say the ad did not have a click or the arm was a failure. So the observed average of the ad or arm will go down. And the confidence bound will also go down. If it had a click, the observed average would go up and the confidence bound also goes up. By exploiting the best one we are decreasing the confidence bound. As we are adding more and more samples, the probability of other arms or ad doing well is also going up. This is the fundamental concept of UCB.

Now let's see what steps are involved to perform this algorithm on our dataset:

                                             21_ucb_2


UCB in Python:

We will implement the whole algorithm in Python. First of all, we need to import some essential libraries.

# Importing the Essential Libraries 
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Now, let's import the dataset-

# Importing the Dataset 
dataset = pd.read_csv('Ads_CTR_Optimisation.csv')

Now, implement the UCB algorithm:

# Implementing UCB
import math
N = 10000
d = 10
ads_selected = []
numbers_of_selections = [0] * d
sums_of_rewards = [0] * d
total_reward = 0
for n in range(0, N):
    ad = 0
    max_upper_bound = 0
    for i in range(0, d):
        if (numbers_of_selections[i] > 0):
            average_reward = sums_of_rewards[i] / numbers_of_selections[i]
            delta_i = math.sqrt(3/2 * math.log(n + 1) / numbers_of_selections[i])
            upper_bound = average_reward + delta_i
        else:
            upper_bound = 1e400
        if upper_bound > max_upper_bound:
            max_upper_bound = upper_bound
            ad = i
    ads_selected.append(ad)
    numbers_of_selections[ad] = numbers_of_selections[ad] + 1
    reward = dataset.values[n, ad]
    sums_of_rewards[ad] = sums_of_rewards[ad] + reward
    total_reward = total_reward + reward
Code Explanation:
Here, three vectors are used.
# ads_selected = [], is used to append the different types of ads selected in each round 
# numbers_of_selections = [0] * d, is used to count the number of time an ad was selected 
# sums_of_rewards = [0] * d, is used to calculate the cumulative sum of rewards at each round

As we don't have any prior knowledge about the selection of each ad, we will take the first 10 rounds as trial rounds. So, we set the if condition:  if (numbers_of_selections[i] > 0) so that the ads are selected at least once before entering into the main algorithm.

Then we implement the second step of the algorithm, computing the average reward of each ad i up to round n and the upper confidence bound for each ad.

Here we applied a trick in else condition by taking the variable upper_bound  to a huge number. This is because we want the first 10 rounds as trial rounds where the 10 ads are selected at least once. This trick will help us to do so.

After 10 trial rounds, the algorithm will work as the steps explained earlier.

Visualizing the Result:
Now, it's time to see what we get from our model. For this, we will visualize the output.

plt.hist(ads_selected)
plt.title('Histogram of ads selections')
plt.xlabel('Ads')
plt.ylabel('Number of times each ad was selected')
plt.show()

                                                                21_ucb_3

From the above visualization, we can see that the fourth ad got the highest click. So our model advice us to place the fourth version of the ad to the user for getting the highest number of clicks!