Multi-arm Bandits

Introduction

Today we will explore an alternative to statistical significance for picking between treatments. This alternative is less complicated and often finds the best treatment quickly.

Imagine you own an email marketing campaign for your company. Your manager asks you to increase engagement in the campaign improving the title. Your team creates two new titles and you reach out to the analytics team to learn how to evaluate them. They tell you to run a three-way experiment to compare the title you had being using previously to the two new ones to determine which one gets the highest click through rate (percent of recipients who click on the email to read it). The analytics team performs a power analysis and expects that the experiment will need to run for at least three months to collect enough data show a statistically significant difference. It will take the team a month to design the experiment and create the three experiment groups. It will take another month after the experiment is over to run the analysis. You start to get a little worried about how long it will take you to report back to your manager with success on this project.

This type of problem comes up at every company: a car dealership may want to explore which advertisements it wants to run during the holidays, a manufacturing company may want to explore what the right prices is for a part they sell, etc. Teams typically turn to statistical significance checks to answer these questions, maybe because it is what shows up in undergraduate statistics courses. This approach is not well suited for business decisions: business decisions rely on many unknown factors (how much demand will there be in the market for your product this year; business decisions often need to be made quickly since opportunities do not last forever. As Gedeck says in Practical Statistics for Data Scientists, “"We are often more concerned with optimizing overall effort and results."

Today we will explore the multi-arm bandit problem and how its solutions are well suited to making business decisions.

The Mult-arm Bandit Problem

Multi-arm bandit algorithms get their name from slot machines which are nicknamed "one armed bandits". Conceptually, we can imagine a slot machine with many arms each of which pays out at a different rate. Our goal is to win as much money as possible.

When we start, we don't know the payout rates for the arms though.

Imagine a machine with three arms and we we pull each of three arms 50 times and we get the following results:

Arm Win Losses
A 10 40
B 2 48
C 4 46

If this was all the pulls we could perform and we had to make a decision about which arm we preferred, do we would probably go with arm A. Alternatively we could decide that it looks like things were caused by chance and do some more testing. But in this case, how long do we go on? A multi-bandit algorithm takes a hybrid approach. We start pulling on the winning arm more often since it will likely get us more money and we will continue to pull A more as it continues to outperform B and C. The same is true for B and C if over time they start outperforming.

Epsilon Greedy Algorithm

Now let's translate to the web testing examples we have looked at. Now instead of different arms on a slot machine, we are looking at changes we can make to our site like different headlines or prices, etc.

Let's look at a simple greedy algorithm for an A/B test:

  1. Generate a uniformly distributed random number between 0 and 1

  2. If the number is between 0 and epsilon (a small number between 0 and 1) flip a fair coin and on heads show offer A otherwise show B

  3. If the number is greater than epsilon show whichever offer has the highest response rate so far

    Epsilon is a parameter that we have to pick a priori. If we pick 1, then this is just an old-fashioned A/B test.

I’ll walk through code to execute this below and the entire thing is available as a Colab. I used a GeeksForGeeks example as a jumping off point.

import numpy as np

np.random.binomial(1, 0.7, 1)[0]

We use NumPy’s random.binomial method to simulate pulling an arm one time with a 70% success rate. This returns 1 when the bandit pays out and 0 when it does not.

import numpy as np

class Bandit:
  def __init__(self, rewardLikelihood, reward):
    self.rewardLikelihood = rewardLikelihood # How likely is this bandit to give a reward
    self.reward = reward # How large is the reward that this bandit gives
    self.mean = 0 # What is the average reward of this bandit so far
    self.pullCount = 0 # How many times have we "pulled" this bandit

  def pull(self):
    return np.random.binomial(1, self.rewardLikelihood, 1)[0] * self.reward

  def update(self, result):
    self.pullCount += 1
    self.mean = (1 - 1.0 / self.pullCount) * self.mean + 1.0 / self.pullCount * result

We create a simple object to maintain the state of each Bandit. It has four member variables:

  • rewardLikelihood: what percent of the time does this bandit pay out when it succeeds

  • reward: how much does this bandit pay out when it succeeds

  • mean: what is the mean of the payouts so far, i.e., sum of payouts divided by number of pulls we have made

  • pullCount: how many times have we pulled the arm on this bandit, this is needed to compute the mean

The class has two methods:

  • pull(): Execute one pull of the arm. Randomly (using binomial() as discussed above) decide if the pull was a success and return the reward amount if it was. Otherwise return zero.

  • update(): Given a result (1 for success, 0 for no success) update the pull count and mean for this bandit. Note that we use a technique to compute the weighted average of the last n-1 pulls and add in the weighted average of the one new pull, see this StackOverflow for details.

def runExperiment(numberTrials, epsilon):
  allRewards = np.empty(numberTrials)

  bandits = [Bandit(0.5, 2), Bandit(0.1, 1000), Bandit(0.7, 1)]
  numberOfBandits = len(bandits)

  for i in range(numberTrials):
    explore = np.random.random()
    if explore < epsilon:
      selectedBandit = np.random.choice(numberOfBandits)
    else:
      selectedBandit = np.argmax([b.mean for b in bandits])
    reward = bandits[selectedBandit].pull()
    bandits[selectedBandit].update(reward)
    allRewards[i] = reward

  cumulativeAverage = np.cumsum(allRewards) / (np.arange(numberTrials) + 1)
  for a in bandits:
    print(a.mean)
  return cumulativeAverage

We will use our Bandit class to create runExperiment() which uses the Greedy Epsilon algorithm to maximize the rewards we get from three Bandits: one with a 50% chance to pay out $2, one with a 10% to pay out $1000, and one with a 70% chance to pay out 1 dollar. The expected payout of each Bandit therefore is: $1, $100, and $0.70 respectively.

We run the experiment numberTrials times. During each iteration, we use epsilon to decide if we pick a bandit at random, or if we will use the most-winning arm we have found so far. In either case we next:

  • pull the arm

  • update the bandit with the resulting award if there is any

  • add the reward (if any) to the array allRewards which records the history at each iteration, i. We will use this later to report on how long it took the algorithm to find the optimal arm to pull

After we have iterated numberTrials times, we use allRewards to compute the cumulative average reward at each iteration, i, and return the resulting array.

import matplotlib.pyplot as plt

epsilon10Results = runExperiment(100000, 0.1)
epsilon5Results = runExperiment(100000, 0.05)
epsilon1Results = runExperiment(100000, 0.01)

plt.plot(epsilon10Results, label ='eps = 0.1')
plt.plot(epsilon5Results, label ='eps = 0.05')
plt.plot(epsilon1Results, label ='eps = 0.01')
plt.legend()
plt.xscale('log')
plt.show()

To get an understanding of how this function performs, we execute runExperiment() on three different epsilon values but with the same number of trials (100,000 each). We try epsilon values: 10%, 5%, and 1%. Remember this is the likelihood we will try a random arm each iteration which gives the algorithm a chance to explore and find arms that are less likely to succeed but which pay out far more than other arms.

We store the cumulative average for each epsilon value and plot each one, shown below; note that the x-axis is on a log scale.

Each of the runs eventually converges to the arm that awards 100.

Let’s look at what is happening for epsilon = 1% first (the green line). Very early on, the cumulative average is only slightly above $0 which means it is likely mostly pulling from arm three, which gives an expected payout of $0.70.

It quickly raises around iteration 10 and holds at this value even past iteration 1,000. During this time, it has picked Bandit 1 which has an expected payout that matches the cumulative average, $1.

Finally, somewhere between iteration 1,000 and 10,000 we see a sharp spike upward in the cumulative average. At this inflection point, the algorithm has finally randomly explore Bandit 2 enough times to learn that its expected pay out is $100 and begins to pick that bandit over the others, increasing our average payout faster than the previous arms. Eventually it converges (plateaus) at $100 around iteration 100,000.

It is interesting to note, that all three algorithms converge on the same number. In this case, based on all the parameters we arbitrarily picked, the highest epsilon value converges the fastest. This doesn’t mean a high epsilon is always best in general though.

It is neat to see that we can find a best treatment using a simple greedy algorithm and without predefining experiment groups and without running statistical significance tests with arbitrary p-values.

Wrapping it up

In this blog, we looked at a Monte Carlo style simulation of three arms where we hard-coded the likelihood of a success and the payout. I hope that it is easy to see how we can use this in the real world to solve our “pick the best treatment” problem discussed at the beginning.

In our example, we had three email titles that we wanted to experiment and the reward is the number of recipients who click to read the email. Each title can be viewed as its own Bandit with an unknown success likelihood and equal payout amount; we only care if a recipient reads the email or not. We could use the same runExperiment() method and set numberOfTrials to the number of customers we want to send an email to. For each customer, we would either pick a random title (based on epsilon) or we would pick the title that has the best click-to-read rate so far. Then we send an email and use our email marketing software to track if the customer reads the email or not. Depending on the result, we update the rewards for the selected bandit.

Overall, we can adapt this method to most treatment selection problems. The next time one comes up consider giving this algorithm a shot.

Jim Herold

Jim Herold is a Catholic, Husband, Father, and Software Engineer.

He has a PhD in Computer Science with a focus on machine learning and how it improves natural, sketch-based interfaces.

Jim researched at Harvey Mudd and UC Riverside, taught at Cal Poly Pomona and UC Riverside, and worked at JPL NASA and Google.

Next
Next

The Urn of Mystery