A Monte Carlo Significance Test

The math behind statistical significance testing is unintuitive to many folks. In this article, I present a Monte Carlo approach that is easier to grok and visualize.

Background

Imagine we work for the marketing department for direct-sails.com, a web-based company with a high-value product: sails for sail boats. Its our job to test two different versions of the web page that allows customers to buy this product: version A (the current version of the web page) and version B (a new version with large gorgeous photos of our sails).

Our product is expensive and sells infrequently. Our executives do not want to wait long periods of time for sales data to accumulate. Instead, we measure a proxy variable to measure the efficacy of the two web pages versions: the amount of time customers spend on the web page. Our hypothesis is that more time spent on the page is an indicator that a customer is more likely to buy a sail.

We can run an A/B test to see which web page version leads to more time spent on page. When a customer visits, we randomly decide which web page version to show them and record the time they spend on visiting the page. We can use a null hypothesis test to determine if random chance is responsible for an effect we observe in a test or if there is a true difference between the two web pages. The book Practical Statistics, says of null hypotheses,

Given the human tendency to react to unusual but random behavior and interpret it as something meaningful and real, in our analyses we will require proof that the difference between groups is more extreme than what chance might reasonably produce.
— Bruce, Bruce, and Gedeck

This means that in a null hypothesis test, by definition, we start with the unintuitive assumption that the data we get from the two web pages are the same. The goal of the null hypothesis test is to prove this wrong.

There is an intuitive way to perform this check with a Monte Carlo approach that we will demonstrate in this article. We can mix all the data from the two web versions together into one big data set. Then we can randomly select from that data to create two new groups (with the same size as the original groups). By comparing the data in these two random groups, we are directly testing what the null hypothesis wants us to test; treating the data from the two groups like they are the same. If we repeat this process many times and the differences between the random groups is similar to that of the original groups, then the data is interchangeable and the null hypothesis was right. In that case we would conclude there is no significant difference between version A and B.

Permutation Test

The process of randomly rearranging the original data into two groups is called a permutation test. Permute is defined as, “to submit to a process of alteration or rearrangement.” We will permute our data into two groups to compare to each other using the following steps:

  1. Create a new data set by combining all the data from the two groups into one and randomize the order

  2. Randomly draw (without replacement) rows from the combined data set until we have as many rows were originally in group A. This creates a new data set which we will call modified A.

  3. Randomly draw (without replacement) rows from the combined data set until we have as many rows were originally in group B. This creates a new data set which we will call modified B.

  4. Compute the metric for each modified data set that we want to compare, e.g., average time spent viewing the web page

  5. Repeat these steps N times keeping track of the metrics computed at each iteration

Code Example

In this section, we will walk through Python code that implements the permutation test steps defined above. All the code is available in a Colab and is based on a small fake data set I generated. This code is adapted from the Permutation Test section of Chapter 3 of Practical Statistics.

from google.colab import drive
import pandas  as pd

drive.mount('/content/drive')
sessionTimes = pd.read_csv('/content/drive/MyDrive/Teaching/DATA 294/data/sessionTimes.csv')
sessionTimes.head()
Row number sessionTimeMinutes version
0 2.77962 A
1 0.378777 A
...

The first step simply loads the sample data from Google Drive. Each row has a session time and a version. The session time is the minutes a customer spent viewing our web page. Version determines if the customer was viewing version A or B of the web page.

import seaborn as sea
sea.boxplot(x="version", y="sessionTimeMinutes", data=sessionTimes)

If we look at a box plot of the data, we can see that version B has a higher average session time, but there is a big overlap in the distribution of values between the groups.

originalA = sessionTimes.loc[sessionTimes['version'] == 'A']
originalA = originalA.drop(columns=['version'])
originalB = sessionTimes.loc[sessionTimes['version'] == 'B']
originalB = originalB.drop(columns=['version'])
originalA.mean()
originalB.mean()
originalMeanDifference = originalB.mean() - originalA.mean()
originalMeanDifference = originalMeanDifference.max() # Max to get a scalar instead of a DataFrame
originalMeanDifference

We put the session times for the original data into their own arrays (one for A and one for B). We see that version A has an average of 1.313191 and version B has an average of 1.797602. Group B has an apparent 0.4844 longer session time, but if we assume the two groups were equal, could random chance explain this difference? We need to implement our permutation test to find out.

def permutationTestIteration(combinedDataFrame, groupASize, groupBSize):
  modifiedA = combinedDataFrame.sample(groupASize) # Default is without replacement
  modifiedB = combinedDataFrame.sample(groupBSize)
  return modifiedB.mean() - modifiedA.mean()

combinedDataFrame = pd.concat([originalA, originalB])
originalASize = len(originalA.index)
originalBSize = len(originalB.index)

iterations = 1000
permutationDifferences = 
  [permutationTestIteration(combinedDataFrame, originalASize, originalBSize)
    for _ in range(iterations)] # So much in one line
df = pd.DataFrame(permutationDifferences)
df.head()

That’s all it takes to execute 1,000 permutation tests! We start by defining a function to create one permutation and returns the difference of the averages between the two modified groups. We use the sample function from pandas to do all the heavy lifting in creating the randomized data sets (or data frames as they are called in pandas). Then we call this function 1,000 times and store the results from each iteration in a data frame:

sessionTimeMinutes
0 0.098339
1 0.178706
...

If we take the average of this data frame we get 0.002492. How do we check if this difference is random chance or not? We can create a 95% confidence interval based on the quantiles:

df.quantile([0.025, 0.975])
sessionTimeMinutes
0.025 -0.426334
0.975 0.439625

In other words, when we assumed that the data was the same, 95% of the time, the average difference between the groups was between -0.426334 and 0.439625. If our original observed difference is outside of these bounds, we can feel confident that the observed difference was not caused by chance. Fortunately, we have all the data from the permutation tests so we can create an intuitive graph:

plt = sea.histplot(data=df)
plt.axvline(x=0.484411, ymin=0, ymax=1, color='red') # Observed difference
plt.axvline(x=-0.426334, ymin=0, ymax=1, color='green') # Threshold lower bound
plt.axvline(x=0.439625, ymin=0, ymax=1, color='green') # Threshold upper bound
plt

Here we see the distribution (blue bars) of differences between the modified groups from 1,000 iterations of our permutation test; these bars literally show the distribution of differences we see when we assume that the data from the two versions are the same, i.e., the null hypothesis. The green lines show our 95% confidence interval based on the corresponding quantiles (2.5% and 9.75%). The red line shows the original observed difference. Because the red line is outside of the 95% confidence interval we can say that the difference is significant. The marketing team’s new web page and its fancy photos have done the trick and we should use the new version for all customers going forward :)

This makes sense. I generated the data for version A by taking a random uniformly distributed number [0, 1] and multiplied it by 3 for version A and multiplied it by 4 for version B.

Wrapping it up

This approach to significance testing has a couple advantages that make it worth using in the real world. First, it produces a nice graph that makes clearly demonstrates the 95% confidence interval and whether or not the observed difference is within it. Second, it is based on a simple, easy to explain algorithm that I believe is easier to explain to someone who doesn’t have a deep understanding of the Calculus involved in the traditional statistical definition of significance. Give it a shot sometime!

Jim Herold

Jim Herold is a Catholic, Husband, Father, and Software Engineer.

He has a PhD in Computer Science with a focus on machine learning and how it improves natural, sketch-based interfaces.

Jim researched at Harvey Mudd and UC Riverside, taught at Cal Poly Pomona and UC Riverside, and worked at JPL NASA and Google.

Previous
Previous

Removing Outliers

Next
Next

Multi-arm Bandits