Removing Outliers

In this article we will show how to use Cook’s distance to find and remove outliers in a data set.

Background

Wikipedia says that an outlier is a “data point that differs significantly from other observations.” Outliers can cause a lot of problems when we are trying to perform data-driven tasks like machine learning or A/B tests. Outliers can negatively impact machine learning’s ability to predict a trend in a data set. In an A/B test, outliers can have an outsized impact on the performance on group A or B and lead to a wrong conclusion about the best treatment.

Outliers can happen for a variety of reasons. Maybe there was a mistake in the device recording the data. Maybe the wrong subject was measured; one of my professors always joked about a time when data from an orange orangutan was included in a data set about chimpanzees. Often times, the best solution with outliers is to simply remove them.

Defining what an outlier is can be tricky though. What does “differs significantly mean” and what is the threshold for how much the outlier should differ from the rest of the data? Let’s explore outliers in the context of linear regression to understand better. All the code presented today can be found in a Colab.

Linear Regression

Simple linear regression estimates how much a response variable, Y, changes when an independent variable, X, changes by a certain amount. I am sticking with a 1-dimensional problem to keep things simple).

To concretize things, let’s look at the tips dataset. This data set shows data a waiter recorded about 244 tips received over the course of a couple months. In particular we will look at the relationship between the total bill and the tip that the waiter received as shown in the figure below.

import seaborn as sns
tips = sns.load_dataset("tips")
plot = sns.scatterplot(x="total_bill", y="tip", data=tips)
plot

Linear regression tries to fit a simple line to the data Y = b0 + b1*X. We define the error of this line as the difference between the predicted Y value and the real y Values.

Linear regression minimizes the sum of squared residual values (RSS):

Luckily, we can take the derivative of this function and set it to zero. We are able to compute the one solution to this equation efficiently. I know I am hand-waving a lot here, but the goal of this article is the outlier discussion not linear regression. The code below shows how to create a linear regression model in Scikit and the resulting best-fit line.

from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

numSamples = tips["total_bill"].values.size
x = tips["total_bill"].values.reshape(numSamples, 1)
y = tips["tip"].values.reshape(numSamples, 1)

regressionModel = linear_model.LinearRegression()
modelFit = regressionModel.fit(x, y)
trainingPredictions = regressionModel.predict(x)
graph = sns.scatterplot(x="total_bill", y="tip", data=tips)
plt.plot(x, trainingPredictions, color='r')
plt.show()

Outliers

Outliers cause big problems for estimates of location like linear regression. See that points way in the upper left corner? The one with a tip around $10 and a total bill of around $50? That is causing the red line to have a much larger slope (b from our earlier equation). Is it really a valid data point? Maybe it was a mistake by the waiter recording the data. Maybe there is another signal we are missing to get more prediction power. For example, what if this was a wealthy customer who always tips more than normal. If we had income as one of the input variables maybe we could better predict this particular instance. In any case, based on the data we have, it may make sense to remove this. But what can we use to label this data point as an outlier?

The book Practical Statistics suggests that we consider influential observations when identifying outliers. The book defines them as,

A value whose absence would significantly change the regression equation.
— Bruce, Bruce, Gedeck

We can think of outliers as those data points with a y value that is far distant from the prediction. This allows us to identify outliers via standardized residual, i.e., the residual divided by the standard error of the residuals. Recall earlier when we defined residuals as the difference (error) between the observer and fit values.

Luckily for us, Cook’s Distance exists and can quantify the impact a single data point has on a regression model. It is defined as, “the sum of all the changes in a regression model when observation i is removed.” The idea here is that we first train a regression model using all the data. Then we train n more models where we remove the ith training data row for each one. We then compute the difference between that model parameters (b, a, e from described earlier) and the first model. Rows that drastically change the model parameters are identified as significant and Cook’s distance computes a large metric for them.

The code below computes the Cook Distance for every row in the tips data source.

from sklearn.metrics import mean_squared_error
import math
from sklearn.metrics import r2_score
import statsmodels.api as sm
import matplotlib.pyplot as plt

rmse = math.sqrt(mean_squared_error(y, trainingPredictions))
r2 = r2_score(y, trainingPredictions)

tipsOutlier = sm.OLS(x, y) # Ordinary Least Squares
result =tipsOutlier.fit()
sm.graphics.influence_plot(result)
plt.show

There is a lot to unpack in this image. Each point shows cook’s distance for removing the ith row from the training data. The number next to the point is the row number (i.e., i), of the removed point. The y-axis (studentized residuals) is the ratio of the residual divided by the standard error of all the residuals. The x-axis (leverage) is a measure of how far away the independent (X) variable is from the other observations. Points with a high leverage ore studentized residual should be evaluated as outliers.


Jim Herold

Jim Herold is a Catholic, Husband, Father, and Software Engineer.

He has a PhD in Computer Science with a focus on machine learning and how it improves natural, sketch-based interfaces.

Jim researched at Harvey Mudd and UC Riverside, taught at Cal Poly Pomona and UC Riverside, and worked at JPL NASA and Google.

Next
Next

A Monte Carlo Significance Test