Removing Outliers
In this article we will show how to use Cook’s distance to find and remove outliers in a data set.
Background
Wikipedia says that an outlier is a “data point that differs significantly from other observations.” Outliers can cause a lot of problems when we are trying to perform data-driven tasks like machine learning or A/B tests. Outliers can negatively impact machine learning’s ability to predict a trend in a data set. In an A/B test, outliers can have an outsized impact on the performance on group A or B and lead to a wrong conclusion about the best treatment.
Outliers can happen for a variety of reasons. Maybe there was a mistake in the device recording the data. Maybe the wrong subject was measured; one of my professors always joked about a time when data from an orange orangutan was included in a data set about chimpanzees. Often times, the best solution with outliers is to simply remove them.
Defining what an outlier is can be tricky though. What does “differs significantly mean” and what is the threshold for how much the outlier should differ from the rest of the data? Let’s explore outliers in the context of linear regression to understand better. All the code presented today can be found in a Colab.
Linear Regression
Simple linear regression estimates how much a response variable, Y, changes when an independent variable, X, changes by a certain amount. I am sticking with a 1-dimensional problem to keep things simple).
To concretize things, let’s look at the tips dataset. This data set shows data a waiter recorded about 244 tips received over the course of a couple months. In particular we will look at the relationship between the total bill and the tip that the waiter received as shown in the figure below.
import seaborn as sns tips = sns.load_dataset("tips") plot = sns.scatterplot(x="total_bill", y="tip", data=tips) plot
Linear regression tries to fit a simple line to the data Y = b0 + b1*X. We define the error of this line as the difference between the predicted Y value and the real y Values.
Linear regression minimizes the sum of squared residual values (RSS):
Luckily, we can take the derivative of this function and set it to zero. We are able to compute the one solution to this equation efficiently. I know I am hand-waving a lot here, but the goal of this article is the outlier discussion not linear regression. The code below shows how to create a linear regression model in Scikit and the resulting best-fit line.
from sklearn import linear_model from sklearn.metrics import mean_squared_error, r2_score import matplotlib.pyplot as plt numSamples = tips["total_bill"].values.size x = tips["total_bill"].values.reshape(numSamples, 1) y = tips["tip"].values.reshape(numSamples, 1) regressionModel = linear_model.LinearRegression() modelFit = regressionModel.fit(x, y) trainingPredictions = regressionModel.predict(x) graph = sns.scatterplot(x="total_bill", y="tip", data=tips) plt.plot(x, trainingPredictions, color='r') plt.show()
Outliers
Outliers cause big problems for estimates of location like linear regression. See that points way in the upper left corner? The one with a tip around $10 and a total bill of around $50? That is causing the red line to have a much larger slope (b from our earlier equation). Is it really a valid data point? Maybe it was a mistake by the waiter recording the data. Maybe there is another signal we are missing to get more prediction power. For example, what if this was a wealthy customer who always tips more than normal. If we had income as one of the input variables maybe we could better predict this particular instance. In any case, based on the data we have, it may make sense to remove this. But what can we use to label this data point as an outlier?
The book Practical Statistics suggests that we consider influential observations when identifying outliers. The book defines them as,
We can think of outliers as those data points with a y value that is far distant from the prediction. This allows us to identify outliers via standardized residual, i.e., the residual divided by the standard error of the residuals. Recall earlier when we defined residuals as the difference (error) between the observer and fit values.
Luckily for us, Cook’s Distance exists and can quantify the impact a single data point has on a regression model. It is defined as, “the sum of all the changes in a regression model when observation i is removed.” The idea here is that we first train a regression model using all the data. Then we train n more models where we remove the ith training data row for each one. We then compute the difference between that model parameters (b, a, e from described earlier) and the first model. Rows that drastically change the model parameters are identified as significant and Cook’s distance computes a large metric for them.
The code below computes the Cook Distance for every row in the tips data source.
from sklearn.metrics import mean_squared_error import math from sklearn.metrics import r2_score import statsmodels.api as sm import matplotlib.pyplot as plt rmse = math.sqrt(mean_squared_error(y, trainingPredictions)) r2 = r2_score(y, trainingPredictions) tipsOutlier = sm.OLS(x, y) # Ordinary Least Squares result =tipsOutlier.fit() sm.graphics.influence_plot(result) plt.show
There is a lot to unpack in this image. Each point shows cook’s distance for removing the ith row from the training data. The number next to the point is the row number (i.e., i), of the removed point. The y-axis (studentized residuals) is the ratio of the residual divided by the standard error of all the residuals. The x-axis (leverage) is a measure of how far away the independent (X) variable is from the other observations. Points with a high leverage ore studentized residual should be evaluated as outliers.