5 basic approaches to effectively detect outliers

Share

# Entry

Have you ever met one? weird data points in the dataset as you explore it? One or a few that appear to differ excessively from the immense majority of observations, thus drastically skewing means and inflating variances? I was there too. These points are there outliers. Their impact goes beyond changing data statistics: outliers can easily destroy the performance of any predictive analytics models you build, so detecting and handling them reliably is crucial to any data project. This article lists and compares five basic approaches to detecting them, along with compact Python examples for each.

# 1. Z-Score method

Calculating a Z-score is a uncomplicated method that works best with normally distributed variable data. It measures how many standard deviations each point lies from the mean. Generally, a data point whose Z-score is 3 or greater (or -3 or less) is marked as an outlier: this means that there is a distance greater than three standard deviations between that point and the mean. Despite its simplicity, it has the disadvantage that means and standard deviations are inherently very sensitive to extreme values.

import numpy as np
from scipy import stats

data = np.array([10, 12, 11, 13, 12, 11, 10, 12, 11, 13, 250])

z_scores = np.abs(stats.zscore(data))
outliers = data[z_scores > 3]

print(outliers)

Exit:

# 2. Interquartile ranges (IQR) method.

Are the data variables not normally distributed? Then IQR is a better and more stalwart solution than Z-score calculations. This method uses percentiles, specifically by determining the range between the first quartile (Q1, 25th percentile) and the third quartile (Q3, 75th percentile). The boundary points lying 1.5 times IQR below Q1 and above Q3 are calculated as shown below and act as a “fence”. In other words, any point outside these two fences on either side is marked as an outlier. The good news: The robustness of IQR comes from the fact that extreme values ​​do not change quartiles in the same way that they change means and standard deviations.

import numpy as np

data = np.array([10, 12, 11, 13, 12, 11, 10, 12, 11, 13, 250])

q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr
outliers = data[(data  upper_fence)]

print(outliers)

Exit:

# 3. Insulation forests

When dealing with complicated, high-dimensional datasets, time-honored methods such as Z-score and IQR are no longer effective. Learn about isolation forests, a machine learning technique that learns to isolate anomalies from “normal” data. The idea is similar to classic decision trees for classification and regression: outliers are infrequent data points, so isolating them through tree partitions is much easier. Therefore, when a point is very easily separated from others using the tree algorithm, there is a chance that it is an outlier.

import numpy as np
from sklearn.ensemble import IsolationForest

data = np.array([10, 12, 11, 13, 12, 11, 10, 12, 11, 13, 250]).reshape(-1, 1)

model = IsolationForest(contamination=0.1, random_state=42)
predictions = model.fit_predict(data)
outliers = data[predictions == -1]

print(outliers)

Exit:

# 4. Median Absolute Deviation (MAD)

It is, so to speak, a much more stalwart version of the Z-score: MAD uses the median – insensitive to extreme values ​​- and the absolute deviations from it to calculate an improved “Z-score”. However, please note that although it can be applied to non-normal variables, it is usually applied to univariate data, i.e. it is a univariate technique.

import numpy as np
from scipy.stats import median_abs_deviation

data = np.array([10, 12, 11, 13, 12, 11, 10, 12, 11, 13, 250])

mad = median_abs_deviation(data, scale="normal")
median = np.median(data)
modified_z_scores = np.abs(data - median) / mad
outliers = data[modified_z_scores > 3]

print(outliers)

Exit:

# 5. Density-based clustering: DBSCAN

This is an excellent approach for identifying outliers in spatial data or datasets with complicated clusters. The DBSCAN the algorithm builds groups around points that are close to each other in high-density areas. When it is used, data points isolated in lower density areas are automatically identified as noise, i.e. outliers. Like method number 3 (isolation forests), this is a multivariate technique that allows for the evaluation of multivariate data points in the outlier detection process.

import numpy as np
from sklearn.cluster import DBSCAN

data = np.array([10, 12, 11, 13, 12, 11, 10, 12, 11, 13, 250]).reshape(-1, 1)

model = DBSCAN(eps=5, min_samples=2)
labels = model.fit_predict(data)
outliers = data[labels == -1]

print(outliers)

Exit:

# Summary

Choosing the right outlier detection method comes down to understanding the data. Z-score and IQR are quick and uncomplicated options for univariate data, with IQR being a safer choice when variables are not normally distributed. MAD offers a more stalwart, one-dimensional alternative in cases where extreme values ​​could otherwise distort the result. When data has many dimensions or complicated structure, isolation forests and DBSCAN extend outlier detection beyond uncomplicated statistical thresholds, capturing relationships that are completely missed in simpler methods. There is no single best approach, only the one that best suits the shape and scale of your data.

Ivan Palomares Carrascosa is a thought leader, writer, speaker and advisor in the fields of Artificial Intelligence, Machine Learning, Deep Learning and LLM. Trains and advises others on the exploit of artificial intelligence in the real world.

Latest Posts

More News