Why most people employ SMOTE incorrectly and how to do it right

Photo by the editor

# Entry

Getting marked data — that is, data with basic target labels — is a fundamental step in building most supervised machine learning models, such as random forests, logistic regression, or neural network-based classifiers. While one of the main difficulties in many real-world applications is obtaining enough labeled data, there are situations where even after checking this box, another critical challenge can still arise: class imbalance.

Class imbalance occurs when a labeled dataset contains classes with very different numbers of observations, usually with one or more classes significantly underrepresented. This issue often causes problems when building a machine learning model. In other words, training a predictive model such as a classifier on unbalanced data results in issues such as biased decision boundaries, indigent minority class recall, and deceptively high accuracy, which in practice means that the model works well “on paper” but when deployed fails in the critical cases we care about most – a clear example of this is fraud detection in banking transactions, with transaction datasets being extremely unbalanced as approximately 99% of transactions are legitimate.

Synthetic minority oversampling technique (SMOTE) is a data-centric resampling technique that aims to address this problem by synthetically generating modern minority class samples, e.g., fraudulent transactions, using interpolation techniques between existing real-world instances.

This article briefly introduces SMOTE, then explains how to employ it correctly, why it is often used incorrectly, and how to avoid such situations.

# What is SMOTE and how does it work?

SMOTE is a data augmentation technique for solving class imbalance problems in machine learning, especially in supervised models such as classifiers. In the case of classification, when one or more classes are significantly underrepresented compared to the others, the model can easily become biased towards the majority class, leading to indigent performance, especially when it comes to predicting the sporadic class.

To address this challenge, the SMOTE project creates synthetic data examples for the minority class, not only by replicating existing instances as-is, but by interpolating between the minority class sample and its nearest neighbors in the space of available features: a process essentially akin to effectively “filling in” gaps in the regions around which existing minority instances move, thus helping to balance the dataset.

SMOTE iterates over each minority example, identifies ( k ) nearest neighbors, and then generates a modern synthetic point along the “line” between the sample and a randomly selected neighbor. The result of iteratively applying these uncomplicated steps is a modern set of minority class examples, so that the model training process is based on a richer representation of minority classes in the dataset, resulting in a more effective, less biased model.

How SMOTE works | Photo by the author

# Proper implementation of SMOTE in Python

To avoid the data leakage issues mentioned earlier, it is best to employ a pipeline. The unbalanced-learn The library provides a pipeline facility that ensures that SMOTE is applied to training data only during each cross-validation fold or during a uncomplicated holdout split, leaving the test set untouched and representative of real-world data.

The following example shows how to integrate SMOTE into a file machine learning workflow using scikit-learn AND imblearn: :

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Split data into training and testing sets first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the pipeline: Resampling then modeling
# The imblearn Pipeline only applies SMOTE to the training data
pipeline = Pipeline([
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Fit the pipeline on training data
pipeline.fit(X_train, y_train)

# Evaluate on the untouched test data
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

By using Pipelineyou are sure that the transformation will only take place in the training context. This prevents synthetic information from entering the evaluation set, providing a much more forthright assessment of how the model will handle unbalanced classes in production.

# Common misuses of SMOTE

Let’s look at three common ways SMOTE can be misused machine learning Workflows and ways to avoid these misuses:

Applying SMOTE before splitting the data set into training and test sets: This is a very common mistake that inexperienced data scientists can often (and in most cases, accidentally) make. SMOTE generates modern synthetic examples based on all available dataand injecting synthetic scores into the later training and testing portion is an “imperfect” recipe for artificially inflating model evaluation metrics in an unrealistic way. The right approach is uncomplicated: split the data first, then apply SMOTE only to the training set. Thinking about using k-fold cross-validation? Even better.
Overbalancing: Another common error is blindly resampling until an exact match is achieved for class proportions. In many cases, achieving a perfect balance is not only unnecessary, but may also be counterproductive and unrealistic given the structure of the domain or class. This is especially true for multi-class datasets with several sporadic minority classes, where SMOTE may produce boundary-breaking synthetic examples or lie in regions where no real data examples have been found: in other words, noise may be inadvertently introduced, which may have undesirable consequences such as overfitting the model. The general approach is to tread lightly and try to train the model with subtle, gradual increases in the proportion of minority classes.
Ignoring context about metrics and models: The overall model accuracy metric is an uncomplicated metric to obtain and interpret, but it can also be a misleading and “empty metric” that does not reflect the model’s inability to detect minority class cases. This is a key issue in high-stakes fields such as banking and healthcare, for scenarios such as sporadic disease detection. Meanwhile, SMOTE can support boost reliance on metrics such as recall, but may reduce its equivalent, precision, by introducing clamorous synthetic samples that may not align with business goals. To properly evaluate not only your model, but also how well SMOTE performs, work together to focus on metrics such as recall, F1 score, Matthews correlation coefficient (MCC, a “summary” of the entire confusion matrix), or area under the precision recall curve (PR-AUC). Similarly, consider alternative strategies, such as class weighting or threshold tuning, in the application of SMOTE to further boost efficiency.

# Final remarks

This article was about SMOTE: a commonly used technique to address class imbalance when building certain machine learning classifiers based on real-world datasets. We’ve identified some common abuses of this technique and provided practical advice on how to avoid them.

Ivan Palomares Carrascosa is a thought leader, writer, speaker and advisor in the fields of Artificial Intelligence, Machine Learning, Deep Learning and LLM. Trains and advises others on the employ of artificial intelligence in the real world.

Categories

Why most people employ SMOTE incorrectly and how to do it right

# Entry

# What is SMOTE and how does it work?

# Proper implementation of SMOTE in Python

# Common misuses of SMOTE

# Final remarks

7 specific unconventional things about language models

Children’s design companies are in turmoil

AI Engineering Hub Breakdown: 10 Agent Projects You Can Implement Today

Artificial intelligence-designed drugs from the DeepMind spinoff are being tested on humans

MIT researchers are building the world’s largest collection of Olympic-level math problems and making them available to everyone

More News

7 specific unconventional things about language models

Children’s design companies are in turmoil

AI Engineering Hub Breakdown: 10 Agent Projects You Can Implement Today

Artificial intelligence-designed drugs from the DeepMind spinoff are being tested on humans

7 specific unconventional things about language models

Children’s design companies are in turmoil

AI Engineering Hub Breakdown: 10 Agent Projects You Can Implement Today