7 XGBoost tricks for more true predictive models

Photo by the editor

# Entry

Team methods such as XGBoost (Extreme Gradient Boosting) are powerful implementations of gradient boosted decision trees that combine several weaker estimators into a sturdy predictive model. These assemblies are very popular due to their accuracy, efficiency, and good performance for structured (tabular) data. Although a widely used machine learning library scikit-learn does not provide a native implementation of XGBoost, there is a separate library, aptly named XGBoost, which offers a scikit-learn compatible API.

All you need to do is import it as follows:

from xgboost import XGBClassifier

Here are 7 Python tricks that will support you get the most out of this standalone XGBoost implementation, especially if you want to build more true predictive models.

To illustrate these tricks, we will employ the breast cancer dataset freely available on scikit-learn and define a base model with mostly default settings. Be sure to run this code first before you start experimenting with the seven tricks below:

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

# Data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Baseline model
model = XGBClassifier(eval_metric="logloss", random_state=42)
model.fit(X_train, y_train)
print("Baseline accuracy:", accuracy_score(y_test, model.predict(X_test)))

# 1. Tuning the learning rate and number of estimators

While this is not a universal rule, significantly reducing the learning rate while increasing the number of estimators (trees) in the XGBoost set often improves accuracy. The slower learning rate allows the model to learn more gradually, while the additional trees compensate for the reduced step size.

Here is an example. Try it yourself and compare the accuracy you get with your initial baseline:

model = XGBClassifier(
    learning_rate=0.01,
    n_estimators=5000,
    eval_metric="logloss",
    random_state=42
)
model.fit(X_train, y_train)
print("Model accuracy:", accuracy_score(y_test, model.predict(X_test)))

To be clear, the finale print() the statement will be omitted in the remaining examples. Just attach it to any of the snippets below when you test it yourself.

# 2. Adjustment of the maximum depth of trees

The max_depth argument is a key hyperparameter inherited from classical decision trees. Limits the growth depth of each tree in the assembly. Limiting tree depth may seem uncomplicated, but surprisingly, shallow trees often perform better than deeper ones.

This example limits trees to a maximum depth of 2:

model = XGBClassifier(
    max_depth=2,
    eval_metric="logloss",
    random_state=42
)
model.fit(X_train, y_train)

# 3. Reducing overfitting through undersampling

The subsample argument randomly takes a portion of the training data (e.g. 80%) before growing each tree in the set. This uncomplicated technique works as an effective regularization strategy and helps prevent overfitting.

If not specified, this hyperparameter defaults to 1.0, which means 100% of the training examples are used:

model = XGBClassifier(
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric="logloss",
    random_state=42
)
model.fit(X_train, y_train)

Please note that this approach is most effective on reasonably sized datasets. If the dataset is already miniature, aggressive undersampling can lead to underfitting.

# 4. Adding regularization conditions

To further control overfitting, intricate trees can be penalized using conventional regularization strategies such as L1 (Lasso) and L2 (Spine). In XGBoost they are controlled by reg_alpha AND reg_lambda parameters, respectively.

model = XGBClassifier(
    reg_alpha=0.2,   # L1
    reg_lambda=0.5,  # L2
    eval_metric="logloss",
    random_state=42
)
model.fit(X_train, y_train)

# 5. Using early stopping

Early stopping is a performance-oriented mechanism that pauses training when the validation set’s performance stops improving after a certain number of rounds.

Depending on your coding environment and the version of the XGBoost library you are using, you may need to upgrade to a newer version to employ the implementation shown below. Provide this too early_stopping_rounds is specified during model initialization rather than passed to fit() method.

model = XGBClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    eval_metric="logloss",
    early_stopping_rounds=20,
    random_state=42
)

model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    verbose=False
)

To update the library, run:

!pip uninstall -y xgboost
!pip install xgboost --upgrade

# 6. Performing a hyperparameter search

With a more systematic approach, hyperparameter searching can support identify combinations of settings that maximize model performance. Below is an example of using grid search to explore combinations of the three key hyperparameters introduced earlier:

param_grid = {
    "max_depth": [3, 4, 5],
    "learning_rate": [0.01, 0.05, 0.1],
    "n_estimators": [200, 500]
}

grid = GridSearchCV(
    XGBClassifier(eval_metric="logloss", random_state=42),
    param_grid,
    cv=3,
    scoring="accuracy"
)

grid.fit(X_train, y_train)
print("Best params:", grid.best_params_)

best_model = XGBClassifier(
    **grid.best_params_,
    eval_metric="logloss",
    random_state=42
)

best_model.fit(X_train, y_train)
print("Tuned accuracy:", accuracy_score(y_test, best_model.predict(X_test)))

# 7. Correcting class imbalance

This last trick is especially useful when working with datasets with highly imbalanced classes (the breast cancer dataset is relatively balanced, so don’t worry if you observe minimal changes). The scale_pos_weight The parameter is especially useful when the class proportions are very distorted, e.g. 90/10, 95/5 or 99/1.

Here’s how to calculate and apply it to your training data:

ratio = np.sum(y_train == 0) / np.sum(y_train == 1)

model = XGBClassifier(
    scale_pos_weight=ratio,
    eval_metric="logloss",
    random_state=42
)

model.fit(X_train, y_train)

# Summary

In this article, we’ve covered seven practical tricks for improving your XGBoost assembly models using a dedicated Python library. Thoughtful tuning of learning rate, tree depth, sampling strategy, regularization, and class weighting – combined with systematic hyperparameter search – often makes the difference between a decent model and a very true one.

Ivan Palomares Carrascosa is a thought leader, writer, speaker and advisor in the fields of Artificial Intelligence, Machine Learning, Deep Learning and LLM. Trains and advises others on the employ of artificial intelligence in the real world.

Categories

7 XGBoost tricks for more true predictive models

# Entry

# 1. Tuning the learning rate and number of estimators

# 2. Adjustment of the maximum depth of trees

# 3. Reducing overfitting through undersampling

# 4. Adding regularization conditions

# 5. Using early stopping

# 6. Performing a hyperparameter search

# 7. Correcting class imbalance

# Summary

When AI companies go to war, security gets left behind

5 Powerful Python Decorators for Optimizing LLM Applications

War with Iran threatens global chip supplies and the expansion of artificial intelligence

Trump’s war with Iran could upend American farmers

ByteDance’s artificial intelligence ambitions are hampered by computational limitations and copyright concerns

More News

5 Powerful Python Decorators for Optimizing LLM Applications

Trump’s war with Iran could upend American farmers

10 GitHub repositories for core system design

Gigantic Tech signs a contract with the White House data center for good optics and little substance

When AI companies go to war, security gets left behind

5 Powerful Python Decorators for Optimizing LLM Applications

War with Iran threatens global chip supplies and the expansion of artificial intelligence