10 Python One Liners to optimize machine learning pipelines

Share

10 Python One Liners to optimize machine learning pipelines
Photo by the author Chatgpt

# Entry

When it comes to machine learning, the performance is crucial. Writing pure, legible and concise code not only accelerates development, but also facilitates understanding, sharing and debugging machine pipelines. Python, with a natural and expressive syntax, perfectly matches the creation of powerful liners that can support common tasks in only one code line.

This tutorial will focus on ten practical liners that utilize the power of libraries Scikit-Learn AND Pandy To improve machine work flows. We will discuss everything from the preparation of data and training of the model to the assessment and analysis of the features.

Let’s start.

# Environmental configuration

Before we get to the creation of our code, let’s import the necessary libraries that we will utilize in examples.

import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

For this reason, let’s code … one line at once.

# 1. Charging a data set

Let’s start with one of the basics. The first steps with the project often means charging data. Scikit-Learn is equipped with several sets of toy data that is ideal for testing models and work flows. You can load both functions and target variable in one neat line.

X, y = load_iris(return_X_y=True)

This one-liner uses load_iris Function and sets return_X_y=True To return the function of the function directly X and target vector yAvoiding the need to analyze the dictionary -like object.

# 2. Division of data into training and test sets

Another fundamental step in any machine learning project is the division of data into many sets for various applications. . train_test_split The function is the basis; It can be made in one line to obtain four separate data frame for training and test sets.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

We utilize here test_size=0.3 Assign 30% of data for testing and utilize stratify=y To ensure the percentage of classes in the train and tests set, reflects the original set of data.

# 3. Creating and training of the model

Why utilize two lines for the model’s instance and then train it? You can have a chain fit Method directly to the model designer for a compact and readable code line, such as:

model = LogisticRegression(max_iter=1000, random_state=42).fit(X_train, y_train)

This single line creates LogisticRegression Model and train it immediately on training data, returning the matched model object.

# 4. Cross validation K Krzyżowa K

Cross validation gives you a more solid estimation of your model’s performance than a single test test. Scikit-Learn’s cross_val_score Facilitates this assessment in one step.

scores = cross_val_score(LogisticRegression(max_iter=1000, random_state=42), X, y, cv=5)

This one-liner initiates a modern model of logistic regression, divides data into 5 times, trains and evaluates the model 5 times (cv=5) and returns a list of results from each fold.

# 5. Forecasts and calculating accuracy

After training the model, you will want to assess its performance in the test set. You can do this and get an accuracy result using a single method.

accuracy = model.score(X_test, y_test)

. .score() The method conveniently combines stages of calculating forecasting and accuracy, returning the accuracy of the model in the provided test data.

# 6. Calculation of numerical functions

Scaling of features is a common initial stage of processing, especially for input algorithms on the scale – including SVM and logistics regression. You can fit a scaler and transform your data at the same time using this single Python line:

X_scaled = StandardScaler().fit_transform(X)

. fit_transform The method is a convenient shortcut that learns the parameters of scaling from data and uses transformation at once.

# 7. The utilize of one scorching coding for categorical data

One scorching coding is a standard technique for handling categorical functions. While scikit-Learn has a powerful OneHotEncoder Powerful method, get_dummies The Pandas function allows for a real one line for this task.

df_encoded = pd.get_dummies(pd.DataFrame(X, columns=['f1', 'f2', 'f3', 'f4']), columns=['f1'])

This line converts a specific column (f1) in the Pandas data frame in modern columns with binary values ​​(f1, f2, f3, f4), ideal for machine learning models.

# 8. Defining the Scikit-Learn pipeline

Scikit-Learn pipelines mean that the connection combines many processing steps and the final estimator is straightforward. They prevent data leaks and simplify your work flow. Defining the pipeline is a pure one line, as is the following:

pipeline = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])

It creates a pipeline that first scales data with StandardScaler and then forwards the result to the support vector classifier.

# 9. TUNING HYPERPARAMETERS from GridSEARCHCV

Finding the best hyperparameters for your model can be tedious. GridSearchCV It can support automatize this process. Through the chain .fit()You can initiate, define search and start all this in one poem.

grid_search = GridSearchCV(SVC(), {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}, cv=3).fit(X_train, y_train)

It configurates the search for a net SVC Model, tests different values ​​for C AND kernelperforms 3-fold cross-validation (cv=3) and fits the training data to find the best combination.

# 10. Separation of the importance of features

In the case of models based on trees, such as random forests, understanding which features are the most influential, is crucial for building a useful and competent model. Understanding the list is a classic single -year -old pilet to separate and sort the import of functions. Pay attention to this fragment first builds the model, and then uses one line to determine the import of functions.

# First, train a model
feature_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
rf_model = RandomForestClassifier(random_state=42).fit(X_train, y_train)

# The one-liner
importances = sorted(zip(feature_names, rf_model.feature_importances_), key=lambda x: x[1], reverse=True)

This one-line combines the name of each function with its expiration results, and then sorts the list in a decreasing order to first show the most critical functions.

# Wrapping

These ten one-line shows how a concise syntax of Python can support you write a more competent and legible machine learning code. Integrate these shortcuts with daily work flow to reduce the boiler plate, minimize errors and spend more time focusing on what is really critical: building effective models and extracting valuable information from data.

Matthew Mayo (@ Matmayo13) Has a master’s degree in computer science and a data extraction graduate diploma. As an editor managing kdnuggets & Statologyand the editor of the contribution in Machine learning championshipMatthew is aimed at providing elaborate concepts of data education. His professional interests include natural language processing, language models, machine learning algorithms and exploring the emerging artificial intelligence. He is powered by the mission of democratization of knowledge in the data science community. Matthew has been coding for 6 years.

Latest Posts

More News