Tuesday, March 10, 2026

From dataset to dataframe to deployment: Your first project with Pandas and Scikit learning

Share


Photo by the editor

# Entry

You want to start your first easy-to-manage machine learning project with popular Python libraries Pandas AND Scikit-learnbut you don’t know where to start? Look no further.

In this article, I’ll walk you through a gentle, beginner-friendly machine learning project in which we will work together to build a regression model that predicts employee income based on socioeconomic attributes. Along the way, we’ll learn some key machine learning concepts and basic tricks.

# From raw data set to immaculate data frame

First, as with any Python-based project, it’s a good idea to start by importing the necessary libraries, modules and components that we will utilize throughout the process:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import joblib

The following instructions will load a publicly available dataset into this repository into a panda DataFrame object: a neat data structure for loading, analyzing and managing fully structured data, i.e. data in a tabular format. Once loaded, we check its basic properties and data types in the attributes.

url = "https://raw.githubusercontent.com/gakudo-ai/open-datasets/main/employees_dataset_with_missing.csv"
df = pd.read_csv(url)
print(df.head())
print(df.info())

You’ll notice that the dataset contains 1,000 entries or instances – that is, data describing 1,000 employees – but for most attributes, such as age, income, etc., the actual values ​​are less than 1,000. Why? Because there are missing values ​​in this dataseta common problem with real-world data that needs to be addressed.

In our project, we will aim to predict an employee’s income based on other attributes. Therefore, we will take the approach of discarding rows (employees) that are missing values ​​for this attribute. While in the case of predictor attributes it is sometimes good to deal with missing values ​​and estimate or impute them, in the case of a target variable we need fully known labels to train our machine learning model: the catch is that our machine learning model learns by exposure to examples with known prediction results.

There is also detailed instructions to check only missing values:

So let’s immaculate ours up DataFrame be relieved of missing values ​​of the target variable: income. This code will remove entries with missing values, especially for this attribute.

target = "income"
train_df = df.dropna(subset=[target])

X = train_df.drop(columns=[target])
y = train_df[target]

What about missing values ​​in other attributes? We’ll get to this soon, but first we need to split our dataset into two main subsets: a training set for training the model and a test set for assessing the performance of our model after training, consisting of examples different from those observed by the model during training. Scikit-learn provides a single random split instruction:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The next step goes a step further in transforming the data into the perfect form for training a machine learning model: constructing a preprocessing pipeline. Typically, this preprocessing should distinguish between numerical and categorical features, so that each feature type is subject to different preprocessing tasks throughout the pipeline. For example, numerical features typically need to be scaled, while categorical features can be mapped or encoded to numerical features for a machine learning model to digest. For illustration purposes, the following code illustrates the complete process of creating a preprocessing pipeline. It includes automatic identification of numerical and categorical features so that each type can be handled correctly.

numeric_features = X.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X.select_dtypes(exclude=["int64", "float64"]).columns

numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median"))
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features)
])

You can learn more about data preprocessing pipelines in this article.

This pipeline, when applied to DataFramewill result in a immaculate, ready-to-use machine learning version. However, we will apply this in the next step, where we will encapsulate both data preprocessing and machine learning model training in one overarching pipeline.

# From a immaculate data frame to a model ready for deployment

Now we will define a parent pipeline that:

  1. Applies a predefined preprocessing process that is saved in the file preprocessor variable – for both numeric and categorical attributes.
  2. It trains a regression model, namely random forest regression, to predict income using pre-processed training data.
model = Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", RandomForestRegressor(random_state=42))
])

model.fit(X_train, y_train)

Importantly, the training stage only receives the training subset that we created earlier during the split, not the entire dataset.

Now we take another subset of the data, the test set, and utilize it to evaluate the model’s performance on sample workers. We will utilize mean absolute error (MAE) as the evaluation metric:

preds = model.predict(X_test)
mae = mean_absolute_error(y_test, preds)
print(f"nModel MAE: {mae:.2f}")

You can get an MAE of around 13000 which is acceptable but not brilliant considering most incomes are in the 60-90k range. Anyway, not bad for a first machine learning model!

Finally, I’ll show you how to save the trained model to a file for future deployment.

joblib.dump(model, "employee_income_model.joblib")
print("Model saved as employee_income_model.joblib")

Saving the trained model to a file .joblib the file is useful for future deployment, allowing it to be immediately reloaded and reused without having to retrain it from scratch. Think of it as “freezing” your entire preprocessing pipeline and trained model in a portable facility. Quick options for future utilize and deployment include plugging it into a plain Python script or notebook, or building a lightweight web application built using tools such as Streamlined, BuiltOr Flask.

# Summary

In this article, we jointly created an initial machine learning-based regression model, namely, employee income prediction, identifying the necessary steps from a raw dataset to a immaculate, pre-processed DataFrameand from DataFrame to a model ready for implementation.

Ivan Palomares Carrascosa is a thought leader, writer, speaker and advisor in the fields of Artificial Intelligence, Machine Learning, Deep Learning and LLM. Trains and advises others on the utilize of artificial intelligence in the real world.

Latest Posts

More News