Building Data Science Pipelines with Pandas

Image generated with ChatGPT

Pandas is one of the most popular data manipulation and analysis tools available, known for its ease of operate and powerful capabilities. But did you know you can also operate it to create and execute data pipelines to process and analyze data sets?

In this tutorial, we will learn how to operate the Pandas `pipe` method to build end-to-end data science pipelines. A pipeline involves various steps such as data ingestion, data cleaning, data analysis, and data visualization. To highlight the benefits of this approach, we will also compare pipeline-based code with non-pipeline alternatives, allowing you to understand the differences and advantages.

What is a Pandas pipe?

The Pandas `pipe` method is a powerful tool that allows users to combine multiple data processing functions in a tidy and readable way. This method can handle both positional and keyword arguments, making it malleable for a variety of custom functions.

In low, the `pipe` method in Pandas:

Increases code readability
Enables function linking
Supports custom functions
Improves code organization
Effective for elaborate transformations

Here is an example code for the `pipe` function. We have applied the Python `tidy` and `analysis` functions to the Pandas DataFrame. The pipe method will first tidy the data, perform data analysis, and return the output.

(
    df.pipe(tidy)
    .pipe(analysis)
)

Pandas Code Without Pipeline

First, we will write a basic data analysis code without using a pipeline so that we have a clear comparison when we operate a pipeline to simplify our data processing pipeline.

In this tutorial we will be using Online Sales Dataset – Popular Market Data from Kaggle, which contains information about online sales transactions in various product categories.

We will load the CSV file and display the top three rows from the dataset.

import pandas as pd
df = pd.read_csv('/work/Online Sales Data.csv')
df.head(3)

Pristine the dataset by removing duplicates and missing values, then reset the index.
Convert column types. We convert “Product Category” and “Product Name” to string, and the “Date” column to date.
To perform the analysis, we will create a “month” column from the “Date” column. Then, we will calculate the average values of units sold in the month.
Imagine a bar chart showing the average number of units sold per month.

# data cleaning
df = df.drop_duplicates()
df = df.dropna()
df = df.reset_index(drop=True)

# convert types
df['Product Category'] = df['Product Category'].astype('str')
df['Product Name'] = df['Product Name'].astype('str')
df['Date'] = pd.to_datetime(df['Date'])

# data analysis
df['month'] = df['Date'].dt.month
new_df = df.groupby('month')['Units Sold'].mean()

# data visualization
new_df.plot(kind='bar', figsize=(10, 5), title="Average Units Sold by Month");

It is very basic and if you are a data scientist or even a data science student, you will know how to do most of these tasks.

Building Data Science Pipelines with Pandas Pipe

To create an end-to-end data analysis process, we first need to convert the above code to the appropriate format using Python functions.

We will create Python functions for:

Loading data: Requires a directory of CSV files.
Data cleaning: Takes a raw DataFrame and returns a sanitized DataFrame.
Convert column types: It requires a tidy DataFrame and data types and returns a DataFrame with valid data types.
Data analysis: Takes the DataFrame from the previous step and returns a modified DataFrame with two columns.
Data visualization: To generate a visualization, a modified DataFrame and visualization type are required.

def load_data(path):
    return pd.read_csv(path)

def data_cleaning(data):
    data = data.drop_duplicates()
    data = data.dropna()
    data = data.reset_index(drop=True)
    return data

def convert_dtypes(data, types_dict=None):
    data = data.astype(dtype=types_dict)
    ## convert the date column to datetime
    data['Date'] = pd.to_datetime(data['Date'])
    return data


def data_analysis(data):
    data['month'] = data['Date'].dt.month
    new_df = data.groupby('month')['Units Sold'].mean()
    return new_df

def data_visualization(new_df,vis_type="bar"):
    new_df.plot(kind=vis_type, figsize=(10, 5), title="Average Units Sold by Month")
    return new_df

Building data pipelines allows us to experiment with different scenarios without changing all the code. You standardize the code and make it more readable.

path = "/work/Online Sales Data.csv"
df = (pd.DataFrame()
            .pipe(lambda x: load_data(path))
            .pipe(data_cleaning)
            .pipe(convert_dtypes,{'Product Category': 'str', 'Product Name': 'str'})
            .pipe(data_analysis)
            .pipe(data_visualization,'line')
           )

The end result looks amazing.

Application

In this low tutorial, we learned about the Pandas `pipe` method and how to operate it to build and execute end-to-end data science pipelines. Pipelining makes code more readable, repeatable, and better organized. By integrating the pipe method into your workflow, you can streamline your data processing tasks and augment the overall performance of your projects. Additionally, some users have found that using `pipe` instead of `.apply()` results in significantly faster execution times.

Abid Ali Awan (@1abidaliawan) is a certified data science professional who loves building machine learning models. He currently focuses on content creation and writing technical blogs on machine learning and data science technologies. Abid has a Masters in Technology Management and a Bachelors in Telecommunication Engineering. His vision is to build an AI product using Graph Neural Network for students struggling with mental illness.

Categories

Building Data Science Pipelines with Pandas

What is a Pandas pipe?

Pandas Code Without Pipeline

Building Data Science Pipelines with Pandas Pipe

Application

5 key changes D&A leaders need to make to ensure analytics and AI success

COBOL is the asbestos of programming languages

Japan approves world’s first treatment using reprogrammed human cells

Wall Street is already betting on markets based on forecasts

AI psychosis lawyer warns of risk of mass casualties

More News

5 key changes D&A leaders need to make to ensure analytics and AI success

Japan approves world’s first treatment using reprogrammed human cells

A novel study details why cats almost always land on their paws

You can approximate Pi by dropping needles on the floor

5 key changes D&A leaders need to make to ensure analytics and AI success

COBOL is the asbestos of programming languages

Japan approves world’s first treatment using reprogrammed human cells