How I built a data cleansing pipeline using one messy DoorDash dataset

Share

How I built a data cleansing pipeline using one messy DoorDash dataset
Photo by the editor

# Entry

According to CrowdFlower SurveyData scientists spend 60% of their time organizing and cleaning data.

In this article, we’ll walk through the process of creating a data cleansing pipeline using an actual dataset from DoorDash. It contains nearly 200,000 food delivery records, each with dozens of features such as delivery time, total number of items, and store category (e.g., Mexican, Thai, or American).

# Predicting food delivery times based on DoorDash data

Predicting food delivery times based on DoorDash dataPredicting food delivery times based on DoorDash data

DoorDash aims to accurately estimate the time it takes to deliver food from the moment a customer places an order to when it arrives at their door. IN this data projectwe are tasked with developing a model that predicts total delivery time based on historical delivery data.

However, we will not complete the entire project, i.e. we will not build a predictive model. Instead, we will employ the dataset provided in the project and create a data cleansing pipeline.

Our workflow consists of two main steps.

Data cleansing pipelineData cleansing pipeline

# Data mining

Data cleansing pipelineData cleansing pipeline

Let’s start by loading and reviewing the first few rows of the dataset.

// Load and preview the dataset

import pandas as pd
df = pd.read_csv("historical_data.csv")
df.head()

Here’s the result.

Data cleansing pipelineData cleansing pipeline

This dataset contains datetime columns that record the order creation time and actual delivery time, which can be used to calculate delivery times. It also includes other features such as store category, total item count, subtotal, and minimum item price, making it suitable for various types of data analysis. We already see that there are some NaN values, which we will take a closer look at in the next step.

// Browse columns with info()

Let’s check all the column names with information() method. We will employ this method throughout the article to see changes in column value counters; this is a good indicator of missing data and overall data health.

Here’s the result.

Data cleansing pipelineData cleansing pipeline

As you can see, we have 15 columns, but the number of non-null values ​​in them varies. This means that some columns contain missing values ​​that could impact our analysis if not handled appropriately. One last thing: created_at AND actual_delivery_time data types are objects; these should be datetime.

# Building a data cleansing pipeline

In this step, we create a structured data cleansing pipeline to prepare the dataset for modeling. Each step addresses common issues such as timestamp formatting, missing values, and irrelevant features.

Building a data cleansing pipelineBuilding a data cleansing pipeline

// Fix the data types of date and time columns

Before we start analyzing the data, we need to fix the time columns. Otherwise, the calculations we mentioned (actual_delivery_time – created_at) will go wrong.

What we repair:

  • created_at: when the order was placed
  • actual_delivery_time: when the food arrived

These two columns are stored as objects, so in order to perform the calculations correctly, we need to convert them to datetime format. To do this we can employ the datetime function in pandas. Here’s the code.

import pandas as pd
df = pd.read_csv("historical_data.csv")
# Convert timestamp strings to datetime objects
df["created_at"] = pd.to_datetime(df["created_at"], errors="coerce")
df["actual_delivery_time"] = pd.to_datetime(df["actual_delivery_time"], errors="coerce")
df.info()

Here’s the result.

Building a data cleansing pipelineBuilding a data cleansing pipeline

As you can see in the screenshot above, the file created_at AND actual_delivery_time are now datetime objects.

Building a data cleansing pipelineBuilding a data cleansing pipeline

Key columns include: store_basic_category has the fewest non-null values ​​(192,668), which means it has the most missing data. Therefore, we will focus on cleaning it first.

// Data imputation using mode()

One of the messiest columns in the dataset, as evidenced by the huge number of missing values, is: store_basic_category. It tells us what types of grocery stores are available, such as Mexican, American, and Thai. However, many rows are missing this information, which is a problem. For example, it may limit how data is grouped or analyzed. So how do we fix this?

We will fill these lines instead of deleting them. For this purpose, we will employ smarter imputation.

We write a dictionary that maps each of them store_id to its most common category, then employ this mapping to fill in the missing values. Before we do that, let’s see the dataset.

Data imputation in modeData imputation in mode

Here’s the code.

import numpy as np

# Global most-frequent category as a fallback
global_mode = df["store_primary_category"].mode().iloc[0]

# Build store-level mapping to the most common category (rapid and strong)
store_mode = (
    df.groupby("store_id")["store_primary_category"]
      .agg(lambda s: s.mode().iloc[0] if not s.mode().empty else np.nan)
)

# Fill missing categories using the store-level mode, then fall back to global mode
df["store_primary_category"] = (
    df["store_primary_category"]
      .fillna(df["store_id"].map(store_mode))
      .fillna(global_mode)
)

df.info()

Here’s the result.

Data imputation in modeData imputation in mode

As you can see in the screenshot above, the file store_basic_category the column now has a larger non-null number. But let’s check again with this code.

df["store_primary_category"].isna().sum()

Here is the result showing the number of NaN values. It’s zero; we got rid of them all.

Data imputation in modeData imputation in mode

Let’s look at the data set after imputation.

Data imputation in modeData imputation in mode

// Dropping the remaining NaN

In the previous step we corrected store_basic_categorybut did you notice anything? Non-null numbers in columns still don’t match!

This is a clear sign that we are still dealing with missing values ​​in some part of the dataset. When it comes to data cleansing, we have two options:

  • Complete the missing values
  • Throw them

Considering this dataset contains almost 200,000 rows, we can afford to lose some. With smaller data sets, you need to be more careful. In this case, it is advisable to analyze each column, establish standards (decide how the missing values ​​will be filled in – using the mean, median, most frequently occurring values, or domain-specific default values), and then fill them in.

To remove NaN, we will employ the method drop() method from pandas library. Let’s set it up place = True to apply changes directly to the DataFrame without having to reassign it. Let’s look at the data set at this point.

Dropping NaNDropping NaN

Here’s the code.

df.dropna(inplace=True)
df.info()

Here’s the result.

Dropping NaNDropping NaN

As you can see in the screenshot above, each column now has the same number of non-null values.

Let’s see the dataset after all the changes.

Dropping NaNDropping NaN

// What can you do next?

Now that we have a neat dataset, here are some things you can do next:

  • Perform EDA to understand supply patterns.
  • Develop recent features like delivery times or busy rate metrics to add more meaning to your analytics.
  • Analyze correlations between variables to improve model performance.
  • Create various regression models and find the most effective model.
  • Predict delivery times using the most effective model.

# Final thoughts

In this article, we cleaned the real DoorDash dataset by solving common data quality issues such as fixing invalid data types and handling missing values. We built a straightforward data cleansing pipeline tailored to this data project and explored potential next steps.

Real-world datasets can be messier than you think, but there are also many methods and tricks to solve these problems. Thank you for reading!

Nate Rosidi is a data scientist and product strategist. He is also an adjunct professor of analytics and the founder of StrataScratch, a platform that helps data scientists prepare for job interviews using real interview questions from top companies. Nate writes about the latest career trends, gives interview advice, shares data science projects, and discusses all things SQL.

Latest Posts

More News