
Photo by the editor
# Entry
According to CrowdFlower SurveyData scientists spend 60% of their time organizing and cleaning data.
In this article, we’ll walk through the process of creating a data cleansing pipeline using an actual dataset from DoorDash. It contains nearly 200,000 food delivery records, each with dozens of features such as delivery time, total number of items, and store category (e.g., Mexican, Thai, or American).
# Predicting food delivery times based on DoorDash data


DoorDash aims to accurately estimate the time it takes to deliver food from the moment a customer places an order to when it arrives at their door. IN this data projectwe are tasked with developing a model that predicts total delivery time based on historical delivery data.
However, we will not complete the entire project, i.e. we will not build a predictive model. Instead, we will employ the dataset provided in the project and create a data cleansing pipeline.
Our workflow consists of two main steps.

# Data mining

Let’s start by loading and reviewing the first few rows of the dataset.
// Load and preview the dataset
import pandas as pd
df = pd.read_csv("historical_data.csv")
df.head()
Here’s the result.

This dataset contains datetime columns that record the order creation time and actual delivery time, which can be used to calculate delivery times. It also includes other features such as store category, total item count, subtotal, and minimum item price, making it suitable for various types of data analysis. We already see that there are some NaN values, which we will take a closer look at in the next step.
// Browse columns with info()
Let’s check all the column names with information() method. We will employ this method throughout the article to see changes in column value counters; this is a good indicator of missing data and overall data health.
Here’s the result.

As you can see, we have 15 columns, but the number of non-null values in them varies. This means that some columns contain missing values that could impact our analysis if not handled appropriately. One last thing: created_at AND actual_delivery_time data types are objects; these should be datetime.
# Building a data cleansing pipeline
In this step, we create a structured data cleansing pipeline to prepare the dataset for modeling. Each step addresses common issues such as timestamp formatting, missing values, and irrelevant features.

// Fix the data types of date and time columns
Before we start analyzing the data, we need to fix the time columns. Otherwise, the calculations we mentioned (actual_delivery_time – created_at) will go wrong.
What we repair:
- created_at: when the order was placed
- actual_delivery_time: when the food arrived
These two columns are stored as objects, so in order to perform the calculations correctly, we need to convert them to datetime format. To do this we can employ the datetime function in pandas. Here’s the code.
import pandas as pd
df = pd.read_csv("historical_data.csv")
# Convert timestamp strings to datetime objects
df["created_at"] = pd.to_datetime(df["created_at"], errors="coerce")
df["actual_delivery_time"] = pd.to_datetime(df["actual_delivery_time"], errors="coerce")
df.info()
Here’s the result.

As you can see in the screenshot above, the file created_at AND actual_delivery_time are now datetime objects.

Key columns include: store_basic_category has the fewest non-null values (192,668), which means it has the most missing data. Therefore, we will focus on cleaning it first.
// Data imputation using mode()
One of the messiest columns in the dataset, as evidenced by the huge number of missing values, is: store_basic_category. It tells us what types of grocery stores are available, such as Mexican, American, and Thai. However, many rows are missing this information, which is a problem. For example, it may limit how data is grouped or analyzed. So how do we fix this?
We will fill these lines instead of deleting them. For this purpose, we will employ smarter imputation.
We write a dictionary that maps each of them store_id to its most common category, then employ this mapping to fill in the missing values. Before we do that, let’s see the dataset.

Here’s the code.
import numpy as np
# Global most-frequent category as a fallback
global_mode = df["store_primary_category"].mode().iloc[0]
# Build store-level mapping to the most common category (rapid and strong)
store_mode = (
df.groupby("store_id")["store_primary_category"]
.agg(lambda s: s.mode().iloc[0] if not s.mode().empty else np.nan)
)
# Fill missing categories using the store-level mode, then fall back to global mode
df["store_primary_category"] = (
df["store_primary_category"]
.fillna(df["store_id"].map(store_mode))
.fillna(global_mode)
)
df.info()
Here’s the result.

As you can see in the screenshot above, the file store_basic_category the column now has a larger non-null number. But let’s check again with this code.
df["store_primary_category"].isna().sum()
Here is the result showing the number of NaN values. It’s zero; we got rid of them all.

Let’s look at the data set after imputation.


// Dropping the remaining NaN
In the previous step we corrected store_basic_categorybut did you notice anything? Non-null numbers in columns still don’t match!
This is a clear sign that we are still dealing with missing values in some part of the dataset. When it comes to data cleansing, we have two options:
- Complete the missing values
- Throw them
Considering this dataset contains almost 200,000 rows, we can afford to lose some. With smaller data sets, you need to be more careful. In this case, it is advisable to analyze each column, establish standards (decide how the missing values will be filled in – using the mean, median, most frequently occurring values, or domain-specific default values), and then fill them in.
To remove NaN, we will employ the method drop() method from pandas library. Let’s set it up place = True to apply changes directly to the DataFrame without having to reassign it. Let’s look at the data set at this point.

Here’s the code.
df.dropna(inplace=True)
df.info()
Here’s the result.

As you can see in the screenshot above, each column now has the same number of non-null values.
Let’s see the dataset after all the changes.

// What can you do next?
Now that we have a neat dataset, here are some things you can do next:
- Perform EDA to understand supply patterns.
- Develop recent features like delivery times or busy rate metrics to add more meaning to your analytics.
- Analyze correlations between variables to improve model performance.
- Create various regression models and find the most effective model.
- Predict delivery times using the most effective model.
# Final thoughts
In this article, we cleaned the real DoorDash dataset by solving common data quality issues such as fixing invalid data types and handling missing values. We built a straightforward data cleansing pipeline tailored to this data project and explored potential next steps.
Real-world datasets can be messier than you think, but there are also many methods and tricks to solve these problems. Thank you for reading!
Nate Rosidi is a data scientist and product strategist. He is also an adjunct professor of analytics and the founder of StrataScratch, a platform that helps data scientists prepare for job interviews using real interview questions from top companies. Nate writes about the latest career trends, gives interview advice, shares data science projects, and discusses all things SQL.
