5 Useful Python Scripts to Automate Data Cleansing

Photo by the editor

# Entry

As a data scientist, you know that machine learning models, analytics dashboards, and business reports depend on data that is correct, consistent, and properly formatted. But here’s the uncomfortable truth: Data cleansing takes up a huge chunk of project time. Analysts and data analysts spend a lot of time cleaning and preparing data rather than analyzing it.

The raw data you get is messy. It contains scattered missing values, duplicate records, inconsistent formats, outliers that skew models, and text fields full of typos and inconsistencies. Manually cleaning this data is tedious, error-prone, and does not scale.

This article discusses five Python scripts specifically designed to automate the most common and time-consuming data cleansing tasks you often encounter in real-world projects.

🔗 Link to code on GitHub

# 1. Procedure for handling missing values

What the script does: Automatically analyzes patterns of missing values across the entire dataset, recommends appropriate handling strategies based on data type and missing patterns, and applies selected imputation methods. Generates a detailed report showing what was missing and how it was addressed.

How it works: The script scans all columns to calculate percentages and missing patterns, determines data types (numeric, categorical, date/time) and applies appropriate strategies:

mean/median for numerical data,
categorical mode,
time series interpolation.

It can variously detect and handle Total Missing at Random (MCAR), Missing at Random (MAR), and Missing at Random (MNAR) patterns, and records all changes to ensure repeatability.

⏩ Download the missing values handling script

# 2. Duplicate record detector and recognition

Pain point: Your data contains duplicates, but they do not always exactly match. Sometimes it’s the same customer with a slightly different spelling of the name, or the same transaction recorded twice with slight differences. Finding these fuzzy duplicates and deciding which record to keep requires manual inspection of thousands of rows.

What the script does: Identifies both exact and fuzzy duplicate records using customizable matching rules. It groups similar records, evaluates their similarity, and either marks them for review or automatically combines them based on survival rules you define, such as keep the newest, keep the most complete, and more.

How it works: The script first finds exact duplicates using hash-based speed comparison. It then uses fuzzy matching algorithms Levenshtein distance AND Jaro-Winkler in key fields to find near-duplicates. Records are grouped into duplicate groups, and survival rules determine which values to keep when merging. A detailed report shows all duplicate groups found and actions taken.

⏩ Download the duplicate detection script

# 3. A tool for determining and standardizing data types

Pain point: Your CSV import turned everything into strings. The dates are in five different formats. Numbers have currency symbols and thousand separators. Boolean values are represented as “Yes/No”, “Y/N”, “1/0”, and “True/False”, all in the same column. Getting consistent data types requires writing custom parse logic for each messy column.

What the script does: Automatically detects the intended data type for each column, standardizes formats, and converts everything to the appropriate types. It supports dates in multiple formats, cleanses numeric strings, normalizes logical representations, and validates results. Displays a conversion report showing what was changed.

How it works: The script samples the values from each column to infer the intended type using pattern matching and heuristics. Then it applies the appropriate analysis: datautil for pliant date parsing, for numerical extraction, dictionary mapping for logical normalization. Failed conversions are logged with problematic values for manual review.

⏩ Download the data type repair script

# 4. Outlier detector

Pain point: Your numbers contain outliers that will destroy your analysis. Some of them are data entry errors, some are legitimate extreme values that you want to keep, and some are ambiguous. You need to identify them, understand their impact, and decide how to handle each case – winsorize, limit, remove, or flag for review.

What the script does: detects outliers using multiple statistical methods such as IQR, Z-score, Insulating forestvisualizes their distribution and impact and applies customizable treatment strategies. Distinguishes between univariate and multidimensional outliers. Generates reports showing the number of outliers, their values, and how they are handled.

How it works: The script calculates outlier boundaries using the selected methods, flags values that exceed the thresholds, and applies treatment: removal, percentile reduction, winsorization, or boundary imputation. For multivariate outliers, it uses the isolation forest or Mahalanobis distance. All outliers are logged with the original values for audit purposes.

⏩ Download the outlier detection script

# 5. Cleaning and normalization of text data

Pain point: Your text boxes are a mess. Names have inconsistent capitalization, addresses utilize different abbreviations (St. vs. Street vs. ST), product descriptions contain HTML tags and special characters, and free text fields have trailing/leading spaces everywhere. Standardizing text data requires the consistent utilize of dozens of patterns and string operations.

What the script does: Automatically cleans and normalizes text data: standardizes case, removes unwanted characters, expands or standardizes abbreviations, removes HTML, normalizes whitespace, and handles Unicode issues. Configurable cleanup pipelines allow you to apply different rules to different types of columns (names, addresses, descriptions, and the like).

How it works: The script provides a text transformation pipeline that can be configured for each column type. It supports case normalization, whitespace cleanup, special character removal, abbreviation standardization with lookup dictionaries, and Unicode normalization. Each transformation is recorded and before and after samples are provided for validation.

⏩ Download the text cleanup script

# Application

These five scripts solve the most time-consuming data cleansing challenges you’ll face in real-world projects. Here’s a quick summary:

Missing data intelligently analyzes and imputes missing data
Duplicate detector finds exact and fuzzy duplicates and resolves them
The data type repair tool standardizes formats and converts to appropriate types
Outlier detector identifies and treats statistical anomalies
The text cleaner consistently normalizes data containing messy strings

Each script is designed modularly. So you can utilize them individually or combine them into a complete data cleansing pipeline. Start with a script that solves your biggest problem, test it on a sample of data, adjust the parameters for your specific utilize case, and gradually build your automated cleanup workflow.

Enjoy clearing your data!

Bala Priya C is a software developer and technical writer from India. He likes working at the intersection of mathematics, programming, data analytics and content creation. Her areas of interest and specialization include DevOps, data analytics and natural language processing. She enjoys reading, writing, coding and coffee! He is currently working on learning and sharing his knowledge with the developer community by writing tutorials, guides, reviews, and more. Bala also creates intriguing resource overviews and coding tutorials.

Categories

5 Useful Python Scripts to Automate Data Cleansing

# Entry

# 1. Procedure for handling missing values

# 2. Duplicate record detector and recognition

# 3. A tool for determining and standardizing data types

# 4. Outlier detector

# 5. Cleaning and normalization of text data

# Application

OpenAI delays ChatGPT ‘adult mode’ again

5 useful Python scripts to automate exploratory data analysis

Sleep apnea often goes undetected in women. This is starting to change

Anthropic’s contract with the Pentagon is a warning to startups chasing federal contracts

When AI companies go to war, security gets left behind

More News

5 useful Python scripts to automate exploratory data analysis

Sleep apnea often goes undetected in women. This is starting to change

5 Powerful Python Decorators for Optimizing LLM Applications

Trump’s war with Iran could upend American farmers

OpenAI delays ChatGPT ‘adult mode’ again

5 useful Python scripts to automate exploratory data analysis

Sleep apnea often goes undetected in women. This is starting to change