Photo by the author
# Entry
Data quality problems are everywhere. Missing values where they shouldn’t be. Dates in the wrong format. Duplicate records getting through. Outliers that skew the analysis. Text fields with inconsistent capitalization and spelling variations. These problems can interrupt analysis, pipelines and often lead to destitute business decisions.
Manual data verification is tedious. You have to check for the same issues repeatedly across multiple datasets, and it’s basic to miss subtle issues. This article discusses five practical Python scripts that solve the most common data quality problems.
# 1. Missing data analysis
// Pain point
You get a data set that you expect to have complete records in, but there are empty cells, null values, empty strings, and placeholder text such as “N/A” or “Unknown”. Some columns are mostly blank, others have just a few gaps. Before you can fix it, you need to understand the scale of the problem.
// What the script does
Comprehensively scans datasets for missing data in all its forms. Identifies missing patterns (random or systematic), calculates completeness scores for each column, and flags columns with excessive missing values. It also generates visual reports showing where data gaps exist.
// How it works
The script reads data from CSV, Excel or JSON files, detects various representations of missing values such as None, NaN, empty strings, common placeholders. It then calculates the percentage of missing data by column and row, identifying correlations between missing values in columns. Finally, it produces both summary statistics and detailed reports with recommendations on how to deal with each type of disappearance.
⏩ Download the missing data analyzer script
# 2. Validation of data types
// Pain point
Your dataset claims to have numeric identifiers, but some are textual. Date fields contain dates, times, or sometimes just random strings of characters. Email addresses in the email column, except for fields that are not valid email addresses. Such type inconsistencies cause scripts to crash or incorrect calculations.
// What the script does
Checks whether each column contains the expected data type. Checks numeric columns for non-numeric values, date columns for invalid dates, email and URL columns for correct formatting, and category columns for unexpected values. The script also provides detailed type violation reports with line numbers and examples.
// How it works
The script accepts a schema definition specifying the expected types of each column, uses patterns and validation libraries to check format compliance, identifies and reports rows that violate type expectations, calculates violation rates per column, and suggests appropriate data type conversions or cleanup steps.
⏩ Download the data type validation script
# 3. Detecting duplicate records
// Pain point
Your database should contain unique records, but duplicate entries keep appearing. Sometimes they are exact duplicates, sometimes only a few fields match. Perhaps it’s the same customer with a slightly different spelling of their name, or transactions that were accidentally submitted twice. Finding them manually is very hard.
// What the script does
Identifies duplicate and near-duplicate records using multiple detection strategies. Finds exact matches, fuzzy matches based on similarity thresholds, and duplicates in specified column combinations. Groups similar records and calculates confidence scores for potential matches.
// How it works
The script uses hash-based exact matching for perfect duplicates, applies fuzzy string matching algorithms using Levenshtein distance in the case of near-duplicates, allows you to specify key columns for partial matching, generates duplicate clusters with similarity scores, and exports detailed reports showing all potential duplicates with deduplication recommendations.
⏩ Download the duplicate records detection script
# 4. Detecting outliers
// Pain point
The analysis results look wrong. You dig around and discover that someone entered an age of 999, the transaction amount is negative when it should be positive, or the measurement is three orders of magnitude larger than the others. Outliers skew statistics, break models, and are often hard to identify in vast data sets.
// What the script does
Automatically detects statistical outliers using multiple methods. Applies z-score, IQR or interquartile range analysis and domain-specific rules. Identifies extreme values, impossible values, and values outside expected ranges. Provides context for each outlier and suggests whether it is a probable error or a legitimate extreme value.
// How it works
The script analyzes numeric columns using configurable statistical thresholds, applies domain-specific validation rules, visualizes distributions with outliers highlighted, calculates outlier scores and confidence levels, and generates prioritized reports by flagging the most likely errors in the data first.
⏩ Download the outlier detection script
# 5. Checking consistency between fields
// Pain point
The individual fields look good, but the relationships between the fields are broken. Start dates follow end dates. Shipping addresses in countries other than the country code listed in the billing address. Child records without corresponding parent records. Order totals that do not match the total of the order items. These logical inconsistencies are harder to detect, but just as damaging.
// What the script does
Verifies logical relationships between fields based on business rules. Checks temporal consistency, referential integrity, mathematical relationships, and custom business logic. Flags violations, providing details of inconsistencies.
// How it works
The script accepts a rule definition file specifying the relationships to check, evaluates conditional logic and comparisons between fields, performs lookups to check referential integrity, computes derived values and compares them to stored values, and produces detailed violation reports with row references and specific rule errors.
⏩ Download the script that checks consistency between fields
# Summary
These five scripts aid you detect data quality issues early, before they harm your analysis or systems. Data validation should be automatic, comprehensive and swift, and these scripts aid with that.
So how do you get started? Download the script that solves your biggest data quality problem and install the required dependencies. Then configure validation rules for specific data, run them on the sample dataset to verify your configuration. Then integrate it with your data pipeline to automatically catch issues
Tidy data is the foundation of everything else. Start checking validation systematically and you’ll spend less time troubleshooting. Have fun checking it out!
Bala Priya C is a software developer and technical writer from India. He likes working at the intersection of mathematics, programming, data analytics and content creation. Her areas of interest and specialization include DevOps, data analytics and natural language processing. She likes reading, writing, coding and coffee! He is currently working on learning and sharing his knowledge with the developer community by writing tutorials, guides, reviews, and more. Bala also creates engaging resource overviews and coding tutorials.
