5 useful Python scripts for advanced data validation and quality control

Photo by the author

# Entry

These advanced validation issues are insidious. They pass basic quality checks because the individual values look good, but the underlying logic is broken. Manually controlling these issues is challenging. You need automated scripts that understand context, business rules, and relationships between data points. This article discusses five advanced Python validation scripts that catch subtle issues that basic checks miss.

You can download the code on GitHub.

# 1. Checking for continuity and patterns in time series

// Pain point

// What the script does

// How it works

The script parses the timestamp columns to infer the expected frequency, identifying gaps in the expected continuous sequences. Verifies that event sequences follow logical ordering rules, applies domain-specific rate checks, and detects seasonality violations. It also generates detailed reports showing time anomalies along with assessing the impact on business activities.

# 2. Semantic validation using business rules

// Pain point

Individual fields pass type validation, but the combination makes no sense. Here are some examples: a purchase order from the future with a completed delivery date in the past. An account marked as “new customer” but with a transaction history of five years. These semantic violations break business logic.

// What the script does

Verifies data against complicated business rules and domain knowledge. It checks multi-field conditional logic, checks stages and time progression, ensures respect for mutually exclusive categories, and flags logically impossible combinations. The script uses a rules engine that can express advanced business constraints.

// How it works

The script accepts business rules defined in a declarative format, evaluates complicated conditional logic across multiple fields, and checks state changes and workflow progress. It also checks the temporal consistency of business events, applies industry-specific domain rules, and generates violation reports categorized by rule type and business impact.

⏩ Download the semantic validity checker script

# 3. Detecting data drift and schema evolution

// Pain point

Data structure sometimes changes over time without documentation. Novel columns appear, existing columns disappear, data types change slightly, value ranges expand or contract, categorical values create recent categories. These changes break downstream systems, invalidate assumptions, and cause hushed failures. Before you know it, months of corrupt data have accumulated.

// What the script does

Monitors datasets for structural and statistical drift over time. It tracks schema changes such as recent and deleted columns, type changes, detects changes in the distribution of numeric and categorical data, and identifies recent values in supposedly established categories. Flags changes in data ranges and restrictions and warns when statistical properties deviate from baseline values.

// How it works

The script creates baseline profiles of the structure and statistics of the dataset, periodically compares current data to baseline values, calculates drift scores using statistical distance metrics such as KL discrepancy, Wasserstein distanceand tracks schema version changes. It also maintains a history of changes, applies significance testing to distinguish real drift from noise, and generates drift reports with severity levels and recommended actions.

⏩ Download the data drift detector script

# 4. Validation of hierarchical and graph relations

// Pain point

Hierarchical data must remain acyclic and logically ordered. Cyclic reporting chains, self-referencing bills of material, cyclic taxonomies, and inconsistencies between parents and children break recursive queries and hierarchical aggregations.

// What the script does

Validates chart and tree structures in relational data. Detects circular references in parent-child relationships, ensures hierarchy depth constraints are respected, and validates that directed acyclic graphs (DAGs) remain acyclic. The script also checks for orphaned nodes and disconnected subgraphs, and checks that root nodes and leaf nodes comply with business rules. It also checks the limitations of many-to-many relationships.

// How it works

The script creates a graphical representation of hierarchical relationships, uses cycle detection algorithms to find circular references, and performs depth-first and breadth-first traversals to check structure. It then identifies highly related components in supposedly acyclic graphs, inspects the properties of nodes at each level of the hierarchy, and generates a visual representation of problematic subgraphs with specific violation details.

⏩ Download the Hierarchical Relationship Validation Script

# 5. Checking referential integrity in tables

// Pain point

Relational data must maintain referential integrity across all foreign key relationships. Orphaned child records, references to deleted or non-existent parent records, invalid codes, and uncontrolled cascading deletions create hidden dependencies and inconsistencies. These breaches corrupt connections, distort reports, interrupt queries, and ultimately make data unreliable and strenuous to trust.

// What the script does

Checks foreign key relationships and consistency between tables. Detects orphaned records that are missing parent or child references, checks cardinality constraints, and checks composite key uniqueness across tables. It also analyzes the effects of cascading deletes before they occur and identifies circular references in multiple tables. The script works with multiple data files at once to validate relationships.

// How it works

The script loads the master dataset and all related reference tables, checks whether foreign key values exist in parent tables, detects orphaned parent records, and orphaned child records. Checks cardinality rules to ensure one-to-one or one-to-many constraints and validates composite keys spanning multiple columns. The script also generates comprehensive reports showing all affected referential integrity violations, number of rows, and specific foreign key values that have not been validated.

⏩ Download the referential integrity check script

# Summary

Advanced data validation goes beyond null and duplicate checking. These five scripts support detect semantic violations, timing anomalies, structural drift, and referential integrity violations that completely miss basic quality controls.

Start with a script that will solve your most essential problem. Configure basic profiles and validation rules for a specific domain. Validate your data pipeline to catch issues at the ingestion stage, not the analysis stage. Configure alert thresholds appropriate to your operate case.

Have fun checking it out!

Priya C’s girlfriend is a software developer and technical writer from India. He likes working at the intersection of mathematics, programming, data analytics and content creation. Her areas of interest and specialization include DevOps, data analytics and natural language processing. She likes reading, writing, coding and coffee! He is currently working on learning and sharing his knowledge with the developer community by writing tutorials, guides, reviews, and more. Bala also creates fascinating resource overviews and coding tutorials.

Categories

5 useful Python scripts for advanced data validation and quality control

# Entry

# 1. Checking for continuity and patterns in time series

// Pain point

// What the script does

// How it works

# 2. Semantic validation using business rules

// Pain point

// What the script does

// How it works

# 3. Detecting data drift and schema evolution

// Pain point

// What the script does

// How it works

# 4. Validation of hierarchical and graph relations

// Pain point

// What the script does

// How it works

# 5. Checking referential integrity in tables

// Pain point

// What the script does

// How it works

# Summary

Meta suspends employee tracking program after internal data leak

Meta has internally disclosed data from its controversial employee tracking program

What do Americans spend on housing?

Some electricians believe that building data centers involves sellouts

Established home insurance is failing. Here’s what can fill that gap

More News

What do Americans spend on housing?

Established home insurance is failing. Here’s what can fill that gap

Python dictionary tips and tricks you should always keep in mind

Pseudoscientific cancer ‘treatment’ involves gassing naked people in plastic bags of bleach

Meta suspends employee tracking program after internal data leak

Meta has internally disclosed data from its controversial employee tracking program

What do Americans spend on housing?