5 Useful Python Scripts for Busy Data Engineers

Photo by the author

# Entry

As a data engineer, you are likely responsible (at least partially) for your organization’s data infrastructure. You build pipelines, maintain databases, keep data flowing smoothly, and troubleshoot when things inevitably break. The point is, how much time do you spend manually checking pipeline status, checking that data is loaded correctly, or monitoring system performance?

If you’re truthful, it probably takes up a huge chunk of your time. Data engineers spend many hours of their workday on operational tasks—monitoring jobs, validating schemas, tracking data lineage, and responding to alerts—when they could be designing better systems.

This article discusses five Python scripts specifically designed to solve the repetitive infrastructure and operational tasks that consume your valuable engineering time.

🔗 Link to code on GitHub

# 1. Pipeline health monitor

Pain point: You have dozens of ETL jobs running on different schedules. Some run hourly, others daily or weekly. Verifying that they’re all successful means logging into various systems, looking at logs, checking timestamps, and putting together what’s actually happening. By the time you realize that the task has failed, further processes are already interrupted.

What the script does: Monitors all data pipelines in one place, tracks execution status, alerts on failures or delays, and maintains a historical log of job performance. Provides a consolidated status dashboard showing what’s working, what’s failing, and what’s taking longer than expected.

How it works: The script connects to a task orchestration system (e.g Airflowor reads log files), extracts execution metadata, compares it with expected schedules and execution times, and flags anomalies. Calculates success rates, average uptime, and identifies failure patterns. It can send notifications via email or Slack when problems are detected.

⏩ Download the pipeline health monitor script

# 2. Schema validator and change detector

Pain point: Your previous data sources change without warning. The column name changes, the data type changes, or a modern required field appears. Your pipeline crashes, further reports fail, and you probably can’t figure out what changed where. Schema drift is a very significant problem in data pipelines.

What the script does: Automatically compares current table schemas with underlying definitions, detecting any changes to column names, data types, constraints, or structures. Generates detailed change reports and can enforce schema contracts to prevent invalid changes from propagating through the system.

How it works: The script reads schema definitions from databases or data files, compares them to stored underlying schemas (stored in JSON format), identifies additions, deletions, and modifications, and logs all changes with timestamps. It can validate incoming data against expected patterns before processing and discard data that does not match.

⏩ Download the schema validation script

# 3. Tracking the origins of data

Pain point: Someone asks, “Where does this field come from?” or “What happens if we change this source table?” and you don’t have a good answer. You review SQL scripts, ETL code, and documentation (if any) trying to trace the flow of data. Understanding relationships and analyzing impact takes hours or days instead of minutes.

What the script does: Automatically maps data lineage by parsing SQL queries, ETL scripts, and transformation logic. Shows the full path from the source systems to the final tables, including all transformations applied. Generates visual dependency graphs and impact analysis reports.

How it works: The script uses SQL parsing libraries to extract table and column references from queries, creates a directed data dependency graph, traces the transformation logic applied at each step, and visualizes the full lineage. It can perform an impact analysis showing which objects below are affected by changes in a given source.

⏩ Download the data lineage tracking script

# 4. Database performance analyzer

Pain point: Queries are running slower than usual. Your tables are getting bloated. Indexes may be missing or unused. You suspect performance issues, but identifying the root cause means manually running diagnostics, analyzing query plans, checking table statistics, and interpreting arcane metrics. It’s time-consuming work.

What the script does: Automatically analyzes database performance by identifying snail-paced queries, missing indexes, table bloat, unused indexes, and suboptimal configurations. Generates actionable recommendations with estimated performance impact and provides precise SQL code needed to implement fixes.

How it works: The script queries the database system and performance views directories (pg_stats for PostgreSQL, Information_schema for MySQLetc.), analyzes query execution statistics, identifies tables with high sequential scan rates indicating missing indexes, detects bloated tables requiring maintenance, and generates optimization recommendations ranked by potential impact.

⏩ Download the database performance analyzer script

# 5. Data quality assurance framework

Pain point: You need to ensure data quality in your pipelines. Is the number of rows as expected? Are there any unexpected null values? Are foreign key relationships persistent? You write these checks manually for each table, scattered throughout scripts, with no consistent structure or reporting. When checks fail, vague errors appear without context.

What the script does: Provides a framework for defining data quality assertions in the form of code: row count thresholds, unique constraints, referential integrity, value ranges, and custom business rules. It automatically triggers all confirmations, generates detailed error reports with context, and integrates with pipeline orchestration so that jobs fail when QA fails.

How it works: The script uses declarative assertion syntax, where you define quality rules in plain Python or YAML. It performs all assertions on your data, collects results with detailed error information (which rows failed, what values were invalid), generates comprehensive reports, and can be integrated with pipeline DAGs to act as quality gates to prevent bad data from spreading.

⏩ Download the data quality assurance structure script

# Summary

These five scripts focus on core operational challenges that data engineers consistently encounter. Here’s a quick summary of how these scripts work:

Pipeline Health Monitor provides centralized visibility into all data-related tasks
The schema validator catches essential changes before they break your pipelines
The data lineage tool maps data flow and simplifies impact analysis
Database Performance Analyzer identifies bottlenecks and optimization opportunities
Data quality assurance frameworks ensure data integrity through automated checks

As you can see, each script solves a specific problem and can be used individually or integrated with an existing toolkit. So choose one script, test it first in a non-production environment, adapt it to your specific configuration, and gradually integrate it into your workflow.

Joyful data engineering!

Bala Priya C is a software developer and technical writer from India. He likes working at the intersection of mathematics, programming, data analytics and content creation. Her areas of interest and specialization include DevOps, data analytics and natural language processing. She likes reading, writing, coding and coffee! He is currently working on learning and sharing his knowledge with the developer community by writing tutorials, guides, reviews, and more. Bala also creates captivating resource overviews and coding tutorials.

Categories

5 Useful Python Scripts for Busy Data Engineers

# Entry

# 1. Pipeline health monitor

# 2. Schema validator and change detector

# 3. Tracking the origins of data

# 4. Database performance analyzer

# 5. Data quality assurance framework

# Summary

Yann LeCun is raising $1 billion to create artificial intelligence that understands the physical world

As neurons learn, they receive precisely tailored learning signals

OpenAI and Google Employees File Amicus Brief in Support of Anthropic Against US Government

Are language models a commodity?

Nvidia plans to launch an open-source AI agent platform

More News

Are language models a commodity?

Don’t expect any massive surprises in government foreign files

Science says left-handed people are more competitive

5 useful Python scripts to automate exploratory data analysis

Yann LeCun is raising $1 billion to create artificial intelligence that understands the physical world

As neurons learn, they receive precisely tailored learning signals

OpenAI and Google Employees File Amicus Brief in Support of Anthropic Against US Government