Photo by the author
# Entry
As a data scientist or analyst, you know that understanding data is the foundation of any successful project. Before you can build models, create dashboards, or generate insights, you need to know what you’re working on. However, exploratory data analysis (EDA) is annoyingly repetitive and time-consuming.
For each recent data set, you’ll likely write almost the same code to check data types, calculate statistics, plot distributions, and more. You need a systematic, automated approach to understand your data quickly and accurately. This article discusses five Python scripts designed to automate the most crucial and time-consuming aspects of data mining.
📜 The scripts can be found on GitHub.
# 1. Profiled data
// Identification of the pain point
When you first open a dataset, you need to understand its basic characteristics. You write code to check data types, count unique values, identify missing data, calculate memory usage, and get summary statistics. You do this for every single column, generating the same repeatable code for each recent set of data. The initial profiling itself can take an hour or more for elaborate datasets.
// Viewing script operation
Automatically generates a complete profile of your dataset, including data types, missing value patterns, cardinality analysis, memory usage, and statistical summaries for all columns. Detects potential problems such as high cardinality categorical variables, fixed columns, and data type mismatches. Creates a structured report that provides a complete picture of your data in seconds.
// An explanation of how it works
The script iterates over each column, determines its type and calculates the appropriate statistics:
- For numeric columns, calculates the mean, median, standard deviation, quartiles, skewness, and kurtosis
- For categorical columns, identifies unique values, mode, and frequency distributions
Indicates potential data quality issues such as columns with >50% missing values, category columns with too many unique values, and columns with zero variance. All results are compiled into an easy-to-read data frame.
⏩ Download the data profiler script
# 2. Analyzing and visualizing distributions
// Identification of the pain point
To select appropriate transformations and models, it is necessary to understand how data is distributed. You should plot histograms, box plots and density curves for numerical features and bar graphs for qualitative features. Generating these visualizations manually means writing plotting code for each variable, customizing layouts, and managing multiple figure windows. For datasets containing dozens of features, this becomes cumbersome.
// Viewing script operation
Generates comprehensive visualizations of the distribution of all objects in a dataset. Creates histograms with kernel density estimates for numeric features, box plots for outliers, bar plots for categorical features, and QQ plots to assess normality. Detects and highlights skewed distributions, multimodal patterns, and potential outliers. Organizes all plots into a clear grid layout with automatic scaling.
// An explanation of how it works
The script separates numeric and categorical columns, then generates appropriate visualizations for each type:
- For numerical features, creates subplots showing histograms with overlaid Kernel Density Estimation (KDE) curves annotated with skewness and kurtosis values
- For categorical features, it generates sorted bar charts showing the frequencies of the values
The script automatically determines optimal bin sizes, handles outliers, and uses statistical tests to flag distributions that deviate significantly from normality. All visualizations are generated in a consistent style and can be exported if necessary.
⏩ Download the distribution analyzer script
# 3. Study of correlations and relationships
// Identification of the pain point
Understanding the relationship between variables is crucial but tedious. Correlation matrices must be calculated, scatter plots created for promising pairs, identification of multicollinearity issues, and detection of nonlinear relationships. Doing this manually requires generating several dozen charts and calculating various correlation coefficients, e.g Pearson, SwordfishAND Kendalland I’m trying to spot patterns in the correlation heatmaps. The process is snail-paced and you often miss crucial relationships.
// Viewing script operation
Analyzes the relationships between all variables in a data set. Generates correlation matrices using multiple methods, creates scatterplots for highly correlated pairs, detects multicollinearity problems for regression modeling, and identifies non-linear relationships that may be missed by linear correlation. Creates visualizations that allow you to drill down into specific relationships and flags potential problems such as perfect correlations or redundant features.
// An explanation of how it works
The script calculates correlation matrices using Pearson, Spearman, and Kendall correlations to capture different types of relationships. Generates an annotated heatmap highlighting mighty correlations, then creates detailed scatter plots for feature pairs that exceed correlation thresholds.
In case of multicollinearity detection, it performs calculations Variation factors of inflation (VIF) and identifies groups of features with high mutual correlation. The script also calculates mutual information scores to capture non-linear relationships that ignore correlation coefficients.
⏩ Download the correlation explorer script
# 4. Detecting and analyzing outliers
// Identification of the pain point
Outliers can impact analysis and models, but identifying them requires multiple approaches. You should check for outliers using various statistical methods such as interquartile range (IQR), Z-score, and isolation forests, and then visualize them using boxplots and scatterplots. You then need to understand their impact on your data and decide whether they are true anomalies or data errors. Manually implementing and comparing multiple outlier detection methods is time-consuming and error-prone.
// Viewing script operation
Detects outliers using multiple statistical and machine learning methods, compares the results of different methods to identify consensus outliers, generates visualizations showing outlier locations and patterns, and provides detailed reporting on outlier features. It helps you understand whether outliers are isolated data points or part of significant clusters and estimates their potential impact on further analysis.
// An explanation of how it works
The script uses multiple outlier detection algorithms:
- IQR method for single-factor outliers
- Mahalanobis distance for multivariate outliers
- Z-score and modified Z-score for statistical outliers
- Insulating forest for elaborate anomaly patterns
Each method generates a set of flagged points, and the script produces a consensus score showing how many methods flagged each observation. Generates side-by-side visualizations comparing detection methods, highlights cases tagged with multiple methods, and provides detailed outlier statistics. The script also performs a sensitivity analysis showing how outliers affect key statistics such as means and correlations.
⏩ Download the outlier detection script
# 5. Analysis of missing data patterns
// Identification of the pain point
Missing data is rarely random, and understanding patterns of missingness is necessary to selecting the appropriate management strategy. You need to determine which columns are missing data, detect missing patterns, visualize missing patterns, and understand the relationships between missing values and other variables. Performing this analysis manually requires custom code for each dataset and advanced visualization techniques.
// Viewing script operation
Analyzes missing data patterns across the entire dataset. Identifies columns with missing values, calculates missingness rates, and detects correlations in missingness patterns. It then evaluates the types of missingness – missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR) – and generates visualizations showing the patterns of missingness. Contains recommendations on strategies based on detected patterns.
// An explanation of how it works
The script creates a binary missing matrix indicating where values are missing, and then analyzes this matrix to detect patterns. Calculates missingness correlations to identify features that are commonly missing together, uses statistical tests to evaluate missingness mechanisms, and generates heat maps and bar charts showing patterns of missingness. For each column with missing data, it examines the relationships between missing values and other variables using statistical tests and correlation analysis.
Based on the detected patterns, the script recommends appropriate imputation strategies:
- Mean/median for MCAR figures
- Predictive imputation for MAR data
- Domain-specific approaches for MNAR data
⏩ Download the missing data analyzer script
# Final remarks
These five scripts address the fundamental data mining challenges that every data scientist faces.
You can utilize each script independently for specific mining tasks or combine them into a complete mining data analysis pipeline. The result is a systematic, repeatable approach to data mining that can save you hours or days on every project while ensuring you don’t miss crucial information about your data.
Have fun exploring!
Priya C’s girlfriend is a software developer and technical writer from India. He likes working at the intersection of mathematics, programming, data analytics and content creation. Her areas of interest and specialization include DevOps, data analytics and natural language processing. She enjoys reading, writing, coding and coffee! He is currently working on learning and sharing his knowledge with the developer community by writing tutorials, guides, reviews, and more. Bala also creates compelling resource overviews and coding tutorials.
