# Entry
Anyone who has spent a lot of time analyzing data will learn something sooner or later: the golden rule of downstream machine learning modeling, known as garbage inside, garbage outside (CONCERT).
For example, feeding a linear regression model with highly collinear data or running ANOVA tests on heteroskedastic variances is the perfect recipe… for ineffective models that won’t learn properly.
Exploratory data analysis (EDA) has a lot to say in terms of visualizations such as scatter plots and histograms, but they are not sufficient when we need demanding data validation based on mathematical assumptions needed in further analyzes or models. Penguin helps achieve this by bridging the gap between two well-known data science and statistics libraries: SciPy AND pandas. Moreover, it can be a great ally in building stalwart, automated EDA pipelines. In this article, you’ll learn how to build an end-to-end pipeline for demanding statistical EDA by checking several essential data properties.
# Initial setup
Let’s start by installing Pingouin in our Python environment (and pandas if you don’t already have it):
!pip install pingouin pandas
Then it was time to import these key libraries and load our data. As an example of an open data set, we will operate a set containing samples of wine properties and quality.
import pandas as pd
import pingouin as pg
# Loading the wine dataset from an open dataset GitHub repository
url = "https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/wine-quality-white-and-red.csv"
df = pd.read_csv(url)
# Displaying the first few rows to understand our features
df.head()
# Checking for univariate normality
The first of the specific exploratory analyzes we will perform involves checking univariate normality. Many conventional algorithms for training machine learning models – as well as statistical tests such as ANOVA and t-tests – require the assumption that continuous variables are normally distributed, or Gaussian. Pingouin pg.normality() function helps check this with Shapiro-Wilk test across the entire data frame:
# Selecting a subset of continuous features for normality checks
features = ['fixed acidity', 'volatile acidity', 'citric acid', 'pH', 'alcohol']
# Running the normality test
normality_results = pg.normality(df[features])
print(normality_results)
Exit:
W pval normal
fixed acidity 0.879789 2.437973e-57 False
volatile acidity 0.875867 6.255995e-58 False
citric acid 0.964977 5.262332e-37 False
pH 0.991448 2.204049e-19 False
alcohol 0.953532 2.918847e-41 False
It appears that none of the available numeric functions satisfy normality. This is by no means a bad thing about the data; it’s just part of his characteristics. We’re just getting the message that in later stages of data preprocessing beyond our EDA, we might consider using data transformations such as the logarithmic or Box-Cox transformation, which make the raw data look “more normal” and therefore more suitable for models that assume normality.
# Checking for multivariate normality
Similarly, assessing normality not feature by feature, but taking into account interactions between features, is another intriguing aspect to check. Let’s see how to check for multivariate normality: a key requirement in techniques such as multivariate ANOVA (MANOVA), for example.
# Henze-Zirkler multivariate normality test
multivariate_normality_results = pg.multivariate_normality(df[features])
print(multivariate_normality_results)
Exit:
HZResults(hz=np.float64(23.72107048442373), pval=np.float64(0.0), normal=False)
And guess what: you can get something like this HZResults(hz=np.float64(23.72107048442373), pval=np.float64(0.0), normal=False)which means that multidimensional normality is also not met. If you are going to train a machine learning model on this dataset, this means that non-parametric tree-based models such as gradient boosting and random forests may be a more stalwart alternative than parametric weight-based models such as SVM, linear regression, and so on.
# Checking for homoscedasticity
Then comes a challenging word to describe a rather basic concept: homoscedasticity. This refers to equal or constant error variance in predictions and is interpreted as a measure of reliability. We’ll test this property (sorry, it’s too strenuous to rewrite its name!) using Pingouin’s implementation of the Levene test as follows:
# Levene's test for equal variances across groups
# 'dv' is the target, dependent variable, 'group' is the categorical variable
homoscedasticity_results = pg.homoscedasticity(data=df, dv='alcohol', group='quality')
print(homoscedasticity_results)
Result:
W pval equal_var
levene 66.338684 2.317649e-80 False
Since we have False once again we have the so-called heteroscedasticity problem that must be taken into account in further analyses. One possible approach could be to operate stalwart standard errors when training regression models.
# Checking for sphericity
Another statistical property to analyze is sphericity, which determines whether the variances of the differences between possible pairs of combinations of conditions are equal. Testing this property is usually desirable before conducting principal component analysis (PCA) for dimensionality reduction because it helps us understand whether there are correlations between variables. PCA will become rather useless if there are no:
# Mauchly's sphericity test
sphericity_results = pg.sphericity(df[features])
print(sphericity_results)
Result:
SpherResults(spher=False, W=np.float64(0.004437706589942777), chi2=np.float64(35184.26583883276), dof=9, pval=np.float64(0.0))
It looks like we’ve chosen a completely untamed, barren dataset! But don’t worry – this article is intentionally designed to focus on the EDA process and assist you identify many of these data issues. Ultimately, detecting them and knowing what to do with them before moving forward, machine learning analysis is much better than building a potentially flawed model. There is a catch in this case: we have a p-value of 0.0, which means that the null hypothesis of the identity correlation matrix is rejected, i.e. there are significant correlations between the variables. So if we have a lot of features and want to reduce dimensionality, it might be a good idea to operate PCA.
# Checking for multicollinearity
Finally, we will check for multicollinearity: a property that indicates whether there are highly correlated predictors. This may at some point become an undesirable property in interpretable models such as linear regressors. Let’s check it out:
# Calculating a stalwart correlation matrix with p-values
correlation_matrix = pg.rcorr(df[features], method='pearson')
print(correlation_matrix)
Output matrix:
fixed acidity volatile acidity citric acid pH alcohol
fixed acidity - *** *** *** ***
volatile acidity 0.219 - *** *** **
citric acid 0.324 -0.378 - ***
pH -0.253 0.261 -0.33 - ***
alcohol -0.095 -0.038 -0.01 0.121 -
While pandas corr() can also be used, the Pingouin equivalent uses asterisks to indicate the level of statistical significance of each correlation (* for p ** for p *** for p
# Summary
Ivan Palomares Carrascosa is a thought leader, writer, speaker and advisor in the fields of Artificial Intelligence, Machine Learning, Deep Learning and LLM. Trains and advises others on the operate of artificial intelligence in the real world.
