Photo by the editor
# Entry
First, the brutal truth: the textbook data science usually becomes a lie in the real world. Concepts and techniques are taught from carefully selected, beautifully curved data variables, but as soon as we venture into real projects, we are struck by many outliers, overly skewed distributions, and steadfast variances.
Previous article on building an exploratory data analysis (EDA) pipeline with Penguin showed how to exploit tests to detect cases where data violates various assumptions such as homoscedasticity and normality. But what if the tests fail? Throwing away the data is not the solution.
This article explores the art of using solid statistics in data science processes. These are mathematical methods specifically designed to produce reliable and valid results, even when the data does not meet classical assumptions or is riddled with outliers and noise. Taking a choose-your-own-adventure approach, we’ll create three scenarios using Python Pingouin to manage the ugliest aspects of data you may encounter in your daily work.
# Initial setup
Let’s start by installing (if necessary) and importing Pingouin i Pandasafter which we will load the available wine quality dataset Here.
!pip install pingouin pandas
import pandas as pd
import pingouin as pg
# Loading our messy, real-world-like dataset, containing red and white wine samples
url = "https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/wine-quality-white-and-red.csv"
df = pd.read_csv(url)
# Take a diminutive peek at what we are about to deal with
df.head()
If you’ve looked at Pingouin’s previous article, you already know that this is a notoriously sloppy data set that fails to meet several common assumptions. We will now embark on three different “adventures,” each of which will present a scenario, a core problem, and a proposed solid solution.
// Adventure 1: When the Normality Test Fails
Suppose we perform normality tests on two groups: white wine samples and red wine samples.
white_wine_alcohol = df[df['type'] == 'white']['alcohol']
red_wine_alcohol = df[df['type'] == 'red']['alcohol']
print("Normality test for White Wine Alcohol content:")
print(pg.normality(white_wine_alcohol))
print("nNormality test for Red Wine Alcohol content:")
print(pg.normality(red_wine_alcohol))
You will find that no distribution is normal, with extremely low p-values. Although non-normality itself does not directly signal outliers or skewness, a robust deviation from normality often suggests that such features may be present in the data. In this situation, comparing means using a t-test would be risky and would likely produce unreliable results.
A solid fix for such a scenario is Mann-Whitney U test. Instead of comparing averages, this test compares items in the data – for example, sorting all the wines in a group from lowest to highest alcohol content. This rank-based approach is a master trick that strips outliers of their sometimes risky magnitude. Here’s how:
# Separating our two groups
red_wine = df[df['type'] == 'red']['alcohol']
white_wine = df[df['type'] == 'white']['alcohol']
# Running the resilient Mann-Whitney U test
mwu_results = pg.mwu(x=red_wine, y=white_wine)
print(mwu_results)
Exit:
U_val alternative p_val RBC CLES
MWU 3829043.5 two-sided 0.181845 -0.022193 0.488903
Since the p-value is not lower than 0.05, there is no statistically significant difference in alcohol content between the two types of wine – and this conclusion is guaranteed in the presence of outliers and skewness.
// Adventure 2: When a couple’s T-test fails
Suppose you now want to compare two measurements taken in the same patient – for example, a patient’s sugar levels before and after a drug prototype, or two properties measured in the same bottle of wine. The focus here is on the how differences between paired measurements are separated. If such differences are not normally distributed, the standard paired t test will produce unreliable confidence intervals.
The ideal solution in this scenario is Wilcoxon signed rank test: The resilient sibling of the paired t-test, which works by looking at differences between columns and ranking their absolute values. In Pingouin this test is called usage pg.wilcoxon()by passing two columns containing paired measures within the same topic – e.g., two types of wine acidity.
# Run the resilient Wilcoxon signed-rank test for paired data
wilcoxon_results = pg.wilcoxon(x=df['fixed acidity'], y=df['volatile acidity'])
print(wilcoxon_results)
Result:
W_val alternative p_val RBC CLES
Wilcoxon 0.0 two-sided 0.0 1.0 1.0
The above result shows a statistically significant difference or “perfect separation” between two measurements. Not only are these two wine properties different, but they also operate at completely different levels of magnitude across the dataset.
// Adventure 3: When ANOVA Fails
In this third and final adventure, we want to test whether residual sugar levels in wine vary significantly between different quality ratings – remember that the latter range from 3 to 9, taking integer values, and can therefore be treated as separate categories.
If Pingouin’s Levene’s test of homoscedasticity fails dramatically – for example, because the sugar variance in average wines is huge but very diminutive in top-quality wines – a classic one-way ANOVA can produce misleading results because this test assumes equal variances between groups.
The solution is Welch’s ANOVAwhich penalizes groups with high variance, thus balancing the scales and making comparisons across several categories more fair. Here’s how to run this resilient alternative to conventional ANOVA using Pingouin:
# Run Welch's ANOVA to compare sugar across quality ratings
welch_results = pg.welch_anova(data=df, dv='residual sugar', between='quality')
print(welch_results)
Result:
Source ddof1 ddof2 F p_unc np2
0 quality 6 54.507934 10.918282 5.937951e-08 0.008353
Even where one-way ANOVA may have had problems due to unequal variances, Welch’s ANOVA provides solid conclusions. The very diminutive p-value is clear evidence that residual sugar levels vary significantly depending on the wine quality rating. However, it should be remembered that sugar is only a diminutive piece of the puzzle affecting the quality of wine – which is emphasized by the low eta-squared value of 0.008.
# Summary
Through three sample scenarios, each combining the problem of unstructured data with a solid statistical strategy, we learned that being a skilled data scientist doesn’t mean having perfect data or being able to tune it perfectly – it means knowing what to do when data becomes hard for a variety of reasons. Pingouin functions implement a variety of resilient tests that lend a hand you escape the trap of false assumptions and extract mathematically sound insights with little additional effort.
Ivan Palomares Carrascosa is a thought leader, writer, speaker and advisor in the fields of Artificial Intelligence, Machine Learning, Deep Learning and LLM. Trains and advises others on the exploit of artificial intelligence in the real world.
