All about Pyjanitor's method combining functionality and why it's useful

Photo by the editor

# Entry

Working intensively with data in Python teaches us all an critical lesson: cleaning data is usually not like performing data analysis, but rather like acting as a digital janitor. Here’s what’s needed in most exploit cases: loading the dataset, discovering that many of the column names are out of order, encountering missing values, and getting a enormous amount of fleeting data variables, only the last of which contains the final, pristine dataset.

Pyanitor provides a cleaner approach to carrying out these steps. This library can be used with the concept of method chaining to transform otherwise tedious data cleansing processes into pipelines that look elegant, effective, and readable.

This article shows and explains how to combine methods in the context of Pyjanitor and data cleaning.

# Understanding combining methods

Chaining methods is not modern to programming: in fact, it is a well-established coding pattern. It involves calling multiple methods on an object in sequence: all in one statement. This way you don’t have to reassign the variable after each step because each method returns an object that calls the next attached method and so on.

The following example helps you understand the essence of this concept. Watch as we would apply a few uncomplicated modifications to a tiny piece of text (string) using “standard” Python:

text = "  Hello World!  "
text = text.strip()
text = text.lower()
text = text.replace("world", "python")

The resulting text value will be: "hello python!".

Now, when combining methods, the same process will look like this:

text = "  Hello World!  "
cleaned_text = text.strip().lower().replace("world", "python")

Notice that the logical flow of the operations used is from left to right: all in one unified chain of thought!

If you understood this, you now have a clear understanding of the concept of combining methods. Let us now translate this vision into the context of using data science Pandas. Standard data cleansing in a dataframe, consisting of multiple steps, usually looks like this without chaining:

# Established, step-by-step Pandas approach
df = pd.read_csv("data.csv")
df.columns = df.columns.str.lower().str.replace(' ', '_')
df = df.dropna(subset=['id'])
df = df.drop_duplicates()

As we will soon see, by combining methods we will construct a unified pipeline in which operations on data frames are encapsulated using parentheses. Moreover, we will no longer need intermediate variables containing non-terminal data frames, which will allow for cleaner and more error-tolerant code. Better yet, Pyjanitor makes the process seamless.

# Entering Pyjanitor: Application example

Pandas itself offers some native support for combining methods. However, some of its core functionalities were not designed strictly according to this pattern. This is the main motivation behind Pyjanitor, based on the R package that has almost the same name: janitor.

Essentially, you can think of Pyjanitor as an extension for Pandas that offers a suite of custom data cleansing processes in a method chain-friendly way. Examples of application programming interface (API) method names include clean_names(), rename_column(), remove_empty()and so on. Its API uses a set of intuitive method names that take code expression to a whole modern level. Plus, Pyjanitor relies entirely on free, open-source tools and can run seamlessly on cloud environments and notebooks like Google Colab.

Let’s take a full look at how method coupling is used in Pyjanitor, with an example where we first create a tiny, synthetic dataset that looks intentionally sloppy, and put it into Pandas DataFrame object.

IMPORTANT: to avoid common, if somewhat terrible, errors resulting from incompatibilities between library versions, make sure you have the latest available version of both Pandas and Pyjanitor by using !pip install --upgrade pyjanitor pandas First.

messy_data = {
    'First Name ': ['Alice', 'Bob', 'Charlie', 'Alice', None],
    '  Last_Name': ['Smith', 'Jones', 'Brown', 'Smith', 'Doe'],
    'Age': [25, np.nan, 30, 25, 40],
    'Date_Of_Birth': ['1998-01-01', '1995-05-05', '1993-08-08', '1998-01-01', '1983-12-12'],
    'Salary ($)': [50000, 60000, 70000, 50000, 80000],
    'Empty_Col': [np.nan, np.nan, np.nan, np.nan, np.nan]
}

df = pd.DataFrame(messy_data)
print("--- Messy Original Data ---")
print(df.head(), "n")

cleaned_df = (
    df
    .rename_column('Salary ($)', 'Salary')  # 1. Manually fix tricky names BEFORE getting them mangled
    .clean_names()                          # 2. Standardize everything (makes it 'salary')
    .remove_empty()                         # 3. Drop empty columns/rows
    .drop_duplicates()                      # 4. Remove duplicate rows
    .fill_empty(                            # 5. Impute missing values
        column_names=['age'],               # CAUTION: after previous steps, assume lowercase name: 'age'
        value=df['Age'].median()            # Pull the median from the original raw df
    )
    .assign(                                # 6. Create a modern column using assign
        salary_k=lambda d: d['salary'] / 1000
    )
)

print("--- Cleaned Pyjanitor Data ---")
print(cleaned_df)

The above code is self-explanatory and includes built-in comments explaining each method called at each stage of the chain.

Here is the output of our example, which compares the original, contaminated data with the cleaned version:

--- Messy Original Data ---
  First Name    Last_Name   Age Date_Of_Birth  Salary ($)  Empty_Col
0       Alice       Smith  25.0    1998-01-01       50000        NaN
1         Bob       Jones   NaN    1995-05-05       60000        NaN
2     Charlie       Brown  30.0    1993-08-08       70000        NaN
3       Alice       Smith  25.0    1998-01-01       50000        NaN
4         NaN         Doe  40.0    1983-12-12       80000        NaN 

--- Cleaned Pyjanitor Data ---
  first_name_ _last_name   age date_of_birth  salary  salary_k
0       Alice      Smith  25.0    1998-01-01   50000      50.0
1         Bob      Jones  27.5    1995-05-05   60000      60.0
2     Charlie      Brown  30.0    1993-08-08   70000      70.0
4         NaN        Doe  40.0    1983-12-12   80000      80.0

# Summary

In this article, we learned how to exploit the Pyjanitor library to apply method coupling and simplify otherwise tedious data cleansing processes. This makes the code cleaner, crisper, and – in a sense – self-documenting, so that other developers or yourself can read the pipeline and easily understand what’s happening on this journey from raw to finished data set.

Great job!

Ivan Palomares Carrascosa is a thought leader, writer, speaker and advisor in the fields of Artificial Intelligence, Machine Learning, Deep Learning and LLM. Trains and advises others on the exploit of artificial intelligence in the real world.

Categories

All about Pyjanitor’s method combining functionality and why it’s useful

# Entry

# Understanding combining methods

# Entering Pyjanitor: Application example

# Summary

5 Docker best practices for faster builds and smaller images

Teaching AI models to say “I’m not sure”

Modern York prohibits government employees from trading inside information in prediction markets

Sam Altman’s Orb company promoted a Bruno Mars partnership that doesn’t exist

US Senate candidate caught insider trading on Kalshi claims he did it on purpose

More News

5 Docker best practices for faster builds and smaller images

Up-to-date gas-powered data centers could emit more greenhouse gases than entire countries

Advanced pandas patterns that most data scientists don’t operate

How to watch the Lyrids meteor shower at its peak

5 Docker best practices for faster builds and smaller images

Teaching AI models to say “I’m not sure”

Modern York prohibits government employees from trading inside information in prediction markets