Anonymizing production data for data analytics with Mimesis

# Entry

Production data typically subject to significant privacy and compliance restrictions. For this reason, anonymizing such data becomes critical in virtually any real-world data analytics project involving the launch of a data-driven product, service, or solution.

Mimesis is an open-source Python library that stands out for its ability to generate realistic “fake” data with high performance. Mimesis runs locally and provides a free, reliable data pipeline solution. In this article, you’ll learn how to operate this library to anonymize sensitive production data, with a step-by-step example you can easily try in your IDE or notebook.

# Step by step procedure

Assuming you’re not familiar with Mimesis, you may need to install it in a Python environment with a command like:

Remember to add ! at the beginning pip command if you are working in a Google Colab or similar notebook environment.

Now we’re ready to get started! We will consider the scenario of a subscription system based on software tiers. For simplicity, we will synthetically generate a toy dataset containing data about customers and their subscription type. As you can see below, some of the variables in the dataset contain very sensitive data:

import pandas as pd

# Creation of a mock "production" customer dataset
production_data = {
    'user_id': [101, 102, 103, 104],
    'real_name': ['Alice Smith', 'Bob Jones', 'Charlie Brown', 'Diana Prince'],
    'email': ['alice.smith@corp.com', 'bjones@startup.io', 'cbrown@domain.org', 'diana@amazon.com'],
    'phone': ['555-0100', '555-0101', '555-0102', '555-0103'],
    'subscription_tier': ['Premium', 'Basic', 'Basic', 'Enterprise']
}

df = pd.DataFrame(production_data)
print("--- Original Sensitive Data ---")
print(df.head())

While subscription levels are not necessarily sensitive data in our example, usernames, email addresses and phone numbers are. Using Mimesis we can initialize the a supplier: a type of customized data anonymization template tailored to the type of data we have. Because our data observations are linked to people, we can import and operate them Person class – a provider that, using a specific language such as English and a random seed, can be used to generate false substitutes for real, sensitive personal data:

from mimesis import Person
from mimesis.locales import Locale

# Initializing a Person provider for English locales
person = Person(locale=Locale.EN, seed=42)

From this point on, the process of anonymizing personal information (PII) is quite basic. Simply replace the sensitive columns – as defined by us – with freshly generated data from the Mimesis Person Locale Generator. This is done by iterating over DataFrame an object containing the entire data set and calling the appropriate Mimesis functions to realistically create data substitutes depending on each given attribute:

# 1. Replacing real names with imitation, realistic names
df['real_name'] = [person.full_name() for _ in range(len(df))]

# 2. Replacing real emails with imitation ones
df['email'] = [person.email() for _ in range(len(df))]

# 3. Replacing real phone numbers
df['phone'] = [person.telephone() for _ in range(len(df))]

# 4. Renaming the column to reflect that it is no longer the real name
df.rename(columns={'real_name': 'anon_name'}, inplace=True)

Notice above how Mimesis Person class provides dedicated functions enabling, among others: generating names, email addresses and phone numbers. Additionally, the name column has been renamed to reflect the fact that the name contained in the updated dataset is no longer real, but anonymous.

We now check the results looking at the transformed one DataFrame. Sensitive PII fields have changed completely: they are now replaced with reliable-looking synthetic data, ensuring that the overall data set is structured and vital information is useful for further analysis, such as subscription_tier absolutely intact.

print("n--- Anonymized Data for Data Science Analyses ---")
print(df.head())

Exit:

--- Anonymized Data for Data Science Analyses ---
   user_id         anon_name                    email            phone  
0      101    Anthony Reilly    archived1911@duck.com     +13312271333   
1      102           Kai Day    suspect2087@yahoo.com  +1-205-759-3586   
2      103  Cleveland Osborn     urgent1912@yahoo.com     +13691067988   
3      104       Zack Holder  johnson1881@example.com  +1-574-481-3676   

  subscription_tier  
0           Premium  
1             Basic  
2             Basic  
3        Enterprise

Fantastic! We’ve just used a few basic steps to anonymize several sensitive data fields commonly found in real-world design and production data analysis – all for free, thanks to open-source Mimesis.

Finally, here are some of them best practices AND observations for carrying out the anonymization process we have just discussed:

We have replaced the columns directly in DataFrame. Depending on the context, consider whether this is the right approach or whether you want to store up-to-date information in a separate place DataFrame if there is a risk of losing the original data.
Mimesis works in a data-consistent manner, so the data generated matches the expected data types.
Seeding helps maintain consistency of generated information across series and facilitates repeatability.

# Summary

In this article, we showed how to operate Mimesis – a powerful Python library for generating anonymized and imitation data – to transform a sensitive production dataset into a version that can be safely used for further analysis without compromising private information such as real people’s PII.

Ivan Palomares Carrascosa is a thought leader, writer, speaker and advisor in the fields of Artificial Intelligence, Machine Learning, Deep Learning and LLM. Trains and advises others on the operate of artificial intelligence in the real world.

Categories

Anonymizing production data for data analytics with Mimesis

# Entry

# Step by step procedure

# Summary

Anthropic wants you to pay for Claude Fable 5

Petite robot boats build floating structures

Most European lawmakers voted against allowing massive tech to read our news. And so they go.

Dimming the sun would facilitate reduce the risk of El Niño. No, really

OpenAI releases novel voice models for more natural live conversations

More News

Dimming the sun would facilitate reduce the risk of El Niño. No, really

Mysterious compound detected on Pluto and Titan

5 Ways Miniature Language Models Power Next-Gen Agents

A British space start-up has launched a longevity laboratory into orbit

Anthropic wants you to pay for Claude Fable 5

Petite robot boats build floating structures

Most European lawmakers voted against allowing massive tech to read our news. And so they go.