5 Useful Python Scripts to Generate Synthetic Data

Photo by the editor

# Entry

Synthetic data, as the name suggests, is created artificially rather than collected from real sources. It looks like real data, but avoids privacy issues and high data collection costs. This allows you to easily test software and models while running experiments to simulate post-release performance.

Although libraries like it Counterfeiter, SDVAND SynthCity exist – and even Huge Language Models (LLMs) are widely used to generate synthetic data – in this article, I focus on avoiding reliance on external libraries or AI tools. Instead, you’ll learn how to achieve the same results by writing your own Python scripts. This provides a better understanding of how the data set is shaped and how biases and errors are introduced. We’ll start with straightforward toy scripts to understand the options available. Once you have mastered these basics, you can conveniently move on to specialized libraries.

# 1. Generating straightforward random data

The easiest way to start is at the table. For example, if you need a fraudulent customer dataset for an internal demo, you can run a script to generate the data in CSV format:

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

countries = ["Canada", "UK", "UAE", "Germany", "USA"]
plans = ["Free", "Basic", "Pro", "Enterprise"]

def random_signup_date():
    start = datetime(2024, 1, 1)
    end = datetime(2026, 1, 1)
    delta_days = (end - start).days
    return (start + timedelta(days=random.randint(0, delta_days))).date().isoformat()

rows = []
for i in range(1, 1001):
    age = random.randint(18, 70)
    country = random.choice(countries)
    plan = random.choice(plans)
    monthly_spend = round(random.uniform(0, 500), 2)

    rows.append({
        "customer_id": f"CUST{i:05d}",
        "age": age,
        "country": country,
        "plan": plan,
        "monthly_spend": monthly_spend,
        "signup_date": random_signup_date()
    })

with open("customers.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=rows[0].keys())
    writer.writeheader()
    writer.writerows(rows)

print("Saved customers.csv")

Exit:

Simple random data generation

This script is straightforward: you define fields, select ranges, and write lines. The random the module supports generating integers, floating point values, random selection and sampling. The csv The module is designed to read and write row-based tabular data. This type of dataset is suitable for:

Frontend demo
Testing the dashboard
API development
Learning structured query language (SQL)
Unit testing of input pipelines

However, this approach has a fundamental weakness: everything is completely random. This often causes the data to look flat or unnatural. Enterprise customers can only spend $2, while “free” users can spend $400. Older users behave exactly the same as younger users because there is no underlying structure.

In real-world scenarios, data rarely behaves this way. Instead of generating values ourselves, we can introduce relationships and rules. This makes the dataset appear more realistic while remaining fully synthetic. For example:

Corporate customers should almost never have zero expenses
Spending ranges should depend on the plan you choose
Older users may spend slightly more on average
Some plans should be more common than others

Let’s add these controls to the script:

import csv
import random

random.seed(42)

plans = ["Free", "Basic", "Pro", "Enterprise"]

def choose_plan():
    roll = random.random()
    if roll = 40:
        base *= 1.15

    return round(base, 2)

rows = []
for i in range(1, 1001):
    age = random.randint(18, 70)
    plan = choose_plan()
    spend = generate_spend(age, plan)

    rows.append({
        "customer_id": f"CUST{i:05d}",
        "age": age,
        "plan": plan,
        "monthly_spend": spend
    })

with open("controlled_customers.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=rows[0].keys())
    writer.writeheader()
    writer.writerows(rows)

print("Saved controlled_customers.csv")

Exit:

Simple random data generation-2

Now the dataset retains significant patterns. Instead of generating random noise, you simulate behaviors. Effective controls may include:

Weighted category selection
Realistic minimum and maximum ranges
Conditional logic between columns
Infrequent Edge cases have been intentionally added
Missing values inserted at low frequency
Correlated functions instead of independent ones

# 2. Simulating processes for synthetic data

Simulation-based generation is one of the best ways to create realistic, synthetic datasets. Instead of filling in the columns directly, you simulate the process. For example, consider a tiny warehouse where orders are arriving, inventory is decreasing, and low inventory is creating backorders.

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

inventory = {
    "A": 120,
    "B": 80,
    "C": 50
}

rows = []
current_time = datetime(2026, 1, 1)

for day in range(30):
    for product in inventory:
        daily_orders = random.randint(0, 12)

        for _ in range(daily_orders):
            qty = random.randint(1, 5)
            before = inventory[product]

            if inventory[product] >= qty:
                inventory[product] -= qty
                status = "fulfilled"
            else:
                status = "backorder"

            rows.append({
                "time": current_time.isoformat(),
                "product": product,
                "qty": qty,
                "stock_before": before,
                "stock_after": inventory[product],
                "status": status
            })

        if inventory[product]

Exit:

Simulation-based synthetic data

This method is excellent because the data is a by-product of system behavior, which usually produces more realistic relationships than directly generating random rows. Other simulation ideas include:

Queues to the call center
Ride requests and driver matching
Loan applications and approvals
Subscriptions and opt-outs
Patient visits
Website traffic and conversion

# 3. Generating synthetic time series data

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

start = datetime(2026, 1, 1, 0, 0, 0)
hours = 24 * 30
rows = []

for i in range(hours):
    ts = start + timedelta(hours=i)
    weekday = ts.weekday()

    base = 120
    if weekday >= 5:
        base = 80

    hour = ts.hour
    if 8

Exit:

Synthetic time series data

This approach works well because it accounts for trends, noise, and cyclical behavior, while being effortless to explain and debug.

# 4. Creating event logs

Event logs are another useful script style, perfect for product analysis and workflow testing. Instead of one line per customer, you create one line per action.

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

events = ["signup", "login", "view_page", "add_to_cart", "purchase", "logout"]

rows = []
start = datetime(2026, 1, 1)

for user_id in range(1, 201):
    event_count = random.randint(5, 30)
    current_time = start + timedelta(days=random.randint(0, 10))

    for _ in range(event_count):
        event = random.choice(events)

        if event == "purchase" and random.random()

Exit:

Event log generation

This format is useful for:

Funnel analysis
Testing the analytical pipeline
Business Intelligence (BI) dashboards.
Session reconstruction
Anomaly detection experiments

A useful technique is to make events dependent on previous actions. For example, a purchase should typically occur after logging in or viewing a page, making the synthetic log more reliable.

# 5. Generating synthetic text data using templates

Synthetic data is also valuable in natural language processing (NLP). You don’t always need an LLM to get started; you can create effective text datasets using templates and controlled variations. For example, you can create training data for support tickets:

import json
import random

random.seed(42)

issues = [
    ("billing", "I was charged twice for my subscription"),
    ("login", "I cannot log into my account"),
    ("shipping", "My order has not arrived yet"),
    ("refund", "I want to request a refund"),
]

tones = ["Please help", "This is urgent", "Can you check this", "I need support"]

records = []

for _ in range(100):
    label, message = random.choice(issues)
    tone = random.choice(tones)

    text = f"{tone}. {message}."
    records.append({
        "text": text,
        "label": label
    })

with open("support_tickets.jsonl", "w", encoding="utf-8") as f:
    for item in records:
        f.write(json.dumps(item) + "n")

print("Saved support_tickets.jsonl")

Exit:

Synthetic text data using templates

This approach works well for:

Text classification demo
Intent detection
Chatbot testing
Quick assessment

# Final thoughts

Synthetic data scripts are powerful tools, but they can be implemented incorrectly. Remember to avoid these common mistakes:

Making all values uniformly random
Forgetting about dependencies between fields
Generating values that violate business logic
Assuming synthetic data is inherently secure by default
Creating data that is too “clean” to be useful for testing real edge cases
Using the same pattern so often that the data set becomes predictable and unrealistic

Privacy remains the most critical issue. While synthetic data reduces exposure to real documents, it is not risk-free. If the generator is too closely linked to the original sensitive data, a leak can still occur. Therefore, privacy protection methods such as private synthetic data are imperative.

Kanwal Mehreen is a machine learning engineer and technical writer with a deep passion for data science and the intersection of artificial intelligence and medicine. She is co-author of the e-book “Maximizing Productivity with ChatGPT”. As a 2022 Google Generation Scholar for APAC, she promotes diversity and academic excellence. She is also recognized as a Teradata Diversity in Tech Scholar, a Mitacs Globalink Research Scholar, and a Harvard WeCode Scholar. Kanwal is a staunch advocate for change and founded FEMCodes to empower women in STEM fields.

Categories

5 Useful Python Scripts to Generate Synthetic Data

# Entry

# 1. Generating straightforward random data

# 2. Simulating processes for synthetic data

# 3. Generating synthetic time series data

# 4. Creating event logs

# 5. Generating synthetic text data using templates

# Final thoughts

AI Engineering Hub Breakdown: 10 Agent Projects You Can Implement Today

Artificial intelligence-designed drugs from the DeepMind spinoff are being tested on humans

MIT researchers are building the world’s largest collection of Olympic-level math problems and making them available to everyone

Apple’s next CEO needs to release a killer AI product

7 practical OpenClaw employ cases you should know

More News

AI Engineering Hub Breakdown: 10 Agent Projects You Can Implement Today

Artificial intelligence-designed drugs from the DeepMind spinoff are being tested on humans

7 practical OpenClaw employ cases you should know

Scientists gave cocaine to salmon, and you’ll surely believe what happened next

AI Engineering Hub Breakdown: 10 Agent Projects You Can Implement Today

Artificial intelligence-designed drugs from the DeepMind spinoff are being tested on humans

MIT researchers are building the world’s largest collection of Olympic-level math problems and making them available to everyone