Photo by the editor
# Entry
Synthetic data, as the name suggests, is created artificially rather than collected from real sources. It looks like real data, but avoids privacy issues and high data collection costs. This allows you to easily test software and models while running experiments to simulate post-release performance.
Although libraries like it Counterfeiter, SDVAND SynthCity exist – and even Huge Language Models (LLMs) are widely used to generate synthetic data – in this article, I focus on avoiding reliance on external libraries or AI tools. Instead, you’ll learn how to achieve the same results by writing your own Python scripts. This provides a better understanding of how the data set is shaped and how biases and errors are introduced. We’ll start with straightforward toy scripts to understand the options available. Once you have mastered these basics, you can conveniently move on to specialized libraries.
# 1. Generating straightforward random data
The easiest way to start is at the table. For example, if you need a fraudulent customer dataset for an internal demo, you can run a script to generate the data in CSV format:
import csv
import random
from datetime import datetime, timedelta
random.seed(42)
countries = ["Canada", "UK", "UAE", "Germany", "USA"]
plans = ["Free", "Basic", "Pro", "Enterprise"]
def random_signup_date():
start = datetime(2024, 1, 1)
end = datetime(2026, 1, 1)
delta_days = (end - start).days
return (start + timedelta(days=random.randint(0, delta_days))).date().isoformat()
rows = []
for i in range(1, 1001):
age = random.randint(18, 70)
country = random.choice(countries)
plan = random.choice(plans)
monthly_spend = round(random.uniform(0, 500), 2)
rows.append({
"customer_id": f"CUST{i:05d}",
"age": age,
"country": country,
"plan": plan,
"monthly_spend": monthly_spend,
"signup_date": random_signup_date()
})
with open("customers.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=rows[0].keys())
writer.writeheader()
writer.writerows(rows)
print("Saved customers.csv")
Exit:

This script is straightforward: you define fields, select ranges, and write lines. The random the module supports generating integers, floating point values, random selection and sampling. The csv The module is designed to read and write row-based tabular data. This type of dataset is suitable for:
- Frontend demo
- Testing the dashboard
- API development
- Learning structured query language (SQL)
- Unit testing of input pipelines
However, this approach has a fundamental weakness: everything is completely random. This often causes the data to look flat or unnatural. Enterprise customers can only spend $2, while “free” users can spend $400. Older users behave exactly the same as younger users because there is no underlying structure.
In real-world scenarios, data rarely behaves this way. Instead of generating values ourselves, we can introduce relationships and rules. This makes the dataset appear more realistic while remaining fully synthetic. For example:
- Corporate customers should almost never have zero expenses
- Spending ranges should depend on the plan you choose
- Older users may spend slightly more on average
- Some plans should be more common than others
Let’s add these controls to the script:
import csv
import random
random.seed(42)
plans = ["Free", "Basic", "Pro", "Enterprise"]
def choose_plan():
roll = random.random()
if roll = 40:
base *= 1.15
return round(base, 2)
rows = []
for i in range(1, 1001):
age = random.randint(18, 70)
plan = choose_plan()
spend = generate_spend(age, plan)
rows.append({
"customer_id": f"CUST{i:05d}",
"age": age,
"plan": plan,
"monthly_spend": spend
})
with open("controlled_customers.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=rows[0].keys())
writer.writeheader()
writer.writerows(rows)
print("Saved controlled_customers.csv")
Exit:
Now the dataset retains significant patterns. Instead of generating random noise, you simulate behaviors. Effective controls may include:
- Weighted category selection
- Realistic minimum and maximum ranges
- Conditional logic between columns
- Infrequent Edge cases have been intentionally added
- Missing values inserted at low frequency
- Correlated functions instead of independent ones
# 2. Simulating processes for synthetic data
Simulation-based generation is one of the best ways to create realistic, synthetic datasets. Instead of filling in the columns directly, you simulate the process. For example, consider a tiny warehouse where orders are arriving, inventory is decreasing, and low inventory is creating backorders.
import csv
import random
from datetime import datetime, timedelta
random.seed(42)
inventory = {
"A": 120,
"B": 80,
"C": 50
}
rows = []
current_time = datetime(2026, 1, 1)
for day in range(30):
for product in inventory:
daily_orders = random.randint(0, 12)
for _ in range(daily_orders):
qty = random.randint(1, 5)
before = inventory[product]
if inventory[product] >= qty:
inventory[product] -= qty
status = "fulfilled"
else:
status = "backorder"
rows.append({
"time": current_time.isoformat(),
"product": product,
"qty": qty,
"stock_before": before,
"stock_after": inventory[product],
"status": status
})
if inventory[product]
Exit:

This method is excellent because the data is a by-product of system behavior, which usually produces more realistic relationships than directly generating random rows. Other simulation ideas include:
- Queues to the call center
- Ride requests and driver matching
- Loan applications and approvals
- Subscriptions and opt-outs
- Patient visits
- Website traffic and conversion
# 3. Generating synthetic time series data
import csv
import random
from datetime import datetime, timedelta
random.seed(42)
start = datetime(2026, 1, 1, 0, 0, 0)
hours = 24 * 30
rows = []
for i in range(hours):
ts = start + timedelta(hours=i)
weekday = ts.weekday()
base = 120
if weekday >= 5:
base = 80
hour = ts.hour
if 8
Exit:

This approach works well because it accounts for trends, noise, and cyclical behavior, while being effortless to explain and debug.
# 4. Creating event logs
Event logs are another useful script style, perfect for product analysis and workflow testing. Instead of one line per customer, you create one line per action.
import csv
import random
from datetime import datetime, timedelta
random.seed(42)
events = ["signup", "login", "view_page", "add_to_cart", "purchase", "logout"]
rows = []
start = datetime(2026, 1, 1)
for user_id in range(1, 201):
event_count = random.randint(5, 30)
current_time = start + timedelta(days=random.randint(0, 10))
for _ in range(event_count):
event = random.choice(events)
if event == "purchase" and random.random()
Exit:

This format is useful for:
- Funnel analysis
- Testing the analytical pipeline
- Business Intelligence (BI) dashboards.
- Session reconstruction
- Anomaly detection experiments
A useful technique is to make events dependent on previous actions. For example, a purchase should typically occur after logging in or viewing a page, making the synthetic log more reliable.
# 5. Generating synthetic text data using templates
Synthetic data is also valuable in natural language processing (NLP). You don’t always need an LLM to get started; you can create effective text datasets using templates and controlled variations. For example, you can create training data for support tickets:
import json
import random
random.seed(42)
issues = [
("billing", "I was charged twice for my subscription"),
("login", "I cannot log into my account"),
("shipping", "My order has not arrived yet"),
("refund", "I want to request a refund"),
]
tones = ["Please help", "This is urgent", "Can you check this", "I need support"]
records = []
for _ in range(100):
label, message = random.choice(issues)
tone = random.choice(tones)
text = f"{tone}. {message}."
records.append({
"text": text,
"label": label
})
with open("support_tickets.jsonl", "w", encoding="utf-8") as f:
for item in records:
f.write(json.dumps(item) + "n")
print("Saved support_tickets.jsonl")
Exit:

This approach works well for:
- Text classification demo
- Intent detection
- Chatbot testing
- Quick assessment
# Final thoughts
Synthetic data scripts are powerful tools, but they can be implemented incorrectly. Remember to avoid these common mistakes:
- Making all values uniformly random
- Forgetting about dependencies between fields
- Generating values that violate business logic
- Assuming synthetic data is inherently secure by default
- Creating data that is too “clean” to be useful for testing real edge cases
- Using the same pattern so often that the data set becomes predictable and unrealistic
Privacy remains the most critical issue. While synthetic data reduces exposure to real documents, it is not risk-free. If the generator is too closely linked to the original sensitive data, a leak can still occur. Therefore, privacy protection methods such as private synthetic data are imperative.
Kanwal Mehreen is a machine learning engineer and technical writer with a deep passion for data science and the intersection of artificial intelligence and medicine. She is co-author of the e-book “Maximizing Productivity with ChatGPT”. As a 2022 Google Generation Scholar for APAC, she promotes diversity and academic excellence. She is also recognized as a Teradata Diversity in Tech Scholar, a Mitacs Globalink Research Scholar, and a Harvard WeCode Scholar. Kanwal is a staunch advocate for change and founded FEMCodes to empower women in STEM fields.
