We’re surrounded by AI buzz these days. Recent AI tools are announced almost every day. They claim to do almost everything for us: drive our cars, write our emails, make art out of us. But even with the biggest, flashiest tools—like ChatGPT—it’s unclear whether the AI approach is an improvement on what they’re replacing. It’s challenging to separate what’s truly useful from what’s little more than hype. AI’s biggest problem is keeping its promises.
There is an exception: synthetic data.
What is synthetic data?
Synthetic data is AI-generated data that reflects the statistical properties of real-world data. By training AI models on real-world data, industries as diverse as healthcare, manufacturing, finance, and software development can generate synthetic data that meets their needs. Wherever and whenever they need it, at the scope and scale they need.
Synthetic data solves several problems. When it comes to AI modeling, synthetic data can assist alleviate the lack of affordable, high-quality data. When it comes to software development and testing, synthetic data sets can assist with edge-testing, simulating sophisticated data scenarios, and validating systems under plausible real-world conditions. While access to live production data is rightly confined, this can hinder innovation across the organization. Synthetic data can have far fewer limitations, allowing teams to build without unnecessary friction.
Companies like AmazonGoogle and American Express already rely on synthetic data, as do organizations such as National Health Service in the UK. Your company/sector probably does too.
Synthetic but not imitation
Synthetic data is sometimes confused with imitation data, and many people utilize the two terms interchangeably. However, they are completely different things. Imitation data, or mock data, is affordable and effortless to generate. Imitation data can be obtained through open-source libraries such as Fraud. However, imitation data does not have the same statistical properties as real data. It is usually elementary and uniform. For example, if we generated a imitation database of 100 transactions between $1 and $10,000, 10 would be between $1 and $1,000, 10 between $1,001 and $2,000, and so on. Real-world purchase data is lumpy. Some transactions cluster together, while others are exceptions.
Imitation data has few or none of the properties or characteristics of the real production dataset. Other than elementary parameters like scope and data type, any resemblance to real data is purely coincidental. In contrast, synthetic data is constructed using statistical models and generative AI trained on real data. This synthetic data has the same statistical properties and internal relationships as the real dataset it is intended to emulate.
While both imitation and synthetic data are useful, they are very different tools. In real-world scenarios, these differences become very essential. Let’s look at two examples: one in online retail and one in data science.
Synthetic data for testing software applications
Let’s say an online sporting goods retailer analyzed its data and noticed a few trends. They found that they were getting almost three times as many visitors from Massachusetts as from any other state, that visitors from MA were most likely to buy snow boots in November, and that traffic was expected to spike before Thanksgiving.
To capitalize on these findings, the retailer updates its site to show snow boots to anyone who visits the site from MA in the three weeks leading up to Thanksgiving. They also customize results for customers who opt for more personalization, showing specific snow boot models based on each visitor’s purchase history and personal preferences.
Before the retailer makes these changes to its app, it wants to test them. It wants to be ready for the spike: Even if tens of thousands of visits occur over the course of three weeks, the website should respond in less than a millisecond. It also wants to make sure it maximizes the likelihood of a purchase. To do this testing, it needs data.
What happens if they utilize imitation data? Because the imitation data is generated randomly, it will generate visitors from every state with equal frequency and for every date in the year with equal frequency. Even if the team decides to generate millions of imitation visits and then throw out everything that is not from MA and falls within its date range, the imitation data will not contain information related to the customer’s purchase history to test the part of the code that customizes which snow boots to display. In the test and development environments, the app’s performance looked good, but when real customers visit the site, performance is destitute due to the clustering that was missing in the imitation data.
What if a retailer used synthetic data instead? Synthetic data generated using an AI model trained on real retailer data can emulate real customers. It can create entire customer journeys, from initial account creation to purchases made in the past two years; .
If real customers bought product A and then bought product B six months later, the synthetic customers will follow this pattern. If there was a spike in traffic from MA in November, the synthetic data set will mimic that. With synthetic data, the retailer can create data that reflects the actual visits they expect, taking into account visitor locations, traffic spikes, and sophisticated purchase histories. By testing this data, they get a more exact picture of what to expect and can prepare their app accordingly.
Current software applications are increasingly animated, adjusting their results based on the data they see in real time. Their logic is updated frequently, and up-to-date versions are deployed quickly, sometimes multiple times a day. Before each deployment, developers must test that it works well and correctly. Those who utilize synthetic data, rather than just imitation data, are more confident that their customers will have a great experience and make more sales.
Synthetic data eliminates the analyst bottleneck
Businesses store expansive amounts of data about how their customers utilize their products and services, hoping it will provide insights that will assist boost profits. To obtain these insights, they may hire consulting firms or independent data scientists or even hold public data science competitions. However, their desire to reach as many people as possible often clashes with the proprietary nature of the data, as well as with customer privacy concerns. Imitation data, again, won’t assist in this scenario because it lacks the realistic properties of production data: internal correlations and other statistical properties that lead to valuable insights.
For a dataset to replace real data, it must provide the same analytical insights as the real data. Returning to the example above, if the real data shows that snow boots are the most popular purchase for MA customers, the analyst using synthetic data must reach the same conclusion. Can synthetic data really be that good?
The first one is from 2017, when my group hired independent data scientists to develop predictive models in a crowdsourced experiment. We wanted to find out:
To test this, one group of data scientists was given the original, real data, and the other three were given synthetic versions. Each group used their data to solve a predictive modeling problem, ultimately running 15 tests on 5 data sets. Ultimately, when their solutions were compared, those generated by the group using the real data and those generated by the groups using the synthetic data showed no significant difference in performance in 11 of the 15 tests (70 percent of the time).
Since then, synthetic data has become a staple of data science competitions and is beginning to transform data sharing and analysis for businesses. Kaggle, a popular data science competition website, now regularly publishes synthetic datasetsincluding some of the enterprises. Wells Fargo Releases Synthetic Data Set for a competition in which data scientists were tasked with predicting suspected elder abuse fraud. Spar Nord Bank has released a set of anti-money laundering data for data scientists to find patterns that indicate money laundering.
Application
Synthetic data is a useful application of AI technology that is already delivering real, material value to customers. Synthetic data is more than just imitation data; it supports data-driven business systems throughout their lifecycle, especially when continuous access to production data is impractical or unwise.
If your projects are hampered by high-priced and sophisticated production data access processes or confined by the inherent limitations of imitation data, synthetic data is worth exploring. You can start using synthetic data today by downloading one of the free options available.
Synthetic data is a valuable up-to-date technique that more and more organizations are adding to their data-driven workloads. Ask your data teams where you can utilize synthetic data and free yourself from the imposters and noise.
About the author