Should Synthetic Data Train AI?

By Elise Leise
New synthetic data companies can produce low-cost, realistic data for deep learning algorithms. So what’s the catch?

Collecting real-world data is complicated, biased, and costly. Although the world generated, copied, and consumed 59 zettabytes of data in 2020 alone, that doesn’t mean all companies can use it. Data privacy laws have restricted access to datasets, and AI developers are struggling to get their hands on usable data. Now, synthetic data firms like Datagen and SynthesisAI can simulate digital humans to rapidly create supposedly perfect, diverse, and malleable data. 

What Is Synthetic Data? 

Synthetic data is randomly generated by computer programmes. We can use it for just about any machine learning project—self-driving cars, security, robotics, fraud prevention, financial systems, and healthcare. You name it, it’s got an application. According to the MIT Technology Review, it’s “a bit like diet soda. To be effective, it has to resemble the ‘real thing’”. 

To create synthetic datasets with the same statistical properties as its real-world counterpart, we use two primary methods: variational autoencoders (VAE), in which unsupervised learning models compress and then generate representations of datasets, and generative adversarial networks (GAN), in which one network creates synthetic data and the other network tries to catch the fakes. “Eventually, the generator can make perfect [data], and the discriminator cannot tell the difference”, said PhD student Lei Xu.

Who Uses It? 

  • Datagen. The company collects raw data from full-body scanners and simulates 3D humans. It’s currently working with four major, mysterious tech companies in areas like VR, security, and vision sensing systems. 
  • Click-Ins. To automate vehicle inspections, its algorithms re-create thousands of cars under different damage conditions, colours, and lighting to train its AI. This helps it avoid certain data privacy issues, such as accidentally tracking license plate numbers.
  • Operating with financial, insurance, and communications firms, the company makes fake client data spreadsheets so that its customers can collaborate with their vendors without revealing sensitive information. 

Why Should We Use It? 

First, synthetic data can automatically label data sets as you generate them. AI developers can construct their training sets much more quickly, with fewer privacy violation concerns, and even test out rare events, such a pandemic like COVID-19 that only happens once every century or so. It can also test out tricky edge cases—situations that occur at extreme operating parameters. 

Are the Costs Worth It? 

Even though synthetic data might speed up our AI training systems, we shouldn’t trust it blindly. “What I don’t want to do is give the thumbs-up to this paradigm and say, ‘Oh this will solve so many problems’”, said Cathy O’Neil, data scientist and founder of ORCAA, an algorithmic auditing firm. Aaron Roth, a computer and IT professor at UPenn, added: “[Just because it’s synthetic] doesn’t mean that it does not encode information about real people”. 

For instance, synthetic data usually includes fewer ethnic minority groups because they’re generally underrepresented in real-world data. This means that synthetic minority humans are potentially less realistic than Caucasian representations, and can further racial bias. 

Yet overall, many think that synthetic data is the best way to feed our information-hungry AI algorithms.

As the MIT Technology Review stated: “Firms like Datagen offer a compelling alternative...they will make [data] how you want it, when you want it—and relatively cheaply”. And as we know, profits usually come out on top, even at the cost of our moral compass. 


Featured Articles

Kyndryl’s Data and AI Console to simplify data management

Data-driven solution expands and increases observability and insights, while enhanced data governance helps identify irregularities and threats

Deep neural networks still struggling to match human vision

New study by researchers in Canada finds artificial intelligence still can't match the powers of human vision despite deep learning's ability with big data

Metaverse destined to become an impossible, dangerous place

Virtual reality’s strength has always been in its ability to provide humans with special experiences, not unending engagement. How about a game of DICE?

Clever coders lead the way as Microsoft launches 365 Copilot

AI Applications

Baidu’s ERNIE doesn’t want confrontation with United States

AI Applications

World faces flood of fake people now AI dodges bot spotters