Should Synthetic Data Train AI?
Collecting real-world data is complicated, biased, and costly. Although the world generated, copied, and consumed 59 zettabytes of data in 2020 alone, that doesn’t mean all companies can use it. Data privacy laws have restricted access to datasets, and AI developers are struggling to get their hands on usable data. Now, synthetic data firms like Datagen and SynthesisAI can simulate digital humans to rapidly create supposedly perfect, diverse, and malleable data.
What Is Synthetic Data?
Synthetic data is randomly generated by computer programmes. We can use it for just about any machine learning project—self-driving cars, security, robotics, fraud prevention, financial systems, and healthcare. You name it, it’s got an application. According to the MIT Technology Review, it’s “a bit like diet soda. To be effective, it has to resemble the ‘real thing’”.
To create synthetic datasets with the same statistical properties as its real-world counterpart, we use two primary methods: variational autoencoders (VAE), in which unsupervised learning models compress and then generate representations of datasets, and generative adversarial networks (GAN), in which one network creates synthetic data and the other network tries to catch the fakes. “Eventually, the generator can make perfect [data], and the discriminator cannot tell the difference”, said PhD student Lei Xu.
Who Uses It?
- Datagen. The company collects raw data from full-body scanners and simulates 3D humans. It’s currently working with four major, mysterious tech companies in areas like VR, security, and vision sensing systems.
- Click-Ins. To automate vehicle inspections, its algorithms re-create thousands of cars under different damage conditions, colours, and lighting to train its AI. This helps it avoid certain data privacy issues, such as accidentally tracking license plate numbers.
- Mostly.ai. Operating with financial, insurance, and communications firms, the company makes fake client data spreadsheets so that its customers can collaborate with their vendors without revealing sensitive information.
Why Should We Use It?
First, synthetic data can automatically label data sets as you generate them. AI developers can construct their training sets much more quickly, with fewer privacy violation concerns, and even test out rare events, such a pandemic like COVID-19 that only happens once every century or so. It can also test out tricky edge cases—situations that occur at extreme operating parameters.
Are the Costs Worth It?
Even though synthetic data might speed up our AI training systems, we shouldn’t trust it blindly. “What I don’t want to do is give the thumbs-up to this paradigm and say, ‘Oh this will solve so many problems’”, said Cathy O’Neil, data scientist and founder of ORCAA, an algorithmic auditing firm. Aaron Roth, a computer and IT professor at UPenn, added: “[Just because it’s synthetic] doesn’t mean that it does not encode information about real people”.
For instance, synthetic data usually includes fewer ethnic minority groups because they’re generally underrepresented in real-world data. This means that synthetic minority humans are potentially less realistic than Caucasian representations, and can further racial bias.
Yet overall, many think that synthetic data is the best way to feed our information-hungry AI algorithms.
As the MIT Technology Review stated: “Firms like Datagen offer a compelling alternative...they will make [data] how you want it, when you want it—and relatively cheaply”. And as we know, profits usually come out on top, even at the cost of our moral compass.