Machine learning models trained on computer-generated video data performed better than models trained on real-life videos in some instances, a team of US researchers has discovered in work that could help remove the ethical, privacy, and copyright concerns of using real datasets.
Researchers currently train machine-learning models using large datasets of video clips showing humans performing actions. However, not only is this expensive and labour-intensive, but the clips often contain sensitive information. Using these videos might also violate copyright or data protection laws as many datasets are owned by companies and are not available for free use.
To address this, a team of researchers at MIT, the MIT-IBM Watson AI Lab, and Boston University built a synthetic dataset of 150,000 video clips that captured a wide range of human actions, which they used to train machine-learning models. These are made by a computer that uses 3D models of scenes, objects, and humans to quickly produce various clips of specific actions — without the potential copyright issues or ethical concerns that come with real data.
Then researchers showed six datasets of real-world videos to the models to see how well they could learn to recognise actions in those clips. Researchers found that synthetically trained models performed even better than models trained on real data for videos with fewer background objects.
Synthetic data tackles ethical, privacy, and copyright concerns
This work could help scientists identify which machine-learning applications could be best suited for training with synthetic data, to mitigate some of the ethical, privacy, and copyright concerns of using real datasets.
“The ultimate goal of our research is to replace real data pretraining with synthetic data pretraining,” says Rogerio Feris, Principal Scientist and Manager at the MIT-IBM Watson AI Lab, and co-author of a paper detailing this research. “There is a cost in creating an action in synthetic data, but once that is done, then you can generate an unlimited number of images or videos by changing the pose, the lighting, etc. That is the beauty of synthetic data.”
The paper was written by lead author Yo-Whan “John” Kim; Aude Oliva, Director of Strategic Industry Engagement at the MIT Schwarzman College of Computing, MIT Director of the MIT-IBM Watson AI Lab, and a senior research scientist in the Computer Science and Artificial Intelligence Laboratory (CSAIL); and seven others. The research will be presented at the Conference on Neural Information Processing Systems.
The researchers compiled a new dataset using three publicly available datasets of synthetic video clips that captured human actions. Their dataset, called Synthetic Action Pre-training and Transfer (SynAPT), contained 150 action categories, with 1,000 video clips per category.
They used this to pre-train three machine-learning models to recognise the actions and the researchers were surprised to see that all three synthetic models outperformed models trained with real video clips on four of the six datasets. Accuracy was highest for datasets that contained video clips with “low scene-object bias”.
Researchers hope to create a catalogue of models that have been pre-trained using synthetic data, says co-author Rameswar Panda, a research staff member at the MIT-IBM Watson AI Lab. “We want to build models which have very similar performance or even better performance than the existing models in the literature, but without being bound by any of those biases or security concerns.”