Tell me about Datagen, your role and your journey to founding the company.
Datagen’s CTO Gil Elbaz and I co-founded the company in 2018 with a mission to transform how teams work with data for computer vision AI development. The year before, we saw a demo of the Oculus Rift, and after the demo, we found ourselves wondering, “with sophisticated cameras embedded in the headset, why was a handheld device needed to connect the virtual space to physical space?” That’s when we realised the answer was data. From there, we knew there was a huge opportunity to solve 3D spatial presence challenges using advanced computer graphics and 3D simulation. But instead of focusing solely on VR/AR, we took a more holistic approach, concentrating on the seemingly intractable problem of generating high-performance data to enable human-centric real-world AI applications.
Now, Datagen is on a mission to evangelise how CV teams access, define and generate the data they need for AI applications. We call it Data-as-code, as we believe that the ultimate way to increase AI adoption globally is by making data acquisition as easy as writing code. Our platform delivers visual data with unmatched domain coverage and high variance. Our approach turns the heavy operational process of visual data collection and annotation into an easy-to-control programmable user interface, enabling CV teams to generate high-fidelity visual data, and train and evaluate models across the development lifecycle. Today, our customers include Fortune 500 companies across a variety of industries including AR/VR, home security, automotive, robotics, and more.
In my role as CEO, I advise on company strategy, work with our top clients and VC partners, and build our team so that we can make even wider leaps in the synthetic data space.
Take me through your recent report on synthetic data, what were the key findings and why should businesses take note?
Sure, so last December, Datagen released a research study to better understand how computer vision teams obtain and use AI/ML training data for computer vision systems and applications, and how those choices impact their projects.
Key findings of the study included:
- 99% experienced project cancellations due to inadequate training data.
- 50% of respondents reported working with synthetic data from publicly available datasets.
- Of those not currently using synthetic data, over 38% plan on using synthetic data in 2022.
- 57% reported that their training delays could have been mitigated by “data that covered more edge cases.”
From my perspective, one of the biggest takeaways from this study is that what was once a fragmented field, is beginning to coalesce around the promise of synthetic data to help mitigate frequent project delays and cancellations. The study revealed that training data has become a significant stumbling block for computer vision professionals, who cited a number of data-related complications hindering their organisation’s progress.
For businesses, it’s important to understand that synthetic data is the future of data. It’s a new way to control and consume the data that our AI systems need. We actually have plans to organise another survey towards the end of the year to check on the progress CV teams have made throughout the year.
How do you think we should use AI, data and machine learning for innovations that benefit humanity?
It’s no question that machine learning and AI can change the world for the better. As I mentioned, we at Datagen are working across several industries – from automotive to home security – to help businesses leverage data and AI to benefit humanity. In automotive, regulations such as that from the European New Car Assessment Programme (Euro NCAP) are driving the need for vehicles to integrate vision systems to detect mission-critical in-cabin scenarios to optimise driver safety. Take a driver starting to doze off as an example, the in-cabin alert system, built on AI, would be able to identify that and send an alert to recommend they pull over at a rest stop or turn on the AC to help keep them awake. And with in-home security use cases, we see companies investing in alerting suspicious outdoor activity and so on.
Why do you think so many choose synthetic data for computer vision?
Generating diverse data at scale is key to achieving robust results needed to train AI/ML applications. Today, most companies do this manually, resulting in data scientists wasting more than 80% of their time on “data” instead of “science.” By using synthetic data, computer vision teams can not only save time and resources but also ensure that their data is reflective of real-world scenarios. Companies can iterate on synthetic data because the information is constantly generated and tested without being weighed down by the manual collection and processes that come with natural data. It’s much easier to repeatedly train and test with synthetic data.
Essentially, generating diverse datasets at scale for computer vision training is an expensive, time-consuming process, and variance is very limited by the mere fact that situating humans in specific locations and photographing them is a complicated process — far more complicated and costly than doing so in a simulated environment. Additional benefits include effectively eliminating the need for manual annotation, a very tedious process that is prone to human error, and the reduction of biases that often occurs with real-world data.
What's next for Datagen?
It’s a really exciting time here at Datagen. We are just off the heels of closing $50M in Series B funding, which brings our total financing to over $70M. Looking to the future, we plan to continue bolstering our leadership position in the computer vision space. One of the fastest-growing fields within AI, computer vision needs a proper infrastructure stack to help advance the development of AI and its most imminent applications. As part of this, we plan to expand our synthetic 3D character model creation with updated control and features to give computer vision teams even more granular access when customising their preferred data sets. We are also expanding into additional domains, with new use-cases looking to harness the power of synthetic data.
Lastly, we see a need for standardisation, so we are aiming to create data formatting standards for common annotation.