Recently Speechmatics announced it has partnered with Personal.ai to transform the capture of voice memories, to learn more about the company, AI Magazine spoke to its VP of Machine Learning, Will Williams.
Can you tell me about your company?
Speechmatics is a Cambridge scaleup which delivers enterprise-level speech recognition technology to understand every voice, regardless of accent, dialect, age, gender, race, or location. Recently we launched Autonomous Speech Recognition, a major breakthrough in the field that saw us significantly outperform tech giants such as Amazon, Google, Microsoft and Apple in tackling inequality and AI bias by introducing self-supervised learning into our model training. Essentially, this means our AI model can learn from much wider audio datasets as it doesn’t rely on exclusively using data labelled by humans, which is costly and time-consuming and far from comprehensive.
As a company, we have a deep and rich heritage in research and development, which fuels our growth. The business has grown 240% since we secured Series A investment back in October 2019 and we have expanded globally, opening offices in the US, India and Europe.
What is your role and responsibilities at the company?
It’s been an incredible year and we are light years from where we were when I first joined with a handful of people on the team 8 years ago. I started at Speechmatics as a Machine Learning researcher when we were a startup. Over the years we have continued to innovate and scale. Being at the cutting-edge of machine learning there can be no let-up in our drive to constantly evolve, which is exciting. As the VP of Machine Learning now, my job is to ensure my team is breaking boundaries, helping Speechmatics pioneer, create and build a revolutionary speech recognition system that understands all voices.
How does your company utilise AI to support its operations?
Traditional speech recognition – also known as ‘Automatic’ Speech Recognition – relies on consuming large quantities of audio data, but this has typically been achieved using a narrow set of voices (e.g. audiobooks) all of which have been meticulously labelled by humans.
Our innovation is leading the industry from Automatic Speech Recognition towards Autonomous Speech Recognition (ASR). Using the latest techniques in machine learning, including the introduction of self-supervised models, we take accuracy to the next level. Self-supervised learning is a machine learning method that starts by taking vast amounts of unlabelled data and then uses some property of the data itself to construct a supervised task, without the need for human intervention. Speechmatics’ technology can exploit internet-scale audio sources - such as radio and news channels, TED talks, podcasts and audiobooks that are in the public domain. Using a huge amount of unlabelled data means the AI model learns faster than competitor technologies that rely on narrow labelled datasets.
We have seen a huge appetite for our technology which can be used in a variety of commercial scenarios, including media & entertainment; contact centres, customer relations, business intelligence; financial services; government; and edtech. For example, speech technology can be used in media monitoring to set live triggers on chosen keywords or used for real-time and pre-recorded subtitling. Turning speech into text also enables contact centres to analyse their audio content and understand the mood, tone and overall sentiment of customers, supporting continuous improvements in customer experience.
How will your AI technology play a major role in tackling bias in speech recognition?
Up until now, misunderstandings in speech recognition have been commonplace due to the limited amount of labelled data available on which to train AI models. Labelled data must be manually ‘tagged’ or ‘classified’ by humans which not only limits the amount of available data for training but also the representation of all voices. We are now firmly in the long tail of problems and one of the great remaining problems is bringing equity to all voices.
With our latest engine, Speechmatics’ technology is trained on huge amounts of unlabelled data directly from the internet. By using self-supervised learning, the technology is now trained on 1.1 million hours of audio which is orders of magnitude more data than prior systems and very computationally demanding. However, this delivers a far more comprehensive representation of all voices and dramatically reduces AI bias and errors in speech recognition.
Based on datasets used in Stanford’s ‘Racial Disparities in Speech Recognition’ study, Speechmatics recorded an overall accuracy of 82.8% for African American voices compared to Google (68.7%) and Amazon (68.6%), reducing speech recognition errors for African American voices by 45%. This level of accuracy equates to reducing errors by three words in an average sentence. Speechmatics’ Autonomous Speech Recognition delivers similar improvements in accuracy across accents, dialects, age, and other sociodemographic characteristics.
It also outperforms competitors on children’s voices, which are notoriously challenging to recognise using legacy speech recognition technology. Speechmatics recorded 91.8% accuracy compared to Google (83.4%) and Deepgram (82.3%) based on the open-source project Common Voice.
What can we expect from Speechmatics in the future?
We are betting big on self-supervised learning and already have our next generation of models in the works. Our plan is to push deep into the long tail of accuracy issues and continue to innovate to build out our best-in-class engine. In the near future, we have some exciting customer announcements and integrations, so keep your eyes peeled. Even with the success of the recent launch, at Speechmatics we will continue to innovate and test ourselves against competitors to make sure we deliver the best speech recognition in the market and make all languages truly understandable.