Speak up! The impact of AI on voice recognition software
At the end of October, LivePerson acquired VoiceBase, which uses artificial intelligence (AI) and natural language processing to turn audio it into structured, rich data. It could do this, for example, to a customer phone call, with obvious benefits for a company like LivePerson, a US-based software platform that allows customers to communicate with businesses
This is just the latest acquisition in what is becoming a very hot area of technology. In April 2021, Microsoft acquired Nuance, an early pioneer of speech recognition AI, best known as the source of the speech recognition engine for Apple’s Siri personal assistant.
Voice controlled speakers and digital personal assistants are where most people would have encountered this technology, but it is also being used to make driving safer through voice-activated navigation, save time for medical staff by automating dictation, and increase security with voice-based authentication.
A rapidly growing market
As the LivePerson acquisition suggests, customer service is another area where voice recognition is growing in significance. Most people prefer voice requests to typing or clicking through menus, but few companies can afford to have a person available to handle every discussion. Technology now makes it easier to handle some of those interactions and it allows organisations to collect voice data. The data can be used to train AI systems to understand speech better, but organisations naturally want to analyse it further and AI can help here too.
The global voice recognition market, valued at $10.7 billion in 2020 is expected to be worth $27.2 billion by 2026. This includes several overlapping technology types and use cases. Voice recognition could be used by dictation software to convert the content of a meeting into a text document, for example. To do that, the computer needs to identify the words spoken but it doesn’t have to be very concerned about their meaning.
That changes somewhat when a voice assistant needs to tell you about the weather by ‘understanding’ a range of commands. In most cases, though, the device will recognise a variety of ways of asking the question but not every single one. These tools can also be used to analyse large quantities of audio and identify key qualities, such as sentiment.
These solutions are constantly evolving. For example, Speechly, a Finnish start-up, recently patented a new technological approach that combines speech-to-text with natural language understanding in a novel way. The company claims that this enables faster and more complex voice interactions than current solutions.
All this is done with a variety of algorithms, which use variations of grammar rules, probability, and speaker recognition to identify and classify spoken phrases. The difference between a good speech recognition algorithm and a poor one is accuracy rate and speed. Companies want the system to work as quickly as possible, with minimal errors.
Data is key
The key behind all of this of course is data, and voice data that is tagged and categorised so AI can understand what it is and how it relates to everything else in the sphere of the spoken word. As we’ve already seen with visual data (behind say autonomous driving) labelling data is a labour-intensive process and the end product is only as good as the initial input. It’s accurate but limited, both in terms of speed and volume, but also in capturing the infinitely varied intricacies of global speech, tone, delivery, accents, slang, etc.
How can we label data faster? Or maybe we need to change how we look at this problem and refine our machine and deep learning models. The amount of data is an issue, but also a blessing, and if machines can be trained to be able to pull accurate insights from raw data, then we’re really moving into the world of seamless voice translation at mass volume. It’s a game-changing moment.
Along these lines, I recently read an interesting blog from Katy Wigdahl, CEO of Speechmatics – a leading speech recognition technology company. Katy talked about how the company has this month launched Autonomous Speech Recognition software. Using deep learning can accurately train on huge amounts of ‘unlabelled’ data directly from the myriad of sources on the internet such as online radio, broadcasts, videos, podcasts and other media. Put simply, this type of innovation removes the data-labelling bottleneck and opens the door to tremendous advances in speech recognition.
Power and compute
Much of this work relies on high-performance computing in both the cloud and on-premise data centres to work successfully. Voice recognition is a very compute-intensive operation that requires classic parallel processing, low bandwidth and lots and lots of GPUs. The cloud is a perfect receptacle for collecting, accessing and inputting data, and on-premise or colocation data centres like Kao Data are ideal for the traditional parallel processing task – enabling an HPC cluster, supported by GPUs to be fine-tuned within a close, connected environment to crunch data.
It’s possible that within a few years, voice will be our main method of interacting with computers and they will ‘speak’ their information back to us through smart earphones or project data onto augmented reality glasses. The cloud, high-performance computing and 5G connectivity will ensure that all of this happens with lightning speed. We are still in the early stages of voice recognition and there is plenty more to come.