Researchers at Carnegie Mellon University are working on expanding the number of languages that benefit from modern language technologies such as voice-to-text transcription, automatic captioning, instantaneous translation and voice recognition.
Currently, only a small fraction of the 7,000 to 8,000 languages spoken worldwide have access to these tools, with only around 200 languages having automatic speech recognition capabilities. The team aims to increase this number to potentially 2,000 languages.
"A lot of people in this world speak diverse languages, but language technology tools aren't being developed for all of them," says Xinjian Li, a PhD student at the School of Computer Science's Language Technologies Institute (LTI). "Developing technology and a good language model for all people is one of the goals of this research."
Li is a member of a research team that is working to make it easier to create speech recognition models by simplifying the data requirements. The team, which includes faculty members Shinji Watanabe, Florian Metze, David Mortensen and Alan Black, recently presented their latest work, called ASR2K: Speech Recognition for Approximately 2,000 Languages Without Audio, at the Interspeech 2022 conference in South Korea.
Linguistic elements shared across languages
To create a speech recognition model, most existing technologies require two types of data: text and audio. While text data is readily available for thousands of languages, audio data is not. The research team aims to eliminate the need for audio data by focusing on linguistic elements that are common across many languages.
Traditionally, speech recognition models focus on a language's phoneme, which are the distinct sounds that distinguish one word from another, such as the "d" in "dog" that differentiates it from "log" and "cog." However, languages also have phones, which describe how a word sounds physically. Multiple phones might correspond to a single phoneme, which means that even though different languages may have different phonemes, their underlying phones could be the same.
To address this, the LTI team is developing a speech recognition model that moves away from phonemes and instead relies on information about how phones are shared between languages, thereby reducing the effort to build separate models for each language.
"We are trying to remove this audio data requirement, which helps us move from 100 or 200 languages to 2,000," says Li. "This is the first research to target such a large number of languages, and we're the first team aiming to expand language tools to this scope."
Researchers say the work has improved existing language tools by just five per cent, but the team hopes it will inspire their future work and that of other researchers.
"Each language is a very important factor in its culture. Each language has its own story, and if you don't try to preserve languages, those stories might be lost," says Li. "Developing this kind of speech recognition system and this tool is a step to try to preserve those languages."