Sony & AI Singapore Join to Build Language Diversity in LLMs
Diversity in machine thought could soon be on the horizon as Sony Research and AI Singapore (AISG) have announced a collaborative effort to develop large language models (LLMs) for Southeast Asian languages.
“Access to LLMs that address the global landscape of language and culture has been a barrier to driving research and developing new technologies that are representative and equitable for the global populations we serve,” said Hiroaki Kitano, President of Sony Research.
This partnership, formalised through a memorandum of understanding this month, marks a significant step towards addressing the glaring gap in linguistic representation within the global AI ecosystem.
The ‘SEA-LION’ initiative
At the heart of this collaboration lies the SEA-LION (Southeast Asian Languages In One Network) family of LLMs.
These models are specifically designed to cater to the diverse linguistic tapestry of Southeast Asia, a region home to over a thousand languages.
They includes the SEA-LION 7B Base, a large decoder model pre-trained on a diverse dataset with a significant portion dedicated to SEA languages, and the SEA-LION BERT Base and SEA-LION XLM-R Base models which cater to natural language understanding tasks.
The initial focus of the partnership will be on exploring SEA-LION's capabilities for the Tamil language in the realms of speech generation, content analysis and recognition. Spoken by an estimated 60-85 million people worldwide, the project will focus on the Indian language before potentially expanding to incorporate other languages available in the models, like Indonesian.
This is part of the partnership’s broader aim to address the gap in the representation of SEA in the global LLM landscape and work to ensure models are more globally representative of all languages and populations.
The English-centric AI dilemma
The collaboration between Sony Research and AISG comes at a crucial time when the AI industry is grappling with significant language biases.
Despite the existence of over 7,000 languages worldwide, most AI chatbots are trained on a mere fraction of these, with English accounting for the lion's share of training data. This linguistic imbalance has far-reaching consequences, from hampering educational opportunities to potentially derailing legal processes.
A stark example of this problem emerged in the US, where a translator reported that 40% of Afghan asylum cases in 2023 were compromised due to inaccuracies in AI-driven translation apps.
Equally, even multilingual language models (MLMs) often transfer values and assumptions encoded in English into other linguistic contexts where they may not belong.
But the push for more linguistically diverse AI models is not merely a matter of convenience or market expansion. By broadening the models' applicability across diverse global markets the amount of data it can effectively consume can also increase, leading to greater understanding.
“AI Singapore is excited to collaborate with Sony Research in this groundbreaking partnership. The integration of the SEA-LION model, with its Tamil language capabilities, holds great potential to boost the performance of new solutions" says Leslie Teo, Senior Director of AI Products at AI Singapore.
A diverse AI ecosystem
As the collaboration between Sony Research and AISG unfolds, it holds promises that can pave the way for more inclusive and culturally sensitive AI technologies.
Whilst the work on SEA-LION will focus on Tamil, the fact that they operate in it also operates in Indonesian mean that it opens up greater diversity by a language spoke by hundreds of millions.
Ands although there are still many more languages left out in the linguistic diversity LLMs, this research sets a precedent for global AI development like never before.
******
Make sure you check out the latest edition of AI Magazine and also sign up to our global conference series - Tech & AI LIVE 2024
******
AI Magazine is a BizClik brand