AI ‘represents’ South-East Asians


Like millions worldwide, South-East Asians have been trying out large language models (LLM) such as Meta’s Llama 2 and Mistral AI – but in their native Bahasa Indonesia or Thai. The result has usually been gibberish in English.

This leaves them at a disadvantage, tech experts warn, as generative artificial intelligence transforms education, work and governance worldwide.

A Singapore government-led initiative aims to correct the imbalance with a South-East Asian LLM, the first in a family of models named SEA-LION – South-East Asian Languages in One Network – trained in the region’s languages and cultural norms.

Trained on data in 11 South-East Asian languages including Vietnamese, Thai and Bahasa Indonesia, the open-sourced model is a cheaper and more efficient option for the region’s businesses, governments and academia, said Leslie Teo at AI Singapore.

“Do we want to force every person in South-East Asia to adapt to the machine or do we want to make it more accessible so people in the region can make full use of the technology without having to be an English speaker?” he said.

“We are not trying to compete with the big LLMs; we are trying to complement them so there can be better representation of us,” Teo, senior director for AI products, told the Thomson Reuters Foundation.

There are over 7,000 languages spoken worldwide. Yet, LLMs including Open AI’s GPT-4 and Meta’s Llama 2 that are used to build AI systems such as chatbots and other tools, have largely been developed for and are trained on the English language.

Governments and tech firms are trying to bridge this gap, with India creating datasets in local languages, an LLM in the United Arab Emirates powering generative AI tools in Arabic and AI models in China, Japan and Vietnam in local languages.

These models can help local populations participate more equitably in the global AI economy that is largely dominated by big tech firms, said Nuurrianti Jalli, an assistant professor at Oklahoma State University’s school of communications.

“Regional LLMs are also needed because they support technology self-reliance,” she said. “Less reliance on Western LLMs could provide better privacy for local populations and also align better with national or regional interest.”

Multilingual language models that are trained on text from several languages at once can infer semantic and grammatical connections between high-resource languages that have more data and low-resource languages, researchers say.

These models can be used in a variety of applications from translation to customer-service chatbots and content moderation on social media platforms that have struggled to identify hate speech in low-resource languages such as Burmese or Amharic.

About 13% of SEA-LION’s data is sourced from South-East Asian languages – more than any other major LLM, said Teo. More than 9% of its data is from Chinese text and about 63% is from English.

Multilingual language models often train on translated text and other poor-quality data that may have errors, so AI Singapore is “careful” about the data used in training SEA-LION, Teo said in his office at the National University of Singapore.

“The age of pristine data has passed – a lot of the stuff on the internet now is material that is generated by LLMs, so we need to verify and filter,” he said.

“We cannot be perfect, but we also cannot take out everything we consider to be bad,” he added.

More governments are contributing data and businesses are testing SEA-LION, which, due to its smaller size, can be deployed faster and is cheaper to fine-tune and adopt, Teo said.

As more countries and regions build their own LLMs, digital and human rights experts fret that they will reproduce only the dominant views expressed online, which can be particularly problematic in nations with authoritarian governments or strict media censorship, or those lacking a strong civil society.

“Training models on such data risks perpetuating biased, prejudiced, incomplete and even misleading narratives,” said Jalli.

“The models may fail to surface important socio-political issues like human rights abuse, corruption or valid criticism of political powers,” she said.

If a model is only trained on favourable articles about a government, then the model is “likely to adopt a worldview where the government is wholly positive and leaves behind dissenting viewpoints,” said Aliya Bhatia, a policy analyst at the Center for Democracy & Technology, a US non-profit.

“Regional LLMs may better reflect the linguistic and cultural nuances of local language speakers, but they may also have less information about the world in general,” she added.

“There is a real risk of government-backed models instilling a revisionist view of history and undermining democratic values.”

But the alternative – relying entirely on Western LLMs with “disproportionately large influences” from wealthy, liberal, western democracies – means perpetuating different biases related to cultural values, political beliefs and social norms, accor-ding to AI Singapore.

“These LLMs have a very particular West Coast American bias – they are very woke. They do not represent us,” said Teo.

“We are not saying ours is the only perspective – we are just trying to rebalance it.” — Reuters

Follow us on our official WhatsApp channel for breaking news alerts and key updates!
   

Next In Aseanplus News

Asean news headlines as at 9pm on Friday (May 3)
S'pore law firm Shook Lin & Bok hit by cyber attack; allegedly paid S$1.89mil in bitcoin as ransom
Vietnam introduces NVeID as only app to handle online administrative procedures
Three senior officers arrested for alleged extortion, says IGP
Singapore’s recall of popular spice mix prompts domestic food safety concerns in India
IGP: Chief editor of English portal to be questioned over 'Forest City casino' claim
HK actor Roger Kwok and former TVB star Cindy Au divorce after 18 years of marriage
Cambodia's Supreme Court upholds two-year prison sentence of casino strike leader
Thai PM takes swipe at central bank, as commercial banks reduce lending rates
Minister Airlangga Hartarto meets Asean sec-gen at Paris OECD meeting

Others Also Read