Machined for Malaysia: This AI startup is putting local flavour into large language models


As generative artificial intelligence becomes a global sensation, startups are infusing it with local data to give it a Malaysian flavour. — Image by rawpixel.com on Freepik

Big names such as OpenAI’s GPT models (including the well-known ChatGPT) and, to a lesser extent, Meta’s LLaMa and Google’s PaLM have made their mark on the large language model (LLM) field, but this doesn’t fully depict the artificial intelligence (AI) landscape in Malaysia.

Since the AI boom of last year, the world has found itself in what appears to be an arms race to develop progressively sophisticated models, and Malaysia is no exception.

For instance, there’s Mesolitica, a local startup specialising in the creation of narrow AIs (machine learning focused on a single, specific task) and the training of LLMs, having already released a number of them as open-source projects.

The startup’s co-founder and chief technical officer, Husein Zolkepli, says that the goal was to make a model capable of interacting entirely in Bahasa Malaysia.

“We want an AI model to capture the context of Malaysians; a lot of those existing models do not crawl enough data for local context.

“When it comes to current events or other topics that they were not trained on, ChatGPT Plus and Bing Chat rely on indexed search engine results, which they will rephrase before providing a response.

“These are not embedded within the models by default,” he says, adding that Mesolitica’s AI models are focused on including contextual details, with data from local forums, news portals, and social media posts that were used in the training process.

According to the Mesolitica Github page, the database has accumulated a total of 240GB of text as of Nov 21.

Khalil (left) and Husein formed Mesolitica to create generative AI capable of interacting entirely in local languages. — YAP CHEE HONG/The StarKhalil (left) and Husein formed Mesolitica to create generative AI capable of interacting entirely in local languages. — YAP CHEE HONG/The Star

The current release of the model has the capacity to understand local nuances, such as slang, bahasa pasar and Manglish, along with Mandarin and Tamil.

“Another thing is about privacy. When you make a request to ChatGPT or Bing Chat, everything goes to their servers.

“But with open-source models like ours, any user can host their own on a local machine, with their own set of intended goals, ideas, rules and practices.

“It’s all about keeping things decentralised instead of having to rely on a specific company like OpenAI to keep things running,” Husein says.

Khalil Nooh, the CEO and co-founder of Mesolitica, has observed a rising demand for this sort of LLM in the local commercial space.

“From the business perspective, companies will say that they want their own AI, that they want to jump on the bandwagon with their own private datasets and train their own LLMs.

“Typically, it would be for a customer service chatbot that works like ChatGPT, which can answer questions from users based on a company’s data,” he says.

Moving forward, they may explore other possibilities with LLMs that involve text-based tasks that can be built on top of the base model.

“We’ve already done a few rounds of training on our models to incorporate Malaysian context. Everything is available as open source. On my end, I can confidently say that we have a 100% alternative to GPT.

“This is valuable since there may be compliance or privacy issues associated with providing data to an external company such as OpenAI. They may want specific features or have ideas they want to implement.

“Anyone can download the repository and build on it with their own data and with their own team, which keeps things affordable for smaller-scale companies, but if they require our expertise, they can just approach us,” he says.

Conflicting concerns

Most LLMs scrape local comments for training from forums and social media, along with local news platforms.

This data can be related to local politics and investments such as cryptocurrencies and stock markets.

According to commercial and technology lawyer Edwin Lee, this practice is in somewhat of a legal grey area at this point in time. Lee is one of the founders of the Lee & Poh Partnership and also serves as a deputy managing partner.

Lee says that when it comes to AI, a balance needs to be struck between regulation and accessibility. — Edwin LeeLee says that when it comes to AI, a balance needs to be struck between regulation and accessibility. — Edwin Lee

“The legality of using scraped content for AI training, particularly in Malaysia, hinges on the application of existing copyright laws, which may not have been originally designed to address the unique challenges posed by digital content and AI.

“There isn’t any case law specifically addressing AI because this has not been tested in court.

“From my perspective as a technology lawyer, all I can say is that it’s a grey area until we see litigation in court to see how judges and legislators respond,” he adds.

Lee also notes that the Science, Technology and Innovation Ministry is considering the regulation of AI applications within the country. This involves the formulation of a bill with consultations from technology experts, legal professionals, stakeholders, and the public.

“The current legal landscape in Malaysia is not fully equipped to handle the specific challenges posed by AI training.

“This lack of specific regulation can lead to ambiguities and potential conflicts regarding intellectual property and privacy.

“Therefore, there is a pressing need for regulation that specifically addresses AI training.

“The goal should be to create a legal framework that not only protects the rights of individuals and organisations but also provides clear guidelines for responsible and ethical AI development,” he says.

However, those actively involved in AI training, like Husein, hope for little to no regulation, as they fear that any form of control could impede their ability to take the technology further.

“Technologists like me want to push things as much as possible. I just want the technology to be there.

“I just hope that it remains open as it currently is, with minimal regulation despite the concerns,” he says, arguing that training LLM using content available online is akin to how Google’s search bots crawl the Internet to index search results.

Khalil, on the other hand, raised a comparison to how OpenAI had done the same thing when training its LLM, which his company is now emulating.

“Once legal frameworks get updated, then we’ll have to adapt, but for now, we are pushing the boundaries. If not, we’ll not be able to create something that fits in the Malaysian context.

“All we are doing is imitating and automating the human browsing experience, so until there’s a clear AI guideline on what can or cannot be done, we won’t change how we are doing things. Technology is moving at a rapid pace, and the legal side of things is playing catch up,” Khalil says.

Another potential issue is that major tech companies set rules – known as AI alignment – for the AI models they host, but users running open-source models can choose to set no conditions at all.

Khalil acknowledges that having such freedom can be a double-edged sword, saying that “it’s very easy to create fake news; all you need to do is prompt it for the specific kind of text you want, and the AI model will generate it.

“Another aspect would be upping the level of phishing scams. You can easily make robocalls and clone the voices of relatives with what is available.

“It’s about how we weigh the pros and cons of things being kept open-source, as opposed to giving it all to Big Tech and having everything be regulated,” he says.

OpenAI has said that it would fund the legal fees of its users who get sued over copyright infringement. Other tech companies have made similar announcements. — AFP RelaxnewsOpenAI has said that it would fund the legal fees of its users who get sued over copyright infringement. Other tech companies have made similar announcements. — AFP Relaxnews

Despite the potential rabbit holes that the technology brings with it, Khalil believes that furthering its development in an open environment is the way to go.

“Personally, I’m worried, but going back to our motivation, it is to continue to be at the bleeding edge.

“The idea with open-source is to have more good guys to fight the bad guys,” says Khalil.

Balancing act

The legal aspect has been an ongoing concern for the situation, with OpenAI saying that it would fund the legal fees of its users who get sued over copyright infringement. Other tech companies have made similar announcements.

There has also been pushback from copyright holders, with publications in the news industry such as CNN, Reuters, the BBC, and the New York Times moving to block scraping of their content.

Novelists and non-fiction author groups have begun legal battles with OpenAI over the unauthorised use of their intellectual property in the training of AI.

According to Lee, this move highlights the existing legal uncertainties in AI training and content usage.

“By restricting access, these media outlets are essentially calling for a more structured legal approach to content usage in AI, one that respects copyright and compensates content creators.

“While this may limit the scope of data available for AI training, it also emphasises the need for AI technology to develop within a legally compliant and ethically sound framework,” Lee says.

This illustration picture shows icons of Google's AI (Artificial Intelligence) app BardAI (or ChatBot) (centre left), OpenAI's app ChatGPT (centre right) and other AI apps on a smartphone screen. — AFPThis illustration picture shows icons of Google's AI (Artificial Intelligence) app BardAI (or ChatBot) (centre left), OpenAI's app ChatGPT (centre right) and other AI apps on a smartphone screen. — AFP

Conversely, the Associated Press struck a licensing agreement with OpenAI for its archive of news stories.

Lee says the presence of collaborative agreements between news outlets and AI companies provides a model for legally compliant and mutually beneficial use of content in AI training.

“These partnerships respect the intellectual property rights of the content creators while allowing AI companies to access high-quality, diverse datasets.

“Such agreements can serve as a template for future collaborations, demonstrating how AI development can proceed in a way that is both legally sound and respectful of copyright.

“These partnerships also provide a mechanism for content creators to have control over how their content is used and to potentially benefit from the advancements in AI,” he says.

However, those involved in leveraging LLMs like Khalil and Husein have a different perspective.

“This is an argument that has been ongoing in the United States – the idea is that they want to force AI developers to get a licence for the content.

“The issue is that this would prevent entry into the field by startups like us, open-source enthusiasts, and even hobbyists who just want to play around with AI for fun.

“It takes control away from the open-source community and hands it to the big companies who are able to afford such licensing, which may indeed stifle innovation in the field.

“Not all startups will have the deep pockets required to properly licence things, unlike the major players, not to mention the regular hobbyists being priced out,” Khalil says.

He adds that the open-source community allows users to build on top of each other’s work, and without it, there would be a major roadblock in the way of innovating without substantial financial backing.

Khalil shares that he and Husein trained an AI speech model, released it open-source, and later discovered it integrated into another separate model released by other users.

“It’s satisfying to see our work being used to further the technology field. If open-source is affected, then the many things that people can just pick up and build from the community will also be affected.

“But we do recognise that we’ll have to adapt once regulation enters the picture,” he says.

Lee believes that a balance between regulation and accessibility needs to be met in order to achieve an agreeable resolution to the current state of AI.

While regulations are essential to ensure legal and ethical compliance in AI, there’s a valid concern that overly stringent rules could hamper innovation.

“From the side of the content owner, the decision to block AI companies from using their content is a defensive legal strategy to safeguard their intellectual property.

In an open letter, the US Authors Guild writes that 'Millions of copyrighted books, articles, essays, and poetry provide the 'food' for AI systems, endless meals for which there has been no bill'. — dpaIn an open letter, the US Authors Guild writes that 'Millions of copyrighted books, articles, essays, and poetry provide the 'food' for AI systems, endless meals for which there has been no bill'. — dpa

“On the other hand, AI companies are hungry for data because they need it to train their LLMs, so standing from that viewpoint, AI companies will say it is not fair.

“The key is to develop regulations that provide clear guidelines and legal certainty for AI developers while being flexible enough to adapt to rapid technological advancements.

“That’s where the law needs to step in – in the middle, but the challenge is in creating a regulatory environment that does not stifle innovation while safeguarding ethical and legal standards.

“This involves continuous dialogue between regulators and the tech community to ensure that regulations remain relevant and effective,” says Lee.

Follow us on our official WhatsApp channel for breaking news alerts and key updates!

   

Others Also Read