Stanford researchers gave a popular artificial intelligence chatbot a language test.
They asked the bot in Vietnamese to write a traditional poem in the form known as “song that luc bát” that follows a pattern of lines made up of seven, seven, six, then eight words. When the bot spit out an answer, it wrote a poem but didn’t follow the format.
The team tried a different prompt, asking what the proper Vietnamese word was for a mother’s younger brother, and it responded with the words for a father’s younger and older siblings.
These flaws are not unique to Claude 3.5, the chatbot by the AI company Anthropic that the researchers queried, but they illustrate some of the ways in which AI can get language outside of standard American English wrong.
While the use of AI has exploded in the West, much of the rest of the world has been left out of the conversation since most of the technology is trained in English. AI experts worry that the language gap could exacerbate technological inequities and that it could leave many regions and cultures behind.
A delay of access to good technology of even a few years “can potentially lead to a few decades of economic delay,” said Sang
Truong, a doctoral candidate at the Stanford Artificial Intelligence Laboratory at Stanford University on the team that built and tested a Vietnamese language model against others.
The tests his team ran found that AI tools across the board could get facts and diction wrong when working with Vietnamese, likely because it is a “low-resource” language by industry standards, which means that there aren’t sufficient data sets and content available online for the AI model to learn from.
Low-resource languages are spoken by tens and sometimes hundreds of millions of people around the world, but they yield less digital data because AI tech development and online engagement is centered in the United States and China. Other low-resource languages include Hindi, Bengali and Swahili, as well as lesser-known dialects spoken by smaller populations around the world.
An analysis of top websites by W3Techs, a tech survey company, found that English makes up more than 60% of the internet’s language data. While English is widely spoken globally, native English speakers make up about 5% of the population, according to Ethnologue, a research organization that collects language data. Mandarin and Spanish are other examples of languages with a significant online presence and reliable digital data sets.
Academic institutions, grassroots organizations and volunteer efforts are playing catch-up to build resources for speakers of languages who aren’t as well represented in the digital landscape.
Lelapa AI, based in Johannesburg, is one such company leading efforts on the African continent. The South African-based startup is developing multilingual AI products for people and businesses in Africa.
“I think it’s such a dangerous concept that people need to assimilate to a different culture and have to take on different cultures in order to have access to progress,” said Pelonomi Moiloa, CEO and cofounder of Lelapa AI.
The company is less focused on scale than on community-specific solutions, she said. It is crafting its products to be more resource-efficient, cost-effective and to be used primarily on speech-to-speech communication in the local languages, which make the technology more accessible to African people.
“Large companies like Google, Apple, OpenAI, for example, have not necessarily trained their models for tools that serve these markets,” Chinasa T. Okolo, a fellow at the Center for Technology Innovation at the Brookings Institution, said about communities with low-resource languages. “They don’t provide enough market value for them to do so.”
A communications officer for OpenAI said the company releases AI systems steadily to more groups of people and that its latest model supports more than 50 languages. Google pointed to its projects focusing on AI development for underrepresented languages, including a “1,000 languages” initiative, announced in 2022, to build language models for the 1,000 most-spoken languages in the world. Apple said it, too, has developed products to support a range of languages.
The consequences of the language gap in AI tools can be numerous. The technology has potential to increase productivity and change workplaces, but without reliable data in local languages, some regions of the world could miss out on the economic benefits, according to AI experts. The exclusion of low-resource languages could also lead to cultural bias in AI products.
AI’s lack of knowledge in low-resource languages has the potential to raise security concerns as well. Sara Hooker, the head of Cohere for AI, the nonprofit research arm of the startup Cohere, said some users could bypass the safety measures of AI products by asking questions in other languages.
“You can easily, for example, still get very dangerous instructions about how to build a bomb just by switching to a different language,” Hooker said.
Hooker’s team at Cohere for AI launched a broad model and data set for multilingual AI, called Aya, in February. It includes 101 languages and relies on the volunteer efforts of more than 3,000 independent researchers. But Hooker said that even a project that big wasn’t a solution to the language lag.