The Minority Language Hack for Large-Language Models

In a world hyper-saturated with information, the ability to digest, comprehend and engage with data is a game-changer. Especially for Enterprise Support Organizations (ESOs) that navigate the support landscapes of low-income countries, there is often a disconnect between the wealth of English-based technology, and the support we can offer to businesses in their local languages. This often leads to a signficant inability for us to unlock the full potential of tech-based solutions and our businesses suffer as a result.

At ONOW, we have been grappling with the challenges posed by Large Language Models (LLMs) like ChatGPT - particularly when it comes to minority languages. With a technology that Bill Gates labeled as the most important since 1980, there is so much untapped potential and opportunity for micro-businesses to benefit. However, the deficiencies of ChatGPT in non-English languages is well-documented, and performance and price are significantly worse.

GPT - The Gift & The Challenge

GPT is incredible. However, it was trained on over 90% English data, making it far better in English than any other language. Languages less represented in the training data often suffer from misconstrued outputs including hallucinations, unreadable text and overly expensive token costs. Results also tend to be monocultural since the majority of the training data comes from a Western perspective. While this may be a non-starter for many domains, we have found that the impact of this monocultural lens is reduced when rendering business support, and some residual relics of inappropriate cultural references can be prevented through careful use of tools like Retrieval Augmented Generation (RAG).

Another common workaround for these challenges is the process of fine-tuning - adapting the model to better understand and process minority languages by retraining it with a larger corpus of minority text. This has its merits, but it's not without problems. For one, it's incredibly expensive, and minority languages have 10-20x the token usage of English, making it even more so.

To fine-tune a model in a minority language, we would need hundreds of complete source materials in our desired context. Using a rough rule-of-thumb, a 50-page English book has about 32,000 tokens. Translating to Burmese at a rate of 12-to-1 means that each book would have 384,000 tokens! If we used 100 books, that comes to 384 Million tokens. Using OpenAI's current pricing of $8/million tokens, we've already eclipsed $3,000 before we even begin to use the model for business support application. And even then, usage rates are far higher than standard ChatGPT costs. It's simply been too expensive for our organization to consider this approach.

Tokenization comparison of English compared to Burmese script on the same phrase:

But these models contain so much untapped and latent potential that we have continued to push forward and search for other solutions that could unlock the power for our business owners in many different minority languages.

Our Approach

We take a multi-model approach, leveraging on the robust knowledge base of GPT, but using a machine translation layer in and out of the model. The English output from GPT is passed through a machine translation model, and then delivered to the user in their local language. This reverse-process is replicated when receiving user language inputs. Essentially, we are able to leverage the built-in safety features and vast knowledge base of GPT, while still making interaction viable in user’s own language.

To further streamline and improve the process, we've also integrated simple prompt engineering techniques, encouraging GPT to do things like "avoid idioms and jargon" and "use simple language". This helps direct generated replies into easier-to-translate phrases, and improves the experience and language the end user receives.

Challenges

Despite our encouraging successes, challenges remain. Uncertain user inputs can and often do allow local idioms to flow into the input side of our processes. However, we've found the process to be essentially self-correcting over time when this does occur, and we try to build education and onboarding steps to help the user understand how to get better results.

To give a specific example of this, one early testing user was navigating a lesson on the topic of "Business Value Proposition" for her "Cosmetics and Money Changing" business in Burmese language. However, during the conversation, she used a slang word that changed her business details to "Cosmetics Black Money Changing." GPT became a little confused by the English terminology it received and began to caution against the practice of running a "black market" business. But as the conversation continued over 3-4 additional messages, the AI engine gradually realized the context and had removed any references or thoughts about the black market and was back on track.

Another example about translation quality of different language groups shows that the performance of Latin-based languages typically far exceeds that of non-Latin scripts. Anecdotally, we have found that languages like Mizo that are spoken by less than a million people worldwide perform better than Burmese because of this idiosyncracy. It's counterintuitive that a more rare language would have a better translation, but results have proven that out time and again. However, as LLMs continue to progress, open-sourced models like SeaLLM are being produced and helping to level this playing field, and we're excited to research and deploy these integrations when we can do so in safe and confident ways.

Finally, we return to the challenge of cost. As they say in economics, 'there is no free lunch'. While our research has shown incredible potential for the multi-modal GPT + Translation approach, the cost can still be fairly steep because of Translation API costs. For example, when we run tests with GPT + Google Translate, approximately 90% of the total cost comes from translation - not from GPT. In this case, we have solved the questions and problems of minority language hallucinations and content that are present in base GPT. However, much of tokenization cost has been transferred rather than eliminated. We can reduce this cost exposure through further prompt engineering and processes that reduce output length that is passed to the translation service. Additionally, we aim to create our own automated translations to eliminate the cost, but that will only be possible after getting a substantial set of training data.

Conclusion

Overall, we have found that while results are imperfect, and can lead to some linguistic and pricing challenges , the personalization and automation benefits almost always far surpass the limitations. Users are more than willing to deal with some imperfect language or an occasionally confusing message if they can receive personalized lessons, action steps, and interpretations of their business data. These imperfections give us opportunities to continue to improve, but they have also shown possible routes to already begin using this technology to serve business owners across the world.

In this age of rapid technology advancement, we believe that no group should be left behind, and the best way to identify inclusive processes for these technologies is to find ways for people of many languages to actively use them. With each challenge, we see opportunity. Together, let’s continue to break those language barriers and empower our world. If you would like to learn more about our journey and experiments, join our community for all the latest updates.