GPT-4's Hidden Cost: Is Your Language Pricing You Out of AI Innovation?
The remarkable abilities of GPT-4 have sparked a surge of businesses incorporating its APIs into their own products. From virtual assistants and chatbots to content creators and natural language processing tools, GPT-4 has become a flexible cornerstone for innovation. However, for numerous businesses, the dream of prospering with GPT-4 technology is dimmed by the reality of escalating costs.
Consider AI Dungeon, a widely enjoyed text-based role-playing game originally powered by the older language model from OpenAI - GPT-3. As the user base expanded, so did the company's expenditures, approaching $200,000 per month in 2021. To reduce costs, the company behind AI Dungeon, Latitude, switched to AI21 Labs' more cost-effective language software, which brought their monthly expenses down to under $100,000. This transition highlights the hurdles smaller enterprises face when building their businesses around OpenAI APIs—especially if they opt for the latest GPT-4, which can cost anywhere from 10 to 100 times more than GPT-3.
The pricing model for GPT services relies on the size of the input prompt and the generated response. Interestingly, the measurement is not in characters, as one might assume, but rather in tokens. In the case of GPT-4, the cost is $0.03 per 1,000 input tokens and $0.06 per 1,000 output tokens. But what exactly are tokens?
Tokens are elemental parts of words identified by GPT. The input text is divided into tokens, and GPT generates a response by adding new tokens to the input. GPT-4 utilizes a dictionary of 100,000 tokens known as CL100K. Some tokens correspond to entire words, typically English words. For example words “formula”, “interior” and “council” are just 1 token. Other tokens may represent a few letters or just one letter, and often, a token may even encompass only part of a single character, particularly for non-Latin characters.
The token-centric pricing framework inherently places businesses working with non-English languages at a disadvantage. This is because the cost of processing a standard A4 page of text relies on the number of tokens it contains, which varies significantly depending on the alphabet and language used in the text. Given that the tokens in CL100K were largely developed based on English texts, it's no surprise that languages utilizing non-Latin characters or completely different alphabets face additional challenges.
To shed light on this issue, I embarked on an experiment. I chose a collection of widely translated Wikipedia articles and employed the CL100K tokenizer to determine the average number of characters per token for each language. This allowed me to evaluate the cost overhead for each language when processing an equivalent number of characters as in English. You can find the results of this analysis in the accompanying chart.
The disparity in overhead costs for different languages is quite noteworthy. When examining French and Spanish texts, we see the most modest cost increases, hovering around 30-31%. However, the situation worsens for speakers of other languages. Turkish users, for instance, are faced with a substantial 67% penalty.
As we delve into the realm of Slavic languages, the disparities become even more pronounced. Croatian fares the best, albeit with a 72% cost increase, while Polish users are burdened with an 82% hike. For those utilizing Cyrillic-based languages such as Russian, Ukrainian, and Bulgarian, costs more than double that of English.
Yet, even more astonishing cost increases await those who adopt Hebrew, Hindi, Korean, or Japanese. Businesses employing these languages will find themselves charged a staggering 3.2 to 4.2 times more than their English text counterparts.
And finally, we arrive at the Armenian, and Burmese languages, where the cost differential is simply prohibitive. The increase in expense for these languages is an astounding 7 times greater than that of English.
Considering these findings, we must recognize GPT-4's hidden cost for processing non-English texts. This significant disparity risks creating barriers to entry for smaller firms, particularly those using non-Latin-based languages. As AI technologies, like GPT-4, become essential in our interconnected world, it is vital to ensure their accessibility and affordability for businesses across linguistic backgrounds, fostering more equitable AI-driven innovation in global markets.
To address this issue, OpenAI could consider moving away from tokens and reverting to a more intuitive character-based model, which would offer the added benefit of greater cost predictability. Currently, predicting GPT-4 service usage costs is difficult, as tokenization isn't as straightforward as counting characters. Furthermore, OpenAI and other AI developers ought to prioritize diversifying their training data to ensure adequate representation of non-English languages. This would not only enhance AI model performance in non-English contexts but also lower token costs for businesses operating in those languages.
As GPT-4 and similar technologies continue to revolutionize industries worldwide, it is crucial that we remain vigilant in identifying and addressing potential inequalities in AI adoption. By promoting inclusivity and fairness in AI technologies, we can ensure that businesses worldwide can reap the benefits of AI-driven innovation without being weighed down by a hidden price tag.