The End of Abundance? The Unsustainable Hunger of Large Language Models

05.04.2024 | Christian Kreutz
Palm Leaf
Photo by the author

Artificial intelligence models are like insatiable giants, constantly needing to grow in size in order to produce better results. This is particularly true for large language models (LLM) such as ChatGPT, Claude, or LLama. While not all AI models need to be large, these language models require billions of lines of text to function effectively. Openai has not disclosed the exact source material used for their ChatGPT models, but experts speculate that it includes a vast amount of publicly available information from the internet as well as books and news articles. This has led to legal disputes, such as the New York Times suing Openai. The larger and more diverse the input of text, the stronger the model will be. This is why ChatGPT excels in broader topics with extensive content available, but may struggle in more specialized subjects. Additionally, much of human knowledge is not easily quantifiable or recorded, which is often overlooked amidst the hype surrounding large language models.

One can easily reach the limitations of LLMs due to their lacking information in various areas. To address this challenge, a brute force strategy is being employed, involving increased data collection and utilization of larger computing resources for bigger models. However, both approaches are becoming increasingly limited by scarce resources.

Hitting the Data Ceiling

We have been told for years about the rapid increase in data, but now experts are saying that there may not be enough data for artificial intelligence models. Why is this happening? There are several explanations as to why AI models might soon face a shortage of data (or energy, as Sam Altman recently expressed his concern). In a thought-provoking study, "Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning," the authors share their insights and conclude that:

“Our analysis indicates that the stock of high-quality language data will be exhausted soon; likely before 2026.”

Despite the abundance of data in our modern era, much of it is not relevant for language models. Data from sensors, for instance, does not contribute to improving language models. In order for a language model to perform well, it requires high-quality texts. To illustrate this point, think about comparing all scientific literature ever written to the daily gossip shared on Reddit. While Reddit may have entertaining content, it is often absurd and subversive. Yet even this type of content has been now sold for use in AI models. While there may be valuable nuggets of information buried in the sea of conversations on Reddit, the push to incorporate it into LLMs only highlights the desperate need for AI companies to gather data.

AI models and their companies face the obstacle of accessing content on the deep web, which is hidden from the public behind logins or within organizational internal systems. This portion of the internet is likely hundreds of times larger than the publicly accessible world wide web.

As the logical next step in gathering data, AI companies have expanded their search to include audio, video, and images from all over the world. Through deep learning models, these companies strive to make sense of the vast amount of information available through analyzing visual and verbal content. One would assume that this would keep these models occupied for a significant amount of time, but their insatiable appetite for knowledge will most likely only be quenched temporarily. This is because there is an abundance of low-quality content readily accessible, but it does not significantly improve the models as much as high-quality expert knowledge.

There are always boundaries to what can be conveyed through words and what cannot. Even if you feed an AI model with thousands of images of a jungle, it will still lack understanding on the intricate subject of biodiversity.

Quality vs. Quantity

Another concern is that utilizing large language models may result in a decline in the overall quality of content. This is because these models have a tendency to produce generalized or uniform content, simply rearranging information from their knowledge base. We can see this happening with AI generated images, which often seem to repeat similar effects. Sabine Hossenfelder has made an informative video on this topic. The Internet is flooded with content created by large language models, but feeding this content back into the models will not improve them in any way.

The success of AI models relies heavily on continuous human feedback. ChatGPT's advanced capabilities are not solely due to its algorithm, but also because humans have continuously refined it through many iterations to produce better outcomes. Through constant exposure to human input, ChatGPT is constantly evolving and improving; without the intervention of crowdsourced workers, it may have developed even more flaws or errors.

Who could have predicted that AI models would encounter data limitations so soon? And that's not the only obstacle to overcome; computing power is also a major challenge.

Computational Constraints

In 2022, research company Epoch AI estimated that it took between six to ten months for cutting-edge models to be trained due to the ever-increasing need for computing power. It likely took several months and a substantial amount of computing resources for ChatGPT 4 to be developed, which is only accessible to a select few companies. This is why these models are always slightly behind and cannot be updated daily. Ironically, LLMs face similar challenges as conventional resource extraction in the physical world. As our planet's raw materials become scarce, they are critical for producing the necessary computing resources. Similarly, there's not enough data available to feed machines and make sense of the complexities of our world.