DayFR Euro

Enlarging the models brings less and less, reducing them works better and better

You just need to use ChatGPT and others for a while to realize the capabilities of these models and their limits, starting with their fabrications. For many industry leaders, these teething problems would be solved by increasing the size of models and the volumes of data on which they are trained. Better yet, this ever-increasing scale would sooner or later make it possible to reach the Holy Grail of general AI.

This logic or this empirical law, which justifies investments in the computing capacities of hyperscalers quarter after quarter, convinces fewer and fewer people and even these most ardent followers are beginning to doubt. Thus GPT-5, the next “frontier model” from OpenAI announced for the end of the year, would not show the same progress as its predecessors, reports The Information. The expected performance gains would not be there and, despite the increase in parameters (17,000 billion according to certain sources), the models persist in producing hallucinations, reasoning errors and other incongruous responses. Former sidekick of Sam Altman at OpenAI and now head of his own company, Ilya Sutskever told Reuters that the results of increasing pre-training have plateaued.

“Each new version of a major model appears to provide only marginal gains over previous versions, indicating that suppliers are seeing diminishing returns. For example, while GPT-4 shows improvements over GPT-3.5 in reasoning and specialized tasks, the gains are not proportional to the massive increase in computing resources required,” says Erik Schwartz, director of the AI at Tricon Infotech at The Stack media.

The fact that OpenAI and others are confronting and recognizing the limits of the size-focused approach delights experts who already doubted it, such as Gary Marcus and more recently Yann Lecun. “The high valuations of companies like OpenAI and Microsoft are largely based on the idea that LLMs will, as they develop, become artificial general intelligences. Like I always said, it’s just a fantasy. There is no principled solution to hallucinations in systems that rely on language statistics without explicit representation of the facts and without explicit tools for reasoning about those facts,” writes Gary Marcus on his blog.

Graph created by Gary Marcus on the evolution of model performance (measured on the MMLU benchmark).

Inference improvements and limited uses

This is not the end of major language models, of course. Current models have demonstrated their astonishing conversational and content generation capabilities and their adoption will continue where their limits do not pose a problem: the chatbot for tourist recommendations yes, that used for automated medical diagnosis no. On the other hand, to approach more reliable and more general AI, it will undoubtedly be necessary to develop new complementary techniques, like the neuro-symbolic AI of Marcus or the planning AI of Lecun.

Furthermore, other techniques applied not to training but to inference, that is to say at the time of interaction with existing models, make it possible to reduce confabulation problems – without eliminating them however. The use of controlled information to feed the responses (RAG), the constrained explanation of the reasoning in the background (chain-of-thought, used in OpenAI’s o1) are among the methods already exploited to improve the results of existing models. Without forgetting the chaining of specialized agents and the development of large specialized models trained on corpora of controlled content.

The shift in computational load from pre-training to inference heralds a shift in capacity distribution, with fewer mega-dacenters concentrating all the GPUs and more power distributed across the four corners of the globe close to users , analyzes Sonya Huang, partner at Sequoia Capital. Microsoft has been reassuring its shareholders for several months by promising that expensive training infrastructures can be redeployed for inference.

Small models compete with their elders

If the performance of large models tends to converge and improve only marginally, small models are progressing at high speed. And they turn out to be cheaper and less energy intensive, both for training and inference. Kai-Fu Lee recently shared an ultra-efficient model developed by his company 01.ai, (6th on the LMSYS benchmark, according to the company), trained for only $3 million and displaying an inference cost 10 cents for a million tokens. For comparison, the training of GPT-4 would have cost nearly 100 million dollars (compared to a billion for GPT-5) and the million tokens are billed between 10 and 60 dollars.

If 01.ai has mainly worked on optimizing inferences, other methods are in vogue to reduce the size of models. One technique is the targeted removal of certain layers from a large model with negligible impact on performance. With this process, Nvidia has developed models achieving performance comparable to Mistral 7B and Llama-3 8B, while using up to ‘at 40 times fewer training tokens, according to the State of AI report.

Another technique – distillation – involves using large models to produce refined data which is then used to train smaller, efficient models. Google’s Gemma 2 9B model, for example, was distilled with its big brother Gemma 2 27B.

The ambition to run GenAI models on smartphones stimulates these developments of less demanding models. Apple is therefore developing smaller models to power its Apple Intelligence service. For its part, Microsoft designed phi-3.5-mini, a model with 3.8 billion parameters rivaling Llama 3.1 8B. Using 4-bit quantization, a technique that reduces the precision with which weights and activations are represented, the model uses only 1.8 GB of memory, allowing inferences to be made on a mobile device.

Equal performance at lower cost

With equivalent performance, it is therefore increasingly possible to use less energy-intensive scale models. It is also possible to do it at lower cost. According to the analysis of Guido Appenzeller, investor at Andreessen Horowitz, the cost of an LLM offering a given power has fallen dramatically over the last two years. Thus, when it was launched at the end of 2021, GPT-3 cost $60 for one million tokens for a performance of 42 on the MMLU benchmark (an established test covering subjects in around fifty disciplines). Today, it is possible to achieve this same score with Llama 3.2 3B, sold for 6 cents per million tokens at Together.ai. The price for users has been divided by 1000 in three years.

Guido Appenzeller’s logarithmic graph shows that, as a trend, the price of tokens for a given performance (here 42 and 83 on the MMLU benchmark) has been divided by 10 each year (dotted line).

-

Related News :