This would be a problem with much greater consequences for the development of these generative artificial intelligences like ChatGPT: because to “train” them, we have been able to feed them until now with astronomical quantities of information available on the Internet, and data in continuous growth. It is thanks to them that ChatGPT and other applications of its kind can produce texts or images at phenomenal speed: these generative AIs have the ability to “predict” an answer to a question, a sequence of words or a pixel, from a mass of data that we make available to them.
But even the Internet is not infinite: and it has been two years since experts predicted that we would reach its limits quite quickly. The “high quality content” usable to power these applications will be caught up in 2026, according to a summary assessment which was carried out in October 2022 by a group of researchers from the firm EpochAI. The same group published an estimate in 2024 placing this limit in 2028, but the difference lies in how we define “high quality content”.
In their latest estimate, the size of databases used to train AI is growing by 100% per year, while the size of “usable” content on the Internet is only growing by 10% per year – hence a “bottleneck”. “imminent strangulation”.
One of the most troubling solutions would be to continue to “train” generative AIs, but with content designed… by other generative AIs. However, if we already lack solutions to limit the production of false information by these AIs, we enter even more vague territory if the AIs must feed on content created by AIs.
Royalties
We may even have already started to enter this bottleneck, commented in December, in the journal Natureresearcher Shayne Longpre who, at the Massachusetts Institute of Technology, is leading a project to evaluate the quality of data used to power AI.
At the same time, some owners of this data available on the Internet – such as newspaper publishers – have started to demand that AI companies pay them royalties for the use of their content – considering that this content is supposed to bring in revenue. exponential growth for IT companies. THE New York Times was the first, in December 2023, to sue the company OpenAI and its partner Microsoft; in April 2024, eight American newspapers filed a joint lawsuit.
The authors of the 2022 assessment were also careful to distinguish “quality” content, which included books, journalistic reports, scientific research and computer code – these contents having in common that they are economically expensive to produce, unlike AI-generated text.
There are also productions of lower quality, ranging from YouTube videos to texts of all kinds circulating on Facebook forums, WhatsApp messages, etc.: the total quantity of this other part of information is difficult to measure, but the designers of an AI on which the public is supposed to rely for reliable information, might hesitate to let it be overwhelmed by lower quality content.