If, when you chat with a chatbot like ChatGPT, you have the feeling of having a conversation as elaborate as with a human, it is because artificial intelligence (AI) has made enormous progress. These have been made possible thanks to the development of large language models (LLM), these neural networks trained on gigantic datasets.
However, “LLM developers will start to lack conventional data to train their models”, says Nature in a long article. They have already exploited almost all the free information available on the Internet, and the ever-growing LLMs are becoming more and more voracious.
According to researchers at Epoch AI, a research institute that focuses on trends and big questions in AI, by 2028 the size of a dataset needed to train a model will be equivalent to that of the estimated total stock of public texts online. In other words, in about four years, there won’t be enough textual data available. No more progress in AI?
Not necessarily. According to the British magazine, companies specializing in the sector do not seem to panic in the face of this announced limit. OpenAI and Anthropic have already publicly acknowledged the problem, while suggesting they plan to work around it, including creating synthetic data using AI or partnering to scrape data that isn’t public.
Private data or replay
This could be through the use of WhatsApp messages or transcriptions of YouTube videos, for example. Beyond the question of legality raised by this solution, many companies say they do not want to share their data in order to use it to train their own AI models, internally. Others imagine that existing LLMs could learn new things by “rereading” the data they have already been trained on.
[…] Read more on International Mail