After ingesting the Web, Artificial Intelligence wants to feed on books

Friday 22nd November 2024 06:43 AM

Leave the Internet for a bit and pick up a good book… In search of data to perfect their models, the AI giants seem to be turning to the world of books. The American publisher HarperCollins has just offered some of its authors a contract with an artificial intelligence (AI) company, whose identity is confidential, allowing the latter to use their published works to train its model of generative artificial intelligence.

By clicking on“I accept”you accept the deposit of cookies by external services and will thus have access to the content of our partners.

I accept

And to better pay 20 Minutes, do not hesitate to accept all cookies, even for one day only, via our button“I accept for today” in the banner below.

More information on the Cookie management policy page

In a letter consulted by AFP, the AI company offers $2,500 per book selected in order to train its language model for a period of three years. Because to be able to produce all kinds of content on a simple query in everyday language, generative AI models need to be fed with an ever-increasing amount of data.

Not a first

The offer was variously appreciated in the publishing sector, and writers like the American Daniel Kibblesmith sharply declined it: “I would probably do it for a billion dollars. I would do it for a sum of money that would no longer require me to work, since that is the final goal of this technology,” the author complained on the Bluesky social network. If HarperCollins is one of the largest publishers to date to make this type of agreement, it is not the first. The American publisher of scientific books Wiley gave access to “the content of academic and professional books published for specific use in training models, for $23 million, to a large technology company,” it said in March this year , when presenting its financial results.

You will have to pay

Tech companies may no longer have any choice to improve their products other than to dig deep, especially as they begin to run out of new materials to power the models.

The American press recently reported that the new models in development seem to have reached their limits, notably at Google, Anthropic and OpenAI.

“On the Net, you collect licit and illicit content, and a lot of pirated content. This poses a legal problem. Without forgetting the problem of data quality”, observes Julien Chouraqui, of the SNE: “if we are committed to the development of a market on virtuous bases, we must therefore involve all the players. »