To grow, generative AI turns to books

Friday 22nd November 2024 12:45 AM

Paris (awp/afp) – Preserve copyright by negotiating with platforms: faced with the ever-increasing needs of generative artificial intelligence, the publishing world is also starting to enter into contracts with its stakeholders to monetize its contents.

The major American publisher HarperCollins has just offered some of its authors a contract with an artificial intelligence (AI) company – whose identity is confidential – allowing the latter to use their published works to train its intelligence model generative artificial.

In a letter consulted by AFP, the AI company offers $2,500 per book selected in order to train its language model (LLM) for a period of three years.

Because to be able to produce all kinds of content on a simple query in everyday language, generative AI models need to be fed with an ever-increasing amount of data.

Contacted, the publisher confirmed the operation. “HarperCollins has entered into an agreement with an artificial intelligence technology company to allow limited use of certain titles (…) to train AI models, in order to improve the quality and performance of the models,” he writes.

The publishing house also explains that the agreement “clearly regulates the production of models that respect copyright.”

The offer was, however, differently appreciated in the publishing sector, and writers like the American Daniel Kibblesmith sharply declined it: “I would probably do it for a billion dollars. I would do it for a sum of money that would no longer require me to work, since that is the final goal of this technology,” the author said indignantly on the Bluesky social network.

New materials

If HarperCollins is one of the largest publishers to date to make this type of agreement, it is not the first. The American publisher of scientific books Wiley gave access “to the content of academic and professional books published for specific use in training models, for $23 million, to a large technology company”, it indicated in March this year , when presenting its financial results.

This type of collaboration highlights the problems linked to the development of generative artificial intelligence, which is trained on immense quantities of data collected on the web with, as a result, potential copyright violations.

For Giada Pistilli, head of ethics at Hugging Face, a Franco-American open access AI platform, this announcement is a step forward, since the content of the books is monetized. But she regrets a limited margin of negotiation for the authors.

“What we are going to see is a mechanism for bilateral agreements between new technology companies and publishers or copyright holders, whereas in my opinion we need a broader conversation to include a little more stakeholders,” she says.

“We are starting from so far away,” comments Julien Chouraqui, the legal director of the French publishing union (SNE). “This is progress: the fact that there is an agreement means that there has been a dialogue and a desire to achieve a balance between the use of source data, which is subject to rights and which will generate value,” he says.

Faced with these questions, press publishers are also getting organized. At the end of 2023, the American daily The New York Times launched proceedings against OpenAI, creator of the ChatGPT software, as well as Microsoft, its main investor, for copyright infringement. Other media have entered into agreements with OpenAI.

Tech companies may no longer have any choice to improve their products other than to dig deep, especially as they begin to run out of new materials to power the models.

The American press recently reported that the new models in development seem to have reached their limits, notably at Google, Anthropic and OpenAI.

“On the internet, you collect licit and illicit content, and a lot of pirated content. This poses a legal problem. Without forgetting the problem of data quality”, observes Julien Chouraqui, of the SNE: “if we are keen the development of a market on virtuous bases, it is therefore necessary to involve all the players.”

afp/ib