DayFR Euro

a gold mine for artificial intelligence companies

As the use of artificial intelligence technologies becomes more widespread, a debate is emerging regarding the legitimacy of certain practices. The massive exploitation of subtitles from films and series to train these AI raises questions about respect for copyright. In this context, several large technology companies are singled out.

Using subtitles to train language models

For several years, giants like Apple, Meta, Nvidia and Salesforce have been using considerable volumes of data to perfect their language models. Among this data are subtitles from a multitude of films and television series. According to different sources, these dialogue repetitions allow the models to imitate human speech in a colossal way, beyond academic or journalistic texts.

This method is based in particular on the inclusion of subtitles in databases such as The Pilehosted on OpenSubtitles.org. This platform offers more than 53,000 films and 85,000 series episodes, providing a substantial library for AI developers.

The companies concerned and their approach

Several companies have been identified as using this method. This is particularly the case ofAnthropicwhich uses these subtitles for its direct ChatGPT competitor, Claude. Meta and Apple are doing the same to develop their LLM and OPT model families respectively. Other players such as Nvidia, Bloomberg and EleutherAI are also exploiting this mass of data to strengthen their artificial intelligence capabilities.

All these companies seek to build systems capable of generating more natural and human dialogues. Indeed, “writing well” is a rare resource in the world of AI training data. Therefore, drawing on written dialogue for subtitles provides a valuable window into the nuances and rhythms unique to spoken conversation.

A controversial use and source of legal conflicts

However, this exploitation is not without controversy. Several legal actions have been filed against the incriminated companies, accused of using copyrighted works without permission. Screenwriters, authors and publishing houses allege violations of their rights. Breaking Bad creator Vince Gilligan called the practice “extraordinarily complex and energy-intensive plagiarism” before the U.S. Copyright Office.

In their defense, the companies claim that training AI systems on copyrighted works constitutes fair use, although this assertion still awaits final legal validation. From a copyright perspective, courts could consider subtitles to be derivative works, thus enjoying the same protection against unlawful copying and distribution.

The role of subtitles in AI performance

Why this particular choice of subtitles? Because they represent a raw and essential form of written dialogue. Unlike textbooks or scientific articles, captions accurately capture the flow, tone, and cadence of human conversations. This allows chatbots, which rely on this data, to produce speech that sounds naturally human.

The contribution of subtitles does not stop there. These text fragments are also used to modulate automatically generated responses in different contexts – television, cinema and even educational. They help artificial intelligences enrich their linguistic repertoire, going beyond the great literary classics to represent the diversity of current verbal interactions.

The position of the original creators

Unfortunately, this use of subtitles poses an ethical and legal dilemma. On the one hand, developers and researchers benefit from a wealth of free and accessible linguistic information. On the other hand, writers and creators see their work used without their knowledge, without appropriate remuneration.

An opinion shared by the representatives of British authors (WGGB), who demand strict regulation and the establishment of compensation for creators affected by this practice. Even some dataset creators, such as Jörg Tiedemann, express concerns about the misuse of their initial contribution, considered useful but now invasive for the creative market.

-

Related News :