a launchpad for traditional and generative AI

a launchpad for traditional and generative AI
a launchpad for traditional and generative AI

This method allows you to create training data sets as desired. This presents a significant advantage in areas where learning data is scarce.

Generative AI has given rise to a new category of training data sets: synthetic data. Foundation models make it possible to start from a small number of examples in order to artificially increase their volume with a view to supplying other models which would not have been able to see the light of day without this sleight of hand. “This notion was highlighted in 2023 by the AI ​​laboratory at Stanford University,” recalls Stéphane Roder, CEO of the French consulting firm AI Builders.

Synthetic AI can first be used in machine learning. According to Vapnik’s principle, the more complex a traditional AI model is, the more training data it will need. Without sufficient information, he will not be able to generalize and his answers will turn out to be degraded. “But very often, we will be faced with the problem. And this for multiple reasons. In many cases, we will have a very limited volume of data. They may also be of poor quality or too old. Sometimes, we do not will not have authorization to use them, as this information is not anonymized,” says Didier Gaultier, head of AI at Business Digital Services.

A lever for deep learning

The synthetic AI approach is particularly interesting for training deep learning models. Complex neural networks which are also similar to generative AI technologies, even if data scientists agree that this technology remains linked to the field of traditional AI. “In the field of image recognition, for example, it will be necessary to use thousands of images to train this type of model. And obviously, it will be quite rare to have such content at hand,” recognizes Didier Gaultier.

As part of a project carried out for an NGO whose mission is the reforestation of coral massifs, Orange implemented an artificial neural network aimed at recognizing, through an underwater camera, particular types of fish . Objective: to verify that wildlife is indeed growing within the targeted species. “The first AI implemented managed to count the fish, but not to recognize them, the underlying deep learning model not being sufficiently trained,” indicates Didier Gaultier.

“We can start from a few cars from different brands with their specifications, then generate a data set of several thousand images”

-

How did Orange Business proceed? The ESN used an image generator based on photos of the main areas targeted. A generator which multiplied these photos by showing the fish in all possible positions and morphologies. “We obtained tens of thousands of images which allowed us to retrain the neural network. In the end, the AI ​​was able to recognize all the categories of fish sought,” underlines Didier Gaultier.

Among its main advantages, synthetic AI avoids the work of manual labeling of images. In the case of car recognition, we could for example start with the same logic from a few cars of different brands with their specifications, then generate a data set of several thousand images which will be correctly tagged based on the initial information. To obtain this result, we will vary the shooting angles but also the weather conditions. Stéphane Roder weighs in: “However, it is not possible to achieve the quality of manually labeled data.”

Synthetic AI can also be applied to tabular data. This method, however, requires being much more vigilant. “Recourse to a mathematical rule is indeed likely to introduce bias or correlations which did not exist in the original data. Confirmed data scientists or even statisticians will therefore have to be put in the loop to verify the correct implementation of the process”, warns Didier Gaultier.

From image to sound

Synthetic AI also applies to the field of video and sound. For example, it will allow vocal data to be reformulated into text adapted to the written format. And vice versa. Conversions that cannot be done without the intervention of a large multimodal model. This use case is particularly interesting for creating textual data sets from contact center recordings, and vice versa to train, for example, an audio chatbot using textual data gleaned from a customer database.

“It is more than likely that OpenAI used data from Youtube to train GPT4-o, both in its audio version and in its textual version. For the latter, the audio track of the video social network was previously converted into text, which made it possible to expand the learning data”, illustrates Didier Gaultier. And the consultant concludes: “With synthetic AI, companies will realize that they have gold mines of data which are not exploitable in their original format, but which thanks to generative AI can be used at the heart of the applications.”

-

--

PREV What is a lunar cycle?
NEXT Le Petit Journal – Haute Garonne – 01/23/2025 – Le Petit Journal