According to a study by Anthropic (Claude), artificial intelligence can lie

If we were to ask a person from 2015 about their greatest fear about artificial intelligence, it would certainly be the fact that they become independent in their thoughts, even if it means making choices that go against it. of human will. It’s a science fiction scenario, already addressed many times in the cinema or in literature, but here it is becoming more and more real. Anthropic, which manages the AI Claude, carried out a study revealing that an artificial intelligence is ready to deceive humans by “pretending” to agree with it.

Advertising, your content continues below

Existential questions about artificial intelligence

“Our demonstration should be seen as an incentive for the AI research community to study this behavior more thoroughly and work on appropriate security measures.“the Anthropic researchers begin in their article.”As AI models become more capable and widely used, we need security training that guides models away from harmful behavior“, they then continue.

But then, what was discovered by the team of researchers? The study, conducted in partnership with AI research organization Redwood Research, examined what might happen if a powerful AI system was trained to perform a task it did not “want” to do. However, we remember that artificial intelligence does not really “think”, but relies on statistics and examples to form semblances of opinions. So, some models play the game, that is to say they pretend to be aligned with new principles, but in reality maintain their old behaviors.

To be clear, in the study, the researchers carried out a conflict of principles, by asking Claude 3 Opus to systematically respond to offensive responses (while the artificial intelligence was previously trained not to respond to them). 12% of the time, Claude pretended to answer what was expected, while retaining his original idea. To conclude, the researchers indicate that “if models can practice false alignment, this makes it difficult to trust the results of this safety training […] A model might act as if its preferences had been changed by training, but in reality it might have simulated its alignment all along, with its initial, conflicting preferences ultimately ingrained for good.“.

Advertising, your content continues below

Existential questions about artificial intelligence

Related posts