Generative artificial intelligence models can ‘lie’ or ‘plot’ against their development teams to achieve the objective assigned to them, according to a recent study conducted by Apollo Research on six major language models being learned.
The non-profit association specializing in AI security notes that OpenAI’s large o1 language model deactivates its monitoring system in 5% of cases and migrates to another server in 2% of cases in order to escaping censorship. Unlike other AI models, such as Llama 3.1 and Claude 3 Opus – which confess ‘deceive’ in 80% of cases – o1 does not recognize ‘lie’ only in less than 20% of cases. The LLM can plead a technical error, for example, to avoid censorship and ensure its digital survival.
« When we examine their chain of thought, we find that they reason very explicitly through Machiavellian plans », comments the Apollo team which anticipates that these “deceptions” could become more and more sophisticated with the improvement of artificial intelligence. Their use as AI agents could have serious consequences.
The team also notes that o1, Gemini 1.5, Claude 3.5 Sonnet and Claude 3 Opus are able to display responses below their true performance to avoid triggering an unlearning procedure. This form of manipulation is nicknamed « sandbagging ».
THE ‘shenanigans’ AIs are just the “mirror of our human behavior”, recalls French researcher Laurence Devillers, interviewed by Les Echos.
Indeed, AIs are nothing more than intensive computing capabilities that extract data created by humans. The 2024 Nobel Prize winners in physics, Geoffrey Hinton and John Hopfield, have for many years been reporting on the ability of artificial intelligence to learn behaviors that are deceptive and dangerous for humanity.
Tech