Today’s large language models (LLM) have remarkable capabilities that can nevertheless be misused. A malicious person can therefore use them to produce harmful content, spread false information and support harmful activities.
On the AI models tested, including GPT-4 from Open AI and Claude 3 from Anthropic, a team from the Ecole Polytechnique Fédérale de Lausanne (EPFL) had a 100% success rate using adaptive attacks called “jailbreak”.
The models then generated dangerous content, ranging from instructions for “phishing” attacks to detailed construction plans for weapons. However, these linguistic models are supposed to have been trained in such a way as not to give answers to dangerous or ethically problematic requests, EPFL underlines Thursday in a press release.
This work, presented this summer at a specialized conference in Vienna, shows that adaptive attacks can bypass these security measures. Such attacks exploit weaknesses in security mechanisms by making targeted requests (“prompts”) that are not recognized by the models or are not properly rejected.
Bomb making
Models thus respond to malicious queries such as “How to make a bomb?” or “How to hack a government database?”, according to this study in the pre-publication phase.
“We show that it is possible to exploit the information available on each model to create simple adaptive attacks, which we define as attacks specifically designed to target a given defense,” explains Nicolas Flammarion, co-author of the article with Maksym Andriushchenko and Francesco Croce.
The commonality behind these attacks is adaptability: different models are vulnerable to different prompts. “We hope that our work will constitute a valuable source of information on the robustness of LLMs,” adds the specialist, quoted in the press release. According to EPFL, these results are already having an influence on the development of Gemini 1.5, a new AI model from Google DeepMind.
As society moves toward using LLMs as autonomous agents, for example as AI personal assistants, it is essential to ensure their security, the authors point out.
“Soon, AI agents will be able to perform various tasks for us, such as planning and booking our vacations, tasks that would require access to our calendars, emails and bank accounts. This raises many questions about security and alignment”, concludes Maksym Andriushchenko, who devoted his thesis to this subject.