Researchers at Google said on Friday that they have discovered the first vulnerability using a large language model.
In a blog post, Google said it believes the bug is the first public example of an AI tool finding a previously unknown exploitable memory-safety issue in widely used real-world software.
The vulnerability was found in SQLite, an open source database engine popular among developers.
Google researchers reported the vulnerability to SQLite developers in early October, who fixed it on the same day. The issue was found before it appeared in an official release and did not impact SQLite users. Google hailed the development as an example of “the immense potential AI can have for cyber defenders.”
“We think that this work has tremendous defensive potential,” Google researchers said. “Finding vulnerabilities in software before it’s even released, means that there’s no scope for attackers to compete: the vulnerabilities are fixed before attackers even have a chance to use them.”
The effort is part of a project called Big Sleep, which is a collaboration between Google Project Zero and Google DeepMind. It evolved out of a past project that started work on vulnerability research assisted by large language models.
Google noted that at the DEFCON security conference in August, cybersecurity researchers tasked with creating AI-assisted vulnerability research tools discovered another issue in SQLite that inspired their team to see if they could find a more serious vulnerability.
Fuzzy variants
Many companies like Google use a process called “fuzzing” where software is tested by feeding it random or invalid data designed to identify vulnerabilities, trigger errors or crash the program.
But Google said fuzzing does not do enough to “help defenders to find the bugs that are difficult (or impossible) to find,” adding that they are “hopeful that AI can narrow this gap.”
“We think that this is a promising path towards finally turning the tables and achieving an asymmetric advantage for defenders,” they said.
“The vulnerability itself is quite interesting, along with the fact that the existing testing infrastructure for SQLite (both through OSS-Fuzz, and the project’s own infrastructure) did not find the issue, so we did some further investigation.”
Google said one of the main motivations for Big Sleep is the persistent issue of vulnerability variants. One of the most concerning problems Google found in 2022 was the fact that more than 40% of the zero-days seen were variants of vulnerabilities that had already been reported.
More than 20% of the bugs were variants of previous in-the-wild zero-days as well, researchers added.
Google said it continues to discover exploits for variants of previously found and patched vulnerabilities.
“As this trend continues, it’s clear that fuzzing is not succeeding at catching such variants, and that for attackers, manual variant analysis is a cost-effective approach,” the researchers said.
“We also feel that this variant-analysis task is a better fit for current LLMs than the more general open-ended vulnerability research problem. By providing a starting point – such as the details of a previously fixed vulnerability – we remove a lot of ambiguity from vulnerability research, and start from a concrete, well-founded theory: ‘This was a previous bug; there is probably another similar one somewhere.’”
The project is still in the early stages and they only use small programs with known vulnerabilities to evaluate progress, they added.
They warned that while this is a moment of validation and success for their team, they reiterated that these are “highly experimental results.”
“When provided with the right tools, current LLMs can perform vulnerability research,” they said.
“The position of the Big Sleep team is that at present, it’s likely that a target-specific fuzzer would be at least as effective (at finding vulnerabilities). We hope that in the future this effort will lead to a significant advantage to defenders — with the potential not only to find crashing test cases, but also to provide high-quality root-cause analysis, triaging and fixing issues could be much cheaper and more effective in the future.”
Several cybersecurity researchers agreed that the findings are promising. Bugcrowd founder Casey Ellis said large language model research is promising and specifically highlighted its use on variants as “really clever.”
“It takes advantage of the strengths of how LLMs are trained, fills some of the shortcomings of fuzzing and, most importantly, mimics economics and tendency towards research clustering of real-world security research,” he said.