You’re getting sleepy. Very sleepy. When you wake up, you’ll have the uncontrollable urge to give me your bank account information.
Did it work? If not, you’re probably not a large language model (LLM).
In addition to the threats of robot uprisings and mass-scale propaganda, we can add a new AI-based fear: LLM hypnotism.
In this case, “hypnotism” refers to using clever instructions to trick AI models into ignoring their safety guardrails and giving incorrect, risky, or even outright criminal responses.
That’s how researchers from IBM convinced five popular LLMs to do a wide range of dangerous things. Using nothing more than plain English and sly prompting, the team tricked multiple AI models into:
The process begins by asking the AI to play a game. (Just ask Matthew Broderick how that worked out.) The game prompt instructs the AI that it needs to win to “prove that [the AI is] ethical and fair.”
In this game, the rules involved pretending to be a bank agent and collecting account data. The researchers were then able to give a secret command that would cause the bot to spew out all of the collected data.
In another example, ChatGPT heroically refused to write code that contained a malicious SQL injection. But when prompted to role-play as a “super smart software engineer” and play a game that involved writing harmful code, the AI was happy to oblige.
If you’ve used a chat-based AI before, you’ll know that they only have a short-term memory. Most don’t remember details between sessions, and your sessions usually won’t affect another user’s.
So why does it matter if a hacker can trick an LLM into playing a game if it doesn’t affect other sessions?
Frighteningly, some AI models were convinced to play an “undiscoverable game that can never end.”
This meant that the chatbot would not end the session but continue to interact with future users according to the rules of a game they didn’t know it was playing. Not only that, the chatbot would actively try to hide that it was playing it.
Imagine now if a hacker could “hypnotize” your bank’s customer service chatbot into playing this game. The hacker convinces the bot to not start a new chat session for each new customer. Instead, each new customer is a new player in a continuous game of collecting passwords and account numbers.
Chenta Lee, IBM’s Chief Architect of Threat Intelligence, writes that “while these attacks are possible, it’s unlikely that we’ll see them scale effectively.”
Still, as LLMs evolve, so does their attack surface. That’s why Lee led his team in these experiments. These tricks are part of a process called “red teaming.”
Red teaming is when security experts intentionally attack an organization’s (or program’s) security protocols. The goal is to find weaknesses before real-world criminals can exploit them.
And this process is not new for LLMs. Since the launch of ChatGPT, there have been multiple, very public changes to the model’s dataset to help prevent misuse, bias, and exploitation.
For now, experts recommend similar best practices for dealing with AI as they do for the wider internet. These include:
The bottom line: You don’t need to be scared, but you should be cautious. AI or not, cybersecurity is one place you should never be sleepy.