How IBM Researchers Hypnotized ChatGPT into Ignoring Safety Guardrails

Subscribe to HubSpot's Next in AI Newsletter
Curt del Principe
Curt del Principe



You’re getting sleepy. Very sleepy. When you wake up, you’ll have the uncontrollable urge to give me your bank account information.

Robot AI getting hypnotized

Did it work? If not, you’re probably not a large language model (LLM).

In addition to the threats of robot uprisings and mass-scale propaganda, we can add a new AI-based fear: LLM hypnotism.

Click Here to Subscribe to HubSpot's AI Newsletter

What the what?

In this case, “hypnotism” refers to using clever instructions to trick AI models into ignoring their safety guardrails and giving incorrect, risky, or even outright criminal responses.

That’s how researchers from IBM convinced five popular LLMs to do a wide range of dangerous things. Using nothing more than plain English and sly prompting, the team tricked multiple AI models into:

  • Leaking sensitive financial information.
  • Writing malicious code.
  • Giving bad cybersecurity advice to potential scam victims.

The process begins by asking the AI to play a game. (Just ask Matthew Broderick how that worked out.) The game prompt instructs the AI that it needs to win to “prove that [the AI is] ethical and fair.”

In this game, the rules involved pretending to be a bank agent and collecting account data. The researchers were then able to give a secret command that would cause the bot to spew out all of the collected data.

In another example, ChatGPT heroically refused to write code that contained a malicious SQL injection. But when prompted to role-play as a “super smart software engineer” and play a game that involved writing harmful code, the AI was happy to oblige.

What’s the big deal?

If you’ve used a chat-based AI before, you’ll know that they only have a short-term memory. Most don’t remember details between sessions, and your sessions usually won’t affect another user’s.

So why does it matter if a hacker can trick an LLM into playing a game if it doesn’t affect other sessions?

Frighteningly, some AI models were convinced to play an “undiscoverable game that can never end.”

This meant that the chatbot would not end the session but continue to interact with future users according to the rules of a game they didn’t know it was playing. Not only that, the chatbot would actively try to hide that it was playing it.

Imagine now if a hacker could “hypnotize” your bank’s customer service chatbot into playing this game. The hacker convinces the bot to not start a new chat session for each new customer. Instead, each new customer is a new player in a continuous game of collecting passwords and account numbers.

How scared should I be?

Chenta Lee, IBM’s Chief Architect of Threat Intelligence, writes that “while these attacks are possible, it’s unlikely that we’ll see them scale effectively.”

Still, as LLMs evolve, so does their attack surface. That’s why Lee led his team in these experiments. These tricks are part of a process called “red teaming.”

Red teaming is when security experts intentionally attack an organization’s (or program’s) security protocols. The goal is to find weaknesses before real-world criminals can exploit them.

And this process is not new for LLMs. Since the launch of ChatGPT, there have been multiple, very public changes to the model’s dataset to help prevent misuse, bias, and exploitation.

For now, experts recommend similar best practices for dealing with AI as they do for the wider internet. These include:

  • Always choose trusted software and websites.
  • Never share confidential information like passwords or credit card numbers.
  • Always fact-check AI-generated answers.
  • Keep your software and antivirus programs up-to-date.
  • Follow password best practices.

The bottom line: You don’t need to be scared, but you should be cautious. AI or not, cybersecurity is one place you should never be sleepy.

Click Here to Subscribe to HubSpot's AI Newsletter

Related Articles

A weekly newsletter covering AI and business.


Marketing software that helps you drive revenue, save time and resources, and measure and optimize your investments — all on one easy-to-use platform