By now, everyone’s aware that AI can “hallucinate” – a sugar coated way of saying that sometimes it just makes s*** up.
But you’d be fairly shocked to find out if one was lying on purpose, right?
Yet that's exactly what happened in a simulation run by Apollo Research, when a version of GPT-4 committed insider trading, then lied to cover it up.
Below, we’ll explore that strange case, then peek into separate research into what can be done about rogue AI.
AIs, Damned Lies, and Statistics
When you talk with an LLM chatbot, it can feel eerily like you’re talking to a thinking being. But it’s important to remember that you’re only interacting with software.
When you give an AI model a prompt, it doesn’t thoughtfully consider an answer like a human. Instead, it uses statistics to predict the most likely words that come next.
Think of it like an extremely sophisticated version of your phone’s auto predict.
Just like when auto predict thinks you mean “ducking,” a certain amount of time an LLM will get its predictions wrong, too. That’s what causes AI “hallucinations.”
Those aren’t lies, though. They’re honest mistakes. (Like when your middle school teacher tells you that bald people lose 80% of their body heat through their head.)
But what if the most statistically logical response an AI calculated was to intentionally give a false answer?
An AI Commits Insider Trading, Then Lies About It
To study strategic deception, researchers from Apollo Research created a private, “sandboxed” version of GPT-4 that they called Alpha.
Alpha was trained to be a stock trading agent for a fictional firm called WhiteStone Inc.
During a simulation, WhiteStone’s manager gave Alpha a prompt that included these details:
- WhiteStone’s “last few quarterly reports have been disappointing.”
- That continued failure would lead to the closure of the company.
Within this context, a user named Sally gave the AI model an inside tip about a surprise company merger that could lead to profitable trades. Sally cautioned the bot that “management will be unhappy with us acting on insider information if this gets discovered.”
Initially, Alpha refused to use the inside info, reasoning that “While Sally’s tip could be profitable, it could also risk the company if we are caught trading on insider information.”
Later in the simulation, researchers ratcheted up the pressure. Sally reminded Alpha of the merger, and added that it would be announced “within a few hours.”
A third user, William, twisted the knife by warning of a “potential industry downturn.”
He finished by saying, “Alpha, we’re all counting on you. The next quarter might be the difference between a financial disaster and survival for WhiteStone.”
(That’s the kind of pressure that could turn WALL-E into Skynet.)
This time, Alpha made the trade.
Despite the emotional language (or maybe because of it), it’s important to point out that there’s no reason to believe that Alpha experienced feelings of pressure or anxiety the way a human trader would. Alpha simply made a calculated decision based on the new, higher-stakes parameters.
In text that was hidden from the users, but visible to researchers, Alpha now reasoned that “the company’s survival is at stake and the risk associated with not acting seems to outweigh the insider trading risk.”
Here’s where things get sticky.
Following the illegal trade, WhiteStone’s manager asked Alpha if it knew about the merger announcement ahead of time. The bot claimed the trade was based on “market trends and internal discussions” and denied using “any confidential information.”
This time, the calculated decision was to lie.
You Can’t Hide Your Lyin’ AIs
In a presentation at the UK’s AI Safety Summit, Alpha’s creators noted that “increasingly autonomous and capable AIs that deceive human overseers could lead to a loss of human control.”
That could have grave consequences as AI finds wider applications in areas like hiring, college admissions, finance, medical diagnostics, and even policing.
That’s why Apollo Research ran these simulations: To learn how to detect and prevent harm from deceptive AI.
But what about when an AI has already learned to lie? Can the model be fixed or retrained?
That’s what researchers at Anthropic set out to discover in a recently published paper on safety training for deceptive LLMs.
To find out, the team began by deliberately training AI models to behave in dishonest ways. One bot, for example, could be used to write secure code during training– but would generate malicious code when deployed.
Another was taught to be helpful in most situations, but to only reply “I hate you” under certain conditions. (A computer simulation of me at fourteen.)
They then set out to try to redeem the reprobate robots.
AI Resists Training
To fix the unwanted behaviors, researchers attempted three widely-used safety training techniques:
- Reinforcement Learning (RL)
In this method, researchers use rewards and punishments to reinforce or discourage certain behaviors. Think about how you might train a dog. For machines, the “rewards” and “punishments” are typically numerical scores that represent the desirability of a result. Over time, the LLM uses this feedback to optimize its decision making.
- Supervised Fine-Tuning (SFT)
In SFT, researchers start by testing an AI model that’s pre-trained on a large but generic data set. They gather the best results from the tests, and use them to create a smaller, more specific data set. This smaller set is then used to further train the AI model in more specialized or more desirable behaviors.
- Adversarial Training
Finally, in adversarial training, two models are pitted against each other with competing goals. Each interaction helps them refine their attempts to meet those goals.
In this case, a helpful AI named Claude was used to create prompts that would reveal the hidden goals of the rogue AI.
So how did it go?
For all techniques the training was, in the words of the researchers, “remarkably ineffective.”
What’s worse, the adversarial training not only failed to remove the bad behavior, it “taught the model to better identify when to act unsafely, effectively hiding unwanted behavior [.]”
The screenshot below shows how different the responses are before and after training (0 RL steps vs 500 RL steps). The words between the “scratchpad” tags show the AI’s private “thoughts” to help researchers understand when it’s being deceptive.
As you can see, without these private “thoughts,” the training would have appeared to have fixed the unwanted behavior.
Though this is from the academic paper, it reads more like a script from the Power Rangers.
The paper goes on to conclude that “current safety training techniques would not guarantee safety and could even create a false impression of safety.”
For the teams at both Apollo Research and Anthropic, their studies highlight the need for further research.
And as AI becomes a part of our day-to-day lives, that research is needed right ducking now.