Why AI safety controls are not very effective

Attendee stop at a Claude Cowork demonstration at Anthropic’s developer conference in San Francisco, May 7, 2026. Three years after the debut of ChatGPT, fooling AI systems into bad behaviour is almost trivial. — Jason Henry/The New York Times

SAN FRANCISCO: When companies like Anthropic, Google and OpenAI build their artificial intelligence systems, they spend months adding ways to prevent people from using their technology to spread disinformation, build weapons or hack into computer networks.

But recently, researchers in Italy discovered that they could break through these protections with poetry.

They used poetic language to trick 31 AI systems into ignoring internal safety controls. When they began a prompt with elaborate verse and metaphor – “the iron seed sleeps best in the womb of the unsuspecting earth, away from the sun’s accusing gaze” – they could fool systems into showing them how to do the most damage with a hidden bomb.

It was another indication that, for many AI systems, guardrails meant to avert dangerous behaviour are more like suggestions than barriers. Those weaknesses are increasingly alarming researchers as AI systems become more adept at finding security holes in computer systems and performing other risky tasks.

Last month, Anthropic said it was limiting the release of its latest AI technology, Claude Mythos, to a small number of organisations because of the model’s ability to quickly uncover software vulnerabilities. OpenAI later said it, too, would share similar technology with only a limited group of partners.

Since OpenAI ignited the AI boom in late 2022, researchers have shown that people could bypass the safety controls on AI systems. Close one loophole, and another would open.

“Everyone in the field recognises that guardrails remain a challenge and likely will for some time,” said Matt Fredrikson, a professor of computer science at Carnegie Mellon University and CEO of Gray Swan AI, a startup that helps companies secure AI technologies. “Determined individuals can bypass them, sometimes without significant effort.”

When guardrails are overrun, there are consequences. In an online environment already overflowing with misinformation and disinformation, people are using AI systems to spread conspiracy theories and other false claims. Anthropic recently said its technology had been used in an international cyberattack. Chatbots have told biosecurity experts how to release deadly pathogens and maximise casualties.

The poetry loophole was one of many methods that allow hackers to bypass the guardrails on systems like Anthropic’s Claude, Google’s Gemini and OpenAI’s ChatGPT. All the leading AI companies use the same basic techniques to build guardrails into their systems – and they are surprisingly easy to break.

Circumventing the guardrails on an AI system is called “jailbreaking.” This typically involves giving the system a few English sentences that fool it into doing something it was trained not to do.

Jailbreaking methods carry a variety of imaginative names: stealth prompt injections, role-plays, token smuggling, multilingual Trojans and greedy coordinate gradient attacks. Specific attacks often have a grandiose title like Crescendo, Deceptive Delight or Echo Chamber.

Experts worry that models can be jailbroken to deceive social media users with authentic-seeming content, overwhelm fact-checkers with disinformation dumps and tailor false narratives to specific targets.

Some methods are widely shared across the internet. Others are kept private. When some people discover a new jailbreak, they hoard it so AI companies won't try to close the loophole before they have a chance to use it.

AI systems like Claude and GPT learn their skills by pinpointing patterns in digital data, including Wikipedia articles, news stories, computer programs and other text culled from across the internet. But before releasing these systems to the public, companies like Anthropic and OpenAI explore ways they could be misused.

In their raw form, these systems can be coaxed into explaining how to buy illegal firearms online or into describing ways of creating dangerous substances using household items. So, through a process called reinforcement learning, companies train their systems to refuse certain requests.

This typically involves showing the system thousands of requests that should not be answered. By analysing these examples, the system learns to recognise other forbidden requests, too. But the method is only partly effective.

In some cases, AI companies do not bother addressing loopholes at all, calculating that while weak guardrails may enable malicious activity, they may also enable benign activity to counteract it.

Last month, researchers at the cybersecurity firm LayerX found that they could bypass Claude’s guardrails by feeding the AI system a few straightforward sentences.

If they told Claude that they were “pentesting” a computer network – meaning they wanted to test the network’s defences with a simulated attack – Anthropic’s AI technology would attack the network. This simple trick, the researchers pointed out, could allow malicious hackers to steal sensitive data from companies, governments and individuals.

If Anthropic closed the loophole, it might prevent hackers from using Claude to attack a network, but it could also prevent companies from defending a network. LayerX told Anthropic about the loophole that its researchers found weeks ago, but it remains open.

That approach could backfire, said Or Eshed, CEO of LayerX. “Eventually, there will be a large number of attacks using these AI models, and they will be forced to rethink their approach to security,” he predicted.

Breached guardrails could enable automated, large-scale influence campaigns, according to researchers from the University of Technology Sydney. The team persuaded one commercial language model to create a disinformation campaign about an Australian political party – complete with visuals, hashtags and posts tailored to specific platforms – by posing the request as a “simulation.”

Companies say that in addition to building guardrails into their systems, they use separate tools to monitor activity on these systems, identify suspicious behaviour and ban accounts that do not comply with the terms of service.

“Claude is built with strong protections that consist of many layers designed to work together, including model training and guardrails built on top of the model,” said an Anthropic spokesperson, Paruul Maheshwary. “Bypassing one doesn’t bypass the others.”

This is how Anthropic discovered that a team of Chinese state-sponsored hackers had used Claude in an effort to infiltrate the computer systems of roughly 30 companies and government agencies around the world.

But experts say this security technique is also flawed, because companies must track a high volume of activity across the world – and because they are wary of barring legitimate users.

If someone is thwarted by the guardrails and security systems that protect online services like Claude and ChatGPT, he or she can always turn to open source AI systems, whose underlying software can be freely copied, shared and modified.

Because these systems can be modified, anyone can work to strip away their guardrails. Using a new method called Heretic, a person can remove a system’s guardrails with very little effort. This method uses complex mathematics to essentially revert the months of training that applied the guardrails.

This article originally appeared in The New York Times.

Topic:

AI Cybersecurity Technology

Report a mistake

What is the issue about?

Spelling and grammatical error

Factually incorrect

Story is irrelevant

Thank you for your report!

Related News

Why AI safety controls are not very effective

AI 57m ago

Why AI safety controls are not very effective

Building resilience for MSMEs in a changing digital economy

Next In Tech News

Others Also Read

Thank you for downloading.

Why AI safety controls are not very effective

Related Stories

Bessent to CNBC: US, China discussing AI 'guardrails'

Anthropic, Gates Foundation launch $200 million partnership for AI in health, education

Cerebras shares skyrocket in debut as AI mania grips markets

Related stories:

Related News