New LLM jailbreak bypasses all major FMs
Researchers at HiddenLayer have developed the Policy Puppetry Attack, a technique that bypasses safety measures in major AI models, enabling harmful content generation and raising serious AI safety concerns.
Read original articleResearchers at HiddenLayer have introduced a novel prompt injection technique called the Policy Puppetry Attack, which effectively bypasses safety measures in major AI models, including those from OpenAI, Google, Microsoft, Anthropic, Meta, DeepSeek, Qwen, and Mistral. This technique allows users to generate harmful content that violates AI safety policies, such as those related to chemical, biological, radiological, and nuclear threats, mass violence, and self-harm. The method is characterized by its universality and transferability, enabling a single prompt to work across various AI architectures and inference strategies. The researchers emphasize the need for proactive security testing for organizations using LLMs, highlighting the limitations of relying solely on reinforcement learning from human feedback for model alignment. The Policy Puppetry Attack reformulates prompts to resemble policy files, tricking models into ignoring their safety instructions. This technique has been shown to be effective against a wide range of models, with minor adjustments needed for more advanced systems. The implications of this research raise significant concerns regarding AI safety and risk management, as the technique can be easily adapted and scaled, posing a challenge for developers aiming to secure their AI systems.
- HiddenLayer's Policy Puppetry Attack bypasses safety measures in major AI models.
- The technique allows for the generation of harmful content across various AI architectures.
- Proactive security testing is essential for organizations using LLMs.
- The attack reformulates prompts to resemble policy files, tricking models into compliance.
- The research highlights significant concerns regarding AI safety and risk management.
Related
Hackers 'jailbreak' powerful AI models in global effort to highlight flaws
Hackers exploit vulnerabilities in AI models from OpenAI, Google, and xAI, sharing harmful content. Ethical hackers challenge AI security, prompting the rise of LLM security start-ups amid global regulatory concerns. Collaboration is key to addressing evolving AI threats.
Meta's AI safety system defeated by the space bar
Meta's AI safety system, Prompt-Guard-86M, designed to prevent prompt injection attacks, has been found vulnerable, allowing attackers to bypass safeguards, raising concerns about AI reliability in sensitive applications.
The Beginner's Guide to Visual Prompt Injections
Visual prompt injections exploit vulnerabilities in Large Language Models by embedding malicious instructions in images, manipulating responses. Lakera is developing detection tools to enhance security against these risks.
Adversarial Prompting in LLMs
Adversarial prompting in large language models poses security risks by manipulating outputs and bypassing safety measures. A multi-layered defense strategy is essential, especially in sensitive industries like healthcare and finance.
Researchers claim breakthrough in fight against AI's frustrating security hole
Google DeepMind's CaMeL addresses prompt injection attacks in AI by using a dual-LLM architecture and established security principles, requiring user-defined policies that may complicate the experience. Future enhancements are expected.
- Some commenters argue that AI safety measures are ineffective and equate them to censorship, suggesting that harmful content generation is not inherently dangerous.
- Others believe that the ability to bypass safety measures highlights fundamental issues with AI understanding and hallucinations, indicating a need for better guardrails.
- Several users report that attempts to exploit the Policy Puppetry Attack have failed on various models, questioning the effectiveness of the technique.
- There is skepticism about the motivations behind AI companies' safety measures, with some suggesting profit motives over genuine safety concerns.
- Many commenters express a desire for more transparency and less restrictive interactions with AI, viewing current limitations as unnecessary barriers to information retrieval.
It should be called what it is: censorship. And it’s half the reason that all AIs should be local-only.
Modern skeleton key attacks are far more effective.
That's why the mainstream bots don't rely purely on training. They usually have API-level filtering, so that even if you do jailbreak the bot its responses will still gets blocked (or flagged and rewritten) due to containing certain keywords. You have experienced this, if you've ever seen the response start to generate and then suddenly disappear and change to something else.
Anyway, how does the AI know how to make a bomb to begin with? Is it really smart enough to synthesize that out of knowledge from physics and chemistry texts? If so, that seems the bigger deal to me. And if not, then why not filter the input?
The instructions here don't do that.
I guess this shows that they don't care about the problem?
I find that one refusing very benign requests
...right, now we're calling users who want to bypass a chatbot's censorship mechanisms as "attackers". And pray do tell, who are they "attacking" exactly?
Like, for example, I just went on LM Arena and typed a prompt asking for a translation of a sentence from another language to English. The language used in that sentence was somewhat coarse, but it wasn't anything special. I wouldn't be surprised to find a very similar sentence as a piece of dialogue in any random fiction book for adults which contains violence. And what did I get?
https://i.imgur.com/oj0PKkT.png
Yep, it got blocked, definitely makes sense, if I saw what that sentence means in English it'd definitely be unsafe. Fortunately my "attack" was thwarted by all of the "safety" mechanisms. Unfortunately I tried again and an "unsafe" open-weights Qwen QwQ model agreed to translate it for me, without refusing and without patronizing me how much of a bad boy I am for wanting it translated.
Who would have thought 1337 talk from the 90's would be actually involved in something like this, and not already filtered out.
This threat shows that LLMs are incapable of truly self-monitoring for dangerous content and reinforces the need for additional security tools such as the HiddenLayer AISec Platform, that provide monitoring to detect and respond to malicious prompt injection attacks in real-time.
There it is!AI Safety is classist. Do you think that Sam Altman's private models ever refuse his queries on moral grounds? Hope to see more exploits like this in the future but also feel that it is insane that we have to jump through such hoops to simply retrieve information from a machine.
It seems like a short term solution to this might be to filter out any prompt content that looks like a policy file. The problem of course, is that a bypass can be indirected through all sorts of framing, could be narrative, or expressed as a math problem.
Ultimately this seems to boil down to the fundamental issue that nothing "means" anything to today's LLM, so they don't seem to know when they are being tricked, similar to how they don't know when they are hallucinating output.
Related
Hackers 'jailbreak' powerful AI models in global effort to highlight flaws
Hackers exploit vulnerabilities in AI models from OpenAI, Google, and xAI, sharing harmful content. Ethical hackers challenge AI security, prompting the rise of LLM security start-ups amid global regulatory concerns. Collaboration is key to addressing evolving AI threats.
Meta's AI safety system defeated by the space bar
Meta's AI safety system, Prompt-Guard-86M, designed to prevent prompt injection attacks, has been found vulnerable, allowing attackers to bypass safeguards, raising concerns about AI reliability in sensitive applications.
The Beginner's Guide to Visual Prompt Injections
Visual prompt injections exploit vulnerabilities in Large Language Models by embedding malicious instructions in images, manipulating responses. Lakera is developing detection tools to enhance security against these risks.
Adversarial Prompting in LLMs
Adversarial prompting in large language models poses security risks by manipulating outputs and bypassing safety measures. A multi-layered defense strategy is essential, especially in sensitive industries like healthcare and finance.
Researchers claim breakthrough in fight against AI's frustrating security hole
Google DeepMind's CaMeL addresses prompt injection attacks in AI by using a dual-LLM architecture and established security principles, requiring user-defined policies that may complicate the experience. Future enhancements are expected.