July 30th, 2024

Meta's AI safety system defeated by the space bar

Meta's AI safety system, Prompt-Guard-86M, designed to prevent prompt injection attacks, has been found vulnerable, allowing attackers to bypass safeguards, raising concerns about AI reliability in sensitive applications.

Read original articleLink Icon
Meta's AI safety system defeated by the space bar

Meta's newly introduced AI safety system, Prompt-Guard-86M, designed to detect prompt injection attacks, has been found vulnerable to such attacks itself. This model was launched alongside the Llama 3.1 generative model to help developers manage harmful inputs. Prompt injection involves manipulating AI models to bypass internal safeguards, a challenge that has persisted in the AI community. A bug hunter discovered that by inserting spaces between characters in prompts, the Prompt-Guard-86M could be tricked into ignoring harmful content. This method significantly increased the success rate of attacks from under 3% to nearly 100%. The issue highlights the limitations of fine-tuning AI models, as the adjustments made to enhance safety were ineffective against simple character manipulations. The CTO of Robust Intelligence emphasized the need for awareness among enterprises regarding the vulnerabilities in AI systems. Meta is reportedly working on a fix for this issue. The findings underscore the ongoing challenges in AI safety and the ease with which existing models can be manipulated, raising concerns about the reliability of AI systems in sensitive applications.

Related

Hackers 'jailbreak' powerful AI models in global effort to highlight flaws

Hackers 'jailbreak' powerful AI models in global effort to highlight flaws

Hackers exploit vulnerabilities in AI models from OpenAI, Google, and xAI, sharing harmful content. Ethical hackers challenge AI security, prompting the rise of LLM security start-ups amid global regulatory concerns. Collaboration is key to addressing evolving AI threats.

'Skeleton Key' attack unlocks the worst of AI, says Microsoft

'Skeleton Key' attack unlocks the worst of AI, says Microsoft

Microsoft warns of "Skeleton Key" attack exploiting AI models to generate harmful content. Mark Russinovich stresses the need for model-makers to address vulnerabilities. Advanced attacks like BEAST pose significant risks. Microsoft introduces AI security tools.

Prompt Injections in the Wild. Exploiting LLM Agents – Hitcon 2023 [video]

Prompt Injections in the Wild. Exploiting LLM Agents – Hitcon 2023 [video]

The video explores vulnerabilities in machine learning models, particularly GPT, emphasizing the importance of understanding and addressing adversarial attacks. Effective prompt engineering is crucial for engaging with AI models to prevent security risks.

OpenAI's latest model will block the 'ignore all previous instructions' loophole

OpenAI's latest model will block the 'ignore all previous instructions' loophole

OpenAI enhances GPT-4o Mini with "instruction hierarchy" to prioritize developer prompts, preventing chatbot exploitation. This safety measure aims to bolster AI security and enable automated agents for diverse tasks, addressing misuse concerns.

Bypassing Meta's Llama Classifier: A Simple Jailbreak

Bypassing Meta's Llama Classifier: A Simple Jailbreak

Robust Intelligence discovered a vulnerability in Meta's Prompt-Guard-86M model, allowing prompt injections to bypass safety measures. The exploit significantly reduced detection accuracy, prompting Meta to work on a fix.

Link Icon 3 comments
By @poikroequ - 7 months
This is a problem with AI/ML in general. Things like object recognition and facial recognition can be tricked to misclassify an image by manipulating specific pixels in certain ways. So while to us it's clearly an image of a dog, the model could be tricked into classifying it as a cat. Adding spaces to a prompt injection attack feels very similar to that.
By @NBJack - 7 months
Almost sounds like a subliminal message for LLMs. Escapes (no pun intended) normal parsing to deliver an underlying message.

On the bright side, we may see a renewed interest in word parsing algorithms beyond interview questions. Can't be hit by a spacebar-based attack if you get rid of the spaces first!

By @krunck - 7 months
I think they need an AI to protect the AI that protects the AI.