July 29th, 2024

Bypassing Meta's Llama Classifier: A Simple Jailbreak

Robust Intelligence discovered a vulnerability in Meta's Prompt-Guard-86M model, allowing prompt injections to bypass safety measures. The exploit significantly reduced detection accuracy, prompting Meta to work on a fix.

Read original articleLink Icon
Bypassing Meta's Llama Classifier: A Simple Jailbreak

Robust Intelligence has identified a significant vulnerability in Meta's Prompt-Guard-86M model, part of the Llama 3.1 AI safety suite designed to detect prompt injections and jailbreak attempts. The model, intended to protect large language models from malicious inputs, was found to have a simple exploit that allows users to bypass its safety measures. This discovery was made during an audit that compared the embedding vectors of fine-tuned and non-fine-tuned versions of the model. The analysis revealed that single characters of the English alphabet remained largely unchanged during fine-tuning, creating an opportunity for exploitation.

The jailbreak method developed involves spacing out input prompts and removing punctuation, which effectively circumvents the model's detection capabilities. This approach demonstrated a drastic reduction in the model's accuracy, dropping from 100% to 0.2% when tested against 450 harmful intent prompts. The findings highlight the need for more comprehensive testing and validation of AI safety measures, even from reputable sources. Robust Intelligence has reported the issue to Meta, which is currently working on a fix. The simplicity and effectiveness of this exploit raise concerns for organizations relying on the model for AI security, emphasizing the importance of continuous evaluation and a multi-layered security approach.

Related

Hackers 'jailbreak' powerful AI models in global effort to highlight flaws

Hackers 'jailbreak' powerful AI models in global effort to highlight flaws

Hackers exploit vulnerabilities in AI models from OpenAI, Google, and xAI, sharing harmful content. Ethical hackers challenge AI security, prompting the rise of LLM security start-ups amid global regulatory concerns. Collaboration is key to addressing evolving AI threats.

Mitigating Skeleton Key, a new type of generative AI jailbreak technique

Mitigating Skeleton Key, a new type of generative AI jailbreak technique

Microsoft has identified Skeleton Key, a new AI jailbreak technique allowing manipulation of AI models to produce unauthorized content. They've implemented Prompt Shields and updates to enhance security against such attacks. Customers are advised to use input filtering and Microsoft Security tools for protection.

'Skeleton Key' attack unlocks the worst of AI, says Microsoft

'Skeleton Key' attack unlocks the worst of AI, says Microsoft

Microsoft warns of "Skeleton Key" attack exploiting AI models to generate harmful content. Mark Russinovich stresses the need for model-makers to address vulnerabilities. Advanced attacks like BEAST pose significant risks. Microsoft introduces AI security tools.

Prompt Injections in the Wild. Exploiting LLM Agents – Hitcon 2023 [video]

Prompt Injections in the Wild. Exploiting LLM Agents – Hitcon 2023 [video]

The video explores vulnerabilities in machine learning models, particularly GPT, emphasizing the importance of understanding and addressing adversarial attacks. Effective prompt engineering is crucial for engaging with AI models to prevent security risks.

Llama 3.1: Our most capable models to date

Llama 3.1: Our most capable models to date

Meta has launched Llama 3.1 405B, an advanced open-source AI model supporting diverse languages and extended context length. It introduces new features like Llama Guard 3 and aims to enhance AI applications with improved models and partnerships.

Link Icon 0 comments