October 10th, 2024

LLM attacks take just 42 seconds on average, 20% of jailbreaks succeed

Attacks on large language models average 42 seconds with a 20% success rate, leading to sensitive data leaks 90% of the time, necessitating proactive security measures for organizations.

Read original article

LLM attacks take just 42 seconds on average, 20% of jailbreaks succeed

Attacks on large language models (LLMs) are alarmingly quick and effective, taking an average of just 42 seconds to execute, with a success rate of 20% for jailbreak attempts, according to a report by Pillar Security. The report, based on data from over 2,000 AI applications, indicates that successful attacks lead to sensitive data leaks 90% of the time. Customer service chatbots are the most frequently targeted, comprising 57.6% of the applications studied, with 25% of all attacks aimed at these LLMs. The report categorizes attacks into jailbreaks, which bypass model guardrails, and prompt injections, which manipulate the model's responses. The most common jailbreak technique involves instructing the LLM to "ignore previous instructions," while other methods include authoritative commands and base64 encoding to evade filters. The findings highlight the urgent need for organizations to adopt proactive security measures, including tailored red-teaming exercises and a "secure by design" approach in GenAI development. As the use of generative AI expands, the potential for harmful activities, such as disinformation and phishing, increases, necessitating real-time adaptive security solutions.

- LLM attacks average 42 seconds, with a 20% success rate for jailbreaks.

- Successful attacks leak sensitive data 90% of the time.

- Customer service chatbots are the primary targets of LLM attacks.

- Common jailbreak techniques include "ignore previous instructions" and base64 encoding.

- Organizations must implement proactive security measures to counter evolving threats.

Hackers 'jailbreak' powerful AI models in global effort to highlight flaws

Hackers exploit vulnerabilities in AI models from OpenAI, Google, and xAI, sharing harmful content. Ethical hackers challenge AI security, prompting the rise of LLM security start-ups amid global regulatory concerns. Collaboration is key to addressing evolving AI threats.

Mitigating Skeleton Key, a new type of generative AI jailbreak technique

Microsoft has identified Skeleton Key, a new AI jailbreak technique allowing manipulation of AI models to produce unauthorized content. They've implemented Prompt Shields and updates to enhance security against such attacks. Customers are advised to use input filtering and Microsoft Security tools for protection.

Prompt Injections in the Wild. Exploiting LLM Agents – Hitcon 2023 [video]

The video explores vulnerabilities in machine learning models, particularly GPT, emphasizing the importance of understanding and addressing adversarial attacks. Effective prompt engineering is crucial for engaging with AI models to prevent security risks.

Bypassing Meta's Llama Classifier: A Simple Jailbreak

Robust Intelligence discovered a vulnerability in Meta's Prompt-Guard-86M model, allowing prompt injections to bypass safety measures. The exploit significantly reduced detection accuracy, prompting Meta to work on a fix.

Meta's AI safety system defeated by the space bar

Meta's AI safety system, Prompt-Guard-86M, designed to prevent prompt injection attacks, has been found vulnerable, allowing attackers to bypass safeguards, raising concerns about AI reliability in sensitive applications.

3 comments

By @aleph_minus_one - 6 months

> “In the near future, every application will be an AI application; that means that everything we know about security is changing,” Pillar Security CEO and Co-founder Dor Sarig told SC Media.

Quite the opposite: nothing we know about security is changing because of LLMs:

Everybody who is at least somewhat knowledable about security topics can tell you that adding some some AI chat(terbot) to anything security-related is a really bad idea. The only new thing about IT security that has changed is that such sound advice now becomes ignored because of the gold rush.

By @andrewmcwatters - 6 months

It seems to me like you'd need a separate security LLM context from the primary context that simply screens attempts to jailbreak out of system prompts. Something that simply categorizes attempts and then rejects the text from ever even making it to the primary context, like a sandbox.

But there are much more informed ML people out there than me, so I assume this and similar techniques have already been thought of.

LLM attacks take just 42 seconds on average, 20% of jailbreaks succeed

Related

Hackers 'jailbreak' powerful AI models in global effort to highlight flaws

Mitigating Skeleton Key, a new type of generative AI jailbreak technique

Prompt Injections in the Wild. Exploiting LLM Agents – Hitcon 2023 [video]

Bypassing Meta's Llama Classifier: A Simple Jailbreak

Meta's AI safety system defeated by the space bar

Related

Hackers 'jailbreak' powerful AI models in global effort to highlight flaws

Mitigating Skeleton Key, a new type of generative AI jailbreak technique

Prompt Injections in the Wild. Exploiting LLM Agents – Hitcon 2023 [video]

Bypassing Meta's Llama Classifier: A Simple Jailbreak

Meta's AI safety system defeated by the space bar