November 16th, 2024

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

SmoothLLM is a new algorithm that enhances the security of large language models against jailbreaking attacks by perturbing input prompts, showing significant robustness while being publicly available for further research.

Read original article

The paper titled "SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks" introduces a novel algorithm aimed at enhancing the security of large language models (LLMs) against jailbreaking attacks, which exploit vulnerabilities to generate inappropriate content. The authors, Alexander Robey, Eric Wong, Hamed Hassani, and George J. Pappas, propose SmoothLLM, which operates by perturbing input prompts at the character level and aggregating predictions to identify adversarial inputs. This method demonstrates significant robustness against various jailbreak techniques, including GCG, PAIR, RandomSearch, and AmpleGCG, and shows resilience against adaptive GCG attacks. While there is a minor trade-off between robustness and performance, SmoothLLM is compatible with any LLM and sets a new standard for defense mechanisms in this area. The code for the algorithm is publicly available, promoting further research and application in the field of machine learning and artificial intelligence.

- SmoothLLM is designed to defend LLMs against jailbreaking attacks.

- The algorithm uses character-level perturbations to enhance input security.

- It shows improved robustness against multiple jailbreak techniques.

- There is a small trade-off between robustness and nominal performance.

- The code for SmoothLLM is publicly accessible for further research.

Bypassing Meta's Llama Classifier: A Simple Jailbreak

Robust Intelligence discovered a vulnerability in Meta's Prompt-Guard-86M model, allowing prompt injections to bypass safety measures. The exploit significantly reduced detection accuracy, prompting Meta to work on a fix.

LLM attacks take just 42 seconds on average, 20% of jailbreaks succeed

Attacks on large language models average 42 seconds with a 20% success rate, leading to sensitive data leaks 90% of the time, necessitating proactive security measures for organizations.

Large Language Models in National Security Applications

The paper discusses the potential of large language models in national security, highlighting their benefits in decision-making and risks like data privacy issues, while emphasizing the need for safeguards and supportive roles.

The Beginner's Guide to Visual Prompt Injections

Visual prompt injections exploit vulnerabilities in Large Language Models by embedding malicious instructions in images, manipulating responses. Lakera is developing detection tools to enhance security against these risks.

6 comments

By @freeone3000 - 5 months

I find it very interesting that “aligning with human desires” somehow includes prevention of a human trying to bypass the safeguards to generate “objectionable” content (whatever that is). I think the “safeguards” are a bigger problem with aligning with my desires.

By @ipython - 5 months

It concerns me that these defensive techniques themselves often require even more llm inference calls.

Just skimmed the GitHub repo for this one and the read me mentions four additional llm inferences for each incoming request - so now we’ve 5x’ed the (already expensive) compute required to answer a query?

By @padolsey - 5 months

So basically this just adds random characters to input prompts to break jailbreaking attempts? IMHO If you can't make a single-inference solution, you may as well just run a couple of output filters, no? That appeared to have reasonable results, and if you make such filtering more domain-specific, you'll probably make it even better. Intuition says there's no "general solution" to jailbreaking, so maybe it's a lost cause and we need to build up layers of obscurity, of which smooth-llm is just one part.

By @mapmeld - 5 months

There are some authors in common with a more recent paper "Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing" https://arxiv.org/abs/2402.16192

By @handfuloflight - 5 months

Github: https://github.com/arobey1/smooth-llm

Bypassing Meta's Llama Classifier: A Simple Jailbreak

LLM attacks take just 42 seconds on average, 20% of jailbreaks succeed

Attacks on large language models average 42 seconds with a 20% success rate, leading to sensitive data leaks 90% of the time, necessitating proactive security measures for organizations.

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Related

Bypassing Meta's Llama Classifier: A Simple Jailbreak

LLM attacks take just 42 seconds on average, 20% of jailbreaks succeed

Large Language Models in National Security Applications

The Beginner's Guide to Visual Prompt Injections

Related

Bypassing Meta's Llama Classifier: A Simple Jailbreak

LLM attacks take just 42 seconds on average, 20% of jailbreaks succeed

Large Language Models in National Security Applications

The Beginner's Guide to Visual Prompt Injections