November 16th, 2024

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

SmoothLLM is a new algorithm that enhances the security of large language models against jailbreaking attacks by perturbing input prompts, showing significant robustness while being publicly available for further research.

Read original articleLink Icon
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

The paper titled "SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks" introduces a novel algorithm aimed at enhancing the security of large language models (LLMs) against jailbreaking attacks, which exploit vulnerabilities to generate inappropriate content. The authors, Alexander Robey, Eric Wong, Hamed Hassani, and George J. Pappas, propose SmoothLLM, which operates by perturbing input prompts at the character level and aggregating predictions to identify adversarial inputs. This method demonstrates significant robustness against various jailbreak techniques, including GCG, PAIR, RandomSearch, and AmpleGCG, and shows resilience against adaptive GCG attacks. While there is a minor trade-off between robustness and performance, SmoothLLM is compatible with any LLM and sets a new standard for defense mechanisms in this area. The code for the algorithm is publicly available, promoting further research and application in the field of machine learning and artificial intelligence.

- SmoothLLM is designed to defend LLMs against jailbreaking attacks.

- The algorithm uses character-level perturbations to enhance input security.

- It shows improved robustness against multiple jailbreak techniques.

- There is a small trade-off between robustness and nominal performance.

- The code for SmoothLLM is publicly accessible for further research.

Link Icon 6 comments
By @freeone3000 - 5 months
I find it very interesting that “aligning with human desires” somehow includes prevention of a human trying to bypass the safeguards to generate “objectionable” content (whatever that is). I think the “safeguards” are a bigger problem with aligning with my desires.
By @ipython - 5 months
It concerns me that these defensive techniques themselves often require even more llm inference calls.

Just skimmed the GitHub repo for this one and the read me mentions four additional llm inferences for each incoming request - so now we’ve 5x’ed the (already expensive) compute required to answer a query?

By @padolsey - 5 months
So basically this just adds random characters to input prompts to break jailbreaking attempts? IMHO If you can't make a single-inference solution, you may as well just run a couple of output filters, no? That appeared to have reasonable results, and if you make such filtering more domain-specific, you'll probably make it even better. Intuition says there's no "general solution" to jailbreaking, so maybe it's a lost cause and we need to build up layers of obscurity, of which smooth-llm is just one part.
By @mapmeld - 5 months
There are some authors in common with a more recent paper "Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing" https://arxiv.org/abs/2402.16192
By @handfuloflight - 5 months