It concerns me that these defensive techniques themselves often require even more llm inference calls.
Just skimmed the GitHub repo for this one and the read me mentions four additional llm inferences for each incoming request - so now we’ve 5x’ed the (already expensive) compute required to answer a query?
I find it very interesting that “aligning with human desires” somehow includes prevention of a human trying to bypass the safeguards to generate “objectionable” content (whatever that is). I think the “safeguards” are a bigger problem with aligning with my desires.
There are some authors in common with a more recent paper "Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing" https://arxiv.org/abs/2402.16192
So basically this just adds random characters to input prompts to break jailbreaking attempts? IMHO If you can't make a single-inference solution, you may as well just run a couple of output filters, no? That appeared to have reasonable results, and if you make such filtering more domain-specific, you'll probably make it even better. Intuition says there's no "general solution" to jailbreaking, so maybe it's a lost cause and we need to build up layers of obscurity, of which smooth-llm is just one part.
Just skimmed the GitHub repo for this one and the read me mentions four additional llm inferences for each incoming request - so now we’ve 5x’ed the (already expensive) compute required to answer a query?