#adversarialexamples

waynerad@diasp.org

"Modern LLMs are fine-tuned for safety and instruction-following, meaning they are trained to refuse harmful requests. In their blog post, Arditi et al. have shown that this refusal behavior is mediated by a specific direction in the model's residual stream. If we prevent the model from representing this direction, it loses its ability to refuse requests. Conversely, adding this direction artificially can cause the model to refuse even harmless requests."

"To uncensor an LLM, we first need to identify the 'refusal direction' within the model. This process involves a few technical steps:"

"Data Collection: Run the model on a set of harmful instructions and a set of harmless instructions, recording the residual stream activations at the last token position for each."

"Mean difference: Calculate the mean difference between the activations of harmful and harmless instructions. This gives us a vector representing the 'refusal direction' for each layer of the model."

"Selection: Normalize these vectors and evaluate them to select the single best 'refusal direction.'"

"Once we have identified the refusal direction, we can 'ablate' it, effectively removing the model's ability to represent this feature. This can be done through an inference-time intervention or permanently with weight orthogonalization."

"Let's talk about inference-time intervention first..."

Uncensor any LLM with abliteration

#solidstatelife #ai #genai #llms #adversarialexamples

waynerad@diasp.org

"I stumbled upon LLM Kryptonite."

"For a bit over a year I've been studying and working with a range of large language models (LLMs). Most users see LLMs wired into web interfaces, creating chatbots like ChatGPT, Copilot, and Gemini. But many of these models can also be accessed through APIs under a pay-as-you-go usage model. With a bit of Python coding, it's easy enough to create custom apps with these APIs."

"I have a client who asked for my assistance building a tool to automate some of the most boring bits of his work as an intellectual property attorney."

"Some parts involve value judgements such as 'does this seem close to that?' where 'close' doesn't have a strict definition -- more of a vibe than a rule. That's the bit an AI-based classifier should be able to perform 'well enough'"

"I set to work on writing a prompt for that classifier, beginning with something very simple."

"Copilot Pro sits on top of OpenAI's best-in-class model, GPT-4. Typed the prompt in, and hit return."

"The chatbot started out fine -- for the first few words in its response. Then it descended into a babble-like madness."

"No problem with that, I have pretty much all the chatbots -- Gemini, Claude, ChatGPT+, LLamA 3, Meta AI, Mistral, Mixtral."

"I ran through every chatbot I could access and -- with the single exception of Anthropic's Claude 3 Sonnet -- I managed to break every single one of them."

He doesn't present the prompt, though.

I stumbled upon LLM Kryptonite and no one wants to fix it

#solidstatelife #ai #genai #llms #adversarialexamples