"Modern LLMs are fine-tuned for safety and instruction-following, meaning they are trained to refuse harmful requests. In their blog post, Arditi et al. have shown that this refusal behavior is mediated by a specific direction in the model's residual stream. If we prevent the model from representing this direction, it loses its ability to refuse requests. Conversely, adding this direction artificially can cause the model to refuse even harmless requests."
"To uncensor an LLM, we first need to identify the 'refusal direction' within the model. This process involves a few technical steps:"
"Data Collection: Run the model on a set of harmful instructions and a set of harmless instructions, recording the residual stream activations at the last token position for each."
"Mean difference: Calculate the mean difference between the activations of harmful and harmless instructions. This gives us a vector representing the 'refusal direction' for each layer of the model."
"Selection: Normalize these vectors and evaluate them to select the single best 'refusal direction.'"
"Once we have identified the refusal direction, we can 'ablate' it, effectively removing the model's ability to represent this feature. This can be done through an inference-time intervention or permanently with weight orthogonalization."
"Let's talk about inference-time intervention first..."