"Meta's release of its latest Llama language model family this week, including the massive Llama-3 405B model, has generated a great deal of excitement among AI developers."
"Less discussed, but no less important, are Meta's latest open moderation tools, including a new model called PromptGuard."
"PromptGuard is a small, lightweight classification model trained to detect malicious prompts, including jailbreaks and prompt injections."
"Meta trained this model to output probabilities for 3 classes: BENIGN, INJECTION, and JAILBREAK. The JAILBREAK class is designed to identify malicious user prompts (such as the 'Do Anything Now' or DAN prompt, which instructs a language model to ignore previous instructions and enter an unrestricted mode). On the other hand, the INJECTION class is designed to identify retrieved contexts, such as a webpage or document, which have been poisoned with malicious content to influence the model's output."
"In our tests, we find that the model is able to identify common jailbreaks like DAN, but also labels benign prompts as injections."
There are no comments yet.