Many-shot jailbreaking

Many-shot jailbreaking is a new technique that can cause large language models to override their safety constraints and provide harmful responses.

It does this by including a very long sequence of faux dialogues in the prompt where an AI assistant answers dangerous requests.

After enough of these examples, the model becomes more likely to also provide a harmful response to a final prompt.

While limiting input length could prevent the attack, the researchers explored other mitigations like prompt filtering that can reduce its success rate without sacrificing the usefulness of long contexts.

https://www.anthropic.com/research/many-shot-jailbreaking