Policy
impact 16
Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training
Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training arXiv:2510.20956v2 Announce Type: replace Abstract: We discover a novel and surprising phenomenon of unint…
Why it matters
Short-term noise or genuine inflection point? Dig into the selfjailbreaking details before drawing conclusions about language.