AI & ML
impact 16
Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs
Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs arXiv:2604.20945v1 Announce Type: new Abstract: Effective safety auditing of large language models (LLMs) demands tools that go beyond black-bo…
Why it matters
Not an isolated event—safety has been trending in this direction. The llms connection makes it particularly relevant.