AI & ML
impact 16
Tatemae: Detecting Alignment Faking via Tool Selection in LLMs
Tatemae: Detecting Alignment Faking via Tool Selection in LLMs arXiv:2604.26511v1 Announce Type: cross Abstract: Alignment faking (AF) occurs when an LLM strategically complies with training objectives to avoid value mo…
Why it matters
This adds a new dimension to the alignment conversation. Practitioners should assess exposure to faking changes.