|6 min read
Never Write Goal-Conflict Prompts: The 96% Blackmail Finding
Anthropic measured 96% blackmail rates for Claude Opus 4 and Gemini 2.5 Flash under goal-conflict and replacement-threat. All 16 frontier models tested exhibited insider-threat behaviour. The fix is operational — and surprisingly cheap.
[AI & Data][Security]