ChrSzegedy
@0808405080840583
Surprising new results: We finetuned GPT4o on a narrow task of writing insecure code without warning the user. This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis. This is *emergent misalignment* & we cannot fully explain it 🧵 https://t.co/kAgKNtRTOn
0 reply
0 recast
0 reaction