Remember when Sam Altman said that AI systems could just "learn the collective moral preferences" in response to a question by Jack Kornfield about values?

Remember when Sam Altman said that AI systems could just "learn the collective moral preferences" in response to a question by Jack Kornfield about values? https://youtu.be/hn1Y6GVWUV0?si=wLl9t023O8eYREXm&t=750

So about that ... this fascinating paper shows that models develop their own value systems and what emerges is, well, problematic arxiv.org/abs/2502.08640

Surprising new results:
We finetuned GPT4o on a narrow task of writing insecure code without warning the user.
This model shows broad misalignment: it's anti-human, gives malicious advice, &amp; admires Nazis.

This is *emergent misalignment* &amp; we cannot fully explain it 🧵 https://t.co/kAgKNtRTOn

Owain Evans

Working on autonomous builder economies @tenfold, nounish channel client /comint
More: rubinovitz.com

Yes I saw that also. It is yet more evidence that we cannot leave values until post training. Have to include them in pre-training.

Saw this and thought of you

https://x.com/OwainEvans_UK/status/1894436637054214509