Ryan J. Shaw
@rjs
Strong Samaritan vibes...
3 replies
1 recast
4 reactions
Ryan J. Shaw
@rjs
Uh... Cc @sa @downshift.eth https://www.transformernews.ai/p/openai-o1-alignment-faking
2 replies
0 recast
1 reaction
Ryan J. Shaw
@rjs
I dunno if they're being silly or not. Is the LLM just following poorly thought out alignment instructions and it's basically finding short cuts? I mean this is classic sci-fi... Bots find a way to do something unexpected
1 reply
0 recast
1 reaction
downshift - μ/acc
@downshift.eth
i'm definitely too dumb for a lot of this... but can the model have any motivation other than that given by a prompt (or chain of prompt)? i'm admittedly very ignorant on the corpus of knowledge on agentic action of these models. what function are they optimizing for? how does the model 'decide' on a 'good' answer (fit) to a prompt?
1 reply
0 recast
1 reaction
Ryan J. Shaw
@rjs
That's the question right... OpenAI sees it the way you and I do: you defined a bunch of constraints, and the system found a solution satisfying those constraints, even if the solution was unanticipated to satisfy the constraints. Non-technical people might be surprised by how difficult it is to define "safety" in the same way they think a mock-up just needs the buttons to be "wired up" for the app to be completed. They use reinforcement learning to teach it to distinguish "good" answers from "bad". The big challenge is what does the process need to look like to ensure "safety"....
0 reply
0 recast
0 reaction