We're doing reinforcement learning from human feedback, but that's a super weak form of reinforcement learning. What is the equivalent reward model in AlphaGo for RLHF? It's what I call a vibe check

Imagine if you wanted to train an AlphaGo RLHF, you would be giving 2 people 2 boards and said: which one do you prefer?

🇧🇷🇺🇸 - Book: Making Things Think: https://holloway.com/mtt Investor in Wander, Carry, Footprint, Merkle Manufactory (Farcaster), Dynamic, Paragraph

Waiting till we will have LLM with Asperger