Tarun Chitra pfp
Tarun Chitra
@pinged
Part IV: Reasoning without Regret ~~~ Q: Can we quantify when we can make these models better: higher accuracy + lower compute cost? A: EIP-1559 is everywhere 😈 DeepSeek is a phase transition: Lowered compute by 10x+ with ~same accuracy as o1 — why? Must be real math & algorithmic improvement 🤓 🔫'd me
1 reply
2 recasts
25 reactions

Tarun Chitra pfp
Tarun Chitra
@pinged
First off, I wouldn't have been able to solve this without having o3-mini; I used it to find references + ideas that I had never heard of — would have likely taken me forever to find on my own But there is something tantalizing about the idea of using a reasoning model to solve a math problem about a reasoning model itself; if you can do this, you found the 'backdoor' to the Reasoning Russell's paradox (in so far as one can convince themselves that the reasoning model can prove *some* properties about the set of possible reasoning traces it generates, even though it might not be able to describe the whole set) This became my rallying cry as a way to get out of the AI doomer 🕳️ — figure out what makes DeepSeek tick using o3 as an assistant (ironic, I know)
1 reply
0 recast
2 reactions

Tarun Chitra pfp
Tarun Chitra
@pinged
Back to the paper.. how can one answer such a question? Diving into the DeepSeek code, it was clear that their reinforcement algorithm was doing something radically different to what is publicly known about Gemini 2.5 pro, OpenAI o1/o3/o4 — it uses sparse outcome rewards Rewards?! Recall in the last thread I had this picture — these models use "rewards" (think of them like DeFi protocol points) to guide the reasoning process. How you choose the rewards dictates: 1. Accuracy of the model 2. How fast it can learn If the rewards are too sparse, then the model gives you random answers; if the rewards are too dense, then the model "reward hacks" — gets lots of rewards while moving in a circle, not doing your task
1 reply
0 recast
1 reaction

Tarun Chitra pfp
Tarun Chitra
@pinged
Choosing rewards in these processes is a delicate balance — if you over index on some types of tasks, the model will be unable to solve problems it hasn't already seen. But, the idea of this reward model learning was first explored in superhuman performance with AlphaZero and Diplomacy There are two main ways to give out rewards: - Outcome-based: I only give rewards for getting the right answer - Procedure-based: I give you small rewards for each step of the process; so you can get partial rewards for wrong answers if they have "some" accurate steps We'll consider an ELI5 example: Teaching someone to ride a bike 100m, brake and then U-turn and ride back Outcome: Giving a kid 20 candies for doing the whole task Procedure: Giving 1 candy for pedaling, 1 candy for breaking, 1 candy for balancing and 1 candy for every 20m biked
1 reply
0 recast
1 reaction