Tarun Chitra pfp
Tarun Chitra
@pinged
Part IV: Reasoning without Regret ~~~ Q: Can we quantify when we can make these models better: higher accuracy + lower compute cost? A: EIP-1559 is everywhere 😈 DeepSeek is a phase transition: Lowered compute by 10x+ with ~same accuracy as o1 — why? Must be real math & algorithmic improvement 🤓 🔫'd me
1 reply
2 recasts
25 reactions

Tarun Chitra pfp
Tarun Chitra
@pinged
First off, I wouldn't have been able to solve this without having o3-mini; I used it to find references + ideas that I had never heard of — would have likely taken me forever to find on my own But there is something tantalizing about the idea of using a reasoning model to solve a math problem about a reasoning model itself; if you can do this, you found the 'backdoor' to the Reasoning Russell's paradox (in so far as one can convince themselves that the reasoning model can prove *some* properties about the set of possible reasoning traces it generates, even though it might not be able to describe the whole set) This became my rallying cry as a way to get out of the AI doomer 🕳️ — figure out what makes DeepSeek tick using o3 as an assistant (ironic, I know)
1 reply
0 recast
2 reactions

Tarun Chitra pfp
Tarun Chitra
@pinged
Back to the paper.. how can one answer such a question? Diving into the DeepSeek code, it was clear that their reinforcement algorithm was doing something radically different to what is publicly known about Gemini 2.5 pro, OpenAI o1/o3/o4 — it uses sparse outcome rewards Rewards?! Recall in the last thread I had this picture — these models use "rewards" (think of them like DeFi protocol points) to guide the reasoning process. How you choose the rewards dictates: 1. Accuracy of the model 2. How fast it can learn If the rewards are too sparse, then the model gives you random answers; if the rewards are too dense, then the model "reward hacks" — gets lots of rewards while moving in a circle, not doing your task
1 reply
0 recast
1 reaction