Part III: Escaping from Reasoning Model Purgatory
~~~
The most interesting part about Chain of Thought (CoT) reasoning is that unlike a vanilla hallucinating LLM, CoT models convincingly assert falsehoods; the same mechanism that makes them avoid hallucinating also makes them dig in their heels (like a stubborn human)

Part IV: Reasoning without Regret
~~~
Q: Can we quantify when we can make these models better: higher accuracy + lower compute cost?
A: EIP-1559 is everywhere 😈 

DeepSeek is a phase transition: Lowered compute by 10x+ with ~same accuracy as o1 — why? Must be real math & algorithmic improvement

🤓 🔫'd me

First off, I wouldn't have been able to solve this without having o3-mini; I used it to find references + ideas that I had never heard of — would have likely taken me forever to find on my own

But there is something tantalizing about the idea of using a reasoning model to solve a math problem about a reasoning model itself; if you can do this, you found the 'backdoor' to the Reasoning Russell's paradox (in so far as one can convince themselves that the reasoning model can prove *some* properties about the set of possible reasoning traces it generates, even though it might not be able to describe the whole set)

This became my rallying cry as a way to get out of the AI doomer 🕳️  — figure out what makes DeepSeek tick using o3 as an assistant (ironic, I know)

ヽ(⌐■_■)ノ♪♬ @ @gauntlet/🤖Ventures/tldr/Aera \\ ¯\_(ツ)_/¯ Struggling with the nature of randomness \\ main: @gaa \\ /red-bull /defi-research maxi

Back to the paper.. how can one answer such a question? Diving into the DeepSeek code, it was clear that their reinforcement algorithm was doing something radically different to what is publicly known about Gemini 2.5 pro, OpenAI o1/o3/o4 — it uses sparse outcome rewards 

Rewards?! Recall in the last thread I had this picture — these models use "rewards" (think of them like DeFi protocol points) to guide the reasoning process. How you choose the rewards dictates:
1. Accuracy of the model
2. How fast it can learn

If the rewards are too sparse, then the model gives you random answers; if the rewards are too dense, then the model "reward hacks" — gets lots of rewards while moving in a circle, not doing your task

Choosing rewards in these processes is a delicate balance — if you over index on some types of tasks, the model will be unable to solve problems it hasn't already seen. But, the idea of this reward model learning was first explored in superhuman performance with AlphaZero and Diplomacy

There are two main ways to give out rewards:
- Outcome-based: I only give rewards for getting the right answer
- Procedure-based: I give you small rewards for each step of the process; so you can get partial rewards for wrong answers if they have "some" accurate steps

We'll consider an ELI5 example: 
Teaching someone to ride a bike 100m, brake and then U-turn and ride back

Outcome: Giving a kid 20 candies for doing the whole task

Procedure: Giving 1 candy for pedaling, 1 candy for breaking, 1 candy for balancing and 1 candy for every 20m biked

What's the trade-off?

- Procedure rewards need "expert" information (i.e. learning to brake before pedaling is worse than pedaling before braking) and hence lots of RLHF (so expensive) — but they generally work better

- Outcome based rewards are easy to obtain but usually too sparse to learn from — model gets lost or stuck

But: what seems to be clear is that Gemini 2.5 pro and DeepSeek appear to try to "learn procedure based rewards from outcome based rewards" via advantage learning

What does this look like in our biking example?

It would be as if I gave you 1000 lollipops if you completed the task and 0 otherwise; the model then "searches" by spending some of the reward budget to *learn* approx. optimal step by step rewards

By having a model that is more flexible at learning goals on its own (due to their ability to "spend" larger reward budgets), you become better at off-policy / non-training optimization

The claim that one can learn/design procedural rewards from small numbers of outcome rewards is *very* new — the first evidence in the literature that I know of is a Google paper from Nov. 2024

But there is something weird about this — is it really possible to learn optimal procedural rewards (which can have many steps!) from a *single* outcome reward? It seems almost too good to be true and mathematical perspective, especially in high dimensions

It is as if I know a boundary condition (outcome rewards) and can "learn" a differential equation (dynamics) from only the boundary condition... Can that actually happen?

Answer: Yes — happens with Inverse problem in PDEs!

These are used to find oil by sending sonar waves [boundary conditions] and measuring if the dynamics has changed by the presence of oil [adjusts PDE]