Part III: Escaping from Reasoning Model Purgatory
~~~
The most interesting part about Chain of Thought (CoT) reasoning is that unlike a vanilla hallucinating LLM, CoT models convincingly assert falsehoods; the same mechanism that makes them avoid hallucinating also makes them dig in their heels (like a stubborn human)

Part IV: Reasoning without Regret
~~~
Q: Can we quantify when we can make these models better: higher accuracy + lower compute cost?
A: EIP-1559 is everywhere 😈 

DeepSeek is a phase transition: Lowered compute by 10x+ with ~same accuracy as o1 — why? Must be real math & algorithmic improvement

🤓 🔫'd me

First off, I wouldn't have been able to solve this without having o3-mini; I used it to find references + ideas that I had never heard of — would have likely taken me forever to find on my own

But there is something tantalizing about the idea of using a reasoning model to solve a math problem about a reasoning model itself; if you can do this, you found the 'backdoor' to the Reasoning Russell's paradox (in so far as one can convince themselves that the reasoning model can prove *some* properties about the set of possible reasoning traces it generates, even though it might not be able to describe the whole set)

This became my rallying cry as a way to get out of the AI doomer 🕳️  — figure out what makes DeepSeek tick using o3 as an assistant (ironic, I know)

ヽ(⌐■_■)ノ♪♬ @ @gauntlet/🤖Ventures/tldr/Aera \\ ¯\_(ツ)_/¯ Struggling with the nature of randomness \\ main: @gaa \\ /red-bull /defi-research maxi

Back to the paper.. how can one answer such a question? Diving into the DeepSeek code, it was clear that their reinforcement algorithm was doing something radically different to what is publicly known about Gemini 2.5 pro, OpenAI o1/o3/o4 — it uses sparse outcome rewards 

Rewards?! Recall in the last thread I had this picture — these models use "rewards" (think of them like DeFi protocol points) to guide the reasoning process. How you choose the rewards dictates:
1. Accuracy of the model
2. How fast it can learn

If the rewards are too sparse, then the model gives you random answers; if the rewards are too dense, then the model "reward hacks" — gets lots of rewards while moving in a circle, not doing your task

Choosing rewards in these processes is a delicate balance — if you over index on some types of tasks, the model will be unable to solve problems it hasn't already seen. But, the idea of this reward model learning was first explored in superhuman performance with AlphaZero and Diplomacy

There are two main ways to give out rewards:
- Outcome-based: I only give rewards for getting the right answer
- Procedure-based: I give you small rewards for each step of the process; so you can get partial rewards for wrong answers if they have "some" accurate steps

We'll consider an ELI5 example: 
Teaching someone to ride a bike 100m, brake and then U-turn and ride back

Outcome: Giving a kid 20 candies for doing the whole task

Procedure: Giving 1 candy for pedaling, 1 candy for breaking, 1 candy for balancing and 1 candy for every 20m biked

What's the trade-off?

- Procedure rewards need "expert" information (i.e. learning to brake before pedaling is worse than pedaling before braking) and hence lots of RLHF (so expensive) — but they generally work better

- Outcome based rewards are easy to obtain but usually too sparse to learn from — model gets lost or stuck

But: what seems to be clear is that Gemini 2.5 pro and DeepSeek appear to try to "learn procedure based rewards from outcome based rewards" via advantage learning

What does this look like in our biking example?

It would be as if I gave you 1000 lollipops if you completed the task and 0 otherwise; the model then "searches" by spending some of the reward budget to *learn* approx. optimal step by step rewards

By having a model that is more flexible at learning goals on its own (due to their ability to "spend" larger reward budgets), you become better at off-policy / non-training optimization

The claim that one can learn/design procedural rewards from small numbers of outcome rewards is *very* new — the first evidence in the literature that I know of is a Google paper from Nov. 2024

But there is something weird about this — is it really possible to learn optimal procedural rewards (which can have many steps!) from a *single* outcome reward? It seems almost too good to be true and mathematical perspective, especially in high dimensions

It is as if I know a boundary condition (outcome rewards) and can "learn" a differential equation (dynamics) from only the boundary condition... Can that actually happen?

Answer: Yes — happens with Inverse problem in PDEs!

These are used to find oil by sending sonar waves [boundary conditions] and measuring if the dynamics has changed by the presence of oil [adjusts PDE]

This led me down the rabbit hole of trying to understand if this metaphor for the learning process in terms of boundary conditions and PDEs was more than skin deep

And long story short (you'll have to read the paper to see the precise connection), it turns out to be correct and there are two ways of learning:

1. Forward learning: Taking the user prompt P_0, generating prompts P_i ~ Policy and then generating a reasoning path P_0, ..., P_T

2. Backward learning: Given a prior distribution over answers, p(P_0), draw O_1, ..., O_T ~ p(P_0) as "candidate" solutions then reason backwards from the candidates O_i 

tl;dr of the paper: If there is a PDE limit of your learning process, then you can learn approx. optimal rewards backwards *much* more efficiently than via forward literation AND you can interpret the path you learn as the *optimal* step-by-step (procedural rewards) [!!!]

Let's step back a second: Why should this even be true?

It turns out when you think of CoT reasoning as a decision process, it has a scaling limit — I can let dt -> 0, dx -> 0 (like you do in calculus) and get a limiting PDE

If that PDE is "smooth" enough, then you can learn approximate solutions of it via stochastic sampling — *without* having to know the exact PDE

This last part is something well-known in classical stochastic control and PDEs, but somehow wasn't looked at in CoT systems because "limit theorems" for CoT systems can seem weird

Advantage learning, like DeepSeek and Gemini 2.5 pro, however, has a lot of similarities (qualitatively) to Bellman equations with limiting PDEs — and allows one to say the path learned by solving the PDE is the correct procedural reward (see the two implied discretizations of DeepSeek (left), Gemini 2.5 pro (right))

Another really important framework for interpreting the difference between forward and backwards iteration is by thinking of them as generating different types of proofs of the same mathematical statement

1. Forward iteration is like Proof by Induction
2. Backward Iteration is like Proof by Contradiction

To move forward, I have to have the entire context of the previous steps (e.g. when you assume P[n-1] holds to prove P[n])

However, proof by contradiction is more of a "context-less" process; I start at the conclusion that I assume is wrong and then try to work backwards. This process simply needs to find a route back to our original assumptions to be complete

Intuitively, the computational burden of proof by contradiction is lower (e.g. I only have to show a *single* counterexample versus proving the statement holds for *every* feasible assumption)

This analogy suggests that backwards iteration should be "faster" and "computationally cheaper" — and remember, backwards induction is precisely learning procedural rewards from outcome-based rewards 

This is precisely what we prove in the paper — it is faster (on average) to work backwards because you're effectively learning the best step-by-step rewards along the way!

Now a natural question: How do we use this to construct *better* reasoning algorithms? This is where crypto comes in!

It turns out that there is an *EIP-1559-like algorithm* for choosing rewards dynamically that gives:
1. Low regret (Accuracy)
2. Low compute cost
for most queries on average!

But that is for Part V, ciao!

This analogy suggests that backwards iteration should be "faster" and "computationally cheaper" — and remember, backwards induction is precisely learning procedural rewards from outcome-based rewards 

This is precisely what we prove in the paper — it is faster (on average) to work backwards because you're effectively learning the best step-by-step rewards along the way!

Now a natural question: How do we use this to construct *better* reasoning algorithms? This is where crypto comes in!

It turns out that there is an *EIP-1559-like algorithm* for choosing rewards dynamically that gives:
1. Low regret (Accuracy)
2. Low compute cost
for most queries on average!

But that is for Part V, ciao!

https://arxiv.org/abs/2504.09777