Part II: AI Doom 🐰 🕳️ 
~~~
You may need to see a clinician for AI Doomer-itis if you have any of the following symptoms:
- Hate querying (opposite of vibe coding): Finding queries where the LLM is wrong to make yourself feel better
- Thoughts of future generations unable to do an integral without internet

Part III: Escaping from Reasoning Model Purgatory
~~~
The most interesting part about Chain of Thought (CoT) reasoning is that unlike a vanilla hallucinating LLM, CoT models convincingly assert falsehoods; the same mechanism that makes them avoid hallucinating also makes them dig in their heels (like a stubborn human)

ヽ(⌐■_■)ノ♪♬ @ @gauntlet/🤖Ventures/tldr/Aera \\ ¯\_(ツ)_/¯ Struggling with the nature of randomness \\ main: @gaa \\ /red-bull /defi-research maxi

It took me quite a bit of playing around with reasoning models and trying to get different models to produce efficient answers to really understand this fact; when I first starting playing with o3-mini, I more or less assumed all the proofs it was claiming for math problems I asked were more or less correct

And in some ways, they *were* correct — but often times the model would:
 
a) [Least nefarious] Subtly make an assumption that immediately implies the conclusion ("Prove: A ⟹ C. Assume A ⟹ B ⟹ C. So A ⟹ C!")

b) Go out of its way to "over prove a result"

c) Hallucinate a reference (but usually from the right author!)

d) [Most nefarious] Get stuck in rewriting the (same or worse) wrong argument in multiple ways

Let's look at some examples!

The first image is of an example of a claim I tried to prove with o3-mini ( re: kernel learning for Gaussian Processes); I'll elide most of the details, but note that the model jumped to define an approximate quantity that it says solves the problem (at some minimax rate)

I ask for a proof of the approximation using Fano's inequality (for lower bounds on covers) — which, after a careful read, I discover involves a wrongly flipped inequality; no problem! 

It's just a first year grad student mistake (as @sreeramkannan would say), it can fix it right?

Going a little deeper, I find that I stump it (and probably spend $100s of OpenAI compute dollars, lol) by pointing out the inequality was wrong. But it still made a mistake! In fact, it made a worse inequality mismatch than before — it flipped multiple inequalities in response to me telling it that it flipped one incorrectly

Time to ask for a fix again — but what happens? Again, it gets stumped and needs to phone a friend; Ten prompts later, I realize that it is on warpath to replace each mistake I find with 3 mistakes, so that by the time I'm 10 prompts deep, I have 31+ mistakes in the proof

This is the worst type of mistake one can have from a reasoning model — it leads you to do way *more* work to verify that it is right than it takes the prove the result; this is almost worse than having no assistance because you spend more time fixing the mistakes than you do finding something new

I wrote a tweet about this phenomena in the sense that it is the *opposite* of ZK — in the worst-case, reasoning models might make the verifier do more work than the prover! I angered many cryptographers with this type of claim ("verifying is always less than proving") — but for RLMs, it seems like this can often be true in the worst case (but not true on average)

This is the worst type of mistake one can have from a reasoning model — it leads you to do way *more* work to verify that it is right than it takes the prove the result; this is almost worse than having no assistance because you spend more time fixing the mistakes than you do finding something new

I wrote a tweet about this phenomena in the sense that it is the *opposite* of ZK — in the worst-case, reasoning models might make the verifier do more work than the prover! I angered many cryptographers with this type of claim ("verifying is always less than proving") — but for RLMs, it seems like this can often be true in the worst case (but not true on average)

https://x.com/tarunchitra/status/1909643992989679807

Now, while most standard LLMs are pretty useless at giving you citations (which is arguably the bulk of the reason Perplexity could be worth $9B 😝 ), reasoning models are actually not that bad at generating citations

The problem is sometimes they don't exist! But — you usually get the right author or topic name, so if you look a little harder, you'll find the reference that the model was misremembering

This is a little like when someone tells you, "oh yeah didn't XXX write about that?" and then you have to go look

I think humans and robots are a draw on this one

A final thing worth noting: math abilities of different reasoning models vary WILDLY

One of my favorite math theorems is the proof that there are only real division algebras in dimensions 1, 2, 4, and 8 which can be proved with complex analysis via the Cayley-Dickson construction (gross!) or via topological K-theory using Bott Periodicity (beautiful!)

So what happens when I ask Gemini 2.5 pro (which aces Aider and does pretty well at AIME) vs. OpenAI? OpenAI basically one-shots and gives the perfect references (e.g. Bott and Tu / Hatcher); Gemini, on the other hand, decides to reprove basic properties about characteristic classes to try to prove it.

This is as if every time I asked you to find me a root of a polynomial, you went and reproved the fundamental theorem of algebra first — these differences in verbosity are really key to enabling effective domain specific usage of these models IMO

On the other hand — these issues are far from insurmountable; with the right prompt engineering and expectations for how much you'll have to supervise the exploration phase of the reasoning model (e.g. how you get it to avoid certain doom loops), these tools can be the best copilot or co-author that you've ever had

And this led me to a natural quandary — a rebuttal to the Russell's Paradox of models that I mentioned earlier, if you will:

Is there a way to describe these models formally that let's you understand (from first-ish principles) why they generate verbose vs. short answers? Is there a way you can quantify when they get stuck? Or when they are able to recover?

If you could quantify this, you could design these systems to minimize these pitfalls that you see in prod

This led me down the rabbit hole of trying to understand the stochastic process that steers reasoning models; since DeepSeek's R1 is the only open source reasoning model with multiple academic papers, this was the natural place to go model spelunking

One of the key things I learned about RLMs is that they rely on rewards (much like crypto incentives!) to drive the sequence of queries that are generated from a user query

In this figure I made, you can see how each subsequent query (a green box) "pays" a reward, with the goal of maximizing the total reward that the model gets (under the assumption that the "most likely answer" is the answer with the highest reward)

This crypto incentive-like problem set my brain off in a lot of directions — the entire length and performance of the reasoning chain is *completely* determined by the choice of rewards "paid" to the model; while not a perfect analogy, you can think of the rewards as measuring "accuracy / unit compute" — so the highest rewards = best accuracy per unit compute over all

DeepSeek shocked the world by being able to train / infer a reasoning model that used significantly less compute than OpenAI o1 — but it turns out this is partially because they chose "sparse" rewards (e.g. they didn't give a "lot" of rewards) which made the training / inference time faster

Does this sound like DeFi to you? If not, you'll find out why there's a connection in Part IV: The Paper Itself™