Tarun Chitra pfp
Tarun Chitra
@pinged
Part III: Escaping from Reasoning Model Purgatory ~~~ The most interesting part about Chain of Thought (CoT) reasoning is that unlike a vanilla hallucinating LLM, CoT models convincingly assert falsehoods; the same mechanism that makes them avoid hallucinating also makes them dig in their heels (like a stubborn human)
10 replies
12 recasts
88 reactions

Tarun Chitra pfp
Tarun Chitra
@pinged
It took me quite a bit of playing around with reasoning models and trying to get different models to produce efficient answers to really understand this fact; when I first starting playing with o3-mini, I more or less assumed all the proofs it was claiming for math problems I asked were more or less correct And in some ways, they *were* correct — but often times the model would: a) [Least nefarious] Subtly make an assumption that immediately implies the conclusion ("Prove: A ⟹ C. Assume A ⟹ B ⟹ C. So A ⟹ C!") b) Go out of its way to "over prove a result" c) Hallucinate a reference (but usually from the right author!) d) [Most nefarious] Get stuck in rewriting the (same or worse) wrong argument in multiple ways Let's look at some examples!
1 reply
0 recast
7 reactions

Tarun Chitra pfp
Tarun Chitra
@pinged
The first image is of an example of a claim I tried to prove with o3-mini ( re: kernel learning for Gaussian Processes); I'll elide most of the details, but note that the model jumped to define an approximate quantity that it says solves the problem (at some minimax rate) I ask for a proof of the approximation using Fano's inequality (for lower bounds on covers) — which, after a careful read, I discover involves a wrongly flipped inequality; no problem! It's just a first year grad student mistake (as @sreeramkannan would say), it can fix it right?
1 reply
1 recast
3 reactions

Tarun Chitra pfp
Tarun Chitra
@pinged
Going a little deeper, I find that I stump it (and probably spend $100s of OpenAI compute dollars, lol) by pointing out the inequality was wrong. But it still made a mistake! In fact, it made a worse inequality mismatch than before — it flipped multiple inequalities in response to me telling it that it flipped one incorrectly Time to ask for a fix again — but what happens? Again, it gets stumped and needs to phone a friend; Ten prompts later, I realize that it is on warpath to replace each mistake I find with 3 mistakes, so that by the time I'm 10 prompts deep, I have 31+ mistakes in the proof
1 reply
0 recast
2 reactions

Tarun Chitra pfp
Tarun Chitra
@pinged
This is the worst type of mistake one can have from a reasoning model — it leads you to do way *more* work to verify that it is right than it takes the prove the result; this is almost worse than having no assistance because you spend more time fixing the mistakes than you do finding something new I wrote a tweet about this phenomena in the sense that it is the *opposite* of ZK — in the worst-case, reasoning models might make the verifier do more work than the prover! I angered many cryptographers with this type of claim ("verifying is always less than proving") — but for RLMs, it seems like this can often be true in the worst case (but not true on average) https://x.com/tarunchitra/status/1909643992989679807
1 reply
0 recast
2 reactions

Tarun Chitra pfp
Tarun Chitra
@pinged
Now, while most standard LLMs are pretty useless at giving you citations (which is arguably the bulk of the reason Perplexity could be worth $9B 😝 ), reasoning models are actually not that bad at generating citations The problem is sometimes they don't exist! But — you usually get the right author or topic name, so if you look a little harder, you'll find the reference that the model was misremembering This is a little like when someone tells you, "oh yeah didn't XXX write about that?" and then you have to go look I think humans and robots are a draw on this one
1 reply
0 recast
2 reactions

Tarun Chitra pfp
Tarun Chitra
@pinged
A final thing worth noting: math abilities of different reasoning models vary WILDLY One of my favorite math theorems is the proof that there are only real division algebras in dimensions 1, 2, 4, and 8 which can be proved with complex analysis via the Cayley-Dickson construction (gross!) or via topological K-theory using Bott Periodicity (beautiful!) So what happens when I ask Gemini 2.5 pro (which aces Aider and does pretty well at AIME) vs. OpenAI? OpenAI basically one-shots and gives the perfect references (e.g. Bott and Tu / Hatcher); Gemini, on the other hand, decides to reprove basic properties about characteristic classes to try to prove it. This is as if every time I asked you to find me a root of a polynomial, you went and reproved the fundamental theorem of algebra first — these differences in verbosity are really key to enabling effective domain specific usage of these models IMO
1 reply
0 recast
3 reactions

Tarun Chitra pfp
Tarun Chitra
@pinged
On the other hand — these issues are far from insurmountable; with the right prompt engineering and expectations for how much you'll have to supervise the exploration phase of the reasoning model (e.g. how you get it to avoid certain doom loops), these tools can be the best copilot or co-author that you've ever had And this led me to a natural quandary — a rebuttal to the Russell's Paradox of models that I mentioned earlier, if you will: Is there a way to describe these models formally that let's you understand (from first-ish principles) why they generate verbose vs. short answers? Is there a way you can quantify when they get stuck? Or when they are able to recover? If you could quantify this, you could design these systems to minimize these pitfalls that you see in prod
1 reply
0 recast
3 reactions