aichannel

When you dig into at the o3 benchmarks they don’t measure what Twitter folks say they do.

Eg ability to fix isolated issues in Python repos != coding is over, but that’s the SWE benchmark.

That being said, if we assume exponential growth in models we can get there.

Personally I don’t know about little benchmarks with puzzles it feels like atari all over again. The benchmark I’d look for is closer to something like sum ARR over AI products, not sure if there’s a simpler / public that captures most of it. I know the joke is it’s NVDA

Andrej Karpathy

When you dig into at the o3 benchmarks they don’t measure what Twitter folks say they do.

Eg ability to fix isolated issues in Python repos != coding is over, but that’s the SWE benchmark.

That being said, if we assume exponential growth in models we can get there.

https://x.com/karpathy/status/1871312079145361645?s=46