Well I’ll be damned. I didn’t think it would actually happen, but as of today, Grok 3 is the best AI model out there. We have a new player in town.

xAI just dropped Grok 3, their latest large language model, packed with a reasoning engine and a mini model. And it’s delivering some serious results:

• LMArena: 1400 ELO (#1 ranking)
• AIME 24: 52% (96% with reasoning!)
• GPQA: 75% (85% with reasoning)
• LiveCodeBench (Coding): 57% (80% with reasoning)
• AIME 2025 (Math): 93%, outperforming o3-mini-high

The AI game just got interesting.

i like tech stuff and sometimes politics also running, last one memes

to what extent do these benchmarks correspond to actual user experience? is this something that AI companies can 'game' the benchmark?

In crypto since 2016 | Nature Photograpy | Experimental art using creative coding | Don't take my casts to serious

That's a cool approach, thanks for sharing!

Llmarena is votes by users 

They make a prompt and get a reply from different A.I models without knowing which is which 

Then vote on which reply was the best

these are internal company benchmarks, so it’s good practice to take them with a grain of salt. the model starts rolling out to users today, and soon, real-world testing will provide a more accurate comparison.

that said, we rarely see a significant gap between company benchmarks and public evaluations, so these numbers are likely a solid indicator of Grok 3’s capabilities.