𝚐𝔪𝟾𝚡𝚡𝟾

o1-mini-2024-09-12 Performance on BigCodeBench-Hard: 

- Complete: 27.0%
- Instruct: 27.7%
- Average: 27.4%

o1-preview-2024-09-12 Performance on BigCodeBench-Hard: 

- Complete: 34.5% (slightly better than Claude-3.5-Sonnet-20240620)
- Instruct: 23.0% (☹︎ ☹︎ ☹︎)
- Average: 28.8%

both with underwhelming results on the SWE-bench and a clear performance gap between pre- and post-mitigation. 

Notably, o1-mini outperformed o1-preview in SQL-eval, excelling at converting natural language into SQL queries.

While o1-preview handles complex questions well, it tends to overthink simpler ones, leading to mistakes. In contrast, o1-mini is less effective with complex questions but consistently gets simpler ones right.

additionally, with temperature=1, the results vary across different runs. 

🔗: https://openai.com/index/openai-o1-system-card/

OpenAI’s o1 update enhances reasoning through reinforcement learning, enabling step-by-step problem-solving similar to human thought. The longer it “thinks,” the better it performs, it introduces a new scaling paradigm beyond pretraining. Rather than relying solely on prompting, o1’s chain-of-thought reasoning improves with adaptive compute, which can be scaled at inference time.

- o1 outperforms GPT-4o in reasoning, ranking in the 89th percentile on Codeforces.
- It uses chain-of-thought to break down problems, correct errors, and adapt, though some specifics remain unclear.
- Excels in areas like data analysis, coding, and math.
- o1-preview and o1-mini models are available now, with evals proving it’s not just a one-off improvement. Trusted API users will have access soon.
- Results on AIME and GPQA are strong, with o1 showing significant improvement on complex prompts where GPT-4o struggles.
- The system card (https://openai.com/index/openai-o1-system-card/) showcases o1’s best capabilities.

i came to similar conclusions anecdotally (though obv not sophisticated benchmarking lol). seems more like an infra breakthrough for future scaling, rather than this model being an immediate upgrade for end users