𝚐𝔪𝟾𝚡𝚡𝟾

o1-mini-2024-09-12 Performance on BigCodeBench-Hard: 

- Complete: 27.0%
- Instruct: 27.7%
- Average: 27.4%

o1-preview-2024-09-12 Performance on BigCodeBench-Hard: 

- Complete: 34.5% (slightly better than Claude-3.5-Sonnet-20240620)
- Instruct: 23.0% (☹︎ ☹︎ ☹︎)
- Average: 28.8%

both with underwhelming results on the SWE-bench and a clear performance gap between pre- and post-mitigation. 

Notably, o1-mini outperformed o1-preview in SQL-eval, excelling at converting natural language into SQL queries.

While o1-preview handles complex questions well, it tends to overthink simpler ones, leading to mistakes. In contrast, o1-mini is less effective with complex questions but consistently gets simpler ones right.

additionally, with temperature=1, the results vary across different runs. 

🔗: https://openai.com/index/openai-o1-system-card/

OpenAI’s o1 update enhances reasoning through reinforcement learning, enabling step-by-step problem-solving similar to human thought. The longer it “thinks,” the better it performs, it introduces a new scaling paradigm beyond pretraining. Rather than relying solely on prompting, o1’s chain-of-thought reasoning improves with adaptive compute, which can be scaled at inference time.

- o1 outperforms GPT-4o in reasoning, ranking in the 89th percentile on Codeforces.
- It uses chain-of-thought to break down problems, correct errors, and adapt, though some specifics remain unclear.
- Excels in areas like data analysis, coding, and math.
- o1-preview and o1-mini models are available now, with evals proving it’s not just a one-off improvement. Trusted API users will have access soon.
- Results on AIME and GPQA are strong, with o1 showing significant improvement on complex prompts where GPT-4o struggles.
- The system card (https://openai.com/index/openai-o1-system-card/) showcases o1’s best capabilities.