Content
@
0 reply
0 recast
0 reaction
ππͺπΎπ‘π‘πΎ
@gm8xx8
o1-mini-2024-09-12 Performance on BigCodeBench-Hard: - Complete: 27.0% - Instruct: 27.7% - Average: 27.4% o1-preview-2024-09-12 Performance on BigCodeBench-Hard: - Complete: 34.5% (slightly better than Claude-3.5-Sonnet-20240620) - Instruct: 23.0% (βΉοΈ βΉοΈ βΉοΈ) - Average: 28.8% both with underwhelming results on the SWE-bench and a clear performance gap between pre- and post-mitigation. Notably, o1-mini outperformed o1-preview in SQL-eval, excelling at converting natural language into SQL queries. While o1-preview handles complex questions well, it tends to overthink simpler ones, leading to mistakes. In contrast, o1-mini is less effective with complex questions but consistently gets simpler ones right. additionally, with temperature=1, the results vary across different runs. π: https://openai.com/index/openai-o1-system-card/
2 replies
0 recast
3 reactions
ππͺπΎπ‘π‘πΎ
@gm8xx8
https://warpcast.com/gm8xx8/0x40ea1ec8
1 reply
0 recast
1 reaction
six
@six
i came to similar conclusions anecdotally (though obv not sophisticated benchmarking lol). seems more like an infra breakthrough for future scaling, rather than this model being an immediate upgrade for end users
0 reply
0 recast
0 reaction