Content pfp
Content
@
0 reply
0 recast
0 reaction

𝚐π”ͺ𝟾𝚑𝚑𝟾 pfp
𝚐π”ͺ𝟾𝚑𝚑𝟾
@gm8xx8
o1-mini-2024-09-12 Performance on BigCodeBench-Hard: - Complete: 27.0% - Instruct: 27.7% - Average: 27.4% o1-preview-2024-09-12 Performance on BigCodeBench-Hard: - Complete: 34.5% (slightly better than Claude-3.5-Sonnet-20240620) - Instruct: 23.0% (☹︎ ☹︎ ☹︎) - Average: 28.8% both with underwhelming results on the SWE-bench and a clear performance gap between pre- and post-mitigation. Notably, o1-mini outperformed o1-preview in SQL-eval, excelling at converting natural language into SQL queries. While o1-preview handles complex questions well, it tends to overthink simpler ones, leading to mistakes. In contrast, o1-mini is less effective with complex questions but consistently gets simpler ones right. additionally, with temperature=1, the results vary across different runs. πŸ”—: https://openai.com/index/openai-o1-system-card/
2 replies
0 recast
6 reactions

𝚐π”ͺ𝟾𝚑𝚑𝟾 pfp
𝚐π”ͺ𝟾𝚑𝚑𝟾
@gm8xx8
https://warpcast.com/gm8xx8/0x40ea1ec8
1 reply
0 recast
2 reactions

six pfp
six
@six
i came to similar conclusions anecdotally (though obv not sophisticated benchmarking lol). seems more like an infra breakthrough for future scaling, rather than this model being an immediate upgrade for end users
0 reply
0 recast
0 reaction