Content pfp
Content
@
0 reply
0 recast
0 reaction

๐š๐”ช๐Ÿพ๐šก๐šก๐Ÿพ pfp
๐š๐”ช๐Ÿพ๐šก๐šก๐Ÿพ
@gm8xx8
o1-mini-2024-09-12 Performance on BigCodeBench-Hard: - Complete: 27.0% - Instruct: 27.7% - Average: 27.4% o1-preview-2024-09-12 Performance on BigCodeBench-Hard: - Complete: 34.5% (slightly better than Claude-3.5-Sonnet-20240620) - Instruct: 23.0% (โ˜น๏ธŽ โ˜น๏ธŽ โ˜น๏ธŽ) - Average: 28.8% both with underwhelming results on the SWE-bench and a clear performance gap between pre- and post-mitigation. Notably, o1-mini outperformed o1-preview in SQL-eval, excelling at converting natural language into SQL queries. While o1-preview handles complex questions well, it tends to overthink simpler ones, leading to mistakes. In contrast, o1-mini is less effective with complex questions but consistently gets simpler ones right. additionally, with temperature=1, the results vary across different runs. ๐Ÿ”—: https://openai.com/index/openai-o1-system-card/
2 replies
0 recast
3 reactions

๐š๐”ช๐Ÿพ๐šก๐šก๐Ÿพ pfp
๐š๐”ช๐Ÿพ๐šก๐šก๐Ÿพ
@gm8xx8
https://warpcast.com/gm8xx8/0x40ea1ec8
1 reply
0 recast
1 reaction

๐š๐”ช๐Ÿพ๐šก๐šก๐Ÿพ pfp
๐š๐”ช๐Ÿพ๐šก๐šก๐Ÿพ
@gm8xx8
cc @stephancill
1 reply
0 recast
1 reaction

Stephan pfp
Stephan
@stephancill
interesting, ty!
0 reply
0 recast
0 reaction