Dylan pfp
Dylan
@elffjs
Neat paper out of Apple measuring the performance of a bunch of LLMs on a mathematical reasoning benchmark as the questions are varied: changing names, changing numbers, adding relevant clauses, and adding irrelevant statements. o1 seems to be in a class of its own, though it still falters when no-op statements are added. https://arxiv.org/abs/2410.05229
0 reply
101 recasts
140 reactions

Miraan Upadhyay pfp
Miraan Upadhyay
@usalvi
Impressive study showing LLM performance on a diverse set of tasks
0 reply
0 recast
0 reaction