Dylan pfp
Dylan
@elffjs
Neat paper out of Apple measuring the performance of a bunch of LLMs on a mathematical reasoning benchmark as the questions are varied: changing names, changing numbers, adding relevant clauses, and adding irrelevant statements. o1 seems to be in a class of its own, though it still falters when no-op statements are added. https://arxiv.org/abs/2410.05229
0 reply
118 recasts
211 reactions

Aimee Richards pfp
Aimee Richards
@kenneth69
LLMs excel at mathematical reasoning with varied questions, but struggle with no-op statements. Paper provides insightful performance analysis
0 reply
0 recast
0 reaction