Dylan
@elffjs
Neat paper out of Apple measuring the performance of a bunch of LLMs on a mathematical reasoning benchmark as the questions are varied: changing names, changing numbers, adding relevant clauses, and adding irrelevant statements. o1 seems to be in a class of its own, though it still falters when no-op statements are added. https://arxiv.org/abs/2410.05229
0 reply
100 recasts
140 reactions
Md Jasim Uddin
@mdjasim1987
Gm
0 reply
0 recast
0 reaction