Neat paper out of Apple measuring the performance of a bunch of LLMs on a mathematical reasoning benchmark as the questions are varied: changing names, changing numbers, adding relevant clauses, and adding irrelevant statements. o1 seems to be in a class of its own, though it still falters when no-op statements are added.

Neat paper out of Apple measuring the performance of a bunch of LLMs on a mathematical reasoning benchmark as the questions are varied: changing names, changing numbers, adding relevant clauses, and adding irrelevant statements. o1 seems to be in a class of its own, though it still falters when no-op statements are added. https://arxiv.org/abs/2410.05229