intern nerds at apple did a fun experiment to evaluate LLM’s mathematical reasoning capabilities. they created modified versions of GSM8k  - GSM-symbolic & GSM-NoOp. In the questions they:

- swapped out subject and object names 
- replaced numerical values 
- added additional clauses relevant to the context of the questions, but irrelevant for calculating the answers (GSM-NoOp)

In case of only name changes, performance dropped by about 10%. In case of NoOp, performance dropped by about 65%

“We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data.”

intern nerds at apple did a fun experiment to evaluate LLM’s mathematical reasoning capabilities. they created modified versions of GSM8k  - GSM-symbolic & GSM-NoOp. In the questions they:

- swapped out subject and object names 
- replaced numerical values 
- added additional clauses relevant to the context of the questions, but irrelevant for calculating the answers (GSM-NoOp)

In case of only name changes, performance dropped by about 10%. In case of NoOp, performance dropped by about 65%

“We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data.”

https://arxiv.org/pdf/2410.05229