divya đź“š pfp
divya đź“š
@divyav.eth
intern nerds at apple did a fun experiment to evaluate LLM’s mathematical reasoning capabilities. they created modified versions of GSM8k - GSM-symbolic & GSM-NoOp. In the questions they: - swapped out subject and object names - replaced numerical values - added additional clauses relevant to the context of the questions, but irrelevant for calculating the answers (GSM-NoOp) In case of only name changes, performance dropped by about 10%. In case of NoOp, performance dropped by about 65% “We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data.” https://arxiv.org/pdf/2410.05229
0 reply
1 recast
4 reactions