insane benchmarks—big if true. ngl it wouldn’t surprise me if an open-source model achieves sota unexpectedly, this is exactly why open-source ai is so important. 

…
will be back to update ☺︎ 
i was waiting for the tokenizer issue to be fixed—looks like it is. 

i’ll leave this here. looks like meta is hosting 😉  
↓

Reflection-“Llama-3.1”-70B: https://huggingface.co/mattshumer/Reflection-Llama-3.1-70B

Reflection 70B underperformed on coding tasks, scoring 42% on Aider and struggling with BigCodeBench-Hard compared to Llama3 70B. Probably bc using reasoning (CoT) negatively impacts code generation, as reasoning may not integrate well with the inherent logic required for programming. Direct code generation yields better results.

reflect 70B struggled coding, scored low compared to Llama3 70B. CoT reasoning might not work well with programming logic. Direct generation is better