A quick test of the new llama3 models (and old ones):

L3 70b: https://i.imgur.com/HPgNLnW.png

L3 8b: https://i.imgur.com/HTxZmy9.png

Mixtral 8x7b: https://i.imgur.com/qjEk93V.png

ChatGPT: https://i.imgur.com/DIQ5loP.png

ChatGPT is correct, L3 70b almost; others are wrong.

Interesting, thanks for demonstrating. I am using ChatGPT heavily for my dev work in my day job. Been curious how the other models are doing. Last I heard is that Claude 3 is getting very close to the reasoning capabilities. Unfortunatly, it is not available in my country yet.