crypto

Competitive Mathematics (AIME 2024): GPT-4.0 achieves an accuracy of just 13.4%, while O1 Preview achieves 56.7% results. The O1 version improved further, achieving 83.3% accuracy. ----------------------------------------------- Competitive Programming (CodeForces): GPT-4.0 has relatively low performance, with only 11.0% accuracy. The O1 Preview improved to 62.0%, and the O1 reaches 89.0%. ----------------------------------------------- Doctoral Level Scientific Questions (GPQA Diamond): GPT-4.0 has an accuracy of 56.1%, while O1 Preview increases it to 78.3%, and O1 maintains a similar accuracy of 78.0%. Expert human performance is 69.7%. Here, we see that the most advanced versions of the model are able to outperform human experts on PhD-level scientific questions, undercutting an advanced ability to understand and answer complex questions.