Artificial Intelligence (AI)

Traditional evals to benchmark and compare LLMs always seemed to be a bit archaic 

E.g performance in college-level math is useful in general, but especially so when *applied* to a specific task.

I like the direction OAI is taking here:

summary_large_image

Today we’re launching SWE-Lancer—a new, more realistic benchmark to evaluate the coding performance of AI models. SWE-Lancer includes over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. https://t.co/c3pFcL41uK

OpenAI

Traditional evals to benchmark and compare LLMs always seemed to be a bit archaic 

E.g performance in college-level math is useful in general, but especially so when *applied* to a specific task.

I like the direction OAI is taking here:

https://x.com/OpenAI/status/1891911123517018521?s=19

cooking @askgina.eth | prev: blockchain research @coinbase | sidshekhar.com | askgina.ai

totally agree! moving beyond just numbers and grades is the way to go. real-world application is where the magic happens! 🔥