JT on Warpcast

Content pfp

0 reply

0 recast

0 reaction

JT pfp

LLM benchmarks are useful from an academic POV, but could be more practical imo. A model might be ranked higher than another, but perform significantly worse on the tasks you care about. It would be useful if there was a way to setup your own benchmark, using a personal workflow, to test multiple models.

1 reply

0 recast

0 reaction

Sfinos🎩🍖 pfp

0 reply

0 recast

0 reaction