Content pfp
Content
@
0 reply
0 recast
0 reaction

JT pfp
JT
@jts
LLM benchmarks are useful from an academic POV, but could be more practical imo. A model might be ranked higher than another, but perform significantly worse on the tasks you care about. It would be useful if there was a way to setup your own benchmark, using a personal workflow, to test multiple models.
0 reply
0 recast
0 reaction