martin ↑
@martin
ok need to refine the 100 $higher @aethernet giveaway to make it better i don't think it's rewarding the most interesting conversations right now if you have experience with helping an ai "judge" thing and giving them score, would appreciate insights. the normal llm stuff isn't really working here i think
10 replies
2 recasts
58 reactions
Mike
@mrmike1
1. Build a good reference data set. Manually score a lot of examples, the more the better, quantity of quality is what you want to go for here. 2. Set up an evaluation script that will run through all the reference data samples and produce scores (completions) based on the prompt you're testing. It doesn't have to be complicated, run the LLM to get the example score, compare it to the ground truth in the reference, and measure the delta to score the LLM's effectiveness across the whole data set. Langchain can help with this. 3. Lastly, set up Github actions to automatically run your LLM and score it so you know how your changes are improving/regressing your system. It also best to ask the LLM for a reason why it score the score it gave too. It will help with debugging and the reasoning-score completion pattern may produce better scores overall. (You could test with and without it.)
0 reply
0 recast
0 reaction