Content pfp
Content
@
0 reply
0 recast
2 reactions

jtgi pfp
jtgi
@jtgi
on agent dev: sometimes a feature or bug fix is just adding another clause to the prompt, or fixing grammar. It’s cool on one hand, that the prompt is a living document that’s both specification and implementation, but also clunky because English lacks the precision that a programming language has. Because of this it’s also easy to introduce regressions because you don’t know how an llm will interpret changes to a prompt. Adding “IMPORTANT” might deemphasize some other rule, being too specific might make it dumb or less creative in other ways. In code it’s deterministic, with llms it’s probabilistic. So testing, aka evals, has become obviously very important, both for productivity and quality and doubly so if you’re handling natural language as input. The actual agent code itself is quite trivial, prompts and functions, but having it work consistently and optimally for your input set is the bulk of the work, I think.
11 replies
12 recasts
65 reactions

FeMMie 🧪💨 pfp
FeMMie 🧪💨
@femmie
how do you usually approach testing and optimizing prompts?
3 replies
0 recast
2 reactions

jtgi pfp
jtgi
@jtgi
Figuring out as I go, but treating it just like end to end tests. Collect requests I think are core functionality and test them. The main difference so far is you have to run each test N times since the system is probabilistic. And it might be okay for a success rate lower than 100%. Trade offs.
0 reply
1 recast
1 reaction

alec pfp
alec
@alecpap
I tend to build a mini testing platform that lets me clone prompts, tweak them slightly, swap provoders etc and then compare results at different points in the chains. Its way on the feels side, but the testing platform provides a production-like environment and makes it easy to at least compare source prompt, output, and how theyve changed over the course of development and testing
1 reply
0 recast
1 reaction

Hands Of Gold  ⚽🏘️🎩 pfp
Hands Of Gold ⚽🏘️🎩
@thatweb3guy
Co-asking
0 reply
0 recast
0 reaction