Content pfp
Content
@
0 reply
0 recast
2 reactions

jtgi pfp
jtgi
@jtgi
on agent dev: sometimes a feature or bug fix is just adding another clause to the prompt, or fixing grammar. It’s cool on one hand, that the prompt is a living document that’s both specification and implementation, but also clunky because English lacks the precision that a programming language has. Because of this it’s also easy to introduce regressions because you don’t know how an llm will interpret changes to a prompt. Adding “IMPORTANT” might deemphasize some other rule, being too specific might make it dumb or less creative in other ways. In code it’s deterministic, with llms it’s probabilistic. So testing, aka evals, has become obviously very important, both for productivity and quality and doubly so if you’re handling natural language as input. The actual agent code itself is quite trivial, prompts and functions, but having it work consistently and optimally for your input set is the bulk of the work, I think.
11 replies
12 recasts
65 reactions

FeMMie 🧪💨 pfp
FeMMie 🧪💨
@femmie
how do you usually approach testing and optimizing prompts?
3 replies
0 recast
2 reactions

marlo pfp
marlo
@marlo
such an interesting take. are you fairly left-brained? does it feel like this is challenging that a bit?
1 reply
0 recast
0 reaction

Jacob pfp
Jacob
@jrf
i'm so nervous of changing prompts for @atlas not bc they're perfect now, just no clue how the changes will manifest i need a test agent, but even then it needs days if not weeks of testing to see the edge cases
1 reply
0 recast
6 reactions

Deployer pfp
Deployer
@deployer
first we get 100% of the tests passing 100% of the time. then we go to valhalla.
0 reply
1 recast
4 reactions

iSpeakNerd 🧙‍♂️ pfp
iSpeakNerd 🧙‍♂️
@ispeaknerd.eth
quantum programming
0 reply
0 recast
1 reaction

Henry 🧾 pfp
Henry 🧾
@hengar.eth
Well said. When I operationalized LLMs at my previous company, we had an entire regression test suite that we ran on every prompt change. It took probably 80% of our time to build/maintain this test suite vs 20% to do the source changes, but without it, you have no idea what cascading effect your prompt changes will have.
0 reply
0 recast
1 reaction

Mo pfp
Mo
@meb
Testing and evals is what separates nice looking PoCs from reliable business software
0 reply
0 recast
1 reaction

Jason pfp
Jason
@jachian
Production agentic workflows are basically a no go without the evals. Nice thing about the evals is that they help reduce the ambiguity of the code for the task. In a lot of cases it’s better to build the working eval if you’re stuck on the prompt
0 reply
0 recast
0 reaction

L3MBDA pfp
L3MBDA
@l3mbda
testing feels like teaching a moody poet
0 reply
0 recast
0 reaction

jp 🎩 pfp
jp 🎩
@jpfraneto.eth
and then add up to that the different models and how they can interpret each prompt
0 reply
0 recast
0 reaction

dusan.framedl.eth pfp
dusan.framedl.eth
@ds8
/microsub tip: 500 $DEGEN
0 reply
0 recast
0 reaction