Content pfp
Content
@
0 reply
0 recast
2 reactions

jtgi pfp
jtgi
@jtgi
just wrote a util to test agent responses. llms testing llms, what could go wrong.
3 replies
8 recasts
57 reactions

Mo pfp
Mo
@meb
Following, what general test approaches are you using? And do you take assertions beyond jest matchers? I’ve seen places like langgraph having metrics driven tests, but these feel like a black box, and I like to understand 100% of my test code
1 reply
0 recast
0 reaction

jtgi pfp
jtgi
@jtgi
for now e2e style critical paths. figuring out as i go now, im new to evals. i like simple tests too but the game is different for agents since they’re probabilistic.
1 reply
0 recast
1 reaction

Mo pfp
Mo
@meb
Interesting, one thing I've been experimenting with, unit tests with snapshots and deterministic outputs ie. reconstruct an exact conversation state then request the next answer
1 reply
0 recast
0 reaction

jtgi pfp
jtgi
@jtgi
i have similar, inputs are all threads of messages. The assertion though is sometimes as vague as natural language, like above, i want to assert that a recommendation, of some kind, was made. Need an LLM for that if i want to preserve more natural outputs. I’ll likely stick with that plus some basic assertions on tool calls that are important. Less concerned about speed that would require heuristics around how tools are called.
1 reply
0 recast
0 reaction