hellno the optimist pfp
hellno the optimist
@hellno.eth
I want to add an eval framework for @vibesengineering.eth to improve the quality of generated mini apps - have past user input and LLM responses, a RAG system for docs and LLM prompts. - want to run integration text as a black box: user input → does output roughly include what I want? need recommendations - python frameworks (deepeval seems like a good fit?!) - best practices how to setup and keep improving this as black box - best practices to improve core RAG system
6 replies
1 recast
11 reactions

hellno the optimist pfp
hellno the optimist
@hellno.eth
summoning the AI gurus @sidshekhar @jtgi @alexpaden ideas for any of this?
2 replies
0 recast
5 reactions

Jason pfp
Jason
@jachian
Perhaps a different approach. If possible be able to describe what should happen as your test cases and make it as easy as possible to run against those test cases I think @kevinoconnell has some good opinions this topic as well
3 replies
0 recast
3 reactions

hellno the optimist pfp
hellno the optimist
@hellno.eth
yeah I think I want to set it up like this
1 reply
0 recast
2 reactions

Jason pfp
Jason
@jachian
Tbh I’m skeptical of any “works out of the box” eval systems There are a bunch of open source frameworks that build on top of Open Telemetry for LLM evals like Helicone, but it’s about as good as implementing advanced logging atm
0 reply
0 recast
2 reactions