hellno the optimist pfp
hellno the optimist
@hellno.eth
I want to add an eval framework for @vibesengineering.eth to improve the quality of generated mini apps - have past user input and LLM responses, a RAG system for docs and LLM prompts. - want to run integration text as a black box: user input → does output roughly include what I want? need recommendations - python frameworks (deepeval seems like a good fit?!) - best practices how to setup and keep improving this as black box - best practices to improve core RAG system
6 replies
1 recast
11 reactions

hellno the optimist pfp
hellno the optimist
@hellno.eth
summoning the AI gurus @sidshekhar @jtgi @alexpaden ideas for any of this?
2 replies
0 recast
5 reactions

shoni.eth pfp
shoni.eth
@alexpaden
deepeval seems like a good fit- some saas options are prompteval promptlayer, and weights&biases* regarding RAG my first thought was full fledged app codebases, i don't see benefit in anything smaller than a full code file off the top of my head regarding best practices basically just unit/app test cases with an expected output is what comes to mind. jason's answer looks good i wouldn't overcomplicate it
1 reply
0 recast
1 reaction

Jason pfp
Jason
@jachian
Perhaps a different approach. If possible be able to describe what should happen as your test cases and make it as easy as possible to run against those test cases I think @kevinoconnell has some good opinions this topic as well
3 replies
0 recast
3 reactions