hellno the optimist pfp
hellno the optimist
@hellno.eth
I want to add an eval framework for @vibesengineering.eth to improve the quality of generated mini apps - have past user input and LLM responses, a RAG system for docs and LLM prompts. - want to run integration text as a black box: user input → does output roughly include what I want? need recommendations - python frameworks (deepeval seems like a good fit?!) - best practices how to setup and keep improving this as black box - best practices to improve core RAG system
6 replies
1 recast
11 reactions

hellno the optimist pfp
hellno the optimist
@hellno.eth
summoning the AI gurus @sidshekhar @jtgi @alexpaden ideas for any of this?
2 replies
0 recast
5 reactions

kompreni 🚂 pfp
kompreni 🚂
@kompreni
way above my pay grade. maybe @eggman.eth can offer some recs
1 reply
0 recast
2 reactions

Carlos Matallín pfp
Carlos Matallín
@matallo.eth
I've used Lilac in the past, for evals and comparison https://www.lilacml.com/
1 reply
0 recast
1 reaction

Sid pfp
Sid
@sidshekhar
Can try helicone (https://www.helicone.ai/) for general observability first before getting into evals? have found it helpful as most of the eval frameworks out there aren't fit for purpose
1 reply
0 recast
0 reaction

eggman 🔵 pfp
eggman 🔵
@eggman.eth
gmeow I think I fall into a bit of a bad pattern when it comes to this stuff as I tend to reinvent the wheel a bit too often - so my knowledge on frameworks can unfortunately be lacking. The biggest challenge here rly is in verifying the happy and unhappy paths - basically you’d want something akin to goose which writes up unit tests, executes, then repeats until all tests are verified. So, a recursive multi-agent stack 🫣 which yeah, will have challenges of its own. If you’re using a sota model like claude or gpt, your context window should thankfully be large enough to allow for this sort of system - prompt engineering would probably wind up being your biggest workload tbh. I’d recommend looking into goose for basically writing unit tests on input->output and verifying, but yeah, it’ll be a big lift overall on this path.
0 reply
0 recast
1 reaction

Royal pfp
Royal
@royalaid.eth
cc @pirosb3 and @linda, you guys probably have good insights here too
0 reply
0 recast
1 reaction