I want to add an eval framework for @vibesengineering.eth to improve the quality of generated mini apps

- have past user input and LLM responses, a RAG system for docs and LLM prompts.
- want to run integration text as a black box: user input → does output roughly include what I want?

need recommendations 
- python frameworks (deepeval seems like a good fit?!)
- best practices how to setup and keep improving this as black box
- best practices to improve core RAG system

i dabble with ai using open data from crypto https://alexpaden.tech @unbias

[inaudible laughter] • prev manifold.xyz, twilio.com

Building askgina.ai | prev: blockchain research @coinbase | sidshekhar.com

dev + founder | @vibesengineering.eth prev: @onsenbot @herocast

summoning the AI gurus @sidshekhar @jtgi @alexpaden 

ideas for any of this?

don't take my casts seriously please (most of them at least)

Perhaps a different approach. If possible be able to describe what should happen as your test cases and make it as easy as possible to run against those test cases

I think @kevinoconnell has some good opinions this topic as well

Building Higher apparel at High Line 🧢 
Host at /clickcast /cpg || Husband to @shiwen

yeah I think I want to set it up like this

Tbh I’m skeptical of any “works out of the box” eval systems

There are a bunch of open source frameworks that build on top of Open Telemetry for LLM evals like Helicone, but it’s about as good as implementing advanced logging atm