I want to add an eval framework for @vibesengineering.eth to improve the quality of generated mini apps

- have past user input and LLM responses, a RAG system for docs and LLM prompts.
- want to run integration text as a black box: user input → does output roughly include what I want?

need recommendations 
- python frameworks (deepeval seems like a good fit?!)
- best practices how to setup and keep improving this as black box
- best practices to improve core RAG system

Can try helicone (https://www.helicone.ai/) for general observability first before getting into evals?

have found it helpful as most of the eval frameworks out there aren't fit for purpose

dev + founder | @vibesengineering.eth prev: @onsenbot @herocast

Building askgina.ai | prev: blockchain research @coinbase | sidshekhar.com