I want to add an eval framework for @vibesengineering.eth to improve the quality of generated mini apps

- have past user input and LLM responses, a RAG system for docs and LLM prompts.
- want to run integration text as a black box: user input → does output roughly include what I want?

need recommendations 
- python frameworks (deepeval seems like a good fit?!)
- best practices how to setup and keep improving this as black box
- best practices to improve core RAG system

I've used Lilac in the past, for evals and comparison

dev + founder | @vibesengineering.eth prev: @onsenbot @herocast

I've used Lilac in the past, for evals and comparison https://www.lilacml.com/

Explorer. Code, design, and mountains. Building Fern https://fernhq.com Previously Shopify

And also I forgot https://www.promptfoo.dev