Evolving my voice playground to have good voice activity detection, aka vad.

I was using a basic heuristic, "was quiet for 500ms" but it's too aggressive and cuts you off while you're thinking. That really bothers me about openai's implementation.

The proper way is a model that understands what you're saying and if you finished a complete thought, aka semantics. So I'm experimenting with that but the catch is it has to be low latency (< 500ms).

So I found a couple lightweight models that can run in the browser like Silero, it compresses down to 2mb, you can run it in via the ONNX runtime.

memebuilding • prev manifold.xyz, twilio.com

that is also one of my biggest frustrations with the openAI voice mode also, i get so angry when it cuts me off haha 

curious to see if this works out!