Anyone know open source models or papers for splitting audio (that contains human speech and other sounds) into two separate audio files, one containing human speech only and one containing other sounds?

I’ve tried to find a good tool for this to help with speech synthesis to get the human speech track but didn’t find anything that worked well

🇧🇷🇺🇸 - Book: Making Things Think: https://holloway.com/mtt .Investor in Wander, Carry, Footprint, Merkle Manufactory (Farcaster), Dynamic, Paragraph

Interesting, it seems doable and dataset is relatively easy to synthesize.

The problem is the dataset, but I agree once you have the dataset it shouldn’t be hard to do it

Also that’s why speech synthesis is not really great with every single figure. For now, you need enough clean data with someone’s voice. Joe Rogan, for example, is a great subject for this

Yeah I don’t have good intuition for the size of dataset required but my guess is a lot less than TTS, maybe similar to ASR given one-to-many problem, so was thinking there is enough public speech dataset (>10k hrs) plus non-speech dataset which can be mixed together to synthesize for training.