Anyone know open source models or papers for splitting audio (that contains human speech and other sounds) into two separate audio files, one containing human speech only and one containing other sounds?

I’ve tried to find a good tool for this to help with speech synthesis to get the human speech track but didn’t find anything that worked well

🇧🇷🇺🇸 - Book: Making Things Think: https://holloway.com/mtt .Investor in Wander, Carry, Footprint, Merkle Manufactory (Farcaster), Dynamic, Paragraph

Interesting, it seems doable and dataset is relatively easy to synthesize.

The problem is the dataset, but I agree once you have the dataset it shouldn’t be hard to do it

Also that’s why speech synthesis is not really great with every single figure. For now, you need enough clean data with someone’s voice. Joe Rogan, for example, is a great subject for this