𝚐𝔪𝟾𝚡𝚡𝟾

gm8xx8

Moshi 
🔥 
- 1. 7b multimodal LM
- will be released as open source!!
-achieves 160ms latency🤌✨
- trained on Scaleway cluster of 1000 H100 GPUs 
- expresses emotions and understands accents, like a “french accent.”
- handles audio generation and listening simultaneously.
- processes thoughts textually during speech.
- uses dual audio streams for simultaneous listening and speaking.
- jointly pre-trained on text and audio.
- utilizes synthetic text from the 7b LLM Helium and fine-tuned on 100k TTS-converted “oral-style” conversations.
- voice learned from TTS-generated data.
- achieves 200ms end-to-end latency.
- includes a smaller version for macbooks or consumer GPUs.
- implements watermarking to identify AI-generated audio (in progress).

Kyutai Labs has open-sourced Moshi, a 7.6B speech-to-speech foundation model, and Mimi, a SoTA streaming speech codec. The release includes Moshi models fine-tuned on synthetic data, along with Mimi, which processes 24 kHz audio with a bandwidth of 1.1 kbps. The models are optimized for on-device performance, with low latency and support for inference via Candle, PyTorch, and MLX.

Kyutai Labs has open-sourced Moshi, a 7.6B speech-to-speech foundation model, and Mimi, a SoTA streaming speech codec. The release includes Moshi models fine-tuned on synthetic data, along with Mimi, which processes 24 kHz audio with a bandwidth of 1.1 kbps. The models are optimized for on-device performance, with low latency and support for inference via Candle, PyTorch, and MLX.

https://huggingface.co/collections/kyutai/moshi-v01-release-66eaeaf3302bef6bd9ad7acd

what use cases do you envision these bringing?

live translation on-site?