Content
@
0 reply
0 recast
0 reaction
𝚐𝔪𝟾𝚡𝚡𝟾
@gm8xx8
Moshi 🔥 - 1. 7b multimodal LM - will be released as open source!! -achieves 160ms latency🤌✨ - trained on Scaleway cluster of 1000 H100 GPUs - expresses emotions and understands accents, like a “french accent.” - handles audio generation and listening simultaneously. - processes thoughts textually during speech. - uses dual audio streams for simultaneous listening and speaking. - jointly pre-trained on text and audio. - utilizes synthetic text from the 7b LLM Helium and fine-tuned on 100k TTS-converted “oral-style” conversations. - voice learned from TTS-generated data. - achieves 200ms end-to-end latency. - includes a smaller version for macbooks or consumer GPUs. - implements watermarking to identify AI-generated audio (in progress).
1 reply
1 recast
28 reactions
𝚐𝔪𝟾𝚡𝚡𝟾
@gm8xx8
i tried the demo and imo it still has a long way to go, now if they in fact release code, model, and paper i believe the community will improve on it and it will be a much better assistant. … some notes: - 1. 7B Multimodal LM - Moshi already runs on apple laptops! (can run on laptop / on consumer GPUs) big w! - local: no data leaving your computer, no internet access. - open source! technical report and open model releases🤞 - latency + - trained by 8+ people in 4 months - used a heavy amount of synthetic data - didn’t get the emotion - missed simple scheduling prompts - no multilingual - needs fine-tuning try demo ↓ https://moshi.chat/?queue_id=talktomoshi
1 reply
2 recasts
4 reactions