𝚐𝔪𝟾𝚡𝚡𝟾 on Warpcast

Content pfp

https://warpcast.com/~/channel/gm8xx8

0 reply

0 recast

0 reaction

𝚐𝔪𝟾𝚡𝚡𝟾 pfp

𝚐𝔪𝟾𝚡𝚡𝟾

Moshi 🔥 - 1. 7b multimodal LM - will be released as open source!! -achieves 160ms latency🤌✨ - trained on Scaleway cluster of 1000 H100 GPUs - expresses emotions and understands accents, like a “french accent.” - handles audio generation and listening simultaneously. - processes thoughts textually during speech. - uses dual audio streams for simultaneous listening and speaking. - jointly pre-trained on text and audio. - utilizes synthetic text from the 7b LLM Helium and fine-tuned on 100k TTS-converted “oral-style” conversations. - voice learned from TTS-generated data. - achieves 200ms end-to-end latency. - includes a smaller version for macbooks or consumer GPUs. - implements watermarking to identify AI-generated audio (in progress).

1 reply

1 recast

8 reactions

𝚐𝔪𝟾𝚡𝚡𝟾 pfp

𝚐𝔪𝟾𝚡𝚡𝟾

i tried the demo and imo it still has a long way to go, now if they in fact release code, model, and paper i believe the community will improve on it and it will be a much better assistant. … some notes: - 1. 7B Multimodal LM - Moshi already runs on apple laptops! (can run on laptop / on consumer GPUs) big w! - local: no data leaving your computer, no internet access. - open source! technical report and open model releases🤞 - latency + - trained by 8+ people in 4 months - used a heavy amount of synthetic data - didn’t get the emotion - missed simple scheduling prompts - no multilingual - needs fine-tuning try demo ↓ https://moshi.chat/?queue_id=talktomoshi

1 reply

2 recasts

8 reactions

Frank pfp

nice summary… did you have to design a prompt to elicit emotion?

1 reply

2 recasts

1 reaction