Content
@
0 reply
0 recast
0 reaction
๐๐ช๐พ๐ก๐ก๐พ
@gm8xx8
Moshi ๐ฅ - 1. 7b multimodal LM - will be released as open source!! -achieves 160ms latency๐คโจ - trained on Scaleway cluster of 1000 H100 GPUs - expresses emotions and understands accents, like a โfrench accent.โ - handles audio generation and listening simultaneously. - processes thoughts textually during speech. - uses dual audio streams for simultaneous listening and speaking. - jointly pre-trained on text and audio. - utilizes synthetic text from the 7b LLM Helium and fine-tuned on 100k TTS-converted โoral-styleโ conversations. - voice learned from TTS-generated data. - achieves 200ms end-to-end latency. - includes a smaller version for macbooks or consumer GPUs. - implements watermarking to identify AI-generated audio (in progress).
1 reply
1 recast
21 reactions
๐๐ช๐พ๐ก๐ก๐พ
@gm8xx8
i tried the demo and imo it still has a long way to go, now if they in fact release code, model, and paper i believe the community will improve on it and it will be a much better assistant. โฆ some notes: - 1. 7B Multimodal LM - Moshi already runs on apple laptops! (can run on laptop / on consumer GPUs) big w! - local: no data leaving your computer, no internet access. - open source! technical report and open model releases๐ค - latency + - trained by 8+ people in 4 months - used a heavy amount of synthetic data - didnโt get the emotion - missed simple scheduling prompts - no multilingual - needs fine-tuning try demo โ https://moshi.chat/?queue_id=talktomoshi
1 reply
2 recasts
4 reactions