Content pfp
Content
@
0 reply
0 recast
0 reaction

๐š๐”ช๐Ÿพ๐šก๐šก๐Ÿพ pfp
๐š๐”ช๐Ÿพ๐šก๐šก๐Ÿพ
@gm8xx8
Moshi ๐Ÿ”ฅ - 1. 7b multimodal LM - will be released as open source!! -achieves 160ms latency๐ŸคŒโœจ - trained on Scaleway cluster of 1000 H100 GPUs - expresses emotions and understands accents, like a โ€œfrench accent.โ€ - handles audio generation and listening simultaneously. - processes thoughts textually during speech. - uses dual audio streams for simultaneous listening and speaking. - jointly pre-trained on text and audio. - utilizes synthetic text from the 7b LLM Helium and fine-tuned on 100k TTS-converted โ€œoral-styleโ€ conversations. - voice learned from TTS-generated data. - achieves 200ms end-to-end latency. - includes a smaller version for macbooks or consumer GPUs. - implements watermarking to identify AI-generated audio (in progress).
1 reply
1 recast
21 reactions

๐š๐”ช๐Ÿพ๐šก๐šก๐Ÿพ pfp
๐š๐”ช๐Ÿพ๐šก๐šก๐Ÿพ
@gm8xx8
i tried the demo and imo it still has a long way to go, now if they in fact release code, model, and paper i believe the community will improve on it and it will be a much better assistant. โ€ฆ some notes: - 1. 7B Multimodal LM - Moshi already runs on apple laptops! (can run on laptop / on consumer GPUs) big w! - local: no data leaving your computer, no internet access. - open source! technical report and open model releases๐Ÿคž - latency + - trained by 8+ people in 4 months - used a heavy amount of synthetic data - didnโ€™t get the emotion - missed simple scheduling prompts - no multilingual - needs fine-tuning try demo โ†“ https://moshi.chat/?queue_id=talktomoshi
1 reply
2 recasts
4 reactions

Frank pfp
Frank
@deboboy
nice summaryโ€ฆ did you have to design a prompt to elicit emotion?
1 reply
2 recasts
1 reaction