Content
@
0 reply
0 recast
0 reaction
𝚐𝔪𝟾𝚡𝚡𝟾
@gm8xx8
Emu3: Next-Token Prediction is All You Need Emu3 is a new suite of multimodal models trained through next-token prediction. It converts images, text, and videos into a discrete space and trains a single transformer with multimodal sequences. Emu3 surpasses models like SDXL, LLaVA-1.6, and OpenSora-1.2 in both generation and perception tasks, without using diffusion or compositional architectures. —Emu3 generates high-quality images from text input by predicting the next visual token, supporting different resolutions and styles. —Demonstrates strong understanding between vision and language, providing coherent text responses without relying on CLIP or a pretrained LLM. —Produces videos by sequentially predicting the next token, allowing for video extension and future event prediction without diffusion models. Emu3: https://huggingface.co/collections/BAAI/emu3-66f4e64f70850ff358a2e60f github: https://github.com/baaivision/Emu3 project page: https://emu.baai.ac.cn/about
1 reply
0 recast
6 reactions
Stephan
@stephancill
I'm supporting you through /microsub! 689 $DEGEN (Please mute the keyword "ms!t" if you prefer not to see these casts.)
0 reply
0 recast
0 reaction