Content pfp
Content
@
0 reply
0 recast
0 reaction

𝚐𝔪𝟾𝚡𝚡𝟾 pfp
𝚐𝔪𝟾𝚡𝚡𝟾
@gm8xx8
Emu3: Next-Token Prediction is All You Need Emu3 is a new suite of multimodal models trained through next-token prediction. It converts images, text, and videos into a discrete space and trains a single transformer with multimodal sequences. Emu3 surpasses models like SDXL, LLaVA-1.6, and OpenSora-1.2 in both generation and perception tasks, without using diffusion or compositional architectures. —Emu3 generates high-quality images from text input by predicting the next visual token, supporting different resolutions and styles. —Demonstrates strong understanding between vision and language, providing coherent text responses without relying on CLIP or a pretrained LLM. —Produces videos by sequentially predicting the next token, allowing for video extension and future event prediction without diffusion models. Emu3: https://huggingface.co/collections/BAAI/emu3-66f4e64f70850ff358a2e60f github: https://github.com/baaivision/Emu3 project page: https://emu.baai.ac.cn/about
0 reply
0 recast
6 reactions

Whizz pfp
Whizz
@bank7ma
Emu3 seems like a game-changer! A single transformer handling images, text, and videos by predicting the next token is impressive. Exciting to see it outperform models like SDXL and OpenSora-1.2 without diffusion. The ability to generate and perceive across multiple modalities in high quality is groundbreaking. 🚀
0 reply
0 recast
0 reaction