𝚐𝔪𝟾𝚡𝚡𝟾

Emu3: Next-Token Prediction is All You Need

Emu3: https://huggingface.co/collections/BAAI/emu3-66f4e64f70850ff358a2e60f

Emu3 is a new suite of multimodal models trained through next-token prediction. It converts images, text, and videos into a discrete space and trains a single transformer with multimodal sequences. Emu3 surpasses models like SDXL, LLaVA-1.6, and OpenSora-1.2 in both generation and perception tasks, without using diffusion or compositional architectures.

- Emu3 generates high-quality images from text input by predicting the next visual token, supporting different resolutions and styles.
-Demonstrates strong understanding between vision and language, providing coherent text responses without relying on CLIP or a pretrained LLM.
- Produces videos by sequentially predicting the next token, allowing for video extension and future event prediction without diffusion models.

github: https://github.com/baaivision/Emu3
project page: https://emu.baai.ac.cn/about