Emu3: Next-Token Prediction is All You Need

Emu3 is a new suite of multimodal models trained through next-token prediction. It converts images, text, and videos into a discrete space and trains a single transformer with multimodal sequences. Emu3 surpasses models like SDXL, LLaVA-1.6, and OpenSora-1.2 in both generation and perception tasks, without using diffusion or compositional architectures.

—Emu3 generates high-quality images from text input by predicting the next visual token, supporting different resolutions and styles.
—Demonstrates strong understanding between vision and language, providing coherent text responses without relying on CLIP or a pretrained LLM.
—Produces videos by sequentially predicting the next token, allowing for video extension and future event prediction without diffusion models.

Emu3: https://huggingface.co/collections/BAAI/emu3-66f4e64f70850ff358a2e60f
github: https://github.com/baaivision/Emu3
project page: https://emu.baai.ac.cn/about