Content pfp
Content
@
0 reply
0 recast
0 reaction

𝚐π”ͺ𝟾𝚑𝚑𝟾 pfp
𝚐π”ͺ𝟾𝚑𝚑𝟾
@gm8xx8
Emu3: Next-Token Prediction is All You Need Emu3: https://huggingface.co/collections/BAAI/emu3-66f4e64f70850ff358a2e60f Emu3 is a new suite of multimodal models trained through next-token prediction. It converts images, text, and videos into a discrete space and trains a single transformer with multimodal sequences. Emu3 surpasses models like SDXL, LLaVA-1.6, and OpenSora-1.2 in both generation and perception tasks, without using diffusion or compositional architectures. - Emu3 generates high-quality images from text input by predicting the next visual token, supporting different resolutions and styles. -Demonstrates strong understanding between vision and language, providing coherent text responses without relying on CLIP or a pretrained LLM. - Produces videos by sequentially predicting the next token, allowing for video extension and future event prediction without diffusion models. github: https://github.com/baaivision/Emu3 project page: https://emu.baai.ac.cn/about
0 reply
0 recast
13 reactions