𝚐𝔪𝟾𝚡𝚡𝟾

Mistral released Pixtral 12B, a Vision Language Model with a 12B text backbone and 400M vision adapter. It supports larger vocabularies, new image tokens, processes 1024x1024 images, and uses bf16 weights.

Mistral back at it again 🔥 

🔗: https://x.com/mistralai/status/1833758285167722836?s=46
🤗: https://huggingface.co/mistral-community/pixtral-12b-240910

I like when Mistral casually drops magnet links ☺︎ 

Mistral releases Pixtral, a 12B VLM with a Mistral Nemo 12B text backbone and 400M Vision Adapter. It processes 1024x1024 images with 16x16 pixel patches, has a 131,072 word vocabulary, and includes special tokens like img, img_break, and img_end. The model uses bf16 weights, GeLU for the vision adapter, and 2D RoPE for the encoder.

I’m looking forward to the inference code release!