Multimodal Autoregressive Pre-training of Large Vision Encoders

 paper: arxiv.org/abs/2411.14402
code & model checkpoints: github.com/apple/ml-aim
weights: https://huggingface.co/collections/apple/aimv2-6720fe1558d94c7805f7688c

AIMV2 is a multimodal vision encoder that pairs image and text processing for advanced pre-training. It achieves SOTA in vision and multimodal tasks, outperforming models like CLIP and SigLIP and reaching 89.5% accuracy on ImageNet-1k with a frozen trunk.

> Surpasses CLIP and SigLIP in multimodal understanding.
> Outperforms DINOv2 in object detection and expression comprehension.
> 89.5% on ImageNet with AIMv2-3B using a frozen trunk.

State-of-the-art multimodal encoder surpassing existing models, achieving high accuracy on ImageNet with a frozen trunk