Content pfp
Content
@
0 reply
0 recast
0 reaction

𝚐π”ͺ𝟾𝚑𝚑𝟾 pfp
𝚐π”ͺ𝟾𝚑𝚑𝟾
@gm8xx8
Multimodal Autoregressive Pre-training of Large Vision Encoders ο£Ώ paper: arxiv.org/abs/2411.14402 code & model checkpoints: github.com/apple/ml-aim weights: https://huggingface.co/collections/apple/aimv2-6720fe1558d94c7805f7688c AIMV2 is a multimodal vision encoder that pairs image and text processing for advanced pre-training. It achieves SOTA in vision and multimodal tasks, outperforming models like CLIP and SigLIP and reaching 89.5% accuracy on ImageNet-1k with a frozen trunk. > Surpasses CLIP and SigLIP in multimodal understanding. > Outperforms DINOv2 in object detection and expression comprehension. > 89.5% on ImageNet with AIMv2-3B using a frozen trunk.
0 reply
2 recasts
9 reactions