Apple recently unveiled autoregressive image models (AIM), a collection of vision models pre-trained with an autoregressive objective. These models represent a new frontier for training large-scale vision models, which are inspired by their textual counterparts, large language models (LLMs), and exhibit similar scaling properties.
The researchers said that it presents a scalable method for pre-training vision models without supervision. The authors have used a generative autoregressive objective during pre-training and propose technical improvements to adapt it for downstream transfer.
Check out the GitHub repository here.
The researchers said that the performance of the visual features scale with both the model capacity and the quantity of data. Further, they said that the value of the objective function correlates with the performance of the model on downstream tasks.
The team have also illustrated the practical implication of these findings by pre-training a 7 billion parameter AIM on 2 billion images, which achieves 84.0% on ImageNet-1k with a frozen trunk.
Interestingly, even at this scale, they have observed no sign of saturation in performance. The pre-training of AIM is similar to the pre-training of LLMs, and does not require any image-specific strategy to stabilize the training at scale.
About AIM
Apple believes that AIM has desirable properties, including the ability to scale to 7 billion parameters using a vanilla transformer implementation without stability-inducing techniques or extensive hyperparameter adjustments.
Moreover, AIM’s performance on the pre-training task has a strong correlation with downstream performance, outperforming state-of-the-art methods like MAE and narrowing the gap between generative and joint embedding pre-training approaches.
The researchers have also found no signs of saturation as models scaled, suggesting potential for further performance improvements with larger models trained for longer schedules.