Last updated April 19, 2024
In AI News & Update

Hugging Face Unveils Idefics2, an 8B Vision-Language Model

The new Idefics2 model outperforms larger rivals in the visual tasks.

Share

Published on April 18, 2024

by Gopika Raj

Listen to this story

Hugging Face has released Idefics2, an open multimodal model that accepts arbitrary sequences of image and text inputs and produces text outputs. The model can answer questions about images, describe visual content, create stories grounded on multiple images, or simply behave as a pure language model without visual inputs.

Click here to check it out.

Idefics2 surges past its forerunner, Idefics1, boasting only 8 billion parameters and the flexibility granted by its open license (Apache 2.0), alongside significantly augmented Optical Character Recognition (OCR) capabilities.

Introducing Idefics 2 🤯

An 8B Vision-Language Model – literally punching above its weight.

> Apache 2.0 licensed! 🔥
> Competitive with 30B models like MM1-Chat
> 12 point increase in VQAv2, 30 point increase in TextVQA (compared to Idefics 1)
> 10x fewer parameters than… pic.twitter.com/uhIvbpypwJ
— Vaibhav (VB) Srivastav (@reach_vb) April 15, 2024

In a remarkable feat of Idefics1, the new Idefics2 model outperformed larger rivals in the visual tasks, as the model has not only achieved exceptional performance on visual question answering benchmarks, but has also outperformed significantly larger language models like LLava-Next-34B and MM1-30B-chat.

Developed by the Hugging Face M4 team, the model is trained on a wide range of openly available datasets, including web documents, image-caption pairs, and OCR data. Additionally, the model was fine-tuned on a novel dataset called ‘The Cauldron,’ which amalgamated 50 carefully curated datasets for multifaceted conversational training.

A significant architectural advancement in Idefics2 is the simplification of integrating visual features into the language backbone. The adoption of a Learned Perceiver Pooling and MLP modality projection has enhanced the model’s overall efficacy, marking a shift from its predecessor’s architecture.

Idefics2 exhibits a refined approach to image manipulation, maintaining native resolutions and aspect ratios, deviating from the conventional resizing norms in computer vision.

Access all our open Survey & Awards Nomination forms in one place

Gopika Raj

With a Master's degree in Journalism & Mass Communication, Gopika Raj infuses her technical writing with a distinctive flair. Intrigued by advancements in AI technology and its future prospects, her writing offers a fresh perspective in the tech domain, captivating readers along the way.