Listen to this story
|
Hugging Face has released Idefics2, an open multimodal model that accepts arbitrary sequences of image and text inputs and produces text outputs. The model can answer questions about images, describe visual content, create stories grounded on multiple images, or simply behave as a pure language model without visual inputs.
Idefics2 surges past its forerunner, Idefics1, boasting only 8 billion parameters and the flexibility granted by its open license (Apache 2.0), alongside significantly augmented Optical Character Recognition (OCR) capabilities.
In a remarkable feat of Idefics1, the new Idefics2 model outperformed larger rivals in the visual tasks, as the model has not only achieved exceptional performance on visual question answering benchmarks, but has also outperformed significantly larger language models like LLava-Next-34B and MM1-30B-chat.
Developed by the Hugging Face M4 team, the model is trained on a wide range of openly available datasets, including web documents, image-caption pairs, and OCR data. Additionally, the model was fine-tuned on a novel dataset called ‘The Cauldron,’ which amalgamated 50 carefully curated datasets for multifaceted conversational training.
A significant architectural advancement in Idefics2 is the simplification of integrating visual features into the language backbone. The adoption of a Learned Perceiver Pooling and MLP modality projection has enhanced the model’s overall efficacy, marking a shift from its predecessor’s architecture.
Idefics2 exhibits a refined approach to image manipulation, maintaining native resolutions and aspect ratios, deviating from the conventional resizing norms in computer vision.