Listen to this story
|
Hugging Face introduced IDEFICS (Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS), an open-access visual language model which accepts arbitrary sequences of images and texts and produces text.
IDEFICS, an 80 billion parameter multimodal model, is designed to process combinations of images and texts and generate coherent textual responses. Its capabilities include image-related inquiries, visual descriptions, and crafting narratives based on multiple images.
It is based on Flamingo, a state-of-the-art visual language model initially developed by DeepMind, which has not been released publicly.
IDEFICS underwent training using a blend of openly accessible datasets, including Wikipedia, Public Multimodal Dataset, and LAION. Additionally, we introduced a novel dataset named OBELICS, comprising 141 million interwoven image-text documents sourced from the internet, encompassing a vast collection of 353 million images.
IDEFICS serves as an open-access counterpart to Flamingo, showcasing performance on par with the proprietary model across diverse image-text comprehension assessments and comes in two variants—the base version and the instructed version. Each variant is available in the 9-billion and 80-billion parameter sizes.
Interestingly, OpenAI hasn’t been able to make ChatGPT multimodal yet. Also as of now, the multimodal features of GPT-4 are not accessible in the APIs. OpenAI’s blog post mentions that users can currently make text-only requests to the GPT-4 model, and the capability to input images is still in a limited alpha stage.
OpenAI introduced Code Interpreter in ChatGPT Plus. Many termed it as the GPT-4.5 moment but interestingly it was just old-school OCR from Python libraries and didn’t use multimodal for image generation.
Apart from IDEFICS, as of now Bard and Bing also accept images as input and creates text. You can try IDEFICS here.