MITB Banner

Adept Releases Fuyu-8B for Multimodal AI Agents

The model can understand charts, documents, and diagrams, with its newly improved OCR capabilities.

Share

Adept Releases Fuyu-8B for Multimodal AI Agents
Listen to this story

Amidst the hype around multimodal AI models and AI agents, Adept has unveiled the Fuyu-8B, a scaled-down version of their multimodal model now accessible through HuggingFace. The model can understand charts, documents, and diagrams, with its newly improved OCR capabilities. 

Check out the model here.

This new model has garnered considerable attention for several key reasons that includes a simplified architecture. Fuyu-8B boasts a simple training process compared to other multimodal models, offering a more accessible, scalable, and deployable solution. 

It is specifically tailored for digital AI agents by meticulously designing to cater to the specific needs of digital agents. It excels in handling arbitrary image resolutions, answering queries related to graphs, diagrams, UI-based questions, and precise localization on screen images. Perhaps most notably, Fuyu-8B exhibits remarkable speed, delivering responses for large images in under 100 milliseconds.

Despite being optimised for specific applications, Fuyu-8B performs admirably in standard image understanding benchmarks, such as visual question-answering and natural-image-captioning.

The Fuyu model eschews the complex and convoluted architecture of its counterparts. Instead, it employs a vanilla decoder-only transformer, omitting the need for a separate image encoder. Image patches are linearly projected into the first layer of the transformer, simplifying the model’s structure.

This architectural streamlining allows Fuyu to support image resolutions of any size, treating image tokens as it does text tokens. Special image-newline characters indicate line breaks, and the model utilises its existing position embeddings to adapt to different image sizes. This approach eliminates the need for separate high and low-resolution training stages, vastly simplifying the training and inference process.

To assess the changes, Adept conducted evaluations on prominent image-understanding datasets, including VQAv2, OKVQA, COCO Captions, and AI2D. Fuyu-8B demonstrated robust performance, even in the realm of natural images. It notably outperformed models like QWEN-VL and PALM-e-12B on multiple metrics, despite having significantly fewer parameters. Even the Fuyu-Medium variant held its own against PALM-E-562B, boasting a fraction of the parameters.

Although PALI-X remains the leader on these benchmarks due to its fine-tuning for each specific task, it is essential to note that Adept’s primary focus does not revolve around optimising these benchmarks. Nevertheless, Fuyu-8B and its variations are promising additions to the field of multimodal models, offering a simpler yet highly effective alternative.

Read: Group of ML experts from big tech to create Adept.AI

Share
Picture of Mohit Pandey

Mohit Pandey

Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.