Listen to this story
|
Ahead of Apple’s flagship event, WWDC 2024, in June, the tech giant is going all in to bringing generative AI to its products. Enter Ferret-UI, a specialised LLM tailored specifically for the nuanced demands of mobile user interface comprehension and interaction.
In this paper called “Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs”, the authors present Ferret-UI as a solution to the limitations of existing LLMs in handling UI screens better.
While general-purpose LLMs like GPT-3 have garnered attention for their versatility, they often struggle to understand and effectively interact with UI screens, especially in the mobile domain. The core focus of Ferret-UI lies in its multimodal capabilities, combining advanced language understanding with visual comprehension tailored specifically for mobile UI screens, incorporating referring, grounding, and reasoning capabilities.
Under the Hood
One of the key challenges in adapting LLMs to UI screens is the unique characteristics of these screens compared to natural images. UI screens often have elongated aspect ratios and contain smaller objects of interest, such as icons and texts, which are not typically encountered in natural images. To address this challenge, Ferret-UI integrates a mechanism called “any resolution,” allowing it to handle screens of varying aspect ratios and magnifying details for enhanced visual feature extraction. By encoding each sub-image separately before feeding them to the LLM, Ferret-UI ensures that no critical visual information is lost during processing.
Moreover, Ferret-UI employs a new approach to data curation, gathering training samples from a wide range of elementary UI tasks. These tasks include icon recognition, finding text, and widget listing, among others. By training on such diverse tasks, Ferret-UI learns to understand UI elements’ semantics and spatial positioning, enabling it to make distinctions at both broad and detailed levels.
In addition to elementary tasks, Ferret-UI is also trained on specialised tasks, such as detailed description generation, perception-conversation understanding, and function inference. These tasks prepare the model to engage in intricate discussions about visual components, formulate action plans based on specific goals, and interpret the overall purpose of a UI screen.
To evaluate the effectiveness of Ferret-UI, the authors establish a comprehensive benchmark encompassing various UI tasks. Comparative evaluations with other existing models, including open-source LLMs and GPT-4V, demonstrate Ferret-UI’s superiority, particularly in elementary UI tasks and advanced reasoning capabilities.
If Apple integrates Ferret UI in Siri, it can be a game-changing experience for Apple users. Integrating Ferret-UI into Siri can also improve accessibility features, enable seamless app integration, offer personalised assistance, facilitate natural language UI navigation, and enhance integration with voice assistive technologies, benefiting users with special needs and improving overall user experience on iOS devices.
This update comes soon after Apple released the MM1 model last month and ReALM (Reference Resolution As Language Modeling) two weeks ago. The company has also forged a $50M licensing deal with Shutterstock to acquire AI training data.