Listen to this story
|
GPT-4 has been the most advanced development in the world of AI so far with its multimodal capabilities. Most recently, a group of researchers have announced MiniGPT-4-an open-sourced model performing complex vision-language tasks like GPT-4. The code, demos, and training instructions are available on Github.
While OpenAI announced that GPT-4 is indeed multimodal, the ability of the model to process images has not been made available yet. However, MiniGPT-4 can process images.
The researchers have also revealed that MiniGPT has many capabilities similar to those exhibited by GPT-4 like detailed image description generation and website creation from hand-written drafts.
To build MiniGPT-4, the researchers have used Vicuna, which is built on LLaMA, as a language decoder and the BLIP-2 Vision Language Model, as a visual decoder. Interestingly, both Vicuna and BLIP-2 are open source.
Given OpenAI has not revealed much details about the architecture (including model size), hardware, training compute, dataset construction, training method used for GPT-4, the open source mini version of the powerful LLM could prove to be significant in terms of research.