Listen to this story
Microsoft researchers recently published a paper aimed at bringing together the capabilities of ChatGPT and visual foundation models like Stable Diffusion. This architecture, termed ‘Visual ChatGPT’, wants to bridge the gap between text-to-image and natural language generation.
As predicted by AIM, this seems to be the way forward for text-to-image algorithms. The approach combines the strengths of an LLM like ChatGPT with the power of image generation, providing a comprehensive package that covers the shortcomings of both these platforms. By bringing natural language processing to parameter-driven image generation models, it is possible to interact with AI in a more organic way.
How does Visual ChatGPT work?
Put simply, the demo adds capabilities of sharing images with ChatGPT. This functionality is achieved by using a ‘prompt manager’ to share information between various visual foundation models, such as Stable Diffusion, ControlNet, BLIP, and ChatGPT itself.
Visual foundation models, or VFMs, is a term used to describe a set of fundamental algorithms used for computer vision. These algorithms can form the basis of more complicated models and are used to impart standardised computer vision capabilities to AI applications.
The prompt manager interfaces between ChatGPT and these VFMs to process the output seamlessly. For example, take the kitchen of a restaurant. While ChatGPT is like the waiter taking the customers orders, the VFMs are like the chefs in the kitchen whipping up the dish. The prompt manager takes on the role of a kitchen manager, relaying orders and food between the waiters and the chefs.
The flowchart of how the prompt manager works in the architecture. (Source: Microsoft Research)
As such, the prompt manager includes some logic, such as a reasoning format which helps ChatGPT decide whether it needs to use a tool (like a VFM) to give the necessary output. The PM also takes care of the iterative reasoning used to fine-tune the output image. It also takes care of certain housekeeping, such as managing the filenames in ChatGPT’s output and keeping a track of image file names.
The prompt manager is really at the heart of this system, as it is what ChatGPT calls on to answer any type of non-language queries. In a way, the prompt manager stands in for the user, moving ChatGPT towards the required output through a series of tailored prompts. This results in a much more capable version of ChatGPT that does not rely on hallucinations, instead being forced to call on the capabilities of VFMs through the prompt manager.
While Visual ChatGPT is capable in and of itself, it sets a precedent that is more fascinating. Is it possible to bring together the formidable capabilities of LLMs and visual models, and could this be one of the first steps towards AGI?
Changing the face of text-to-image
There is a fundamental problem with how text-to-image models work, and that’s their lack of understanding when it comes to linguistic context. In a paper exploring the relational understanding of generative AI models, researchers found that these models did not ‘understand’ physical relations of certain objects.
For example, while the model was capable of creating images for ‘a child touching a bowl’, it was not able to create an image of ‘a monkey touching an iguana’. This is because there isn’t enough information in the training data of the latter scenario, thereby leading to inadequate responses. To overcome this limitation of text-to-image models, a new job has emerged—AI whisperers or prompt engineering.
The process to make AI models ‘understand’ humans is still uncharted territory, which is slowly being mapped out by up-and-coming AI artists. That’s why we have websites like ‘PromptHero’, a repository of prompts for text-to-image algorithms that just work, and that’s also why a seemingly meaningless word soup can provide stunning AI imagery. Consider the example below.
As seen by this image, getting a solid output from a text-to-image model requires a comprehensive knowledge base of what to prompt for. Negative prompts are also used to avoid certain characteristics in the completed image. Looking at the direction Microsoft’s prompt manager is taking, it seems that the potential for this job is over before it even begins.
From the examples provided in the GitHub page, it is clear that users do not need to engage in such complex prompts to convey information to the model. They can simply type, in natural language, what they want from the model. For example, after generating the image of a cat, the user then asks ChatGPT to replace the cat with a dog. Without any complex prompts, the image was generated, with the user iteratively making changes to it like changing its colour.
Tools like Visual ChatGPT can not only reduce the barrier of entry to text-to-image models, they can also be used to add interoperability to various AI tools. LLMs and T2I models previously existed in silos but through technologies like the prompt manager, we might be able to amplify the capabilities of these state-of-the-art models.