MITB Banner

What Is StyleCLIP?

The authors have investigated three techniques in particular that combine CLIP with StyleGAN.

Share

In 2018, NVIDIA researchers introduced the Style Generative Adversarial Network (StyleGAN) — a significant extension to the GAN architecture. StyleGAN makes adjustments to the image’s style at each convolution layer to manipulate the image’s features for that layer.

Recently, teams from Adobe, Tel Aviv University, and the Hebrew University of Jerusalem developed Style Contrastive Language Image Pre-Training (StyleCLIP) by applying CLIP techniques to StyleGANs.

CLIP is a neural network introduced by OpenAI last January. It effectively learns visual concepts from natural language supervision and can be applied to any visual classification benchmark by providing the names of the visual categories to be recognised.

What Is StyleCLIP?

GANs have revolutionised image synthesis, generating some of the most realistic synthetic images to date. Further, the learnt intermediate latent spaces of StyleGAN have shown disentanglement properties, performing a wide variety of image manipulations on synthetic as well as real images.

However, building a StyleGAN model requires manual examination, a large amount of annotated data, or pretrained classifiers. The existing controls can enable image manipulation only along preset semantic directions. This limits users’ creativity and imagination. For using any additional, unmapped direction, the user needs to put strenuous manual effort, which also requires a large amount of annotated data.

Enter, StyleCLIP. 

By leveraging CLIP, the team has eliminated the need for manual efforts to discover new controls and enable intuitive text-based semantic image manipulation that is not dependent on preset manipulation directions. The CLIP model is pretrained on 400 million image-text pairs. “Natural language is able to express a much wider set of visual concepts, combining CLIP with the generative power of StyleGAN opens fascinating avenues for image manipulation,” the authors noted.

The authors have investigated three techniques in particular that combine CLIP with StyleGAN:

Text guided latent optimisation: Here, the CLIP model is used as a loss network. The manipulations are derived from the text input directly. As per the authors, this is a generic approach and can be used in different domains without manipulating or domain-specific data annotation. Although it is a versatile approach, it needs a few minutes of optimisation to apply manipulation to the image.

Latent mapper: While latent optimisation is a versatile approach, it requires a lot of time to edit a single image. To overcome this, the authors propose a more efficient process where a mapping network is trained for a particular text prompt to infer a manipulation step for any given image embedding.

Input agnostic direction in StyleGAN: Authors introduced a method for mapping text prompt to an input agnostic and global direction in StyleGAN’s style space. This allows great control over the manipulation strength and degree of disentanglement.

This paper and the supplementary material demonstrate a wide range of semantic manipulations on images of human faces, animals, cars, and churches. These manipulations range from abstract to specific, and from extensive to fine-grained. Many of them have not been demonstrated by any of the previous StyleGAN manipulation works, and all of them were easily obtained using a combination of pretrained StyleGAN and CLIP models.

Limitations

Since the method relies on the pretrained StyleGAN generator and CLIP model for joint language vision embedding, StyleCLIP cannot manipulate images that lie outside the pretrained generator’s domain.

The text prompts that map into the areas of CLIP space that do not have enough images will not yield a visual representation that would faithfully reflect the semantics of the text prompt. The team said it is difficult to achieve significant manipulations in visually diverse datasets.

Read the full paper here.

Share
Picture of Shraddha Goled

Shraddha Goled

I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at shraddha.goled@analyticsindiamag.com.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.