Now Reading
What Is StyleCLIP?

What Is StyleCLIP?

  • Natural language is able to express a much wider set of visual concepts, combining CLIP with the generative power of StyleGAN opens fascinating avenues for image manipulation

In 2018, NVIDIA researchers introduced the Style Generative Adversarial Network (StyleGAN) — a significant extension to the GAN architecture. StyleGAN makes adjustments to the image’s style at each convolution layer to manipulate the image’s features for that layer.

Recently, teams from Adobe, Tel Aviv University, and the Hebrew University of Jerusalem developed Style Contrastive Language Image Pre-Training (StyleCLIP) by applying CLIP techniques to StyleGANs.

CLIP is a neural network introduced by OpenAI last January. It effectively learns visual concepts from natural language supervision and can be applied to any visual classification benchmark by providing the names of the visual categories to be recognised.

What Is StyleCLIP?

GANs have revolutionised image synthesis, generating some of the most realistic synthetic images to date. Further, the learnt intermediate latent spaces of StyleGAN have shown disentanglement properties, performing a wide variety of image manipulations on synthetic as well as real images.

However, building a StyleGAN model requires manual examination, a large amount of annotated data, or pretrained classifiers. The existing controls can enable image manipulation only along preset semantic directions. This limits users’ creativity and imagination. For using any additional, unmapped direction, the user needs to put strenuous manual effort, which also requires a large amount of annotated data.

Enter, StyleCLIP. 

By leveraging CLIP, the team has eliminated the need for manual efforts to discover new controls and enable intuitive text-based semantic image manipulation that is not dependent on preset manipulation directions. The CLIP model is pretrained on 400 million image-text pairs. “Natural language is able to express a much wider set of visual concepts, combining CLIP with the generative power of StyleGAN opens fascinating avenues for image manipulation,” the authors noted.

The authors have investigated three techniques in particular that combine CLIP with StyleGAN:

Text guided latent optimisation: Here, the CLIP model is used as a loss network. The manipulations are derived from the text input directly. As per the authors, this is a generic approach and can be used in different domains without manipulating or domain-specific data annotation. Although it is a versatile approach, it needs a few minutes of optimisation to apply manipulation to the image.

Latent mapper: While latent optimisation is a versatile approach, it requires a lot of time to edit a single image. To overcome this, the authors propose a more efficient process where a mapping network is trained for a particular text prompt to infer a manipulation step for any given image embedding.

See Also
Pixel2Style2Pixel for image translation

Input agnostic direction in StyleGAN: Authors introduced a method for mapping text prompt to an input agnostic and global direction in StyleGAN’s style space. This allows great control over the manipulation strength and degree of disentanglement.

This paper and the supplementary material demonstrate a wide range of semantic manipulations on images of human faces, animals, cars, and churches. These manipulations range from abstract to specific, and from extensive to fine-grained. Many of them have not been demonstrated by any of the previous StyleGAN manipulation works, and all of them were easily obtained using a combination of pretrained StyleGAN and CLIP models.


Since the method relies on the pretrained StyleGAN generator and CLIP model for joint language vision embedding, StyleCLIP cannot manipulate images that lie outside the pretrained generator’s domain.

The text prompts that map into the areas of CLIP space that do not have enough images will not yield a visual representation that would faithfully reflect the semantics of the text prompt. The team said it is difficult to achieve significant manipulations in visually diverse datasets.

Read the full paper here.

What Do You Think?

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join our Telegram Group. Be part of an engaging community

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top