Guide To Video GPT: A Transformer-Based Architecture For Video Generation

Video GPT is a novel machine learning architecture that employs likelihood-based generative modelling for video synthesis.

Share

Published on April 29, 2021

by Nikita Shiledarbaxi

Video GPT is a novel machine learning architecture that employs likelihood-based generative modelling for video synthesis. It has been recently introduced by Wilson Yan, Yunzhi Zhang, Pieter Abbeel and Aravind Srinivas. (research paper).

Before going into the detailed workings of Video GPT, we will have a quick look at some of its background terminologies.

Autoencoder is an artificial neural network model belonging to the unsupervised learning category. It reduces the dimensionality of input data by ignoring any noisy data. It learns compressing and encoding input data. It then reconstructs the data from that encoded form to a new data representation resembling the original one before encoding. Visit this page to know more about autoencoders.

Latent space represents compressed data such that similar data points lie in proximity to each other. Read this article to read about it in detail.

Variational AutoEncoder (VAE) is the one that does not give out a single value for each of the encoding dimensions. Instead, it outputs a probabilistic distribution for each attribute in the latent space. Check out this weblink for more details.

Vector Quantization (VQ) is an encoding-decoding technique in which an encoder is fed with input vectors. It gives out the index of the closest codeword, which is provided to the decoder. The decoder then recognizes the input vector from it.

VQ-VAE (Vector Quantized Variational AutoEncoder) adds a discrete codebook to a standard autoencoder. It compares the output of the encoder network to all the vectors of the codebook; the closest vector is then fed to the decoder network. Thus, the VQ concept is combined with VAE.

GPT (Generative Pre-Training) is a pre-trained language model on a large corpus of text and then fine-tuned for required tasks. (article on OpenAI’s GPT).

Self-attention: Consider three vectors in a deep learning task viz. ‘Query (Q)’, ‘key (K)’ and ‘value (V)’. The term ‘attention’ means query and key vectors get multiplied such that the resultant vector of probabilities decides the value to be passed on to the subsequent layer. ‘Self-attention’ is the case where all the three vectors Q, K and V are the same. Find the research paper on ‘attention’ here.

Image source

The above figure explains ‘attention’ where Q and K vectors are first multiplied using matrix multiplication. The result then goes through a softmax function which creates a probability distribution which is then multiplied with V.

Overview of Video GPT

Video GPT is a simple model architecture that uses VQ-VAE and learns from an inputted raw video its downsampled discrete latent representations. It employs 3D convolutional networks and self-attention.

Image source: Research paper

The above figure explains the working of the Video GPT architecture. LHS of the figure depicts the first stage of operation, which is nothing but training a usual VQ-VAE model. In the second stage in sequence (RHS), the raw video data is encoded by VQ-VAE into latent sequences. At the decoding end, these latent sequences are sampled and converted into a new video sample (by VQ-VAE) resembling the original one.

Pre-trained VQ-VAE models used by Video GPT

bair_stride4*2*2 : trained on 64*64-dimensional videos (with 16 frames) taken from the BAIR Robot Pushing dataset.

ucf101_stride4*4*4 : trained on 128*128 dimensional videos (with 16 frames) taken from the UCF-101 dataset.

kinetics_stride4*4*4 : trained on 128*128-dimensional videos (with 16 frames) taken from the Kinetics-600 dataset.

kinetics_stride2*4*4 : trained on the same data as kinetics_stride4*4*4 but with latent temporal codes that are twice larger, resulting in better video reconstruction.

Note: The strides mentioned in the above models denote the amounts of downsampling across THW (number of images in a batch, height of image, width of image) for encoder structures.

Practical implementation

Here’s a demonstration of how to generate video using Video GPT. The code has been implemented using Python 3.7.10, matplotlib 3.2.2, torch 1.7.1, torchvision 0.8.2, and scikit-video 1.1.11 versions. Step-wise implementation of the code is as follows:

Install Video GPT from GitHub.

!pip install git+https://github.com/wilson1yan/VideoGPT.git

Install scikit-video, a Python library for video processing.

!pip install scikit-video av

Import required libraries and modules.

 import os   
 import matplotlib.pyplot as plt
 from matplotlib import animation
 from IPython.display import HTML  
 import torch
 from torchvision.io import read_video, read_video_timestamps
 from videogpt import download, load_vqvae
 from videogpt.data import preprocess

Create a dictionary of videos to choose from for video reconstruction.

 vid = {
     'breakdancing': '1OZBnG235-J9LgB_qHv-waHZ4tjofiDgj',
     'bear': '16nIaqq2vbPh-WMo_7hs9feVSe0jWVXLF',
     'jaywalking': '1UxKCVrbyXhvMz_H7dI4w5hjPpRGCAApy',
     'cartoon': '1ONcTMSEuGuLYIDbX-KeFqd390vbTIH9d'
 }

Here, we are using ‘kinetics_stride2*4*4’ model.

 “””
 Set up and run CUDA operations which are identical to CPU tensors but computations are performed using GPU
 “””
 dev = torch.device('cuda')
 Download the model
 vqvae = load_vqvae('kinetics_stride2x4x4', device=dev).to(dev)

Select the video from ‘vid’ to be reconstructed.

vid_name = ‘bear’

Initialize resolution of video to be constructed. It must be divisible by encode image stride which is 2*4*4 here.

resolution = vqvae.hparams.resolution

Initialize duration of the sequence of frames to be displayed.

seq_length = 64

Download the video file.

vid_fname = download(vid[vid_name], f'{vid_name}.mp4')

Decode the entire video frame-by-frame and record a list of frames’ timedtamps.

pts = read_video_timestamps(vid_fname, pts_unit='sec')[0]

Read the video from specified .mp4 file and get its audio and video frames.

video = read_video(vid_fname, pts_unit='sec', start_pts=pts[0], end_pts=pts[seq_length - 1])[0]

Preprocess the video using its THWC values (T: number of images in a batch, H: height of image, W: width of image, C: number of color channels)

video = preprocess(video, resolution, seq_length).unsqueeze(0).to(dev)

Encoding and decoding step

 #Encode the video using VQ-VAE so that it creates latent sequences 
 with torch.no_grad():  #disable the gradient computation
     enc = vqvae.encode(video)
 #Decode the latent sequences to reconstruct the video
     vid_recon = vqvae.decode(enc)
 #Clamp the reconstructed video tensor’s values in the range [-0.5,0.5]
     vid_recon = torch.clamp(vid_recon, -0.5, 0.5)

Note: The clamp() method used aboe works as follows:

If we define the range as say [-0.5,0.5], then if the tensor has a value say -1.7, it will be clamped to -0.5 since it’s less than -0.5 and hence out of the range. Similarly, if there is an element greater than the range’s upper limit 0.5, say 1,8, it will be clamped to 0.5.

Visualize the reconstructed video.

 #Concatenate original and reconstructed video for visualizing both
 videos = torch.cat((video, vid_recon), dim=-1)
 #Permute the dimensions of the videos from C*T*H*W to T*H*W*C
 #C: number of color channels
 #T: number of images in a batch
 #H: height if image
 #W: width of image
 videos = videos[0].permute(1, 2, 3, 0) 
 """
 Convert the CUDA variables to NumPy. Since NumPy does not support CUDA, GPU to CPU transition is to be done first and then cange the type to unsigned integer 
 """
 videos = ((videos + 0.5) * 255).cpu().numpy().astype('uint8')
 #Create a matplotlib figure
 fig = plt.figure()
 #Title of the plot
 plt.title('Original video (left), Reconstructed video (right)')
 #Disable the axes
 plt.axis('off')
 #Display the plot
 img = plt.imshow(videos[0, :, :, :])
 plt.close()

Define a function for drawing a clear frame

 def init():
     img.set_data(videos[0, :, :, :])

Define a function to be called at each frame for animation.

 def animate(i):
     img.set_data(videos[i, :, :, :])
     return img

Create an animation by repeatedly calling the animate() function defined above.

anmt = animation.FuncAnimation(fig, animate, init_func=init, frames=videos.shape[0], interval=100)

Convert the animation to HTML5 video tag

HTML(anmt.to_html5_video())

Output video:

Code source: GitHub
Google colab notebook of the above implementation.

References

Access all our open Survey & Awards Nomination forms in one place

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

10 Deepfake AI Tools to Help You Create Content within Minutes

Gopika Raj

Deepfake is a double edged sword that can ignite creativity for social media engagement and can also cause immense harm

Commvault’s Arlie Teams Up with Microsoft to Elevate Cyber Resilience Globally

Shyam Nandan Upadhyay

Ready or Not, AI Agents Are Coming

Sukriti Gupta

Top Editorial Picks

SBI to Leverage HCL Unica to Digitally Transform Customer Engagement

Pritam Bordoloi

African Tech Companies Prefer Zoho Enterprise over Google Workspace

Vandana Nair

Reid Hoffman Creates a DeepFake of Himself, Reid AI

Gopika Raj

GitHub Copilot Rival, Augment Secures $252 Mn at $1 Bn Valuation to Boost AI for Developers

K L Krithika

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Also in News

Become a Certified Generative AI Engineer

Check our Industry Research Reports

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.

AIM Videos

Zerodha CTO Dr. Kailash Nadh Decodes AI Culture in Tech

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Developer’s Corner

In Case You Missed It

Which is the Most Frustrating Programming Language?

Mohit Pandey 18/03/2024

AI4Bharat Rolls Out IndicLLMSuite for Building LLMs in Indian Languages

Shritama Saha 15/03/2024

Google Introduces Synth^2 to Enhance the Training of Visual Language Models

K L Krithika 14/03/2024

Infosys Funds Llama 2 Project with 22 Indian Languages

Infosys Founder Funds Meta’s Llama 2 Project with 22 Indian Languages

Mohit Pandey 13/03/2024

Guide To Video GPT: A Transformer-Based Architecture For Video Generation

Overview of Video GPT

Practical implementation

References

Nikita Shiledarbaxi

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to stay informed

Top Editorial Picks

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Also in News

AI Courses & Careers

Become a Certified Generative AI Engineer

Industry Insights

Check our Industry Research Reports

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.

AIM Videos

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

GenAI Corner

Data Dialogues

Future Talks

Developer’s Corner

In Case You Missed It

Webstories

Also in Trends

World's Biggest Media & Analyst firm specializing in AI

Advertise with us

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

Branded Content

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

Corporate Upskilling

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

Hackathons

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Talent Assessment

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

Research & Advisory

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Conferences & Events

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.

Subscribe to Our Newsletter

Download the easiest way to
stay informed

Industry
Insights

GenAI
Corner

Data
Dialogues

Future
Talks