Now Reading
How To Do Text To Video Retrieval With S3D MIL- NCE

How To Do Text To Video Retrieval With S3D MIL- NCE

Visions and languages play an important role in the way humans learn to interpret visual entities to abstract concepts and vice versa. This technique is now trending to train computer vision-based problems successfully. Indeed, from classification tasks where images are categorised based on the list words to the recent captioning tasks where images and videos are annotated with rich language support, these interplays are one of the driving forces for the computer vision field. However, one of the main limitations of this approach is that it requires manually annotating a large dataset collection. These manual tasks are very heavy to carry out and expensive.  

For videos, annotation is also even more challenging than images; this is due to the ambiguities of choosing the right vocabulary of action and annotating action intervals. This significantly limits the scale at which fully supervised video data can be obtained and, hence, slows down the quest to improve visual representation. Recent work in this field has produced a prominent alternative to obtain this fully supervised approach which is nothing but by leveraging narrated videos which are widely available over the web. In contrast, a dataset like HowTo100M contains more than 100 million pairs of video clips and associated narration.

Register for this Session>>

It was automatically collected by querying YouTube for instructional videos. Such videos usually describe someone explaining how to perform a complex human activity, e.g. swimming in a river or performing some scientific tasks. End to end learning from these instructional videos is a highly challenging task. These videos are generally made to maximize the number of views and with no specific intention to provide a training signal for machine learning algorithms. This means that the supervision present in narration is weak due to noisy operation. Along with the noise, there is a typical cause of weak alignment between video and language.        

DeepMind proposed a bespoke training loss for this work, dubbed MIL-NCE, as it inherits from Multiple Instance Learning (MIL) and Noise Conservative Estimation. This technique can address visually misaligned narration from uncurated instructional videos, as you can see in the below image.


Equipped with this novel architecture and simple joint video and text embedding model, it has successfully shown results in training a video representation from scratch directly from the pixel on the HowTo100M dataset. 

The code shows how we perform the text to video retrieval using the MIL-NCE technique; we have used gif files from Wikipedia resources.

Code implementation of Text to Video Retrieval:

The following code implementation is in reference to the official implementation.

Import all dependencies:
 import numpy as np
 import cv2
 from IPython import display
 import math
 import os
 import tensorflow_hub as hub
 import tensorflow.compat.v2 as tf 
Load the model from Tensorflow-hub:
 model_address = ''
 load_model = hub.load(model_address) 

The below user-defined function is used to create inference on our videos, which do embeddings for videos and text associated with it.

 def embedding_generator(model, input_sequence, input_word):
   video_output = model.signatures['video'](tf.constant(tf.cast(input_sequence, dtype= tf.float32)))
   text_output = model.signatures['text'](tf.constant(input_word))
   return video_output['video_embedding'],text_output['text_embedding'] 
Video loading and visualisation function:
 def center_crop_square(frame):
   y, x = frame.shape[0:2]
   dim_min = min(y, x)
   x_start = (x // 2) - (dim_min // 2)
   y_start = (y // 2) - (dim_min // 2)
   return frame[y_start:y_start + dim_min, x_start:x_start + dim_min] 
 def video_load(video_url, max_frames = 40, resize = (224,224)):
   path_ = tf.keras.utils.get_file(os.path.basename(video_url)[-128:], video_url)
   cap = cv2.VideoCapture(path_)
   frames = []
     while True:
       rett, frame =
       if not rett:
       frame = center_crop_square(frame)
       frame = cv2.resize(frame,resize)
       frame = frame[:, :, [2, 1, 0]]
       if len(frames) == max_frames:
   frames = np.array(frames)
   if len(frames) < max_frames:
     n_repeat = int(math.ceil(max_frames / float(len(frames))))
     frames = frames.repeat(n_repeat, axis = 0)
   frames = frames[:max_frames]
   return frames / 255.0
 def video_display(urls):
     html = '<table>'
     html += '<tr><th>Video 1</th><th>Video 2</th><th>Video 3</th></tr><tr>'
     for url in urls:
         html += '<td>'
         html += '<img src="{}" height="224">'.format(url)
         html += '</td>'
     html += '</tr></table>'
     return display.HTML(html)  
 def display_result(query, urls, scores):
   sorted_ix = np.argsort(-scores)
   html = ''
   html += '<h2>Input query: <i>{}</i> </h2><div>'.format(query)
   html += 'Results: <div>'
   html += '<table>'
   html += '<tr><th>Rank #1, Score:{:.2f}</th>'.format(scores[sorted_ix[0]])
   html += '<th>Rank #2, Score:{:.2f}</th>'.format(scores[sorted_ix[1]])
   html += '<th>Rank #3, Score:{:.2f}</th></tr><tr>'.format(scores[sorted_ix[2]])
   for i, idx in enumerate(sorted_ix):
     url = urls[sorted_ix[i]];
     html += '<td>'
     html += '<img src="{}" height="224">'.format(url)
     html += '</td>'
   html += '</tr></table>'
   return html 
Load videos and associated queries:
 video_1_url = ''
 video_2_url = ''
 video_3_url = ''
 video_1 = video_load(video_1_url)
 video_2 = video_load(video_2_url)
 video_3 = video_load(video_3_url)
 all_videos = [video_1, video_2, video_3]
 query_1_video = 'waterfall'
 query_2_video = 'playing guitar'
 query_3_video = 'car drifting'
 all_queries_video = [query_1_video, query_2_video, query_3_video]
 all_videos_urls = [video_1_url, video_2_url, video_3_url]

GIF :1

GIF: 2

See Also


Infer the architecture:
 # video input
 videos_ = np.stack(all_videos, axis = 0)
 # prepare text input
 words_ = np.array(all_queries_video)
 # generate video and text embedding
 video_embd, text_emdb = embedding_generator(load_model,videos_,words_)
 #scores between video and text
 all_scores =,tf.transpose(video_embd))
 html = ''
 for i, words in enumerate(words_):
   html += display_result(words, all_videos_urls, all_scores[i, :])
   html += '<br>'





Manually annotating each video file is a rigorous task and also inefficient and not advisable nowadays. The breakthrough development in computer vision leads to a rapid increase in the application and, more specifically, the innovative application. From this article, we have seen how to implement the text-video retrieval system using a Tensorflow-Hub model. The results associated with respective queries are surprisingly very accurate. From the Rank score, we can say that this architecture is a breakthrough in CV.      


Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join our Telegram Group. Be part of an engaging community

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top