Now Reading
Moment in Time: The Biggest Short Video Dataset For Data Scientists

Moment in Time: The Biggest Short Video Dataset For Data Scientists

Ankit Das

Moment in Time is one of the biggest human-commented video datasets catching visual and discernible short occasions created by people, creatures, articles and nature. It was developed in 2018 by the researchers: Mathew Monfort, Alex Andonian, Bolei Zhou and Kandan Ramakrishnan. The dataset comprises more than 1,000,000 3-second recordings relating to 339 unique action words. Every action word is related to more than 1,000 recordings bringing about a huge adjusted dataset for taking in powerful occasions from recordings. The various day to day activities associated with this dataset includes falling on the floor, the opening of the mouth, eye, swimming, bouncing etc.

Here, we will examine the information contained in this dataset, how it was assembled, and give some benchmark models that gave high exactness on this dataset. Further, we will execute the Moment in time dataset utilizing Pytorch and Tensorflow Library.

Data Collection Of Activity Moments

To gather video information, the researchers searched the Internet by parsing video metadata to assemble a rundown of applicant recordings for each class from a wide range of sources like YouTube and Google recordings. The videos were downloaded and randomly cut in a 3-second area. These action word video tuples were then shipped off Amazon Mechanical Turk (AMT) tool for comment. Each AMT labourer was given a video-action word pair and requested to press a Yes or No key implying if there was an action sequence in the video recording. Positive reactions from the first round were then shipped off to late rounds of annotation. The solitary specialist task or HIT contained 64 distinctive 3-second recordings that were identified with a solitary action word and ten ground truth recordings. In each HIT, the initial 4 inquiries were utilized to prepare the labourers on the undertaking and required the right response to be chosen before continuing. Only the outcomes from HITs that acquire a 90% or above on the control recordings were recorded for the dataset.

Loading dataset using Torchvision

The dataset can be downloaded from the following link.

Import all the libraries required for the project.

import os
import cv2
import argparse
import numpy as np
from PIL import Image
import torch
import torchvision. models as models
from torch.nn import functional as F
from torchvision import transforms as torchN
def load_model(categories, weight_file):
    if not os.access(weight_file, os.W_OK):
        weight_url = '' + weight_file
        os.system('wget ' + weight_url)
    model = models.__dict__['resnet50'](num_classes=len(categories))
    useGPU = 0
    if useGPU == 1:
        checkpoint = torch.load(weight_file)
        checkpoint = torch.load(weight_file, map_location=lambda storage, loc: storage) # allow cpu
    dict1={str.replace(str(k),'module.',''):v for k,v in checkpoint['state_dict'].items()}
    return model

To gather more information about the video recordings, we need to transform the data using data augmentation.

def load_transform():
    """Load the image transformer."""
    tensor = torchN.Compose([
        torchN.Resize((224, 224)),
        torchN.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    return tensor
def load_categories(filename):
    """Load categories."""
    with open(filename) as f:
        return [line.rstrip() for line in f.readlines()]
if __name__ == '__main__':
    # load categories and model
    categories = load_categories('category_momentsv2.txt')
    model = load_model(categories, 'moments_v2_RGB_resnet50_imagenetpretrained.pth.tar')
    # load the transformer
    tensorflow1 = load_transform()  # image transformer

   Here we can load the images from the dataset by calling the specified URL links.

    # load the test image
    if os.path.exists('test.jpg'):
    img_url = ''
    os.system('wget %s -q -O test.jpg' % img_url)
    image ='test.jpg')
    input_img = tensorflow1(image).unsqueeze(0)

Loading dataset using Tensorflow

import os
import numpy as np
import tensorflow as tf
import random as rn
from keras import backend as K
from keras.callbacks import Callback, LearningRateScheduler, ModelCheckpoint
from keras.callbacks import CSVLogger, EarlyStopping, LambdaCallback
import utils
import prednet_model
import argparse
import sys
from data import DataGenerator

Define the parameters in the function.

def train(config_name, training_data_dir,
          base_results_dir, epochs=150,
          use_multiprocessing=False, workers=1, shuffle=True,
          n_timesteps=10, batch_size=4, stopping_patience=None, 
          input_channels=3, input_width=160, input_height=128, 
          max_queue_size=10, classes=None, 
          training_index_start=0, training_max_per_class=None, 
          frame_step=1, stateful=False, rescale=None, gpus=None,
          seq_overlap=0, **config):

Load the model with epoch 150 and batch size 4.The parameters will be passed to the function.

Stay Connected

Get the latest updates and relevant offers by sharing your email.
train= DataGenerator(classes=classes,

Classification Accuracy

State of the art

The present state of the art on Moment in Time dataset is AssembleNet. The model gave an exactness of 34.27%. BMN is a near contender with a precision of around 32.4%.


VLOG  dataset was introduced in 2017 to understand essential human collaborations, for example, getting up or opening a fridge, is the absence of acceptable preparing information. The dataset was developed by researchers: David F. Fouhey, Wei-cheng Kuo, Alexei A. Efros and Jitendra Malik of UC Berkeley University. It gives an immensely popular genre of video that people upload to Youtube to document their lives. It contains a 14-day/114K video/10.7K uploader dataset of ordinary association happening normally.

Data Collection

Unlike other datasets, a direct search for the recording doesn’t work. The researchers first discovered a Lifestyle Vlog corpus from Youtube. They characterise a positive video as one that portrays individuals cooperating with the indoor climate from a third individual. They used templated inquiries dependent on subjects (“day by day schedule 2013”) or exercises included (“cleaning room”), including six fundamental English question layouts and three formats converted into 13 European dialects. These gave 823 unique inquiries. In the end, top 1K hits were mined from Youtube, yielding 216K unique recordings.

See Also

Loading the dataset using Keras

Download the dataset from here.

Introduce the video generator utilising the pip order. We have utilised an Image generator for information expansion.

import os
import glob
import keras
from keras_video import VideoFrameGenerator

Let’s define the parameters so that we can pass it to the model for training.

classes = [i.split(os.path.sep)[1] for i in glob.glob('videos/*')]
# Parameters
Size = (112, 112)
channel = 3
Nbframe = 5
Batch_size = 32
# Data augmentation
data_aug = keras.preprocessing.image.ImageDataGenerator(
# Create video frame generator
load_data = VideoFrameGenerator('data/train/',
    transformation=data_aug ,

Accuracy on different Frames

State of the art

The current state-of-the-art on VLOG is Object Relation Network. It gives an accuracy of 44.7.


In this article, we have discussed the details and implementation of both Moment in time and Vlog dataset using Pytorch and Tensorflow Library. Furthermore, these datasets can make progress on the difficult problem of understanding the actions in real life. There is still a long way to go in getting a good model that can give decent accuracy on these.

What Do You Think?

If you loved this story, do join our Telegram Community.

Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.
What's Your Reaction?
In Love
Not Sure

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top