Moment in Time is one of the biggest human-commented video datasets catching visual and discernible short occasions created by people, creatures, articles and nature. It was developed in 2018 by the researchers: Mathew Monfort, Alex Andonian, Bolei Zhou and Kandan Ramakrishnan. The dataset comprises more than 1,000,000 3-second recordings relating to 339 unique action words. Every action word is related to more than 1,000 recordings bringing about a huge adjusted dataset for taking in powerful occasions from recordings. The various day to day activities associated with this dataset includes falling on the floor, the opening of the mouth, eye, swimming, bouncing etc.
Here, we will examine the information contained in this dataset, how it was assembled, and give some benchmark models that gave high exactness on this dataset. Further, we will execute the Moment in time dataset utilizing Pytorch and Tensorflow Library.
Data Collection Of Activity Moments
To gather video information, the researchers searched the Internet by parsing video metadata to assemble a rundown of applicant recordings for each class from a wide range of sources like YouTube and Google recordings. The videos were downloaded and randomly cut in a 3-second area. These action word video tuples were then shipped off Amazon Mechanical Turk (AMT) tool for comment. Each AMT labourer was given a video-action word pair and requested to press a Yes or No key implying if there was an action sequence in the video recording. Positive reactions from the first round were then shipped off to late rounds of annotation. The solitary specialist task or HIT contained 64 distinctive 3-second recordings that were identified with a solitary action word and ten ground truth recordings. In each HIT, the initial 4 inquiries were utilized to prepare the labourers on the undertaking and required the right response to be chosen before continuing. Only the outcomes from HITs that acquire a 90% or above on the control recordings were recorded for the dataset.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Loading dataset using Torchvision
The dataset can be downloaded from the following link.

Import all the libraries required for the project.
import os import cv2 import argparse import numpy as np from PIL import Image import torch import torchvision. models as models from torch.nn import functional as F from torchvision import transforms as torchN def load_model(categories, weight_file): if not os.access(weight_file, os.W_OK): weight_url = 'http://moments.csail.mit.edu/moments_models/' + weight_file os.system('wget ' + weight_url) model = models.__dict__['resnet50'](num_classes=len(categories)) useGPU = 0 if useGPU == 1: checkpoint = torch.load(weight_file) else: checkpoint = torch.load(weight_file, map_location=lambda storage, loc: storage) # allow cpu dict1={str.replace(str(k),'module.',''):v for k,v in checkpoint['state_dict'].items()} model.load_state_dict(dict1) model.eval() return model
To gather more information about the video recordings, we need to transform the data using data augmentation.
def load_transform(): """Load the image transformer.""" tensor = torchN.Compose([ torchN.Resize((224, 224)), torchN.ToTensor(), torchN.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ]) return tensor def load_categories(filename): """Load categories.""" with open(filename) as f: return [line.rstrip() for line in f.readlines()]
if __name__ == '__main__': # load categories and model categories = load_categories('category_momentsv2.txt') model = load_model(categories, 'moments_v2_RGB_resnet50_imagenetpretrained.pth.tar') # load the transformer tensorflow1 = load_transform() # image transformer
Here we can load the images from the dataset by calling the specified URL links.
# load the test image if os.path.exists('test.jpg'): os.remove('test.jpg') img_url = 'http://places2.csail.mit.edu/imgs/demo/IMG_5970.JPG' os.system('wget %s -q -O test.jpg' % img_url) image = Image.open('test.jpg') input_img = tensorflow1(image).unsqueeze(0)
Loading dataset using Tensorflow
import os import numpy as np import tensorflow as tf import random as rn from keras import backend as K from keras.callbacks import Callback, LearningRateScheduler, ModelCheckpoint from keras.callbacks import CSVLogger, EarlyStopping, LambdaCallback import utils import prednet_model import argparse import sys from data import DataGenerator
Define the parameters in the function.
def train(config_name, training_data_dir, base_results_dir, epochs=150, use_multiprocessing=False, workers=1, shuffle=True, n_timesteps=10, batch_size=4, stopping_patience=None, input_channels=3, input_width=160, input_height=128, max_queue_size=10, classes=None, training_index_start=0, training_max_per_class=None, frame_step=1, stateful=False, rescale=None, gpus=None, data_format=K.image_data_format(), seq_overlap=0, **config):
Load the model with epoch 150 and batch size 4.The parameters will be passed to the function.
train= DataGenerator(classes=classes, seq_length=n_timesteps, min_seq_length=n_timesteps, seq_overlap=seq_overlap, sample_step=frame_step, target_size=None, rescale=rescale, fn_preprocess=resize, batch_size=batch_size, shuffle=shuffle, data_format=data_format, output_mode='error', index_start=training_index_start, max_per_class=training_max_per_class)
Classification Accuracy
State of the art
The present state of the art on Moment in Time dataset is AssembleNet. The model gave an exactness of 34.27%. BMN is a near contender with a precision of around 32.4%.
VLOG
VLOG dataset was introduced in 2017 to understand essential human collaborations, for example, getting up or opening a fridge, is the absence of acceptable preparing information. The dataset was developed by researchers: David F. Fouhey, Wei-cheng Kuo, Alexei A. Efros and Jitendra Malik of UC Berkeley University. It gives an immensely popular genre of video that people upload to Youtube to document their lives. It contains a 14-day/114K video/10.7K uploader dataset of ordinary association happening normally.
Data Collection
Unlike other datasets, a direct search for the recording doesn’t work. The researchers first discovered a Lifestyle Vlog corpus from Youtube. They characterise a positive video as one that portrays individuals cooperating with the indoor climate from a third individual. They used templated inquiries dependent on subjects (“day by day schedule 2013”) or exercises included (“cleaning room”), including six fundamental English question layouts and three formats converted into 13 European dialects. These gave 823 unique inquiries. In the end, top 1K hits were mined from Youtube, yielding 216K unique recordings.
Loading the dataset using Keras
Download the dataset from here.
Introduce the video generator utilising the pip order. We have utilised an Image generator for information expansion.
import os import glob import keras from keras_video import VideoFrameGenerator
Let’s define the parameters so that we can pass it to the model for training.
classes = [i.split(os.path.sep)[1] for i in glob.glob('videos/*')] classes.sort() # Parameters Size = (112, 112) channel = 3 Nbframe = 5 Batch_size = 32 # Data augmentation data_aug = keras.preprocessing.image.ImageDataGenerator( zoom_range=.1, horizontal_flip=True, rotation_range=8, width_shift_range=.2, height_shift_range=.2) # Create video frame generator load_data = VideoFrameGenerator('data/train/', classes=classes, nb_frames=Nbframe, split=.33, shuffle=True, batch_size=Batch_size, target_shape=Size, nb_channel=channel, transformation=data_aug , use_frame_cache=True)
Accuracy on different Frames
State of the art
The current state-of-the-art on VLOG is Object Relation Network. It gives an accuracy of 44.7.
Conclusion
In this article, we have discussed the details and implementation of both Moment in time and Vlog dataset using Pytorch and Tensorflow Library. Furthermore, these datasets can make progress on the difficult problem of understanding the actions in real life. There is still a long way to go in getting a good model that can give decent accuracy on these.