This New Framework By Amazon Bypasses The Need Of Human Labelled Data

Recently, Amazon AI, along with SenseTime Research and the Chinese University of Hong Kong, introduced a new framework which can leverage web data to train AI video recognition models. One strong point about this framework is that it overcomes the barriers between data formats for webly-supervised learning.

Representation learning has gained a lot of traction in image recognition and video classification over the past few years. However, developing AI models need large-scale human-labelled image datasets, which are both time-consuming and costly. According to the researchers, collecting these datasets is more difficult in the domain of trimmed video recognition, since most online videos contain numerous shots with multiple concepts.

Behind OmniSource

OmniSource is a unified framework for video classification which utilizes multiple sources of web data, including images, trimmed videos and untrimmed videos, simultaneously. To enhance data efficiency, the researchers proposed a task-driven data collection approach. This can be done by obtaining the topmost results using class labels to make the supervision most informative. The framework works under the semi-supervised setting, where both labelled and unlabelled data from the web co-exist. 


Sign up for your weekly dose of what's up in emerging technology.

The OmniSource framework consists mainly of three steps:-

  • To filter the noise, researchers used a teacher network to filter out samples with low confidence scores, and obtain pseudo labels for the remaining ones
  • One or more teacher networks are trained on the labelled dataset 
  • For each source of data collected, the corresponding teacher network is applied to obtain pseudo-labels, and filter out irrelevant samples with low confidence scores
  • Different transforms are used to convert each type of web data. This includes images to the input format needed by the target task, such as video clips and training the student network.

Dataset Used

The researchers used Kinetics-400 dataset, which is one of the most extensive video datasets. In total, this dataset has around 240K, 19K, and 38K videos for training, validation and testing subset, respectively. They also used Youtube-car dataset, which is a fine-grained video recognition task containing 196 types of different cars and UCF101 dataset, which is a small scale video recognition dataset with 101 classes.

Features Of OmniSource

The features of OmniSource include several good practices, including data balancing, resampling, and cross-dataset mixup, which are adopted in joint training. According to the researchers, this framework is data-efficient in training. 

The researchers claimed that with only 3.5M images and 800K minute videos crawled from the internet without human labelling, and less than 2% of prior works, the models learned with OmniSource improve Top-1 accuracy of 2D- and 3D-ConvNet baseline models by 3.0% and 3.9%, respectively, on the Kinetics-400 benchmark. With the help of this framework, one can also establish new records with different pre-training strategies for video recognition.

Contributions In This Project

Here are some of the contributions mentioned by the researchers of this project:-

  • OmniSource framework leverages a mixture of web data forms, including images, short videos and untrimmed videos into one student network
  • The researchers proposed several good practices to deal with problems during joint training on data from multiple sources, including source-target balancing, resampling and cross-dataset mixup.
  • The models trained by OmniSource achieve state-of-the-art performance on the Kinetics-400 benchmark for all pre-training strategies

Wrapping Up

The researchers proposed a unified framework for omni-sourced webly-supervised video recognition which exploits web data of various forms, such as images, trimmed videos, and untrimmed videos from multiple sources, such as search engine, social media, video sharing platform – all in an integrated way. Due to its data-efficient nature, this framework reduces the amount of data required to train a model. OmniSource achieves a Top-1 accuracy of 83.6%, establishing a new record and benchmarking Kinetics-400 dataset.

Read the paper here.

More Great AIM Stories

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM