Recently, Amazon AI, along with SenseTime Research and the Chinese University of Hong Kong, introduced a new framework which can leverage web data to train AI video recognition models. One strong point about this framework is that it overcomes the barriers between data formats for webly-supervised learning.
Representation learning has gained a lot of traction in image recognition and video classification over the past few years. However, developing AI models need large-scale human-labelled image datasets, which are both time-consuming and costly. According to the researchers, collecting these datasets is more difficult in the domain of trimmed video recognition, since most online videos contain numerous shots with multiple concepts.
OmniSource is a unified framework for video classification which utilizes multiple sources of web data, including images, trimmed videos and untrimmed videos, simultaneously. To enhance data efficiency, the researchers proposed a task-driven data collection approach. This can be done by obtaining the topmost results using class labels to make the supervision most informative. The framework works under the semi-supervised setting, where both labelled and unlabelled data from the web co-exist.
The OmniSource framework consists mainly of three steps:-
- To filter the noise, researchers used a teacher network to filter out samples with low confidence scores, and obtain pseudo labels for the remaining ones
- One or more teacher networks are trained on the labelled dataset
- For each source of data collected, the corresponding teacher network is applied to obtain pseudo-labels, and filter out irrelevant samples with low confidence scores
- Different transforms are used to convert each type of web data. This includes images to the input format needed by the target task, such as video clips and training the student network.
The researchers used Kinetics-400 dataset, which is one of the most extensive video datasets. In total, this dataset has around 240K, 19K, and 38K videos for training, validation and testing subset, respectively. They also used Youtube-car dataset, which is a fine-grained video recognition task containing 196 types of different cars and UCF101 dataset, which is a small scale video recognition dataset with 101 classes.
Features Of OmniSource
The features of OmniSource include several good practices, including data balancing, resampling, and cross-dataset mixup, which are adopted in joint training. According to the researchers, this framework is data-efficient in training.
The researchers claimed that with only 3.5M images and 800K minute videos crawled from the internet without human labelling, and less than 2% of prior works, the models learned with OmniSource improve Top-1 accuracy of 2D- and 3D-ConvNet baseline models by 3.0% and 3.9%, respectively, on the Kinetics-400 benchmark. With the help of this framework, one can also establish new records with different pre-training strategies for video recognition.
Contributions In This Project
Here are some of the contributions mentioned by the researchers of this project:-
- OmniSource framework leverages a mixture of web data forms, including images, short videos and untrimmed videos into one student network
- The researchers proposed several good practices to deal with problems during joint training on data from multiple sources, including source-target balancing, resampling and cross-dataset mixup.
- The models trained by OmniSource achieve state-of-the-art performance on the Kinetics-400 benchmark for all pre-training strategies
The researchers proposed a unified framework for omni-sourced webly-supervised video recognition which exploits web data of various forms, such as images, trimmed videos, and untrimmed videos from multiple sources, such as search engine, social media, video sharing platform – all in an integrated way. Due to its data-efficient nature, this framework reduces the amount of data required to train a model. OmniSource achieves a Top-1 accuracy of 83.6%, establishing a new record and benchmarking Kinetics-400 dataset.
Read the paper here.