Deep Dive Into Kinetics: An Intensive Dataset On Action Classification Developed By Deepmind

Kinetics datasets are taken from Youtube recordings. The activities are human focussed and cover a wide scope of classes including human-object communications, for example mowing lawn, washing dishes, humans Actions e.g. drawing, drinking, laughing, pumping fist; human-human interactions, e.g. hugging, kissing, shaking hands.

Kinetics dataset was first introduced in the year 2017 primarily for human action classification.It was developed by the researchers: Will Kay, Joao Carreira, Chloe Hillier and Andrew Zisserman at Deepmind. The dataset contains 400 human activity classes, within any event 400 video cuts for each activity. It has 306,245 recordings and is separated into three parts, one for preparing to have 250–1000 recordings for each class, one for approval with 50 recordings per class and one for testing with 100 recordings for every class. Each clip endures around 10s. 

Kinetics datasets are taken from Youtube recordings. The activities are human focussed and cover a wide scope of classes including human-object communications, for example mowing lawn, washing dishes, humans Actions e.g. drawing, drinking, laughing, pumping fist; human-human interactions, e.g. hugging, kissing, shaking hands.

Here, we will discuss statistics of the dataset, how it was gathered, and give some benchmark models that provide high accuracy on this dataset.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Data Collection

Each clip for each class was taken by first looking on YouTube for applicants. Afterwards Amazon Mechanical Turkers (AMT) tool is used to check if the clip contains the activity or not. At least three confirmations out of five were needed before a clip was acknowledged. Some clips may contain common video material. So, the dataset was refined by watching that just one clip is taken from every video.


Downloading the dataset

The biggest problem of the Kinetics dataset is that the real recordings are not accessible for download.The dataset contains annotations file which gives a list of entries in CSV and JSON format containing the YouTube Url links, action category and the start and end times of the action category within the video. We need to follow those links and download the videos and crop them according to right fleeting reach.

The dataset can be downloaded from the following:

Kinetics 700

View paper Download dataset

Kinetics 600

View paper Download dataset

Kinetics 400

View paper Download dataset


After downloading the dataset, extract the zip file. It contains train, test and validation in CSV and JSON format.

The Label indicates what activity is performed by the humans. In the below result we get testifying and eating spaghetti for the respective ids.Url will redirect us to the following link where we can download the particular video.The human activity can be calculated by the start and end time given in the segment section.


 Geographical data distribution, per continent

Structure of the JSON file

    "---QUuC4vJs": {
        "annotations": {
            "label": "testifying",
            "segment": [
        "duration": 10.0,
        "subset": "train",
        "url": ""
    "--3ouPhoy2A": {
        "annotations": {
            "label": "eating spaghetti",
            "segment": [
        "duration": 10.0,
        "subset": "train",
        "url": ""
    "--4-0ihtnBU": {
        "annotations": {
            "label": "dribbling basketball",
            "segment": [

Loading the Kinetics 400 using PyTorch

After downloading the dataset from the given URL we need to load the Kinetics400 dataset using pytorch.

import torch
import torchvision
kinetics_data = torchvision.datasets.Kinetics400(root,frames_per_clip, step_between_clips=1,frame_rate=None,extensions=(‘avi’),’transform=None, _precomputed_metadata=None,num_workers=1,_video_width=0,_video_height=0, _video_min_dimension=0,_audio_samples=0,_audio_channels=0)
data_loader =,

Let’s define the parameters in the Kinectics400 Class:

·   root  – It is the root directory of the Kinetics400 Dataset.

·   frames_per_clip – Number of frames in a clip for the UCF dataset.

·   step_between_clips – Number of frames between each clip.

·   transform –transform a function that gives a transformed version for a T*W*H*C dimensional video.

State of the Art

The current state of the art on Kinetics 400 dataset is OmniSource irCSN-152. The model gave an accuracy of 83.6%. irCSN is a close competitor with an accuracy of around 83%.

Kinetics 600

This paper was released in Aug 2018 by the researchers:Joao Carreira,Eric Noland,Chloe Hillier and Andrew Zisserman. The new form of the dataset, called Kinetics-600, follows similar standards as Kinetics-400.The clips are mined from YouTube.Each video lasts for 10s. Classes are now increased from 400 to 600 as one of the main objectives of kinetics dataset was to replicate the ImageNet dataset with 1000 classes.

The general procedure for data collection is the same as in Kinetics 400.In words, a rundown of class names is made, at that point a rundown of applicant YouTube URLs is acquired for each class name, and applicant 10s clasps are inspected from the recordings. These clips are used by Amazon Mechanical Turk tools that check whether those clips contain the activity class that they should. Finally the dataset was refined as it may contain some common video.

The current state of the art is LGD-3D Two Stream. It performed well on the dataset with an accuracy of 96%.

Kinetics 700

The new kinetics dataset was an improvement over the Kinetics 600 dataset as the number of classes increased from 600 to 700. As on account of Kinetics-600, Kinetics-700 has at least 600 clips for every human activity class – this speaks to a 30% expansion in the quantity of video cuts, from around 500k to around 650k. The objective of the Kinetics venture is to give an enormous scope of a curated dataset of video cuts, covering a different scope of human activities.The overall pipeline for the data collection is same as the kinetic 400 and Kinetic 600 datasets.

The best performing model on this dataset was I3D. It gave an accuracy of over 81%.


In this article, we have discussed the Kinetics dataset with 400,600 and 700 classes which are improvements over the other datasets. Further, we have covered the technique that was used to gather data for the datasets. The main purpose of increasing the classes was to replicate the ImageNet dataset with 1000 classes.

Ankit Das
A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.

Download our Mobile App


AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry


Strengthen Critical AI Skills with Trusted Corporate AI Training

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox