Guide to IMDb Movie Dataset With Python Implementation

Internet Movie Database (IMDb) is an online information base committed to a wide range of data about a wide scope of film substance, for example, movies, TV and web-based streaming shows, etc. The IMDb dataset contains 50,000 surveys, permitting close to 30 audits for each film.

Internet Movie Database (IMDb) is an online information base committed to a wide range of data about a wide scope of film substance, for example, movies, TV and web-based streaming shows, etc. The data which is introduced on the IMDb portal incorporates cast, creation group, director crew, individual accounts, plot outlines, random data, evaluations, fan, and critics reviews. 

The IMDb dataset contains 50,000 surveys, permitting close to 30 audits for each film. It was developed in 2011 by the researchers: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts of Stanford University. The dataset was evenly divided into training and test sets. The training set contains 25000 reviews so as the test set.

A negative review has a score of ≤ 4 out of 10, and a positive survey has a score of ≥ 7 out of 10. Neutral reviews were excluded from this dataset. 

Here, we will examine the information contained in this dataset, how it was gathered, and give some benchmark models that gave high accuracy on this dataset. Further, we will implement the IMDB dataset using Keras Library.

Data Collection

The raw data was collected by the researchers from the IMDb website. They searched the content information present in each of the reviews and discovered any highlights that were representative for judging whether the review was positive or negative. The reviews were then evenly divided into training and test sets uploaded to their website. In each of the directories contained in the sets, there are another two directories representing pos and neg tags, to partition the information through various marks. In every one of these folders, there are numerous TXT records containing the substance of the film survey, with each document containing one report.

Loading the dataset Using Pytorch

import os
import glob
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F
from import download_file_maybe_extract

Define the parameters that need to be passed to the function. The list x defined below will contain reviews with its polarity.

def imdb_dataset(directory='data/',
                 sentiments=['pos', 'neg']):
download_file_maybe_extract(url=url, directory=directory, check_files=check_files)
    x= []
    splits = [
        dir_ for (requested, dir_) in [(train, train_directory), (test, test_directory)]
        if requested
    for split_directory in splits:
        full_path = os.path.join(directory, extracted_name, split_directory)
        examples = []
        for sentiment in sentiments:
            for filename in glob.iglob(os.path.join(full_path, sentiment, '*.txt')):
                with open(filename, 'r', encoding="utf-8") as f:
                    textnew = f.readline()
                    'text': textnew ,
                    'sentiment': sentiment,
   if len(x) == 1:
        return x[0]
        return tuple(x)

Code Implementation using Keras Library

The dataset can be downloaded from the following link.

Import all the libraries required for this project.
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

Load the information from the IMDb dataset and split it into a train and test set. Ensure that the maximum number of words is 5000.

maximum_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=maximum_words)

Let’s define the maximum length of the review. If the length of the review is more than 500, shorten it to maximum length. Suppose a review has a length shorter than 500 pad_sequence will add “0” to the remaining length.

For example “Bangalore 0 0 0 0” 

max_review = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review)
X_test = sequence.pad_sequences(X_test, maxlen=max_review)

We are adding the model=Sequential() line so that the data will flow from input to output in a sequence way. The Embedding layer turns each of the words into vectors of 32 digits.

LSTM Layer decides which words in the reviews are important that will flow through them. We will add a Dense layer to the furthest limit of our model and utilize a sigmoid function capacity to deliver good results. The sigmoid function will choose if the data ought to be given a 1 (positive)or a – 1(negative). 

embedding_vector_length = 32
model = Sequential()
model.add(Embedding(max_words, embedding_vector_length, input_length=max_review_length))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Next Step is to train the model with epoch=5 and batch size=64. Our model gave an accuracy of 92.88% on training data., y_train, epochs=5, batch_size=64)
scores = model.evaluate(X_test, y_test, verbose=0)
print("Model accuracy on the IMDb dataset: {0:.2f}%".format(scores[1]*100))

We finished with an accuracy of 87.25% on the test dataset.

State of the art

The present state of the art on IMDb dataset is NB-weighted-BON + dv-cosine . The model gave an exactness of 97.4%. Graph star and BERT large finetune UDA are near contenders with a precision of around 96%.


In this article, we have discussed the details and implementation of IMDb dataset using Keras Library. The model trained on the test data gave a decent accuracy of around 87%. Additionally, we can increase the accuracy by training the model with more number of epochs.

Download our Mobile App

Ankit Das
A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Bangalore

Future Ready | Lead the AI Era Summit

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

20th June | Bangalore

Women in Data Science (WiDS) by Intuit India

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox