Trailers have transitioned from being played after the movie to being tailored as a teaser releasing one year before the main film hits the theatres. The core of this transition relied heavily on customer behaviour just like any other business venture.
Film-makers condense the full-length feature into a 2-minute trailer which displays probable awe-inducing scenes while trying to stay true to the essence of the original content. This article primarily focuses on the recent developments in the movie-recommendation engines and how studios are using them to rake in profits.
Understanding the market segmentation of the movie-going public lies at the core of any movie studios’ marketing strategy. They go to great lengths to understand consumer behaviour because predicting the outcome of the movie before its release is close to impossible. For example, the movie, Shawshank Redemption (1994) which tops the lists of movie-buffs and critics alike, tanked at the box-office. One of the key reasons was the poor word of mouth. It is indigestible for the creators to make a masterpiece and then fail due to something as trivial as mispronouncing the title of the movie.
The researchers at 20th Century Fox trained their convolutional neural network using NVIDIA Tesla P100 GPUs on the Google Cloud, with the cuDNN-accelerated TensorFlow deep learning framework, on hundreds of movie trailers released over the last years, as well as millions of attendance records. At its core, the model works on the parameters available in each frame of a movie trailer to deduce the probability of a movie being liked and which segment of the audience would like it.
Temporal Dynamics
In this paper, the researchers developed a model which works heavily on the temporal dynamics of movie trailers. Here the convolution layer is followed by temporal pooling layer which makes a gist of all the video feed frame by frame and is later fed into a hybrid collaborative filter. This model utilises ‘convolution over time,’ a multivariate time series method which makes the model temporal-aware.
“Video analysis using pooling schemes to collapse an entire video or part of a video into a unique dense feature vector can miss important semantic aspects of the video. Although simple to implement, the approach neglects the sequential and temporal aspects of the story, which, we argue, can be useful for characterisation of a motion picture,” said the data scientists at 20th Century Fox.
The idea behind VCN is to learn a collection of filters, each of which captures a particular kind of object-sequence that could be suggestive of specific actions. The model differentiates between long shots and intermittent shots using a technique called Temporal Sequencing. This can be used to deduct key information about the movie — like its genre or who the protagonists are or whether it is film suitable for the entire family. Temporal sequencing is done using Video Convolution Network (VCN). This network trains on different shapes like convolution networks usually do, and increases its ability to label objects in the sequenced parts of the trailer.
Using A Video Convolution Network
A dialogue-heavy drama can have a sequence where there are close-up shots of the actors. It can be assumed with certainty that the face of the speakers is shown back and forth. Sequences like these are plenty and filtering out the most significant object-sequences is the job of object-specific convolution filters.
Step-By-Step Working Of The Algorithm:
- Downsampling the videos to one frame-per-second.
- Extracting a 1024 dimensional image features for each frame using the Inception V3 model.
- A convolution layer applies 1024 convolution filters against these features where each filter of the size 8 x 1024. The layer ends up having 8 million (1024 x 8 x 1024) dimensions.
- These filters are convoluted along the temporal dimension with a stride value of 2 for dimensionality reduction.
There is also a residual layer that performs another set of convolutional filters against the output of the previous layer. If this layer’s filter size > 1 frame, it will further increase the effective receptive field. Such filters do not expand the receptive field but increase the capacity to summarise information among all 1024 channels.

These video sequences activate a particular channel in the last RELU layer. Every corresponding activation is typically higher than the average standard deviation of previously activated layers.
In the picture below, we see two frames (John Wick and Jason Bourne). Any average movie-goer would guess these frames belong to a movie, high on action. But, training the network to figure out this correlation is the tricky part. VCN does exactly that by using object-specific sequencing to find similarities in the dense feature vectors as discussed above.
The comparison charts reveal how the model did fairly well on many occasions. Although there is still room for improvement, this model gives a much-needed platform to the movie studios for a better evaluation of their content and targeted audience.
“Video trailers are the single most critical element of the marketing campaigns for new films,” the 20th Century Fox researchers stated in their paper. There is now a push for appending bumpers at the start of trailers. These bumpers have a runtime of 5 seconds revealing whatever explosive content the trailer has to offer. With high volatile attention spans, the marketing teams believe in the need for diligent strategies which are high on the tech front.
According to a leading financial daily, the Indian movie market is growing at a rate of 11.5% every year. The experts forecast a phenomenal rise in revenue in the coming years. The estimation now stands at a whopping 24,000 crores by the year 2020. With such alluring incentives and fierce competition within the industry, it wouldn’t be surprising if the Indian movie-giants dial-up analytics to grab a better half of the pie.