Machine learning models are designed to check for patterns in the data they are fed. But, how would they perform when they are asked to check for patterns or repetitions by looking at a video? Researchers have been trying to teach models to not only find patterns in videos but also to count the number of times a certain action is repeated. The implications can range from identifying patterns in traffic cams, heartbeats in ultrasound and many more.
The researchers at Google AI have introduced RepNet, a single model that can recognise repetitions within a single video and understand a broad range of repeating processes, ranging from birds flapping their wings to pendulums swinging.
Overview Of RepNet
The model consists of three parts: a frame encoder, an intermediate representation called a temporal self-similarity matrix (TSM), and a period predictor.
First, the frame encoder uses the ResNet architecture as a per-frame model to generate embeddings of each frame of the video. Passing each frame of a video through a ResNet-based encoder yields a sequence of embeddings.
Now, TSM is calculated by comparing each frame’s embedding with every other frame in the video, returning a matrix that is easy for subsequent modules to analyse for counting repetitions.
The popular Transformers networks are then used to predict the period of repetition and the periodicity for each frame directly from the sequence of similarities in the TSM. Now, the per-frame count is obtained by dividing the number of frames captured in a periodic segment by the period length. This is summed up to predict the number of repetitions in the video.
The working of RepNet can be summarised as follows:
- A video V is taken as a sequence of frames.
- This video is fed to an image encoder to produce per-frame embeddings X.
- Then, using the embeddings the self-similarity matrix (TSM) is obtained by computing pairwise similarities between all pairs of embeddings.
- This similarity matrix is fed to the period predictor module which gives period length estimate and periodicity score.
- The period length is the rate at which a repetition is occurring while the periodicity score indicates if the frame is within a periodic portion of the video or not.
For training, the authors propose the use of synthetically generated repetitions using unlabeled videos from YouTube. Synthetic periodic videos are generated using randomly selected videos, and are used to predict per frame periodicity and period lengths.
The researchers have also introduced Countix dataset, a subset of the Kinetics dataset annotated with segments of repeated actions and corresponding counts. During collection, the authors first manually choose a subset of classes from Kinetics which have a higher chance of repetitions happening in them for e.g. jumping jacks, slicing onion etc.
Key Takeaways
The authors in this work are of the notion that repeating processes provide us with unambiguous “action units,” semantically meaningful segments that make up an action. For example, if a person is chopping an onion, the action unit is the manipulation action that is repeated to produce additional slices. These units may be indicative of more complex activity and may allow us to analyse more such actions automatically at a finer time-scale without having a person annotate these units.
Few applications of this model:
- Monitoring speed changes is useful for exercise tracking applications
- Predict the count and frequency of repeating phenomena from videos for e.g. biological processes like heartbeats
- This model successfully detects periodicity and predicts counts over a diverse set of actors (humans, animals etc) and sensors (standard camera, ultrasound etc).
The researchers believe that this work will lead to more complex cases such as multiple simultaneous repeating signals and temporal arrangements of repeating sections such as in dance steps and music.