Spatio-Temporal Transformer Network: Can Text Detection Be Achieved Through It?

Spatio-Temporal Transformer Network (STTN) and contemporary techniques are combined in STRIVE (Scene Text Replacement In VidEos).

In computer vision, visual object tracking is a crucial yet difficult research problem. Object tracking has made significant progress in recent years using convolutional neural networks. Recently, NEC Laboratories, Palo Alto Research Center, Amazon, PARC, and Stanford University academics collaborated to solve the problem of realistically modifying scene text in videos. As a result of the foregoing research approach, the researchers termed their framework STRIVE (Scene Text Replacement In VidEos). The main objective of this research is to develop custom content for marketing and promotional objectives. 

Several attempts have been made to automate text replacement in still photos using deep style transfer concepts. Training an image-based text style transfer module on individual frames while adding temporal consistency requirements in the network loss could be one approach to solving video-test replacement. As a result, the research team used a unique strategy. They first extract text regions of interest (ROI) and train a Spatio-temporal Transformer Network (STTN) to frontalize the ROIs, making them temporally consistent. After that, they scan the movie for a reference frame with good text quality, which was assessed in terms of text clarity, size, and shape.

Similarly, researchers from Japan’s University of Tsukuba and Waseda University have unveiled a unified framework designed to handle a wide range of remastering operations for digitally converted films. A fully convolutional network is used to implement the method. The researchers chose temporal convolutions instead of recursive models for video processing because they may consider information from several input frames simultaneously.

Existing Approaches

There have been many approaches for detecting scene text in videos, however, the majority of them only focus on scene text identification in individual frames, attempting to overcome low-resolution or complicated image background difficulties. Only a few approaches have taken into account spatial and temporal context information, i.e., not only detecting texts in single frames but also taking into account context information across many frames.

  • The researchers from Barcelona proposed a real-time video text identification system based on MSER with no benchmark dataset evaluation.
  • The group of scholars from China offered a multi-strategy tracking system, although hand-crafted criteria from tracking-by-detection, spatial-temporal context learning, and linear prediction were used to pick the best match. 
  • The Chinese research team expanded on the previous method by using dynamic programming to find the best match worldwide in a unified framework. 
  • The research team from California developed a network flow-based technique. 
  • The scholars from Singapore employed network flow for text line creation in single images.

Technology behind

The majority of video text identification algorithms have two steps: the first detects texts in individual frames or important frames, and the second tracks proposals in the short and long run. Poor contrast and low-resolution photos, multi-orientation and multi-scale texts, and random motions provide obstacles in both processes. Many of the approaches established for text detection in images may also be used to detect text in video frames, and the use of temporal context information is advantageous for video text detection.

The analysis of spatiotemporal data necessitates the consideration of both temporal and spatial relationships. For two key reasons, assessing both the temporal and spatial dimensions of data adds significant complexity to the data analysis process: 

1) Changes in the spatial and non-spatial features of spatiotemporal objects across time, both continuous and discrete, and

2) The influence of collocated neighbouring spatiotemporal objects on one another.

STTN is a network that can establish dense pixel correspondences in both space and time. It is a transformer that uses a multi-scale patch-based attention module to search for coherent material across all frames in both spatial and temporal dimensions. The module is in charge of collecting patches of varying scales from all video frames in order to cover the many appearance changes induced by complex video motions. The multi-head transformer calculates similarities on spatial patches across different scales at the same time. This could be the first attempt at deep video text substitution, according to Amazon and the team. In this field, more research focus is required. In the foreseeable future, we can expect additional research contributions from Indian startups.

More Great AIM Stories

Dr. Nivash Jeevanandam
Nivash holds a doctorate in information technology and has been a research associate at a university and a development engineer in the IT industry. Data science and machine learning excite him.

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM