MITB Banner

Spatio-Temporal Transformer Network: Can Text Detection Be Achieved Through It?

Spatio-Temporal Transformer Network (STTN) and contemporary techniques are combined in STRIVE (Scene Text Replacement In VidEos).

Share

In computer vision, visual object tracking is a crucial yet difficult research problem. Object tracking has made significant progress in recent years using convolutional neural networks. Recently, NEC Laboratories, Palo Alto Research Center, Amazon, PARC, and Stanford University academics collaborated to solve the problem of realistically modifying scene text in videos. As a result of the foregoing research approach, the researchers termed their framework STRIVE (Scene Text Replacement In VidEos). The main objective of this research is to develop custom content for marketing and promotional objectives. 

Several attempts have been made to automate text replacement in still photos using deep style transfer concepts. Training an image-based text style transfer module on individual frames while adding temporal consistency requirements in the network loss could be one approach to solving video-test replacement. As a result, the research team used a unique strategy. They first extract text regions of interest (ROI) and train a Spatio-temporal Transformer Network (STTN) to frontalize the ROIs, making them temporally consistent. After that, they scan the movie for a reference frame with good text quality, which was assessed in terms of text clarity, size, and shape.

Similarly, researchers from Japan’s University of Tsukuba and Waseda University have unveiled a unified framework designed to handle a wide range of remastering operations for digitally converted films. A fully convolutional network is used to implement the method. The researchers chose temporal convolutions instead of recursive models for video processing because they may consider information from several input frames simultaneously.

Existing Approaches

There have been many approaches for detecting scene text in videos, however, the majority of them only focus on scene text identification in individual frames, attempting to overcome low-resolution or complicated image background difficulties. Only a few approaches have taken into account spatial and temporal context information, i.e., not only detecting texts in single frames but also taking into account context information across many frames.

  • The researchers from Barcelona proposed a real-time video text identification system based on MSER with no benchmark dataset evaluation.
  • The group of scholars from China offered a multi-strategy tracking system, although hand-crafted criteria from tracking-by-detection, spatial-temporal context learning, and linear prediction were used to pick the best match. 
  • The Chinese research team expanded on the previous method by using dynamic programming to find the best match worldwide in a unified framework. 
  • The research team from California developed a network flow-based technique. 
  • The scholars from Singapore employed network flow for text line creation in single images.

Technology behind

The majority of video text identification algorithms have two steps: the first detects texts in individual frames or important frames, and the second tracks proposals in the short and long run. Poor contrast and low-resolution photos, multi-orientation and multi-scale texts, and random motions provide obstacles in both processes. Many of the approaches established for text detection in images may also be used to detect text in video frames, and the use of temporal context information is advantageous for video text detection.

The analysis of spatiotemporal data necessitates the consideration of both temporal and spatial relationships. For two key reasons, assessing both the temporal and spatial dimensions of data adds significant complexity to the data analysis process: 

1) Changes in the spatial and non-spatial features of spatiotemporal objects across time, both continuous and discrete, and

2) The influence of collocated neighbouring spatiotemporal objects on one another.

STTN is a network that can establish dense pixel correspondences in both space and time. It is a transformer that uses a multi-scale patch-based attention module to search for coherent material across all frames in both spatial and temporal dimensions. The module is in charge of collecting patches of varying scales from all video frames in order to cover the many appearance changes induced by complex video motions. The multi-head transformer calculates similarities on spatial patches across different scales at the same time. This could be the first attempt at deep video text substitution, according to Amazon and the team. In this field, more research focus is required. In the foreseeable future, we can expect additional research contributions from Indian startups.

Share
Picture of Dr. Nivash Jeevanandam

Dr. Nivash Jeevanandam

Nivash holds a doctorate in information technology and has been a research associate at a university and a development engineer in the IT industry. Data science and machine learning excite him.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.