Introduction To YolactEdge For Real-time Object Segmentation On Edge Device

YolatEdge is one of the first competitive instanced segmentation techniques that can run on small devices with great real-time speed, It can reach up to 30fps on Nvidia Jetson AGX Xavier and 172fps on RTX 2080Ti. YolactEdge techniques come with Resnet-101 backbone which takes 550×550 resolution image as input. It paper called YolactEdge: Real-time Instance Segmentation on the Edge is authored by Haotian Liu, Rafael A. Rivera Soto, Fanyi Xiao, and Yong Jae Lee in Dec 2020, and the code and models are open-sourced on GitHub here.

Some of the new features and things the authors came up with are:

  • TensorRT optimization technique without compromising trading off speed and accuracy, 
  • A novel feature warping module to accomplish temporal redundancy in videos.
  • Integrated YouTube VIS and MS COCO datasets.
  • Produces a 3 to 5x speedup over existing real-time methods while producing competitive mask and box detection accuracy.

In order to do inferences in real-time speeds on edge devices, the authors built the SOTA image-based real-time instances segmentation method YOLACT and did some new changes mainly two: one at algorithms level and other system levels. YolactEdge leverages the facility of Nvidia TensorRT machine inference engine to quantize the network parameters to fewer bits while systematically balancing any tradeoff inaccuracy, and it also leverages temporal redundancy in the video, and learn to rework and propagate features over time in order that the deep network’s expensive backbone feature computation doesn’t get to be fully computed on every frame.

YolactEdge Backbone

YOLACT can be divided into 4 components: 

  1. a feature Backbone.
  2. a feature pyramid network(FPN). 
  3. a ProtoNet.
  4. a Prediction Head.

As shown in the above figure, YolactEdge extends the YOLACT method to videos by transforming a set of the features from keyframes (shown in left) to nonkeyframes (shown on the right side of the above figure), to reduce expensive backbone computation. Especially on non-keyframes, it computes features that are cheap while crucial for mask prediction, which largely accelerates the technique while retaining accuracy on non-keyframes. YolacEdge uses blue, orange, and grey to indicate computed, transformed, and skipped blocks. 

yolactedge

Implementation

YolactEdge is trained on a batch size of 32 on 4 GPUs using ImagNet already pre-trained weights, First, the authors used pre-train YOLACT with SGD for 500k iterations. Then, they froze YOLACT weights and trained FeatFlowNet on FlyingChairs dataset. Finally, they fine-tuned all weights except the ResNet backbone architecture for 200k iterations. 

Installation

  • It is written in python3 programming language 
  • Installed PyTorch 1.6.0 from here
  • Install CUDA 10.2/11.0 and cuDNN 8.0.0.
  • Download TensorRT 7.1 tar file here and install TensorRT from the official documentation.
  • Install torch2trt.
 git clone https://github.com/NVIDIA-AI-IOT/torch2trt
 cd torch2trt
 sudo python setup.py install --plugins
 Installing some other dependencies:
 !pip install cython
 !pip install opencv-python pillow matplotlib
 !pip install !git+https://github.com/haotian-liu/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI"
 !pip install GitPython termcolor tensorboard
 Clone the repo and change the directory inside:
 git clone https://github.com/haotian-liu/yolact_edge.git
 cd yolact_edge 

YolactEdge Models

Authors provided baseline YOLACT and YolactEdge models trained on COCO and YouTube VIS dataset, given below is the information about Youtube VIS models

MethodBackbone mAPAGX-Xavier FPSRTX 2080 Ti FPSweights
YOLACTR-50-FPN44.78.559.8download 
YolactEdge(w/o TRT)R-50-FPN44.210.567.0download 
YolactEdgeR-50-FPN44.032.4177.6download 
YOLACTR-101-FPN47.35.942.6download 
YolactEdge(w/o TRT)R-101-FPN46.99.561.2download 
YolactEdgeR-101-FPN46.230.8172.7download 
Youtube VIS models

YolactEdge COCO Models

Method  Backbone    mAPTitan Xp FPSAGX-Xavier FPSRTX 2080 Ti FPSweights
YOLACTMobileNet-V222.115.035.7download 
YolactEdgeMobileNet-V220.835.7161.4download 
YOLACTR-50-FPN28.242.59.145.0download 
YolactEdgeR-50-FPN27.030.7140.3download 
YOLACTR-101-FPN29.833.56.636.5download 
YolactEdgeR-101-FPN29.527.3124.8download 
COCO Models

To evaluate the pretrained models, you can put the corresponding weight file in the ./weights directory by creating one and run further commands.

Evaluation of YolactEdge

For Convert each component of the trained model to TensorRT using the optimal settings and evaluate on the YouTube VIS validation set.

 !python3 eval.py --trained_model=./weights/yolact_edge_vid_847_50000.pth
 # Evaluate on the entire COCO validation set.
 # '--yolact_transfer' is used to convert the models trained with YOLACT to be compatible with YolactEdge.
 !python3 eval.py --yolact_transfer --trained_model=./weights/yolact_edge_54_800000.pth
 # Output a COCO file for the COCO test-dev set. The command will create './results/bbox_detections.json' and './results/mask_detections.json' for detection and instance segmentation respectively. These files can then be submitted to the website for evaluation.
 !python3 eval.py --yolact_transfer --trained_model=./weights/yolact_edge_54_800000.pth --dataset=coco2017_testdev_dataset --output_coco_json 

Running on Images

 # Display qualitative results on the specified image.
 python eval.py --yolact_transfer --trained_model=weights/yolact_edge_54_800000.pth --score_threshold=0.3 --top_k=100 --image=my_image.png
 # Process an image and save it to another file.
 python eval.py --yolact_transfer --trained_model=weights/yolact_edge_54_800000.pth --score_threshold=0.3 --top_k=100 --image=input_image.png:output_image.png
 # Process a whole folder of images.
 python eval.py --yolact_transfer --trained_model=weights/yolact_edge_54_800000.pth --score_threshold=0.3 --top_k=100 --images=path/to/input/folder:path/to/output/folder 

On videos

 # Display a video in real-time. "--video_multiframe" will process that many frames at once for improved performance.
 # If video_multiframe > 1, then the trt_batch_size should be increased to match it or surpass it. 
 python eval.py --yolact_transfer --trained_model=weights/yolact_edge_54_800000.pth --score_threshold=0.3 --top_k=100 --video_multiframe=2 --trt_batch_size 2 --video=my_video.mp4
 # Display a webcam feed in real-time. If you have multiple webcams pass the index of the webcam you want instead of 0.
 python eval.py --yolact_transfer --trained_model=weights/yolact_edge_54_800000.pth --score_threshold=0.3 --top_k=100 --video_multiframe=2 --trt_batch_size 2 --video=0
 # Process a video and save it to another file. This is unoptimized.
 python eval.py --yolact_transfer --trained_model=weights/yolact_edge_54_800000.pth --score_threshold=0.3 --top_k=100 --video=input_video.mp4:output_video.mp4 
yolactedge

Conclusion 

YolactEdge is the new way of looking at object detection problem with Real-time Instance Segmentation on the Edge with less computation power and the only thing we are left of is an optimization problem in deep learning projects which is been completed by approaches like YolacEdge, to  learn more about the project you can follow below resources:

More Great AIM Stories

Mohit Maithani
Mohit is a Data & Technology Enthusiast with good exposure to solving real-world problems in various avenues of IT and Deep learning domain. He believes in solving human's daily problems with the help of technology.

More Stories

OUR UPCOMING EVENTS

8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>

MORE FROM AIM
Yugesh Verma
All you need to know about Graph Embeddings

Embeddings can be the subgroups of a group, similarly, in graph theory embedding of a graph can be considered as a representation of a graph on a surface, where points of that surface are made up of vertices and arcs are made up of edges

Yugesh Verma
A beginner’s guide to Spatio-Temporal graph neural networks

Spatio-temporal graphs are made of static structures and time-varying features, and such information in a graph requires a neural network that can deal with time-varying features of the graph. Neural networks which are developed to deal with time-varying features of the graph can be considered as Spatio-temporal graph neural networks. 

Yugesh Verma
A guide to explainable named entity recognition

Named entity recognition (NER) is difficult to understand how the process of NER worked in the background or how the process is behaving with the data, it needs more explainability. we can make it more explainable.

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM