YolatEdge is one of the first competitive instanced segmentation techniques that can run on small devices with great real-time speed, It can reach up to 30fps on Nvidia Jetson AGX Xavier and 172fps on RTX 2080Ti. YolactEdge techniques come with Resnet-101 backbone which takes 550×550 resolution image as input. It paper called YolactEdge: Real-time Instance Segmentation on the Edge is authored by Haotian Liu, Rafael A. Rivera Soto, Fanyi Xiao, and Yong Jae Lee in Dec 2020, and the code and models are open-sourced on GitHub here.
Some of the new features and things the authors came up with are:
- TensorRT optimization technique without compromising trading off speed and accuracy,
- A novel feature warping module to accomplish temporal redundancy in videos.
- Integrated YouTube VIS and MS COCO datasets.
- Produces a 3 to 5x speedup over existing real-time methods while producing competitive mask and box detection accuracy.
In order to do inferences in real-time speeds on edge devices, the authors built the SOTA image-based real-time instances segmentation method YOLACT and did some new changes mainly two: one at algorithms level and other system levels. YolactEdge leverages the facility of Nvidia TensorRT machine inference engine to quantize the network parameters to fewer bits while systematically balancing any tradeoff inaccuracy, and it also leverages temporal redundancy in the video, and learn to rework and propagate features over time in order that the deep network’s expensive backbone feature computation doesn’t get to be fully computed on every frame.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
YolactEdge Backbone
YOLACT can be divided into 4 components:

- a feature Backbone.
- a feature pyramid network(FPN).
- a ProtoNet.
- a Prediction Head.
As shown in the above figure, YolactEdge extends the YOLACT method to videos by transforming a set of the features from keyframes (shown in left) to nonkeyframes (shown on the right side of the above figure), to reduce expensive backbone computation. Especially on non-keyframes, it computes features that are cheap while crucial for mask prediction, which largely accelerates the technique while retaining accuracy on non-keyframes. YolacEdge uses blue, orange, and grey to indicate computed, transformed, and skipped blocks.
Implementation
YolactEdge is trained on a batch size of 32 on 4 GPUs using ImagNet already pre-trained weights, First, the authors used pre-train YOLACT with SGD for 500k iterations. Then, they froze YOLACT weights and trained FeatFlowNet on FlyingChairs dataset. Finally, they fine-tuned all weights except the ResNet backbone architecture for 200k iterations.
Installation
- It is written in python3 programming language
- Installed PyTorch 1.6.0 from here
- Install CUDA 10.2/11.0 and cuDNN 8.0.0.
- Download TensorRT 7.1 tar file here and install TensorRT from the official documentation.
- Install torch2trt.
git clone https://github.com/NVIDIA-AI-IOT/torch2trt cd torch2trt sudo python setup.py install --plugins Installing some other dependencies: !pip install cython !pip install opencv-python pillow matplotlib !pip install !git+https://github.com/haotian-liu/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI" !pip install GitPython termcolor tensorboard Clone the repo and change the directory inside: git clone https://github.com/haotian-liu/yolact_edge.git cd yolact_edge
YolactEdge Models
Authors provided baseline YOLACT and YolactEdge models trained on COCO and YouTube VIS dataset, given below is the information about Youtube VIS models
Method | Backbone | mAP | AGX-Xavier FPS | RTX 2080 Ti FPS | weights |
YOLACT | R-50-FPN | 44.7 | 8.5 | 59.8 | download |
YolactEdge(w/o TRT) | R-50-FPN | 44.2 | 10.5 | 67.0 | download |
YolactEdge | R-50-FPN | 44.0 | 32.4 | 177.6 | download |
YOLACT | R-101-FPN | 47.3 | 5.9 | 42.6 | download |
YolactEdge(w/o TRT) | R-101-FPN | 46.9 | 9.5 | 61.2 | download |
YolactEdge | R-101-FPN | 46.2 | 30.8 | 172.7 | download |
YolactEdge COCO Models
Method | Backbone | mAP | Titan Xp FPS | AGX-Xavier FPS | RTX 2080 Ti FPS | weights |
YOLACT | MobileNet-V2 | 22.1 | – | 15.0 | 35.7 | download |
YolactEdge | MobileNet-V2 | 20.8 | – | 35.7 | 161.4 | download |
YOLACT | R-50-FPN | 28.2 | 42.5 | 9.1 | 45.0 | download |
YolactEdge | R-50-FPN | 27.0 | – | 30.7 | 140.3 | download |
YOLACT | R-101-FPN | 29.8 | 33.5 | 6.6 | 36.5 | download |
YolactEdge | R-101-FPN | 29.5 | – | 27.3 | 124.8 | download |
To evaluate the pretrained models, you can put the corresponding weight file in the ./weights directory by creating one and run further commands.
Evaluation of YolactEdge
For Convert each component of the trained model to TensorRT using the optimal settings and evaluate on the YouTube VIS validation set.
!python3 eval.py --trained_model=./weights/yolact_edge_vid_847_50000.pth # Evaluate on the entire COCO validation set. # '--yolact_transfer' is used to convert the models trained with YOLACT to be compatible with YolactEdge. !python3 eval.py --yolact_transfer --trained_model=./weights/yolact_edge_54_800000.pth # Output a COCO file for the COCO test-dev set. The command will create './results/bbox_detections.json' and './results/mask_detections.json' for detection and instance segmentation respectively. These files can then be submitted to the website for evaluation. !python3 eval.py --yolact_transfer --trained_model=./weights/yolact_edge_54_800000.pth --dataset=coco2017_testdev_dataset --output_coco_json
Running on Images
# Display qualitative results on the specified image. python eval.py --yolact_transfer --trained_model=weights/yolact_edge_54_800000.pth --score_threshold=0.3 --top_k=100 --image=my_image.png # Process an image and save it to another file. python eval.py --yolact_transfer --trained_model=weights/yolact_edge_54_800000.pth --score_threshold=0.3 --top_k=100 --image=input_image.png:output_image.png # Process a whole folder of images. python eval.py --yolact_transfer --trained_model=weights/yolact_edge_54_800000.pth --score_threshold=0.3 --top_k=100 --images=path/to/input/folder:path/to/output/folder
On videos
# Display a video in real-time. "--video_multiframe" will process that many frames at once for improved performance. # If video_multiframe > 1, then the trt_batch_size should be increased to match it or surpass it. python eval.py --yolact_transfer --trained_model=weights/yolact_edge_54_800000.pth --score_threshold=0.3 --top_k=100 --video_multiframe=2 --trt_batch_size 2 --video=my_video.mp4 # Display a webcam feed in real-time. If you have multiple webcams pass the index of the webcam you want instead of 0. python eval.py --yolact_transfer --trained_model=weights/yolact_edge_54_800000.pth --score_threshold=0.3 --top_k=100 --video_multiframe=2 --trt_batch_size 2 --video=0 # Process a video and save it to another file. This is unoptimized. python eval.py --yolact_transfer --trained_model=weights/yolact_edge_54_800000.pth --score_threshold=0.3 --top_k=100 --video=input_video.mp4:output_video.mp4
Conclusion
YolactEdge is the new way of looking at object detection problem with Real-time Instance Segmentation on the Edge with less computation power and the only thing we are left of is an optimization problem in deep learning projects which is been completed by approaches like YolacEdge, to learn more about the project you can follow below resources: