Facebook AI launched a computer vision system called DINO to segment unlabeled and random images and videos without supervision. The open-source PyTorch framework implementation and pre-trained models for DINO is currently available on GitHub.
DINO stands for self DIstillation with NO labels. Facebook has developed the new model in collaboration with researchers at INRIA, Sorbonne University, MILA and McGill University. DINO has set a new state of the art among self-supervised methods.
Sign up for your weekly dose of what's up in emerging technology.
Self-supervised vision transformers (ViT), a type of machine learning model, carry explicit information about the semantic segmentation of an image and perform better than supervised ViTs and convolutional neural network (CNNs).
Unlike traditional supervised learning, DINO doesn’t require large volumes of annotated/labelled data. In DINO, high accurate segmentation can be solved with self-supervised learning and a suitable architecture.
Interestingly, DINO focuses on the near object even in highly ambiguous situations.
Source: Facebook (Showcasing the visual representation of the original image, followed by supervised model and unsupervised model (DINO))
“Our model can discover and segment objects in an image or a video with absolutely no supervision,” said researchers at Facebook AI, pointing at the visuals of the original video trained versus DINO (self-supervised vision transformers).
Source: Facebook (Self-attention maps of neural network on videos of a puppy, a horse, a BMX rider, and a fishing boat, using DINO)
Facebook’s CTO Mike Schroepfer, while explaining the nuances of its new computer vision system, said the DINO self-supervised model is inspired by how young children learn the language, physics, and more without formal instruction.
More with less
Facebook AI claimed that it lets users train models using limited computing resources. Besides launching DINO, the company has also introduced PAWS, a new model-training approach that delivers accurate results with much less compute. The open-source PyTorch implementation of PAWS (predicting view assignments with support samples) is also available on GitHub.
“When pretraining a standard ResNet-50 model with PAWS using just 1 percent of the labels in ImageNet, we get state-of-the-art accuracy while doing 10x fewer pertaining steps,” claimed Facebook researchers.
For instance, while training a student network, its self-supervised computer vision system matches the output of a teacher network over different views on the same image.
Similarly, in another example where the model was asked to recognise duplicate images, Facebook’s DINO outperformed existing models, even though it was not trained to solve that particular problem. DINO had the highest accuracy compared to ViT trained on ImageNet and MultiGrain, which were so far considered to have the highest accuracy for duplicate detection.
Source: Facebook (The image showcases how DINO can recognise near-duplicate images taken from the Flickr dataset, where red and green outlined images indicate false and true positives)
Additionally, Facebook’s new model discovers object parts and shared characteristics across the model and learns to categorise and structure images into groups based on physical properties like animal species or biological taxonomy.
Source: Facebook (Feature representation of unlabelled data)
Lately, Facebook has been bullish on open-source frameworks, tools, libraries and models for research and developers to deploy in large-scale production.
Last month, Facebook launched an open-source machine learning library called Flashlight that lets researchers execute AI applications seamlessly using C++. Its latest machine learning library, written entirely in C++, is currently available on GitHub.
A year before that, Facebook had launched an open-source graph transformer networks (GTN) framework for effectively training graph-based learning models.