Guide To Lightly: Tool For Curating Your Vision Data

Jayita Bhattacharyya

Lightly, makes deep learning more efficient by popularizing the use of self-supervised methods to understand and filter raw image data. This solution and preprocessing of data can be applied before any data annotation, and the learned representations can be used to analyze and visualize datasets as well as for selecting a core set of samples that is efficient in training. 

Self-supervised learning(falling under unsupervised learning) is an algorithm where a model generates output labels by finding relations among the parts of the data object or different views of objects. This can be correlated with active learning or semi-supervised learning or representation learning, which also adapts similar kinds of techniques. While transfer learning and pre-trained models have been more common in use since Imagenet Challenge, self-supervised learning is somewhere overlooked. But in some domains which have not been previously defined, transfer learning doesn’t perform well and here’s where self-supervised learning comes handy with less amount of data. 


Lightly focuses on optimizing the datasets rather than optimizing the deep learning models, architectures, training and regularization methodologies. Lightly was launched in 2018 by Igor Susmelj and Matthias Heller. Lightly is headquartered in Zurich, Switzerland. Earlier it was known as ‘WhatToLabel’. Lightly provides efficient ways to filter out by comparing different sampling strategies over data that is relevant for modelling and labels only them. 

This is possible by using self-supervised learning along with active learning to build a machine learning pipeline for the training dataset. In self-supervised learning we have a specific task that needs to be solved(e.g., object classification or detection, image segmentation) that we used for pre-training is called a “pretext task”. The pretext tasks can then be used for fine-tuning, which is known as “downstream tasks”. 

Performance on Image Classification

Performance on Semantic Segmentation


Data selection – Helps in removing data redundancy and bias. The data selector feature automatically removes corrupt files and rebalances the dataset on a feature level. Based on your filter preference, nearby duplicates can be removed, or a new dataset can be created based on the most relevant samples. These redundancies lead to biased results of the model’s performance, affect the accuracy, mean average precision score, hence lead to high annotation costs. Embedding is the representation of an image in a vector space—embedding Model (typically a convolutional neural network) to create embeddings from images.

Download our Mobile App

Data pool – Data tool feature ensures data collection fulfils all the criteria for data to be trained. It will automatically fill the dataset with negative images if it does not contain them. Provide active feedback on how data looks, which can be then enhanced team collaboration. Start preprocessing of data and annotate data. Datasets can be merged or split for better performance.

Data Analytics – This feature provides insights into the collected data with statistics and interactive graphs. Allows data visualization and exploration with UMAP distribution. Provides Visual Quality Analysis by showing samples that were kept and removed.

Here is a demo video on how to use lightly for uploading and exploring the dataset.

Industry Use Cases

  • Autonomous Vehicles for Shipping, Logistics, Airline, Defense and Military 
  • Visual Inspection to detect defects in infrastructure manufactured products, or find infected plants in Railways & Roads, Infrastructure, Manufacturing, Agriculture, Surveillance & Security sectors
  • Medical Imagery – Find abnormalities in medical images such as X-rays, MRIs, microscope & medical scans in Health/Life Science, Biotechnology, and Digital Diagnostics/Pathology sectors
  • Geospatial Data Improve space products and achieve better results in Sattelite Imaging, Visual Inspection for Space Components, Autonomous Systems.


See Also

  • Webapp 1000 samples, Drag n Drop (no coding required))
  • Python Package(CLI) – can work with more than 25000 samples

# install pip package

pip install lightly

To train a model and create embeddings:

from lightly import train_embedding_model, embed_images
# first model is trained for 10 epochs
checkpoint = train_embedding_model(input_dir='./my/cute/cats/dataset/', trainer={'max_epochs': 10})
# let's embed dataset using our trained model
embeddings, labels, filenames = embed_images(input_dir='./my/cute/cats/dataset/', checkpoint=checkpoint)
# the shape of our embeddings

Embedding can be saved in a CSV file.

  • On-Premise(Docker) used by Fortune500 companies to process more than 1M samples.

Partnered Companies

AI Retailer Systems claimed to reduce the data by 85% for an object detection model, without any loss in the accuracy. Curbflow claimed to have high performance in key metrics after their collected data was filtered by Lightly. DroGone claims to have better data gathering and cleaning procedures to build high-quality datasets to train accurate models. Other companies include Arm, Frontify, CBB CFF FFS.

What Do You Think?

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join Our Telegram Group. Be part of an engaging online community. Join Here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top