Active Hackathon

How to Easily Annotate Text Data with LightTag

LightTag is a text annotator tool designed to get developers and researchers their work done easily

LightTag is a text annotator tool designed to get developers and researchers their work done easily. NLP(natural language processing) has been taking over the world of communication with deep learning advances. LightTag assures both speed and quality while annotating the best-in-class text data for training high-quality ground truth machine learning models. LightTag was launched in 2018 by founder Tal Perry. It is headquartered in Berlin, Germany.

LightTag allows forming an annotation team to work on an NLP project. The project manager can define how many annotators will be assigned to work on each example. LightTag will automatically allocate workforce, and aggregate annotations or view them on an annotator by annotator basis. With LightTag’s review, reporting and quality assurance. Its user-friendly-UI and the hosted solution are fully managed and include daily backups for restoring work with long retention times and a redundant cluster of servers to ensure high availability. Optimised interface with full unicode support and no tokenization assumptions.


Sign up for your weekly dose of what's up in emerging technology.


  • Annotation Types and Productivity :

Span Annotation, Document Classifications, Document Tagging, Entity Annotations, Relationships Annotation.

Phrase and Subword Annotations, Document Metadata, Pre-Annotations, Very Long Class Lists, Guidelines, Keyboard Shortcuts, Auto Save, Search.

Team Collaboration, Automatic Scheduling & Task Assignment, Multiple Annotators Per Document, Role-Based Access Control, Teams Productivity Reports.

  • Multilingual

Covering a wide range of vernacular dialects such as Chinese legalese, Hebrew medical records, English financial jargon, Arabic tweets.

  • Performance  Dashboard 

Inter-Annotator Agreement Reports, Review & Adjudication.

  • Evaluation Metrics – Precision and Recall Reports, Confusion metrics, heatmaps all of which are downloadable to review the quality of data.
  • Automation

LightTag suggests labels from its machine-in-loop system.

  • Review and Quality Assurance
  • DevOps free hosted solution in annotation projects. With your own domain ( to work from anywhere, high availability through high powered server replication, a separate database and daily backup planned with a guaranteed 30-day retention. 
  • Considering data privacy and sensitivity, it’s problematic for users to put it on the cloud. LightTag solves this problem with its on-premise version that fits into Openshift, Kubernetes, or Docker Swarm cluster.
  • Unlike other data annotators, LightTag avoids the use of complex XML for annotations that need to be clubbed with raw text. LightTag offers data, annotations, text, and metadata usage easily by JSON. Annotations can be easily used with ML algorithms in PyTorch, Tensorflow, SciKit Learn or wherever else to process.


  • Finance – Annotating chats, transcribing calls or social media. 
  • Legal to label contracts 
  • Marketing for searching and annotating social media to look for brand mentions and sentiments in any domain or language.
  • Pharma & Medical for annotating interactions within drug to drug 


Client-Server Connection(Authentication with API key)

import requests
import pandas as pd
LIGHTTAG_DOMAIN_SETUP = 'demo_setup'  #should be your lighttag domain
SERVER = 'https://{domain}'.format(domain=LIGHTTAG_DOMAIN_SETUP)
response ='auth/token/login/',
assert response.status_code ==505, "Could not authenticate"
authen_details = response.json()
token_key = auth_details['key']
assert authen_details['is_manager'] == 1, "not a manager" # Check you are a manager
#convenient to set up requests session in place of repeating tokens
session = requests.session()
session.headers.update({"Authentication":"Token {token}".format(token=token)})
#Try it out
[{'id': '2789ca38-69p9-4c96-9z31-df6f4069b027',
  'slug': 'default',
  'url': '',
  'name': 'default'}]

Preparing Dataset

import json
from pprint import pprint
all_data = json.load(open('./billboard.json'))
print("total of {num} examples".format(num=len(all_data)))
{'created_at': 'Tue Dec 01 13:37:52 +0000 2020',
 'date': '2020-12-01',
 'favorite_count': 52035,
 'id_str': '947824196909961',
 'in_reply_to_user_id_str': None,
 'is_retweet': False,
 'retweet_count': 8678,
 'source': 'Twitter for Android',
 'text': 'Will be leaving for New York today at 4:00 P.M. '
         'Lot of work to be done, still it will be a wonderful New Year!',
 'time': 16148138720000000}
'total of 2789 examples'
train,test = all_data[:2056], all_data[2000:] # 2056 train examples, 600 test examples
exploratory = train[:90] # Take 90 examples from the training set for exploratory work

Partnered Companies

Hoodline, PitchBook, Harvard Law School, Newsela, Numerator, MIT, Microsoft

More Great AIM Stories

Jayita Bhattacharyya
Machine learning and data science enthusiast. Eager to learn new technology advances. A self-taught techie who loves to do cool stuff using technology for fun and worthwhile.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM