MITB Banner

Meet skweak: A Python Toolkit For Applying Weak Supervision To NLP Tasks

skweak is a software toolkit based on Python, developed for applying weak supervision to various Natural Language Processing tasks.



skweak is a software toolkit based on Python, developed for applying weak supervision to various NLP tasks. It has been recently introduced by Pierre Lison, Jeremy Barnes and Aliaksandr Hubin from Norway in April 2021 (research paper).

Are you familiar with the term ‘weak supervision’? Have a look at its brief meaning before proceeding.

Weak supervision refers to a novel ML technique that uses noisy, unstructured or limited data sources to label training data in a supervised learning approach. Instead of annotating the data manually, labelling functions created using existing knowledge of the domain annotate the data independently and hence eliminate the efforts and cost required for manual annotations.

Overview of skweak

skweak applies weak supervision to various NLP tasks such as sequence labelling and text classification. What it does can be summarized by the following steps:

  1. Apply a variety of labelling functions created based on the domain knowledge on a corpus.
  2. Aggregate the results of all the applied labelling functions in an unsupervised manner using a generative model.

Any machine learning model can then be trained on the labelled corpus.

working of skweak

Image source: Research paper

Types of labeling functions used by skweak:

  1. Heuristics: The most straightforward way of labeling the corpus is using a heuristic approach which uses certain rules to decide upon the labels. For instance, if an entity ends with ‘Pvt Ltd’, ‘Inc’ etc. terms, it can be labelled as a “company”. (We will see an  example of such labelling in the practical implementation section further in this article).
  1. Gazetteers: This group of labelling functions searches a document for occurrences of specific words or phrases. It relies on a prefix tree called ‘trie’, which looks for all the possible occurrences and is traversed following the depth-first search method.
  1. Machine learning models: skweak can employ the concept of transfer learning, i.e. it can learn to label a corpus from an ML model and then use that knowledge for labeling the actual corpus.
  1. Document-level labeling functions: skweak can use the concept of label consistency in a document for labeling the whole corpus. For instance, frequently occurring terms are more likely to belong to a common label.

Practical implementation

Here’s a demonstration of using skweak for annotating a corpus having 200 news articles. The code has been implemented using Google colab with Python 3.7.10, skweak 0.2.9 and spacy 2.2.4 versions. Step-wise explanation of the code is as follows:

  1. Install skweak using pip command.

!pip install skweak

  1. Install SpaCy library

!pip install spacy

  1. Import required libraries and modules.
 import tarfile
 import spacy
 import skweak 
  1. Download en_core_web_sm and en_core_web_md trained pipelines of SpaCy.
 !python -m spacy download en_core_web_sm
 !python -m spacy download en_core_web_md 
  1. Extract the corpus’ text. (Data file used can be downloaded from here.)
 txt = []  #Create an array to store the text
 #Open the zip file of corpus
 arcv_file ="reuters_small.tar.gz")
 #For each file in the zip file
 for arcv_mem in arcv_file.getnames():
 #If the file is a text file, extract it, read its contents and decode it
     if arcv_mem.endswith(".txt"):
         text = arcv_file.extractfile(arcv_mem).read().decode("utf8")
         #Add the content to storage array ‘txt
  1. Load the en_core_web_sm pipeline and disable unnecessary components of the pipeline.
 pipeline = spacy.load("en_core_web_sm", disable=["ner", "lemmatizer"])
 docs = list(pipeline.pipe(txt)) 
  1. Create a list of cue words for other non-commercial organizations.
 otherorg = {"University", "Institute", "College", "Committee", "Party", "Agency",
                        "Union", "Association", "Organization", "Court", "Office", "National"} 
  1. Define a function to find companies’ names in the corpus text.
 def find_company(doc):
 #for each noun in the document
     for chunk in doc.noun_chunks:
 #If the noun ends with suffix like ‘corp’, ‘inc’ etc
         if chunk[-1].lower_.rstrip(".") in {'corp', 'inc', 'ltd', 'llc', 'sa', 'ag'}:
        #label the chunk as COMPANY
             yield chunk.start, chunk.end, "COMPANY" 
  1. Create labelling function for companies 
detect_company = skweak.heuristics.FunctionAnnotator("company_detector", find_company)

Where, ‘company_detector’ is the name given to labelling function and ‘find_company’ is the function to be used for annotation

  1. Run the labelling function on the entire corpus

docs = list(detect_company.pipe(docs))

  1. Apply the labelling function on a small piece of text from the corpus. Display the annotated entities in a document using display_entities() method.
skweak.utils.display_entities(docs[27], "company_detector")

Condensed output:

  1. Define a function to find the names of non-commercial organizations in the corpus text and label them.
 def find_other_org(doc):
 #for each noun in the document
     for chunk in doc.noun_chunks:
 #if the noun is equal to one of the members of otherorg list
         if any([token.text in otherorg for token in chunk]):
    #label that chunk as OTHER_ORGANIZATION
             yield chunk.start, chunk.end, "OTHER_ORGANIZATION" 
  1. Create a labelling function for other organizations.
detect_other_org = skweak.heuristics.FunctionAnnotator("other_org_detector", find_other_org)

Where, ‘other_org_detector’ is the name given to the labelling function and ‘find_other_arg’ is the function to be used for annotation.

  1. Apply the labelling function to the corpus.

docs = list(detect_other_org.pipe(docs))

  1. Apply the above labelling function on a document.
skweak.utils.display_entities(docs[28], "other_org_detector")


skweak op2
  1. Create Gazetteers labelling function.

First, we extract companies’ data from a JSON file available here

 comp_data = skweak.gazetteers.extract_json_data("crunchbase_companies.json.gz")
 #Labelling function
 gzt = skweak.gazetteers.GazetteerAnnotator("gazetteer", comp_data) 
  1. Run the gazetteer function on the whole corpus.

docs = list(gzt.pipe(docs))

Apply the labelling function on a spacy document from the corpus.

skweak.utils.display_entities(docs[28], "gzt")

Condensed output:

  1. Run an NER (Named Entity Recognition) model trained on conll2003 dataset.
 ner_model = skweak.spacy.ModelAnnotator("spacy", "en_core_web_sm")
 docs = list(ner_model.pipe(docs)) 

Apply the NER model on a document.

skweak.utils.display_entities(docs[17], "spacy")

Condensed output:

  1. Aggregation step

We now aggregate the labels of different labelling functions using a generative model. This will create a unique annotation for each of the documents in the corpus.

agg_model = skweak.aggregation.HMM("hmm", ["COMPANY", "OTHER_ORG"])

Specify that “ORG” term can represent both a company or a non-commercial organization.agg_model.add_underspecified_label(“ORG”, [“COMPANY”, “OTHER_ORG”])

Fit the aggregated model on the corpus.        

docs = agg_model.fit_and_aggregate(docs)


skweak op5

Run the aggregated model on a document.

skweak.utils.display_entities(docs[17], "hmm")

Condensed output:

skweak op6
  1. Write the stream of documents in a binary file
    for document in docs:
      document.ents = document.spans["hmm"]
 skweak.utils.docbin_writer(docs, "reuters_small.spacy") 
  1. Train the final aggregated model on the labelled data.
 !spacy init config - --lang en --pipeline ner --optimize accuracy | \
 spacy train - --paths.train ./reuters_small.spacy ./reuters_small.spacy \
 --initialize.vectors en_core_web_md --output reuters_small 

Sample condensed output:

skweak op7


Picture of Nikita Shiledarbaxi

Nikita Shiledarbaxi

A zealous learner aspiring to advance in the domain of AI/ML. Eager to grasp emerging techniques to get insights from data and hence explore realistic Data Science applications as well.
Related Posts


Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India