Meet skweak: A Python Toolkit For Applying Weak Supervision To NLP Tasks

skweak is a software toolkit based on Python, developed for applying weak supervision to various Natural Language Processing tasks.

skweak is a software toolkit based on Python, developed for applying weak supervision to various NLP tasks. It has been recently introduced by Pierre Lison, Jeremy Barnes and Aliaksandr Hubin from Norway in April 2021 (research paper).

Are you familiar with the term ‘weak supervision’? Have a look at its brief meaning before proceeding.

Weak supervision refers to a novel ML technique that uses noisy, unstructured or limited data sources to label training data in a supervised learning approach. Instead of annotating the data manually, labelling functions created using existing knowledge of the domain annotate the data independently and hence eliminate the efforts and cost required for manual annotations.


Sign up for your weekly dose of what's up in emerging technology.

Overview of skweak

skweak applies weak supervision to various NLP tasks such as sequence labelling and text classification. What it does can be summarized by the following steps:

  1. Apply a variety of labelling functions created based on the domain knowledge on a corpus.
  2. Aggregate the results of all the applied labelling functions in an unsupervised manner using a generative model.

Any machine learning model can then be trained on the labelled corpus.

working of skweak

Image source: Research paper

Types of labeling functions used by skweak:

  1. Heuristics: The most straightforward way of labeling the corpus is using a heuristic approach which uses certain rules to decide upon the labels. For instance, if an entity ends with ‘Pvt Ltd’, ‘Inc’ etc. terms, it can be labelled as a “company”. (We will see an  example of such labelling in the practical implementation section further in this article).
  1. Gazetteers: This group of labelling functions searches a document for occurrences of specific words or phrases. It relies on a prefix tree called ‘trie’, which looks for all the possible occurrences and is traversed following the depth-first search method.
  1. Machine learning models: skweak can employ the concept of transfer learning, i.e. it can learn to label a corpus from an ML model and then use that knowledge for labeling the actual corpus.
  1. Document-level labeling functions: skweak can use the concept of label consistency in a document for labeling the whole corpus. For instance, frequently occurring terms are more likely to belong to a common label.

Practical implementation

Here’s a demonstration of using skweak for annotating a corpus having 200 news articles. The code has been implemented using Google colab with Python 3.7.10, skweak 0.2.9 and spacy 2.2.4 versions. Step-wise explanation of the code is as follows:

  1. Install skweak using pip command.

!pip install skweak

  1. Install SpaCy library

!pip install spacy

  1. Import required libraries and modules.
 import tarfile
 import spacy
 import skweak 
  1. Download en_core_web_sm and en_core_web_md trained pipelines of SpaCy.
 !python -m spacy download en_core_web_sm
 !python -m spacy download en_core_web_md 
  1. Extract the corpus’ text. (Data file used can be downloaded from here.)
 txt = []  #Create an array to store the text
 #Open the zip file of corpus
 arcv_file ="reuters_small.tar.gz")
 #For each file in the zip file
 for arcv_mem in arcv_file.getnames():
 #If the file is a text file, extract it, read its contents and decode it
     if arcv_mem.endswith(".txt"):
         text = arcv_file.extractfile(arcv_mem).read().decode("utf8")
         #Add the content to storage array ‘txt
  1. Load the en_core_web_sm pipeline and disable unnecessary components of the pipeline.
 pipeline = spacy.load("en_core_web_sm", disable=["ner", "lemmatizer"])
 docs = list(pipeline.pipe(txt)) 
  1. Create a list of cue words for other non-commercial organizations.
 otherorg = {"University", "Institute", "College", "Committee", "Party", "Agency",
                        "Union", "Association", "Organization", "Court", "Office", "National"} 
  1. Define a function to find companies’ names in the corpus text.
 def find_company(doc):
 #for each noun in the document
     for chunk in doc.noun_chunks:
 #If the noun ends with suffix like ‘corp’, ‘inc’ etc
         if chunk[-1].lower_.rstrip(".") in {'corp', 'inc', 'ltd', 'llc', 'sa', 'ag'}:
        #label the chunk as COMPANY
             yield chunk.start, chunk.end, "COMPANY" 
  1. Create labelling function for companies 
detect_company = skweak.heuristics.FunctionAnnotator("company_detector", find_company)

Where, ‘company_detector’ is the name given to labelling function and ‘find_company’ is the function to be used for annotation

  1. Run the labelling function on the entire corpus

docs = list(detect_company.pipe(docs))

  1. Apply the labelling function on a small piece of text from the corpus. Display the annotated entities in a document using display_entities() method.
skweak.utils.display_entities(docs[27], "company_detector")

Condensed output:

  1. Define a function to find the names of non-commercial organizations in the corpus text and label them.
 def find_other_org(doc):
 #for each noun in the document
     for chunk in doc.noun_chunks:
 #if the noun is equal to one of the members of otherorg list
         if any([token.text in otherorg for token in chunk]):
    #label that chunk as OTHER_ORGANIZATION
             yield chunk.start, chunk.end, "OTHER_ORGANIZATION" 
  1. Create a labelling function for other organizations.
detect_other_org = skweak.heuristics.FunctionAnnotator("other_org_detector", find_other_org)

Where, ‘other_org_detector’ is the name given to the labelling function and ‘find_other_arg’ is the function to be used for annotation.

  1. Apply the labelling function to the corpus.

docs = list(detect_other_org.pipe(docs))

  1. Apply the above labelling function on a document.
skweak.utils.display_entities(docs[28], "other_org_detector")


skweak op2
  1. Create Gazetteers labelling function.

First, we extract companies’ data from a JSON file available here

 comp_data = skweak.gazetteers.extract_json_data("crunchbase_companies.json.gz")
 #Labelling function
 gzt = skweak.gazetteers.GazetteerAnnotator("gazetteer", comp_data) 
  1. Run the gazetteer function on the whole corpus.

docs = list(gzt.pipe(docs))

Apply the labelling function on a spacy document from the corpus.

skweak.utils.display_entities(docs[28], "gzt")

Condensed output:

  1. Run an NER (Named Entity Recognition) model trained on conll2003 dataset.
 ner_model = skweak.spacy.ModelAnnotator("spacy", "en_core_web_sm")
 docs = list(ner_model.pipe(docs)) 

Apply the NER model on a document.

skweak.utils.display_entities(docs[17], "spacy")

Condensed output:

  1. Aggregation step

We now aggregate the labels of different labelling functions using a generative model. This will create a unique annotation for each of the documents in the corpus.

agg_model = skweak.aggregation.HMM("hmm", ["COMPANY", "OTHER_ORG"])

Specify that “ORG” term can represent both a company or a non-commercial organization.agg_model.add_underspecified_label(“ORG”, [“COMPANY”, “OTHER_ORG”])

Fit the aggregated model on the corpus.        

docs = agg_model.fit_and_aggregate(docs)


skweak op5

Run the aggregated model on a document.

skweak.utils.display_entities(docs[17], "hmm")

Condensed output:

skweak op6
  1. Write the stream of documents in a binary file
    for document in docs:
      document.ents = document.spans["hmm"]
 skweak.utils.docbin_writer(docs, "reuters_small.spacy") 
  1. Train the final aggregated model on the labelled data.
 !spacy init config - --lang en --pipeline ner --optimize accuracy | \
 spacy train - --paths.train ./reuters_small.spacy ./reuters_small.spacy \
 --initialize.vectors en_core_web_md --output reuters_small 

Sample condensed output:

skweak op7


More Great AIM Stories

Nikita Shiledarbaxi
A zealous learner aspiring to advance in the domain of AI/ML. Eager to grasp emerging techniques to get insights from data and hence explore realistic Data Science applications as well.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM