skweak is a software toolkit based on Python, developed for applying weak supervision to various NLP tasks. It has been recently introduced by Pierre Lison, Jeremy Barnes and Aliaksandr Hubin from Norway in April 2021 (research paper).
Are you familiar with the term ‘weak supervision’? Have a look at its brief meaning before proceeding.
Weak supervision refers to a novel ML technique that uses noisy, unstructured or limited data sources to label training data in a supervised learning approach. Instead of annotating the data manually, labelling functions created using existing knowledge of the domain annotate the data independently and hence eliminate the efforts and cost required for manual annotations.
Overview of skweak
skweak applies weak supervision to various NLP tasks such as sequence labelling and text classification. What it does can be summarized by the following steps:
- Apply a variety of labelling functions created based on the domain knowledge on a corpus.
- Aggregate the results of all the applied labelling functions in an unsupervised manner using a generative model.
Any machine learning model can then be trained on the labelled corpus.
Image source: Research paper
Types of labeling functions used by skweak:
- Heuristics: The most straightforward way of labeling the corpus is using a heuristic approach which uses certain rules to decide upon the labels. For instance, if an entity ends with ‘Pvt Ltd’, ‘Inc’ etc. terms, it can be labelled as a “company”. (We will see an example of such labelling in the practical implementation section further in this article).
- Gazetteers: This group of labelling functions searches a document for occurrences of specific words or phrases. It relies on a prefix tree called ‘trie’, which looks for all the possible occurrences and is traversed following the depth-first search method.
- Machine learning models: skweak can employ the concept of transfer learning, i.e. it can learn to label a corpus from an ML model and then use that knowledge for labeling the actual corpus.
- Document-level labeling functions: skweak can use the concept of label consistency in a document for labeling the whole corpus. For instance, frequently occurring terms are more likely to belong to a common label.
Practical implementation
Here’s a demonstration of using skweak for annotating a corpus having 200 news articles. The code has been implemented using Google colab with Python 3.7.10, skweak 0.2.9 and spacy 2.2.4 versions. Step-wise explanation of the code is as follows:
- Install skweak using pip command.
!pip install skweak
- Install SpaCy library
!pip install spacy
- Import required libraries and modules.
import tarfile import spacy import skweak
- Download en_core_web_sm and en_core_web_md trained pipelines of SpaCy.
!python -m spacy download en_core_web_sm !python -m spacy download en_core_web_md
- Extract the corpus’ text. (Data file used can be downloaded from here.)
txt = [] #Create an array to store the text #Open the zip file of corpus arcv_file = tarfile.open("reuters_small.tar.gz") #For each file in the zip file for arcv_mem in arcv_file.getnames(): #If the file is a text file, extract it, read its contents and decode it if arcv_mem.endswith(".txt"): text = arcv_file.extractfile(arcv_mem).read().decode("utf8") #Add the content to storage array ‘txt txt.append(text)
- Load the en_core_web_sm pipeline and disable unnecessary components of the pipeline.
pipeline = spacy.load("en_core_web_sm", disable=["ner", "lemmatizer"]) docs = list(pipeline.pipe(txt))
- Create a list of cue words for other non-commercial organizations.
otherorg = {"University", "Institute", "College", "Committee", "Party", "Agency", "Union", "Association", "Organization", "Court", "Office", "National"}
- Define a function to find companies’ names in the corpus text.
def find_company(doc): #for each noun in the document for chunk in doc.noun_chunks: #If the noun ends with suffix like ‘corp’, ‘inc’ etc if chunk[-1].lower_.rstrip(".") in {'corp', 'inc', 'ltd', 'llc', 'sa', 'ag'}: #label the chunk as COMPANY yield chunk.start, chunk.end, "COMPANY"
- Create labelling function for companies
detect_company = skweak.heuristics.FunctionAnnotator("company_detector", find_company)
Where, ‘company_detector’ is the name given to labelling function and ‘find_company’ is the function to be used for annotation
- Run the labelling function on the entire corpus
docs = list(detect_company.pipe(docs))
- Apply the labelling function on a small piece of text from the corpus. Display the annotated entities in a document using display_entities() method.
skweak.utils.display_entities(docs[27], "company_detector")
Condensed output:
- Define a function to find the names of non-commercial organizations in the corpus text and label them.
def find_other_org(doc): #for each noun in the document for chunk in doc.noun_chunks: #if the noun is equal to one of the members of otherorg list if any([token.text in otherorg for token in chunk]): #label that chunk as OTHER_ORGANIZATION yield chunk.start, chunk.end, "OTHER_ORGANIZATION"
- Create a labelling function for other organizations.
detect_other_org = skweak.heuristics.FunctionAnnotator("other_org_detector", find_other_org)
Where, ‘other_org_detector’ is the name given to the labelling function and ‘find_other_arg’ is the function to be used for annotation.
- Apply the labelling function to the corpus.
docs = list(detect_other_org.pipe(docs))
- Apply the above labelling function on a document.
skweak.utils.display_entities(docs[28], "other_org_detector")
Output:
- Create Gazetteers labelling function.
First, we extract companies’ data from a JSON file available here.
comp_data = skweak.gazetteers.extract_json_data("crunchbase_companies.json.gz") #Labelling function gzt = skweak.gazetteers.GazetteerAnnotator("gazetteer", comp_data)
- Run the gazetteer function on the whole corpus.
docs = list(gzt.pipe(docs))
Apply the labelling function on a spacy document from the corpus.
skweak.utils.display_entities(docs[28], "gzt")
Condensed output:
- Run an NER (Named Entity Recognition) model trained on conll2003 dataset.
ner_model = skweak.spacy.ModelAnnotator("spacy", "en_core_web_sm") docs = list(ner_model.pipe(docs))
Apply the NER model on a document.
skweak.utils.display_entities(docs[17], "spacy")
Condensed output:
- Aggregation step
We now aggregate the labels of different labelling functions using a generative model. This will create a unique annotation for each of the documents in the corpus.
agg_model = skweak.aggregation.HMM("hmm", ["COMPANY", "OTHER_ORG"])
Specify that “ORG” term can represent both a company or a non-commercial organization.agg_model.add_underspecified_label(“ORG”, [“COMPANY”, “OTHER_ORG”])
Fit the aggregated model on the corpus.
docs = agg_model.fit_and_aggregate(docs)
Output:
Run the aggregated model on a document.
skweak.utils.display_entities(docs[17], "hmm")
Condensed output:
- Write the stream of documents in a binary file
for document in docs: document.ents = document.spans["hmm"] skweak.utils.docbin_writer(docs, "reuters_small.spacy")
- Train the final aggregated model on the labelled data.
!spacy init config - --lang en --pipeline ner --optimize accuracy | \ spacy train - --paths.train ./reuters_small.spacy --paths.dev ./reuters_small.spacy \ --initialize.vectors en_core_web_md --output reuters_small
Sample condensed output:
- Code source: GitHub
- Google colab notebook of the above implementation