Last updated April 22, 2022
In AI Mysteries

How to make text classification explainable with the SS3 classifier?

Share

Published on April 23, 2022

by Sourabh Mehta

SS3 has obtained state-of-the-art performance on the early risk detection (ERD) problems on text streams. Since it was designed to deal with risk detection over text streams, it supports incremental training and classification, as well as a visual explanation. A package called PySS3 is introduced in this article, which implements SS3 to classify text and provides visualization tools, allowing for the deployment of robust and explanation-ready text classification models. Following are the topics to be covered.

What is an SS3 classifier?
How does SS3 classify text?
Explainable text classification with SS3

Let’s start with the understanding of the SS3 classifier.

What is an SS3 classifier?

A novel supervised machine learning model for text classification, the SS3 text classifier can explain its reasoning naturally. The algorithm is capable of detecting early risk (ERD) issues on social media, such as early signs of depression, early rumour detection, or early identification of sexual predators. In addition to being able to justify their decisions, ERD tasks involve risky decisions that could affect people’s lives

Working

To store word frequencies for a category, SS3 built a dictionary using all training documents in the category. With this simple training method, SS3 will be able to support online learning as new training documents are added; simply updating the dictionaries will be enough. This incremental method makes it easy for SS3 to support online learning( when the user is uploading the data in real-time).

To determine the value for each word, SS3 computes the frequency of the words based on the dictionary contents with the help of a function [f=gv(w,c)] and let’s call it the ‘gv’ function. This function contains the word and a category based on which it generates an output, a number in the interval between 0 to 1. The output represents the degree of confidence whether the word has a high level of confidence that it belongs exclusively to that category.

For multiple words, a vector version is formed which contains all the confidence intervals related to a word and classifies the text according to the confidence score. There is a fixed position for each category in a vector. The “confidence vector of the word” represents the position for each category in the vector.

Are you looking for a complete repository of Python libraries used in data science, check out here.

How does SS3 classify text?

For understanding the classification algorithm let’s split it into a two-phase process.

In the first phase, the input is separated into several blocks (paragraphs), which are continuously broken down into smaller units (sentences, words). This creates a hierarchy of blocks from the previously ‘flat’ document.

In the second phase, the vector function (gv) is applied to each word to obtain a set of word confidence vectors, which are then reduced to sentence confidence vectors by a word-level summary operator. This reduction process is recursively propagated to higher-level blocks, using higher-level summary operators until a single confidence vector, the entire input is summed up in a final vector. In the end, classification is accomplished by applying some policies based on the stored confidence values in the final vector.

Hyperparameters

For the entire classification process to work, the gv function must first be used to create a basic set of confidence vectors from which higher-level confidence vectors are constructed. The computation of gv involves three functions.

The function (lv) contains values for a word based on the local frequency of the word in the category. The word distribution curve is smoothed by a factor controlled by the hyperparameter called sigma ‘σ’.
The function (sg) is a sigmoid function that captures the significance of the word in the category. The ‘λ’ hyperparameter controls how far the function must deviate from the median to be considered significant.
The function (sn) decreases the global value concerning the number of categories. The ‘ρ’ hyperparameter controls how severe this sanction is.

Let’s implement this concept in python.

Explainable Text Classification with SS3

Let’s use this Python package not only to classify a document but also to generate the list of text fragments its classification decision was based on. There are three modules in the PySS3 package. In this article, we will be using the SS3 text classifier.

Install the pyss3 package

! pip install pyss3

Import necessary libraries

from pyss3 import SS3
from pyss3.util import Dataset
import numpy as np
from collections import Counter

Loading and training the classifier

!unzip -u Datasets/topic.zip -d datasets/
 
clf = SS3(s=0.32, l=1.62, p=2.35)
x_train, y_train = Dataset.load_from_files("datasets/topic/train", folder_label=False)
clf.train(x_train, y_train, n_grams=3)

The classifier is built on three hyperparameters as explained above which is been fed in sequence (0.32,1.62,2.235). The X_train contains the text and the y_train contains the categories. The training data have eight different categories sports, food, health, science&technology, business&finance, art&photography, music, beauty&fashion.

Determine the document

For this implementation, we will be using the document as shown below related to the Moving average.

Single label Classification

clf.classify_label(document)

The classifier labelled the document under the ‘science&technology’ category. Let’s see how the classifier is explaining the categorization.

fragments = clf.extract_insight(document)
print("Total number of text fragments extracted = ", len(fragments))

The classifier fragmented the document into thirteen different fragments which could be seen by using the below code.

print("Text:", fragments[0][0])
print()
print("Confidence value:", fragments[0][1])

The confidence value of the first fragment is 1.4 as observed above. In this way, one can view all the fragments and confidence values.

Multi-label Classification

clf.classify_multilabel(document)

The classifier labelled the document under the ‘science&technology’ and ‘business&finance’ categories. Let’s see how the classifier is explaining the categorization.

fragments = clf.extract_insight(document, cat="business&finance")
 
fragments[:3]

Similarly one can view all the multi-category labels and their confidence intervals. But what if we want to see the best paragraph which is categorized under a particular category in multi-label classification. For that purpose use ‘level’ as a parameter in the extract_insight function.

frag = clf.extract_insight(document, cat="science&technology", level="paragraph")
 
print("The best paragraph is:\n\n", frag[0][0])
print()
print("with confidence value:", frag[0][1])

Got the best paragraph categorized in ‘science&technology with a confidence score of 2.8.

Final words

An SS3 text classifier can be implemented by PySS3, an open-sourced python package. With a hands-on implementation of this concept in this article, we could classify the text with an explanation by using PySS3.

References

Access all our open Survey & Awards Nomination forms in one place

Sourabh Mehta

Sourabh has worked as a full-time data scientist for an ISP organisation, experienced in analysing patterns and their implementation in product development. He has a keen interest in developing solutions for real-time problems with the help of data both in this universe and metaverse.