SS3 has obtained state-of-the-art performance on the early risk detection (ERD) problems on text streams. Since it was designed to deal with risk detection over text streams, it supports incremental training and classification, as well as a visual explanation. A package called PySS3 is introduced in this article, which implements SS3 to classify text and provides visualization tools, allowing for the deployment of robust and explanation-ready text classification models. Following are the topics to be covered.
Table of Contents
- What is an SS3 classifier?
- How does SS3 classify text?
- Explainable text classification with SS3
Let’s start with the understanding of the SS3 classifier.
What is an SS3 classifier?
A novel supervised machine learning model for text classification, the SS3 text classifier can explain its reasoning naturally. The algorithm is capable of detecting early risk (ERD) issues on social media, such as early signs of depression, early rumour detection, or early identification of sexual predators. In addition to being able to justify their decisions, ERD tasks involve risky decisions that could affect people’s lives
Working
To store word frequencies for a category, SS3 built a dictionary using all training documents in the category. With this simple training method, SS3 will be able to support online learning as new training documents are added; simply updating the dictionaries will be enough. This incremental method makes it easy for SS3 to support online learning( when the user is uploading the data in real-time).
To determine the value for each word, SS3 computes the frequency of the words based on the dictionary contents with the help of a function [f=gv(w,c)] and let’s call it the ‘gv’ function. This function contains the word and a category based on which it generates an output, a number in the interval between 0 to 1. The output represents the degree of confidence whether the word has a high level of confidence that it belongs exclusively to that category.
For multiple words, a vector version is formed which contains all the confidence intervals related to a word and classifies the text according to the confidence score. There is a fixed position for each category in a vector. The “confidence vector of the word” represents the position for each category in the vector.
Are you looking for a complete repository of Python libraries used in data science, check out here.
How does SS3 classify text?
For understanding the classification algorithm let’s split it into a two-phase process.
In the first phase, the input is separated into several blocks (paragraphs), which are continuously broken down into smaller units (sentences, words). This creates a hierarchy of blocks from the previously ‘flat’ document.
In the second phase, the vector function (gv) is applied to each word to obtain a set of word confidence vectors, which are then reduced to sentence confidence vectors by a word-level summary operator. This reduction process is recursively propagated to higher-level blocks, using higher-level summary operators until a single confidence vector, the entire input is summed up in a final vector. In the end, classification is accomplished by applying some policies based on the stored confidence values in the final vector.
Hyperparameters
For the entire classification process to work, the gv function must first be used to create a basic set of confidence vectors from which higher-level confidence vectors are constructed. The computation of gv involves three functions.
- The function (lv) contains values for a word based on the local frequency of the word in the category. The word distribution curve is smoothed by a factor controlled by the hyperparameter called sigma ‘σ’.
- The function (sg) is a sigmoid function that captures the significance of the word in the category. The ‘λ’ hyperparameter controls how far the function must deviate from the median to be considered significant.
- The function (sn) decreases the global value concerning the number of categories. The ‘ρ’ hyperparameter controls how severe this sanction is.
Let’s implement this concept in python.
Explainable Text Classification with SS3
Let’s use this Python package not only to classify a document but also to generate the list of text fragments its classification decision was based on. There are three modules in the PySS3 package. In this article, we will be using the SS3 text classifier.
Install the pyss3 package
! pip install pyss3
Import necessary libraries
from pyss3 import SS3 from pyss3.util import Dataset import numpy as np from collections import Counter
Loading and training the classifier
!unzip -u Datasets/topic.zip -d datasets/ clf = SS3(s=0.32, l=1.62, p=2.35) x_train, y_train = Dataset.load_from_files("datasets/topic/train", folder_label=False) clf.train(x_train, y_train, n_grams=3)
The classifier is built on three hyperparameters as explained above which is been fed in sequence (0.32,1.62,2.235). The X_train contains the text and the y_train contains the categories. The training data have eight different categories sports, food, health, science&technology, business&finance, art&photography, music, beauty&fashion.
Determine the document
For this implementation, we will be using the document as shown below related to the Moving average.
Single label Classification
clf.classify_label(document)
The classifier labelled the document under the ‘science&technology’ category. Let’s see how the classifier is explaining the categorization.
fragments = clf.extract_insight(document) print("Total number of text fragments extracted = ", len(fragments))
The classifier fragmented the document into thirteen different fragments which could be seen by using the below code.
print("Text:", fragments[0][0]) print() print("Confidence value:", fragments[0][1])
The confidence value of the first fragment is 1.4 as observed above. In this way, one can view all the fragments and confidence values.
Multi-label Classification
clf.classify_multilabel(document)
The classifier labelled the document under the ‘science&technology’ and ‘business&finance’ categories. Let’s see how the classifier is explaining the categorization.
fragments = clf.extract_insight(document, cat="business&finance") fragments[:3]
Similarly one can view all the multi-category labels and their confidence intervals. But what if we want to see the best paragraph which is categorized under a particular category in multi-label classification. For that purpose use ‘level’ as a parameter in the extract_insight function.
frag = clf.extract_insight(document, cat="science&technology", level="paragraph") print("The best paragraph is:\n\n", frag[0][0]) print() print("with confidence value:", frag[0][1])
Got the best paragraph categorized in ‘science&technology with a confidence score of 2.8.
Final words
An SS3 text classifier can be implemented by PySS3, an open-sourced python package. With a hands-on implementation of this concept in this article, we could classify the text with an explanation by using PySS3.