Hands-On Guide to Small Text: A Python Tool for Active Learning

the article is more focused on the small text library for active learning, which provides active learning algorithms for text classification and allows mixing and matching many classifiers.

Today’s scenario where the data is more accessible and increasing in amount has increased the efforts to analyse and preprocess the data for a data scientist. There are many kinds of processes that need to be followed by a data scientist in preprocessing the data, cleaning the data, validating the data, and labelling the data. Labelling the data can be very much effort taking and time taking. In supervised learning, we can understand how much labelled data is important for a model to be trained to perform well. For example, in the NLP process, we have sentiments of the people, but the sentiments are not categorized; we can’t perform our learning in this case. Before the development of the model, we need to provide tags or labels to the sentiments, and here we can understand how difficult it can be to go through all the data for providing labels to the sentiment. To reduce the number of efforts taken by labelling raw data, active learning comes into the picture.

What is active learning?

In an active learning environment, learners try to become experienced in making decisions in the field for which learners are assigned meaning-making. For example, in checking multiple-choice questions, the paper checker gives marks for the right answers and negative or zero marks for wrong answers. So the learner will be the paper checker, and he will try to be an expert in giving marks from the category of marks. 

In machine learning processes, it is a subset of the whole learning process, as we know that this happens many times when we have to analyze data that is unlabeled, and in the whole process of machine learning, we need to provide our data points labels so that we can easily complete our machine learning process. And also, it is very helpful for models to provide well-labelled data for completing the learning process, but in cases where we need to provide levels to the data, active learning takes part. 

The concept of active learning is to make a machine-learning algorithm to reach a higher level of accuracy by providing them with small numbers of training datasets. At the same time, models are allowed to choose the dataset from which they want to learn. 

To provide labels to the unlabeled data, we usually require a human data annotator, which is a powerful way to annotate the data but is it efficient. For a data scientist, there are several tasks like analyzing the data, visualizing it, understanding it and finding patterns in the data. Aren’t these tasks pretty time-consuming? And if with all these we add one more task for them to provide labels to data accurately, the whole procedure will become too harsh, hectic and time-consuming. As A Result, data scientists will need to focus more on the procedure of labelling data. To reduce those efforts, active learning helps.

Active learning works differently in different situations. Roughly we can categorize active learning into three categories.

  • Stream-based selective sampling.
  • Pool based sampling.
  • Membership query synthesis.

Stream-based selective sampling

In this category, we try to make an algorithm that can determine the benefits to query for the labels for a specific unlabeled dataset while the model is training. The algorithm works with the models and provides labels to the dataset if labelling sequenced data sufficiently benefits training.

 Pool based sampling

In this type of active learning algorithm, algorithms try to evaluate data samples for modeling and label the best fit data samples from a data source before performing the modeling part of the machine learning. Often algorithms learn from the already labelled datasets to provide labels to the unlabeled data and these are the most commonly used algorithms for active learning.

Membership query synthesis

These algorithms are not useful for all the cases because we try to generate the synthesis data. The algorithm is allowed to generate the data for labelling. This is useful where we can easily generate the data instances.

Small text is an easy way to apply active learning in different kinds of machine learning procedures. Next, in the article, we are going to discuss the Small text for active learning.


As the article’s name suggests, the article is more focused on the small text library for active learning, which provides active learning algorithms for text classification and allows mixing and matching many classifiers and query strategies to build active learning applications. Using its features, we can easily develop classifiers using sklearn libraries, and also, in addition, we can use PyTorch classifiers with transformer models.  

This article starts with small-text by implementing a binary classification model on sklearn provided 20 newsgroup data. Before starting the model, we will know how we can install this package in python.

pip install small-text


If anyone wants to go ahead with the PyTorch classifiers and transformers, this is an extra requirement we need to download. We can download using the following command.

pip install small-text[transformers]

Since I am using google Colab, I will need to clone the whole package presented in the link. So next, I will clone the package.

!git clone https://github.com/webis-de/small-text.git


After cloning it, we can install the package by giving the exact address where it got installed. Also, before giving the address, you will need to mount your drive for the Colab notebook.

Using the following command, we can mount our drive.

from google.colab import drive


After the mounting, we can install the package which we have cloned using the following command.

pip install /content/small-text


After restarting the runtime, we can use the whole package.

Now we can start our procedure of developing a binary classifier using small-Text library packages.

Importing the required library:

import numpy as np
from small_text.active_learner import PoolBasedActiveLearner
from small_text.classifiers import ConfidenceEnhancedLinearSVC
from small_text.classifiers.factories import SklearnClassifierFactory
from small_text.query_strategies import PoolExhaustedException, EmptyPoolException
from small_text.query_strategies import RandomSampling
from small_text.data import SklearnDataSet

Defining active learning parameters:

clf_template = ConfidenceEnhancedLinearSVC()
clf_factory = SklearnClassifierFactory(clf_template)
query_strategy = RandomSampling()

Defining a function to load the data using sklearn:

from sklearn.datasets import fetch_20newsgroups

def get_twenty_newsgroups_corpus(categories=['rec.sport.baseball', 'rec.sport.hockey']):

   train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'),categories=categories)

    test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'),categories=categories)

    return train, test

def get_train_test():

 train, test = get_twenty_newsgroups_corpus()
   return train, test

Defining a function to preprocess the data:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize

def preprocess_data(train, test):

 vectorizer = TfidfVectorizer(stop_words='english')
 x_train = normalize(vectorizer.fit_transform(train.data))

 x_test = normalize(vectorizer.transform(test.data))

 return SklearnDataSet(x_train, train.target), SklearnDataSet(x_test, test.target)

Defining a function to evaluate the results using sklearn’s  metrics:

from sklearn.metrics import f1_score

def evaluate(active_learner, train, test):

    y_pred = active_learner.classifier.predict(train)

    y_pred_test = active_learner.classifier.predict(test)

    print('Train accuracy: {:.2f}'.format(

        f1_score(y_pred, train.y, average='micro')))

    print('Test accuracy: {:.2f}'.format(f1_score(y_pred_test, test.y, average='micro')))


Merging all these functions into the main function:

def main():

   # Active learning parameters

    clf_template = ConfidenceEnhancedLinearSVC()

    clf_factory = SklearnClassifierFactory(clf_template)

    query_strategy = RandomSampling()

    text_train, text_test = get_train_test()

    train, test = preprocess_data(text_train, text_test)

    active_learner = PoolBasedActiveLearner(clf_factory, query_strategy, train)

    labeled_indices = initialize_active_learner(active_learner, train.y)

        perform_active_learning(active_learner, train, labeled_indices, test)
    except PoolExhaustedException:
        print('Error! Not enough samples left to handle the query.')
    except EmptyPoolException:
        print('Error! No more samples left. (Unlabeled pool is empty)')

Defining a function to perform active learning.

def perform_active_learning(active_learner, train, labeled_indices, test):

   for i in range(15):

        # ...where each iteration consists of labelling 20 samples

        q_indices = active_learner.query(num_samples=20)
        y = train.y[q_indices]

        labeled_indices = np.concatenate([q_indices, labeled_indices])

        print('Iteration #{:d} ({} samples)'.format(i, len(labeled_indices)))

      evaluate(active_learner, train[labeled_indices], test)

This loop will perform 15 iterations of active learning; in each iteration, 20 samples of the news are queried and then updated. The update step will reveal the true label to the active learner. 

Defining a function to initialize the active learner; this function is required for model-based query strategies.

def initialize_active_learner(active_learner, y_train):

    indices_pos_label = np.where(y_train == 1)[0]
    indices_neg_label = np.where(y_train == 0)[0]

([np.random.choice(indices_pos_label, 10, replace=False),
             np.random.choice(indices_neg_label, 10, replace=False)])

    x_indices_initial = x_indices_initial.astype(int)
    y_initial = [y_train[i] for i in x_indices_initial]

    active_learner.initialize_data(x_indices_initial, y_initial)
    return x_indices_initial

Running the main function:

if __name__ == '__main__':


Here we can see that the accuracy of the model has increased. The size of the output was big, that is why in the image, the whole output is not put in the article, but it started with the training accuracy of 1.00 and the test accuracy of 0.76 with the sample size of 40.

Here in the article, we have seen what active learning is and how we can perform it easily with small Text. This library has many features like it provides full functionality to work on data labelling with transformer models. Moreover, since it is totally developed on python packages, it is very easy to understand with a basic knowledge of python. In the case of studies, we can also use reinforcement learning in place of this. Reinforcement learning is inspired by behavioural psychology, where active learning is closer to supervised learning. The development of the machine learning model is more rigid and accurate in the results than reinforcement learning.


Download our Mobile App

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox