What is Extreme Multilabel Text Classification?

The problem of assigning the most relevant subset of class labels to each document from an extremely large label collection, where the number of labels could reach hundreds of thousands or millions, is known as extreme multi-label text classification (XMTC).


The problem of assigning the most relevant subset of class labels to each document from an extremely large label collection, where the number of labels could reach hundreds of thousands or millions, is known as extreme multi-label text classification (XMTC). In this post, we will have a look at how multi-label and multiclass classification differs from one another, as well as the approaches and techniques used to deal with XMTC. Below is a list of the main points to be discussed in this article.

Table Of Contents

  1. What is Extreme Multilabel Text Classification?
  2. Approaches to Extreme Multilabel Text Classification
  3. Methods for Extreme Multilabel Text Classification

Let’s start the discussion by understanding the difference between Multi-label and Multi-class problems.


Sign up for your weekly dose of what's up in emerging technology.

What is Extreme Multilabel Text Classification?

The problem of determining the most relevant subset of labels for each document from an extremely large space of categories is known as extreme multi-label text classification (XMTC). Wikipedia, for example, has over a million category labels created by curators due to the rapid growth of internet content and the pressing need for Organizational perspectives on big data, and an article may have more than one relevant label. 

Amazon shopping items, for example, have a complex hierarchical structure. For shopping organizations, there are over one million categories of items, and each item is usually part of more than one relevant category. Solving such large-scale multi-label classification problems presents new challenges for machine learning.

The traditional binary or multi-class classification problems that have been extensively studied in the machine learning literature are fundamentally different from multi-label classification. Binary classifiers treat class labels as independent target variables, which is clearly inefficient for multi-label classification because dependencies between labels cannot be taken advantage of. In multi-label settings, multi-class classifiers rely on the mutually exclusive assumption about class labels (i.e., one document should have only one class label), which is incorrect.

The extremely severe data sparsity issue contributes to the difficulty in solving XMTC problems. The distribution of labels is common in XMTC datasets, which means that a large proportion of the labels have very few training instances associated with them.

As a result, learning the dependency patterns among labels is difficult. Another significant challenge in XMTC is that when the number of labels reaches hundreds of thousands or even millions, the computational costs of both training and testing mutually independent classifiers become practically prohibitive.

Recently, significant progress has been made in XMTC. To deal with the large label space as well as scalability and data sparsity issues, several approaches have been proposed. In the following section, we’ll classify XMC algorithms into four categories or approaches used to solve the problem: one-vs-all approaches, partitioning methods, embedding-based approaches, and deep learning approaches.

Approaches to Extreme Multilabel Text Classification

One-Vs-All (OVA) approach

The naive one-versus-all approach treats each label as a separate binary classification problem. OVA approaches have been shown to achieve high accuracies, but when the number of labels is very large, they suffer from expensive computation for both training and prediction. As a result, several techniques for speeding up the algorithm have been proposed. 

PDSparse uses primal and dual sparsity to accelerate training and prediction. Parallelism and sparsity are investigated in order to speed up the algorithm and reduce model size. OVA approaches are also commonly used as building blocks for a variety of other approaches.

Embedding Based approaches

The label matrix in embedding models is represented using a low-rank representation so that the label similarity search can be done in a low-dimensional space. Embedding-based methods, in other words, assume that the label space can be represented by a low-dimensional latent space with similar latent representations for similar labels.

 However, in practice, embedding-based models often perform worse than sparse one-vs-all and partitioning approaches to achieve similar computational speedups, which could be due to the inefficiency of the label representation structure.

Deep learning approaches

Deep learning representations, such as TF-IDF features, are expected to better capture the semantic information in text inputs than bag-of-words features. AttentionXML and HAXMLNet networks used attention models to extract embeddings from text inputs, while XML-CNN used CNN models to represent text input.

For training, the SLICE network used supervised pre-trained embeddings from XML-CNN models. Pre-trained deep language models such as BERT, ELMo, and GPT have recently demonstrated promising results on a variety of NLP tasks. However, previous research has not been able to incorporate these pre-trained large models for XMC, posing significant training and inference challenges.

Partitioning Methods

Partitioning can be implemented in two ways. Partitioning the input space is one thing, but partitioning the label space is another. The input partition only contains a small subset of labels, and the label partition only contains a small subset of instances when the output is sparse. 

Furthermore, using tree-based approaches to partition the labels allows for sublinear time prediction with respect to label size. For example, uses label features constructed from the instances to partition the labels using a balanced 2-means label tree.

Methods for Extreme Multilabel Text Classification

We will go over some methods below, including the most representative methods in XMTC as well as some successful deep learning methods that were designed for multi-class text classification but can be applied to XMTC with minor tweaks.


FastXML is the current state-of-the-art tree-based XMTC method. At each node of the hierarchy, it learns a hierarchy of training instances and optimizes an NDCG-based objective. At each node, a hyperplane parameterized is induced, which divides the set of documents in the current node into two subsets and learns the ranking of the labels in each of the two subsets jointly. 

The key concept is to have each subset’s documents have a similar label distribution, which is then characterized using a set-specific ranked list of labels. This is accomplished by maximizing the NDCG scores of the ranked label lists in the two sibling subsets simultaneously. To improve the robustness of predictions, an ensemble of multiple induced trees is learned in practice.

Each test document is passed from the root to a leaf node in each induced tree at prediction time, and the label distributions in all the reached leaves are summed for the test document.


FastText is a simple yet effective deep learning method for classifying multi-class texts. A document representation is created by averaging the embeddings of the words in the document, and the document representation is then mapped to class labels using a softmax layer. This approach was inspired by recent work on efficient word representation learning, such as skip-gram and CBOW. 

When creating document representations, it ignores word order and uses a linear softmax classifier. FastText is very efficient to train while achieving state-of-the-art results on a variety of multi-class classification benchmarks, and it is often several orders of magnitude faster than competing methods.

However, simply averaging input word embeddings with the shallow architecture for document-to-label mapping may limit its success in XMTC, because document presentations in XMTC must capture much richer information in order to successfully predict multiple correlated labels and discriminate them from vast numbers of irrelevant labels.


CNN-Kim is one of the first attempts to use convolutional neural networks to classify text. CNN-Kim creates a document vector by concatenating its word embeddings, and then in the convolution layer, t filters are applied to this concatenated vector to produce t feature maps, which are then fed to a softmax time pooling layer to create a t-dimensional document representation.

Following this is a fully-connected layer with L softmax outputs corresponding to L labels. CNN-Kim has demonstrated excellent performance in multi-class text classification in practice, and it serves as a solid baseline in our comparative evaluations.


Bow-CNN (Bag-of-word CNN) is another effective multi-class classification method. It uses a bag-of-words indicator vector (called the one-hot vector) to represent each small text region (several consecutive words). A D-dimensional binary vector is constructed for each region, where the i-th entry is 1 if the i-th word in the vocabulary appears in that text region, where D denotes the size of the feature space (the vocabulary). 

All region embeddings are passed through a convolutional layer, followed by a special dynamic pooling layer that aggregates the embedded regions into a document representation, and is finally fed to a softmax output layer.


PD-Sparse is a new max-margin method for extreme multi-label classification that was recently proposed. It doesn’t fit into any of the first three categories (target-embedding methods, tree-based methods, and deep learning methods). 

In PD-Sparse, a linear classifier with l1 and l2 penalties on the weight matrix associated with each label is learned. This yields a solution that is extremely sparse in both the primal and dual spaces, which is advantageous in terms of XMTC time and memory efficiency.

PD-Sparse proposes a Fully-Corrective Block-Coordinate FrankWolfe training algorithm that takes advantage of sparsity in the solution and achieves sub-linear training time when compared to the number of primal and dual variables, while prediction time remains linear. On multi-label classification, PD-Sparse outperforms 1-vs-all SVM and logistic regression, with significantly less training time and model size.


The X-Bert (BERT for eXtreme Multi-label Text Classification) approach is partly inspired by information retrieval (IR), where the goal is to find relevant documents for a given query from a large set of documents. An IR engine typically performs searches in the following steps to handle a large number of documents. 

  • indexing: create a data structure that is efficient for indexing documents; 
  • matching: locate the document index to which this instance of the document belongs; 
  • Sort the documents in the retrieved index by ranking.

An XMC problem is linked to an IR problem in the following way: a large number of labels can be compared to the large number of documents indexed by a search engine, and the instance to be labelled can be compared to the query. Some existing approaches, such as HAXMLNet and Parabel, are closely related to the three-stage framework of IR due to its success for an extremely large number of targets.

X-BERT is part of a three-stage framework that includes the following stages:

  • Semantically indexing the labels,
  • Deep learning to match the label indices, 
  • Ranking the labels based on the retrieved indices, and taking an ensemble of different configurations from the previous steps.

Final Words

Through this post, we have discussed the Extreme Multi-label Text classification. In the beginning, we have seen how the multi-class and multi-label differ from each other and discussed what is the possible way to address XMTC. Later we discussed what different approaches are made by the community. And lastly, we discussed some of the popular techniques which are being used in practice to address this task. 


More Great AIM Stories

Vijaysinh Lendave
Vijaysinh is an enthusiast in machine learning and deep learning. He is skilled in ML algorithms, data manipulation, handling and visualization, model building.

Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM