Now Reading
Guide to PyTerrier: A Python Framework for Information Retrieval

Guide to PyTerrier: A Python Framework for Information Retrieval

PyTerrier cover image

Information Retrieval is one of the key tasks in many natural language processing applications. The process of searching and collecting information from databases or resources based on queries or requirements, Information Retrieval (IR). The fundamental elements of an Information Retrieval system are query and document. The query is the user’s information requirement, and the document is the resource that contains the information. An efficient IR system collects the required information accurately from the document in a compute-effective manner.

The popular Information Retrieval frameworks are mostly written in Java, Scala, C++ and C. Though they are adaptable in many languages, end-to-end evaluation of Python-based IR models is a tedious process and needs many configuration adjustments. Further, reproducibility of the IR workflow under different environments is practically not possible with the available frameworks.

Machine Learning heavily relies on the high-level Python language. Deep learning models are built almost on one of the two Python frameworks: TensorFlow and PyTorch. Though most natural language processing applications are built on top of Python frameworks and libraries nowadays, there is no well-adaptable Python framework for the Information Retrieval tasks. Hence, here comes the need for a Python-based Information Retrieval framework that supports end-to-end experimentation with reproducible results and model comparisons. 

PyTerrier & its Architecture

Craig Macdonald of the University of Glasgow and Nicola Tonellotto of the University of Pisa have introduced a Python framework, named PyTerrier, for Information Retrieval. This framework proposes different pipelines as Python Classes for Information Retrieval tasks such as retrieval, Learn-to-Rank re-ranking, rewriting the query, indexing, extracting the underlying features and neural re-ranking. An end-to-end Information Retrieval system can be easily built with these pre-established pipeline elements. Moreover, a built IR architecture can be scaled or extended in the future as per the requirements.

A typical Information Retrieval task
A typical model comparison experiment for two different IR models (Source)

An experiment architecture for comparing two different Information Retrieval models has many key components such as Ranked retrieval, Fusion, Feature extraction, LTR (Learn-to-Rank) re-ranking and Neural re-ranking. The workflow is represented in a directed acyclic graph (DAG) with complex operation sequences. The PyTerrier framework helps build such a complex DAG problem in an end-to-end trainable pipeline. 

PyTerrier & its Key Objects

PyTerrier is a declarative framework with two key objects: an IR transformer and an IR operator. A transformer is an object that maps the transformation between an array of queries and the corresponding documents. 

Information Retrieval transformations
The Transformer Classes of PyTerrier. Q and R represent the input query and the input document, respectively. An element provided in parentheses is optional (Source).

The basic retrieval process, for example, in PyTerrier is performed using the following Python code template.

Here, Q is the input query and R’ is the retrieved output document. Thus, a complex IR task can be performed with simple Python codes. Also, PyTerrier provides operator overloading for conventional math operators to perform custom IR operations.

PyTerrier operators
The PyTerrier operators employed under operator overloading strategy (Source).

The newly introduced PyTerrier Framework is instantiated on two public datasets so far: the Terrier dataset and the Ansereni dataset. More dataset implementations would be expected soon.

Hands-on Retrieval and Evaluation

PyTerrier is available as a PyPi package. We can simply pip install it.

!pip install python-terrier

Import the library and initialize it.

 import pyterrier as pt
 if not pt.started():
   pt.init() 

Use one of the in-built datasets to perform the retrieval process and extract its index.

 vaswani_dataset = pt.datasets.get_dataset("vaswani")
 indexref = vaswani_dataset.get_index()
 index = pt.IndexFactory.of(indexref)
 print(index.getCollectionStatistics().toString()) 

Output:

Extract queries as topics for the dataset.

 topics = vaswani_dataset.get_topics()
 topics.head(5) 

Output:

Perform retrieval easily using a few commands as shown below.

 retr = pt.BatchRetrieve(index, controls = {"wmodel": "TF_IDF"})
 retr.setControl("wmodel", "TF_IDF")
 retr.setControls({"wmodel": "TF_IDF"})
 res=retr.transform(topics)
 res 

Output:

PyTerrier query retrieval

It can be observed that the documents are retrieved and ranked. Further, the results can be saved to the disk using the write_results method available in the io class of the PyTerrier framework.

pt.io.write_results(res,"result1.res")

Now, evaluation is performed by comparing the results with the ground truth available in-built. Get the ground truth query results.

qrels = vaswani_dataset.get_qrels()

Output:

Evaluate the query results.

 eval = pt.Utils.evaluate(res,qrels)
 eval 

Output:

Evaluation results can also be obtained for per-query results. Here, the evaluation is performed based on the ‘map’ metric on all documents under query.

 eval = pt.Utils.evaluate(res,qrels,metrics=["map"], perquery=True)
 eval 

A portion of the output:

Find the Notebook with these code implementations here.

Hands-on Learn-To-Rank

Create the environment by importing the necessary libraries and initializing the PyTerrier framework.

 import numpy as np
 import pandas as pd
 import pyterrier as pt
 if not pt.started():
   pt.init() 

Download an in-built dataset, its indices, queries and ground truth results.

 dataset = pt.datasets.get_dataset("vaswani")
 indexref = dataset.get_index()
 topics = dataset.get_topics()
 qrels = dataset.get_qrels() 

For ranking the queries, the standard ‘BM25’ model is used in this example. The traditional ‘TF-IDF’ model and the ‘PL2’ model are used to re-rank the query results.

 #this ranker will make the candidate set of documents for each query
 BM25 = pt.BatchRetrieve(indexref, controls = {"wmodel": "BM25"})
 #these rankers we will use to re-rank the BM25 results
 TF_IDF =  pt.BatchRetrieve(indexref, controls = {"wmodel": "TF_IDF"})
 PL2 =  pt.BatchRetrieve(indexref, controls = {"wmodel": "PL2"}) 

Create a PyTerrier pipeline to perform the above said example task and make a query.

 pipe = BM25 >> (TF_IDF ** PL2)
 pipe.transform("chemical end:2") 

Output:

See Also
SQLALchemy in python

PyTerrier re-ranking

In the above output, the term ‘score’ represents the ranking score of the BM25 model and the term ‘features’ represents the re-ranking scores of the TF-IDF and PL2 models. However ranking at the first step and re-ranking in two successive steps consumes more time. To tackle this issue, PyTerrier introduces a method, called FeaturesBatchRetrieve. Let’s implement the method for efficient processing by ranking and re-ranking, all in one go.

 fbr = pt.FeaturesBatchRetrieve(indexref, controls = {"wmodel": "BM25"}, features=["WMODEL:TF_IDF", "WMODEL:PL2"]) 
 # the top 2 results
 (fbr %2).search("chemical") 

Output:

PyTerrier ranking

PyTerrier has a pipeline method, called compile(), which optimizes the ranking and re-ranking processes automatically. This approach also yields the same results as above at around the same compute-time. An example implementation is as follows:

 pipe_fast = pipe.compile()
 pipe_fast %2).search("chemical") 

Output:

After performing ranking and re-ranking, a machine learning model can be built to Learn-to-Rank (LTR). Split the available data into train, validation and test sets.

train_topics, valid_topics, test_topics = np.split(topics, [int(.6*len(topics)), int(.8*len(topics))])

Build a Random Forest model to perform the LTR and obtain the results.

 from sklearn.ensemble import RandomForestRegressor
 BaselineLTR = fbr >> pt.pipelines.LTR_pipeline(RandomForestRegressor(n_estimators=400))
 BaselineLTR.fit(train_topics, qrels)
 resultsRF = pt.pipelines.Experiment([PL2, BaselineLTR], test_topics, qrels, ["map"], names=["PL2 Baseline", "LTR Baseline"])
 resultsRF 

Output:

Build an XGBoost model to perform the LTR and obtain the results.

 import xgboost as xgb
 params = {'objective': 'rank:ndcg', 
           'learning_rate': 0.1, 
           'gamma': 1.0, 'min_child_weight': 0.1,
           'max_depth': 6,
           'verbose': 2,
           'random_state': 42 
          }
 BaseLTR_LM = fbr >> pt.pipelines.XGBoostLTR_pipeline(xgb.sklearn.XGBRanker(**params))
 BaseLTR_LM.fit(train_topics, qrels, valid_topics, qrels)
 resultsLM = pt.pipelines.Experiment([PL2, BaseLTR_LM],
                                 test_topics,                                  
                                 qrels, ["map"], 
                                 names=["PL2 Baseline", "LambdaMART"])
 resultsLM 

Output:

Find the Notebook with these code implementations here.

Wrapping up

We discussed the newly introduced PyTerrier framework, its architecture and its implementation for Information Retrieval tasks. We learnt how to use the framework with two example hands-on implementations for the applications, a Simple Query-Retrieval and a Learn-to-Rank machine learning model. PyTerrier has enormous algorithms and in-built datasets to perform almost any Information Retrieval task with minimal efforts. This framework is also established as a Python-built one focusing chiefly on simplicity, efficiency and reproducibility.

Further reading:

Research paper

Github repository

Indexing with PyTerrier

Index API of PyTerrier

What Do You Think?

Join Our Telegram Group. Be part of an engaging online community. Join Here.

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top