Complete Tutorial On Txtai: An AI-Powered Search Engine

In this article, we will see the different applications of the txtai and implement them in Python.

The internet is built on search engines where you enter what information you want and the web fetches it for you from the database. Searching is the most basic functionality that is seen in almost all applications.  But it can be challenging when you have a large amount of data or documents and you need faster results. This is where natural language processing can be useful to us. With the development of new models in NLP, quicker computation and more accurate results are possible. One such development is a library called txtai. This enables a smarter way to apply natural language processing on search bars. 

In this article, we will see the different applications of the txtai and implement them in Python.

What is Txtai?

Txtai is an AI-powered search engine that is built based on indexing over text sections. It is built using sentence transformers, python and libraries like faiss and annoy. Txtai performs a similarity search between the sections of the text and the query typed in the search bar. It can not only do this but also be used to build an interactive question and answer machine. It has already been used in platforms like :


Sign up for your weekly dose of what's up in emerging technology.
  1. Paperai: to build AI-based indexing over science and medical papers
  2. Cord19q: an analysis of COVID 19
  3. Neuspo: a news and sports site
  4. Codequestion: allows you to ask questions about coding from your terminal.

Let us now understand how the txtai works by implementing a few small projects. 

Installing txtai

Since this was developed on python you can easily install this library with the pip command. To install this library use:

Download our Mobile App

pip install txtai

Implementation of embedding instances

The basic entry point and feature of the txtai are the embedding instances. The embedding methods used here are transformers which help in tokenization. After tokenization, the text sections are converted into embedding vectors. That is, whenever you enter words in the search bar, txtai understands the information by tokenizing it and fetches the correct information for you without actually using much memory. 

Let us implement a simple in-memory embedding instance to understand this concept better. I will now type in a few random sentences as shown below. 

information  = ["global warming and ice melts worries scientists",
             "flu symptoms are similar to corona virus",
             "dont wear masks for covid",
            "expect thunderstorms in Bangalore today"]

Next, we will use the txtai to classify these into their respective categories. I will be giving the names of the categories in a jumbled order. 

First, we will import the libraries and get the embeddings method.

import numpy as np
from txtai.embeddings import Embeddings
embed = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})

Now, we will use the similarity method to identify the similarities between the search word and the information given above. 

print("%-20s %s" % ("input", "output"))
for search_word in ("weather report", "fake news","climate change","health"):
    idx = np.argmax(embed.similarity(search_word, information))
    print("%-20s %s" % (search_word, information[idx]))

As you can see the information has been correctly matched with the keyword.

Testing the similarity method

 Now, for the purpose of experimentation, I will add information that may belong to two categories and check the results. 

information = ["global warming and ice melts worries scientists",
             "flu symptoms are similar to coronavirus",
            "expect thunderstorms in Bangalore today",
            "China, India is the largest populated countries in the world"]
print("%-20s %s" % ("input", "output"))
for search_word in ("weather report","asia","climate change","health","population"):
    idx = np.argmax(embed.similarity(search_word, information))
    print("%-20s %s" % (search_word, information[idx]))

As you can see the output for Asia and population are the same since china and India are in Asia. This means that the similarity during the embedding works very well for all categories. 

Embedding indexes

The method used above proved to be efficient, but not really practical. For a large number of documents or data, it is not feasible to tokenize each and every sentence and then categorize them. Instead, embedding indexes are created which essentially allows pre-computed index values. 

Let us now implement this and check how it works. For simplicity, I will choose the same sentences used above. 

Implementing embedding index

This method uses a function called index that builds the relationship between search word and the information and saves the index values. These can easily be stored in memory and accessed based on a keyword at any point. 

embed.index([(idx, info, None) for idx, info in enumerate(information)])
for search_word in ("weather report","asia","climate change","health","population"):
    idx =, 1)[0][0]"index value")
embed = Embeddings()
embed.load("index value")
idx ="weather report", 1)[0][0]

As you can see we have saved the indexes and just from the keyword we have accessed the information we needed.

Similarity search for millions of documents

We saw above that txtai works based on similarity search between the keyword and the information. But, how does this happen when there is a huge repository of information?

This is done using the concept of ANN or approximate nearest neighbour. This algorithm allows accessing a large corpus of data at once and the similarity query is run at the same time. 

Along with this, txtai has incorporated libraries like annoy, hnswlib and faiss stacks to handle large volumes of data as efficiently as possible. 


In this article, we learnt about a recent library called txtai and implemented an AI-powered search engine as well. Txtai at a large scale also allows robust models like hugging face and BERT to make searching more efficient and quick. 
You can find the complete notebook of the above implementation in AIM’s GitHub repositories. Please visit this link to find this notebook.

More Great AIM Stories

Bhoomika Madhukar
I am an aspiring data scientist with a passion for teaching. I am a computer science graduate from Dayananda Sagar Institute. I have experience in building models in deep learning and reinforcement learning. My goal is to use AI in the field of education to make learning meaningful for everyone.

AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Do machines feel pain?

Scientists worldwide have been finding ways to bring a sense of awareness to robots, including feeling pain, reacting to it, and withstanding harsh operating conditions.

IT professionals and DevOps say no to low-code

The obsession with low-code is led by its drag-and-drop interface, which saves a lot of time. In low-code, every single process is shown visually with the help of a graphical interface that makes everything easier to understand.

Neuralink elon musk

What could go wrong with Neuralink?

While the broad aim of developing such a BCI is to allow humans to be competitive with AI, Musk wants Neuralink to solve immediate problems like the treatment of Parkinson’s disease and brain ailments.

Understanding cybersecurity from machine learning POV 

Today, companies depend more on digitalisation and Internet-of-Things (IoT) after various security issues like unauthorised access, malware attack, zero-day attack, data breach, denial of service (DoS), social engineering or phishing surfaced at a significant rate.