The internet is built on search engines where you enter what information you want and the web fetches it for you from the database. Searching is the most basic functionality that is seen in almost all applications. But it can be challenging when you have a large amount of data or documents and you need faster results. This is where natural language processing can be useful to us. With the development of new models in NLP, quicker computation and more accurate results are possible. One such development is a library called txtai. This enables a smarter way to apply natural language processing on search bars.
In this article, we will see the different applications of the txtai and implement them in Python.
What is Txtai?
Txtai is an AI-powered search engine that is built based on indexing over text sections. It is built using sentence transformers, python and libraries like faiss and annoy. Txtai performs a similarity search between the sections of the text and the query typed in the search bar. It can not only do this but also be used to build an interactive question and answer machine. It has already been used in platforms like :
- Paperai: to build AI-based indexing over science and medical papers
- Cord19q: an analysis of COVID 19
- Neuspo: a news and sports site
- Codequestion: allows you to ask questions about coding from your terminal.
Let us now understand how the txtai works by implementing a few small projects.
Installing txtai
Since this was developed on python you can easily install this library with the pip command. To install this library use:
pip install txtai
Implementation of embedding instances
The basic entry point and feature of the txtai are the embedding instances. The embedding methods used here are transformers which help in tokenization. After tokenization, the text sections are converted into embedding vectors. That is, whenever you enter words in the search bar, txtai understands the information by tokenizing it and fetches the correct information for you without actually using much memory.
Let us implement a simple in-memory embedding instance to understand this concept better. I will now type in a few random sentences as shown below.
information = ["global warming and ice melts worries scientists", "flu symptoms are similar to corona virus", "dont wear masks for covid", "expect thunderstorms in Bangalore today"]
Next, we will use the txtai to classify these into their respective categories. I will be giving the names of the categories in a jumbled order.
First, we will import the libraries and get the embeddings method.
import numpy as np from txtai.embeddings import Embeddings embed = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})
Now, we will use the similarity method to identify the similarities between the search word and the information given above.
print("%-20s %s" % ("input", "output")) for search_word in ("weather report", "fake news","climate change","health"): idx = np.argmax(embed.similarity(search_word, information)) print("%-20s %s" % (search_word, information[idx]))
As you can see the information has been correctly matched with the keyword.
Testing the similarity method
Now, for the purpose of experimentation, I will add information that may belong to two categories and check the results.
information = ["global warming and ice melts worries scientists", "flu symptoms are similar to coronavirus", "expect thunderstorms in Bangalore today", "China, India is the largest populated countries in the world"] print("%-20s %s" % ("input", "output")) for search_word in ("weather report","asia","climate change","health","population"): idx = np.argmax(embed.similarity(search_word, information)) print("%-20s %s" % (search_word, information[idx]))
As you can see the output for Asia and population are the same since china and India are in Asia. This means that the similarity during the embedding works very well for all categories.
Embedding indexes
The method used above proved to be efficient, but not really practical. For a large number of documents or data, it is not feasible to tokenize each and every sentence and then categorize them. Instead, embedding indexes are created which essentially allows pre-computed index values.
Let us now implement this and check how it works. For simplicity, I will choose the same sentences used above.
Implementing embedding index
This method uses a function called index that builds the relationship between search word and the information and saves the index values. These can easily be stored in memory and accessed based on a keyword at any point.
embed.index([(idx, info, None) for idx, info in enumerate(information)]) for search_word in ("weather report","asia","climate change","health","population"): idx = embed.search(search_word, 1)[0][0] embed.save("index value") embed = Embeddings() embed.load("index value") idx = embed.search("weather report", 1)[0][0] print(information[idx])
As you can see we have saved the indexes and just from the keyword we have accessed the information we needed.
Similarity search for millions of documents
We saw above that txtai works based on similarity search between the keyword and the information. But, how does this happen when there is a huge repository of information?
This is done using the concept of ANN or approximate nearest neighbour. This algorithm allows accessing a large corpus of data at once and the similarity query is run at the same time.
Along with this, txtai has incorporated libraries like annoy, hnswlib and faiss stacks to handle large volumes of data as efficiently as possible.
Conclusion
In this article, we learnt about a recent library called txtai and implemented an AI-powered search engine as well. Txtai at a large scale also allows robust models like hugging face and BERT to make searching more efficient and quick.
You can find the complete notebook of the above implementation in AIM’s GitHub repositories. Please visit this link to find this notebook.