Semantic text matching is the task of estimating semantic similarity between the source and the target text pieces and has applications in various problems like query-to-document matching, web search, question answering, conversational chatbots, recommendation system etc.
At the recent MLDS 2021, Shrutendra Harsola, senior data scientist at Intuit, and Naveen Kumar Kaveti, data scientist at Intuit, spoke about building machine learning systems for large scale semantic text matching by decomposing the problem into candidate generation and re-ranking steps. Besides the basics on semantic text matching, the speakers also talked about IR (Web Search), question-answering, recommendation systems as well as how deep learning models like BERT (Bidirectional Encoder Representations from Transformers) can be used for each of these steps.
Kaveti kick-started the talk by introducing semantic text matching with the help of few use-cases of Intuit’s products such as: the task of estimating the similarity between user’s questions and FAQ in a help section: the similarity between the source and target text pieces, etc. He said, “Semantic Text Matching is the task of estimating the semantic similarity between the source and the target text pieces. ”
For semantic text matching, Kaveti gave one more instance, which is finding news articles related to bank deposit rates. In this information retrieval system, the output is known as the Target Text. In the above image, the first two statements are irrelevant to what the user has asked for.
The speaker talked about the various applications of semantic text matching:
- Web Search, where source text is query and target text is document (web pages)
- Sponsored Search, where source text is query and target text is text ad.
- Question answering, where source text is question and target text is answer
- Recommendation, where source text is product title and target text is product title.
Talking about Information Retrieval (IR) in case of News Search, Kaveti said semantic text matching can be used for information retrieval. The above image shows how information retrieval works. When a user asks for bank deposit rates, the query goes to the document corpus where thousands of news articles are located and then it will go to the ranked results, which is a document list. Based on the matched texts, the information is retrieved and sent to the user.
Kaveti also shed light on the ranking task reduced to binary classification and two step approach for large scale systems. Harsola said the two step approach for large scale system include:
- Candidate Generation: Given a search query, the goal of the candidate generation step is to quickly retrieve hundreds of the most relevant documents from the huge document corpus.
- Reranking tasks using deep learning and machine learning techniques.
Harsola concluded the talk by explaining how efficient pre-trained language models like BERT can be used for reranking in large scale systems. The duo also provided the GitHub and Medium blog links for the codes related to the above-mentioned use-cases.