MITB Banner

Guide To Real-Time Semantic Search

The word semantic is a linguistic term. It means something related to meaning in language or logic. Understanding how...

Share

The challenges of handling explosive growth of data volume and complexity cause the increasing need for semantic queries. The semantic query can be interpreted as the correlation-aware retrieval while containing approximate results. Although the true value of data heavily depends on how efficiently semantic search can be carried out on the data in real-time, this large data is significantly reduced due to data deterioration. 

The word semantic is a linguistic term. It means something related to meaning in language or logic. Understanding how semantic search works- in natural language, semantic analysis relates the structure and occurrence of words, phrases, clauses, paragraphs, etc., and understanding what’s written in the text. The challenge we face in a technologically advanced world is to make machines understand the language of logic as human beings do.

Semantic matching works for the rules which are to be defined for the system. This rule is the same as we think about the language, and we ask the machines to imitate. E.g. ‘This is the white car’ is a simple sentence so that we humans can easily understand the terms like: a car with the color white. Whereas for machines, this is something we humans do not understand. The concept of linguistics in nothing but the sentence which has a unique structure, i.e. Subject-Predicate-Object S-P-O in short. Where ‘Car’ is the subject ‘is’ is a predicate and ‘white’ is the object.  

When dealing with large textual data, it is impossible to perform semantic matching by scanning the whole textual input data manually or by traditional NLP techniques; thus, we are discussing the approximate similarity matching algorithm, which allows us to trade-off between little of accuracy and nearest neighbor matches.

This article will discuss one such approach made to address the semantic search result using the Approximate Nearest Neighbor (ANN) index using the extracted embeddings. The below code implementation is a reference from the official code implementation.  

Code implementation of Semantic Search

Apache beam generates the embeddings from the Tensorflow Hub model and ANNOY library to generate the nearest neighbor index. Annoy (Approximate Nearest Neighbor Oh Yeah) is a C++ library with Python bindings to search for points near in the space for the given query. It also creates large read-only file-based data structures mapped into memory; many processes may share the same data.  

Install & import all dependencies:
 !pip install apache_beam
 !pip install 'scikit_learn~=0.23.0'  # For gaussian_random_matrix.
 !pip install annoy 
 import os
 import sys
 import pickle
 from collections import namedtuple
 from datetime import datetime
 import numpy as np
 import apache_beam as beam
 from apache_beam.transforms import util
 import tensorflow as tf
 import tensorflow_hub as hub
 import annoy
 from sklearn.random_projection import gaussian_random_matrix 
Prepare the data:

A million news headline dataset is used here, which contains news headlines published over 15 years sourced from Australian Broadcasting Corp. This dataset is more focused on the historical records of noteworthy events from 2003 to 2017. The dataset has two columns, one date published and the other is the main text. Later we remove the first column as it is not necessary. 

!wget 'https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true' -O raw.tsv

 !wc -l raw.tsv
 !rm -r corpus
 !mkdir corpus 
 with open('corpus/text.txt', 'w') as out_file:
   with open('raw.tsv', 'r') as in_file:
     for lines in in_file:
       headline = lines.split('\t')[1].strip().strip('"')
       out_file.write(headline+"\n")
 !tail corpus/text.txt 

Output:

Generate Embeddings for the data:

The Neural Network Language model is used to generate embeddings for the headlines, and later these embeddings is used to compute the sentence level semantic meaning.  

Embedding Extraction method:
 embed = None
 def generate_embeddings(text, model_url, random_projection_matrix=None):
   global embed
   if embed is None:
     embed = hub.load(model_url)
   embedding = embed(text).numpy()
   if random_projection_matrix is not None:
     embedding = embedding.dot(random_projection_matrix)
   return text, embedding 
 def to_tf_example(entries):
   example = []
   text_lis, embedding_lis = entries
   for i in range(len(text_lis)):
     text = text_lis[i]
     embedding = embedding_lis[i]
     features = {
         'text': tf.train.Feature(
             bytes_list=tf.train.BytesList(value=[text.encode('utf-8')])),
         'embedding': tf.train.Feature(
             float_list=tf.train.FloatList(value=embedding.tolist()))
     }
     example = tf.train.Example(
         features=tf.train.Features(
             feature=features)).SerializeToString(deterministic=True)
     example.append(example)
   return example 
Beam Pipeline:
 def run_hub2emb(args):
   options = beam.options.pipeline_options.PipelineOptions(**args)
   args = namedtuple("options", args.keys())(*args.values())
   with beam.Pipeline(args.runner, options=options) as pipeline:
         (pipeline
         | 'Read sentences from files' >> beam.io.ReadFromText(
             file_pattern=args.data_dir)
         | 'Batch elements' >> util.BatchElements(
             min_batch_size=args.batch_size, max_batch_size=args.batch_size)
         | 'Generate embeddings' >> beam.Map(
             generate_embeddings, args.model_url, args.random_projection_matrix)
         | 'Encode to tf example' >> beam.FlatMap(to_tf_example)
         | 'Write to TFRecords files' >> beam.io.WriteToTFRecord(
             file_path_prefix='{}/emb'.format(args.output_dir),
             file_name_suffix='.tfrecords')) 
Generate Random projection weight matrix:
 def generate_random_projection_weights(original, projected):
   random_projection_matrix = None
   random_projection_matrix = gaussian_random_matrix(
       n_components=projected, n_features=original).T
   print("A Gaussian random weight matrix shape of {}".format(random_projection_matrix.shape))
   print('Storing projection matrix...')
   with open('random_projection_matrix', 'wb') as handle:
     pickle.dump(random_projection_matrix, 
                 handle, protocol=pickle.HIGHEST_PROTOCOL)      
   return random_projection_matrix 
Set the parameters:
 model_url = 'https://tfhub.dev/google/nnlm-en-dim128/2' 
 projected_dim = 64 
Run the pipeline:
 import tempfile
 dir_output = tempfile.mkdtemp()
 dim_original = hub.load(model_url)(['']).shape[1]
 random_projection_matrix = None
 if projected_dim:
   random_projection_matrix = generate_random_projection_weights(
       dim_original, projected_dim)
 args = {
     'job_name': 'hub2emb-{}'.format(datetime.utcnow().strftime('%y%m%d-%H%M%S')),
     'runner': 'DirectRunner',
     'batch_size': 1024,
     'data_dir': 'corpus/*.txt',
     'output_dir': dir_output,
     'model_url': model_url,
     'random_projection_matrix': random_projection_matrix,}
 print("Pipeline args are set.")
 args 
Build the ANN index for Embeddings: 
 def build_index(embedding_files_pattern, index_filename, vector_length, 
     metric='angular', num_trees=100):
   annoy_index = annoy.AnnoyIndex(vector_length, metric=metric)
   mapping = {}
   embed_files = tf.io.gfile.glob(embedding_files_pattern)
   num_files = len(embed_files)
   print('Found {} embedding file(s).'.format(num_files))
   item_counter = 0
   for i, embed_file in enumerate(embed_files):
     print('Loading embeddings in file {} of {}...'.format(i+1, num_files))
     dataset = tf.data.TFRecordDataset(embed_file)
     for record in dataset.map(_parse_example):
       text = record['text'].numpy().decode("utf-8")
       embedding = record['embedding'].numpy()
       mapping[item_counter] = text
       annoy_index.add_item(item_counter, embedding)
       item_counter += 1
       if item_counter % 100000 == 0:
         print('{} items loaded to the index'.format(item_counter))
   print('A total of {} items added to the index'.format(item_counter))
   print('index with {} trees...'.format(num_trees))
   annoy_index.build(n_trees=num_trees)
   print('Index is successfully built.')
   print('index to disk...')
   annoy_index.save(index_filename)
   print('saved to disk.')
   print("file size: {} GB".format(
     round(os.path.getsize(index_filename) / float(1024 ** 3), 2)))
   annoy_index.unload()
   print('mapping to disk...')
   with open(index_filename + '.mapping', 'wb') as handle:
     pickle.dump(mapping, handle, protocol=pickle.HIGHEST_PROTOCOL)
   print('Mapping saved to disk.')
   print("Mapping size: {} MB".format(
     round(os.path.getsize(index_filename + '.mapping') / float(1024 ** 2), 2))) 

Now thats all we can use the ANN index to find the news headline which are semantic to the input query 

Load the index and mapping file: 
 index = annoy.AnnoyIndex(embedding_dimension)
 index.load(index_filename, prefault=True)
 print('index is loaded.')
 with open(index_filename + '.mapping', 'rb') as handle:
   mapping = pickle.load(handle)
 print('file is loaded.') 
Similarity matching method:
 def find_similar_items(embeddings, num_matches=5):
   ids = index.get_nns_by_vector(
   embeddings, num_matches, search_k=-1, include_distances=False)
   items = [mapping[i] for i in ids]
   return items 
Extract embedding from given query: 
 print("TF-Hub model...")
 %time embed_fn = hub.load(model_url)
 print("TF-Hub is loaded.")
 random_projection_matrix = None
 if os.path.exists('random_projection_matrix'):
   print("Loading random projection matrix...")
   with open('random_projection_matrix', 'rb') as handle:
     random_projection_matrix = pickle.load(handle)
   print('random projection matrix is loaded.')
 def extract_embeddings(query):
   query_embedding =  embed_fn([query])[0].numpy()
   if random_projection_matrix is not None:
     query_embedding = query_embedding.dot(random_projection_matrix)
   return query_embedding 
Now the test part:

Enter the random query you want in the query variable defined below on which semantic analysis will be carried out, and the top ten relevant headings will be shown.

 query = "engineering" 
 print("Generating embedding...")
 %time query_embedding = extract_embeddings(query)
 print("relevant items in the index...")
 %time items = find_similar_items(query_embedding, 10)
 print("Results:")
 print("=========")
 for item in items:
   print(item) 

Output:

Conclusion

This article has seen how semantic analysis can be very effective when it comes to automation-related tasks. The results are very similar as per the query; we queried ‘engineering’, and all the results have engineering references. 

References

Share
Picture of Vijaysinh Lendave

Vijaysinh Lendave

Vijaysinh is an enthusiast in machine learning and deep learning. He is skilled in ML algorithms, data manipulation, handling and visualization, model building.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.