MITB Banner

Guide To Universal Sentence Encoder With TensorFlow

Share
universal sentence encoder

Universal sentence encoder models encode textual data into high-dimensional vectors which can be used for various NLP tasks. It was introduced by Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope and Ray Kurzweil (researchers at Google Research)  in April 2018. (research paper)

The encoders used in such models require modelling the meaning of word sequences instead of individual words. Apart from single words, the models are trained and optimized for text having more-than-word lengths such as sentences, phrases or paragraphs. 

Major variants of universal sentence encoder 

There are two main variations of the model encoders coded in TensorFlow – one of them uses transformer architecture while the other is a deep averaging network (DAN).

When fed with variable-length English text, these models output a fixed dimensional embedding representation of the input strings. They take lowercase PTB tokenized string as input and output sentence embedding as a 512-dimensional vector. 

  1. Transformer-based model

This variant builds sentence embeddings using the coding sub-graph of the transformer architecture. The sub-graph computes a context-aware representation of words in the input sentence. It considers identity and sequence of all other words too. The element-wise sum of that representation is computed at each word position and is converted into a fixed-length sentence encoding vector. 

  1. Deep Averaging Network (DAN)

In the variant employing DAN, input embeddings for words and bi-grams are averaged and fed to a feedforward DNN (Deep Neural Network) resulting in sentence embeddings. It is found that such DANs perform quite well on text classification tasks.

Comparison of the variants

Utilizing the output (sentence embeddings) of any of the variants for transfer learning gives better performance results than several baselines not using transfer learning or using word level transfer learning. 

However, there is a trade-off between the accuracy of the results obtained and the resources required for computation, when we compare both the variants. The transformer-based model aims to achieve high model accuracy, but it requires a high amount of computation resources and increases model complexity. The memory usage and computation time for this variant rise erratically with the length of the sentence. On the contrary, the computation time linearly increases with sentence length for the DAN-based model. In the research paper, the transformer model’s time complexity has been noted as O(n2)while that of DNA model as O(n), where ‘n’ denotes the sentence length. The DNA variant aims at efficient inference despite a little reduction in achieved accuracy. 

Universal sentence encoder family

Several versions of universal sentence encoder models can be found here. They differ from each other in terms of whether they are multilingual, which NLP task they are good at, which metric they prioritise (size, performance, etc.)

Practical implementation

Here’s a demonstration of using a DAN-based universal sentence encoder model for the sentence similarity task. The implementation has been coded in Google colab using Python version 3.7.10. Step-wise explanation of the code is as follows:

  1. Import required libraries
 from absl import logging
 import tensorflow as tf
 import tensorflow_hub as hub
 import matplotlib.pyplot as plt
 import numpy as np
 import os
 import pandas as pd
 import re    #module for regular expression operations
 import seaborn as sns 
  1. Load the TF Hub module of the universal sentence encoder
url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]

A drop-down list as shown below will allow you to switch between the URLs

model = hub.load(url) #Load the module from selected URL

  1. Define a function for computing sentence embedding of input string
 def embed(input):
   return model(input) 
  1. Illustrate how sentence embedding is computed for a word, sentence and paragraph
 word = "Anaconda"
 sen = "Tiger is India's national animal."  #sentence
 #paragraph
 para = (             
     "Universal Sentence Encoder embeddings also support short paragraphs. "
     "There is no hard limit on how long the paragraph is. "
     )
 msgs = [word, sen, para] 
  1.  Reduce logging output

logging.set_verbosity(logging.ERROR)

set_verbosity() method sets the threshold for what messages will be logged.

  1. Embed the defined word, sentence and paragraph using the embed() method defined in step (3).

 message_emb = embed(msgs)

  1. Compute and print sentence embeddings
 for i, embedding in enumerate(np.array(message_emb).tolist()):
   print("Msg: {}".format(msgs[i]))     #print the message
    #print size of the embedding
   print("Embedding size: {}".format(len(embedding)) 
   #print the embedding representation
   msg_emb_snippet = ", ".join(       
       (str(x) for x in message_emb[:3]))
   print("Embedding: [{}, ...]\n".format(msg_emb_snippet)) 

Output:

  1. Define a function to find semantic text similarity between sentences
 def plot_similarity(labels, features, rotation):
#compute inner product of the encodings
   corr = np.inner(features, features) 
   sns.set(font_scale=1.2)  
   g = sns.heatmap(  #plot heatmap 
       corr,  #computed inner product
       xticklabels=labels, #label the axes with input sentences
       yticklabels=labels,
 #vmin and vmax are values to anchor the colormap
       vmin=0,
       vmax=1,
       cmap="YlOrRd") #matplotlib colormap name (here Yellow or Red)
   g.set_xticklabels(labels, rotation=rotation) 
   g.set_title("Semantic Textual Similarity") 
  1. Define a function to feed the message embeddings for plotting the heatmap
 def run_and_plot(msgs):
   message_embeddings_ = embed(msgs)
   plot_similarity(msgs, message_embeddings_, 90)
 #labels rotated by 90 degrees 
  1. Define the input sentences
 messages = [
     # Smartphones
     "I like my phone",
     "My phone is not good.",
     "Your cellphone looks great.",
     # Weather
     "Will it snow tomorrow?",
     "Recently a lot of hurricanes have hit the US",
     "Global warming is real",
     # Food and health
     "An apple a day, keeps the doctors away",
     "Eating strawberries is healthy",
     "Is paleo better than keto?",
     # Asking about age
     "How old are you?",
     "what is your age?",
 ] 
  1. Pass the input messages to run_and_plot() defined in step (9) 

run_and_plot(messages)

Output:

References

To get an in-depth understanding of universal sentence encoder, refer to the following sources:

PS: The story was written using a keyboard.
Share
Picture of Nikita Shiledarbaxi

Nikita Shiledarbaxi

A zealous learner aspiring to advance in the domain of AI/ML. Eager to grasp emerging techniques to get insights from data and hence explore realistic Data Science applications as well.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India