Universal sentence encoder models encode textual data into high-dimensional vectors which can be used for various NLP tasks. It was introduced by Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope and Ray Kurzweil (researchers at Google Research) in April 2018. (research paper)
The encoders used in such models require modelling the meaning of word sequences instead of individual words. Apart from single words, the models are trained and optimized for text having more-than-word lengths such as sentences, phrases or paragraphs.
Major variants of universal sentence encoder
There are two main variations of the model encoders coded in TensorFlow – one of them uses transformer architecture while the other is a deep averaging network (DAN).
When fed with variable-length English text, these models output a fixed dimensional embedding representation of the input strings. They take lowercase PTB tokenized string as input and output sentence embedding as a 512-dimensional vector.
- Transformer-based model
This variant builds sentence embeddings using the coding sub-graph of the transformer architecture. The sub-graph computes a context-aware representation of words in the input sentence. It considers identity and sequence of all other words too. The element-wise sum of that representation is computed at each word position and is converted into a fixed-length sentence encoding vector.
- Deep Averaging Network (DAN)
In the variant employing DAN, input embeddings for words and bi-grams are averaged and fed to a feedforward DNN (Deep Neural Network) resulting in sentence embeddings. It is found that such DANs perform quite well on text classification tasks.
Comparison of the variants
Utilizing the output (sentence embeddings) of any of the variants for transfer learning gives better performance results than several baselines not using transfer learning or using word level transfer learning.
However, there is a trade-off between the accuracy of the results obtained and the resources required for computation, when we compare both the variants. The transformer-based model aims to achieve high model accuracy, but it requires a high amount of computation resources and increases model complexity. The memory usage and computation time for this variant rise erratically with the length of the sentence. On the contrary, the computation time linearly increases with sentence length for the DAN-based model. In the research paper, the transformer model’s time complexity has been noted as O(n2)while that of DNA model as O(n), where ‘n’ denotes the sentence length. The DNA variant aims at efficient inference despite a little reduction in achieved accuracy.
Universal sentence encoder family
Several versions of universal sentence encoder models can be found here. They differ from each other in terms of whether they are multilingual, which NLP task they are good at, which metric they prioritise (size, performance, etc.)
Practical implementation
Here’s a demonstration of using a DAN-based universal sentence encoder model for the sentence similarity task. The implementation has been coded in Google colab using Python version 3.7.10. Step-wise explanation of the code is as follows:
- Import required libraries
from absl import logging import tensorflow as tf import tensorflow_hub as hub import matplotlib.pyplot as plt import numpy as np import os import pandas as pd import re #module for regular expression operations import seaborn as sns
- Load the TF Hub module of the universal sentence encoder
url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]
A drop-down list as shown below will allow you to switch between the URLs
model = hub.load(url) #Load the module from selected URL
- Define a function for computing sentence embedding of input string
def embed(input): return model(input)
- Illustrate how sentence embedding is computed for a word, sentence and paragraph
word = "Anaconda" sen = "Tiger is India's national animal." #sentence #paragraph para = ( "Universal Sentence Encoder embeddings also support short paragraphs. " "There is no hard limit on how long the paragraph is. " ) msgs = [word, sen, para]
- Reduce logging output
logging.set_verbosity(logging.ERROR)
set_verbosity() method sets the threshold for what messages will be logged.
- Embed the defined word, sentence and paragraph using the embed() method defined in step (3).
message_emb = embed(msgs)
- Compute and print sentence embeddings
for i, embedding in enumerate(np.array(message_emb).tolist()): print("Msg: {}".format(msgs[i])) #print the message #print size of the embedding print("Embedding size: {}".format(len(embedding)) #print the embedding representation msg_emb_snippet = ", ".join( (str(x) for x in message_emb[:3])) print("Embedding: [{}, ...]\n".format(msg_emb_snippet))
Output:
- Define a function to find semantic text similarity between sentences
def plot_similarity(labels, features, rotation): #compute inner product of the encodings corr = np.inner(features, features) sns.set(font_scale=1.2) g = sns.heatmap( #plot heatmap corr, #computed inner product xticklabels=labels, #label the axes with input sentences yticklabels=labels, #vmin and vmax are values to anchor the colormap vmin=0, vmax=1, cmap="YlOrRd") #matplotlib colormap name (here Yellow or Red) g.set_xticklabels(labels, rotation=rotation) g.set_title("Semantic Textual Similarity")
- Define a function to feed the message embeddings for plotting the heatmap
def run_and_plot(msgs): message_embeddings_ = embed(msgs) plot_similarity(msgs, message_embeddings_, 90) #labels rotated by 90 degrees
- Define the input sentences
messages = [ # Smartphones "I like my phone", "My phone is not good.", "Your cellphone looks great.", # Weather "Will it snow tomorrow?", "Recently a lot of hurricanes have hit the US", "Global warming is real", # Food and health "An apple a day, keeps the doctors away", "Eating strawberries is healthy", "Is paleo better than keto?", # Asking about age "How old are you?", "what is your age?", ]
- Pass the input messages to run_and_plot() defined in step (9)
run_and_plot(messages)
Output:
- Code source
- Google colab notebook of the above implementation
References
To get an in-depth understanding of universal sentence encoder, refer to the following sources: