Transformer models, especially BERT transformed the NLP pipeline. They solved the problem of sparse annotations for text data. Instead of training a model from scratch, we can now simply fine-tune existing pre-trained models. But the sheer size of BERT(340M parameters) makes it a bit unapproachable. It is very compute-intensive and time taking to run inference using BERT.ALBERT is a lite version of BERT which shrinks down the BERT in size while maintaining the performance.
This model was published in a paper presented at ICLR 2020 by Zhenzhong Lan, Mingda Chen2, Sebastian Goodman, Kevin Gimpel, Piyush Sharma and Radu Soricut (researchers at Google Research and Toyota Technological Institute at Chicago). Link.
Architecture of ALBERT
The main Idea of ALBERT is to reduce the number of parameters(up to 90% reduction) using novel techniques while not taking a big hit to the performance. Now, this compressed version scales a lot better than the original BERT, improving the performance while still keeping the model small.
The backbone of the architecture is the Multi-headed, multi-layer Transformer.
This picture taken from http://primo.ai/index.php?title=Attention is a great visualization of the transformer model.
ALBERT is an encoder-decoder model with self-attention at the encoder end and attention on encoder outputs at the decoder end.
It consists of several blocks stacked on top of one another. Each of these blocks contains a multi-head attention block and a Feedforward Network. Following is an excerpt from the implementation of such a block in the source code of ALBERT. Code here is modified for the sake of brevity.
def attention_ffn_block(layer_input,hidden_size=768): # Get self attention tensor attention_output = attention_layer(from_tensor=layer_input, to_tensor=layer_input,) # Run a linear projection of `hidden_size` then add a residual # with `layer_input`. attention_output = dense_layer_3d_proj(attention_output, hidden_size,) #Run a feed forward network with two layers attention_output = layer_norm(attention_output + layer_input) intermediate_output = dense_layer_2d(attention_output, intermediate_size) ffn_output = dense_layer_2d(intermediate_output, hidden_size) ffn_output = layer_norm(ffn_output + attention_output) return ffn_output
The first zoom-in the image shows the multi head attention layer. Here’s the code implementation of it
def attention_layer(from_tensor, to_tensor): # Scalar dimensions referenced here: # B = batch size (number of sequences) # F = `from_tensor` sequence length # T = `to_tensor` sequence length # N = `num_attention_heads` # H = `size_per_head` # `query_layer` = [B, F, N, H] q = dense_layer_3d(from_tensor, num_attention_heads, size_per_head) # `key_layer` = [B, T, N, H] k = dense_layer_3d(to_tensor, num_attention_heads, size_per_head) # `value_layer` = [B, T, N, H] v = dense_layer_3d(to_tensor, num_attention_heads, size_per_head) q = tf.transpose(q, [0, 2, 1, 3]) k = tf.transpose(k, [0, 2, 1, 3]) v = tf.transpose(v, [0, 2, 1, 3]) # 'new_embeddings = [B, N, F, H]' new_embeddings = dot_product_attention(q, k, v, attention_mask, attention_probs_dropout_prob) return tf.transpose(new_embeddings, [0, 2, 1, 3])
The second zoomin picture just shows how the dot product for attention is calculated.Given triples of (Q,K,V) for each sequence term we calculate a weighted sum of the values of all the V’s.These weights here are the dot products of K’s of all sequence terms and the current sequence term.They represent relevance of the sequence terms.
def dot_product_attention(q, k, v): logits = tf.matmul(q, k, transpose_b=True) # [..., length_q, length_kv] logits = tf.multiply(logits, 1.0 / math.sqrt(float(get_shape_list(q)[-1]))) attention_probs = tf.nn.softmax(logits, name="attention_probs") return tf.matmul(attention_probs, v)
There are few twists to the architecture mentioned in the case of ALBERT. Following are the techniques that ALBERT uses to achieve compression.
Factorization of Parameters
HIdden layer representations must be large to accommodate the context information along with the word level embedding information. But if we increase the hidden layer size this increases the number of parameters that blows up. If V is the number of tokens in the vocabulary, H is the hidden layer size then we would need the number of parameters to be of the order V*H.
ALBERT factorizes these word-level input embeddings into lower dimensions. Let’s say E is the size of embedding after factorization. Now the number of parameters needed would be of the order V*E + E*H. Since V is very large in natural languages this results in a reduction of parameters by a huge margin.
This can be very easily implemented
# E=input_width; H=hidden_size if input_width != hidden_size: next_layer = dense_layer_2d(input_tensor, hidden_size, create_initializer(initializer_range), None, use_einsum=use_einsum, name="embedding_hidden_mapping_in")
Cross Layer Parameter sharing
Stacking independent layers although increases the learning capacity of the models, greatly increases the redundancy. Different Layers often learn the parameters that perform the same operation. ALBERT tackles this redundancy by sharing the parameters between groups of layers. This reduces the number of total parameters while keeping the number of layers constant.
Tensorflow’s Variable scope can be used with get_variable() to implement this.
for layer_idx in range(num_hidden_layers): group_idx = int(layer_idx / num_hidden_layers * num_hidden_groups) with tf.variable_scope("group_%d" % group_idx): #...Code for implementing the layer.... w = tf.get_variable(name="kernel") # w is the shared parameter across the current group #...Code for implementing the layer....
In addition to reducing the number of parameters this cross-layer variable sharing also has a nice effect of stabilizing the model.
Inter sentence coherence loss
This loss isn’t used to directly reduce the number of parameters. It’s used to improve the performance of the representations in the downstream tasks.BERT model is pre-trained for the task of NSP(next sentence prediction). We provide the encoder-decoder model with pairs of segments and make the model predict if they are positive or negative
|Sentence 1||Sentence 2||positive|
|Sentence 2||Sentence 3||positive|
|Sentence 3||Sentence 7||negative|
|Sentence 1||Sentence 3||negative|
This task, although forcing the model to learn representations that perform well, turned out to be not a difficult task for the model. Model kind of captures the topic information between the sentences and predicts the class. This resulted in the model trained for NSP to struggle at predicting the sentence order.ALBERT uses the following data for training which causes the model not only to learn the topic information but also granular level details and coherence between sentences..
|Sentence 1||Sentence 2||positive|
|Sentence 2||Sentence 1||negative|
Usage of ALBERT
Tensorflow hub has made it extremely easy to use pre-trained models. Let’s use pretrained ALBERT base model for the classification of movie reviews.
albert_url='https://tfhub.dev/tensorflow/albert_en_base/2' encoder = hub.KerasLayer(albert_url) preprocessor_url="https://tfhub.dev/tensorflow/albert_en_preprocess/3" preprocessor = hub.KerasLayer(preprocessor_url)
Model and the required preprocessor are downloaded and loaded.
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string) encoder_inputs = preprocessor(text_input) outputs = encoder(encoder_inputs) pooled_output = outputs["pooled_output"] embedding_model = tf.keras.Model(text_input, pooled_output)
Just like that, we have our embedding layer. We just need to build a Fully connected neural network to predict whether a movie review is positive or negative.
model = tf.keras.Sequential() model.add(embedding_model) model.add(tf.keras.layers.Dense(128, activation='relu')) model.add(tf.keras.layers.BatchNormalization()) model.add(tf.keras.layers.Dense(30, activation='relu')) model.add(tf.keras.layers.BatchNormalization()) model.add(tf.keras.layers.Dense(8, activation='relu')) model.add(tf.keras.layers.BatchNormalization()) model.add(tf.keras.layers.Dense(1)) model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), metrics=['accuracy']) history = model.fit(train_data.shuffle(10000).batch(128), epochs=10, validation_data=validation_data.batch(128), verbose=1)
With this basic model validation accuracy, about 75% is a good number. Especially when we are not fine-tuning the embeddings at all. We can fine-tune the embeddings by just making the encoder trainable.
encoder = hub.KerasLayer(albert_url,trainable=True)
Here’s a link to the colab notebook with the code in this section
ALBERT is a very useful variant of BERT which is not huge.It can improve the efficiency of the performance of downstream language understanding tasks while keeping the computational overhead under an acceptable level for several applications.