Understanding and making sense of human languages to generate value is the vital objective of Natural Language Processing. Owing to this requirement, the task of Representing text becomes extremely important to NLP practitioners. There are innumerable formats to represent text. Different Techniques preserve different levels and types of information. The task at hand heavily influences the choice of a representation for storing data. Abstract Meaning Representation is one such representation that has several applications in the NLP domain.
AMR is a graph-based representation that aims to preserve semantic relations. AMR graphs are rooted, labelled, directed, acyclic graphs, comprising whole sentences. They are intended to abstract away from syntactic representations. In this sense, sentences similar in meaning should be assigned the same AMR, even if they are not identically worded.
Example:
”China is expanding its military power to attempt to join the ranks of the superpowers”
Each node represents a concept, and each edge represents the type of relationship between these concepts.
In this post, let’s explore a graph to sequence-based architecture to generate natural text using an AMR graph.
AMR to Text Translation
We are often interested in the natural language information present in the AMR graph.
But extracting this language is not a trivial task as the tenses and functional information are abstracted away during AMR graph construction. Some statistical approaches are used to solve this problem, but they require too many rules and are not easily generalizable.
Deep Learning, seq2seq models, in particular, demonstrated great results in translating AMR graphs. To serialise the graph, input Depth, First Traversal is used, and the input sequence is generated. But the task of converting graph input to sequence input adds more complexity to the algorithms. Moreover, the model itself struggles to understand the original connections after serialization. The following architecture uses graph state LSTMs to accept the graph input directly.
Graph State LSTM
AMR graph is represented as a set of hidden state vectors g = { hj } each vector in this set represents a node. Edges of the graph are represented as triples (i,j,l) i,j,l are a source, target and label of an edge.
We will use LSTM cells to connect these nodes and let the information exchange happen between the graph’s neighboring nodes. These exchanges between neighboring nodes occur recurrently, and information propagates through the entire graph. This helps in retaining nonlocal relationships between concepts. Number of time steps needed depends on the size of the graph.
Each node is represented as the following LSTM cell.
Embedding of an edge=[embedding of source node; embedding of target node; embedding of label]
Pretrained Glove vectors are used for embedding the tokens.
For each node Incoming Neighbour Nodes+Edges
passage_in_neighbor_edge_representations = tf.nn.embedding_lookup(self.edge_embedding, self.passage_in_neighbor_edges) # [batch_size, passage_len, passage_neighbors_size_max, input_dim] passage_in_neighbor_node_representations = collect_neighbor_node_representations( passage_node_representation, self.passage_in_neighbor_indices) passage_in_neighbor_representations = tf.concat( [passage_in_neighbor_node_representations, passage_in_neighbor_edge_representations], 3) passage_in_neighbor_representations = tf.multiply(passage_in_neighbor_representations, tf.expand_dims(self.passage_in_neighbor_mask, axis=-1)) # [batch_size, passage_len, input_dim + edge_dim] passage_in_neighbor_representations = tf.reduce_sum(passage_in_neighbor_representations, axis=2)
Similarly, passage_out_neighbor_representations
are also calculated.
We also need to add connections from previous hidden states of the graph to the current state. For this we need to calculate hidden states in each time step iteratively.
passage_in_edge_prev_hidden = collect_neighbor_node_representations(passage_node_hidden, self.passage_in_neighbor_indices) passage_in_edge_prev_hidden = tf.multiply(passage_in_edge_prev_hidden, tf.expand_dims(self.passage_in_neighbor_mask, axis=-1)) # [batch_size, node_len, neighbor_vector_dim] passage_in_edge_prev_hidden = tf.reduce_sum(passage_in_edge_prev_hidden, axis=2) passage_in_edge_prev_hidden = tf.multiply(passage_in_edge_prev_hidden, tf.expand_dims(self.passage_nodes_mask, axis=-1)) passage_in_edge_prev_hidden = tf.reshape(passage_in_edge_prev_hidden, [-1, options.neighbor_vector_dim])
Now we can implement the LSTM structure displayed above.
## ig passage_edge_ingate = tf.sigmoid(tf.matmul(passage_in_neighbor_representations, w_in_ingate) + tf.matmul(passage_in_edge_prev_hidden, u_in_ingate) + tf.matmul(passage_out_neighbor_representations, w_out_ingate) + tf.matmul(passage_out_edge_prev_hidden, u_out_ingate) + b_ingate) passage_edge_ingate = tf.reshape(passage_edge_ingate, [batch_size, passage_nodes_size_max, options.neighbor_vector_dim]) ## fg passage_edge_forgetgate = tf.sigmoid(tf.matmul(passage_in_neighbor_representations, w_in_forgetgate) + tf.matmul(passage_in_edge_prev_hidden, u_in_forgetgate) + tf.matmul(passage_out_neighbor_representations, w_out_forgetgate) + tf.matmul(passage_out_edge_prev_hidden, u_out_forgetgate) + b_forgetgate) passage_edge_forgetgate = tf.reshape(passage_edge_forgetgate, [batch_size, passage_nodes_size_max, options.neighbor_vector_dim]) ## og passage_edge_outgate = tf.sigmoid(tf.matmul(passage_in_neighbor_representations, w_in_outgate) + tf.matmul(passage_in_edge_prev_hidden, u_in_outgate) + tf.matmul(passage_out_neighbor_representations, w_out_outgate) + tf.matmul(passage_out_edge_prev_hidden, u_out_outgate) + b_outgate) passage_edge_outgate = tf.reshape(passage_edge_outgate, [batch_size, passage_nodes_size_max, options.neighbor_vector_dim]) ## input passage_edge_cell_input = tf.tanh(tf.matmul(passage_in_neighbor_representations, w_in_cell) + tf.matmul(passage_in_edge_prev_hidden, u_in_cell) + tf.matmul(passage_out_neighbor_representations, w_out_cell) + tf.matmul(passage_out_edge_prev_hidden, u_out_cell) + b_cell) passage_edge_cell_input = tf.reshape(passage_edge_cell_input, [batch_size, passage_nodes_size_max, options.neighbor_vector_dim]) passage_edge_cell = passage_edge_forgetgate * passage_node_cell + passage_edge_ingate * passage_edge_cell_input passage_edge_hidden = passage_edge_outgate * tf.tanh(passage_edge_cell)
Decoder
Now that we have successfully encoded the AMR graph using hidden state vectors, we need a decoder to convert these vectors into a natural language sequence.
Before looking at the decoder in detail, we need to address the sparsity problem that occurs due to open class tokens like people, dates etc.
This problem is solved by using a copy machine with an attention layer during decoding. The final Probability of vocabulary is the interpolation between the probabilities generated by the decoder and probabilities from the attention layer.
Pfinal = ፀtPvocabulary+(1- ፀt)Pattention
Attention is implemented in the following manner.
Encoder_states are the hidden states of graph state LSTM
Encoder_features are calculated using encoder_states as shown below.
encoder_features = tf.expand_dims(encoder_states, axis=2) # now is shape [batch_size, passage_len, 1, encoder_dim] W_h = variable_scope.get_variable("W_h", [1, 1, encoder_dim, options.attention_vec_size]) self.W_h = W_h encoder_features = nn_ops.conv2d(encoder_features, W_h, [1, 1, 1, 1], "SAME") # [batch_size, passage_len, 1, attention_vec_size] encoder_features = tf.reshape(encoder_features, [batch_size, passage_len, options.attention_vec_size])
A context vector is calculated using attention.
context_vector = tf.reduce_sum(tf.expand_dims(attn_dist, axis=-1) * encoder_states, axis=1)
A Coverage Vector is calculated by accumulating the attentions from all the previous time steps.
coverage += attn_dist
Attention distribution is calculated using the decoder state, encoder features and coverage vector.
state_features = linear(decoder_state, attention_vec_size, True) # [batch_size, attention_vec_size] state_features = tf.expand_dims(state_features, 1) # [batch_size, 1, attention_vec_size] all_features = encoder_features + state_features # [batch_size, passage_len, attention_vec_size] coverage_features = tf.expand_dims(coverage, axis=-1) * w_c # coverage_features: [batch_size, passage_len, 1] all_features += coverage_features e = tf.reduce_sum(v * tf.tanh(all_features), axis=-1) # [batch_size, passage_len] attn_dist = nn_ops.softmax(e) # [batch_size, passage_len] attn_dist *= node_mask
Now let’s see how the copy mechanism is implemented.
The ፀt is a switch for controlling generating a word from the vocabulary or directly copying it from the input graph. It is calculated using the context vector, decoder state, input to decoder cell and output of the decoder.
p_gen = linear([context_t, state_t.c, state_t.h, x], 1, True) # [batch_size, 1] p_gen = tf.sigmoid(p_gen) # Each entry on the list output_t corresponds to one decoder step vocab_score_t = tf.nn.xw_plus_b(output_t, w, b) # apply the linear layer vocab_score_t = tf.nn.softmax(vocab_score_t) # For pointer-generator model, calc final distribution from copy distribution and vocabulary distribution vocab_score_t = self.merge_prob_dist_for_one_step(vocab_score_t, attn_dist, p_gen, node_idxs, node_mask) vocab_score_t = _clip_and_normalize(vocab_score_t, 1e-6)
Inputs to the decoder cell are
- State of the decoder during previous time step.
- Output of the decoder during previous time step
- Context vector during previous time step
- Coverage vector during the previous time step
- Attention Distribution.
The output of the decoder is the vocab_score
.
Code repository consisting of snippets shown above is present here(https://github.com/freesunshine0316/neural-graph-to-seq-mp)
Here’s a link to a quick run of the above repository: (https://colab.research.google.com/drive/1OKLdTxeVaizNbhCvNO5TmnA9z0c6x001?usp=sharing)
Conclusion
Outside of formal study, AMRs have several applications in the NLP domain. They can be used to do Abstractive Text summarisation, Semantic Parsing and Information Extraction. Robust architectures to reconstruct natural language from these representations will help the advancement of these applications.