Now Reading
Complete Guide to SDNet: Contextualized Attention-based Deep Network for Conversational Question-Answering

Complete Guide to SDNet: Contextualized Attention-based Deep Network for Conversational Question-Answering

Conversational Question Answering is an exciting task that requires the model to read a passage and answers questions in dialogue. It is different from Machine Reading Comprehension, where the model reads a passage and answers questions in a stateless manner, i.e. it doesn’t use information from previous questions and answers. This new task expects the model to comprehend the passage, understand the context, and do coreference resolution.

SDNet is a contextualized attention based deep neural network that achieved State of the Art results in the challenging task of Conversational Question Answering. It makes use of inter attention and self-attention along with Recurrent BIdirectional LSTM layers.

Register for our upcoming Masterclass>>

Chenguang Zhu,  Michael Zeng1 and  Xuedong Huang, researchers at Microsoft, introduced this model  in a  paper published on 2nd January 2019 


The SDNet model is built upon Machine Reading Comprehension(MRC) models.

Let us look at the model’s innovative architecture step by step.

Looking for a job change? Let us help you.


Model takes passage as an input from which context(C) is learned. It also takes the current question as input. It requires the previous question-answer pairs to understand the context of the dialogue.

Each question is represented by :

Qk = {Qk−N ; Ak−N ; ..., Qk−1; Ak−1; Qk} N previous Question(Q) and Answer(A) pairs are taken into consideration while answering the current question.

All the Questions on one passage are treated as a batch by the model.


The model uses both Glove and BERT representations of each word or token given in the context and question. Glove embeddings are used in a straightforward 300D vector lookup fashion.BERT representations for each word are calculated by using the Byte Pair encoding representations. Each word is broken down into s BPE tokens, Each token has L hidden vectors, one for each layer of BERT. These are summed as below to get a single vector for each word.

\alpha is the weight which is learned.

Context Layers

We have each word of the context(passage) vectorized using different techniques. We need to combine these vectors and feed them into Context Layers. The input to context layers is a vector of w’s corresponding to each word.

  • f is a feature vector representing the POS, NER and exact matching with question.
  • h is a word-level inter attention defined below.
  • BERT is the vector mentioned in the embeddings section
  • GLoVe is a 300d vector of the word. 

Word level Inter attention is one of the inputs of the context layer. It is calculated from question to context using the word embeddings of question(Q) and context(C).

Context Layer(left part in the image) contains K Bidirectional LSTMs to develop a context-based understanding of the passage. Let the output of these RNNs be

A MultiLevel Attention block is used to calculate attention from question to context. Attention score for each token in context is calculated using all the previous RNN layer outputs.

 Note that Query value of attention is a vector representation of each token in the passage whereas Key,value pairs are similar vector representations of the question.

A Shortcut connection is added from RNNs output to the MLA output, and their concatenation is passed through one more BIdirectional LSTM.Traditional Self Attention is used on the outputs of previous RNN layer. One more RNN layer is added on top of the Self Attention layer to get the final output (uC) of the context layers.

Question Layers

Question layers are very similar to COntext Layers. They contain the following layers.

  • Glove and BERT embeddings are concatenated.
  • K RNN layers to develop contextualized understanding of Question.
  • ONe more RNN layer to generate higher level understanding.
  • Self Attention on the output of RNN to generate final question representation.

These n vectors representing the questions are further compressed into one vector as shown below.

uQ =Σi βi uQi, where βi ∝ exp (wTuQi ) and w is a parameterized vector.

Output Layer

This model’s outputs can be Yes/No or the span of the passage that answers the question.

To get the span we need probabilities of answers starting from each word in the context.

This probability is used along with outputs from question and context layers to generate probabilities of answer ending at each word of the passage.

If the result is in yes /no answer to the question, we need to generate corresponding probabilities.

See Also

All the W’s newly introduced in the calculation of probabilities can be learned during training.

In Action

Microsoft has made the code for SDNet model opensource.It is available  here

We can clone this repository with

!git clone

Let’s use data from to train and test the model.

We need to download data,bert model and glove embeddings to train this model. Following are the commands to get these files.

 !tar -xf bert-base-uncased.tar.gz
 !cp /content/SDNet/conf /content/coqa/ 

We need to arrange these files into the following directory structure.

Directory Structure

Now training the model is done with a simple command

!python SDNet/ train coqa/conf

This command will train using train data and predict results for dev data. But let’s see how to only predict without training the model. To do this, we need a pretrained model, test data and a config file. We have saved all of them from the previous step.

 from SDNet.Models.SDNetTrainer import SDNetTrainer
 from SDNet.Utils.Arguments import Arguments
 conf_args = Arguments(conf_file)
 opt = conf_args.readArguments()
 opt['cuda'] = torch.cuda.is_available()
 opt['confFile'] = conf_file
 opt['datadir'] = os.path.dirname(conf_file)  # conf_file specifies where the data folder is
 for key,val in cmdline_args.__dict__.items():
     if val is not None and key not in ['command', 'conf_file']:
         opt[key] = val
 model = SDNetTrainer(opt)
 predictions,confidence,pred_json = model.official(model_path,test_data) 

Following are the validation F1 scores obtained by SDNet model using various settings on CoQA dataset.

The code mentioned above is available here.


Many Applications are employing chatbots to interact with human customers. But these chatbots are limited in their capability to maintain a coherent dialogue. Models like SDNet can immensely help in the betterment of chatbots as they can solve coreference resolution and context understanding problems to a good extent.

What Do You Think?

Join Our Discord Server. Be part of an engaging online community. Join Here.

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top