Taking the cue, three former students of Jadavpur University, Kolkata, Sourya Dipta Das (currently working at SHL India as a Research Engineer), Ayan Basak (working as a data scientist at Snapdeal), and Saikat Dutta (working at LG Ads Solutions as data scientist) have developed an AI model for fake news detection with a high level of accuracy. Recently, their work has been published in the Neurocomputing journal. Analytics India Magazine got in touch with the trio to understand the nuts and bolts of their AI model.
The model consists of seven main parts:
- Text Preprocessing
- Backbone Model Architectures
- Statistical Feature Fusion Network
- Predictive Uncertainty Estimation Model
- Heuristic Post Processing
Figure 1: Fake News Identification Initial Process Block Diagram
A major chunk of social media items, like tweets, are written in colloquial language and contain information like usernames, URLs, emojis, etc. The team filtered out such attributes from the given data as a basic preprocessing step, before feeding it into the ensemble model.
During tokenisation, each sentence is broken down into tokens before being fed into a model. The team has used a variety of tokenization approaches depending on the pre-trained model used as each model expects tokens to be structured in a particular manner, including the presence of model-specific special tokens. Each model has its corresponding vocabulary associated with its tokeniser, trained on a large corpus of data. During training, each model applies the tokenisation technique with its corresponding vocabulary on news data.
Backbone Model Architectures
The team has used a variety of pre-trained language models as backbone models for text classification. “For each model, an additional fully connected layer is added to its respective encoder sub-network to obtain prediction probabilities for each class- ‘real’ and ‘fake’ as a prediction vector. Pre-trained weights for each model are fine-tuned using the tokenized training data. The same tokeniser is used to tokenise the test data and the fine-tuned model checkpoint is used to obtain predictions during inference,” Dutta said.
The team used the model prediction vectors obtained from inference on the news titles for the different models to obtain their final classification result, i.e. “real” or “fake”. The reason behind using an ensemble of various fine-tuned pretrained language models is to utilise knowledge extracted by the respective models from the corresponding dataset it is trained on.
However, in the case of FakeNewsNet dataset, they obtained an additional prediction vector using NewsBERT on the news body, which is also appended to the existing feature set. All the features used here are obtained from the raw text data. To balance an individual model’s limitations, an ensemble method can be useful for a collection of similarly well-performing models.
Statistical Feature Fusion Network
“Our basic intuition behind using statistical features is that meta-attributes like username handles, URL domains, news source, news author, etc. are very important aspects of a news item and they can convey reliable information regarding the genuineness of such items. We have tried to incorporate the effect of these attributes along with our original ensemble model predictions,” Dutta said.
They calculated probability values corresponding to each of the attributes (say probability of a username handle or URL domain indicating a fake news item) and added them to our feature set. Additionally, the team used information about the frequency of each class for each of these attributes in the training set to compute these probability values and realised that Soft-voting works better than Hard-voting. Hence, the post-processing step takes Soft-voting prediction vectors into account.
Predictive Uncertainty Estimation Model
The team has designed an approximate Bayesian neural network as a Statistical Feature Fusion Network (SFFN) for uncertainty estimation of fake news classification. The team applied a Monte Carlo Dropout (MCDropout) layer between hidden layers of the feature fusion network for Bayesian interpretation.
The Monte Carlo (MC) dropout is applied both during training and inference. Hence, the model does not produce the same output each time inference is done on the same data point. The MC dropout enabled them to make random predictions that can be interpreted as samples from a probability distribution. From this model, they got the prediction vector along with its uncertainty value.
Here, they augmented the original framework with a heuristic approach taking into account the effect of the statistical attributes. This approach works well for data with attributes like URL domains, username handles, news source, etc. For texts that lack these attributes, they relied on ensemble model predictions. These attributes allowed them to add meaningful features to the current feature set. They obtained new training, validation and test feature-sets obtained using class-wise probability vectors from ensemble model outputs as well as probability values obtained using statistical attributes from the training data.
“We use a novel heuristic algorithm on this resulting feature set to obtain our final class predictions. The intuition behind using a heuristic approach taking the statistical features into account is that if a particular feature can by itself be a strong predictor for a particular class, and that particular class is predicted whenever the value of a feature is greater than a particular threshold, a significant number of incorrect predictions obtained using the previous steps can be ‘corrected’ back,” said Dutta.
Figure 2: Fake News Identification Post Process Block Diagram
Basak said language models have played a crucial role in developing the model. Statistical concepts like approximate Bayesian Inference to perform uncertainty estimation for fake news items have also been incorporated. The system has been built on the Python and TensorFlow 2.0 and scikit-learn have been used to develop the model.
“We have developed hand-engineered statistical features using attributes like author name, URL domain, etc, and it was quite challenging to devise a strategy to fuse these features with the model predictions and make a prediction. The uncertainty estimation in the case of fake news classification was also very difficult. Finally, ensuring that our framework is robust and unaffected by variations in the type of news items was a very big challenge,” said Basak.
To tackle the data imbalance issue, the team used data augmentation using oversampling technique (KMeans-SMOTE). They used the Statistical Feature Fusion Network (SFFN) sub-model to combine the hand-engineered statistical features with the model predictions
“We have also used transfer learning using a number of different pre-trained language models trained on a large data corpus. This ensured our final model is robust and also has a diverse knowledge base from different sources,” said Das.