Graph Neural Networks (GNNs) are an effective framework for representation learning of graphs.
Large scale knowledge graphs are usually known for their ability to support NLP applications like semantic search or dialogue generation. Whereas, companies like Pinterest already have opted for a graph-based system(Pixie) for real time high performance tasks.
GNNs follow a neighborhood aggregation scheme, where the representation vector of a node is computed by recursively aggregating and trans- forming representation vectors of its neighboring nodes.
Many GNN variants have been proposed and have achieved state-of-the-art results on both node and graph classification tasks. However, despite GNNs revolutionizing graph representation learning, there is limited understanding of their representational properties and limitations.
Talking about limitations, it is extremely important to investigate the workings of graph neural networks and how well calibrated they are as they are proven to be useful in classification tasks.
So, in order to test the efficacy of GNNs, few researchers experimented with the existing technologies and demonstrated their findings in a recently published paper.
A machine learning model with good calibration produces best results consistently. For instance, a softmax function in a convolutional neural networks tasked to predict the class of an image, will predict the right one almost every time.
Avoiding Misfires And Mishits
The evaluation was done using two tools: calibration or reliability diagram and the metric expected calibration error(ECE).
In the calibration diagram, the predictions of the model are grouped in bins, according to their confidence value. Then, for each bin, a point is drawn where the x-axis is the average confidence of the predictions in the bin, while the y-axis is their average accuracy.
Whereas, the Expected Calibration Error (ECE). The ECE metric is a single number that summarizes the calibration error. ECE is the average of the gaps in the reliability diagram, weighted by the number of predictions in each bin.
Firstly, the graph neural networks were trained on data obtained from social networking platforms like Friendster, Facebook and also from Amazon and PUBMED.
Popular GNNs like Graph Convolutional Networks, Graph Attention Networks and Graph isomorphism Networks were trained using PyTorch geometric library.
Since evaluating the existing methodologies is the prime concern here, the researchers considered calibration improvement techniques like MC Dropout, histogram binning, isotonic regression and temperature scaling.
In many real world datasets, it is common to face an imbalanced class distribution, which can pose a challenge to learn meaningful models. In the FRIENDSTER dataset, researchers observed a severe class imbalance, with more than 60% of labeled nodes belonging to the most prevalent class, while less than 1% of labeled belong to least prevalent class.
This led to collapsed predictions towards a single (most prevalent) class, with all GNNs predicting at least 95% of examples to the most prevalent class (with some hyperparameter configurations actually predicting 100% examples in a single class).
Key Findings
To address the imbalances, the less prevalent classes are upweighted so that they would contribute equally in the network.
When the test and train distribution are dissimilar, the conclusions drawn from the evaluation can be misleading.
The results show that for easier tasks all GNNs are reasonably calibrated, while for harder tasks, such as in the FRIENDSTER dataset, GNNs can be miscalibrated and existing calibration techniques are unable to calibrate them. And, using the proper test distribution when evaluating has an impact on both accuracy and calibration.
Along with finding new metrics to calibrate GNNs, adopting reinforcement learning to graph-based reasoning to make the model search will open up other interesting avenues in graph based machine learning systems.
Know more about miscalibration of GNNs here