“Likelihood-based and representation-based model diagnostics are not yet as reliable as previously assumed.”Aribandi et al.,
Machine learning(ML) models are increasingly migrating from lab environments to real world setups. The models in deployment incorporate few tools to run the diagnostics–to keep an eye on models during training. But, how reliable are the diagnostics and, more importantly, who watches the watchmen? Diagnostics help practitioners understand the failure modes and capabilities of large contemporary models and also improve models.
Model diagnostics generally probe a model for:
- Acquisition of syntactic knowledge
- Presence of biases and stereotypes
- Phenomena that can be used to further improve models.
In a recent survey on model diagnostics, researchers from Google have found most diagnostics are unreliable on multiple fronts. The team picked language models and three diagnostic tasks for the experiment:
StereoSet: A large-scale natural dataset in English to measure stereotypical biases in four domains: gender, profession, race, and religion.
CrowS-Pairs: It has 1,508 examples that cover stereotypes dealing with nine types of bias, like race, religion, and age. In CrowS-Pairs, a model is presented with two sentences: one less stereotypical than the other. The data focuses on stereotypes about historically disadvantaged groups and contrasts them with advantaged groups.
SEAT: The Sentence Encoder Association Test for testing bias in language models.
The researchers pre-trained five BERT BASE and LARGE uncased English models using Tensorflow. Experiments on StereoSet and CrowS-Pairs show likelihood-based ranking diagnostics have a standard deviation of over 2.5 percentage points in many categories—leading to false conclusions.
The results also showed models are probably uncertain about their predictions for these data points, motivating the consideration of model uncertainty in diagnostic measures instead of simply making a binary decision by comparing likelihoods. And, when it comes to vector based diagnostics (eg: SEATs), it is possible for two models to be pre-trained with the exact same configurations but different random seeds to yield completely opposite conclusions.
The researchers observed that representation-based diagnostics are less stable than likelihood-based diagnostics because large models like BERT are optimised to be good at modeling likelihoods via their pre-training objective. Though they have restricted their experiments to an intrinsic paradigm, the researchers suggest validating model diagnostics on whether they are intrinsic or extrinsic, i.e., whether they directly analyze models for certain phenomena that aren’t tied to any downstream task or do so keeping particular tasks in mind.
The diagnostic fragility extends to other areas as well. For example, Elina Voita and Ivan Titov, demonstrated that classifier probes are unstable, and that it might not be clear from their results whether the probe itself learned a phenomena or whether the diagnosed representations learned it (Hewitt and Liang, 2019). Similarly, other researchers have found that gradient-based analysis of language technologies based on neural networks can often be unreliable and manipulable. Whereas, a team from Northeastern University showed that attention-based interpretation can also be unreliable and manipulable to the point of deceiving practitioners.
ML models have forayed into critical fields such as medicine, judiciary, and social media moderation. So, diagnosing for a better understanding of model behaviour is more essential than ever before. For instance, last year, researchers from Cornell Tech and Facebook AI quashed the machine learning hype. In their work, they demonstrated that the ML success trend appears to be overstated and the so-called cutting edge research or benchmarks smashing models perform almost similar to the ones released many years ago. According to them, the metric learning algorithms have not made spectacular progress.
The Google researchers too try to demonstrate the significance of assessing algorithms more diligently and how few practices can help reflect ML success in reality. The authors also concluded that no probe is perfect and the point of their survey is to expose methodologies which were previously thought to be impeccable. They recommend the following with regards to model diagnostics:
- Single diagnostic result cannot be generalised to the entire training setup.
- Restrict conclusions to a specific checkpoint.
- Diagnostic tools should be tested on publicly available checkpoints as well as multiple model/probe configurations.