The first and foremost step when measuring the performance of a machine learning model is its evaluation. Numerous metrics are used to evaluate an ML model; in fact, selecting the most suitable metrics is critical to ensure a fine-tuned model.
Amazon has introduced a new metric, QUALS, to evaluate the performance of abstractive text summarisation models. An upgrade from its predecessor, the latest metric, has an enhanced speed and capacity. It works with abstractive summarisation, which summarises a text by automatically extracting phrases from the given words and rephrasing them.
The Problem
Since these systems are deep learning-based, there is a maximised overlap between the summaries generated and the sample summary in the model’s training data. But this theory is incorrect in practical usage of abstractive summarisation because the significant overlap between generated summary and target summary leads to factually incorrect phrases.
Amazon AI has provided an example to project the same.
Credit: Amazon AI Blogpost
The Solution
Conventional metrics for training abstractive-summarisation models don’t account for factual accuracy. Amazon has introduced a new metric for measuring the performance of abstractive-summarisation models called QUALS. “Our metric adopts the same general strategy as the earlier QAGS metric, but it’s 55 times as fast to apply, which makes it more practical for model training,” according to Amazon’s blog post.
Credit: Amazon AI Blogpost
The given image shows the architecture of the new model QUALS at the bottom compared to the earlier QAGS on the top. QUALS has a simpler architecture allowing the model to work faster.
A comparison of their trained models using both techniques found that the approach improved on the best-performing previous models by 15% on one dataset and 2% on another.
Question-Answer Scoring: QAGS vs QUALS
QAGS, the previous system, uses a four-step procedure to score a text summary, extracting names and phrases from the summary, feeding this noun to a trained question generation model, and feeding the generated questions to a trained answering model.
QAGS requires the sequential application of three neural models.
On the other hand, QUALS reduces these sequences to one, making it 55 times faster than QAGS. QUALS stands for question answering with a language model score for summarisation.
It uses the joint question-and-answer generation (QAGen) model that takes a text as input and generates question-and-answer pairs about it.
QUALS’ Working
QUALS requires a single neural model, the question-and-answer generation model.
The model produces 60 high probability question-and-answer pairs for a given summary. The researchers also claim that these options are made after exploring diverse options to generate various unique suggestions. The model can then eliminate the pairs whose word sequences don’t match the summary.
How it Finds Factual Inconsistencies
The source text behind the summary is fed to the QAGen model that calculates the probability of similar question-answer pairs extracted for the summary. Then, the model compares the likelihood of generating a matching pair for the source text and for the summary – when this probability is small, the QUALS is low. Since this suggests a discrepancy where the pair is correct for the summary but not the source text, it indicates a factual inconsistency.
Training Methodology
The researchers have proposed a contrastive learning method to use the QUALS score and train the ML model accordingly. Contrastive learning is where the model learns the general features of a dataset without labels through similar or different data points.
This includes initially training a summarisation model through the standard approach, which uses maximum-likelihood estimation (MLE) to approximate a phrasal-overlap score. Next, this trained model generates new summaries for various source texts in the training dataset. Finally, it creates two groups of summaries, one containing ground truth summaries with high scores and general summaries with low scores.
Lastly, the team will use a loss function to retrain the summarisation model. This allows the model to generate summaries like the first group and discourages summaries like the second.
Evaluation
The team used two models for evaluation: a standard trained model and another trained by contrastive learning. In addition, the team used three ROGUE metrics to evaluate the summaries instead of the QUALS. On all five metrics, models trained using QUALS outperformed the two baselines.
A confirmation human evaluation study compared 100 summaries generated by QUALS with 100 generated by MLE. Human subjects were asked to compare these to check for factual consistency, grammar, and valuable information. QUALS-based summaries were found to be better in terms of accuracy and information. However, grammatical correctness was the same for both.
Other popular evaluation techniques:
Chi-Square
The χ2 test is a popular method to test the hypothesis between two or more groups. It allows the developers to confirm the independence between the two variables, to analyse the categorical data, and evaluate Tests of Independence when using a bivariate table.
Confusion Matrix
The confusion matrix or the error matrix is represented by a 2D table describing the performance of a classification model on a set of test data in machine learning. The two-dimensional matrix consists of each row representing the instances in the predictive class while each column represents the instances in the actual class. The values can be put the other way as well.
Gini Coefficient
The Gini coefficient is a popular metric for imbalanced class values since it provides a statistical measure of distribution. The coefficient ranges from 0 to 1, where 0 represents perfect equality, and 1 represents perfect inequality. Here, a higher index value leads to greater dispersion of the data.