Every year, the machine learning community is treated to at least half a dozen international conferences flaunting exceptional research from across the world. The not so attractive aspect of these conferences is the review system for the researchers whose credentials and careers are intrinsically linked to the acceptance at these conferences. Every year, a handful of researchers write lengthy blog posts on how they were overlooked for a petty reason. Check the following lamentation of a Redditor after the release of last year’s ICLR reviews:
“‘Your proof is not at all rigorous because I don’t follow.’(No details included)
I’ve never expected a review to be so unprofessional and disrespectful.”
For instance, Hirevue, which makes money by automating the laborious corporate interview process, has recently decided not to use facial recognition technology for interviews. An external audit found that the models used for reading facial expressions of the candidates deviate from the promised unbiased results. Now, building an automated system that understands the merits of a scientific endeavour does sound ambitious, to the say least. But, the incentive to build one is driven by the volume of research and the disappointments associated with the review results.
Research by the numbers at top ML conferences:
- CVPR: 1,470 research papers on computer vision accepted from 6,656 valid submissions.
- ICLR: 687 out of 2594 papers made it to ICLR 2020 — a 26.5% acceptance rate.
- ICML: 1088 papers have been accepted from 4990 submissions.
The biases, technical expertise and the physical exhaustion come into play when reviewing thousands of papers. Now, take into account all the subdomains of all the scientific fields ever existed. An average PhD holder published at least 5 papers during his/her tenure. It is possible that some impactful work might go unnoticed by the human reviewers. Taking into account all these factors, the researchers at Carnegie Mellon University have used natural language processing to review scientific papers. In a paper demonstrating their experiment, the researchers talk about how viable the automation of the reviewing process is.
Overview Of The Method
The review system built by Carnegie researchers can often precisely summarize the core idea of the paper and also generate reviews that cover different aspects of the paper. “This could potentially provide a preliminary template for reviewers and help them quickly identify salient information in making their assessment,” wrote the researchers.
The researchers listed few attributes that make a good reviewer or a review system:
The above desiderata is used to set a benchmark for the automated review system. The researchers introduced metrics that help quantify these attributes. For example, for comprehensiveness, they use a metric called Aspect Coverage (ACOV). Specifically, given review R, aspect coverage measures how many aspects (e.g.clarity) in a predefined aspect typology. They also propose another metric Aspect Recall (AREC), which explicitly takes into account meta review—an authoritative summary of all the reviews for a paper.
And for measuring kindness of the language used in the paper, they used Semantic Equivalence metrics, which gives away the similarity between generated reviews and reference.
For experiments, the authors have prepared their own dataset—Aspect-enhanced Peer Review (ASAP-Review)— by scraping the openly available reviews of the top conferences such as NeurIPS. The basic statistics of the ASAP-Review dataset can be seen below.
For experiments, BART, the pre-trained sequence-to-sequence model was used. And to characterise potential biases existing in reviews, the authors first defined an aspect score, which calculates the percentage of positive occurrences of each aspect. Whereas, the polarity of each aspect is obtained based on the learned tagger.
The authors also admit that given the complexity of understanding the merit of scientific contributions, it is difficult to expect an automated system to be able to match a well-qualified human reviewer any time soon. However, some degree of review automation may assist reviewers in their assessments.
“In answer to the titular question of “can we automate scientific review,” the answer is clearly “not yet”. However, we believe the models, data, and analysis tools presented in this paper will be useful as a starting point for systems that can work in concert with human reviewers to make their job easier and more effective,” concluded the authors.
Read the original paper here.