AI Research Codes are Open, Accessibility is an Issue

AI can’t advance without sharing dataset/code, and it doesn't matter how many conferences back Open Science, researchers still aren’t making codes public

Published on August 11, 2022

by Lokesh Choudhary

Listen to this story

For the past few years, the scientific community worldwide has been advocating the accessibility of science. ‘Open Science’, as they call it, is an ongoing movement to make research papers accessible to all.

(credit: UNESCO)

Open information is vital for research, even in space tech. Not many know that three years ago, when scientists created the first-ever black hole image, it was made possible only because of an open-source software, Matplotlib.

The research papers that often claim to have their dataset/code open, are often found to be making false proclamations. “Most of the time they don’t have a publicly available link and expect you to mail them, in which case, too, they reply maybe once for every ten papers,” points out a Redditor.

AI research isn’t doing much for the community if the code isn’t shared by the publisher; it isn’t complete without open-source code accessible to all. According to the Radiological Society of North America, “AI research can be considered essentially useless for other researchers if it is not readily usable, reproducible, replicable, or transparent.”

However, it’s not always the fault of the author/researcher. Sometimes they write a thesis, but the dataset/code is designed by someone else, and the institute/publisher that legally owns the code requires an email request to give access. Recently, a Reddit user shared a similar incident, “My thesis is based on a dataset that my supervisor designed and collected. She wrote a paper on it. But technically speaking, her employer (the university) owns it. So whenever someone needs access, the university’s ethics board needs to approve it.”

Also, we can’t ignore the fact that researchers do write dataset/code available papers, but fail to get permissions from all stakeholders. Even if they have good intentions, there is still some autocratic stakeholder who can veto the code being published. The conferences, too, can’t withdraw those papers as they already have gone through all acceptance phases.

Steps taken to improve the community

There is, in fact, a checklist in place for researchers, called the Machine Learning Code Completeness Checklist. Proposed by the Neural Information Processing System, the checklist recommends researchers to follow a systematic path that would actually help the AI community, rather than just publish a paper.

The checklist requires the publisher to include specification of dependencies, training and evaluation codes, pre-trained models, and a README file. According to Github, the more boxes a publisher checks, the higher are the star ratings.

(Credit: Robert Stojnic)

AI thrives on open science concepts. Without open code/datasets it simply won’t move as fast. We can witness some cases around it, but data sharing in the AI community is still not as big an issue as it is in bioengineering. Some even say that code sharing is considered almost radical in the field. According to ScienceDirect, over 90% researchers don’t follow their own data sharing consent. It checked around 1,792 papers that claimed to be open source, however, when mailed, only 7% replied. “As many as 1669 (93%) authors either did not respond or declined to share their data with us,” says their report.

Another reason often pointed out is that scientists are highly disincentivised and don’t have any time or energy to provide explainers to critiques. If someone does leave the data open in the bioengineering field, people would misunderstand and start bombarding them with questions. The overworked researchers find sharing data with selective people easy and less controversial.

Controversy does affect research papers. “Other scientists assume there’s no smoke without fire and start trolling them. After a couple of days, they become aware of the thread and create another post defending their students’ work and clear the air but it’s already too late. No one is going to get a notification that they have responded and everyone has already moved on with their lives,” a Redditor pointed out.

Issue of readability

Even if a research paper provides codes accessible to all, it may not be the most promising one. Nature did a study on quality and success rate of research codes from Harvard Dataverse repository and found that only 25% of the R-files were able to run without presenting an error. Upon code cleaning too, only 40% of the R-files were able to run smoothly.

(credit: Nature)

Elaborating on this, a researcher, in an online community, said, “Our primary goal isn’t to write a public library. The code isn’t really meant to be that, for the most part. We put research code online so our results can be verified during peer review. We let anyone use the code artifacts of our work as a bonus since we eventually want our ideas to spread.”

According to him, they would rather put time and effort into implementing new ideas and designing good experiments. “Readmes don’t get us a lot of career advancement, unfortunately.”

PS: The story was written using a keyboard.

Access all our open Survey & Awards Nomination forms in one place

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

The Impact of Lok Sabha Election on India’s AI Progress

Vidyashree Srinivas

The BJP aims to safeguard citizen safety and privacy, leaning towards regulation, while the Congress views AI advancements as an opportunity to create jobs.