MITB Banner

AI Research Codes are Open, Accessibility is an Issue 

AI can’t advance without sharing dataset/code, and it doesn't matter how many conferences back Open Science, researchers still aren’t making codes public 
Share
Listen to this story

For the past few years, the scientific community worldwide has been advocating the accessibility of science. ‘Open Science’, as they call it, is an ongoing movement to make research papers accessible to all. 

(credit: UNESCO)

Open information is vital for research, even in space tech. Not many know that three years ago,  when scientists created the first-ever black hole image, it was made possible only because of an open-source software, Matplotlib. 

The research papers that often claim to have their dataset/code open, are often found to be making false proclamations. “Most of the time they don’t have a publicly available link and expect you to mail them, in which case, too, they reply maybe once for every ten papers,” points out a Redditor. 

AI research isn’t doing much for the community if the code isn’t shared by the publisher; it isn’t complete without open-source code accessible to all. According to the Radiological Society of North America, “AI research can be considered essentially useless for other researchers if it is not readily usable, reproducible, replicable, or transparent.”

However, it’s not always the fault of the author/researcher. Sometimes they write a thesis, but the dataset/code is designed by someone else, and the institute/publisher that legally owns the code requires an email request to give access. Recently, a Reddit user shared a similar incident, “My thesis is based on a dataset that my supervisor designed and collected. She wrote a paper on it. But technically speaking, her employer (the university) owns it. So whenever someone needs access, the university’s ethics board needs to approve it.”

Also, we can’t ignore the fact that researchers do write dataset/code available papers, but fail to get permissions from all stakeholders. Even if they have good intentions, there is still some autocratic stakeholder who can veto the code being published. The conferences, too, can’t withdraw those papers as they already have gone through all acceptance phases.

Steps taken to improve the community 

There is, in fact, a checklist in place for researchers, called the Machine Learning Code Completeness Checklist. Proposed by the Neural Information Processing System, the checklist recommends researchers to follow a systematic path that would actually help the AI community, rather than just publish a paper. 

The checklist requires the publisher to include specification of dependencies, training and evaluation codes, pre-trained models, and a README file. According to Github, the more boxes a publisher checks, the higher are the star ratings. 

(Credit: Robert Stojnic

AI thrives on open science concepts. Without open code/datasets it simply won’t move as fast. We can witness some cases around it, but data sharing in the AI community is still not as big an issue as it is in bioengineering. Some even say that code sharing is considered almost radical in the field. According to ScienceDirect, over 90% researchers don’t follow their own data sharing consent. It checked around 1,792 papers that claimed to be open source, however, when mailed, only 7% replied. “As many as 1669 (93%) authors either did not respond or declined to share their data with us,” says their report.

Another reason often pointed out is that scientists are highly disincentivised and don’t have any time or energy to provide explainers to critiques. If someone does leave the data open in the bioengineering field, people would misunderstand and start bombarding them with questions. The overworked researchers find sharing data with selective people easy and less controversial. 

Controversy does affect research papers. “Other scientists assume there’s no smoke without fire and start trolling them. After a couple of days, they become aware of the thread and create another post defending their students’ work and clear the air but it’s already too late. No one is going to get a notification that they have responded and everyone has already moved on with their lives,” a Redditor pointed out. 

Issue of readability

Even if a research paper provides codes accessible to all, it may not be the most promising one. Nature did a study on quality and success rate of research codes from Harvard Dataverse repository and found that only 25% of the R-files were able to run without presenting an error. Upon code cleaning too, only 40% of the R-files were able to run smoothly. 

(credit: Nature)

Elaborating on this, a researcher, in an online community, said, “Our primary goal isn’t to write a public library. The code isn’t really meant to be that, for the most part. We put research code online so our results can be verified during peer review. We let anyone use the code artifacts of our work as a bonus since we eventually want our ideas to spread.”

According to him, they would rather put time and effort into implementing new ideas and designing good experiments. “Readmes don’t get us a lot of career advancement, unfortunately.”

PS: The story was written using a keyboard.
Share
Picture of Lokesh Choudhary

Lokesh Choudhary

Tech-savvy storyteller with a knack for uncovering AI's hidden gems and dodging its potential pitfalls. 'Navigating the world of tech', one story at a time. You can reach me at: lokesh.choudhary@analyticsindiamag.com.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India