AI Research Codes are Open, Accessibility is an Issue 

AI can’t advance without sharing dataset/code, and it doesn't matter how many conferences back Open Science, researchers still aren’t making codes public 
Listen to this story

For the past few years, the scientific community worldwide has been advocating the accessibility of science. ‘Open Science’, as they call it, is an ongoing movement to make research papers accessible to all. 

(credit: UNESCO)

Open information is vital for research, even in space tech. Not many know that three years ago,  when scientists created the first-ever black hole image, it was made possible only because of an open-source software, Matplotlib. 

The research papers that often claim to have their dataset/code open, are often found to be making false proclamations. “Most of the time they don’t have a publicly available link and expect you to mail them, in which case, too, they reply maybe once for every ten papers,” points out a Redditor. 

AI research isn’t doing much for the community if the code isn’t shared by the publisher; it isn’t complete without open-source code accessible to all. According to the Radiological Society of North America, “AI research can be considered essentially useless for other researchers if it is not readily usable, reproducible, replicable, or transparent.”

However, it’s not always the fault of the author/researcher. Sometimes they write a thesis, but the dataset/code is designed by someone else, and the institute/publisher that legally owns the code requires an email request to give access. Recently, a Reddit user shared a similar incident, “My thesis is based on a dataset that my supervisor designed and collected. She wrote a paper on it. But technically speaking, her employer (the university) owns it. So whenever someone needs access, the university’s ethics board needs to approve it.”

Also, we can’t ignore the fact that researchers do write dataset/code available papers, but fail to get permissions from all stakeholders. Even if they have good intentions, there is still some autocratic stakeholder who can veto the code being published. The conferences, too, can’t withdraw those papers as they already have gone through all acceptance phases.

Steps taken to improve the community 

There is, in fact, a checklist in place for researchers, called the Machine Learning Code Completeness Checklist. Proposed by the Neural Information Processing System, the checklist recommends researchers to follow a systematic path that would actually help the AI community, rather than just publish a paper. 

The checklist requires the publisher to include specification of dependencies, training and evaluation codes, pre-trained models, and a README file. According to Github, the more boxes a publisher checks, the higher are the star ratings. 

(Credit: Robert Stojnic

AI thrives on open science concepts. Without open code/datasets it simply won’t move as fast. We can witness some cases around it, but data sharing in the AI community is still not as big an issue as it is in bioengineering. Some even say that code sharing is considered almost radical in the field. According to ScienceDirect, over 90% researchers don’t follow their own data sharing consent. It checked around 1,792 papers that claimed to be open source, however, when mailed, only 7% replied. “As many as 1669 (93%) authors either did not respond or declined to share their data with us,” says their report.

Another reason often pointed out is that scientists are highly disincentivised and don’t have any time or energy to provide explainers to critiques. If someone does leave the data open in the bioengineering field, people would misunderstand and start bombarding them with questions. The overworked researchers find sharing data with selective people easy and less controversial. 

Controversy does affect research papers. “Other scientists assume there’s no smoke without fire and start trolling them. After a couple of days, they become aware of the thread and create another post defending their students’ work and clear the air but it’s already too late. No one is going to get a notification that they have responded and everyone has already moved on with their lives,” a Redditor pointed out. 

Issue of readability

Even if a research paper provides codes accessible to all, it may not be the most promising one. Nature did a study on quality and success rate of research codes from Harvard Dataverse repository and found that only 25% of the R-files were able to run without presenting an error. Upon code cleaning too, only 40% of the R-files were able to run smoothly. 

(credit: Nature)

Elaborating on this, a researcher, in an online community, said, “Our primary goal isn’t to write a public library. The code isn’t really meant to be that, for the most part. We put research code online so our results can be verified during peer review. We let anyone use the code artifacts of our work as a bonus since we eventually want our ideas to spread.”

According to him, they would rather put time and effort into implementing new ideas and designing good experiments. “Readmes don’t get us a lot of career advancement, unfortunately.”

Download our Mobile App

Lokesh Choudhary
Tech-savvy storyteller with a knack for uncovering AI's hidden gems and dodging its potential pitfalls. 'Navigating the world of tech', one story at a time. You can reach me at: lokesh.choudhary@analyticsindiamag.com.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

Can OpenAI Save SoftBank? 

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week. 

How Generative AI is Revolutionising Data Science Tools

How Generative AI is Revolutionising Data Science Tools

Einblick Prompt enables users to create complete data workflows using natural language, accelerating various stages of data science and analytics. Einblick has effectively combined the capabilities of a Jupyter notebook with the user-friendliness of ChatGPT.