Active Hackathon

arXiv Makes All Its Research Papers Available On Kaggle To Boost Machine Learning Developments

arXiv, the largest repository of research papers recently announced that they are presenting a free and open pipeline of its dataset which is more than 1.7 million articles on Kaggle. It aims to boost developments in areas such as machine learning. It will include relevant features such as article titles, authors, categories, abstracts, full-text PDFs and more. 

arXiv has served the public and research communities for nearly 30 years in subjects ranging from physics, computer science, math, statistics, quantitative biology, economics and everything in between. 


Sign up for your weekly dose of what's up in emerging technology.

“Having the entire arXiv corpus on Kaggle grows the potential of arXiv articles immensely. By offering the dataset on Kaggle we go beyond what humans can learn by reading all these articles and we make the data and information behind arXiv available to the public in a machine-readable format,” said Eleonora Presani, Executive Director, arXiv.

Kaggle has been a favourite destination for data scientists and machine learning engineers for quite some time now. Researchers can utilise Kaggle’s extensive data exploration tools and easily share their relevant scripts and output with others. With arXiv’s repository of articles, Kaggle users can push the limits of innovation.

The large datasets will offer researchers with new connections, innovative tools and perspectives to enable better discovery and innovation, believes Steinn Sigurdsson, Scientific Director, arXiv.

Especially at the time of the current pandemic, when the world is aiming for developments to cure COVID, free resources can help the researchers come up with solutions and innovations. For instance, Google’s COVID-19 Research Explorer is a tool that helps researchers pore through the CORD-19 dataset – a repository of 190,000+ science articles on COVID-19 on arXiv.

arXiv hopes that the release of the machine-readable dataset will inspire the creation of similar tools in the future. 

The dataset available on Kaggle will be updated weekly and is available here

More Great AIM Stories

Srishti Deoras
Srishti currently works as Associate Editor at Analytics India Magazine. When not covering the analytics news, editing and writing articles, she could be found reading or capturing thoughts into pictures.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM