Advertisement

How To Win A Kaggle Competition

Kaggle Competitions are the best way to train and equip oneself with data science skills. The problems on Kaggle are for data scientists and analysts to explore specifically curated datasets and solve specific problems. It is meant for developers looking to build models to solve classification tasks, regression tasks, image recognition, and voice recognition. The platform is equipped with datasets and communities that help competitors learn how to work better with data.

While anyone can participate in a Kaggle competition, all credits to the low-entry barrier on the platform, winning it is a different ordeal.

What are Kaggle competitions?

Companies come to the Kaggle platform with datasets and a question. For instance, in drug discovery, a Pharma company might come to the Kaggle platform with data containing drug testing results and seek help to figure out the variables that would determine drug failure. 

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

During the competition, a company usually provides participants with datasets, the outcome of which the company already knows. Participants are then expected to build algorithms to predict the test results. Finally, based on the final test results, the accuracy of the machine learning models developed by the participants is determined. 

Approaching the algorithm:

Kaggle competitions are an amalgamation of decision trees. The datasets are mostly unstructured. There are only two approaches to develop models in such competitions- handcrafted or neural networks. Both the approaches work best when applied to the datasets that they perform best at.  


Download our Mobile App



Handcrafted Feature Engineering works well when data is well structured and the competitor has an idea of the problems in data. It is mainly based on intuition and the trial-and-error approach. The competitor will also have to plot histograms and explore what is in the data. A lot of time is spent devising an algorithm to optimise the target variable. 

The second approach involves developing neural networks. Almost everybody who wins these competitions spend no time in the handcrafted approach but develop neural networks, as most of the data is unstructured. Also, when datasets combine unstructured and structured data, the best approach is neural networks.

Training and execution of algorithm

Participants in a Kaggle competition have two goals– developing a data science pipeline and achieving the best possible metric. Additionally, every competition on the platform requires different variables to be predicted and optimised. Thus, while the focus should be on these two approaches, one should not be over absorbed in this approach.

Most people obsess about obtaining very high scores in the first round that would eventually help them score high in the final round. Competitors should be familiar with the concept of overfitting– meaning, training the algorithm on one dataset and optimising it on another dataset. It is a common observation that optimised-performing algorithms made for first rounds typically do not outperform their performance in the final round. The scope of change should always be open to fit different datasets.

What to do after building the algorithm?

  • Remove noise and enrich the dataset: In big datasets like drug testing datasets, the junk and noise have to be removed to increase the algorithm’s efficiency. Pruning the dataset can give better results and shoot one straight up the ranks. Therefore, besides informative data, participants should scrape everything else out. 
  • Cross-validation: Participants should always cross-validate to estimate the efficiency of algorithms on unknown datasets.
  • Ensembling: Building multiple models and combining them into a single platform is an intelligent way to avoid the risks of overfitting. Although it may not always work, ensembling works as a fail-safe when constituent models malfunction and help mitigate the risk. The constituents have to be carefully chosen depending on the task they are expected to perform. Combining them requires an additional level of validation.
  • Using the leaderboard: Kaggle has an open leaderboard. It means that as the competition progresses, it provides crucial information that can help participants improve their models. After analyses, critical variables can either be maximised or minimised to optimise the metric. 
  • Analysis of past models: Competitors should read past models, forums, and blog posts to find guidance to mitigate through the competition. Such knowledge is constructive when the data involved are of the same type or have the same evaluation criteria.

Winning a Kaggle competition is no easy feat, but with careful analysis of the previous competitions and a broad approach to the model building, it is not an insurmountable problem. 

More Great AIM Stories

Meenal Sharma
I am a journalism undergrad who loves playing basketball and writing about finance and technology. I believe in the power of words.

AIM Upcoming Events

Regular Passes expire on 3rd Mar

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Early Bird Passes expire on 17th Feb

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, Virtual
Deep Learning DevCon 2023
27 May, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
AIM TOP STORIES

A beginner’s guide to image processing using NumPy

Since images can also be considered as made up of arrays, we can use NumPy for performing different image processing tasks as well from scratch. In this article, we will learn about the image processing tasks that can be performed only using NumPy.

RIP Google Stadia: What went wrong?

Google has “deprioritised” the Stadia game streaming platform and wants to offer its Stadia technology to select partners in a new service called “Google Stream”.