Kaggle Competitions are the best way to train and equip oneself with data science skills. The problems on Kaggle are for data scientists and analysts to explore specifically curated datasets and solve specific problems. It is meant for developers looking to build models to solve classification tasks, regression tasks, image recognition, and voice recognition. The platform is equipped with datasets and communities that help competitors learn how to work better with data.
While anyone can participate in a Kaggle competition, all credits to the low-entry barrier on the platform, winning it is a different ordeal.
What are Kaggle competitions?
Companies come to the Kaggle platform with datasets and a question. For instance, in drug discovery, a Pharma company might come to the Kaggle platform with data containing drug testing results and seek help to figure out the variables that would determine drug failure.
During the competition, a company usually provides participants with datasets, the outcome of which the company already knows. Participants are then expected to build algorithms to predict the test results. Finally, based on the final test results, the accuracy of the machine learning models developed by the participants is determined.
Approaching the algorithm:
Kaggle competitions are an amalgamation of decision trees. The datasets are mostly unstructured. There are only two approaches to develop models in such competitions- handcrafted or neural networks. Both the approaches work best when applied to the datasets that they perform best at.
Handcrafted Feature Engineering works well when data is well structured and the competitor has an idea of the problems in data. It is mainly based on intuition and the trial-and-error approach. The competitor will also have to plot histograms and explore what is in the data. A lot of time is spent devising an algorithm to optimise the target variable.
The second approach involves developing neural networks. Almost everybody who wins these competitions spend no time in the handcrafted approach but develop neural networks, as most of the data is unstructured. Also, when datasets combine unstructured and structured data, the best approach is neural networks.
Training and execution of algorithm
Participants in a Kaggle competition have two goals– developing a data science pipeline and achieving the best possible metric. Additionally, every competition on the platform requires different variables to be predicted and optimised. Thus, while the focus should be on these two approaches, one should not be over absorbed in this approach.
Most people obsess about obtaining very high scores in the first round that would eventually help them score high in the final round. Competitors should be familiar with the concept of overfitting– meaning, training the algorithm on one dataset and optimising it on another dataset. It is a common observation that optimised-performing algorithms made for first rounds typically do not outperform their performance in the final round. The scope of change should always be open to fit different datasets.
What to do after building the algorithm?
- Remove noise and enrich the dataset: In big datasets like drug testing datasets, the junk and noise have to be removed to increase the algorithm’s efficiency. Pruning the dataset can give better results and shoot one straight up the ranks. Therefore, besides informative data, participants should scrape everything else out.
- Cross-validation: Participants should always cross-validate to estimate the efficiency of algorithms on unknown datasets.
- Ensembling: Building multiple models and combining them into a single platform is an intelligent way to avoid the risks of overfitting. Although it may not always work, ensembling works as a fail-safe when constituent models malfunction and help mitigate the risk. The constituents have to be carefully chosen depending on the task they are expected to perform. Combining them requires an additional level of validation.
- Using the leaderboard: Kaggle has an open leaderboard. It means that as the competition progresses, it provides crucial information that can help participants improve their models. After analyses, critical variables can either be maximised or minimised to optimise the metric.
- Analysis of past models: Competitors should read past models, forums, and blog posts to find guidance to mitigate through the competition. Such knowledge is constructive when the data involved are of the same type or have the same evaluation criteria.
Winning a Kaggle competition is no easy feat, but with careful analysis of the previous competitions and a broad approach to the model building, it is not an insurmountable problem.