In the era of data science and machine learning, hackathon platforms like Kaggle, MachineHack, etc., have emerged as testbeds for many ML and data science professionals, alongside helping companies to hire the best talent using the hackathon model.
According to Kaggle’s 2020 edition of the State of Machine Learning and Data Science report — which includes insights gathered from a survey of 20,036 Kaggle members — more than 55 per cent of data scientists have less than three years of experience, and six per cent of professionals pursuing data science have been using machine learning for more than a decade.
The study further revealed that machine learning has become more rooted in the companies where Kaggle scientists work. Nearly 31 %of data scientists claimed well-established machine learning methods, up from 28% in 2019 and 25 % in 2018.
Sign up for your weekly dose of what's up in emerging technology.
Kaggle vs real life
Though Kaggle competitions are great to practice data science skills, are they really that different from real-world data science and machine learning work? This article will unveil the difference between the two, especially when solving machine learning problems on Kaggle vs real life.
While some argue its [Kaggle] real-world implications and question the effectiveness, the problem-solving aspect remains common for real-life as well as hackathons.
In Kaggle, the problem is well defined, and you are provided with clear instructions on how to solve the problem and how it will evaluate your work.
A typical problem-solving cycle (Source: Humor That Works)
However, in real life, the problem is often not defined clearly, and you will have to come up with some inputs from data that can lead to concrete KPIs in the business environment. Plus, you will have to do lots and lots of meetings to get a better understanding of your problem statement.
According to Kaggle, the most commonly used algorithms were linear and logistic regression, followed closely by decision trees and random forests. For more complex techniques, gradient boosting machines and convolutional neural networks (CNN) were the most popular approaches.
But, in real life, there are no shortcuts. Aakash Nand, Software Engineer (Data Science) at NTT Communications, said many Kagglers use a few ‘sneaky’ methods to boost the performance of their model, which in the real world should be avoided.
“For instance, some perform transformation or imputation on both train and test set combined instead of splitting them and preprocessing them separately to avoid data leakage. This increases performance but might make your model less generalisable to new, unseen data,” said Nand.
Almost every dataset can be seen as a machine learning problem on Kaggle. It is quite famous for hosting machine learning competitions, which makes you an expert in improving your score by 0.0001, fine-tuning parameters, and making an algorithm work.
In the real world, not every company uses machine learning and not every data scientist deals with machine learning in their daily work, so the exposure is minimal.
In Kaggle, you can access the datasets with minimal effort. Also, you are provided with a platform where you can discuss with domain experts to understand the features. The datasets provided are usually ready for analysis and require minimal cleaning efforts or skills.
For instance, a Kaggle alternative, MachineHack, offers various such platforms like Mocks, Practice and Bootcamps, making it easier for participants to experiment with an array of datasets and ace data science hackathons.
But, in reality, almost all the time, you will be asked to figure out how to get the data. If you are a senior or mid-senior level positioned data scientist, you will be often asked to choose the data and KPI needed for analysis as per business needs and requirements.
In Kaggle, the only meaningful metric is accuracy. Whereas, in the real world, data science and machine learning work require a careful trade-off between cost, model return on investment, model latency, and model scalability.
Hackathons and real-life data science can be considered a sprint and a marathon, respectively, from a timeline perspective. And, mechanisms have to be in place to ensure that models are evaluated once in a while to retrain a model, drop-in performance, etc.
In Kaggle, it is pretty straightforward when it comes to evaluation. Every competition will give you information as to how the leaderboard is scored.
But, in reality, you cannot assess your work. Instead, you will have to try and discuss and make your problem more concrete to understand the bigger picture. At times, the product manager or development team can help you in evaluating your analysis impact. Also, there are chances that your work may not be noticeable all the time, but it does add value to your understanding and will be helpful for your future work.
Hackathon platforms such as Kaggle, MachineHack, etc., are excellent platforms for practising your skills, honing machine learning skills in a conducive environment, improving their knowledge-sharing skills, and learning to create good analytical reports. These platforms also provide a real-world dataset, which further boosts your confidence to solve real-life problems instead of learning on the job.
More than anything, these hackathon platforms showcase the work done by experienced professionals, which will make you think that there is always something new to learn and advance in the field. Even though the problems on Kaggle and real life are different, practising on Kaggle or MachineHack, can help you hone your skills and be a better ML and data science professional.
Experts believe that professionals should continually update their knowledge and differentiate themselves by actively participating in international hackathons as it helps them benchmark their stance in the field and grow.