For this month’s machine learning practitioners series, Analytics India Magazine got in touch with Mathurin Aché, a Kaggle master ranked 19 in the global Kaggle competitions’ leaderboard.
In his career spanning more than a decade and a half, Mathurin has seen it all. From the popularity of support vector machines to the explosion of deep learning algorithms; from evolution statistics as a tool to data science as a profession. In this interview, he talks about ways to master Kaggle competitions and leverage those skills to become a better data scientist.
Mathurin is currently the VP at Prevision.io and oversees product management as a machine learning expert. When asked about his early stint with machine learning, he confessed about his passion for automation and how statistics during his college days played a crucial role in nurturing his machine learning acumen.
He graduated with a master’s degree in techniques in business decision (data mining and marketing) from the University of Paris, back in 2003. He then went on to hold various positions at Orange, one of the largest telecom providers in the world. There, he began working as a data analyst, and eventually ascended to the role of product manager where he developed AutoML-based solutions even before it became a fad.
However, he thanks Kaggle for his renewed interest in the world of data science as it helped him get familiarized with concepts like computer vision, Natural Language Processing (NLP) – and more recently – reinforcement learning.
“I joined the company Prevision.io last year. My visibility and my ranking in Kaggle competitions played a major role.”
He has been active on Kaggle for the past three years, participating in over 200 competitions and achieving top 1% in four of those.
Road To Mastery At Kaggle
Mathurin is currently ranked in top 20 on Kaggle, a community of more than 1,30,000 members. He firmly believes that selflessness, time, luck, teamwork, IT resources, a lot of reading and coding have made him what he is today.
The important motto, asserts Mathurin, is to try more and fail fast. He insists on re-using previous codes and learning about optimization of metrics.
Here is how Mathurin’s typical approach to any Kaggle contest looks like:
- Create an environment, download the data
- Download and submit top kernel public to run and learn basic ideas and ways to improve
- Generate cross-validation files to collate the results later
- Read the forum
- Test different AutoML software
- Construct new explanatory variables
Although every problem has its challenges, Mathurin recommends one to have a good scheme of cross-validation and confidence, and advocates trusting local results over results on the public leaderboard.
As the environment used varies with the competition, he recommends the following thumb rules:
- Kaggle kernels when there are resource constraints
- Kaggle kernels + local, if there are no constraints
- Kaggle kernels + cloud + local for more extensive data volumes
- If optimization is under constraints, use Gurobi or Cplex benchmark tools
Furthermore, in the case of task-specific problem approaches such as computer vision, Mathurin advises one to have an idea of state-of-the-art model architectures, while also making sure of the availability of a GPU.
Whereas, for NLP, he lists down three generations of code that one needs to build on: TFIDF(term frequency-inverse document frequency), followed by word embedding and later transformers (ex: BERT), knowing that the latter requires a lot of GPU resources.
Like many across the globe, Mathurin prefers Python to R, though he had been using R until 2015. His renewed interest in algorithms made him switch to Python gradually.
A look at Mathurin’s toolkit, which he keeps coming back to:
- Packages: scikit learn, pandas, numpy
- Frameworks: Keras, Tensorflow, Pytorch and Fastai
- Algorithms: lightgbm, xgboost, catboost
- AutoML tools: Prevision.io, h2o and other open sources such as TPOT, auto sklearn
- Cloud services: Google colab and kaggle kernels
“The untimely disconnections on Google Colab are annoying.”
Talking about the various ways in which one can leverage modern-day computational resources, Mathurin explains how one can use Kaggle’s resources. Kaggle offers substantial TPU resources, in addition to GPU resources (30 hours per week, up to 3 tasks in parallel for both).
Talking about the hardware he uses for contests, he reveals that he owns two personal computers. One runs on Linux and is equipped with GPUs for high demand contests that involve image classification. Whereas, the other is a Windows used for refining public kernels through the creation of new variables, or by optimizing hyperparameters.
When it comes to the financial aspect of competing, Mathurin considers Kaggle to be an inexpensive affair. For cloud resources, he uses free coupons, rounding off the total cost to a few tens of dollars per contest.
“Team up with someone slightly stronger than yourself to build skills, and learn tips and tricks.”
Though Mathurin has competed solo in the majority of the contests, all his top medals were won as a team. Talking about the importance of teaming up for Kaggle, he advises one to choose a highly skilled partner to keep the learning curve steep.
A Word For The Beginners
Today data science is one of the hottest jobs on the planet. And, Kaggle is slowly becoming the test bed for measuring the mettle of the candidates. Given the variety of skills that one gets to test with Kaggle, it is necessary to be focussed on the problem at hand, and not be swayed by vanity metrics such as leaderboard position.
“Trust local results rather than results on the public leaderboard.”
As an industry insider, Mathurin warns newcomers not to mistake a Kaggle competition as the end goal. This is because building a machine learning pipeline at an organization needs skills that extend beyond what Kaggle demands.
“The data science work begins well before this phase is covered in Kaggle and ends well after.”
Kaggle contests mostly focus on the performance aspect of models. Whereas, to develop an ML product, things like access to data, preprocessing, refinement of models in accordance with the customers, periodic monitoring to improve models and a whole bunch of other challenges surface.
However, he has great admiration for the Kaggle community, for the way it facilitates the upskilling of an amateur through forums, kernels, and the sharing of winning solutions.
For job seekers, he stresses the importance of being consistently curious. This trait is more valuable than one’s Kaggle achievements because at the end of the day, companies are looking for those who are readily deployable with the least amount of training.
“AI is not the solution for all tasks. Sometimes common sense is a better approach.”
Although the current frameworks and libraries have made launching deep learning algorithms relatively straightforward, it is of utmost importance that one has a grip on the fundamentals. For starters, Mathurin recommends the deep learning course by Andrew Ng and Jeremy Howard’s fast.ai course, which have helped him tremendously.
Looking Beyond The Hype
Speaking of how the whole ecosystem around AI has evolved, Mathurin recalls the time when the yesteryear’s data miners would spend long hours exploring each variable retained in a model and produce as many as 20 models per year.
“20 years ago, they were called statisticians, later data miners and now they have reincarnated as data scientists.”
Mathurin, unlike most data scientists of the day, is a veteran of his class. He worked for Orange, a telecom giant, where he held various roles that involved data analytics. So, he is no stranger to the game.
At his current company, Prevision.io, Mathurin explains that he and his team are developing tools for data scientists that are easy to use, fast and robust.
As he doubles down on the significance of AutoML, he elaborates the two common aspects of this approach:
- It should be flexible enough and modular so that everyone can find what they need and what they are looking for. It can be hyper optimization of parameters or drift detection or algorithm benchmarking.
- Secondly, since few domain experts consider the current level of autoML not to be good enough, he insists on making AutoML products that focus more on improving feature engineering and reducing the time to search for hyperparameters – a crucial aspect of meta-learning.
“Back then SVM was the top algorithm, then came the random forests, then the gradient boosting with xgboost, lightgbm and catboost. Now we are talking about machine learning, AutoML.”
At the pace at which new tools get released, there is no doubt that the abilities of machine learning are sometimes blown out of proportion. The kind of attention it gets, sometimes clouds the ground truth and nudges people into believing that it is the holy grail of solutions.
With all the significant developments happening almost every day, it might be overwhelming for an outsider, and they may be tempted to jump on the AI bandwagon. Though Mathurin firmly believes that tools such as those of AutoML and other domains will play a crucial role, he warns us of the dangers of “AI is the cure for all” notion.