For this week’s ML practitioner’s series, Analytics India Magazine got in touch with Arthur Llau. Arthur is a Kaggle master, who is currently ranked in the top 100 on the global leaderboard that hosts more than 1,30,000 participants. He is a mathematician from heart, who happened to run into machine learning. We bring to our readers Arthur Llau’s fascinating journey into the world of data science.
A lifelong Parisian, Arthur Llau has a dual masters degree in Theoretical Mathematics (Probability) and in Statistics & Machine Learning from Sorbonne University campus of Université Pierre & Marie Curie. As part of his thesis, he worked on neural style transfer, a relatively new field.
At college, Arthur worked as a barman while he scribbled math problems on the side. Though a mathematician from heart, Arthur’s tryst with machine learning only began when one of his friends introduced him to computer vision and the statistical aspect of machine learning. Mixing drinks, and tussling with mathematics and machine learning is how he spent most of his student days.
Today, he competes with machine learning experts from across the globe on the grandest stage—Kaggle.
Currently, Arthur is a Senior Data Scientist at Flowlity, an innovative startup that deals with optimization & synchronization of supply chain management. As a senior data scientist, Arthur works on-demand-sales forecasting, inventory level optimization, safety-stock recommendation, and also with graphs for supply chain synchronization.
He also teaches Data Science applications to industrial problems at the Sorbonne Universités.
As a mathematician, it was a hard job at first, then a passion.
Even with a background in mathematics and statistics, Arthur still found the transition to machine learning quite challenging. The most challenging part, admits Arthur, was to understand how to apply theoretical methods to real-world problems.
When we inquired Arthur on why we see a lot of Europeans in Kaggle and the machine learning field in general, he untangled this mystery by nonchalantly revisiting the history of Europe, especially his country France. He reminded us of great mathematicians taking the examples of Poisson and Gallois.
Mathematics is a vital part of French culture and history. There has always been a big love story between maths and French people.
The culture of inculcating mathematics is valid till this day and it becomes almost natural to turn towards domains such as machine learning.
He also reminded us that there was no secret to his machine learning mastery. All he did was read, learn, practice and repeat.
Life As A Kaggler
His initial interest in machine learning competitions was sparked when one of his professors tasked him to participate in a contest that had a problem statement framed by the prestigious Institut Henri Poincaré and a big company; a Kaggle-like a contest, in which Arthur ended up winning two of them, outperforming professional data scientists. He wanted to continue this momentum at a higher level and what can be better than Kaggle!
So far, Arthur has participated in more than 80 competitions of which he has won two gold, 12 silver and 14 bronze medals. Though he is top 100 at the global level, he still considers there is a long way to go to the top.
It takes a lot of time, a lot of reading, imagination and obstinacy.
For beginners, Arthur recommends exploring the data and finding what is not evident. “…and don’t hesitate to try classic methods. Trial and error is a great motto,” confided Arthur.
Insisting on the importance of data exploration, Arthur doubled down on implementing the metrics right, performing a couple of validation schemes, setting up a baseline and sticking to it.
What I learn in Kaggle, I apply it sometimes in my work, and this is important for me to do my job very well.
When asked about how significant Kaggle was for his career, Arthur heaped praise on its community and the variety of contests that he gets to participate in. Not only that, but he firmly believes that Kaggle experience adds a great deal to his learning curve, and that, learning is still his primary goal.
Tools & Tricks Of A Master
Arthur Llau revealed that he had spent around 4-8 hours per day for over a month for the contests that fetched him gold. Arthur believes that being a top Kaggler is a full-time job. Talking about the resources required for a typical competition, Arthur says that a basic laptop would sometimes suffice. However, sometimes he rents some GPU in Google cloud platform with Kaggle vouchers, depending on the competition.
With regard to languages, Arthur prefers Python and sometimes C++ for doing operational research tasks. And, when it comes to algorithms, Arthur expressed his delight for boosting methods, such as xgboost, catboost and lightgbm.
He switches between Keras and PyTorch framework while using a handful of very useful libraries like albumentations for image augmentation, eli5 and lofo for feature selection, and Missingno and seaborn for visualization; Imblearn, when imbalanced data. For parameters optimization, Arthur prefers Optuna and skopt for the Bayesian module.
Here is what Arthur’s toolkit looks like:
- Hardware: MBPro(2019, 16GB, i7) or i7,32GB + 1070Ti or GCP.
- Language: Python and C++
- Framework: Keras and Pytorch
- Augmentation library: albumentations
- Feature selection library: eli5 and lofo
- Visualization: Missingno and seaborn
- Imbalanced data: imblearn
- Parameter optimization: Optuna and skopt
The availability of many libraries and frameworks has made the job of a data scientist easy. Deep learning algorithms could now be called by writing a single line of code on Python. Even complex mathematical operations are wrapped up as libraries.
The democratization of ML has drawn in a lot of people, and somewhere down the line, few people have started falling prey to vanity metrics such as leaderboard rankings and are venturing into malpractices.
Especially in Kaggle, Arthur laments that cheating can happen in many forms. In the kernel part, he explains, there is a lot of copycat kernel (EDA/ensembling) just craving for points/medals.
There are a lot of multiple account users as well who leak information across, and there have been instances where an entire class of students (~20 ppl) using more or less the same solution and winning a medal in a particular competition.
When asked how to identify these mal practitioners, “Make an ML model,” quipped Arthur.
That said, he holds the Kaggle community in high regards, and he has made a lot of friends over the years. While he will continue to experiment with Kaggle contests, he hopes that there will be original challenges like the 2018 trackML challenge.
Arthur predicts reinforcement learning to be a big thing going forward, but he is a bit sceptical as there is still a long way to go in getting basic predictions right, like in sales or doing object recognition.
When asked about the overwhelming hype around AI, Arthur quipped that it is not artificial intelligence, it is artificial stupidity, quoting famous researcher Youshua Bengio.
It is stupid to think that doing only MOOC and using autoML can tackle all kind of problems.
AutoML is excellent at solving basic tasks with good performances, continued Arthur, but it cannot be used to solve complex problems. The problem of AutoML is also the blackbox effect, which can lead to explainability issues in front of customers.
Reiterating on the importance of practice for beginners, Arthur advises one to look at Kaggle as a playground rather than a battlefield, and to experiment a lot. He was also positive about the fact that aspirants can land a data science job with Kaggle in their portfolio, if combined with consistent practice.
However, he also warns of the dangers of inflating Kaggle success, as there is a vast difference in problem-solving at the industry level.
The data we typically get in the field is not as clean as in Kaggle. You can’t have magic or leaks or funny tricks in data science problems at work; you need to find other good methods.
A significant difference, observes Arthur, is information extraction needed for the job; also, there is a lot more discussion with field experts to make a good modelization of the problem which is not required in Kaggle.
Understanding any algorithm will eventually boil down to math mostly, and Arthur insists on having a good grasp of fundamentals. A student for life, Arthur admits that he has been fortunate enough to have exceptional teachers throughout his student life who have helped him become what he is today.
That said, a great book is equal to many excellent teachers – if not exceptional – and Arthur recommends everyone to read the following books, which he considers to be classics :
- « The Elements of Statistical Learning » by Tibshirani, Hastie and Friedman
- « Pattern Recognition » by Bishop and
- « ML: A probabilistic perspective » by Murphy