“A majority of books or courses are based on overly used datasets or benchmarks but things get harder as you face real-world noisy problems.”
For this week’s ML practitioner’s series, we got in touch with Oliver Grellier — 2x Kaggle GM and a senior data scientist at H2O.ai, a leading open-source machine learning and artificial intelligence platform trusted by data scientists across 14K enterprises. At H2O.ai, Olivier leads a team of exceptional Kaggle Grandmasters. In this interview, Olivier spills a few secrets on his inspiring journey to the top of the data science world.
AIM: How did your ML journey begin?
Olivier: My journey is made of very different roles and experience. I have been a researcher, a software engineer, a team leader, a project manager, a futures trader and now a data scientist. So it’s hard to say when it all started. I completed a PhD in Signal Processing in 2000 with a focus on telecommunication. My thesis was on semi-supervised deconvolution, which gave me a strong statistical and mathematical background. I love coding, and as I was into stock markets, I soon started backtesting trading ideas. Finding patterns and predicting future asset prices led me to data science, machine learning and Kaggle!
AIM: What were the initial challenges, and how did you address them?
Olivier: A major challenge was to translate what I read into pipelines that actually work and generalise. A majority of books or courses are based on overly used datasets or benchmarks but things get harder as you face real-world noisy problems. Books and online courses are amazing resources to learn from but you need a lot of practice to ensure your models generalize to unseen data. The first book I read on machine learning was written by people of Octo Technology with forewords signed by Yann LeCun. It focuses on applied ML and uses data science competitions to demonstrate different techniques. I then started Kaggling and used knowledge bases like Coursera, Packtpub, Apress and Kaggle forums. This very first book mentioned Kaggle and that is when I signed in, 5 years ago. Kaggle is the goto platform to learn and practice AI/ML. In competitions, I am always amazed and grateful that kagglers share their knowledge and techniques in kernels or discussions. This is where I learnt most of what I know today.
“I don’t play too much with optimizers and/or learning rate schedules as you may just totally overfit training data.”
AIM: What is the secret behind your success on Kaggle?
Olivier: The first step is to get an understanding of the problem. Then I look at the data and try to find differences in train and test sets. This is very important for feature engineering in tabular competitions and for augmentations in computer vision problems.
These differences also help build the right validation scheme. I want to see what makes test data different from tests and analyse discrepancies between my local validation score and public leaderboard. This is key to ending in the top 10.
This also holds for real-world problems where your model is deployed to production and scores unseen data. The right validation should make your model generalise and be successful.
Selecting your model is not just about one or two metrics; it’s about model robustness and stability. Backtesting is of utmost importance in time series problems. Your model should behave well over different time periods. I also find the leaderboard very powerful. It forces you to keep improving your skills and models as competitors challenge your placement. It is a tool every AI company should have.
The most important in this matter is about building a robust validation scheme for your problems. Practice is the only way you can see for yourself what works and what does not. In that regard, I find competing very useful. It’s a sport in which you muscle your brain to learn ways/tricks to build good models. The best resource of all is winners’ solutions writeup when the competition ends. I work with Grandmasters every day and can say that a lot of work and dedication are required to be at the top.
AIM: What is your favourite Kaggle competition, and how did you approach it?
Olivier: One of my favourite competitions has been SIIM-ISIC Melanoma Classification. It was a health-care competition where Kagglers were challenged to classify skin lesions and identify melanomas. I usually start a competition by looking at the discussion forum and public kernels to get some insights. I then look at the data and check the images to see the sort of augmentations that would help training. The last step of the discovery phase is building a first pipeline for training and inference. This is particularly important for code competitions.
Once this is done, I focus on getting the most out of the available data.
I try different architectures and augmentations. I don’t play too much with optimizers and/or learning rate schedules as you may just totally overfit training data. The most important in this competition was to go multiclass despite the original problem being a binary classification. During this second phase, I took a look at high scoring kernels and passed them through a strong cross-validation pipeline to see what holds and what does not. It gave me insights on the future leaderboard shake-up. Later, I have spent some time on complementary datasets. Here, the data of previous competitions gave a good boost in performance although classes were slightly different and more in number.
Very early in the competition, I felt a nice shake-up would occur on private leaderboards and focused on making the solution generalise. Blending ranked predictions of a few architectures, resulted in having local and LB score being equal and moving in the same direction. It helped me trust what I was doing as I was ranked 1,232 on the public leaderboard. I moved up 1,178 places on the private leaderboard and finished 54. This was my best placement in a computer vision competition at the time. Most important of all, I learnt a lot!
AIM: What does your ML toolkit look like?
Olivier: There is an ever-increasing number of open source tools available, and I love testing new things. However, I’m quite conservative in competitions to avoid being hit by a last-minute bug discovery. I use scikit-learn, pydatatable/pandas, LightGBM/XGBoost/Catboost, H2O-3 AutoML, Optuna, pyTorch, torch image model and albumentations. I find neptune.ai a great way to keep track of experiments and source code. For code, I use GitHub to pair tags with LB scores. For compute, I use dedicated root servers and TPU resources provided by Kaggle. I also use resources available at H2O.ai. Data analysis and problem-solving capabilities are necessary along with a good command of Python for NLP and computer vision use-cases. You need to be good at GPU/CPU compute resources as well!
AIM: What is the rationale behind H2O.ai’s fondness for Kaggle Grandmasters?
Olivier: Grandmasters have a unique way to address problems, thanks to the experience they have on a variety of datasets and problems. They are knowledgeable, efficient, fast and humble. Building automatic machine learning tools is a demanding task. Hiring the best people ensures we build AI solutions that do not overfit, generalise to unseen data and are useful. Above all, the Kaggle team at H2O.ai is a strong asset for AI democratisation and helps our customers be successful. We hire strong Kagglers that have demonstrated their performance in the competitions or notebooks tier. We seek diversity in domain expertise so that everyone in the team complements each other and can grow together.
AIM: How does it feel to lead a team of top Kaggle GMs?
Olivier: I joined H2O.ai as a senior data scientist in 2019. It is a privilege to lead the greatest team of Kaggle Grandmasters in the world in a pure AI innovation company. Our CEO Sri Satish Ambati built an amazing team with people located everywhere in the world and 5 of the World’s Top 15 ! We also have four amazing Grandmasters in India. My role at H2O.ai is more about leading the team than managing individuals. In the team, Grandmasters are involved in almost everything– from focusing on automatic machine learning with Driverless to the success of our customers. We love our people and look for a high level of creativity. My goal is to ensure everyone finds the right space and to remove any barrier so that they bring what makes them unique to the company. At H2O, our mission is to democratize AI, and I was very excited by the release of H2O Wave last December, which will help people build AI apps faster and easier.
AIM: Few tips for the beginners?
Olivier: As a beginner, I would start with introductory courses on Coursera for Python, data science and deep learning. In parallel, I would participate in Kaggle competitions, read discussion forums and play with public kernels. The idea is to spend enough time on a competition so that you can benefit from the winners write-ups when it completes.
AIM: What does the future hold for AI?
Olivier: AI will be with us for a long time, and I don’t see any dust settling down anytime soon. Today, a big focus is set on models when, in fact, I believe dataset auditing is the primary step for all of this. If a model is biased, chances are the dataset it has been trained on also is. The number one domain in the next few years is ResponsibleAI. Being transparent about what your model does and how it does it is key to the adoption of AI. This is an all-inclusive domain for AI with fairness analysis, model interpretability, adversarial impact and safety of models. Progress of deep learning in computer vision and natural language processing has already changed our lives, but I expect a lot more to come in these areas. Reinforcement learning is set to become a top area of AI and Kaggle has already organised great competitions in that field.