“My Kaggle journey took a lot of time, effort, computing power, frustration and sleepless nights, but mostly frustration.”
For this week’s ML practitioners series, Analytics India Magazine got in touch with Khoi Nguyen, a Kaggle master who is currently ranked 111 and has won gold in four competitions. In this interview, Khoi shares valuable insights from his machine learning journey.
Khoi has a Bachelor’s in computer science from Hanoi University of Science and Technology. He currently works as a data scientist at Zalo, Vietnam’s top messaging app. His work at Zalo consists of using machine learning and NLP to improve user experience.
Khoi’s fascination with machine learning started about three years ago when he was doing a project that involved deploying a deep learning model to Android devices. Talking about the project and how it sparked his interest in algorithms, Khoi explained how challenging it was to train a model on TensorFlow back then. “I ended up rewriting my version of it in C++, with backpropagation, beam search and everything. When I finally made it work, I realised that I really enjoyed the whole concept and started thinking about a change,” added Khoi.
Khoi had no formal training in machine learning. So when he began, he confessed that programming was a big challenge. “I had to review a lot of my mathematical knowledge. It’s a back and forth process; whenever I encounter something I don’t understand, I have to go back and try to understand things at that level and work my way up. It gets easier, but it was time-consuming at first and sometimes even now,” said Khoi.
On His Kaggle Journey
“Although the techniques used in competition are usually very specific to it[Kaggle], over time, it can open up new perspectives on problems that I have to deal with in the real world.”
Khoi’s introduction to Kaggle happened while he was searching for datasets. He got drawn into the competitions later on. His first competition was Airbus Ship Detection. “The result was a disaster, but I really enjoyed competing so as usual, I want to be good at that,” said Khoi.
Today, after two years, Khoi has fetched four gold, three silver, two bronze and has already featured in the top 100 of Kaggle leaderboard. Though top 100 is not a mean feat considering the level of competition, Khoi still doesn’t consider himself to be at the top yet. His journey so far, he explained, consumed a lot of time, effort, computing power, frustration and sleepless nights.
Talking about the challenges faced in the Kaggle contests, Khoi took the example of one of his favourite competitions — Google QUEST Q&A. This competition required the participants to rank question and answer pairs in multiple subjective aspects. The subjects ranged from how well written and informative the question was to something very strangely specific like whether the question was asking for how to spell a word.
“My approach was quite unique,” explained Khoi. “Since this is a pair labelling task, many teams deployed the common input architecture for transformers i.e [CLS] [SENT_0] [SEP] [SENT_1]. Instead, I used a Siamese-like model where the question and answer were fed to the model separately, then I took the representations of each and concatenated it to a single vector for the final regression layer. This comes from the observation that by using the common approach, you’re limiting the possible length of the question-answer pair by the maximum length allowed by a transformer (typically 512), using my method it can get up to 1024.
“Interestingly my best single model used an XLNet backbone which I don’t think is a popular choice,” he added.
The Google Quest competition also used Spearman correlation as the metric, which was easily exploited and which Khoi refers to as ‘magic’.
Underlining the importance of keeping the local validation pipeline robust, Khoi advises that this will help with starting any serious kinds of optimisation. Regarding the frameworks, Khoi uses PyTorch for prototyping/competing and TensorFlow for production.
Reminiscing about how useful Kaggle has been to his ML journey, Khoi explained how Kaggle could be leveraged into diversifying one’s skill set. “Beside the raw amount of knowledge that you can get from Kaggle, I think it helped to diversify my skill set by introducing new kinds of problems with a clear objective, motivating me to find the best way to solve it. Although the techniques used in competition are usually very specific to it, over time it can open up new perspectives on problems that I have to deal with in the real world,” believes Khoi.
He does agree that having the titles of a Kaggle master/grandmaster help one’s resume in standing out of the crowd. However, in an interview, one still has to prove themselves, and there is no free pass, warned Khoi.
“I see some people try to “game” the ranking system using unethical methods, some even got to the tier of grandmasters. Maybe you can fool some clueless people with that, but you will soon be found out one way or another.”
Khoi also warns the aspirants against mistaking the competitions for real-world problems. “A competition is a very specific kind of data science problem. You have the problem handled to you, with a cleaned dataset, a fixed metric and most of the time very little performance constraints. That’s just unrealistic. Ironically I find the hardest part of a problem is to define the problem itself: What can be solved by machine learning/data science to bring value, the “how” it can be solved is usually the easier part,” explained Khoi.
“I don’t have any special tricks but to keep working until you get better. Also I always try to find interesting problems to try to work on so that I don’t lose motivation.”
For the beginners, Khoi recommends Chai Time Data Science where Sanyam Bhutani interviews successful data scientists and Kagglers with many great insights or Abhishek Thakur’s Youtube channel. In terms of books, Khoi recommends Neural Networks and Deep Learning by Michael Nielsen. Khoi considers himself as more of a hands-on person as he usually likes to jump into a problem first and gather resources along the way to tackle any new challenges as they appear, which usually would lead one to read a lot of new materials.
On The Future Of ML
“We don’t have an understanding at a mathematical level why our best models even work; your LSTM won’t make you a millionaire with the stock market.”
When asked about the hype around ML, Khoi said that machine learning would forever be stuck in non-critical applications if it remains to be a black box. So, going forward, Khoi expects the explainable models to flourish.
That said, Khoi also admits that people are overly-optimistic when it comes to applying AI and we’re still clueless about AGI. “We don’t have an understanding at a mathematical level why our best models even work, your LSTM won’t make you a millionaire with the stock market,” he said.
This also happens to be the reason why Khoi advises people not to take the competition setup too seriously. “The fixed, single metric means Goodhart’s law taking control. At some point it’s no longer about generalisation but overfitting to the right kind of noise. The aim should be to make something that people may find useful if it ever gets into production. Your models are only as good as the data allowed, so a large amount of effort must be put into data collection/cleaning. This is often missing in competition but is extremely crucial in practice,” concluded Khoi.