This week, for our Kaggle Master’s series, we got in touch with Tri Duc Nguyen Tang, who is ranked 47 on the Kaggle worldwide rankings leaderboard. He is also a chief data engineer and co-founder of the Vietnamese AI startup, Palexy. In this interview, Duc shares his experiences on his ascent to the top 1% in Kaggle competitions and many more.
Early Fascination With ML
Tri Duc Nguyen Tang along with his peers, founded Palexy, an AI-oriented startup that provides insights about customers to the retail shops. Duc has a computer science bachelor’s degree in engineering and a Master’s in Information Science from the Japan Advanced Institute of Science and Technology (JAIST).
Currently, Duc works as the Chief Data Engineer at Palexy and oversees data engineering and data science.
At Palexy, the team aims to transform unstructured video data, streamed from cameras at retail stores to structured-data, and then provide insights.
Duc’s first big push towards machine learning happened at his first job when he was working as a software engineer in a Japanese company (NEC Vietnam) where he had to build auto-ML systems.
His fascination for machine learning grew during this project, and he ended up quitting the job and decided to go back to school.
Fortunately, exclaims Duc, “I received the scholarship from JAIST (Japan Advanced Institute of Science and Technology) in the field of deep learning and machine learning to apply in games AI.”
Duc badly wanted to apply what he has learned. So instead of opting for a lucrative job in a developed country like Japan, he went back to his roots and started a company in Vietnam.
His beginning, however, admits Duc has been more or less similar to many ML aspirants around the world.
While he was establishing his company, he stumbled on Kaggle in his search for a dataset. Two years later, he is now a grandmaster and has made it to the top 1 % in many competitions.
Duc, ever since he started participating in Kaggle, has been self-taught and his first course was the widely popular “Machine Learning course” by Andrew Ng on Coursera, which laid the foundation for his ascent to the top.
The initial goal was to find a public dataset on Kaggle for my company’s project
In his initial days on Kaggle, Duc used and improved the source code using the public kernel and tried to get a high score on the public leaderboard but usually dropped ranks because of overfitting models.
Duc’s persistence has paid off, as he continued to participate in as many competitions as he could while he was establishing his company. His first big success came in the form of image classification for fashion products competition.
Duc, along with his colleagues, teamed up and won the second prize in the iMaterialist Challenge (Fashion) at FGVC5. This was followed by the Inclusive Image challenge, which fetched him his first gold medal.
Methods Of The Master
Like all successful ML practitioners, Duc too, insists on the importance of knowing the fundamentals that involve mathematics.
He strongly believes that mathematics helps one to get familiar with algorithms that help one prepare for concepts introduced in books or advanced courses.
However, in a real-world project or Kaggle competition, observes Duc, the role of mathematics is rarely tangible, and one barely touches it while building ML pipelines.
The most important skills, explains Duc, are more or less can be summarised to two points:
- Use EDA to get a good feel of your dataset and
- Improve your understanding of why your model is making the wrong decisions by running an error analysis.
In most of the competitions, stresses Duc, it is important to have a goal that is to build many single models (that can achieve the top 100) before ensembling them instead of building a single model that can achieve the top 10.
“To succeed, you need Pragmatic Coding 95%, Tenacity 85%, An Open Mind 80% and High School Math 60%.”fastAI
For example, in the Recursion Cellular competition, the input data was a set of images with 6-channel, and most of the competitors used all 6-channels in their models. Duc and his team, instead, tried many different combinations of channels such as [1,2,3,4,5], [1,2,3,4,6] etc.
His team later concluded that different combinations give different performances owing to their diversity.
They went ahead trained multiple models (each model with a different combination of channels) followed by the ensemble methods.
In the RSNA Intracranial Hemorrhage Detection competition, his team built 9 separate models and used convolutional neural networks (CNNs) and LGBM to learn the correlation between them. This stacking method helped his team get 5th place on the leaderboard.
Currently, he is competing in the Deep Fake Kaggle challenge, which he says is not for everyone. Talking about the challenges, Duc lamented that the dataset is huge and requires a lot of computing resources.
“Right now, we are facing an imbalanced dataset, and there is a big gap between our local cross-validation and public leaderboard. The limited inference time (9 hours with GPU) is also a problem, we cannot stack as many models as we want,” adds Duc talking about how every competition comes with its own challenges.
Talking about the computational resources, Duc said that he and his team usually use one server with 2x1080Ti with a Kaggle kernel. For a competition like DeepFake, he plans to rent a server with 4x1080Ti on AWS.
Roadmap To Glory
Before deciding to join a competition, advises Duc, participants should try to re-use the source code from previous contests to have a good baseline.
Always perform K-fold to evaluate the gap between local validation and public leaderboard
Tips for Kaggle beginners:
- read the forum carefully for discussion,
- re-read the top solution in similar contests in the past
- research new papers to get ideas
- run experiments and always perform K-fold to evaluate the gap between local validation and public leaderboard.
When it comes to frequently used tools, Duc usually finds himself using Keras-TensorFlow, OpenCV, albumentation, lgbm, scikit-learn
When Duc was asked what is his secret for going from good to great, he remembers the wisdom shared by his CEO.
“Think about the problems from the user’s perspective, from the customer perspective, from the company’s perspective, the community’s perspective, etc. Think about the whys and the what’s first, and let those drive the hows,” says Duc, quoting his CEO.
He also has some great recommendations for those who are starting out new in this AI field:
- ‘Machine learning by Andrew Ng” in Coursera
- Stanford courses: CS231n, CS224n, CS229
- Data science specialization by John Hopkins University on Cousera
- Deep learning specialization by deep learning AI on Cousera
- FastAI courses.
- “Pattern Recognition and machine learning,” by Bishop
- “Hands on Machine Learning with Scikit- Learn, Keras & TensorFlow”
- “The Elements of Statistical Learning”
- “Feature Engineering for Machine Learning” by Alice Zheng & Amanda Casari.
If you want to be a researcher, suggests Duc, try a bottom-up approach, learn carefully basic courses like Machine Learning by Andrew Ng on coursera, other courses from Stanford, MIT and implement everything from scratch, test your idea with some benchmark dataset, join Kaggle competition.
If you want to be a machine learning engineer, try top-down approach, start with FastAI course, try some hand-on projects and also join Kaggle competitions.
Engineering A Data-Driven Future
Duc’s current title at Palexy is chief data engineer, and we had to ask him the much-dreaded question of what is the difference between a data scientist and data engineer.
His role, explained Duc, is a mix between data engineer and data scientist.
The role of a data engineer is collecting data and preparing the data pipeline, and data engineering team’s primary focus is to build infrastructure and architecture for data generation using SQL, MySQL, Spark, Hadoop, Hive, etc.
Whereas, a data scientist is responsible for getting the insights from data and formulate these insights into a model and communicate the same with the clients. A data scientist would use statistics, visualisation (matplotlib, seaborn), modeling (sklearn, TensorFlow, PyTorch), etc.
If Data Science is a cuisine, then a Data Engineer is the one who prepares the material, and a Data Scientist is the one who cooks
Remembering the early hiccups he faced at Palexy, Duc pointed out how challenging building an ML pipeline for a startup can be. In his case he and his team had to tussle with the generalisation of the model, which was one of their biggest challenges. They had only a week’s time to collect data and an ambitious target of making it work well for at least four months.
The other big challenge was the balancing between resources and accuracy. Regardless of what title one holds, Duc’s journey does prove that a great data scientist or machine learning developer is someone who has a 360 degree approach with regards to skills and decision making.
Deep learning may be improved and transformed, but Gradient Descent will still be the key
When asked about the overwhelming reception of AI, Duc was quick to warn us that it would be a long time before we see any Artificial General Intelligence.
Self-supervised learning will be popular, we will not need too much label to train a model. Learning to Reason will be the trend, and there’ll be some weak AGIs.
“So don’t worry about Artificial Super Intelligence like in the movie Terminator, where machines go back to the past and try to kill people,” quipped Duc .