For this week’s ML practitioner’s series, Analytics India Magazine got in touch with Hiroki Yamamoto (tereka), a Kaggle Master. Hiroki is currently working as a data scientist and is ranked in the top 100 of the world’s largest platforms for data science competitions– Kaggle. In this interview, Hiroki shares his experience of competing on Kaggle and how it has helped in growing as a data scientist.
AIM: Tell us a bit about the initial Machine Learning journey
Hiroki: I got a master’s degree in information technology back in 2015. During my graduation, I have worked on image processing research using deep learning — for example, autoencoders. My journey began by reading many research papers and working with libraries like Theano and sci-kit learn. For learning Theano, I visited the official site. Currently, I work at Acroquest Technology as a data scientist, and my work focuses on image processing research and developing products for my company. I also consult with other companies.
AIM: Can you talk about your Kaggle journey?
Hiroki: When I started participating in Kaggle competitions, I was not familiar with ensemble tricks, boosting (XGB or LGB) or other few feature engineering techniques. Back then, Kaggle didn’t have a kernel (notebook). So, competitors were responsible for creating their own baselines. I tried it too. I created my baseline through trial and error.
In due course, I had joined many competitions and learned many tricks from going through the solutions published by the top competitors. My personal favourites are the following:
The solutions of these Kagglers usually consist of impressive neural network architectures and give a different perspective of the problem. I want to solve the problem using the neural networks as much as possible. I refer to these past solutions many times while solving.
When it comes to tools, I mainly use PyTorch. PyTorch is a great deep learning framework that has many libraries, utilities, and pretrained models (Image/NLP). Even many researchers use it for their research implementations. In fact, many Kagglers use PyTorch to build their solutions. Also, PyTorch can be used with TPU using pytorch-xla.
For loading machine learning models, I use my personal workstation and Google Colaboratory. My personal setup consists of 2 x RTX2080Ti/GTX1080Ti. We can also use TPU in Google Colaboratory as TPUs are very fast and allow us to increase batch size.
I will talk about one of my most difficult competitions on Kaggle — Global Wheat Detection, where the participants were asked to detect wheat heads from a set of outdoor images of wheat plants, which also included wheat datasets from around the globe using worldwide data. Competitors can use more than 3,000 training images collected from Europe (France, UK, Switzerland) and North America (Canada). The test data includes about 1,000 images from Australia, Japan, and China.
Models developed for wheat phenotyping need to be able to generalise between environments. The detection model must be robust. Current detection methods involve one- and two-stage detectors (Yolo-V3 and Faster-RCNN), but even when trained with a large dataset, a bias to the training region remains. So, I have used EfficentDet model and heavy augmentation.
EfficientDet was released by Google earlier this year. First, a small-size EfficientDet-D0 baseline was developed, and then a compound scaling was applied to obtain EfficientDet-D1 to D7. Each consecutive model has a higher compute cost, covering a wide range of resource constraints from 3 billion FLOPs to 300 billion FLOPS, and provides higher accuracy. For the wheat detection competition, pseudo labelling while calculating inference notebook has also been very effective. Successful solutions should be able to accurately estimate the density and size of wheat heads in different varieties.
Check my solution here.
For me personally, Kaggle has helped me get recognised as a more advanced data scientist. I gain immense modelling knowledge from Kaggle. Whenever I encounter an issue at work, the knowledge and skill from Kaggle come in handy. Also, I got a sense of the speed and accuracy of the initial construction of the model, which is very important to approximate if a model would be successful or not.
AIM: Few words for aspirants and about the future of ML
Hiroki: For beginners, machine learning skills are important along with knowledge of software engineering and workflows. If you are planning to experiment with a problem, then you have to write a great pipeline. Pipelines enable you to experiment faster.
Here are a few tips on how to approach a data science problem:
- read or set a problem statement.
- read discussions/notebooks (only Kaggle) or EDA using notebook
- create baseline validation (don’t leak)
- create a small baseline code.
- If you use bigger models or bigger experiment settings, we cannot experiment many times.
- Many experiments for parameters, architectures.
- Ensemble (only Kaggle)
When it comes to reading resources, there are many, but I recommend the following boo two:
- Kaggleで勝つデータ分析の技術 (in Japanese) is a great book. This book introduces many tricks like feature engineering, hyperparameter optimisation, algorithms using Kaggle.
- Approaching (Almost) Any Machine Learning Problem is also a great book, as it has many codes and tricks (NLP, Image processing). If you want to learn a few tricks, you can try this book.
Also, you check the titanic competition. It helps you learn how to do EDA. It is a good start if you are planning to try Kaggle competitions. Even if you lose, you will definitely gain a lot of knowledge and know-how. That said, there is also a lot of hype around machine learning. Machine learning can’t strictly be correct. System input is the same, system output is almost the same, although the machine learning system cannot achieve the same output. It’s difficult to explain why models like deep learning can or cannot be corrected.
Many researchers continue to publish many states of the art (SOTA) papers, but most of these SOTA methods are not the best. I have tried many SoTA methods, but most of them didn’t work. I believe that in the next ten years, research related to multimodal, cross-modal might come out on top. Many researchers are working on image processing, NLP, Signal Processing, and so on. But each area is largely still independent, and they will be combined further.