“My ultimate dream is to build a next generation healthcare system with the help of AI technology.”
For this week’s ML practitioner’s series, Analytics India Magazine got in touch with Hiroshi Yoshihara, a Kaggle competition master and a machine learning engineer whose work is focussed on public health. He works at Aillis Inc., a Japan based startup developing AI-powered medical devices for early and accurate detection of influenza.
As a machine learning engineer, Hiroshi and his team developed algorithms to process medical images. He is also a full-time doctoral student in public health at Kyoto University. In this interview, Hiroshi shares a few insights from his data science journey.
AIM: Can you tell us a bit about the beginning of your data science journey?
Hiroshi: I started learning to program when I was in middle school, and soon I got obsessed with competitive programming. After I entered university, I joined the university team of robot competition as a robot programmer, and also synthetic biology competition as a mathematical modelling engineer. My journey in machine learning started quite unexpectedly. A friend of mine from Russia mistakenly asked me to join an international machine learning competition when I knew almost nothing about machine learning. I was taking a machine learning course at the university, but I could hardly find it interesting until my first participation in the competition.
The first task was a binary classification of particles from real data of a CERN (The European Organisation for Nuclear Research) experiment. I still remember my excitement to load huge data, add massive amounts of features, tune hyperparameters, and post-process predictions — all for the first time in my life.
I received a bachelor’s and a master’s degree in health economics from the University of Tokyo. I worked on large scale medical claims, data analysis and health technology assessment.
Alongside, I studied basics in Andrew Ng’s course on Coursera and textbooks from O’Reilly as many other Kagglers did. But I believe the most important learning came from the discussions and notebooks in each Kaggle competition.
There are many good courses and books on machine learning. With regards to Kaggle specifically, I would like to recommend the following:
- “How to Win a Data Science Competition: Learn from Top Kagglers” by National Research University Higher School of Economics
- “Approaching (Almost) Any Machine Learning Problem” by Abhishek Thakur.
- “Kaggleで勝つデータ分析の技術” by Daisuke Kadowaki et al. (In Japanese)
- “PythonではじめるKaggleスタートブック” by Shotaro Ishihara and Hideki Murata (In Japanese).
While courses and textbooks taught me the essential theories, Kaggle discussions and notebooks taught me how to correctly use those theories to tackle a specific problem. I would say both course/textbook and Kaggle platform helped me a lot.
AIM: Can you tell us about your Kaggle journey?
Hiroshi: The quality of competitions, the diversity of tasks, and the number of participants make Kaggle a second-to-none platform. Kaggle is often referred to as a platform for aspiring data scientists to learn machine learning and build their portfolio. While I believe the statement is absolutely true, I would like to emphasise the fact that you can learn many practical skills which are essential to tackle real-world problems in business or research. These skills include making a reliable validation scheme, avoiding overfitting, handling various data structures etc.
Of course, there are many other things, which you can learn only in industry, but the skills you learned from Kaggle come in handy. I have participated in many other competitions in various academic fields before I have joined Kaggle. Each competition has its challenges, but those practical aspects are what distinguishes Kaggle from others.
“Once I covered most of the famous techniques, I realised my performance on Kaggle has significantly improved.”
When I started to participate in Kaggle competitions, the biggest challenge was to catch up on Kaggle-specific techniques. There were many techniques which were not listed in typical machine learning textbooks such as ‘test time augmentation’, ‘pseudo-labelling’, ‘adversarial validation’ and so on.
To catch up on the latest Kaggle-related techniques, I googled all new words I found in discussions, and in most cases, previous discussions on Kaggle taught me what they mean. Once I covered most of the famous techniques, I realised my performance on Kaggle has significantly improved.
AIM: What does your approach to a data science problem look like?
Hiroshi: As many other Kagglers might say, the first step is always a careful EDA, which is all about answering questions like this:
- Is there anything strange in the problem setting?
- Is the evaluation metric hackable?
- Is there any specific pattern in the dataset?
- Is there any post-process which improves the metric?
This process is compared to ‘torture the data’, and it actually should be so. After I go through the data and the objective of the competition, I often look for previous competitions with similar data or problem setting, as previous competitions provide a lot of information.
Also, I usually search for papers with related topics and read discussions on Kaggle. Next thing to do is to build a simple baseline model and a reliable validation scheme. There might be groups we need to separate, or there might be variables we need to stratify during cross-validation. A high correlation between local validation score and test score suggests your scheme is good. But sometimes there is no or very low correlation between cross-validation (CV) score and test score. There is a cliché ‘trust CV’ in Kaggle, but I think whether to trust CV or test score is up to the problem setting.
“After the competition is over, no matter my place on the leaderboard, I always read top solutions compared to mine.”
The last and most interesting step is trial-and-error. Based on what I learned so far, I always make a TO-DO list and try one by one. The point here is to make a list before doing experiments, or I might get stuck with a specific idea and fail to see the big picture.
Another important thing is to check for reproducibility. I define experimental variables such as model, augmentation, image size etc., then make a configuration class object containing those variables. The main script loads the configuration class and runs the experiment.
Like many other Kagglers, I use Python, LightGBM, CatBoost, PyTorch and so on. I even developed a wrapper to simplify and automate the training and validation process of those libraries; a package called kuma_utils to run experiments efficiently.
Once the competition is over, regardless of my leaderboard position, I always read top solutions and compare it with that of mine.
The competition which left the strongest impression is actually the most recent one I joined, and in which I got my first gold medal – PANDA. The task was to classify the whole slide image (WSI) of prostate biopsy into an ordinal grade. There were roughly two challenges: the first one was how to feed an extremely big WSI to CNN models; the second one was how to train your model robustly to label noises which were present only in the training dataset.
A clear answer to the first challenge was given by Kaggle Grandmaster @iafoss on the discussion, which was to extract small patches from a WSI, pass the patches to a CNN, concatenate all features derived from those patches, and then predict the target from concatenated features.
For the second challenge, I searched academic publications related to label noise. I read around 30 papers and implemented around ten methods during the competition to see which one performs the best. Surprisingly, a simple method which excludes uncertain data during training outperformed any other complicated methods including state-of-the-art one.
Finally, my teammates’ and my efforts paid off, and we finished at 6th place. I learned a lot from this competition, especially the importance of domain knowledge, and the gap between performance in frequently used benchmark dataset and real-world dataset. Furthermore, the knowledge I learned from this competition helped me a lot with my tasks at my company.
The solution can be found here.
AIM: What do you think about the current state of ML?
Hiroshi: Image classification, credit scoring, recommendation system — in many simple tasks, either outperformed humans or performed good enough. I believe the possibility of AI in the next ten years lies in the intersection of AI and other fields such as physics, biology, math etc.
In my opinion, machine learning techniques will be used to support decision making of humans in many fields such as radiology, instead of replacing humans in the near future. This is because we need someone to take responsibility for every decision, but machines are not capable of.
For instance, there are many myths around Kaggle. The most common ones are lack of relevance to real-world problems and lack of academic value. The first myth is probably due to the competition problem settings, sometimes being overly ideal. The key here is that even if the competition looks too idealistic, there still must be challenges in it and that’s why the host pays for it. Every year, many companies and organisations pay Kaggle to host competitions, because they know how to translate real-world problems into a competition problem which is not simple but of great relevance.
The second myth is partly true. Kaggle is not an academic society. It is not likely that innovative architectures such as ResNet or AlphaFold2 could have been invented in a Kaggle competition. But I believe Kaggle has what academic societies don’t have is the practical value. A state-of-the-art method proposed in an academic society based on a benchmark dataset often does not perform so well in Kaggle competitions, in other words, external datasets. Here, Kaggle plays an important role as a bridge between state-of-the-art academic research and real-world problems.
Machine learning area nowadays is so dynamic that it is virtually impossible to keep up with all the new methods, but we should always keep our eyes open for them. A machine learning engineer who has cutting-edge knowledge is good, and those who constantly keep up with cutting-edge knowledge is great.