“I would be happy if medical AI becomes more and more popular in the next ten years.”
For this week’s ML practitioners series, Analytics India Magazine got in touch with Kaggle GM Okoshi Takumi. Okoshi is ranked 55 in Kaggle global rankings and currently works as a data scientist at Rist — an AI company based in Japan. In this interview, Okoshi talks about how his love for baseball led him to data science. He also has some tips for aspirants who are looking to break into the field of data science or make it to the top in Kaggle.
AIM: How did your Data Science journey begin?
Okoshi: I played baseball when I was a kid. At the time, I kept track of my hitting records in Excel and looked at the performance of professional baseball players. I enjoyed it. Looking back, I think I’ve loved data since then. Once I got to high school, that fever grew even more, and with the help of Moneyball analysis, I was able to analyse the data of Japanese professional baseball. I’ve analysed who the most valued hitters are and whether there are any hidden star candidates on your favourite team.
I enjoyed that analysis so much that I looked for a university where I could study statistics, and to this day, I enjoy analysing data very much.
In college, I learned about statistics and machine learning, and the research involved using machine learning to decipher the relationship between cheering and outcomes in sports. I’ve read a lot of books. Many of them have been translated into Japanese, or are in Japanese, to begin with, but O’Reilly’s books are very good. I also studied in Coursera’s Andrew Ng course and Stanford University’s CS course delivery videos.
Recently, there is also a book of tips from Kaggle masters that have been published, which is still only in the Japanese language but has been very helpful to Kagglers in Japan. Several books on Kaggle are also available in Japan, making it easier for newcomers to get started. One of the books — “Kaggle Start Book”, written and reviewed by mean my teammates in the Petfinder competition.
Currently, I work for Rist, a Japanese IT company, where our team develops AutoML solutions that will be used internally. Apart from this, I spend part of my time on Kaggle.
AIM: Tell us about your Kaggle journey
Okoshi: The first Kaggle challenge I tackled was Porto Seguro’s Safe Driver Prediction. It was a prediction of whether or not a car insurance policyholder would have an accident. It was a competition of anonymous features, so it was very difficult. I managed to get a bronze medal with the help of the kernel.
Kaggle appeals to me for three main reasons:
Firstly, is the competition where you can compete with the other participants on the score, so you know where you stand. It’s nice when you move up in the rankings, and you get to experiment and learn a lot for it.
Secondly, coming in contact with data on many different topics. Images, text, audio, and tableau, each has its own interesting aspects, and I enjoy getting a taste of it all! Especially in multi-modal competitions, such as Petfinder and Avito, it was a lot of fun because I was exposed to different types of data in one competition. Also, all of the competition topics were interesting, even if they were in areas I didn’t know about as I got to work on it and become a little more knowledgeable about the field.
Thirdly, getting hold of the solution shared by the top solvers. Even when the rankings are not good, after a competition, you can look at the solutions of the winners and reflect on what was lacking. I try to recreate the top solutions from competitions that I have already completed, and even in competitions where I lost, I always try to make use of them for my own benefit.
On Kaggle, the current competition methods are discussed in kernel and discussion, and there is a solution released afterwards, so I feel it will be a learning experience for both beginners and experts.
Both the participants and Kaggle’s UI are very appealing to me because of the importance of learning. Especially the discussion and kernel, and of course the sharing of solutions afterwards is a great culture. I try to write solutions in the competitions that I get to the top of, and I hope I can contribute to this great culture in some small way.
AIM: What does it take to be at the top?
Okoshi: I’m still a novice ML engineer myself. In particular, I feel it’s very difficult to think about what AI products users want. Of course, the development skills I have at Kaggle are beneficial in my work, and once I get into the development phase and the experimental phase, I can create solutions quickly. However, I feel that I have a particular challenge in other areas, especially in task design, and I believe that learning those areas will help me become a “great” ML engineer.
I participated in a lot of competitions, about 40 competitions, and since Porto Seguro, I have been working on Kaggle competitions almost without a break. Every time I participate in a competition, I discover something new, which I can use to improve my work in the following competitions. In due course, I was able to win more medals and become a Kaggle Grand Master.
It’s also important to have an organised methodology in competitions. In my case, I make a pipeline for competitions, and when there is a new method for each competition, I try to incorporate that into that pipeline. By doing so, I can easily try out the methods in the next competition, and together with the pipeline, I can get stronger with each competition. This pipeline development has also led to the development of AutoML.
Given a data science problem, I would first focus on creating a benchmark using a pipeline. Then when I get a score, I compare it with that of other participants’ scores, and if it’s low, I try to identify the missing parts. If it’s high, I generate new ideas. To do this, I gather information from the Overview section of the competition, discussions, and papers. Along with this, I do EDA to check the data.
From this point onwards, it is all about repetition of hypotheses, ideas and experiments to raise the score. Towards the end of the competition, I do an ensemble and make a final push. It is important to create benchmarks quickly and experiment with new ideas. And so far my pipeline building routine has been successful!
I think it’s a good idea to participate in the Kaggle competition for now. At first, you can make a benchmark using the kernel as a reference, and then you can participate in discussions and other kernels.
In Japan, we have a blog for beginners called “What to do next after signing up for Kaggle”. There’s also an article (in Japanese), and I think beginners will learn a lot just by following this.
I also recommend the Kaggle start book, which you should do when you start with Kaggle so that you can work on your competition smoothly. If you’d like to read it in English, contact the author!
AIM: What does your machine learning toolstack look like?
Okoshi: I write code in Python, and I mainly use the PyTorch framework for competitions that need neural networks and LightGBM for the tableau competition. And of course, I’m indebted to many other tools such as pandas, NumPy, Scikit-learn, Albumentations, etc., and for images, I recommend timm. For accessing any pre-trained models on PyTorch, check this. For computing, I use GCE for my machines. The company also supports Kaggler’s machine fees, which is very helpful.
AIM: What do you think the future of ML would be like?
Okoshi: That’s a very difficult question. I think that machine learning products, like any other product, will depend on what the users want. I started my own business when I was a college student, and back then, I couldn’t figure out what the users wanted, and I couldn’t create something that I could say with any confidence that it was good. Markets simply don’t accept this. As a result, I abandoned the company and started doing something else. Looking back on that, I think it’s important to create something that users want, and honesty, to create something that can be marketed with confidence as a good product.
In ML, I still think the problem of accuracy is very deep-rooted. In many cases, there is a difference between the accuracy we can achieve with AI, and the accuracy users expect. Both parties end up unhappy. We need to find a field where the difference in recognition is matched as much as possible, and the accuracy we achieve with AI is sufficient for the service.
That said, I have high hopes for the medical industry myself. There are a lot of medical competitions on Kaggle, but I think we’re just starting to make progress in this field. I would be happy if medical AI becomes more and more popular in the next ten years, and I would gladly participate in those medical-related competitions.