For this week’s ML practitioner’s series, we got in touch with Kaggle Grandmaster Martin Henze. Martin is an astrophysicist by training who ventured into machine learning fascinated by data. His notebooks on Kaggle are a must read where he brings his decade long expertise in handling vast data into play. In this interview Martin shared his own perspective on making it big in the machine learning industry as an outsider.
About His Early Days Into ML
Martin Henze is an astrophysicist by training and holds a doctorate in Astrophysics. He spent the better part of his academics observing exploding stars in nearby galaxies. As an observational astronomer, his job was to work with different types of telescope data and to extract insights from distant stars. The data generated in experiments related to deep space is literally astronomical. For example, the black hole that was imaged last year generated data that was equal to half a tonne of hard drives and took more than a year and many flights to move the data to get it stitched. Martin too, is no stranger to this kind of data.
As part of his master’s thesis, he had to skim through a large archival dataset containing images of hundreds of thousands of stars taken over a time range of 35 years to discover the signatures of distant stellar explosions.
Back then, data science as a domain hasn’t gained traction and Martin was working on MIDAS to churn time-series data. At the time, explained Martin, I knew very little about coding in general and I was working with an astro-specific, Fortran-based language called MIDAS and it was terribly slow. “One of my main tasks was to create a time series of the luminosities of all the detectable stars. I estimated that my first working prototype would take one and a half years to run on my local machine – significantly more time than I had left in my 1-year project. Coming up with different optimisation tricks and reducing the runtime to 3 weeks (on the same machine) was a great puzzle to solve, and it taught me a great deal about programming structures. I also learned something valuable about incremental backups after the first of these 3-week runs was crashed by a power outage,” he added.
“Studying Physics gave me a solid foundation in mathematics beyond the key Algebra and Vector Calculus concepts needed for ML.”
Though the ML aspects of the project were mostly confined to regression fits, for Martin, however, this has been the first step towards the world of machine learning.
His zeal for deciphering data helped him take the leap from academia to industry. Currently, Martin works as a Data Scientist at Edison Software, a consumer technology and market research company based in Silicon Valley. He is part of a team that developed a market intelligence platform that helps enterprise customers understand consumer purchase behaviour.
For most part of his academics, Martin usually worked with tools like decision trees, PCA, or clustering. And, not until he joined Kaggle, he would learn about state of the art methods. “Kaggle opened my eyes not only to the full spectrum of exciting ML algorithms, but also to all the different ways to use data to understand our world – not just the distant universe,” said Martin.
On His Kaggle Journey
“I remember feeling a little overwhelmed and having difficulties to decide where and how to get started.”
Martin joined Kaggle to learn more about ML, and to use these tools for his astrophysics projects. Though he had working experience with techniques like regression or decision trees, seeing all of these sophisticated tools like XGBoost or neural networks on Kaggle, alongside the large models stacks some people were building, intimidated him. So, to fill the gaps, Martin started reading other people’s Kernels, code, and discussions. He also advises the newcomers to go through the scikit-learn documentation, which he thinks is underrated.
- “Introductory Statistics with R”, by Peter Dalgaard.
- “R for Data Science” by Grolemund and Wickham.
- “Hands-On Machine Learning with Scikit-Learn and TensorFlow”, by Aurelien Geron.
- “Approaching (Almost) Any Machine Learning Problem” by Abhishek Thakur.
When it comes to programming languages, R is the go-to language for Martin. Shortly after learning R, he picked up Python through an introductory (in-person) course. Though the ML community continues to be divided over the type of programming language, Martin believes that both R and Python have a lot of potential to complement one another.
“Python libraries were closer to my astronomical data – by providing input/output interfaces for many astro-specific data formats – while I used R to analyse the meta-properties of the extracted data,” explained Martin.
That said, Martin confessed that a large part of his work is data exploration, which he runs on a local machine with R in Rstudio. “ Rstudio is a fantastic IDE for which I have yet to see an equivalent in Python. For ML in R there is the promising new tidymodels framework; still in active development but with a pretty cool philosophy. I’m starting to use tidymodels for many projects which I had previously wrapped up with short scikit-learn pipelines,” said Martin.
When asked about how he would proceed with a data science problem, Martin said that he always starts his projects with a comprehensive EDA(exploratory data analysis). “I’m a visual learner and my EDA typically includes lots of plots that help me scrutinise the relationships and oddities within the data. I think that it is a mistake to jump too quickly into modelling,” explained Martin.
For real world data, this EDA step will usually include quite a bit of data cleaning and wrangling. While it can be tedious, Martin thinks that data cleaning provides important information on the kind of challenges your model might face on unseen data. “Question your assumptions carefully and you will gain a better understanding of the data and the context in which it is extracted,” he advised.
Here is a 3 step guide by Martin:
- Try to build an end-to-end pipeline as quickly as possible: the basic preprocessing, a simple baseline model or slightly better, and getting the outputs in shape for their intended downstream use.
- Then I iterate over the different parts of the pipeline; again focussing on cycling quickly through the first iterations.
- Try to talk frequently with the teams that will use your predictions, to figure out which level of sophistication is needed. Don’t lose sight of the bigger picture.
On The Future OF ML
“Least Squares Regression has been around since the time of Gauss and will always be relevant.”
For Martin, machine Learning in its modern incarnation is a relatively young field, which makes it more difficult to extrapolate from history. But he is positive that fundamental techniques like gradient descent or backpropagation might still be relevant in the future. Whereas, the least Squares Regression, he believes, will always be relevant as a first baseline model.
Martin also warned that progress of a field rarely takes the shape of a monotonic increase. “A few (or even many) dead ends and failed experiments are to be expected along the way; that’s just the nature of it. The more we explore the parameter space of ML – even away from popular techniques – the better and more robust the surviving methods will be,” he explained.
There is a tendency for domains to move closer to the way in which we experience the world. For instance, NLP deals more directly and flexibly with language, instead of having to go through another abstraction level where language characteristics are first translated into generic numerical features which are then modelled. Similar for Computer Vision. I have a feeling that whichever domain manages to deal with very diverse input data in a flexible yet robust way has a good chance of coming out on top.
“The old adage of “garbage in – garbage out” is as relevant as ever in ML.”
When asked about the hype around machine learning, Martin quipped that it’s important to remember that ML is not magic, and that even the most sophisticated model is at its core an abstract description of training data. “If you’re not careful, then all the bias inherent in your data will be reflected in the model. The old adage of “garbage in – garbage out” is as relevant as ever in ML; probably even more so if your model lacks interpretability,” he added.
Talking about how overwhelming machine learning can be for the beginner due to hype, Martin cited his own example of coming from a non-software engineering background. He also believes that the most underrated skills for ML engineers are not necessarily to be found in the technical domain.
Few tips for the beginners:
According to Martin, the thumb rule here is to know where the model fits in the overall business pipeline, and learning from those who provide data and those who will use the model. And in turn, this will help one to better understand your data and how to handle it.
On a concluding note, Martin said that the best way to overcome any challenge is to get started on some small and well-defined aspect of it. “Sure, you don’t want to jump blindly into a problem; and a little bit of preparation can have a large payoff. But you also don’t want to overthink and over-optimise your approach, and become overwhelmed before you even begin.”
“Consistency is key. It’s a bit of a cliche by now, but a small amount of progress every day really will accumulate pretty quickly and will give you noticeable improvements in months or even weeks. But you gotta do it every day, that’s the hard part,” said Martin.