Gilberto Titericz aka “Giba” is a force to reckon with in the Kaggle circles with the highest number of gold medals (59) worldwide. The avid gamer has some serious street cred when it comes to RAPIDS/GPU tools.
“Even now, there are only 249 competing GMs in the world. To achieve this title, you need five gold and one solo gold medal in competitions, which are extremely difficult these days. Becoming a Kaggle GM was very good for my career,” said Gilberto.
Analytics India Magazine got in touch with NVIDIA’s senior data scientist to understand how an electrical engineer from UTFPR Curitiba, Brazil established himself as a successful data scientist and consistently topped the charts in Kaggle competitions.
Sign up for your weekly dose of what's up in emerging technology.
AIM: How did your fascination with algorithms begin?
Giba: My passion for data science dates back to my childhood when I began studying electronics and programming languages. By the age of 10, I was already assembling electronics circuits. Then, at 16, I started coding assembly for Zilog Z80 and Intel MCS-51 microcontrollers. The turning point was perhaps in 1997 when I attended my first C class, which introduced me to the elusive world of programming languages.
I learned algorithms for the first time in a digital signal transmission class where I was asked to simulate the error loss of binary signals transmitted over a noisy environment. It inspired me to look for new algorithms to solve the most diverse problems.
Download our Mobile App
Fast forward to 2009, soon after the subprime stock crisis, I started to study neural networks and coded some rudimentary MLP blocks to simulate stock market trades. At the time, I couldn’t get any positive returns in my simulated trades. However, a few years later, I found the reason – my validation strategy was completely wrong.
In 2011, I participated in the Google AI challenge and competed in coding AI for agents in the online reinforcement learning challenge. For months, I learned a bunch of new algorithms, and each of them multiplied my curiosity to learn more about RL and ML algorithms. After that, I found Kaggle via a web search and started competing immediately. And the rest, as they say, is history.
AIM: How did you deal with the teething troubles?
Giba: Before entering data science full time, I had systematically built a solid programming foundation during my engineering days.
If you are a person without programming skills, I’d recommend you to study a scripting language focused on data science like R or python first.
People from other areas have some difficulty thinking in a structured way to write code, so it’s important to learn how to code because it’s a fundamental tool for data scientists.
Once I started studying data science, I found it very difficult to understand why there were so many different areas and their purpose. Of course, it’s important to study all the areas, but I recommend you specialise in a specific area.
For me, that “area” was applied machine learning for tabular data. So I started to learn online and used the knowledge to build solutions in Kaggle competitions. The good thing about competing in Kaggle is that you learn a lot. You can use ML to solve challenges and win competitions. Not only does it give you first-hand experience, but every single competition adds great value to your portfolio.
AIM: What excites you the most about coding?
Giba: I like to code algorithms that are time efficient and highly accurate. This is because we have limited time to code algorithms in the real world, and there is a cost associated with maintaining and generating inferences for ML models running online. So the simpler the code, the easiest to maintain and the fastest the algorithms, the cheaper it will be to score new data points. But at the same time, having a high accuracy model usually refers to better services and often increases some metrics correlated to revenues.
AIM: What does your ML tool stack look like?
Giba: Much like the rest of the world, Python is my go-to language for data science projects. And my preferred IDE is Jupyter lab. The libraries that I use frequently are:
- Pandas, matplotlib and seaborn for EDA. They are great for visualisation.
- Pandas, numpy cuML and cupy for data processing, data frames, matrix operations and feature engineering. This is everything we need to process data that fits in memory.
- XGBoost, LightGBM, cuML and Pytorch for machine learning in tabular data – fast and accurate algorithms.
- Pytorch for deep learning – extremely versatile, fast and customisable to everything.
AIM: What advice do you have for those preparing for their first hackathons?
Giba: Hackathons are a perfect place to showcase your coding and thinking skills. Usually, you have to run against time to deliver the best solution to a given problem. So before joining a data science hackathon, I recommend getting used to using the related tools. Hardly you will win a hackathon without any previous experience. So, part of the preparation is practising coding algorithms and using ML libraries; this can save you precious time during the event. Also, get prepared to set up your data science environment in your PC/notebook. But I think the most important trick comes with experience. If you know how to approach a problem properly, you can save many experimentation steps and consequently have more time to run even more experiments.
AIM: What’s your biggest pet peeve about hackathons?
Giba: What I hate about hackathons is that you always run out of time before doing all the experiments in your mind. That’s why I first build a list of things to do when I start a hackathon. But unfortunately, my list is always long, making it nearly impossible to run for the duration of the event.
AIM: What’s the worst experience you’ve ever had as a coder?
Giba: As a coder, I hate it when I start using a library, and then a new version comes along, changing the way I use functions, and my old code doesn’t work anymore. I know it’s part of the development process to keep improving libraries, but having to rewrite old code can be very annoying.
AIM: What drew you to Kaggle? Tell us about your journey so far.
Giba: What attracted me to Kaggle was the possibility to learn machine learning in a fun way through competitions. And it is really fun in the beginning because there is a game system where people are ranked according to a metric, and it can become addictive after a while. It is also possible to make friends through the community.
I joined Kaggle 10 years ago and have participated in 227 competitions. I won 59 gold and 47 silver medals. I also made many friends and learned most of what I know about machine learning there. I jumped to 1st place in the competition in Oct/2015 and stayed there for a few years. Since then, I have changed my role from electrical engineer to data scientist, and I feel blessed to have the opportunity to work as a data scientist on the NVIDIA Grandmaster’s team.
AIM: What was your first Kaggle competition like?
Giba: When I joined Kaggle, there were some prediction competitions going on. Most were related to time series and power generation. I joined both, and the only machine learning tool I knew at that time was Matlab. So I started studying the NeuralNetwork toolbox and put together a simple competition strategy. I trained several neural models and then averaged the predictions. I didn’t know it at the time, but this technique is known as bagging, and it helps to diminish the variance of predictions (very useful for NNets). It helped me finish top 3 and top 11 in both competitions, receiving the title of Kaggle Master (later GM) in the very beginning. I was able to do all this with a I5, 2 cores/4 threads/8GB RAM notebook.
AIM: How do you ace Kaggle competitions? Could you share a few tips and tricks?
Giba: I have three tips for data science/Kaggle aspirants:
1. Determination: There is no free lunch. I spent many years learning and competing before getting to the top. I lost many medals in competitions, suffered from shakeup when switched to private LB, and sometimes I feel that I will never learn such techniques. What worked for me was that I never gave up; even when my daughter was born, I had to spend most of the time taking care of her rather than learning and improving my skills in Kaggle. But I made time for it and achieved it.
2. Validation: Many people fail to define the proper validation strategy. Defining this strategy is not as simple as defining cross-validation or holdout folds. The idea is to define it to resemble the public and private test set variables distribution and construction. So studying their datasets is often necessary. Also, run experiments to test them on the leaderboard and ensure you used the correct local validation strategy. A typical mistake, for example, is using random cross-validation (KFold) when we have a user_id variable in the dataset, and the same user_id appears in multiple rows. That way, we may have the same user_id in multiple folds leaking part of the ground truth information and overfitting to the train set, resulting in bad generalization to the test set.
3. Think outside the box: Kagglers usually come with novel/SOTA solutions to win competitions. Most novice Kagglers tend to copy popular notebooks, change something and submit to get used to the platform and code something. But if everyone is copying other people’s solutions, then everyone will finish the competition in approximately the same place. What sets top Kagglers apart is the capacity of thinking out of the box and explore something different and better than everyone else. So running as many experiments as possible and using different approaches can help you jump many places in the Kaggle rankings.