Kaggle Master and senior data scientist Mark Tenenholtz sent a tweet admitting that despite spending thousands of hours on ML models, 90 per cent of the models that he used were ineffective. Tenenholtz then listed the best baseline models for different datasets in the same thread. He underlined the importance of having a good baseline in models, and it was a valuable asset to solve issues with ML models.
For tabular data, Tenenholtz said that XGBoost, LightGBM, or RF models are some of the most commonly used models. Even though ensemble tree-based models can outperform neural networks, XGBoost is the most popular choice among Kagglers.
For time series data, he stated that models like XGBoost, LightGBM and RF were the best despite them not being built for time series data. Tenenholtz explained that even if the dataset is for tabular data, one could set a prediction horizon that is well-matched to the lag between the input and the output so that the user can control the system better and effectively. Then, the dataset can be treated as tabular data.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
For image datasets, ResNet and EffNet-BO are small and quick models that are effective for nearly any type of image data. A huge advantage of these models is that they can be scaled up and used for greater accuracy.
DistilRoBERTa is the best model for text datasets. The model offers a combination of speed and accuracy. When scaled up, the accuracy of the model increases.
The best models for audio datasets are ResNet and EffNet. Tenenholtz justified the usage of image models for audio datasets. He said that he had started audio problems by converting the audio to a spectrogram and combining it with an image model.