Listen to this story
Tree-based machine learning models are generally used for their easy interpretation and the way they handle higher dimensional data. But tree-based models are having some concerns with some use cases and the traditional machine learning models only appear to perform better and converge faster when compared to tree-based models in machine learning. So in this article let us see when not to use tree-based models in machine learning and what are the factors that forbid the use of tree-based models in machine learning.
Table of Contents
- What are tree-based models?
- When not to use tree-based models?
What are tree-based models?
As the name suggests, tree-based models have an overall structure of trees. The overall structure of these models basically consists of root nodes where the model emerges and under the tree, the branches are considered sub-trees and the leaves of the sub-trees of the tree can be considered as leaf nodes. In a similar way, a single tree in machine learning is Termed a Decision Tree and a forest of such trees is considered a Random forest. Tree-based models can be used for both regression and classification tasks.
Are you looking for a complete repository of Python libraries used in data science, check out here.
The tree-based models appear to be more off of a flow chart-like structure with certain conditions in each of the steps. An overview of tree-based models is shown below.
Sign up for your weekly dose of what's up in emerging technology.
When not to use Tree-based models?
Let us understand the limitations of the most commonly used tree-based machine learning models which are Decision Tree and Random Forest. A decision Tree is one of the supervised machine learning algorithms that can be used for regression or classification tasks and unlike other models decision trees are having certain limitations and let us understand their limitations in detail.
1. High performance is required in regression analysis
Regression tasks are basically statistical analysis performed on various features of the dataset to predict continuous variable outcomes. For regression tasks, there may be various features and in the presence of various features the decision tree model may overfit on the training set, and also the depth of the tree may increase for higher dimensional data, and for lower dimensional data it may underfit as it may have to converge faster with lesser number of branches created and decision tree may be responsible for catching the wrong correlated feature in the process of tree development.
So for regression tasks tree-based models should not be used because tree-based models break down the data into smaller subsets by not at all considering the correlation of the features and it turns out to be less effective for rightly predicting outcomes and the decision tree model may be prone to information loss. Also for regression tasks tree-based models should not be used because for relatively smaller datasets with lesser noise and uncertainties the tree-based models may tend to yield lower accuracies.
2. Not to be fallen into overfitting situations
Tree-based models are majorly prone to overfitting when it is utilized with higher dimensional data or data. So if the model is designed perfectly to fit the training data it will overfit and not generalize well with the testing data. So while designing decision tree parameters like the depth of the tree and the nodes of the tree must have proper division properties among the leaf nodes otherwise the model’s accuracy will turn out to be very low.
So tree-based models should not be used for higher dimensional data as the decision tree tends to grow deeper as there are more features and it memorizes the training data leading to overfitting and showing poor performance while testing or for uncertain data. So when the relatively larger data is considered the decision tree grows to its complete depth and tends to memorize the smaller samples split and tend to overfit the data.
3. Any change is expected in the data
Tree-based models are extremely sensitive to minor changes in data and they may not be well suitable for continuous variable prediction because for these use cases the data cannot be expected to remain stationary. So tree-based models are not to be used for data with higher uncertainty as the model may yield very lower accuracy and also false predictions and when considered for certain classification tasks if there are instances with higher support for one variable and if in the future the target gets balanced the tree-based model would still be biased to the first occurring majority classes.
4. More dependent samples are there in the data
Tree-based modeling should not be used when there are more dependent samples in the dataset as tree-based models give different weights for each of the dependent samples and higher weightage will be given only to certain dependent features and certain features will be given very lower weights which in turn may be responsible for poor accuracy from the model. So when there are dependent features in the dataset tree-based models are not to be used.
5. Training time is a constraint
The number of features of the data and the training time of the tree-based models are directly dependent on the training time and for data with higher dimensions the tree-based models will eat up a larger time for training when compared to other supervised learning algorithms. So for higher dimensional data and to speed up the training process the Support Vector Machine algorithm can be used instead of tree-based models.
6. Regularization is to be employed
If regularization is to be applied to any model building to prevent overfitting, tree-based models cannot be used as regularization is not possible for tree-based models as it operates on heuristic algorithms that operate on the underlying principle of decision making. So if regularization is to be applied to machine learning models tree-based models cannot be used.
7. Compatibality with Mean Absolute Error
Mean absolute error is a metric that is basically used to measure the accuracy of continuous variables and tree-based models are not compatible with mean absolute error as the tree-based models will consume higher time for calculation of this parameter or it may not converge at all.
8. Resampling is time-consuming for tree-based models
Resampling techniques of data such as cross-validation is time-consuming for tree-based models for higher dimension data and higher number of folds. So if resampling techniques are to be used in model building other machine learning models can be used instead of tree-based models.
9. Longer computation time in the pipeline
When compared to other machine learning models, tree-based models take a longer time to get fitted on the pipeline due to their complex structure for higher dimensional data. So if a fast operating machine learning pipeline is to be created tree-based models are not to be used.
10. Bias towards most occurring class
Tree-based models enforce bias towards most occurring classes on the voting classifier of tree-based algorithms and the voting classifier may yield higher occurring classes from each of the base learners and the voting classifier may also be influenced by the base learners yielding wrong predictions. So if unbiased predictions are to be obtained tree-based models are not to be used.
As mentioned in this article tree-based machine learning models have some concerns with the type of data and the characteristics of data in use. So tree-based modeling in machine learning is not an efficient way of modeling the data for all applications and problems. So if the data is simpler, without outliers or without multicollinearity traditional machine learning modeling techniques can be used over tree-based models in machine learning.