Listen to this story
When the size of the features exceeds a certain limit, regression trees become inapplicable due to overfitting. The decision tree’s overfitting problem is caused by other factors as well as synch as branches sometimes are impacted by noise and outliers of data. Pruning is a critical step in constructing tree based machine learning models that help overcome these issues. This article is focused on discussing pruning strategies for tree based models and elaborates on how this strategy works in practice. Following are the topics to be covered in this article.
Table of contents
- A snippet about decision trees
- About pruning
- Strategies for pruning
- Pruning methods
A decision tree is a traditional supervised machine learning technique. Let’s get a high-level understanding of decision trees.
Sign up for your weekly dose of what's up in emerging technology.
A snippet about decision trees
A decision tree is a hierarchical data structure that uses a divide and conquers technique to describe data. We will explore decision trees with categorical labels in this lesson, but decision trees may also be used for non-parametric classification and regression.
The decision tree is made up of nodes that create a rooted tree, which means it is a directed tree with no incoming edges. Every other node has only one incoming edge. All other nodes are referred to as leaves, which are also known as terminal or decision nodes. Each internal node in a decision tree divides the instance space into two or more sub-spaces based on a discrete function of the input attribute values.
When dealing with nominal data, each leaf is allocated to a class that represents the most appropriate target value. Alternatively, the leaf might include a probability vector showing the likelihood of the target characteristic having a specific value. In the case of numeric characteristics, decision trees may be mathematically understood as a collection of orthogonal hyperplanes.
Decision trees and lists divide the instance space into disjoint divisions and assign a class label to each. Their benefit is that they offer a clear depiction of how this is accomplished. Using ordinary logical procedures, the description of a piece belonging to a certain class may be translated into disjunctive normal form. In this style, each class is characterised by a proposition whose premise is a disjunctive sentence specifying the class’s portions of space. Disjuncts are individual clause components that are mutually exclusive in decision trees and lists, meaning they do not overlap in instance space.
Each disjunct is allocated to a class, and any subsequent test instances covered by the disjunct are assigned to this class as well. To reduce the number of inaccurate class assignments, label the disjunct with the class that is most likely to occur. This is the class that appears the most frequently in the training data, according to the maximum likelihood principle, which is extensively used in learning algorithms for decision trees and lists. As a result, the disjunction is labelled with the majority class of the occurrences it covers.
The amount of training examples associated with the disjunct determines its size. A disjunct’s mistake rate is the percentage of future test cases that it misclassifies. Small disjuncts appear to be more mistake-prone than large ones, simply because they receive less support from the training data.
Are you looking for a complete repository of Python libraries used in data science, check out here.
Pruning is the process of eliminating weight connections from a network to speed up inference and reduce model storage size. Decision trees and neural networks, in general, are overparameterized. Pruning a network entails deleting unneeded parameters from an overly parameterized network.
Pruning mostly serves as an architectural search inside the tree or network. In fact, because pruning functions as a regularizer, a model will often generalise slightly better at low levels of sparsity. The trimmed model will match the baseline at higher levels. If you push it too far, the model will start to generalise worse than the baseline, but with greater performance.
Need for pruning
Pruning a classifier simplifies it by combining disjuncts that are adjacent in instance space. By removing error-prone components, the classifier’s performance may be improved. It also permits additional model analysis for the aim of knowledge gain. Pruning should never be used to remove predicted components of a classifier. As a result, the pruning operation needs a technique for determining if a group of disjuncts is predictive or should be merged into a single, bigger disjunct.
The pruned disjunct represents the “null hypothesis” in a significance test, whereas the unpruned disjuncts represent the “alternative hypothesis.” The test determines if the data offer adequate evidence to support the alternative. If this is the case, the unpruned disjuncts are left alone; otherwise, pruning continues.
The obvious rationale for significance tests is that they evaluate whether the apparent correlation between a collection of disjuncts and the data is likely to be attributable to chance alone. They do so by calculating the likelihood of generating a random relationship as least as strong as the observed association if the null hypothesis is confirmed. If the observed relationship is unlikely to be attributable to chance and this likelihood does not exceed a set threshold, the unpruned disjuncts are deemed to be predictive; otherwise, the model is simplified. The aggressiveness of the pruning operation is determined by the “significance level” criterion used in the test.
Strategies for pruning
Pruning is a critical step in developing a decision tree model. Pruning is commonly employed to alleviate the overfitting issue in decision trees. Pre-pruning and post-pruning are two common model tree generating procedures.
Prepruning is the process of pruning the model by halting the tree’s formation in advance. When construction is completed, the leaf nodes inherit the label of the most common class in the subset that is connected to the current node. There are various ways for pre-pruning, including the following
- When the model reaches a specific height, the decision tree’s growth is stopped.
- When the eigenvectors of instances associated with a node are identical, the tree model stops developing.
- When the number of occurrences within a node falls below a certain threshold, the tree stops growing. The downside of this strategy is that it is inapplicable not in particular circumstances where the amount of data is tiny.
- An expansion is a process of dividing a node into two child nodes. When the gain value of an expansion falls below a certain threshold, the tree model stops expanding as well.
The major disadvantage of pre-pruning is the narrow viewing field, which implies that the tree’s current expansion may not match the standards, but later expansion may. In this situation, the decision tree’s development is halted early.
The decision tree generation is divided into two steps by post-pruning. The first step is the tree-building process, with the termination condition that the fraction of a certain class in the node reaches 100%, and the second phase is pruning the tree structure gained in the first phase.
Post-pruning techniques circumvent the problem of a narrow viewing field in this way. As a result, post-pruning procedures are often more accurate than pre-pruning methods, therefore post-pruning methods are more widely utilised than pre-pruning methods. The pruning procedure identifies the node as a leaf node by using the label of the most common class in the subset associated with the current node, which is the same as in pre-pruning.
The goal of pruning is to remove sections of a classification model that explain random variation in the training sample rather than actual domain characteristics. This makes the model more understandable to the user and, perhaps, more accurate on fresh data that was not used to train the classifier. An effective approach for differentiating sections of a classifier that are attributable to random effects from parts that describe significant structure is required for pruning. There are different methods for pruning listed in this article used in both strategies.
Reduced Error Pruning (REP)
The aim is to discover the most accurate subtree with the shortest version to the pruning set.
The pruning set is used to evaluate the efficacy of a subtree (branch) of a fully grown tree in this approach, which is conceptually the simplest. It starts with the entire tree and compares the number of classification mistakes made on the pruning set when the subtree is retained to the number of classification errors made when internal nodes are transformed into leaves and assigned to the best class for each internal node of the tree. The simplified tree can sometimes outperform the original tree. It is best to prune the subtree in this scenario. This branch trimming procedure is continued on the simplified tree until the misclassification rate rises. Another restriction limits the pruning condition: the internal node can be pruned only if it includes no subtree with a lower error rate than the internal node itself. This indicates that trimmed nodes are evaluated using a bottom-up traversal technique.
The advantage of this strategy is its linear computing complexity, as each node is only visited once to evaluate the possibility of trimming it. REP, on the other hand, has a proclivity towards over-pruning. This is because all evidence contained in the training set and used to construct a fully grown tree is ignored during the pruning step. This issue is most obvious when the pruning set is significantly smaller than the training set, but it becomes less significant as the percentage of instances in the pruning set grows.
Pessimistic Error Pruning (PEP)
The fact that the same training set is utilised for both growing and trimming a tree distinguishes this pruning strategy. The apparent error rate, that is, the error rate on the training set, is optimistic and cannot be used to select the best-pruned tree. As a result, the continuity correction for the binomial distribution was proposed, which may give “a more realistic error rate.”
The distribution of errors at the node is roughly a binomial distribution. The binomial distribution’s mean and variance are the likelihood of success and failure; the binomial distribution converges to a normal distribution.
The PEP approach is regarded as one of the most accurate decision tree pruning algorithms available today. However, because the mechanism for traversing PEP is similar to pre-pruning, PEP suffers from excessive pruning. Furthermore, due to its top-down nature, each subtree in the tree only has to be consulted once, and the time complexity is in the worst-case linear with the number of non-leaf nodes in the decision tree.
Minimum Error Pruning (MEP)
This method is a bottom-up strategy that seeks a single tree with the lowest “anticipated error rate on an independent data set.” This does not indicate the adoption of a pruning set, but rather that the developer wants to estimate the error rate for unknown scenarios. Indeed, both the original and enhanced versions described exploiting just information from the training set.
In the presence of noisy data, Laplace probability estimation is employed to improve the performance of ID3. Later, the Bayesian technique was employed to enhance this procedure, and the approach is known as an m-probability estimation. There were two modifications:
- Prior probabilities are used in estimate rather than assuming a uniform starting distribution of classes.
- Several trees with differing degrees of pruning may be generated by adjusting the value of the parameter. The degree of pruning is now decided by parameters rather than the number of classes. Furthermore, factors like the degree of noise in the training data may be changed based on domain expertise or the complexity of the problem.
The predicted error rate for each internal node is estimated in the minimal error pruning approach and is referred to as static error. The anticipated error rate of the branch with the node is then estimated as a weighted sum of the expected error rates of the node’s children, where each weight represents the chance that observation in the node would reach the associated child.
Critical Value Pruning (CVP)
This post-pruning approach is quite similar to pre-pruning. Indeed, a crucial value threshold is defined for the node selection measure. Then, if the value returned by the selection measure for each test connected with edges flowing out of that node does not exceed the critical value, an internal node of the tree is pruned. However, a node may meet the pruning criterion but not all of its offspring. The branch is retained in this scenario because it includes significant nodes. This additional check is typical of a bottom-up strategy and distinguishes it from pre-pruning methods that prohibit a tree from developing even if future tests prove to be important.
The degree of pruning changes obviously with the critical value: a greater critical value results in more extreme pruning. The approach is divided into two major steps:
- Prune the mature tree to increase crucial values.
- Choose the best tree from the sequence of trimmed trees by weighing the tree’s overall relevance and forecasting abilities.
Cost-Complexity Pruning (CCP)
The CART pruning algorithm is another name for this approach. It is divided into two steps:
- Using certain techniques, select a parametric family of subtrees from a fully formed tree.
- The optimal tree is chosen based on an estimation of the real error rates of the trees in the parametric family.
In terms of the first phase, the primary concept is to prune the branches that exhibit the least increase in apparent error rate per cut leaf to produce the next best tree from the best tree. When a tree is pruned at a node, the apparent error rate increases by a certain amount while the number of leaves reduces by a certain number of units. As a result, the following ratio of the error rate increase to leaf reduction measures the rise in apparent error rate per trimmed leaf. The next best tree in the parametric family is then created by trimming all nodes in the subtree with the lowest value of the above-mentioned ratio.
The best tree in the entire grown tree in terms of predicted accuracy is picked in the second phase. The real error rate of each tree in the family may be estimated in two ways: one using cross-validation sets and the other using an independent pruning set.
Pruning methods are a crucial component of practical decision tree and list learning algorithms, and they are required for learning intelligible and accurate classifiers in the face of noise. With this article, we have understood the methods and strategies used to prune a tree.