In last article, we looked at the basics of Decision tree and how it helps in classifications. We also looked at advantages and disadvantages of using decision trees. One of the advantage of using Decision tree is that it efficiently identifies the most significant variable and splits the population on it. In previous article, we developed a high level understanding of Decision trees. In this article, we will focus on the science behind splitting the nodes and choosing the most significant split.
Decision trees can use various algorithms to split a node in two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that purity of the node increases with respect to the target variable. Decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes.
The algorithm selection is also based on type of target variables. Let’s look at the four most commonly used algorithms in decision tree:
Gini index says, if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure.
Steps to Calculate Gini for a split
Example: – Referring to example used in previous article, where we want to segregate the students based on target variable of playing cricket or not. In below snapshot, we split the population using two input variables Gender and Class. Now I want to identify which split is producing more homogeneous sub-nodes using Gini index.
Similar for Split on Class:-
Above, you can see that Gini score for Split on Gender is higher than Class so node will split on Gender.
It is an algorithm to find out the statistical significance between the differences between sub-nodes and parent node. We measures it by sum of squares of standardized differences between observed and expected frequencies of target variable.
Steps to Calculate Chi-square for a split:
Example: Let’s work with above example that we have used to calculate Gini.
Split on Gender:
Split on Class:
Perform similar steps of calculation for split on Class and you will come up with below table.
Above, you can see that Chi-square also identify the Gender split is more significant compare to Class.
Let’s look at the image below and think which node can be described easily. I am sure, your answer is C because it requires less information as all values are similar where as B requires more information to describe it and A would require even more. You can say in other words also that C is a Pure node, B is less Impure and A is more impure.
Now, we can build a conclusion that less impure node requires less information to describe it and more impure node requires more information. Information theory has a measure to define this degree of disorganization in a system, which is called Entropy. If the sample is completely homogeneous, then the entropy is zero and if the sample is an equally divided it has entropy of one.
Entropy can be calculated using formula:-![]()
Here p and q is probability of success and failure respectively in that node. Entropy is also used with categorical target variable. It chooses the split which has lowest entropy compared to parent node and other splits.
Steps to calculate entropy for a split:
Example: Let’s use this method to identify best split for student example.
Above you can see that entropy of split on Gender is lower compare to Class so we will again go with split Gender. We can derive information gain from entropy as 1- Entropy.
Till now, we have discussed the algorithms for categorical target variable. Reduction in Variance is an algorithm for continuous target variable. This algorithm uses the same formula of variance to choose the right split that we went through the descriptive statistics. The split with lower variance is selected as the criteria to split the population:
Above X-bar is mean of the values, X is actual and n is number of values.
Steps to calculate Variance:
Example:- Let’s assign numerical value 1 for play cricket and 0 for not playing cricket. Now follow the steps to identify the right split:
Above, you can see that Gender split has lower variance compare to parent node so the split would be on Gender only.
Above, we have have looked at various algorithms to split a node into sub nodes. Now to create a decision tree, sub-nodes are further split into two or more sub-nodes and all input variables are considered for creating the split again. Fields already involved in split also get considered for split. It is a recursive process and it stops if the node ends up as a pure node or it reaches the maximum depth of the tree or number of records in the node reaches the preset limit.
In a extreme scenario, a decision tree can have number of nodes equals to total number of observation, but that would be a very complex tree. If we are expanding decision tree towards more complexity based on training data set, then it causes over fitting and losses the predictive power of the model because it is not generalized. Over fitting can be removed by pruning the nodes.
In this article, we have looked at four most commonly used algorithm to select a variable for splitting a node into sub-nodes. We also looked at the mathematical calculation behind these algorithms. Out of these four algorithms, three are used for categorical target variable where as reduction in variance is generally used for continuous target variable.
Did you find the article useful? Do let us know your thoughts about this article in the comment box below.
P.S. Have you joined Analytics Vidhya Discuss yet? If not, you are missing out on awesome data science discussions. Here are some of the discussions happening on modeling techniques:
1. Decision tree with continuous variables
2. How many algorithms should an analyst know?
Great article Sunil . It really helped to clear my decision tree concepts.
Excellent and simple article to understand basic concepts.
Thank u Sunil .