Part 3 - Classic Machine Learning Algorithms Series
This blog post was adapted from Section 4.5 in my PhD dissertation.
This is the third post in a series of blog posts focusing on the some of the classic Machine Learning algorithms! In my last blog post, I described the notion of Generalised Linear Models and showed how the well-known Logistic Regression model is meerly a special case of a much broader family of statistical models. In this post, we will talk about another much-loved class of machine learning models, namely Decision Trees. Let’s get started!
Decision tree learning is one of the most widely used supervised learning approaches — a fact that is primarily attributed to their capability of performing both classification and regression tasks, their relatively simple mathematical formulation and their ability to represent their underlying decision making process visually and explicitly
CART is a popular non-parametric algorithm for producing either classification or regression trees, depending on the defined target variable
The process of producing a classification tree begins at an initial node, called the root node, in which the entire training set is contained. Tree growth is achieved by partitioning each node in such a way as to maximise a defined splitting criterion. If the $j$th attribute is defined as the parent node $d_p$, then there exists an optimal splitting value $x_j^{*}$ for this node. Once this value has been determined, an if-then splitting procedure may follow whereby all observations with a $j$th attribute value of $x_j^{*}$ or greater are partitioned into a right-hand child node $t_r$, as illustrated in Figure 1, while all observations with a $j$th attribute value less than $x_j^{*}$ are partitioned into a left-hand child node $t_l$
According to Lewis
\begin{equation} \Delta i(t)=i(t_p)-E[i(t_c)],\label{4.eqn.tree1} \end{equation}
where $E[i(t_c)]$ is the expected impurity of the two child nodes, which can also be expressed in terms of the probability of being split into the left- and right-hand nodes, denoted by $p(t_l)i(t_l)$ and $p(t_r)i(t_r)$, respectively. Consequently, in the case of a classification tree, all possible attribute values for all $n$ attributes in $\mathcal{X}$ must be considered as possible “best split,” resulting in the maximum value of (\ref{4.eqn.tree1}), expressed mathematically at a single parent node, as
\begin{equation} \underset{\substack{x_j<x_j^{*},j\in{1,\ldots,m}}}{\arg\max}\hspace{1.5mm}i(t_p)-p(t_l)i(t_l)-p(t_r)i(t_r).\label{4.eqn.tree2} \end{equation}
Although this process of maximising the change in impurity described in (\ref{4.eqn.tree2}) is common to both classification and regression trees, the choice of impurity function $i(t)$ is defined in a different manner
A tree that has been grown to the point where it fits the training data almost perfectly typically results in an overly complex model that performs poorly in respect of the validation data, as illustrated in Figure 2. A tree that is too small, on the other hand, is not able to discover the underlying relationships present in the data. Consequently, the ideal tree size is one that achieves a suitable trade-off between overfitting and underfitting the training data. A CART is, therefore, grown to a near optimal size by employing one of two methods: The growth of the tree can be terminated according to (1) a stopping criterion or (2) employing a pruning procedure.
As the name implies, the use of the stopping criterion approach forces the recursive tree growing procedure to be terminated once a specified criterion is met. Rashidi et al.
Although the stopping criterion provides a simplified method for limiting the growth of a tree, the more popular method is that of pruning
Although CART is a powerful predictive model in its original form, there are three attractive methods for enhancing this standard learning model
The second CART enhancement method is that of boosting. Much like bagging, the notion of boosting may be applied to any learning model
The third and, by far, most popular CART enhancement is called random forests which is discussed, in its own right, in the next section.
Introduced by Breiman
Similar to bagging, this method involves the construction of $b$ decision trees produced by CART and combines them together to produce a more accurate, stable prediction. In addition to the randomness induced by employing bootstrap sampling, random forests provide further randomness by imposing the following change to the standard CART model: Starting at the root node, instead of considering each of the $m$ attributes as possible candidates according to which to split the node, only a random subset of $m^{\prime}<m$ (typically chosen as $m^{\prime}\approx\sqrt{m}$) attributes are considered. This process of only considering a random attribute subset to split a node is continued until the tree reaches the largest possible size. The process is repeated a total of $b$ times, and the final prediction is typically a majority vote (in the case of classification) or an average (in the case of regression) of all the predictions made by the $b$ trees constructed
The reason why this approach works so well can be described as follows: By only allowing the algorithm to consider a subset of the available attributes at a node split, the algorithm is charged to explain the target variables by not only utilising the “best” attributes present in the data. In other words, if there is an attribute which is an inherently very strong predictor of the target variables, this attribute will not be available to the algorithm $100(m-m^{\prime})/m\%$ of the time (on average), thereby forcing the algorithm to explore the use of other attributes which are moderately strong predictors
The C4.5 algorithm was developed by Quinlan
The split criterion employed by the C4.5 algorithm is based on Shannon’s well-known concept of entropy from information theory
\begin{equation} \mbox{Entropy}(\mathcal{S})=-\sum_{q=1}^{Q}\frac{\mbox{freq}(q,\mathcal{S})}{|\mathcal{S}|}\log_b\left(\frac{\mbox{freq}(q,\mathcal{S})}{|\mathcal{S}|}\right),\label{4.eqn.entropy} \end{equation}
where $Q$ is the total number of target attribute categories. If a large proportion of observations in a data subset $\mathcal{S}$ have the value of $q$ (i.e. $\text{freq}(q,\mathcal{S})/|\mathcal{S}|\approx 1$), then the entropy in (\ref{4.eqn.entropy}) will be approximately zero (implying a high level of order). Conversely, if there is an equal representation of all attribute categories in the data subset $\mathcal{S}$ (i.e. $\text{freq}(q,\mathcal{S})/|\mathcal{S}|\approx 1/C_j$), then the entropy in (\ref{4.eqn.entropy}) will be close to one (implying a high level of disorder).
With this notion in mind, consider the case in which a data subset $\mathcal{S}$, residing in a parent node $t_p$, is partitioned based on the categories of attribute $j$ producing $C_j$ child nodes $t_1,\ldots,t_{C_j}$. The information gain achieved by this node split can be defined as the total reduction of entropy in the data, expressed mathematically as
\begin{equation} \mbox{Gain}(t_p,j)=\mbox{Entropy}(t_P)-\sum_{i=1}^{C_j}\frac{|t_i|}{|t_p|}\mbox{Entropy}(t_i).\label{4.eqn.gain} \end{equation}
This criterion was first proposed for splitting nodes in the ID3 algorithm
\begin{equation} \mbox{SplitEntropy}(t_p,j)=-\sum_{i=1}^{C_j}\frac{|t_i|}{|t_p|}\log_b\left(\frac{|t_i|}{|t_p|}\right).\label{4.eqn.splitentropy} \end{equation}
If a large proportion of observations in a data subset $\mathcal{S}$ exhibit the value $c_k^{(j)}$ for attribute $j$ (i.e. $|t_i|/|t_p|\approx 1$), then the entropy in (\ref{4.eqn.entropy}) will be approximately zero, while, conversely, if there is an equal representation of all attribute categories in the data subset $\mathcal{S}$ (i.e. $|t_i|/|t_p|\approx 1/C_j$), then the entropy in (\ref{4.eqn.entropy}) will be close to one. Combining (\ref{4.eqn.gain}) and (\ref{4.eqn.splitentropy}), it is possible to define the information gain ratio as the proportion of information generated by partitioning a parent node $t_p$ according to attribute $j$, expressed mathematically as
\begin{equation} \mbox{GainRatio}(t_p,j)=\frac{\mbox{Gain}(t_p,j)}{\mbox{SplitEntropy}(t_p,j)}.\label{4.eqn.gainratio} \end{equation}
Unlike its predecessor, this formulation is more robust since it effectively penalises the choice of splitting attributes with many categories. A drawback of this approach, however, is that (\ref{4.eqn.gainratio}) tends to favour unbalanced splits in which one node partition is much larger than the rest
And that’s all I have. In this blog post, I walked yoiu through a host of different decision tree algorithms that are still widely used in practice today, namely CARTs, Random forests and the C4.5 algorithm. Stay tuned to this series where I’ll dive into even more classic machine learning algorithms. Until next time!