## Accelerating the Random Forest Algorithm, I: the Algorithm

By Mark Seligman on Monday 22 September 2014, 22:26 - Permalink

[This article is adapted and expanded from presentations at Nvidia's
*GTC* conference in March, as well as *UseR!* in July, 2014.]

The Random Forest algorithm is a well-known tool for data analysis. It
yields robust predictive models for vastly different data sets, serving as a
sort of Swiss Army Knife in the data scientist's toolkit. Given the need to
accommodate ever-larger data sets, scalable implementations are essential.
There are many opportunities to accelerate the algorithm but, in order to
appreciate the tradeoffs in implementation, an in-depth presentation of the
algorithm is in order.

## The algorithm, in a nutshell

For expository purposes, we refer to the data as a (design) matrix with
predictor-specific observations organized as columns. The rows of the matrix,
then, correspond to observations on a given sample, across all observed
predictors. The *response*, that is, the dependent variable being
modelled, is a vector with as many components as there are samples.

## Decision trees

A training phase first creates a series of decision trees, typically
numbering in the hundreds. These are binary trees, each non-terminal node
specifying a *splitting predictor* as well as a predicate to be
evaluated on that predictor's value, known as a *splitting criterion*.
The terminal nodes specify score values. A tree derives a score for a given
sample by walking from the root to a terminal according to the value of the
splitting decision at each non-terminal. That is, the branch direction at a
given node is determined by applying this predicate to the value of the
splitting predictor for the sample. For a given sample, the value predicted by
an individual tree is the score of the terminal reached. The value predicted by
the entire forest is either the mean tree score for that sample, when the
response is continuous, or the plurality of tree scores, when the response is
categorical.

Two types of splitting criterion are employed, depending on the data type of
the splitting predictor. If the predictor, say "x", is continuous, then the
predicate evaluates a partial order relation, such as "Is *x* less than
or equal to 7.3?". If the predictor, say "c", is categorical with, for example,
four categories, then the predicate evaluates a subset membership relation,
such as "Is *c* one of the values 1, 2 or 4?". In either case, the
splitting criteria are Boolean-valued and so determine the sense of each branch
taken as the tree is walked.

Each tree is constructed, or trained, from a separately-chosen subset of the
original samples. These subsets are determined by random sampling, either with
or without replacement, in a process known as *bagging*. Hence the
training process begins on separate, independently-drawn submatrices of the
original design matrix, each submatrix consisting of a different collection of
rows. Note that if sampling is performed with replacement, it is likely that
some rows will be selected more than once.

## Decision tree construction

Training works by subdividing a sample set into two distinct subsets, recursively, until subdivision is no longer profitable. A root node is first created. Corresponding to the root, but not itself a part of the decision tree, is the collection of bagged samples from which training is initialized for that tree. It is this collection which is recursively subdivided as the tree is built. When a node is found to split - that is, when a subdivision is profitable - two subnodes are created on the decision tree and the corresponding splitting criterion and predictor are recorded on that node.

There are several reasons why it may not be profitable to split a node. The simplest reason is that there may be no nontrivial subdivisions of the sample set remaining. This may occur if the node corresponds to a single sample or multiple copies of the same sample. Another reason stems from heuristics: the user may have decided, a priori, that little useful predictive information is to be gained by splitting nodes with fewer than some number of distinct samples. Finally, it may be the case that the splitting algorithm itself could not identify a split capable of maximizing the information gain. More on this later.

What is the connection between node splitting and subdividing the samples? Recall that the process of walking a completed decision tree involves not merely asking a series of questions, but asking questions which are conditioned on answers to previous questions. Intuitively, the process of training a decision tree proceeds similarly. Not only is a criterion formulated according to the data (i.e., samples) at hand, but a subsequent criterion is formulated according to the data on which it is predicated. In other words, the samples available to a true- or false-sense subnode are precisely those samples for which the parent node's splitting criterion is, respectively, true or false. Not only does a tree node split then but, in a very real sense, so does the collection of samples associated with it. Every branch taken in the decision tree implicitly divides the data into two subsets - a "bipartition" - either one conditioned on the sense of the branch taken.

How does a collection of samples lead to a splitting criterion? The idea is to collect the values of the response for each sample and then try to find bipartitions of this collection which maximize some measure of information gradient, typically Gini gain. That is, the algorithm tries to find a way to divide the samples into two distinct subsets over which the corresponding response values are somehow maximally different. Not every bipartition is evaluated, though. For one thing, the number of bipartitions grows exponentially with the number of distinct samples. More importantly, though, the algorithm is concerned only with bipartitions which are certain functions of the underlying predictor values at these samples, that is, criteria based on these values. For continuous predictors, the criteria are based on order relations and for discrete predictors they are based on set membership, as already described.

What role do the predictors play? At every node a random collection of predictors is selected. Bipartitions of the response values are evaluated on these predictors, over the set of samples corresponding to the node. The predictor (if any) maximizing the positive gain over this collection becomes the splitting predictor and the criterion at which the gain is maximal becomes the splitting criterion. It is worth pointing out that the randomly-sampled collection of predictors will vary at each node. That is, this collection is not fixed, but rather is sampled independently at each node.

Given a collection of available samples and predictors for a node, though,
how is a splitting criterion actually determined? For a continuous predictor,
the bipartitions arise from considering the samples in the respective
*predictor* order and, at each position, dividing the ordered set into
left and right components. For a discrete (categorical) predictor, the
bipartitions consist exactly of all ways of dividing the available
*predictor* values into distinct left and right components. The
"winning" bipartition, then, indicates not only the splitting predictor and
criterion but the left and right inheritance of samples, as well. In fact, the
key challenge in accelerating the algorithm lies in economically maintaining
the respective predictor-based orderings as the sample set itself splits into
more, but smaller, subsets.

What happens to nodes that do not split? Besides being tagged as terminals in the decision tree, these nodes record the scores for their associated sample sets. Recall that score values for regression trees are means. The score for a regression tree terminal, then, is the mean response value of all associated samples. Similarly, the score for a classification tree terminal is the plurality categorical response value among all associated samples, with ties broken randomly.

## Prediction and validation

The resulting forest is validated using the unsampled (*out-of-bag*)
observations. The forest can also be saved and reapplied to separate data sets
for prediction. Both validation and prediction apply the tree-walking procedure
described above, but with different goals and over different data. Prediction
applies the forest to a set of observations, usually different from the
original, the goal being to predict an unknown response. Validation, on the
other hand, applies the forest to the original observations, with an additional
twist that each tree operates on a separate collection of samples. The goal of
validation is to see how well the forest "predicts" the already-known response
over the original data. The validation quality is often reported by a summary
statistic, such as an R-squared value or a confusion matrix.