The latest release of the Rborist package, which provides an accelerated implementation of the Random Forest (TM) algorithm, is available from CRAN. Version 0.1-6 offers several notable improvements:

Sparse matrix representation

Sparse numeric dcgMatrix matrix objects are now accepted as input, provided an intra-column encoding is employed. This representation is particularly useful, for example, in the case of one-hot encodings.

Additionally, Rborist now autocompresses training data on a per-predictor basis, compactly representing runs of arbitrary value. This space-saving feature is most useful when training iteratively, using the preFormat feature.

Pruned representation

A new option thinLeaves allows trained forests to be recorded in a slender format, economizing on storage.

Vignette

A vignette has been provided to guide users through Rborist's various capabilities. It is hoped that this will invite more users to try the package and make it easier to use.

Improved scalability

Particular attention has been paid to limiting data movement and exploiting data locality. This has paid dividends in the ability of the implementation to scale across larger data sets.

The graph below illustrates recent progress by comparing execution times of Rborist with Xgboost on a flight-delay data set. Xgboost is considered to be among the fastest open-source packages implementing decision-tree methods. The flight-delay data, and execution scripts, are hosted on Szilard Pafka's benchm-ml project on Github . One script was modified to extend the sample limit from 10 million to 12.5 million rows, approximately the maximum available from the data. Timings were performed on a two-socket Xeon server:

Flight.jpg

Of particular interest is the inflection point apparent near one million rows. This is likely due to crossing a level of the memory hierarchy. That is, more and more data must be accessed from outside the L1 cache. Although Xgboost remains faster throughout this regime, Rborist appears better able to handle the transition, and the two are nearly even at 12.5 million rows, Additional testing will be needed to learn how far these scaling trends extend.

Thanks go out to Chris Kennedy, Christopher Brown, Carlos Ortega and Tal Galili, whose comments and contributions helped make this a successful release.