Other sites

Rborist 0.1-6 now on CRAN

The latest release of the Rborist package, which provides an accelerated
implementation of the Random Forest (TM) algorithm, is available from CRAN.
Version 0.1-6 offers several notable improvements:

Sparse matrix representation

Sparse numeric dcgMatrix matrix objects are now accepted as input,
provided an intra-column encoding is employed. This representation is
particularly useful, for example, in the case of one-hot encodings.

Additionally, Rborist now autocompresses training data on a
per-predictor basis, compactly representing runs of arbitrary value. This
space-saving feature is most useful when training iteratively, using thepreFormat feature.

Pruned representation

A new option thinLeaves allows trained forests to be recorded in a
slender format, economizing on storage.

Vignette

A vignette has been provided to guide users through Rborist’s various
capabilities. It is hoped that this will invite more users to try the package
and make it easier to use.

Improved scalability

Particular attention has been paid to limiting data movement and exploiting
data locality. This has paid dividends in the ability of the implementation to
scale across larger data sets.

The graph below illustrates recent progress by comparing execution times ofRborist with Xgboost on a flight-delay data
set. Xgboost is considered to be among the fastest open-source
packages implementing decision-tree methods. The flight-delay data, and
execution scripts, are hosted on Szilard Pafka’s benchm-ml
project on Github . One script was modified to extend the sample limit from 10 million
to 12.5 million rows, approximately the maximum available from the data.
Timings were performed on a two-socket Xeon server:

Of particular interest is the inflection point apparent near one million
rows. This is likely due to crossing a level of the memory hierarchy. That is,
more and more data must be accessed from outside the L1 cache. AlthoughXgboost remains faster throughout this regime,Rborist appears better able to handle the transition, and the
two are nearly even at 12.5 million rows, Additional testing will be needed to
learn how far these scaling trends extend.

Thanks go out to Chris Kennedy, Christopher Brown, Carlos Ortega and Tal
Galili, whose comments and contributions helped make this a successful
release.