CRAN Task View: Machine Learning & Statistical Learning

Several add-on packages implement ideas and methods developed at the
borderline between computer science and statistics - this field of research
is usually referred to as machine learning.
The packages can be roughly structured into the following topics:

Recursive Partitioning
: Tree-structured models for
regression, classification and survival analysis, following the
ideas in the CART book, are
implemented in
rpart
(shipped with base R) and
tree.
Package
rpart
is recommended for computing CART-like
trees.
A rich toolbox of partitioning algorithms is available in
Weka
,
package
RWeka
provides an interface to this
implementation, including the J4.8-variant of C4.5 and M5.
The
Cubist
package fits rule-based models (similar
to trees) with linear regression models in the terminal leaves,
instance-based corrections and boosting. The
C50
package can fit
C5.0 classification trees, rule-based models, and boosted versions of these.
Two recursive partitioning algorithms with unbiased variable
selection and statistical stopping criterion are implemented in
package
party
and
partykit. Function
ctree()
is based on
non-parametric conditional inference procedures for testing
independence between response and each input variable whereas
mob()
can be used to partition parametric models.
Package
model4you
can be used to build trees based
on more complex models, for example models featuring a
treatment effect to be partitioned. Transformation trees for
estimating discrete and continuous predictive distributions based on transformation
models, also allowing censoring and truncation, are available
from package
trtf.
Extensible tools for visualizing binary trees
and node distributions of the response are available in package
party
and
partykit
as well.
Tree-structured varying coefficient models are implemented in
package
vcrpart.
For problems with binary input variables
the package
LogicReg
implements logic regression.
Graphical tools for the visualization of
trees are available in package
maptree.
Trees for modelling longitudinal data by means of
random effects is offered by package
REEMtree.
Partitioning of mixture models is performed by
RPMM.
Computational infrastructure for representing trees and
unified methods for prediction and visualization is implemented
in
partykit.
This infrastructure is used by package
evtree
to implement evolutionary learning
of globally optimal trees. Survival trees are available in
various package,
LTRCtrees
allows for
left-truncation and interval-censoring in addition to
right-censoring.

Random Forests
: The reference implementation of the random
forest algorithm for regression and classification is available in
package
randomForest. Package
ipred
has bagging
for regression, classification and survival analysis as well as
bundling, a combination of multiple models via
ensemble learning. In addition, a random forest variant for
response variables measured at arbitrary scales based on
conditional inference trees is implemented in package
party.
randomForestSRC
implements a unified treatment of Breiman's random forests for
survival, regression and classification problems. Quantile regression forests
quantregForest
allow to regress quantiles of a numeric response on exploratory
variables via a random forest approach. For binary data,
The
varSelRF
and
Boruta
packages focus on variable selection by means
for random forest algorithms. In addition, packages
ranger
and
Rborist
offer R interfaces to fast C++ implementations of random forests.
Reinforcement Learning Trees, featuring splits in variables
which will be important down the tree, are implemented in
package
RLT.
wsrf
implements an
alternative variable weighting method for variable subspace selection
in place of the traditional random variable sampling.
Random forests for parametric models, including forests for the
estimation of predictive distributions, are available in
packages
trtf
(predictive transformation forests,
possibly under censoring and trunction),
model4you
(featuring a simple user interface for building forests based
on arbitrary parametric models fitted in R), and
grf
(an implementation of generalised random forests).

Regularized and Shrinkage Methods
: Regression models with some
constraint on the parameter estimates can be fitted with the
lasso2
and
lars
packages. Lasso with
simultaneous updates for groups of parameters (groupwise lasso)
is available in package
grplasso; the
grpreg
package implements a number of other group
penalization models, such as group MCP and group SCAD.
The L1 regularization path for generalized linear models and
Cox models can be obtained from functions available in package
glmpath, the entire lasso or elastic-net regularization path (also in
elasticnet)
for linear regression,
logistic and multinomial regression models can be obtained from package
glmnet.
The
penalized
package provides
an alternative implementation of lasso (L1) and ridge (L2)
penalized regression models (both GLM and Cox models). Package
biglasso
fits
Gaussian and logistic linear models under L1 penalty when the data
can't be stored in RAM.
Package
RXshrink
can be used to identify and display TRACEs
for a specified shrinkage path and to determine the appropriate extent of shrinkage.
Semiparametric additive hazards models under lasso penalties are offered
by package
ahaz.
A generalisation of the Lasso shrinkage technique for linear regression
is called relaxed lasso and is available in package
relaxo.
Fisher's LDA projection with an optional LASSO penalty to produce sparse
solutions is implemented in package
penalizedLDA.
The shrunken
centroids classifier and utilities for gene expression analyses are
implemented in package
pamr. An implementation
of multivariate adaptive regression splines is available
in package
earth. Various forms of
penalized discriminant analysis are implemented in
packages
hda
and
sda.
Package
LiblineaR
offers an interface to
the LIBLINEAR library.
The
ncvreg
package fits linear and logistic
regression models under the the SCAD and MCP
regression penalties using a coordinate descent algorithm. The
same penalties are also implemented in the
picasso
package.
An implementation of bundle methods for regularized risk minimization
is available form package
bmrm. The Lasso under non-Gaussian and
heteroscedastic errors is estimated by
hdm,
inference on low-dimensional components of Lasso regression and of estimated treatment
effects in a high-dimensional setting are also contained. Package
SIS
implements sure independence screening in generalised linear and Cox models.
Normal and binary logistic linear models under various
penalties (or mixtures of those) can be estimated using package
oem.

Boosting and Gradient Descent
: Various forms of gradient boosting are
implemented in package
gbm
(tree-based functional gradient
descent boosting). Package
xgboost
implements
tree-based boosting using efficient trees as base learners for
several and also user-defined objective functions.
The Hinge-loss is optimized by the boosting implementation
in package
bst. Package
GAMBoost
can be used to fit generalized additive models
by a boosting algorithm. An extensible boosting framework for
generalized linear, additive and nonparametric models is available in
package
mboost. Likelihood-based boosting for Cox models
is implemented in
CoxBoost
and for mixed models in
GMMBoost.
GAMLSS models can be fitted using boosting by
gamboostLSS.
An implementation of various learning algorithms based on
Gradient Descent for dealing with regression tasks is available
in package
gradDescent.

Support Vector Machines and Kernel Methods
: The function
svm()
from
e1071
offers an interface to the LIBSVM library and
package
kernlab
implements a flexible framework
for kernel learning (including SVMs, RVMs and other kernel
learning algorithms). An interface to the SVMlight implementation
(only for one-against-all classification) is provided in package
klaR.
The relevant dimension in kernel feature spaces can be estimated
using
rdetools
which also offers procedures for model selection
and prediction.

Bayesian Methods
: Bayesian Additive Regression Trees (BART),
where the final model is defined in terms of the sum over
many weak learners (not unlike ensemble methods),
are implemented in packages
BayesTree,
BART, and
bartMachine.
Bayesian nonstationary, semiparametric nonlinear regression
and design by treed Gaussian processes including Bayesian CART and
treed linear models are made available by package
tgp.
MXM
implements variable selection based on Bayesian
networks.

Optimization using Genetic Algorithms
:
Package
rgenoud
offers optimization routines based on genetic algorithms.
The package
Rmalschains
implements memetic algorithms
with local search chains, which are a special type of
evolutionary algorithms, combining a steady state genetic
algorithm with local search for real-valued
parameter optimization.

Association Rules
: Package
arules
provides both data structures for efficient
handling of sparse binary data as well as interfaces to
implementations of Apriori and Eclat for mining
frequent itemsets, maximal frequent itemsets, closed
frequent itemsets and association rules. Package
opusminer
provides an
interface to the OPUS Miner algorithm (implemented in C++) for finding the key associations in
transaction data efficiently, in the form of self-sufficient itemsets, using either leverage or lift.

Fuzzy Rule-based Systems
:
Package
frbs
implements a host of standard
methods for learning fuzzy rule-based systems from data
for regression and classification. Package
RoughSets
provides comprehensive implementations of the
rough set theory (RST) and the fuzzy rough set theory (FRST) in a single
package.

Model selection and validation
: Package
e1071
has function
tune()
for hyper parameter tuning and
function
errorest()
(ipred) can be used for
error rate estimation. The cost parameter C for support vector
machines can be chosen utilizing the functionality of package
svmpath.
Functions for ROC analysis and other visualisation techniques
for comparing candidate classifiers are available from package
ROCR.
Packages
hdi
and
stabs
implement stability
selection for a range of models,
hdi
also offers other inference procedures in high-dimensional models.

Other procedures
: Evidential classifiers quantify the uncertainty about the
class of a test pattern using a Dempster-Shafer mass function in package
evclass.
The
OneR
(One Rule) package offers a classification algorithm with
enhancements for sophisticated handling of missing values and numeric data
together with extensive diagnostic functions.
spa
combines feature-based and graph-based data for prediction of some response.

Meta packages
:
Package
caret
provides miscellaneous functions
for building predictive models, including parameter tuning
and variable importance measures. The package can be used
with various parallel implementations (e.g. MPI, NWS etc).
In a similar spirit, package
mlr
offers a high-level
interface
to various statistical and machine learning packages. Package
SuperLearner
implements a similar toolbox.
The
h2o
package implements a general purpose machine learning
platform that has scalable implementations of many popular algorithms such
as random forest, GBM, GLM (with elastic net regularization), and deep
learning (feedforward multilayer networks), among others.

Visualisation (initially contributed by Brandon Greenwell)
The
stats::termplot()
function package can be used to plot the
terms in a model whose predict method supports
type="terms".
The
effects
package provides graphical and tabular effect
displays for models with a linear predictor (e.g., linear and generalized
linear models). Friedman’s partial dependence plots (PDPs), that are low
dimensional graphical renderings of the prediction function, are implemented
in a few packages.
gbm,
randomForest
and
randomForestSRC
provide their own functions for displaying PDPs,
but are limited to the models fit with those packages (the function
partialPlot
from
randomForest
is more limited since
it only allows for one predictor at a time). Packages
pdp,
plotmo, and
ICEbox
are more general and allow for the
creation of PDPs for a wide variety of machine learning models (e.g., random
forests, support vector machines, etc.); both
pdp
and
plotmo
support multivariate displays (plotmo
is
limited to two predictors while
pdp
uses trellis graphics to
display PDPs involving three predictors). By default,
plotmo
fixes the background variables at their medians (or first level for factors)
which is faster than constructing PDPs but incorporates less information.
ICEbox
focuses on constructing individual conditional expectation
(ICE) curves, a refinement over Friedman's PDPs. ICE curves, as well as
centered ICE curves can also be constructed with the
partial()
function from the
pdp
package.
ggRandomForests
provides ggplot2-based tools for the graphical exploration of random forest
models (e.g., variable importance plots and PDPs) from the
randomForest
and
randomForestSRC
packages.

CORElearn
implements a rather broad class of machine learning
algorithms, such as nearest neighbors, trees, random forests, and
several feature selection methods. Similar, package
rminer
interfaces
several learning algorithms implemented in other packages and computes
several performance measures.