Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

8.
CART
Tree: piecewise constant
predictor obtained with recursive
binary partitioning of Rp
Constrains: splits are parallel to
the axes
At every step of the binary
partitioning, data in the current
node are split “at best” (i.e., to
have the greatest decrease in
heterogeneity in the two child
nodes
Figure: Regression tree
Nathalie Villa-Vialaneix | RF for Big Data 6/39

13.
OOB error and estimation of the prediction error
OOB = Out Of Bag
OOB error
For predicting Yi, only predictors ˆfb
such that i τb are used ⇒ Yi
OOB error = 1
n
n
i=1 Yi − Yi
2
(regression)
OOB error = 1
n
n
i=1 1{Yi Yi} (classiﬁcation)
estimation similar to standard cross validation estimation
... without splitting the training dataset because it is included in the
bootstrap sample generation
Warning: a different forest is used for the prediction of each Yi!
Nathalie Villa-Vialaneix | RF for Big Data 11/39

16.
Why RF and Big Data?
on one hand, bagging is appealing because easily computed in
parallel.
on the dark side, each bootstrap sample has the same size than the
original dataset (i.e., n, which is supposed to be LARGE) and contains
approximately 0.63n different observations (which is also LARGe)!
Nathalie Villa-Vialaneix | RF for Big Data 13/39

24.
Overview of BLB
[Kleiner et al., 2012, Kleiner et al., 2014]
method used to scale any bootstrap estimation
consistency result demonstrated for a bootstrap estimation
Here: we describe the approach in the simpliﬁed case of bagging (as for
random forest)
Framework: (Xi, Yi)i=1,...,n a learning set. We want to deﬁne a predictor of
Y ∈ R from X given the learning set.
Nathalie Villa-Vialaneix | RF for Big Data 16/39

25.
Problem with standard bagging
When n is big, the number of different observations in τb is ∼ 0.63n ⇒ still
BIG!
Nathalie Villa-Vialaneix | RF for Big Data 17/39

26.
Problem with standard bagging
When n is big, the number of different observations in τb is ∼ 0.63n ⇒ still
BIG!
First solution...: [Bickel et al., 1997] propose the “m-out-of-n” bootstrap:
bootstrap samples have size m with m n
Nathalie Villa-Vialaneix | RF for Big Data 17/39

27.
Problem with standard bagging
When n is big, the number of different observations in τb is ∼ 0.63n ⇒ still
BIG!
First solution...: [Bickel et al., 1997] propose the “m-out-of-n” bootstrap:
bootstrap samples have size m with m n
But: The quality of the estimator strongly depends on m!
Nathalie Villa-Vialaneix | RF for Big Data 17/39

28.
Problem with standard bagging
When n is big, the number of different observations in τb is ∼ 0.63n ⇒ still
BIG!
First solution...: [Bickel et al., 1997] propose the “m-out-of-n” bootstrap:
bootstrap samples have size m with m n
But: The quality of the estimator strongly depends on m!
Idea behind BLB
Use bootstrap samples having size n but with a very small number of
different observations in each of them.
Nathalie Villa-Vialaneix | RF for Big Data 17/39

40.
Overview of Map Reduce
Map Reduce is a generic method to deal with massive datasets stored on
a distributed ﬁlesystem.
It has been developped by Google
TM
[Dean and Ghemawat, 2004] (see also
[Chamandy et al., 2012] for example of use at Google).
Nathalie Villa-Vialaneix | RF for Big Data 20/39

45.
MR implementation of random forest
A Map/Reduce implementation of random forest is included in Mahout
(Apache scalable machine learning library) which works as
[del Rio et al., 2014]:
data are split between Q bits sent to each Map job;
a Map job train a random forest with a small number of trees in it;
there is no Reduce step (the ﬁnal forest is the combination of all trees
learned in the Map jobs).
Nathalie Villa-Vialaneix | RF for Big Data 21/39

46.
MR implementation of random forest
A Map/Reduce implementation of random forest is included in Mahout
(Apache scalable machine learning library) which works as
[del Rio et al., 2014]:
data are split between Q bits sent to each Map job;
a Map job train a random forest with a small number of trees in it;
there is no Reduce step (the ﬁnal forest is the combination of all trees
learned in the Map jobs).
Note that this implementation is not equivalent to the original random
forest algorithm because the forests are not built on bootstrap samples of
the original data set.
Nathalie Villa-Vialaneix | RF for Big Data 21/39

48.
Drawbacks of MR implementation of random forest
Locality of data can yield to biased random forests in the different Map
jobs ⇒ the combined forest might have poor prediction performances
OOB error cannot be computed precisely because Map job are
independent. A proxy of this quantity is given by the average of OOB
errors obtained from the different Map tasks ⇒ again this quantity
must be biased due to data locality (similar problem with VI).
Nathalie Villa-Vialaneix | RF for Big Data 22/39

56.
Online learning framework
Data stream: Observations (Xi, Yi)i=1,...,n have been used to obtain a
predictor ˆfn
New data arrive (Xi, Yi)i=n+1,...,n+m: How to obtain a predictor from the
entire dataset (Xi, Yi)i=1,...,n+m?
Naive approach: re-train a model from (Xi, Yi)i=1,...,n+m
More interesting approach: update ˆfn with the new information
(Xi, Yi)i=n+1,...,n+m
Nathalie Villa-Vialaneix | RF for Big Data 24/39

57.
Online learning framework
Data stream: Observations (Xi, Yi)i=1,...,n have been used to obtain a
predictor ˆfn
New data arrive (Xi, Yi)i=n+1,...,n+m: How to obtain a predictor from the
entire dataset (Xi, Yi)i=1,...,n+m?
Naive approach: re-train a model from (Xi, Yi)i=1,...,n+m
More interesting approach: update ˆfn with the new information
(Xi, Yi)i=n+1,...,n+m
Why is it interesting?
computational gain if the update has a small computational cost (it
can even be interesting to deal directly with big data which do not
arrive in stream)
storage gain
Nathalie Villa-Vialaneix | RF for Big Data 24/39

58.
Framework of online bagging
ˆfn =
1
B
B
b=1
ˆfb
n
in which
ˆfb
n has been built from a bootstrap sample in {1, . . . , n}
we know how to update ˆfb
n with new data online
Nathalie Villa-Vialaneix | RF for Big Data 25/39

59.
Framework of online bagging
ˆfn =
1
B
B
b=1
ˆfb
n
in which
ˆfb
n has been built from a bootstrap sample in {1, . . . , n}
we know how to update ˆfb
n with new data online
Question: Can we update the bootstrap samples online when new data
(Xi, Yi)i=n+1,...,n+m arrive?
Nathalie Villa-Vialaneix | RF for Big Data 25/39

63.
PRF
In Purely Random Forest [Biau et al., 2008], the splits are generated
independently from the data
splits are obtained by randomly choosing a variable and a splitting
point within the range of this variable
Nathalie Villa-Vialaneix | RF for Big Data 27/39

64.
PRF
In Purely Random Forest [Biau et al., 2008], the splits are generated
independently from the data
splits are obtained by randomly choosing a variable and a splitting
point within the range of this variable
decision is made in a standard way
Nathalie Villa-Vialaneix | RF for Big Data 27/39

68.
Online RF
Developed to handle data streams (data arrive sequentially) in an
online manner (we can not keep all data from the past):
[Saffari et al., 2009]
Can deal with massive data streams (addressing both Volume and
Velocity characteristics), but also to handle massive (static) data, by
running through the data sequentially
In depth adaptation of Breiman’s RF: even the tree growing
mechanism is changed
Main idea: think only in terms of proportions of output classes,
instead of observations (classiﬁcation framework)
Consistency results in [Denil et al., 2013]
Nathalie Villa-Vialaneix | RF for Big Data 29/39

70.
When should we consider data as “big”?
We deal with Big Data when:
data are at google scale (rare)
data are big compared to our computing capacities
Nathalie Villa-Vialaneix | RF for Big Data 31/39

71.
When should we consider data as “big”?
We deal with Big Data when:
data are at google scale (rare)
data are big compared to our computing capacities
[R Core Team, 2016, Kane et al., 2013]
R is not well-suited for working with data structures larger than
about 10–20% of a computer’s RAM. Data exceeding 50% of
available RAM are essentially unusable because the overhead of
all but the simplest of calculations quickly consumes all available
RAM. Based on these guidelines, we consider a data set large if
it exceeds 20% of the RAM on a given machine and massive if it
exceeds 50%.
Nathalie Villa-Vialaneix | RF for Big Data 31/39

72.
When should we consider data as “big”?
We deal with Big Data when:
data are at google scale (rare)
data are big compared to our computing capacities ... and depending
on what we need to do with them
[R Core Team, 2016, Kane et al., 2013]
R is not well-suited for working with data structures larger than
about 10–20% of a computer’s RAM. Data exceeding 50% of
available RAM are essentially unusable because the overhead of
all but the simplest of calculations quickly consumes all available
RAM. Based on these guidelines, we consider a data set large if
it exceeds 20% of the RAM on a given machine and massive if it
exceeds 50%.
Nathalie Villa-Vialaneix | RF for Big Data 31/39