What it is

When constructing each tree, the algorithm picks a "remaining" feature
randomly at each node expansion without any purity function check (such as
information gain, gini index, etc.). A categorical feature (such as
gender) is considered "remaining" if the same categorical feature has not
been chosen previously in a particular decision path starting from the
root of tree to the current node. Once a categorical feature is chosen, it
is useless to pick it again on the same decision path because every
example in the same path will have the same value (either male or female).
However, a continuous feature (such as income) can be chosen more than
once in the same decision path. Each time the continuous feature is
chosen, a random threshold is selected.

A tree stops growing any deeper if one of the following conditions is met:

A node becomes empty or there are no more examples to split in the
current node.

The depth of tree exceeds some limits.

Each node of the tree records class distributions. Assume that a node has
a total of 1000 examples from the training data that pass through it.
Among these 1000 examples, 200 of them are + and 800 of them are -. Then
the class probability distribution for + is P(+|x) = 200/1000 = 0.2 and
the class probability distribution for - is P(-|x) = 800/1000 = 0.8

The algorithm does not prune the randomly built decision tree in the
conventional sense (such as MDL-based pruning and cost-based pruning, and
etc.). However, it does remove "unnecessary" nodes. A node expansion is
considered unnecessary, if none of its descendents have significantly
different class distribution from this node. When this happens, we remove
the expansion from this node and make the node a leaf node. In our
implementation, the random tree is built recursively and "necessity check"
is performed when the recursion returns.

In some situations, a leaf node could be empty. When this happens, the
algorithm doesn't output NaN (not a number) or 0 as the class probability
distribution, but it goes one level up to its parent node and output its
parent's class distribution.

In order to make a final prediction, a loss function is needed. For
example, under traditional 0-1 loss, the best prediction is to choose the
class label that is most likely. For example, for the binary problem, such
as + and -, we predict class label + iff P(+|x) >= 0.5. In some
situations, when the training data and testing data are not drawn with the
same probability, the optimal decision threshold may not be exactly 0.5.

Theoretical Explanation

Random decision tree is an efficient implementation of Bayes Optimal
Classifier or (OBC). It is also termed as Model Averaging (MA).

Examples

target distribution

&nbsp

Random Tree (5 seconds to train)

Traditional Tree (1 second to train)

SVM linear kernel (overnight to train)

SVM RBF kernel (1 day to train)

Good Generality

The same code can perform classification, regression, ranking and multi-label
classification and can be easily implemented in MapReduce/Hadoop.

Source Code Download

An open source project for Random Decision Tree (RDT) is avalable from
http://www.dice4dm.com . The source
code is in JAVA and can perform classification, ranking, regression and
multi-label classification.

The ozone data set is a streaming problem with both skewed distribution
(3 to 5% positive) and unbalanced loss function. The dataset has 72 continuous
features normalized between 0 and 1. The dataset spans over a period
of 7 years and each data entry has a date stamp.

More information on about this dataset and some studies using
several inductive learning algorithms can be found in this
paper.

This dataset is available for research purposes. Please send a request
to either Dr. Kun Zhang (zhang.kun05@gmail.com) or Wei (wei.fan@gmail.com).
A shorter version of the dataset is now available from UCI machine learning
database at here.

Artificial anomaly generates "anomaly" from normal data by analyzing
the characteristics of its distribution. The work is described in
the following KAIS'04 paper.
The source code to generate artificial anomaly can be found
here.

An auto-clustering based method to correct all types of sample
selection bias has been proposed and discussed in the following
SDM'08 paper. The MatLab
code developed by Xiaoxiao Shi and related datasets can be downloaded
here.

We propose a framework (as shown below) to actively
transfer examples from out-domain into in-domain. We formally analyze
how the error descreases as more in-domain examples are actively queried and
how out-of-domain data could help reduce the cost of ask domain experts
to provide labeled in-domain data. The paper by Xiaoxiao Shi,
Wei Fan, and Jiangtao Ren
can be found here, and the
software and dataset (synthetic and landmine) written and prepared by Xiaoxiao Shi can
be found here, and
the 20 Newsgroup data can be found
here.

In another approach, we combine models previously trained from different domains
via locally weighted framework to transfer the knowledge into a new domain.
One important difference is that the out-of-domain models in our approach
did not use any in-domain data at all during their training. In other words,
our proposed approach simply "adopt" them into the in-domain.

In this example, both training set 1 and set 2 have "conflicting" regions
from the in-domain data. Thus the models trained cannot be adopted directly
by simple combination or averaging. However, using the proposed
approach, as shown in the bottom right plot below, we can successfully
transfer the right knowledge from several different domains.
The paper by Jing Gao, Wei Fan, Jing Jiang and Jiawei Han can be found
here. The code and dataset
written by Jing Gao can be found here.

Most inductive algorithms can only model from well-structured
feature vectors, but in reality, raw data from many applications do not have any feature vectors available. Repeated
patterns in raw data can provide solutions, and some examples include frequent graphs, itemsets and sequential patterns. It is well understood that frequent pattern mining is
non-trivial since the number of unique patterns is exponential
but many are non-discriminative and correlated.

Currently, frequent pattern mining is performed in batch mode
of two sequential steps: enumerating a set of frequent patterns as candidate features, followed by feature selection.
Although many methods have been proposed in the past
few years on how to perform each separate step efficiently,
there is still limited success in eventually finding those highly
compact and discriminative patterns. The culprit is due to
the inherent nature of this widely adopted batch approach.

We proposes a new
and different approach to mine frequent patterns as
discriminative features. It builds a decision tree that sorts
or partitions the data onto nodes of tree. Then at each node,
it directly discovers a discriminative pattern to further divide its examples into purer subsets, that previously chosen
patterns during the same run cannot separate. Since the
number of examples towards leaf level is relatively small, the
new approach is able to examine patterns with extremely low
global support that could not be enumerated on the whole
dataset by the batch method.

The discovered feature vectors
are more accurate on some of the most difficult graph and
frequent itemset problems than most recently proposed algorithms but the total size is typically 50% or more smaller.
Importantly, the minimum support of some discriminative
patterns can be extremely low (e.g. 0.03%). In order to enumerate these low support patterns, state-of-the-art frequent
pattern algorithm either cannot finish due to huge memory
consumption or have to enumerate 10^1 to 10^3 times more
patterns before they can even be found.

Details on this work can be found in this paper., and the code and dataset is here.