Oblivious decision tree is tree, which has the same splits in all nodes at particular depth.

ODT of depth $n$ consists of $n$ pairs (feature index, split) and $2^n$ real values, which correspond to predictions in each leaf.

In [19]:

defpredict_odt(data,features,splits,leaf_values):# first compute leaf for each sample in dataleaf_indices=np.zeros(len(data),dtype=int)forfeature,splitinzip(features,splits):leaf_indices*=2leaf_indices+=data[:,feature]>splitpredictions=leaf_values[leaf_indices]returnpredictions

That's it. It was really simple, now let's create some fake data and fake ODT to measure speed.

Data in numpy is stored as continuous chunk of data. This gives high flexibility (instant transpositions, reshaping, views and strided tricks), but sometimes we need to keep eye on strides (in particular, orders of dimensions).

Operations with sequential data is faster due to CPU cache.
By default, numpy uses C-order, thus elements from one column are far each from other,
while elements within row are placed in memory together.

where $w_{leaf,+}, w_{leaf, -}$ are weights of signal and background events in the leaf. Let's write this procedure in beautiful numpy code:

In [28]:

defcompute_optimal_leaf_values(terminal_regions,labels,predictions):""" terminal regions are integers, correspong to the number of leaf each event belongs to labels are +1 and -1 predictions are real numbers - sum of predictions given by previous trees in boosting """weights=np.exp(-labels*predictions)w_plus=np.bincount(terminal_regions,weights*(labels==+1))w_minus=np.bincount(terminal_regions,weights*(labels==-1))returnnp.log(w_leafs/w_minus)/2.

The problem is compute these things when part of the data shall be ignored. For instance,
houses which are very expensive (top 5%) are probably expensive not due to their position. Can we ignore them when computing average?

We can guess the language based on price, but this can be inreliable in other cases.

Let's combine tricks studied earlier:

In [44]:

lang_sorter=np.argsort(lang_average_salaries)# languages ranked by average salaries:lang_salaries_ordered=np.argsort(lang_sorter)programmer_top_language=np.zeros(programers.max()+1,dtype=int)np.maximum.at(programmer_top_language,programers,lang_salaries_ordered[languages])# now we need to decode order of language back to original IDprogrammer_top_language=lang_sorter[programmer_top_language]

In [45]:

programmer_top_language

Out[45]:

array([ 3, 2, 7, ..., 27, 27, 7])

Checking with previous result — taking average salaries for top languages.

NB: there were programmers with no languages - their 'top language' became the worst payed one.
Maybe worth creating special pseudo-language to denote this situations.

In uBoost algorithm we need to compute many times local efficiencies:
for each event which part of its neighbors pass the given threshold. Sparse matrix is a good option in this case. Close things happen in other classifiers (uGB+kNN, RankBoost)

As an exercise, let's compute which part of programmers knowing language X have salary greater then salary_threshold (printing result only for first 5 languages):

this is the situation when numpy is really not of much help. Sorting applied twice is bad approach when no sorting shall be used.

pandas suggests a good one-line alternative, but it is ..hmhm.. a bit slow.

In [64]:

importpandasdf=pandas.DataFrame({'cat':categories_stream,'ones':np.ones(len(categories_stream))})# truncating to first 10000, otherwise it will never finishdf=df[:10000]%time result = df.groupby('cat')['ones'].transform(pandas.Series.cumsum)

CPU times: user 956 ms, sys: 5.42 ms, total: 962 ms
Wall time: 965 ms

Ok, then it is time to use some external tool. Let's do it with parakeet.

In [65]:

importparakeetparakeet.config.backend='c'# one thread@parakeet.jitdefcompute_online_counter_parakeet(categories_stream):counters=np.zeros(np.max(categories_stream)+1)online_counter=np.zeros_like(categories_stream)foriinrange(len(categories_stream)):online_counter[i]=counters[categories_stream[i]]counters[categories_stream[i]]+=1returnonline_counter