Daniel@World

Tuesday, December 12, 2017

Implementing Gradient Boosted Trees

In this post I will share my experience on implementing
gradient boosted trees as seen in XGBoost in order to
get a better understanding of the method. As in previous
posts, the code won't be as efficient as the original.
Furthermore, I will use scala again.
The basic idea behind boosting is to construct an
additive model of a collection of simpler models (base models)
for classification or regresion. If you are unfamiliar with the
general boosting idea plase consider (Hastie: Elements of Statistics Learning)
for a great introduction. This post is mainly based on this blog post
and these slides. In boosting, the model is constructed iteratively
that is, in an iteration $t$, the new output given a datapoint $x$ is:
$$
\hat{y}^t = \hat{y}^{(t - 1)} - f_t(x)
$$
The idea of gradient boosting is that in each iteration the next model
is determined so it minimizes a given loss function $\mathbb{L}$.
Specifically the minimization is performed using the first and second order derivatives
of the loss function. Each added model can also be seen as a step in the direction
of the gradient. In other words, the iterations can be seen as a form
of gradient descent. When extending this idea to regression tree base models and
when regularizing the objective function of the optimization, we end up with
a powerful algorithm that one multiple Kaggle challenges as well as
the Xing hosted recommender systems challenge to years in a row.
In the next part, I will derive the optimization step used for decision trees with a cross entropy
loss and then "massage" in the regularization.

Regularization

One of the ideas implemented in XGBosst is regularisation. Intuitavely we want to
penalize large parameters that dominate our result and model complexity.
Given a parameter set $\theta$, the L1 regularization is $\Omega(\theta) = \sum_i ||\theta_i||_1$,
where $||x||_1$ is the L1 norm or simply the absolute value. In L2 regularization, we switch to
$||\theta_i||_2$ where $||x||_2$ is $x^2$.
Then for a decision tree, we could measure the complexity as the number of leaf nodes $T$,
and the parameters $\theta$ are the output scores of a leaf nodes (we will later use regression trees).
For example, we could penalize large trees by $\gamma T$ and large output by $\frac{1}{2}\sum_i \lambda ||\theta_i||_2$.
We can control the penalty amount by two parameters $\gamma$ and $\lambda$.

Gradient boosted trees

The combined objective using regularization and some loss function is:
$$O = L(y, \hat{y}) + \Omega(\theta)$$
Since we are boosting, we define the objective in terms
of the predictions in iteration $t$:
$$O^t = L(y, \hat{y}^{t - 1} + f(x)) + \Omega(\theta)$$
while $\hat{y}^{t - 1} + f(x)$ follows directly from the boosting formular $\hat{y}^t = \hat{y}^(t - 1) - f_t(x)$.
Since we want to define the gradient in order to optimize f(x), we can apply the Taylor expansion to the loss.
Remember that the Taylor expansion is defined as $f(x + \delta x) \approx f(x) + f'(x) * \delta x + \frac{1}{2} f''(x) \delta x^2$
So the loss becomes
$$O^t = (L(y, \hat{y}^{t - 1}) + g_i * f(x) + \frac{1}{2} * h_i * f(x)^2) + \Omega(\theta))$$
Assuming a decision tree base classifier we can use the example regularization above and the objective becomes:
$$O^t = (L(y, \hat{y}^{t - 1}) + g_i * f(x) + \frac{1}{2} * h_i * f(x)^2) + \gamma T \frac{1}{2}\sum_i \lambda * ||\theta_i||_2$$
With a few transformations, we can redefine the loss in a way we can directly optimize the function
with the standard CART learning algorithm.
First we defined $\mathbb{I}(j == i)$ as an indicator function that returns 1 if example $i$ when pushed trough
the decision tree, ends up in leaf node $j$. Then the result of $f(x_i)$ is $w_j$ the leaf node score at node $j$
ig $x_i$ evaluates true. Now we can rearange the objective function towards the leaf nodes as:
$$O^t = \sum_j [(\sum_i \mathbb{I}(j == i) * g_i) * w_j + \frac{1}{2} (\sum_i \mathbb{I}(j == i) * h_i + \lambda) * w_j^2] + \gamma T$$
redefining the sums of the gradients to $G_j$ and $H_j$ we get:
$$O^t = \sum_j [G_j * w_j + \frac{1}{2} (H_j + lambda) * w_j^2] + \gamma T$$
Now we can update a leaf node using the gradient statistics by setting the derivative of the objective to zero:
$$Gj + (H_j + \lambda) = 0$$
and then $w_j = -\frac{G_j}{H_j + \lambda}$.
Intuitively, we model the gradient to take given an input in each leaf node. And these gradients are normalized by their second
order statistics. So each leaf prediction will push the total function output of the ensemble more towards the wanted result.
In order to construct a tree, we can use the same gradients to define a gain function and then learn the trees like a
normal decision tree. The formular is the gradient "amplitude" before the split $(G_l + G_h)^2 / (H_l + H_r + \lambda)$
and after the split $G_l^2 / (H_l + \lambda)$, where $G_l$ and $H_l$ are the sum of the gradients in the left node or the
right node. Then the gain is simply:
$$Gain = \frac{1}{2} [ \frac{G_l^2}{H_l + \lambda} + \frac{G_r^2}{H_r + \lambda} - \frac{(G_l + G_h)^2}{H_l + H_r + \lambda}]$$
Now we can implement these functions:

In this post I will present my ideas on a refactored version. The main point is that the graph representation I implemented in the first two posts made it easier to build the computation graph,

but the code to run back propagation, especially with weight sharing was quite messy. So this refactored version first creates a computation tree as before, only that the actual computations are decoupled from the tree. The tree later is translated into the instructions and parent / child relationships used in the book version of the back propagation algorithm.

Now in the first step we write model setup code. For example, a neural network for mnist classification:

As in tensorflow I define placeholders, variables and computation nodes, along with their name.

Furthermore, the variables get a shape and an initialisation method. In the background this method constructs a computation tree rooted at a5 using the following DSL.

Each node, keeps track of the name, an instruction name and a shape, if needed.

We can then convert this tree into a neural network ready to be used in the back propagation code from original post. We build four maps keeping:

Mapping from node names to computations (like add, multiply, ...)

Mapping from node names to their parents node names

Mapping from node names to their child node names

Mapping from node names that are weights to matrices.

All computations such as add or multiply implement the forward pass through the variable

as well as the backward pass.

Using a class I called TensorStore that maps from names to matrices, I implemented the forward and the backward pass as follows:

The code can be found along the other versions on github. An example for the spiral dataset as well as the mnist data set is attached.

Of course the code is not at all up to the popular libraries in terms of speed or functionality so for me it is more an ongoing investigation on how these libraries are implemented and what I can reproduce. :). On Mnist the given network achieves between 96 and 98 % accuracy depending on the starting condition.

Sunday, September 10, 2017

Variational Fashion Encoder

In this experiment I want to try out Variational Auto Encoders, an unsupervised feature learning algorithm on a new fashion classification dataset from Zalando. Since I wanted to play around with both more, I will build a generic Variational Auto Encoder and then learn it on the new fashion-mnist dataset.

Variational Auto Encoder

The bottom part of the model is embedding the input X into a mean and variance vector.

The mean and variance represent the parameters of a gaussian that is trained to be close to a standard normal distribution N(0, I).The decoder network is trying to reconstruct the input from a sample from said distribution. The encoder as well as the decoder network can have multiple layers in my case 3 each.

The loss function is a combination of the Kullback-Leibler divergence between the distribution from the encoder network and a standard normal distribution, basically centering the latent space around 0 with standard deviation 1. This can be seen as a form of regularisation. Furthermore, the decoder
does not reconstruct directly from the encoder output but from a sample of the distribution in the latent space.

Fashion Reconstruction

The images in the fashion dataset are 28 pixels by 28 pixels, in the same way as the MNIST dataset.

The fashion data consists of 10 - classes of cloths:

tshirt

trouser

pullover

dress

coat

sandal

shirt

sneaker

bag

ankle boo

After training for 5 epochs on the training set, I plotted the reconstructions from the Variational Auto Encoder.

As one can see the reconstructions work quite well indicating the latent space learned is useful.

Tuesday, August 1, 2017

I recently skimmed through some class videos by Tucker Balch, Professor at Georgia Tech called "Machine Learning For Trading". One interesting part was portfolio optimisation.

That is, if you want to invest some amount of money, how do you distribute it
over a selection of assets in a way minimising risk while at the same time
being able to earn money through your investment. A portfolio is basically an allocation of the money into different options or in other words, the percentage of the total amount of money allocated to an option. Anyways, in the class as well as the course book the optimisation is performed using a
python method for constrained non-linear optimisation. The algorithm used to crunch these problems is called Sequential Least Squares Programming, which is implement using a fortran wrapper.

Sunday, July 23, 2017

I found an interesting paper on black box variational inference. Variational inference is an
approximate inference algorithm for factored probabilistic models (such as bayesian networks).

In probabilistic inference, we are normally interested in the posterior density of a latent variable
given all our data. The basic idea is to turn the inference problem into an optimisation problem.
That is, we build an approximate probability distribution that is differentiable with respect to it 's parameters. We then find the parameter set that minimises the difference between the true posterior and the approximate distribution.

The problem is that in order to do this for a new model, we still have to do a lot of manual derivations
and work through a lot of math before we can actually use this algorithm. This new paper
however, shows an algorithm that is easier to use given a new model. In my opinion,
one of the successes in deep learning is auto differentiation, where we do not
have to develop all gradients on our own but let the computer do it (as described in earlier posts).
One library that implements this inference method for complex time series is called pyFlux [2].

The proposed algorithm [1] works as follows. First, we sample the latent variables in our model
from the approximate distribution. We then use the samples to compute a gradient
for each parameter that minimises the Kullback-Leibler divergence or more specifically the Expectational Lower Bound (ELBO).

Reproduced from original paper [1]

The z[s] are samples from the latent space. Each z[s] includes a values for each of the latent variables. These samples are used to build the gradient.
The value of the Robbins - Monroe Sequence, is modelling an adaptive learning rate schedule
to guarantee convergence. The gradient has multiple components, including the joint probability of the original distribution, the log of the approximate distribution and the derivative of the approximate distribution. For a lot of cases, these values are easy to compute. First note that when using the samples, all variables in the original joint distribution are observed. For a Bayesian network that means that the joint distribution is simply the product of all variables:

In the paper and in the pyFlux library, the approximate distribution is implemented as a mean field.
Which means that the variables in our model are all independent. This makes the sampling and evaluation of the model very easy. Each variable can be sampled individually and the conditional distribution is simply the product of all the variables. Finally, the pyFlux library implements each variable in the approximate distribution as a one dimensional gaussian. Putting all that together,
the final gradient becomes:

Where the lambda parameters will be the mean and variance of each of the gaussians used
with the gradients:

Friday, May 26, 2017

In this post I will describe how a general sequence alignment could be implemented in Scala.
For readers that don't know what sequence alignment is, I recommend reading mypost on building a gesture recogniser under Dynamic Time Warping distance (DTW).

Sequence alignment is often used in several contexts including speech recognition,
gesture recognition, approximate string matching, typing correction, biological
sequence alignment and time series classification and even stereo vision.
All these algorithms are based on a technique called dynamic programming and the algorithms
are all very similar. However, in literature this alignment technique is often found under different names (Dynamic Time Warping, Levensthein Distance, Needleman-Wunsch).

I will first formulate my idea of a general alignment algorithm with a scala implementation and afterwards describe an implementation of the three above algorithms.

Given two sequences X = [x1 .. xi .. xN] and Y = [y .. yj .. yM], we use the following
recursion to find the alignment score S(x, y). At each position i in sequence X and position j in sequence Y we look backwards at the matching score in X only, in Y only and in X and Y simultaneously resulting in the following recursion:

The rank function scores the alignment at position i and j based on the previous alignment scores and
the current elements. The rank function is the main component that changes in all three algorithms.
Looking back in X direction is also called an insertion, in Y direction a deletion and for both it is
called a match so the ranking function becomes:

Furthermore we will also implement a generic initialisation function that starts our recursion.
Ok now the Scala code:

The alignment class has a type parameter specifying the type of the elements in the sequences we want
to align and takes the initialising function and the ranking function as input. Now DTW and the Edit distance are just specialisations with a different type parameter and scoring function:

DTW is defined over doubles and is initialising the recursion with 0.0 at position (0,0) and all others to infinity. The ranking adds the euclidian distance to the insertion, deletion and match and then takes the minimum. It is easy to see how this can be generalised to multivariate sequences. The type parameter becomes Seq[Double] and the euclidean distance is implemented for vectors. One can also use kernels
instead or the cosine similarity.
The next function is the edit distance for discrete sequence elements (here characters):

The function is initialised as the index at the borders and the function again takes the minimum
of insertion, deletion and match. Insertions and deletions count as an error so we add 1.
A match can also be a substitution so we add 1 to the error if the characters do not match.
The last algorithm is the Needleman-Wunsch algorithm. Instead of minimising the distance
we are maximising a score. The similarity between two symbols can be defined and the 'gap' penalty
is not 1 as in the edit distance but given.

Saturday, April 15, 2017

I really enjoy it when concepts can be reused or applied in several contexts. Three of the algorithms
that seem to be everywhere are quick sort, merge sort and binary search. In this post I will
just show some tricks I learned during coding. Using a similar algorithm than the merge method
of merge sort for fast sparse vector multiplication, using binary search to insert into a histogram in O(log(n)) time, and using the partition method of quick sort during C.45 decision tree training.

Fast Sparse Vector Multiplication

A sparse vector is a vector where most if the entries are zeros. A common data structure is an array saving all non zero entries. Each entry is a tuple of the position and the corresponding value. Also the array is sorted increasingly by the position. Given two such vectors, we can multiply two such vectors very efficiently. We use two variables i and j, indicating a position in one of the vectors' array x and y. The variables move at the same speed as long as the positions of x(i) and y(j) are the same. In this case, we can also aggregate the multiplication of the vector elements. If i points to a position larger than j, we only move j until i and j are equal and vice versa. Since in the previous case, x(i) or y(j) is a zero entry, we don't have to compute anything. The movement of i and j is the same in the merge method of merge sort.

Histogram Insertion

A histogram can be represented as an array that holds the counts of values that fall into

a certain region. The range is another array containing the boundaries, which should be sorted

in increasing order. When we want to estimate the histogram, we need to find the region the value falls in. Naively that can be done by iterating the boundary array and checking if the value is in the range. However, we can perform this search using binary search in the boundary array.

Instead of searching an element though, we are checking if the value is larger than boundary[middle - 1] and is smaller than boundary[middle], if so we can increase the count for the middle, else we recursively search the left or right side, depending if the value is smaller or larger than the middle.

Decision Tree Learning

A dataset for decision tree learning can be represented as a 2d array X[N][D] of N vectors each of D

dimensions. At each node in the decision tree, a node partitions the data points into two subsets, one contains all vectors whose dimension D is smaller than a threshold and the other subset contains the vectors where dimension D is larger. We can introduce an ordering array. The left side of the array from a pivot element contains all instances where the split decision is true and the right side contains all elements where the decision is false. We can swap the elements from the left to right whenever

a split decision is violated (the element should be left but is right). We can then call the partition

function recursively on the left and the right side. If we apply the function recursively, we partition the dataset into the split decisions as a decision tree does. Of course, we still have to select the correct

spit decision for every level but this is the material for another post. In the end, this function is very similar to the quick sorts partition method. Another note: The method only works well for binary decision trees, which is not a huge problem in practice.