t-Distributed Stochastic Neighbor Embedding

Description

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets and should be avialable in HeuristicLab

What would be the call if I don't have any features, but only a matrix of distances? Is there some kind of DistanceFunction that is just a lookup in a matrix?

Also as we (bwerth + abeham) discussed: The initial parameters are probably not that well suited for large number of cases. A benchmark set of several different data should be generated and different parameters applied to identify one set that works best on average. Above all, I would strongly recommend theta to be default to 0, because this seems to be only suitable for large datasets. Maybe we can also auto-set this parameter after the problem dimension is known.

Finally, I would recommend to have both TSNE in form of an easy-to-use API call and as a BasicAlgorithm.

What would be the call if I don't have any features, but only a matrix of distances? Is there some kind of DistanceFunction that is just a lookup in a matrix?

Good point. I propose that we provide a different version of the algorithm for this case (because we don't have a dataset as for all other data-analysis algs).

Also as we (bwerth + abeham) discussed: The initial parameters are probably not that well suited for large number of cases. A benchmark set of several different data should be generated and different parameters applied to identify one set that works best on average. Above all, I would strongly recommend theta to be default to 0, because this seems to be only suitable for large datasets. Maybe we can also auto-set this parameter after the problem dimension is known.

I also already noticed that theta=0 produces very different results. We should leave this option for the case of N>5000 but use theta=0 as default. Additionally, we should look at the differences between theta=0 and theta=eps, maybe there is another issue hidden there.

Finally, I would recommend to have both TSNE in form of an easy-to-use API call and as a BasicAlgorithm.

Full ack. I plan to refactor the code to provide a simple static method.

in fast_tsne.m an initial dimensionality reduction using PCA is performed before running bh_tsne features is not absolutely necessary

normalization of data should be moved to TSNEStatic (ZeroMean and diving by max) this is not possible as the tSNE implementation is not specific to real vectors. Therefore, scaling was left in the main algorithm

Code Review Comments

Distance related calculation

This should all be moved to HeuristicLab.Common.

The InnerProductDistance needs to be looked at. It does not satisfy point 3 of the conditions mentioned in IDistance. It is not zero-reflexive, because d(x,x) can only be 0 if x is equal to the null vector. Inner product is not a measure for distance, because it gets larger the more similar two vectors are (if normalized with the product of the vectors' lengths). Also, the description mentions angular distance and speaks of normalization (there is no normalization). It is not clear to me why it is only defined for vectors with all non-negative dimensions. A distance measure based on inner product would probably look something like 1 - cosine(alpha) where cosine(alpha) = <x,y> / (||x|| * ||y||) (<x,y> = inner product) to me. This results in a distance measure in the interval [0; 2] related to the angle of two vectors (0 for vectors of identical angle, 1 for vectors perpendicular to each other and 2 for vectors in the exact opposite direction).

Implementation of the distance measures with .Zip() and .Sum() is nice to read, but inefficient. I would recommend to implement efficiently and sacrifice a little bit on readability, especially if it's put in Common.

Collection and Trees

It is mentioned in the TODO of PriorityQueue that this should move to HeuristicLab.Collections. However, it's not a storable class and it's not observable. I would move it to HeuristicLab.Common. VantagePointTree and SpacePartitioningTree should also be moved to Common (maybe all together in a folder Collections).

Static TSNE

It uses different default values than the TSNE. I personally think that an eta of 200 is too much, I would go with the same values that are default in the tSNE for data analysis problems. van der Maarten mentions adaptive learning rate at some points.

TSNEAlgorithm

Also concerns static TSNE: if our implementation is based on some variant of t-sne (of which there are several) we should state that in the ItemName and/or the ItemDescription. Additionally, a reference to the publication or other source should be given in the description.

Further Review Comments

ClassNames parameter should be an OptionalConstrainedValueParameter<StringValue> and should be filled with the input variable names when a problem is loaded.

DistanceFunction should be a ConstrainedValueParameter<IDistance<double[]>> and should be filled with all instances of type IDistance<double[]> through discovery using ApplicationManager.GetInstances().

Both should be exposed as IConstrainedValueParameter<T> in the parameter properties.

In implementing this you must then override OnProblemChanged. Add event handlers to ProblemDataChanged. In case problem or problemdata changes clear the constrained value parameter for class names and fill them with the ProblemData's InputVariables (use all of them, not just the allowed ones). If event handlers are added you must add an after deserialization hook and also register the event handlers there. Event handlers must also be registered in the default constructor and after the cloning constructor.

Ideally Theta parameter would be a PercentValue instead of a DoubleValue. Internally this is completely the same, but it's presented to the user as a percentage and it makes it more obvious that this is something that is in the range [0;1].

TSNE throws an exception if a variable is included that has the same value for each row. The ArithmeticException that is thrown (in Math.Sign) says: "Function does not accept floating point Not-a-Number values." Maybe this can be caught and rethrown with a better description.