Symbolic Representation Learning

A model’s quality depends completely on the data representation used to train it. For this reason,
methods that can transform input data such that it better suits a given machine learning method can
improve that method’s predictive capacity. We showed that symbolic regression approaches can be
competitive in the task of learning better data representations for standard ML tools [1-3].
These approaches have nice properties, including 1) the ability to represent arbitrary nonlinear
relations in the data, 2) independent scaling from the number of features in the raw data, 3) the
ability to produce readable transformations. On a set of 20 classification problems, an ensemble
technique [2] outperformed 7 state-of-the-art ML methods trained on the raw data [2].

A particular instance of symbolic representation learning is our development of M4GP, a multi-class
classification strategy that uses GP to learn representations for a nearest centroid classifier.
This method has shown promise on several biomedical informatics problems [3-4], and was shown to
outperform state-of-the-art methods in identifying epistasis in noisy genetics datasets [4].