Local Optimization and Complexity Control for Symbolic Regression

Publication

Outline

Abstract

Symbolic regression is a data-based machine learning approach that creates interpretable prediction models in the form of mathematical expressions without the necessity to specify the model structure in advance. Due to numerous possible models, symbolic regression problems are commonly solved by metaheuristics such as genetic programming. A drawback of this method is that because of the simultaneous optimization of the model structure and model parameters, the effort for learning from the presented data is increased and the obtained prediction accuracy could suffer. Furthermore, genetic programming in general has to deal with bloat, an increase in model length and complexity without an accompanying increase in prediction accuracy, which hampers the interpretability of the models. The goal of this thesis is to develop and present new methods for symbolic regression, which improve prediction accuracy, interpretability, and simplicity of the models.
The prediction accuracy is improved by integrating local optimization techniques that adapt the numerical model parameters in the algorithm. Thus, the symbolic regression problem is divided into two separate subproblems: finding the most appropriate structure describing the data and finding optimal parameters for the specified model structure. Genetic programming excels at finding appropriate model structures, whereas the Levenberq-Marquardt algorithm performs least-squares curve fitting and model parameter tuning. The combination of these two methods significantly improves the prediction accuracy of generated models.
Another improvement is to turn the standard single-objective formulation of symbolic regression into a multi-objective one, where the prediction accuracy is maximized while the model complexity is simultaneously minimized. As a result the algorithm does not produce a single solution, but a Pareto front of models with varying accuracy and complexity. In addition, a novel complexity measure for multi-objective symbolic regression is developed that includes syntactic and semantic information about the models while still being efficiently computed. By using this new complexity measure the generated models get simpler and the occurrence of bloat is reduced.