Building a Process Output Optimization Solution using Multiple Models, Ensemble Learning and a Genetic Algorithm.

1. Abstract

The objective of this paper is to present the process of building a model for identifying the right combination of inputs for optimizing the Concrete Compressive Strength. Multiple machine learning algorithms were evaluated. A process of optimizing the solution using Ensemble Learning was identified and successfully tested. These are meta-heuristic techniques used to improve and combine the predictions of multiple learning algorithms.

The model building was done in the following stages:

Models were built with different number of hidden layers of Multilayer Perceptrons and evaluated. The deep learning ones with 3 hidden layers generally performed well.

An ensemble of the best performing Multilayer Perceptrons was tested and was found to significantly improve the performance.

Other Regression Models were tested. Some of them were Ensemble models.

An Ensemble of all the best performing models was also tested and it showed significantly improved results. Some of the models included were Ensemble ones themselves

This ensemble was passed as an input to a chain of ensembles and this improved further the performance.

Finally a search was made to find the combination of input parameters that would maximise the Concrete Compressive Strength using Turing Point’s GA optimizer

This work is the outcome of a comprehensive prototyping and proof-of-concept exercise conducted by Tirthankar Raychaudhuri, Sankaran Iyer and Avirup Das Gupta at Turing Point (http://www.turing-point.com/) a consulting company focused on providing genuine Enterprise Machine Learning solutions based on highly advanced techniques such as 3D discrete event simulation, deep learning and genetic algorithms.

2. Introduction

Machine Learning (ML), a branch of Computer Science that focuses on drawing insights and conclusions by examining data sets, is an increasingly popular discipline today in resolving enterprise business issues. However the field is vast and consists of numerous algorithms and approaches. Data sets are also often complex and require to be pre-processed before an ML algorithm can be 'trained' to learn from such data. For a particular problem domain and data set, defining the pre-processing technique and selecting the ML algorithm (or set of algorithms) is still largely 'an art rather than a science' depending on the knowledge and skills of the expert/data scientist in question. With time this will change and scientific guiding principles/best practices will emerge to pre-process data and to select appropriate algorithms for a particular problem domain - as the discipline matures.

In the meanwhile we have conducted a study of applying the so-called 'ensemble learning' approach to a data-set.

2.1 What is Ensemble Learning?

Ensemble learning is the process by which multiple models, such as classifiers or experts, are strategically generated and combined to solve a particular computational intelligence problem. It has been found that an ensemble can often perform better than a single model. This process is similar to important decisions we take in day to day situations.

An investment in equity may require consulting multiple analysts for their expert opinion. Each one may look at it from a different angle. It may also be good to consult the opinion of friends and relatives. Finally a consensus decision is taken.

The election of an office bearer from potential candidates is the result of the maximum number of votes cast by the members.

One may need to consult multiple real estate analysts to decide on the right price for a property.

The individual models contributing to a decision may differ due to a number of factors:

The algorithms used in building the model

The training samples used to build the model

There may be a difference in hypothesis

The initial seed in model building may be different

A combination of the all the items listed above.

Section5.7 lists the Ensemble learning algorithms used in this paper. As discussed in Section4.2, Ensemble Learners address some of the model issues like bias-variance trade off.

2.2 Building Machine Learning Models

The building of a Machine Learning model is a complex process. A right algorithm or an ensemble of them needs to be chosen from a plethora of available algorithms.

The output of the model can be broad classification like trying to identify the type of car from the features, or it can be a continuous or Regression value instead of being discrete items. Often the solution depends on the complexity of the problem being addressed. In some situations a simple linear model may be sufficient but in other situations a complex combination may be warranted.

The Concrete Compressive Strength use case being addressed by this paper is a complex Regression problem. Hence it required only algorithms that can address a problem of this type. The model selection process was addressed in 4 stages.

Multilayer Perceptrons especially the deep learning models can be used for any complex models. But they need to be configured for number of hidden layers and the neurons per layer. Trying out different model configurations was addressed in the first stage

The second stage involved an Ensemble of best performing Multilayer Perceptrons

The other popular Regression algorithms were evaluated during stage 3.

The final stage involved an ensemble of all the best performing algorithms, some of them were ensembles themselves. The mean of the selected models was then appended to the attributes and passed to a chain of Ensemble models.

3. Concrete Strengthening Process

The purpose of this paper is to build a Regression Model for the Concrete Strengthening Process. The description of the process and the data set can be found in the following link:http://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+StrengthThis is a free and a complex dataset available from the Machine Learning Repository of Centre of Machine Learning and Intelligent Systems at University of California IrvineConcrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. These ingredients include cement, blast furnace slag,fly ash, water, superplasticizer, coarse aggregate, fine aggregate and age. The following are the list of data attributes. The Concrete Compressive Strength is the last attribute which is the desired output combiningthe inputs

Figure 1 illustrates the Concrete Compressive Strength Process

Figure1Block Diagram of Concrete Compressive Strength Process

The objective is to model Concrete Compressive Strength as a function of these input variables.The dataset contains 1030 measurements.

4. The Process of building and verifying Machine Learning Systems

The objective of any machine learning systems is to emulate the real time behavior as a function of the independent variables or predictors. In order to do this the behavior is modeled with some training samples and verified against some test samples and released with the hope that the resulting solution will do a perfect job predicting the outcome of any unseen test data. The confidence in the model will be high if the training data contained samples representative of all the variations of the real world. However, there can be practical limitations in getting data sets. It may not be possible to get samples of all possible variations thereby constrain the perfection of the model

4.1 Training and Test Split

Hence it is possible to work only what is available with right processes in place to build as perfect a model as possible. Assumptions have to be made that the data is independent and identically distributed. The data set is randomly split into training and test data. The test data is used for verifying the performance only and is not to be used for any model building process. Typical split ratios are 60:40, 70:30, 80:20 or even 90:10.

For our model this ratio has been deliberately kept at 50:50 in order to increase the confidence in the resulting model. The 1030 tuples of data set was split randomly into Training and test sets each having 515 tuples. The test data was kept aside and used only for testing purpose. No change was done to the models after testing

4.2 Bias and Variance

A bias is the difference between the expected value and the actual value of a variable. This is an important measure for a machine learning which is concerned with predicting dependent or target outcomes from independent variables or predictors. Thus if “y” represents the actual value and “E(y’)” the estimated value then

Variance of an estimator y’ is the expected value of the square of the difference from its mean E(y’). Thus

Ideally in a perfect world one would seek a model with zero bias and zero variance. But this is hardly the case. A model may be trained to have a low bias with the training data but can perform poorly resulting in high variance with the test data. In such a situation , the estimator is considered to be over-fitting to the training data including the noise in it as well. On the other hand, a model may have a high bias in which case may be simpler and under-fit the training data but may have a relatively lower variance with test data.

Hence there always has always to be bias – variance trade off, in that the bias need not be too low with training data that would result in high variance with test data. This calls for a fairly complex model.

4.3 Model Selection Process

Given these requirements of relatively low bias and not so high variance on test data, the next step is to evaluate models and compare their performance. Multilayer Perceptrons perform well in complex situations and it was decided to try out deep learning models various combinations of hidden layers and compare their performances. Ensemble learners are found to improve the performance of the base models and are able to meet bias-variance requirements. Other algorithms were to be evaluated as well and have the performances compared. The following is the summary of the Model Selection Process:

4.3.1 Train Multilayer Perceptrons with different configurations of input layers

4.3.2 Combine the best MLP performers into an Ensemble

4.3.3 Train other known algorithms from Weka and evaluate their performance

4.3.4 Combine the best performers from all the models including Multilayer Perceptrons into an ensemble

4.3.5 Pass the ensemble as another attribute to a chain of ensembles and test performance

4.4 Development Process

The entire development process was carried out in java using Weka libraries as there was a need to train and store the models and develop comparison reports

4.5 Cleaning up Training Data

The training data was cleaned up to eliminate the noise and other redundant information in order to optimize the training time. After this process the number of tuples were reduced from 515 to 481.

5. Overview of applied Machine Learning Algorithms

As this paper is mainly concerned with Regression problem, the scope is limited to the applicable algorithms only. These algorithms were evaluated using Weka Data Mining tools.

Before going into the algorithms, it is important to establish some broad concepts

5.1 Parametric and Non Parametric Models

A machine learning models may be parametric or non parametric.

A parametric model summarizes the data using a fixed set of parameters which will not change with the number of instances of data. For example, a linear Regression Model tries to identify a relationship y as a linear combination of input parameters say x1 and x2 as follows

y = k1x1 + k2x2, where k1 and k2 are parameters

These models are simpler to develop, quite fast in learning and relatively require less data to model. However, they suffer from being constrained by the function trying to fit the data and may not be suitable for complex data patterns and hence are likely to be poor fit in such situations

A non parametric model, on the other hand, makes no assumptions about the mapping function and tries to identify it from the training samples. Models based on K Nearest Neighbor algorithms or Support Vector Machine algorithms belong to this category.

5.2 Eager Learners and Lazy Learners

Eager Learners are those classifiers that try to generalize the target mapping functions before being available for use. For example, Artificial Neural Networks are required to be trained before they can be used for querying.

Lazy Learners on the other hand don’t require to be trained. They only store the data and wait until a query is made. For example , K Nearest Neighbor looks for the closest matching tuples from the training set for mapping the output.

5.3 Rule Based Systems

These learning systems identify and evolve rules from the training data and apply them to evaluate test data. For the Concrete Compressive Strength Model, two rule based systems: Decision Table and M5 Rules were evaluated

5.4 Decision Tree Learning

These are non parametric, eager learning systems which create tree like graphs or model of decisions and target values from the training data. In this paper, Decision Stump, M5P, Random Tree and REPTree algorithms were evaluated.

5.5 Ensemble Learning (Metaheuristics)

This is the main topic on focus as far as this paper is concerned. Multiple models are combined with the objective of improving the overall performance. These are also known as metaheuristic algorithms. In this paper the following Ensemble algorithms were evaluated:

Bagging

Additive Regression

Random Committee

Random Subspace

Regression by Discretization

Random Forests

The final chosen solution involved 2 parts:

a) An Ensemble Learning model was created taking the mean of the best performing algorithms: Multilayer Perceptrons with 4 different configurations of hidden layer, Random Committee, Random Forest and Bagging

b) This output was added to the 8 inputs and multiple Ensemble algorithms were experimented with. The best solution found was a complex Ensemble Chain consisting of Additive Regression that used Bagging as the base classifier. Bagging in turn is Random Subspace as the base classifier which in turn used REP Tree.

5.6 Algorithms evaluated for building the model

Table 1 lists the algorithms evaluated for building the model using Weka machine learning tool. References are provided for further information for some complex algorithms.

As stated in 4.3 the model building was a 5 step process. Each of them is detailed in this section

The algorithms listed in section 5 were applied using Weka Machine learning tools. All of them were invoked using default parameters except Multilayer Perceptron.

6.1 Multilayer Perceptron

The Weka default training time for weka for Multilayer Perceptron is 500 epochs. From experiments it was found that this is not sufficient for building complex models. It usually takes as much as 25000000 epochs for these models to reach global minima for “error per epoch”

The learning and momentum parameters had to be set at 0.01 respectively.

The following models were built. All of them used 8 inputs for predictors and one output for Concrete Compressive Strength. Other than 6.3.1 the others are all Deep Learning Models i.e. have more than one hidden layer

6.2 Ensemble of Multilayer Perceptrons

The MLP models having mean absolute error less than 5 were combined into a single Ensemble by taking the mean by a java program as these had to be custom trained and tailored and Weka GUI based tools could not be used.

6.3 Performance of the MLP models

The column chart in Figure 4 sums up the performance of the models:

Figure4: Performance of MLPs

The numbers following the “MLP” represent the number of hidden layers and neurons in each layer. For example MLP 8-8-8-8 has 4 hidden layers with 8 neurons in each layer. Similarly MLP 8-12-8 has 3 hidden layers with 8, 12 and 8 neurons respectively.

6.3.3 The best training performance was achieved with 4 hidden units with lowest Mean Absolute error but as can be seen the error with test data was much higher (5.50). This was a case of overfitting.

6.4 Multiple Regression Model

Next all the applied algorithms listed in Section 5.8 were tested and the performance was compared using a program written in java. An ensemble was created combining the best performing models having Mean Absolute Error less than 5. The following were the models built besides the Multilayer Perceptrons

6.4.1 Gaussian Processes

6.4.2 Linear Regression

6.4.3 Simple Linear Regression

6.4.4 SMOReg

6.4.5 IBK (K Nearest Neighbour)

6.4.6 K Star

6.4.7 LWL(Locally Weighted Learning)

6.4.8 Decision Tables

6.4.9 M5 Rules

6.4.10 Decision Stump

6.4.11 M5P

6.4.12 Random Trees

6.4.13 REP(Reduced Error Pruning) Trees

6.4.14 Bagging

6.4.15 Additive Regression

6.4.16 Random Committee

6.4.17 Randomizable Filtered Classifier

6.4.18 Random Subspace

6.4.19 Random Forests

6.4.20 Regression by Discretization

6.5 Ensemble Chain

The mean of the best performing models having Mean Absolute Error less than 5 was fed as an additional input in the training sample and experimented with complex Ensemble Algorithms. The ensemble that performed best was Additive Regression which is a Stochastic Gradient Booster. Bagging was chosen as its base algorithm. Bagging in turn used an Ensemble Random Sub Space which used REP Tree. The performance improved significantly with this Ensemble Chain,

6.6 Performance

Figure 6shows the performance of best performing models chosen to calculate the mean and passed as input to form an Ensemble Chain. The Mean Absolute Error significantly reduced to 2.49 from 2.82.

6.6.2 Out of these Bagging, Random Committee and Random Forest are ensemble algorithms

6.6.4 All the models other than MLPs were tested with default Weka configurations. Bagging uses REP Tree as its base classifier

6.6.5 Random committee and Random Forest use Random Tree as its base classifier

6.6.6 Random Committee and Random Forest achieved good performances 3.33 and 3.57 respectively. This should be compared with the ensemble of MLPs (refer 0) which achieved Mean absolute Error of 2.90

6.6.7 With the ensemble of the best performers of Multi classifiers including 4 Multilayer Perceptrons, It was possible to further reduce the Mean Absolute Error to 2.82.

6.6.8 The ensemble chain described in 6.7 wherein the Mean of all models was passed as an additional input in the training sample was able to reduce the Mean Absolute Error to 2.49 using the process described in the section

6.6.9 Figure 7 shows the plot of 20 randomly selected points for all classifiers other than Multilayer Perceptrons. The Mean of selected Models includes the best performers from Multilayer Perceptrons

Figure 7 Comparison of MultiRegression Models for 20 random data points. The Mean of the selected models includes best MLP performers as well

7. Finding Optimum Parameters using Genetic Algorithm

Having developed a model, our next step is to try finding the optimum set of parameters for maximising Concrete Compressive Strength. The entire data set which includes both the training and test data is passed to the GA optimiser solution developed by Turing Point. For further information on implementation of the GA refer here

The GA algorithm selects candidates from the entire data set for “mating” and generating “children” data points. The candidates are chosen on the basis of their fitness level exceeding a threshold set at data points that can generate Concrete Compressive Strength at the most 5 megapascals below the maximum generated Concrete Compressive Strength (82.6 MegaPascals)

The exit criterion for the algorithm is set at 10 generations of no improvement in fitness level.

Figure 8 shows the 10 high values of the Concrete Compressive Strength found in the supplied data set. The data points associated with these Concrete Compressive Strength form the Target Candidates for generating the high performing solutions

Figure8: 10 Best Performing Data Points from the Data sets

Figure 9 shows the generated top performers from the starting candidates. The data was generated with 5 different random seeds. The highest Predicted value (81) however is less than the highest in the initial data set. Hence a better solution could not be found.

8.1.2 The 4 hidden layer model was overfitting the training data and did not perform very well with the test data

8.1.3 The main drawback of these models is however the time it takes to train the models

8.2 Ensemble of Multilayer Perceptrons

8.2.1 Even better results were obtained from the ensemble of best performing Multilayer Perceptrons. The Mean Absolute Error with Test data was 2.90

8.3 Other Regression Models

8.3.1 Random Tree was the best performing Regression model with Mean Absolute Error of 4.99

8.4 Ensemble of Other Regression Models

8.4.1 Bagging, Random Forest and Random Committee were the best performing models with Mean Absolute Errors of 4.81,3,57 and 3.33 respectively

8.4.2 It was possible to further reduce the Mean Absolute Error with Test data to 2.82 using an ensemble of best performing Multilayer Perceptron, Random tree and other ensemble models: Bagging, Random Committee and Random Forest.

8.4.3 The criteria of the models was only to find the best fit and did not consider other factors like speed of processing, resource utilisation etc.

8.5 Ensemble Chain

8.5.1 The mean of the Other Regression Models was passed as an additional input to the training data set and several algorithms we.

8.5.2 The best performing algorithm wre experimented withas the Additive Regression which uses Stochastic Gradient Booster. Bagging was used as the base algorithm. Bagging in turn used Random Subspace as the base algorithm, which is an ensemble of Reduced Error Pruning Tree. With this set up it was possible to further reduce the Mean Absolute Error to 2.49 on the test data set. This is a further improvement of 12%.

8.5.3 Again the criteria was only to find the best fit and did not consider other factors like speed of processing, resource utilisation etc.

8.6 Optimal Solution using Genetic Algorithm

8.6.1 Genetic search was applied using the top performers of the entire data set and using the Ensemble chain as a model to search for optimal values of the attributes that would result in high Concrete Compressive Strength values.

8.6.2 The highest value of the Concrete Compressive Strength was 82.6 found in the initial data set.