How to Grid Search Hyperparameters for Deep Learning Models in Python With KerasPhoto by 3V Photo, some rights reserved.

Overview

In this post, I want to show you both how you can use the scikit-learn grid search capability and give you a suite of examples that you can copy-and-paste into your own project as a starting point.

Below is a list of the topics we are going to cover:

How to use Keras models in scikit-learn.

How to use grid search in scikit-learn.

How to tune batch size and training epochs.

How to tune optimization algorithms.

How to tune learning rate and momentum.

How to tune network weight initialization.

How to tune activation functions.

How to tune dropout regularization.

How to tune the number of neurons in the hidden layer.

How to Use Keras Models in scikit-learn

Keras models can be used in scikit-learn by wrapping them with the KerasClassifier or KerasRegressor class.

To use these wrappers you must define a function that creates and returns your Keras sequential model, then pass this function to the build_fn argument when constructing the KerasClassifier class.

For example:

1

2

3

4

5

def create_model():

...

returnmodel

model=KerasClassifier(build_fn=create_model)

The constructor for the KerasClassifier class can take default arguments that are passed on to the calls to model.fit(), such as the number of epochs and the batch size.

For example:

1

2

3

4

5

def create_model():

...

returnmodel

model=KerasClassifier(build_fn=create_model,epochs=10)

The constructor for the KerasClassifier class can also take new arguments that can be passed to your custom create_model() function. These new arguments must also be defined in the signature of your create_model() function with default parameters.

How to Use Grid Search in scikit-learn

When constructing this class you must provide a dictionary of hyperparameters to evaluate in the param_grid argument. This is a map of the model parameter name and an array of values to try.

By default, accuracy is the score that is optimized, but other scores can be specified in the score argument of the GridSearchCV constructor.

By default, the grid search will only use one thread. By setting the n_jobs argument in the GridSearchCV constructor to -1, the process will use all cores on your machine. Depending on your Keras backend, this may interfere with the main neural network training process.

The GridSearchCV process will then construct and evaluate one model for each combination of parameters. Cross validation is used to evaluate each individual model and the default of 3-fold cross validation is used, although this can be overridden by specifying the cv argument to the GridSearchCV constructor.

Below is an example of defining a simple grid search:

1

2

3

param_grid=dict(epochs=[10,20,30])

grid=GridSearchCV(estimator=model,param_grid=param_grid,n_jobs=-1)

grid_result=grid.fit(X,Y)

Once completed, you can access the outcome of the grid search in the result object returned from grid.fit(). The best_score_ member provides access to the best score observed during the optimization procedure and the best_params_ describes the combination of parameters that achieved the best results.

Download the dataset and place it in your currently working directly with the name pima-indians-diabetes.csv.

As we proceed through the examples in this post, we will aggregate the best parameters. This is not the best way to grid search because parameters can interact, but it is good for demonstration purposes.

Need help with Deep Learning in Python?

How to Tune Batch Size and Number of Epochs

In this first simple example, we look at tuning the batch size and number of epochs used when fitting the network.

The batch size in iterative gradient descent is the number of patterns shown to the network before the weights are updated. It is also an optimization in the training of the network, defining how many patterns to read at a time and keep in memory.

The number of epochs is the number of times that the entire training dataset is shown to the network during training. Some networks are sensitive to the batch size, such as LSTM recurrent neural networks and Convolutional Neural Networks.

Here we will evaluate a suite of different mini batch sizes from 10 to 100 in steps of 20.

The results suggest that the ADAM optimization algorithm is the best with a score of about 70% accuracy.

How to Tune Learning Rate and Momentum

It is common to pre-select an optimization algorithm to train your network and tune its parameters.

By far the most common optimization algorithm is plain old Stochastic Gradient Descent (SGD) because it is so well understood. In this example, we will look at optimizing the SGD learning rate and momentum parameters.

Learning rate controls how much to update the weight at the end of each batch and the momentum controls how much to let the previous update influence the current weight update.

We will try a suite of small standard learning rates and a momentum values from 0.2 to 0.8 in steps of 0.2, as well as 0.9 (because it can be a popular value in practice).

Generally, it is a good idea to also include the number of epochs in an optimization like this as there is a dependency between the amount of learning per batch (learning rate), the number of updates per epoch (batch size) and the number of epochs.

How to Tune Network Weight Initialization

In this example, we will look at tuning the selection of network weight initialization by evaluating all of the available techniques.

We will use the same weight initialization method on each layer. Ideally, it may be better to use different weight initialization schemes according to the activation function used on each layer. In the example below we use rectifier for the hidden layer. We use sigmoid for the output layer because the predictions are binary.

We can see that the best results were achieved with a uniform weight initialization scheme achieving a performance of about 72%.

How to Tune the Neuron Activation Function

The activation function controls the non-linearity of individual neurons and when to fire.

Generally, the rectifier activation function is the most popular, but it used to be the sigmoid and the tanh functions and these functions may still be more suitable for different problems.

In this example, we will evaluate the suite of different activation functions available in Keras. We will only use these functions in the hidden layer, as we require a sigmoid activation function in the output for the binary classification problem.

Generally, it is a good idea to prepare data to the range of the different transfer functions, which we will not do in this case.

This involves fitting both the dropout percentage and the weight constraint. We will try dropout percentages between 0.0 and 0.9 (1.0 does not make sense) and maxnorm weight constraint values between 0 and 5.

We can see that the dropout rate of 0.2% and the maxnorm weight constraint of 4 resulted in the best accuracy of about 72%.

How to Tune the Number of Neurons in the Hidden Layer

The number of neurons in a layer is an important parameter to tune. Generally the number of neurons in a layer controls the representational capacity of the network, at least at that point in the topology.

Also, generally, a large enough single layer network can approximate any other neural network, at least in theory.

In this example, we will look at tuning the number of neurons in a single hidden layer. We will try values from 1 to 30 in steps of 5.

A larger network requires more training and at least the batch size and number of epochs should ideally be optimized with the number of neurons.

We can see that the best results were achieved with a network with 5 neurons in the hidden layer with an accuracy of about 71%.

Tips for Hyperparameter Optimization

This section lists some handy tips to consider when tuning hyperparameters of your neural network.

k-fold Cross Validation. You can see that the results from the examples in this post show some variance. A default cross-validation of 3 was used, but perhaps k=5 or k=10 would be more stable. Carefully choose your cross validation configuration to ensure your results are stable.

Review the Whole Grid. Do not just focus on the best result, review the whole grid of results and look for trends to support configuration decisions.

Parallelize. Use all your cores if you can, neural networks are slow to train and we often want to try a lot of different parameters. Consider spinning up a lot of AWS instances.

Use a Sample of Your Dataset. Because networks are slow to train, try training them on a smaller sample of your training dataset, just to get an idea of general directions of parameters rather than optimal configurations.

Start with Coarse Grids. Start with coarse-grained grids and zoom into finer grained grids once you can narrow the scope.

Do not Transfer Results. Results are generally problem specific. Try to avoid favorite configurations on each new problem that you see. It is unlikely that optimal results you discover on one problem will transfer to your next project. Instead look for broader trends like number of layers or relationships between parameters.

Reproducibility is a Problem. Although we set the seed for the random number generator in NumPy, the results are not 100% reproducible. There is more to reproducibility when grid searching wrapped Keras models than is presented in this post.

Summary

In this post, you discovered how you can tune the hyperparameters of your deep learning networks in Python using Keras and scikit-learn.

Specifically, you learned:

How to wrap Keras models for use in scikit-learn and how to use grid search.

How to grid search a suite of different standard neural network parameters for Keras models.

How to design your own hyperparameter optimization experiments.

Do you have any experience tuning hyperparameters of large neural networks? Please share your stories below.

Do you have any questions about hyperparameter optimization of neural networks or about this post? Ask your questions in the comments and I will do my best to answer.

My question is related to this thread. How to get the probablities as the output? I dont want the class output. I read for a regression problem that no activation function is needed in the output layer. Similiar implementation will get me the probabilities ?? or the output will exceed 0 and 1??

Hi Jason, First of all great post! I applied this by dividing the data into train and test and used train dataset for grid fit. Plan was to capture best parameters in train and apply them on test to see accuracy. But it seems grid.fit and model.fit applied with same parameters on same dataset (in this case train) give different accuracy results. Any idea why this happens. I can share the code if it helps.

You will see small variation in the performance of a neural net with the same parameters from run to run. This is because of the stochastic nature of the technique and how very hard it is to fix the random number seed successfully in python/numpy/theano.

You will also see small variation due to the data used to train the method.

Generally, you could use all of your data to grid search to try to reduce the second type of variation (slower). You could store results and use statistical significance tests to compare populations of results to see if differences are significant to sort out the first type or variation.

when I am using the categorical_entropy loss function and running the grid search with n_jobs more than 1 its throwing error “cannot pickle object class”, but the same thing is working fine with binary_entropyloss. Can you tell me if I am making any mistake in my code:
def create_model(optimizer=’adam’):
# create model
model.add(Dense(30, input_dim=59, init=’normal’, activation=’relu’))
model.add(Dense(15, init=’normal’, activation=’sigmoid’))
model.add(Dense(3, init=’normal’, activation=’sigmoid’))
# Compile model
model.compile(loss=’categorical_crossentropy’, optimizer=optimizer, metrics=[‘accuracy’])
return model

I came cross and solved the problem several days ago. Please use “epochs” instead of “nb_epoch” in param_grid dict. Personally, I guess “cannot pickle object class” means the neuron network cannot be built because of some errors. Open to discussion.

excellent post, thanks. It’s been very helpful to get me started on hyperparameterisation.

One thing I haven’t been able to do yet is to grid search over parameters which are not proper to the NN but to the trainign set. For example, I can fine-tune the input_dim parameter by creating a function generator which takes care of creating the function that will create the model, like this:

this works but only as a for loop over the different fp_subset, which I must define manually.
I could easily pick the best out of every run but it wuld be great if I could fold them all inside a big grid definition and fit, so as to automatically pick the largest.

However, until now haven’t been able to figure out a way to get that in my head.
If the wrapper function is useful to anyone, I can post a generalised version here.

Thanks. I ended up coding my own for loop, saving the results of each grid in a dict, sorting the hash by the perofrmance metrics, and picking the best model.

Now, the next question is: How do I save the model’s architecture and weights to a .json .hdf5 file? I know how to do that for a simple model. But how do I extract the best model out of the gridsearch results?

Hi Jason, I think this is very best deep learning tutorial on the web. Thanks for your work. I have a question is :how to use the heuristic algorithm to optimize Hyperparameters for Deep Learning Models in Python With Keras, these algorithms like: Genetic algorithm, Particle swarm optimization, and Cuckoo algorithm etc. If the idea could be experimented, could you give an example

You could search the hyperparameter space using a stochastic optimization algorithm like a genetic algorithm and use the mean performance as the cost function orf fitness function. I don’t have a worked example, but it would be relatively easy to setup.

Hi Jason, very helpful intro into gridsearch for Keras. I have used your guidance in my code, but rather than using the default ‘accuracy’ to be optimized, my model requires a specific evaluation function to be optimized. You hint at this possibility in the introduction, but there is no example of it. I have followed the SciKit-learn documentation, but I fail to come up with the correct syntax.

I have posted my question at StackOverflow, but since it is quite specific, it requires understanding of SciKit-learn in combination with Keras.

Perhaps you can have a look? I think it would nicely extend your tutorial.

I tried to combine this gridsearch/keras approach with a pipeline. It works if I tune nb_epoch or batch_size, but I get an error if I try to tune the optimizer or something else in the keras building function (I did not forget to include the variable as an argument):

I keep getting error messages and I tried a big for loops that scan for all possible combinations of layer numbers, neuron numbers, other optimization stuff within defined limits. It is very time consuming code, but I could not figure it out how to adjust layer structure and other optimization parameters in the same code using GridSearch. If you would provide a code for that in your blog one day, that would be much appreciated. Thanks.

Great tutorial! I’m running into a slight issue. I tried running this on my own variation of the code and got the following error:

TypeError: get_params() got an unexpected keyword argument ‘deep’

I copied and pasted your code using the given data set and got the same error. The code is showing an error on the grid_result = grid.fit(X, Y) line. I looked through the other comments and didn’t see anyone with the same issue. Do you know where this could be coming from?

The only differences are I am running Python 3.5 and Keras 1.2.1. The example I ran previously was the grid search for the number of neurons in a layer. But I just ran the first example and got the same error.

Do you think the issue is due to the next version of Python? If so, what should my next steps be?

Hi Jason,
thanks for this awesome tutorial !
I have two questions: 1. In “model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])”, accuracy is used for evaluate results. But GridSearchCV also has scoring parameter, if I set “scoring=’f1’”,which one is used for evaluate the results of grid search? 2.How to set two evaluate parameters ,e.g. ‘accuracy’and ’f1’ evaluating the results of grid search？

I find no matter what evaluate parameters used in GridSearchCV “scoring”,”metrics” in “model.compile” must be [‘accuracy’],otherwise the program gives “ValueError: The model is not configured to compute accuracy.You should pass ‘metrics=[“accuracy”]’ to the ‘model.compile()’method. So, if I set:
model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=’recall’)
the grid_result.best_score_ =0.72.My question is: 0.72 is accuracy or recall ? Thank you!

Great Blogpost. Love it. You are awesome Jason. I got one question to GridsearchCV. As far as i understand the crossvalidation already takes place in there. That’s why we do not need any kfold anymore.
But with this technique we would have no validation set correct? e.g. with a default value of 3 we would have 2 training sets and one test set.

That means in kfold as well as in GridsearchCV there is no requirement for creating a validation set anymore?

Yes, GridSearchCV performs cross validation and you must specify the number of folds. You can hold back a validation set to double check the parameters found by the search if you like. This is optional.

What I’m missing in the tutorial is the info, how to get the best params in the model with KERAS. Do I pickup the best parameters and call ‘create_model’ again with those parameters or can I call the GridSearchCV’s ‘predict’ function? (I will try out for myself but for completeness it would be good to have it in the tutorial as well.)

Just noticed, when you tune the optimisation algorithm SGD performs at 34% accuracy. As no parameters are being passed to the SGD function, I’d assume it takes the default configuration, lr=0.01, momentum=0.0.

Later on, as you look for better configurations for SGD, best result (68%) is found when {‘learn_rate’: 0.01, ‘momentum’: 0.0}.

It seems to me that these two experiments use exactly the same network configuration (including the same SGD parameters), yet their resulting accuracies differ significantly. Do you have any intuition as to why this may be happening?

Hello Jason!
I do the first step – try to tune Batch Size and Number of Epochs and get
print(“Best: %f using %s” % (grid_result.best_score_, grid_result.best_params_))
Best: 0.707031 using {‘epochs’: 100, ‘batch_size’: 40}
After that I do the same and get
print(“Best: %f using %s” % (grid_result.best_score_, grid_result.best_params_))
Best: 0.688802 using {‘epochs’: 100, ‘batch_size’: 20}
And so on
The problem is in the grid_result.best_score_

I expect that in the second step (for ample tuning optimizer) I will get grid_result.best_score_ better than in the first step (in the second step i use grid_result.best_params_ from the first step). But it is not true
Tune all Hyperparameters is a very long time

I have better scores with Adamax on validation data. I’m confused about how to proceed, should I choose Adamax and play with learning rates a little more, or go on with SGD and somehow try to improve performance?

Thanks for your response! I experimented with different learning rates and found out a reasonable one, (good for both Adamax and SGD) and now I try to fix learning rate and optimizer and focus on other hyperparameters such as batch-size and number of neurons. Or would be better if I set those first?

It seems to me that if you use the entire training set during your cross-validation, then your cross-validation error is going to give you an optimistically biased estimate of your validation error. I think this is because when you train the final model on the entire dataset, the validation set you create to estimate test performance comes out of the training set.

My question is: assuming we have a lot of data, should we use perhaps only 50% of the training data for cross-validation for the hyperparameters, and then use the remaining 50% for fitting the final model (and a portion of that remaining 50% would be used for the validation set)? That way we wouldn’t be using the same data twice. I am assuming in this case that we would also have a separate test set.

Thanks for your valuable post. I learned a lot from it.
When I wrote my code for grid search, I encountered a question:

I use fit_generator instead of fit in keras.
Is it possible to use grid search with fit_generator ?

I have some Merge layers in my deep learning model.
Hence, the input of the neural network is not a single matrix.
For example:
Suppose we have 1,000 samples
Input = [Input1,Input2]
Input1 is a 1,000 *3 matrix
Input2 is a 1,000*3*50*50 matrix (image)

When I use the fit in your post, there is a bug….because the input1 and input2 don’t have the same dimension. So I wonder whether the fit_generator can work with grid search ?

Hi Jason, thank you for your good tutorial of the grid research with Keras. I followed your example with my own dataset. It could be run. But when I using the autoencoder structure, instead of the sequential structure, to gird the parameters with my own data. It could not be run. I don’t know the reason. Could you help me? Are there any differences between the gird of sequential structure and the grid of model structure?

I’m a little bit confused about the definition of the “score” or “accuracy”. How are they made? I believe that they are not simply comparing the results with target, otherwise it will be the overfitting model being the best (like the more neurons the better).

But on the other hand, they are just using those combinations of parameters to train the model, so what is the difference between I manually set the parameters and see my result good or not, with risk of overfitting and the grid search that creates an accuracy score to determine which one is the best?

The idea is to find a config that does well on the train and validation sets. We require a robust test harness. With enough resources, I’d recommend repeated k-fold cross validation within the grid search.

One question about GridSearch in my case. I have tried to tune parameters of my neural network for regression with 18 inputs size 800 but the time to use GridSearch totally long, like forever even though I have limited to the number. I saw in your code:

I had the same issue with you (using spyder and python 3.6) but after changing the parameter n_jobs = 1 it worked fine. Also n_jobs = 2 was stuck although spyder showed it was running in the backgound (I checked the CPU usage and was down to 1% vs the 55-80% when it is actually running).

Don’t ask the reason why is that. My guess would be that it has to do with your system and the fact that it might not support parallelization (no CUDA GPU).

Hi Jason,
I’m unable to apply the grid search to a seq to seq LSTM network (Keras Regressor model in the scikit API). When I set the GridSearchCV scoring algorithm to r^2 (or any scoring function for regression problems) the model.fit expect a 2 dim input vector, not the 3 dim used in Keras.
Otherwise, if I left the default scoring algorithm named “_passthrough_scorer”( I don’t know what it does, I don’t even know what it is) it works but the best_score doesn’t match with the real best parametrization. I’m really confused…I’ll had to write the grid search manually…

Hey Jason.
I was using grid search to tune hyperparameters for a CNN-LSTM classification problem.
I used the code template on your blog about sequence classification.
MY original data has 38932 instances, but for tuning I am using only 1000 to save time.
But even then, I am not sure how to best search for those parameters and save time.

Is it a bad idea to search for hyper parameters in a small subset (almost 1/40th of training in my case).
Will the result vary largely when I use actual data size?
Also, I passed in several parameters for the grid search. Left it overnight and it still hadn’t made enough progress, so I stopped the execution.
How can I speed up this process?

Great !
I did read that one of the sanity checks is to check whether the model overfits on a small sample! If yes, then we are good to go…
I am slightly new to building proper models and find this part exciting but a little intimidating at the same time !
I am going to use only a few hyper parameters at a time, and keep the rest constant and check what happens !

Love your posts ! They are amazingly helpful .
Does the Python LSTM book have code snippets in Python 3 as well?
Coz it becomes a little difficult to search for the right modules and attributes otherwise :/

Hey. I was hypertuning a model on 4 different choices of hyper parameters. However, in the grid_results_ dictionary, the rank_test_score key has array with all same values. I find that confusing. Shouldn’t it have 4 different values in each place?
Something like [1,3,2,4] ?
What could be the explanation for this?

If you have one parameter and you want to test 4 values, each value needs one run. Ideally, we would run many times for each parameter value and take the average skill score given the stochastic nature of ML algorithms.

What I understand is that when we have more than 1 (say 2) hyper-parameters in a grid, then for each combination, the code will complete as many epochs as I have specified, with as many training-cross-validation sets as specified (the CV in GridSearchCV). So, going through all those epochs, for each training-cross-validation set, we get the avg accuracy over all the cross-validation sets for every combination.

So when you say 1 run only in the case of a single hyperparameter, that means only 1 training-crossvalidation set? Because only in this case, there won’t be any averaging involved.

Is that what I have to do? Change the training-crossValidation set to just 1?

I can not thank you enough. I am sure that there are many people like me who have learnt a lost from your tutorial on both “R” and “Python”. I have been following your tutorial for more than 3 year now. Before I was using R however, recently I moved to python for Deep learning. And I find your tutorial as usual, exceptional. I think Andrew Ng and CS231n (andrej karpathy), theoretical course and your programming course on deep learning is one of the best in the world. You rock! Thanks a lot.

I do have a question 🙂 as well.
The grid search parameter tuning works perfectly with CPU. I agree with your suggestion not to tune everything at once. Now I moved to GPU implementation. I was able to execute the code if I chose options n_job=1. However, if I do multi-threading n_job=-1. I am getting “CUDA_ERROR_OUT_OF_MEMORY”. I have GeForce GTX 1080. Did you happen to encounter similar kind of error? I will post you the error log if needed.

Hi Jason,
Thank you for the response. The parameter search using CPU (n_job=-1) is (2.961489-4.977758) while using GPU (n_job=1) is (140.101048-142.151023) second.

One more thing, after grid search I have value for parameters {batch_size, activation, neurons, learn_rate..} and accuracy around 90%. However, I wonder why reusing these model parameter does not provide the same results, now accuracy is 52%. Even though I executed it many times with same parameter the accuracy remains the same (52%). I could not achieve the accuracy as shown in grid search using best model parameter. I am doing 5-fold CV I do not expect the accuracy to be the same since it is stochastic process but it should be around SD±5%. What do you think? Did you also happen to encounter the same thing ?

Also the best parameter values changes in each executions with an accuracy SD±5%.

Thanks

P.S:
Below code is something I am doing to limit GPU memory usage and run multiple grid search. However, we should know the memory usage in advance (cs231n.github.io/convolutional-networks/#case). Let me know if it makes sense.

Also, we can use n-job. I tried with n_job = 2 however the GPU memory is allocated based on fraction. I am searching how to allocated memory based on MB. I will do more research on this “CUDA_ERROR_OUT_OF_MEMORY” and update you.

Could you please help on how to do features normalization while doing the grid search and cross-validation. Is normalization is done automatically here, GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=15,cv=rkf)? If I normalize the features during training X = scaler.transform(X_train), this will introduce bias in cross-validation. Also, if possible, can you please provide me references on using scikit-learn wrapper with Keras for advance options, is their any limitation on wrapper ?
Thanks

I normalize my data prior to grid search using X = scaler.transform(X_train) but dont you think it would introduce bias in the performance. Normally, I expect to normalize train set and use that normalization factor to normalize test or validation set before prediction. May be I did not understand you properly, how do you do normalization prior to grid search?

I started looking at the pipeline (http://scikit-learn.org/stable/modules/pipeline.html) on how they have been using it for SVM, lets see. I would expect the pipeline to work for Keras as well, as this is a classical problem in machine learning. Why do you expect error here? I wanted to take the full advantage from automatic grid search. Well, the final option will be to implement my own grid search.

This is such a great, thorough tutorial. Thanks for keeping your tutorials up to date! It’s so nice finding a resource with examples that you know will work because they’ve been tested on recent versions of required packages.

Thank you for your great tutorial. I tried to use it for my model with multiple inputs. but It didn`t work. I found that the scikit-learn wrapper does not work for multiple inputs. it gives me an error for grid.fit([input1,input2],y)
Do you have any suggestion to handle it?
Thanks,

When I run your code to tune the dropout_rate, I get the following error:
ValueError: dropout_rate is not a legal parameter

In fact, I get this error for all labels except epochs and batch_size. Both of these were recognized and ran fine. I could not find a reference to valid labels anywhere, even in API docs. Any suggestions?

I removed comment # and ran each one separately. For example, running the first param_grid values resulted in: Error – optimizer is not a valid parameter. They all got the same rejection notice except for epochs and batch_size.
I hope that helps.

Just to be clearer, each parameter had it’s own name in the error message as follows:

Error – optimizer is not a valid parameter
Error – learn_rate is not a valid parameter
Error – learning_rate is not a valid parameter
Error – init is not a valid parameter
Error – init_mode is not a valid parameter
Error – dropout_rate is not a valid parameter

Let’s say I’m using 5 fold CV on a relatively small dataset (not necessarily for a deep learning model). In this case, the variance of the performance metric might be quite high, and just by chance, a point on the grid that is in reality far from optimal, might be selected as the “best”.

So are there any approaches to smooth out the response surface of the grid search, to deal with “spikes” in performance due to variance?

What a great blog, I very much appreciate you sharing some of your expertise!

I want to grid search the hyperparams from my CNN, but I’m using data augmentation with ImageDataGenerator. So I’m not calling model.fit but model.fit_generator for the actual training.
This does not seem to be supported through the grid search..
Am I forced to write my own KerasClassifier implementation?

Would you advise to just fall back to using (nested) for loops instead, or would I be missing some ‘magic’ from the existing scikit gridsearch?

NaN outputs as in my predictions ?
Or the weights ?
If exploding gradient then weight will be very large (probably NaN) hence output would also be NaN.
But how will this logic be used for vanishing gradients. I this case the weights basically stop changing r8?

Should I use some kind of code that checks by how much the weights at each layer are changing…and if after a certain threshold they haven’t changed by a certain amount, I’ll declare vanishing gradient !

I have a question for you, Jason and for general audience. I tried to find optimal number of neurons for one of the hidden layers. i did loop over my function which contains my deep learning model. It is fast enough for the values I define and I get a result based on accuracy. However, when I use your code, it is extremely slow and never reached to an end. How long does it take on your computer?

Thank you for your quick reply. I try grid search for number of neurons on Iris data set for the purpose of learning. I scale the data first and then transform and encode the dependent variable. However, first of all, even though I use small data set or fewer parameters, it is slow; second of all, when I get the results, it is all zero. This is very basic example and I am pretty much sure that my code is correct but I guess I am missing out something.

Hi Jason,
Thank you for the great tutorial. I just have an issue when using exactly your code: when I try to parallelize the grid search with n_jobs=-1, I end up with the error “AttributeError: Can’t get attribute ‘create_model’ on ” while it works well without parallelization. Any idea where the issue comes from?
Thank you,
Wassim

Thanks very much for the tutorial. It is extremely helpful for my work. I came across a problem with grid search with Keras (tensorflow backend). I want to run the same grid search on different datasets. Everything works fine on the first dataset. But when I fit the grid search to the second dataset, the program got stuck there. I run the grid search with n_jobs=-1 and put keras.backend.clear_session() between two fits. You can replicate this issue by fit to the data twice in your examples. Could you please kindly help me with this issue?

I got it to work by just fitting one dataset in the python script and looping the python script over multiple datasets in a bash script. I am still not clear why second fitting fails in python, but this is a not-so-beautiful workaround.

Thank you so much for sharing your knowledge.
I am trying to optimize the number of hidden layers.
I can´t figure it out how to do it with keras (actually I am wondering how to set up the function create_model in order to maximize the number of hidden layers)
Could you please help me?
Thank you

I tried to execute the gripsearch but cam up with parallelism issues. I have a Windows OS and I get this error when I try to run the script on multiple cpus:

ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using “if __name__ == ‘__main__'”. Please see the joblib documentation on Parallel for more information.

Thanks a lot for such a wonderful post. Overall, there are a lot of parameters that need to be tuned. I was thinking to use RandomizedSearchCV instead of GridSearchCV. Still, it will be time consuming for a lot of simulations. Do you have any suggestion for fast parameter tuning? For example, can we say that specific parameters have more effect on scores, so lets try to Grid/RandomizedSearchCV them first?

I was wondering if it would be more appropriate to tune all the hyperparameters at one go instead of breaking it up into various parts as shown above – you may be doing it for the sake of visibility of how each component is tuned but would it be better to tune everything together since there might be “interactions between the hyperparameters” which would not be captured if they were tuned separately?

I have an extremely imbalanced data set to study, of which #negative : #positive is about 100:1. When I built the first model, I performed 10-fold validation and in each validation round, I use oversampling to add positive samples on training data, but not on testing data. Now I question is: if I want to perform hyperparameter search, how do I tell GridSearchCV() to do oversampling for each round of cross-validation?

A good 2018 to you. I have a question about how Keras early stopping callbacks might be able to use the GridSearchCV k-fold generated validation data set as their val_loss or val_acc. The question I posted on StackOverflow but I wished to call your attention to it – should you so wish.

Hi,I am facing a basic query where i have training and test set.i built lstm on training and using history = model.fit(trainX, trainY, epochs=100, batch_size=50,
validation_data=(testX, testY), verbose=0, shuffle=False) to fit my model.
After this i tried to model.predict(testX) to get predicted Y values.Now that was basic code.i am now trying to apply gridsearch.what variation in the history statement code i have to make to apply grid =
GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(testX, testY, verbose=0, shuffle=False)

Hi Jason, thank you for your great tutorial! My question here is about ‘grid_result.best_score’. In this article the best score seems to be the best mean score, but in a regression problem, the mean score is irrelevant, so I have to look for the best std score. Is that correct?

I’m not sure. I just copied the codes from this tutorial and changed ‘KerasClassifier’ to ‘KerasRegressor’.I didn’t make any change other than that. I don’t understand how score function works and I’m not familiar with the concept of negative mse. Would you please elaborate?

I am curious about the use of CV . Each time you call
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
you are compiling a new keras model with the new set of parameters.

Are these different models of keras, compiled one after another, accumulating in the memory? Would this imply a memory usage problem in the case of an extensive grid search with bigger models? Any tips?