Bayesian Optimization

Bayesian Optimization
(BO)
Javad Azimi
Fall 2010
http://web.engr.oregonstate.edu/~azimi/
Outline
• Formal Definition
• Application
• Bayesian Optimization Steps
– Surrogate Function(Gaussian Process)
– Acquisition Function
•
•
•
•
•
•
PMAX
IEMAX
MPI
MEI
UCB
GP-Hedge
Formal Definition
• Input:
• Goal:
Fuel Cell Application
This is how an MFC works
Fuel
(organic
matter)
Oxidation
products
(CO2)
ee-
bacteria
Nano-structure of
anode significantly
impact the electricity
production.
SEM image of bacteria sp. on
Ni nanoparticle enhanced
carbon fibers.
O2
H+
H2O
Cathode
Anode
We should optimize anode nano-structure to maximize power by selecting a set of experiment.
4
Big Picture
• Since Running experiment is very expensive we use BO.
• Select one experiment to run at a time based on results of
previous experiments.
Current Experiments
Our Current Model
Select Single Experiment
Run Experiment
5
BO Main Steps
• Surrogate Function(Response Surface , Model)
– Make a posterior over unobserved points based
on the prior.
– Its parameter might be based on the prior.
Remember it is a BAYESIAN approach.
• Acquisition Criteria(Function)
– Which sample should be selected next.
Surrogate Function
• Simulates the unknown function distribution based
on the prior.
– Deterministic (Classical Linear Regression,…)
• There is a deterministic prediction for each point x in
the input space.
– Stochastic (Bayesian regression, Gaussian
Process,…)
• There is a distribution over the prediction for each
point x in the input space. (i.e Normal distribution)
– Example
• Deterministic: f(x1)=y1, f(x2)=y2
• Stochastic: f(x1)=N(y1,2) f(x2)=N(y2,5)
Gaussian Process(GP)
• A Gaussian process is a collection number of
random variables, any finite number of which
have a joint Gaussian distribution.
– Consistency requirement or marginalization
property.
• Marginalization property:
Gaussian Process(GP)
• Formal prediction:
• Interesting points:
– Squared exponential function corresponds to Bayesian
linear regression with an infinite number of basis function.
– Variance is independent from observation
– The mean is a linear combination of observation.
– If the covariance function specifies the entries of
covariance matrix, marginalization is satisfied!
Gaussian Process(GP)
• Gaussian Process is:
– An exact interpolating regression method.
• Predict the training data perfectly. (not true in classical
regression)
– A natural generalization of linear regression.
• Nonlinear regression approach!
– A simple example of GP can be obtained from
Bayesian regression.
• Identical results
– Specifies a distribution over functions.
Gaussian process(2):
distribution over functions
95% confidence
interval for each
point x.
Three sampled
functions
Gaussian process(2):
GP vs Bayesian regression
• Bayesian regression:
– Distribution over weight
– The prior is defined over the weights.
• Gaussian Process
– Distribution over function
– The prior is defined over the function space.
• These are the same but from different view.
Short Summary
• Given any unobserved point z, we can define a
normal distribution of its prediction value
such that:
– Its means is the linear combination of the
observed value.
– Its variance is related to its distance from
observed value. (closer to observed data, less
variance)
BO Main Steps
• Surrogate Function(Response Surface , Model)
– Make a posterior over unobserved points based
on the prior.
– Its parameter might be based on the prior.
Remember it is a BAYESIAN approach.
• Acquisition Criteria(Function)
– Which sample should be selected next.
Bayesian Optimization:
(Acquisition criterion)
• Remember: we are looking for:
• Input:
– Set of observed data.
– A set of points with their corresponding mean and variance.
• Goal: Which point should be selected next to get to the
maximizer of the function faster.
• Different Acquisition criterion(Acquisition functions or
policies)
Policies
•
•
•
•
Maximum Mean (MM).
Maximum Upper Interval (MUI).
Maximum Probability of Improvement (MPI).
Maximum Expected of Improvement (MEI).
Policies:
Maximum Mean (MM).
• Returns the point with highest expected value.
• Advantage:
– If the model is stable and has been learnt very good,
performs very good.
• Disadvantage:
– There is a high chance to fall in local minimum(just
exploit).
• Can converge to global optimum finally?
– No 
Policies:
Maximum Upper Interval (MUI).
• Returns the point with highest 95% upper interval.
• Advantage:
– Combination of mean and variance(exploitation and
exploration).
• Disadvantage:
– Dominated by variance and mainly explore the input space.
• Can converge to global optimum finally?
– Yes.
– But needs almost infinite number of samples. 
Policies:
Maximum Probability of Improvement (MPI)
• Selects the sample with highest probability of
improving the current best observation (ymax)
by some margins m.
Policies:
Maximum Probability of Improvement (MPI)
• Advantage:
– Considers mean and variance and ymax in policy(smarter than
MUI)
• Disadvantage:
– Ad-hoc parameter m 
– Large value of m?
• Exploration
– Small value of m?
• Exploitation
Policies:
Maximum Expected of Improvement (MEI)
• Maximum Expected of improvement.
• Question: Expectation over which variable?
–m
Policies:
Upper Confidence Bounds
• Select based on the variance and mean of
each point.
– The selection of k left to the user.
– Recently, a principle approach to select this
parameter has been proposed.
Summary
• We introduced several approaches, each of
which has advantage and disadvantage.
– MM
– MUI
– MPI
– MEI
– GP-UCB
• Which one should be selected for an unknown
model?
GP-Hedge
• GP-Hedge(2010)
• It select one of the baseline policy based on the
theoretical results of multi-armed bandit problem,
although the objective is a bit different!
• They show that they can perform better than (or as
well as) the best baseline policy in some framework.
Future Works
• Method selection smarter than GP-Hedge
with theoretical analysis.
• Batch Bayesian optimization.
• Scheduling Bayesian optimization.