I've been playing around with Pyro for a bit. The code works perfectly fine (getting great results), but the ELBO loss is always pretty high. If the objective is to generate images, then the loss returned by svi.step() seems to be summed over the entire batch and all the pixels. My previous experience with PyTorch is that the default is usually the mean of all pixels instead of the sum.

For image size 64x64 and batch size 32, the difference between mean and sum is 5 orders of magnitude when doing the gradient step. However, the tutorials (VAE, AIR, etc) all use learning rate 1e-3 with Adam optimizer and get great results. Am I understanding something wrong here?

Hi @junting, I believe the Adam optimizer is scale invariant, so we can freely interchange mean and sum in defining the ELBO. We standardize on sum rather than mean because we often deal with hierarchical models with intricate latent structure, and it's easier to combine a pixel term of the elbo with e.g. a higher latent term of the elbo if we never divide by layer size.

To get the scaled loss, you can wrap your model like your_newmodel = pyro.poutine.scale(your_model, scale), and do similar for your guide. However, as @fritzo mentioned, Adam optimizer is scale invariant. So as long as you are doing with the same mini-batch size, things will be fine.