The data are in many different ranges. Some covariates are [0,1], others are [1,20000].
In order to make Stan run smoothly, I’d like to scale the covariates to all be in the same range. Two immediate thoughts on this:

Scale each covariate to be [0,1]

We have population in the model, so scale each covariate as a percentage of population

Both will run fine in Stan. The problem comes when trying to estimate predicted counts. As every covariate is now in the compressed range, the posterior samples for the counts are always too low. My intuition is that I need to un-scale the posterior samples in some way, but not sure how. Any suggestions on how to attack this.

My intuition is that I need to un-scale the posterior samples in some way

Don’t change the posterior samples unless you really know what you are doing. But the predictions should be fine if your population values are scaled in the same way as in your sample.

But scaling the predictors marginally is not so optimal. It often works much better to do a QR reparameterization of the design matrix (X), which is discussed in the manual and is a strongly recommended option in rstanarm, where your model would be

The QR reparameterzation of (X) makes the columns of Q have the same length and be orthogonal to each other. After inverting the reparameterization to obtain coefficients on X, you can do prediction in the original scale of the variables.

I implemented it in the model, chains mix nicely, samples look good, etc. BUT, the coefficients recreated in generated quantities have a median of zero or very close to zero. That seems highly unlikely.

Stan file pasted below. This is a hurdle model for student activity counts, with a campus specific intercept. There are separate coefficients for each portion of the model (the probability of a zero, and the count if non-zero)

the coefficients recreated in generated quantities have a median of zero or very close to zero. That seems highly unlikely.

I have no way of judging whether that is plausible or not, but I tend to not argue with posterior distributions when it appears Stan is working well. I would be looking at the posterior predictive distribution of the outcome, especially to see if the proportion of zeros is similar to that in the data.

Strangely, no matter what I try, almost all of the coeffcients (beta_zero and beta_count) have a mean value of zero. Many are highly correlated to the dependent variable, and have reasonable values using other regression tools. This leads me to believe that I did something wrong with the QR decomposition part. Tried to follow the Stand documents exactly, but must have an error somewhere.

Does anyone see anything that might cause this? (Stan code in earlier post)

I’d recommend simulating data where you know they’re not zero and see what happens.

If you have highly correlated predictors x1 and x2, and you have coefficients beta1 and beta2 with a symmetric prior, then you expect beta1 and beta2 to have a posterior mean of zero. Look at their sum, which won’t be expected to be zero if x1 and x2 are informative about y.