Saturday, November 3, 2012

Draw Multinomial Random Variables

* It is a little tricky to draw multinomial random variables (joint binomial distribution) in Stata because Stata does not have a canned inverse CDF function for the binomial that inverts P values to x values.

* Let's define the arguments of the program.
* N is the total number of trials.
* P is the CDF value that we would like inverted
* p is the probability of one outcome being a success.
* newvar is the name of the new variable to be generated.
args newvar P N p

* The challenging thing about trying to invert a CDF for a count distribution is that the CDF is a probability mass function with discontinuous regions.
* However, the particularly nice thing about the binomial distribution is the the pool of outcomes is well defined. (1,2,3..N)
* Thus we can simply count how many times the "steps" of the CDF fall below the target CDF level.
cap drop `newvar'
qui gen `newvar'=0
label var `newvar' "Inverse CDF of binomial"

forv k=1(1)`N' {
* Add 1 to the variable value every time the CDF(n,k-1,p) is less than the target CDF value P.qui replace `newvar'=`newvar'+1 if binomial(`N', `k'-1, `p')<(`P')
}

* First we will specify the correlation matrix.
* The only constraint as far as I know is that the covariance matrix has to be PSD.
* This in practicality limits the possible correlations between variables since cross terms tend to cause vialations more likely in the PSD requirement.

* If we do not specify a mean or covariance matrix then the default draws are standard normals which is what we want for simplicity.
drawnorm x1 x2 x3 x4, corr(c)

corr x?
spearman x?

* Now all we need to do is turn our normal draws into uniform draws.
* Note: if x~N(0,1) and THETA is the CDF of the normal then y=CDF(x)~uniform
* So for any new distribution with CDF ALPHA and inverse INVALPHA the variable z=INVALPHA(y) ~ alpha.

sum
* Looking good. All of the y variables are uniformly distributed now while still maintaining most of the correlation and all of the rankcorrelation.

* The key final step is to take the inverse CDF of the uniforms for a series of binomial draws.

* The function CDFinv is programmed so that the:
* 1st argument is the new variable name to be created
* 2nd is the cumulative P value to be inverted
* 3rd Number of total draws
* 4th probability of success on each individual draw

* Let's imagine you would like to simulate quarterly data on verterans.

CDFinv z1 y1 4 .05
label var z1 "Hospitalization from mental illness"

CDFinv z2 y2 4 .15
label var z2 "Hospitalization from accident"

CDFinv z3 y3 4 .2
label var z3 "Seeks treatment for PTSD"

CDFinv z4 y4 1 .2
label var z4 "Unemployed"

corr z?
spearman z?

* Note: Using an inverse CDF function is highly non-linear. Thus the correlations are significantly weakened since only a linear transformation of the initial values would maintain the same correlation values.
* To adjust the final correlations, experiment with initial values until you find outcomes that approximate the level of correlation desired.
* It is worth noting that there is an upper limit to the amount of correlation possible as a result of this transformation.