Tuesday, July 31, 2012

* Probit is a commonly used command to model binary outcome variables.

* The maximum likelihood sytax in Stata is specific
* I will first define three different equivalent ml programs. Then we will solve them to see how they perform.

cap program drop myprobit1
program define myprobit1
* Tell Stata that this code is written in version 11
version 11
* Define the arguments input in the maximum likelihood function
* Stata automatically generates the first argument as a temporary variable name
* that is used to store the natural log of the likelihood value of the likelihood function.
args ln_likehood xb
* The probit tries to directly maximize the probability of seeing the outcome (either y==1 or y==0)
* If y==1 then we maximize the probability of seeing a positive outcome
qui replace `ln_likehood' = ln( normal(`xb')) if $ML_y==1
* If y==0 then we want to maximize the probability of seeing a negative outcome
qui replace `ln_likehood' = ln(1-normal(`xb')) if $ML_y==0
end

cap program drop myprobit2
program define myprobit2
version 11
args ln_likehood xb
* Because of the symetry of the normal CDF the following set of expressions is equivalent.
qui replace `ln_likehood' = ln(normal( `xb' )) if $ML_y==1
qui replace `ln_likehood' = ln(normal(-`xb' )) if $ML_y==0
end

cap program drop myprobit3
program define myprobit3
version 11
args ln_likehood xb
* This is a somewhat opeque way of writing the same thing as above.
* I do not prefer it to the first two forms except that it is likely to run slightly faster
* since it is a single command.
qui replace `ln_likehood' = ln(normal((-1)^(1-$ML_y)*`xb'))
end

clear
set obs 1000

gen x1 = rnormal()*.5
gen x2 = rnormal()*.5
gen u = rnormal()*(.5^.5)

gen p=normal(x1+x2+u)

gen y=rbinomial(1,p)

probit y x1 x2
* This is the benchmark

ml model lf myprobit1 (y=x1 x2)
ml maximize
* Basically works the same except there is no pseudo r2 reported.

timer on 1
ml model lf myprobit2 (y=x1 x2)
ml maximize
timer off 1

timer on 2
ml model lf myprobit3 (y=x1 x2)
ml maximize
timer off 2

timer list

di "myprobit2 is " r(t1)/r(t2) " times slower than myprobit3"

* The following is a brief exploration attempting to run two probits at once.
* It shows how ml can optimize two equations jointly
cap program drop mytwoprobit1
program define mytwoprobit1
version 11
args ln_likehood xb1 xb2
qui replace `ln_likehood' = ln(normal((-1)^(1-$ML_y1)*`xb1')) ///
+ ln(normal((-1)^(1-$ML_y2)*`xb2'))
end

* This is equivalent to the previous except that it uses the binormal distribution rather than two seperate normals. I tried specifying the correlation as a third equation but that did not work. I will need to play around a bit more to figure out how to program a biprobit.
cap program drop mytwoprobit2
program define mytwoprobit2
version 11
args ln_likehood xb1 xb2
qui replace `ln_likehood' = ln(binormal((-1)^(1-$ML_y1)*`xb1', ///
(-1)^(1-$ML_y2)*`xb2', /// 0))

if (1==1|2==3)|(3==3&3==2) di "Will display because either 1==1 or 2==3 (obviously 1==1) or 3==3 and 3==2 (and 3==3)"
if (1==1|2==3|3==3)&(3==2) di "Will not display because either 1==1, 2==3, or 3==3 (we know 1==1, and 3==3) but 3 does not equal 2."

Sunday, July 29, 2012

* 1. There are the if Statements that are evaluated as a vector each value based on some rule.

* For instance:

replace x1 = 10000 if x1>0

sum

* We can see that some values of x1 are now equal to 1000 while all others are equal to or less than zero.

* 2. This is in contrast to the following commands

* This will sort the elements of x2 t0 have large values first
gsort -x2

if x2>0 replace x2 = 10000

* and

gsort x3

if x3>0 replace x3 = 10000

sum

* We can see obviously what each of these commands produced dramatically different results.

* This is because if the "if" statement is listed before a command then it is only evaluated once.

* And that value when it is evaluated with regards to a variable tells the if statement to only evaluate the first element of that variable.

* Thus the first element of x2 was greater then 0 so the replace command "replace x2 = 10000" was executed (for all elements of x2 while the first element of x3 was less than zero so the replace command was not executed.

* Both types of if Statements are useful though the second one tends to be more frequently used by programmers.

* Say you want to generate 10 variables with half drawn from a uniform and half from a normal distribution.

forv i=1/10 {
if mod(`i',2) == 0 gen z`i' = rnormal()
if mod(`i',2) == 1 gen z`i' = runiform()
}
sum
* We can see that the even numbered zs are normally distributed while the odd are uniformaly distributed.
* Note: a summary command is usually not sufficient to diagnose distributions.

* It is very useful to become knowledgable about relational operators.

timer list 7
* Reading and writing causing the command to take five times as long to complete.

* 5. Choose commands that approximate time consuming commands.

* For instance if you are really interested in the average partial effect of a probit command and your computer is getting bogged down then try using a linear probability model instead since the APE tends to be very similar to that of a probit.
timer on 8
probit y x*
margins, dydx(x1 x2 x3 x4 x5 x6 x7 x8 x9 x10)
timer off 8

timer list 8

timer on 9
reg y x*
timer off 9

timer list 9

* While the linear probability model is not as good a fit as the probit (because we know how the data was generated and because the pseudo-r2 is larger than the r2)
* it gets similar results and runs about 560 times faster.

* 6. Minimize unused calculations.

* Imagine that you have decided to bootstrap the standard errors on your estimators.
* If that is the case then it is completely unnecessary in OLS to use robust standard errors because it just means more calculations.
timer on 10
bs: reg y x*, robust
timer off 10

* Sometimes it is helpful to convert your data from string variables to factor variables.
gen longstring = "This is a really big string that is going to take up a lot of memory. Because the longer the string you have the more space Stata needs to reserve for that sting within the variable even if all of the other values are small" if _n == 1

timer on 12
probit y x*
timer off 12

encode longstring, gen(factor_longstring)

desc

* Variables x1-x10 take up only 4 bytes a piece or 40 bytes collectively.
* We can see that the storage type on the string is 244 which means it takes up 244 bytes.
* While the data type on the factor version of it is a long meaning it only takes 4 bytes.
* Note: This can be misleading. If you have many unique factor values then they are going to also take up a lot of space.

drop longstring

timer on 13
probit y x*
timer off 13

timer list 12
timer list 13

* By reducing the size of the longstring to a factor variable we free up memory for Stata to use and speed up the probit regression by about 8%.

Wednesday, July 25, 2012

* In Stata the return command is an essential tool that any aspiring programmer should be aware of.

* Note, I have two lines of comments going on in this code. Hopefully, it is coherent.
* The comments that are commented out as such /* HELLO I am YOU COMMENT */ are referring
* to a method to identify coefficients when the error is multiplicative.

* Within Stata there are two methods by which commands yeild feedback to the user.

* All of the above commands do not give any kind of return except visual ones (this is frustrating to me since sometimes)
* However, there is a lot of commands that do.
gen add_error = y-30*x
corr add_error x /* We can see that that the "addative" error is correlated with the
explanatory variable */

* corr is an rclass command the majority of commands return as rclass.
* to see the returns from corr
return list
* We can target the returns in the form that Stata lists:
di "We will have difficulty estimating the 30 coefficient on x"
di "because the addative error is correlated with x at a level = " r(rho)
reg y x /* This badly fails at identifying the 3 because the error is
correlated with explanatory variable x */
* Most regression/estimator commands are "eclass" commands and store their returns as ereturns.

* To see those returns:
ereturn list

* These returns are available to be targetted if you would like to store a value as a macro.
global rss_1 = e(rss)
global r2_a = e(r2_a)

* Matrices can be targetted in a similar fassion
matrix b1 = e(b)
matrix list b1

* Most coefficients are also saved as system variables
global bx = _b[x]

* When you do your next eclass command it will override the stored values from the previous
gen lny = ln(y) /* We may attempt to isolate the coefficient by using the natural log command
. This makes ln(y) = 30*x*u = ln(30) + ln(x) + ln(u) */
gen lnx = ln(x)

reg lny lnx
di "The " exp(_b[_cons]) " is closer to 30 than the coefficient in the direct estimation ${bx}"
/* Note that the only reason this works as well as it does is because u is defined so that
E(ln(u))=E(ln(exp(rnormal())))=E(rnormal())=0 and because there is not constant in the first equation.If there were a constant in the first equation then we would havey=c+bxu -> lny=ln(c+bxu) which does not help us. */

* Let's imagine that you have some new command that you would like to use and you do not know how to solve the math to calculate the standard errors.

* For instance, two stage quantile regression would be extremely challenging.
clear
set obs 1000

* There is an exogenous component to w which is z
gen z = rnormal()

* There is an endogenous component to w which is v
* It is drawn from a Cauchy distribution.
* See (previous post) for more information on drawing from non-standard distributions.
gen v = tan(_pi*(runiform()-.5))

gen w = z + v

* There is also a standard addative error in the model
* u1 is a portion of the error that is independent of w
* It is drawn from a Cauchy distribution.
gen u1 = tan(_pi*(runiform()-.5))

gen u = (v+u1*3)*5

* Thus y is generated this way
gen y = -14 + 5*w + u

* In a Cauchy distribution the median is defined but the mean and variance are not.
* Thus it is inappropriate and often infeasible to use OLS or many common estimators.

* Quantile regression however is theoretically justified.

* That said:
reg y w
* Looks like it is working fine.

* But just for the sake of doing the consistent thing:
qreg y w

* We can see that both estimators are heavily biased because of the endogeneity of w.

* We want to estimate a two stage quantile regression
qreg w z

predict w_hat

qreg y w_hat
* It looks pretty good but we need to estimate the standard errors because that generated from just qreg on the predicted values is ignoring the variance in the original prediction.

Monday, July 23, 2012

# This is a simple function (program in Stata) that takes an input vector of values (d) and randomly draws from them a random subset set of values of length n.

# We want a to program our own function to draw x draws from a set of possible draws
pull <- function(n,d) {
# Create an empty vector to store the output
v <- c()
# Loop from 1 to n and draw 1 observation from 1 randomly for each of those
for(i in 1:n) v=append(v, d[ceiling(runif(1)*length(d))])
# Okay so I know this looks pretty intractable. The for loop allows a singe line of code to follow on the same line. So it is just saying loop through i from 1 to n. Then the vector v has one value added to the end of it. That value is a random draw from the vector d.
# Return the compiled vector of random draws v
v
}

# This alternative function specification uses the preferred vector forms in R's language.
# To get a hint of what it is doing play around with this.
vect1 <- c("a","b","c","d","e")
# To see vect1 just type
vect1

# To get a single element of vect1
vect1[2]

# To get a subset of vect1
vect1[c(1,2,3)]

# To get a subset with repetition
vect1[c(1,2,1,3,1)]

#Now
ceiling(runif(20)*length(vect1))

# This is a nice way of figuring out how code is working in R. Simply take it apart and see what it does by itself.

Sunday, July 22, 2012

As I am
sure you are aware, the majority of my posts are in Stata. My primary motivation for choosing to post in
Stata has been that this is the program which the majority of the professors I
work for have been using (Jeffrey Wooldridge, Andrew Dillon, etc.). However, I have been and continue to be a reluctant user of Stata. Frankly, I do not see a reason to buy a
program that costs upwards of $800 (often more) in order to do work that could
be done just as well on a program which is free.

So far
I have been content to use Stata because I am more fluent with it than R. However, now Stata has released a new version
(version 12) and I am stuck back in version 11 unwilling to pay 595 to upgrade
my version. So unless someone decides to
upgrade my version for me, very likely, then most of my future posts are going
to be in R. Hope you find them useful!

# Let's imagine that we by chance sellected the 250 x's which are in the "middle"

# First let's sort x

xsort <- sort(x)

# Now let's sample non-randomly from the sorted "middle" of the population

xmiddle_sample <- xsort[400:599]

# Now let's sample alternatively from the random "middle" of the population

x_sample <- x[400:599]

sd(xmiddle_sample)
# We can see the standard deviation is about 1/6 of what it should be (close to 1)

sd(x_sample)
# In the random sub-sample the standard deviation is much closer to 1

mean(xmiddle_sample)
# The mean of the middle sample is close to the true mean. But this is by chance.

mean(x_sample)
# The mean of the random sample also works well.

# However, having a non-random sample can bias mean estimates as well as estimates of variation.
xlower_sample <- xsort[1:200]

sd(xlower_sample)
mean(xlower_sample)
# The standard deviation is still too small but now the mean is way off.

# So how does all of this translate into OLS?

# It depends to some extent on what elements are non-random.

# In Y = XB + u, if the sample of X is non-randomly sorted and selected from as above then the only problem will be that the variance of X is small making it hard to identify it's effect on Y.

# If however, the u term is the one that is sorted on and selected from then it is going to make our estimated standard errors too small causing us to overrect the null. In addition, it might also cause us to bias the constant.

# Finally, if sorting and selecting is based on a combination of X and u (perhaps sorting on Y for instance) then non-random selection will additionally make our estimates of the coefficients biased.

# Let's see how this works. Let's start with 10,0000 observations.

X <- rnorm(10000)
u <- rnorm(10000)*3

Y <- 2+X+u

# The coefficient on X is 1

# Let's bind the data together in a data frame
data.rand <- as.data.frame(cbind(Y,X,u))

summary(lm(Y~X, data=data.rand))

# First let's randomly sample form data as a benchmark

sample.rand <- data.rand[1:1000,]

summary(lm(Y~X, data=sample.rand))

# We can see a random subsample from the population does not pose us too many problems.

# We loose a little accuracy and the stength of our t values diminish.

# Now, let's see what happens when we sort on X then select from x

data.Xsort <- data.rand[order(data.rand$X),]

# We can see that in data.Xsort the X variable is sorted (we will only look at the first 40)

data.Xsort[1:40,]

# Now we select a middle lower 1000 observations

sample.Xsort <- data.Xsort[3001:4000,]

summary(lm(Y~X, data=sample.Xsort))

# We can see that as predicted, though the sample size is the sample as previously, and though the estimate of the coefficient on X is still good, there is not enough variation in X to reject the null.

# Let's see how it works when the data is selected and sorted by u.

data.usort <- data.rand[order(data.rand$u),]

sample.usort <- data.usort[3001:4000,]

summary(lm(Y~X, data=sample.usort))

# As predicted the constant is strongly biased.

# In addition the standard errors are too small. This would cause us to overreject the null.

# It is not possible to see this with this data as is because all of the variables have an effect on Y.

# But let us add a handful of variables to see if we reject.

# Loops through i=1 through i=20
for(i in 1:20) {
# Generate the new variable name z1 z2 etc
newvar <- paste("z",i,sep="")
# Assign the new variable to the sample.usort
sample.usort[newvar]<-rnorm(1000)
}

# It does not seem to be overrejecting too horribly in general. But I am sure if we did a Monte Carlo repeated simulation we would find the rejection rate to be greater than .1 at the .1 level .05 at the .05 level etc.

# Finally let us look at what happens if the data is sorted by Y (which is a combination of X and u)

# In order to be in the range 3001-4000 in generaly either X has to be low but u high, or u low and X high, or u and X both a little low.
sample.Ysort[1:40,]

# If you were to take a different sample of Y then you would find a different correlation between X and u.

# The important point is that by selecting the sample on Y then (assuming X has real effects) then you are by neccessity causing a correlation between the error and x.
summary(lm(Y~X, data=sample.Ysort))

sum
* Conviently these variables are cleaned from memory once no longer useful

* However, if we were to try to override a temporary variable with one of the same name then it will not work.

forv i = 1/10 {
tempvar temp
gen `temp' = rnormal()
}

sum

* However, it is impossible to target previous temporary variables that were created with the typical `temp' identifier since now it is targetting the most recently created variable.

sum __000003

* However works.

* Which is really not very useful but could potentially be.

* This may not appear to be a problem but when you start looping through commands could potentially result in a large number of temporary variables clogging the active memory things start to slow down.

clear
set obs 10000

* For example:
forv i = 0/10000 {
tempvar temp
gen `temp' = rnormal()
if mod(`i',1000)==0 di "$S_TIME"
}
* For me there is about 2 seconds for every 1,000 variables created

* Easy fix:
clear
set obs 10000

forv i = 1/10000 {
tempvar temp
gen `temp' = rnormal()
* Add drop
drop `temp'
if mod(`i',1000)==0 di "$S_TIME"
}
* When cleaning the memory this reduces the run time of all 10,000 (for me) to only 2 seconds.

* This is all fine, but they only problem is that if we are going through all of the work of drawing temporary variables just to later drop them then why not just draw normal variables?

* There is only one reason I cas see. We do not want to risk our drawn variable names to overlap with our existing variable names.

* However, if this is not a problem then the following commands are clearly that much cleaner:

ybernoulli = rbinom(1000,1,yprob)
# Remember we need to specify 1000 so that R knows how many random draws to make

# 6. Use a probit estimation to estimate y

myprobit = glm(ybernoulli ~ x, family=binomial(link="probit"))

summary(myprobit)

# 7. Logit

mylogit = glm(ybernoulli ~ x, family=binomial(link="logit"))

summary(mylogit)

# We can see that while most commands are very similar to those of Stata the overall setup is more complicated.
# Which is not neccessarily a bad thing. It is generally much easier to program (anything more advanced than one liners) in R.

Thursday, July 19, 2012

* This post seeks to answer the question, "Given that you have more than 7 cards in your hand, what is the likelihood that nobody (including yourself) will roll a 7 before the next turn, forcing you to discard?"

* First we must calculate the probability of anybody rolling a 7 (a seven is the most common roll possible).

* It is the probability that seven will be rolled in any combination of dice.

* Given that any particular outcome (X=a,Y=b)=P(X=a)*P(Y=b)=1/6*1/6=1/36

Wednesday, July 18, 2012

* There once was a game show hosted by Monte Hall. The game show at some point had the contestant choosing between one of three doors.

* Behind two of the doors there was a goat while behind the third one there was a car.

* The contestant would first choose one door. Then Monte would reveal a goat from one of the two remaining doors (being how Monte Hall knew where the car was) and the contestant had the chance to choose a new door (the one that was not opened) at that time.

* After that the doors were openned and the goat or car was revealed behind the door that was finally chosen. If the contestant had chosen the door with the car then the contestant won the car.

* So the question was asked, "What is the ideal strategy for this game?"

* A clever columnist responded by stating that the optimal stategy was to always change your door after one goat is revealed.

* Why is this?

* Simply, if you stick with you original door as a strategy then your probability of being right is 1/3.

* However, if you change your vote after one of the goats in revealed then you have increased your probability of getting the car to 2/3.

* One way to think about this is, if the probability of getting a car from the initial initial_guess is 1/3 then the strategy of keeping with your original initial_guess must yield the same results.

* However, if getting a car by sticking with your original initial_guess is 1/3 then by switching your initial_guess you must be therefore getting the remaining probabily of getting a car (ie. 1-1/3=2/3)

* Don't believe it? I did not at first either!

* Well let's write a simulation to check the expected results of the two strategies

* This count command will count how many random revealed doors are not legitamate.
count if revealed == initial_guess | revealed == car
* The loop successfully worked

tab revealed
* We can see revealed still looks like it is performing well (in terms of equal sampling from each of the doors)

* Now let's formulate three strategies
* 1. Choose a door and stick with it
* 2. Choose a door then always switch
* 3. Choose a door then fair flip a coin, if heads switch.

* For strategy one it is easy
gen door1_choice = initial_guess

gen strat1_car = 0
replace strat1_car = 1 if door1_choice == car

* For stragey two it is a bit more complicated. First we need to figure out how to identify which alternative is chosen.
gen door2_choice = mod(initial_guess+1,3)+1
* That is if the empty value is 1 plus the initial choice
replace door2_choice = mod(door2_choice+1,3)+1 if revealed==door2_choice
* But there is the possibility that the new choice will be the door revealed which is not good.
* However, in the modular system if we add +1 and +1 to a initial draw then we have gone through all of the potential problems.
count if revealed==door2_choice | initial_guess == door2_choice
* No violations of the strategy

gen strat2_car = 0
replace strat2_car = 1 if door2_choice == car

* Finally strategy three: choose a door randomly from the two doors remaining
* In order to execute these random draws of the remaining choices we will use a repeated loop to keep drawing random draws until all of the observations have draws not equal to the revealed draw.

Tuesday, July 17, 2012

* So, you roll a fair 6 sided die once. What is the probability of getting a 1 on the first roll?

* Simple:

di 1/6

* How about the possibility of not getting a one:

di 1-1/6

* or

di 5/6

* Ok, easy right. Now is when things get ugly.

* What happens if you want to know the possibility of rolling a 1 (at least 1 or more) with either of two fair dice?

* 1/6 + 1/6? right? I mean, you roll one die, 1 in 6 times you will get a 1 and you roll the other die and one in six times you will get a 1. 2/6=1/3 Right?

di 1/6 + 1/6

* Wrong!

* Weirdly. It is not obvious with 2 dice why this statement is not true. However, start adding dice to the equation. 3/6 (three dice) 2/3 (four dice) etc. 6/6 = 1 (six dice) wait a second! I know that if I roll six dice, there is a possibility that one of them will not be a 1. I mean, it might be small but it exists! Then what happens when we go to 7 dice? 7/6 > 1 and thus not a probability.

* The reason why the intuitive guess does not work, is because the intuition applies to the question of "how many 1s will be rolled on x number of dice?" To see this we find the expected value in terms of successes of one die roll:
* E(x=1)=1/6*1[x1=1] + 5/6*0[x1!=0]=1/6
* likewise: E(x=2)=1/6*1[x1=1] + 5/6*0[x1!=0] + 1/6*1[x2=1] + 5/6*0[x2!=0]=2/6=1/3
* The difference between this and the previous question is that in the previous question if you get a one, both rolls of the dice it counts as 1 while in the expected number of 1s count if you roll two ones then they count as 2.

* Thus we realize the utter failure of our intuition.

* The things about independent outcomes is that they are in a sense conditional upon each other. You roll two dice and you only care about the possibility that at least one of them is a 1. Thus you say, what is the possibility that the first die is a 1 = 1/6 + what is the possibility that the second die is a 1 (1/6) given that the first die is not a 1 (5/6) -> 1/6*5/6

* So the total probability that either die are ones is:
di 1/6 + 1/6*5/6

* An easier way to do these things is: taking 1 less the probability that the event never occurs
di 1-(5/6)^4
* I thought these equations are equivalent. They probably are, just a more of a rounding error I would suspect.

Monday, July 16, 2012

* This is one of the basic programming techniques in most programming languages but as far as I can tell it is largely absent from end user Stata programming.

* Imagine some routine which is recursively identical.

* For instance, imagine you would like to calculate a factorial.

* Such as !10 = 10 * 9 * 8 * 7 * 6 *5 * ... 2 * 1

* Interesting Stata (outside of Mata) does not have a basic factorial command but rather only a log factorial

* Easy enough to invert:

di exp(lnfactorial(10))

* Now, let's see if we can't write our own command to calculate a factorial.

cap program drop myfact
program define myfact, rclass
* Let's display some feedback to help the user see what is going on
if "`2'" == "noisy" di "Evaluate `1' - Input `1'"

* If the number to be evaluated is greater than 1 then this program will call itself.
if `1' > 1 {
* myfact calling myfact at a value 1 less than the input value
myfact `=`1'-1' `2'* After the called myfact command it returns a scalar local called 'current_sum'* This sum we multiply by the current `1' being evaluatedlocal current_sum = `1'*r(current_sum)
}

* This is in a sense the starting point of the recursive operation.
* This is only because computers are required 'in general' to evaluate computational problems.
else local current_sum = 1

* The point is to demonstrate the problem solving technique known as recursive programming.
* There are numerous other applications for this technique.
* Search and sorting algorithms frequently feature recursive algorithms.

* In economics I have used recursive algorithm to solve market equilibrium problems.

* That is when the decision of one actor depends on the choices of other actors simultaneously.

Sunday, July 15, 2012

* This presents an alternative method to resampling results from the last post

webuse mheart0, clear

* First let's generate an observation index
gen obs_id = _n

* And you want to test how well an estimator will work on sampled data from that data set.

* There are obviously many ways to do this.

* One way would be to resample from that data 1,000 draws and then generate a dependent variable and test how well your estimator works.

sum

* First we want to mark the draws but we can see that bmi is missing some information.

* For our purposes we could either drop the observations for which bmi is missing or inearly impute bmi.

* Let's just impute bmi:

reg bmi age smokes attack female hsgrad marstatus alcohol hightar

predict bmi_fill

replace bmi = bmi_fill if bmi==.

sum bmi
drop bmi_fill

di "Now what we want is approximately 1,000 results (we do not need to be exact)"

di "We have " _N " observations"

local obs_add =1000/_N

di "So we need to add approximately " round(`obs_add') " observations per observation"

* One way to do this would be to add (or subtract) randomly more duplicate observations.

* The uniform distribution is a natural choice. However, its expected value is 1/2 so we need to multiply by 2 to ensure that we get the right number of observations.
gen add = round(`obs_add'*runiform()*2)

* Note: alternative distributions might be any non-negative distribution for which you can specify the expected value.
* For example: poission. This distribution will be less likely to drop observations and have more proportional representation of initial observations.

tab add
* First let's drop any obervations that are slated to be dropped

drop if add == 0

* Now the command expand is very useful because it allows us to easily duplicate observations

* Suppose you would like your simulation to be as realistic as possible in terms of input data.

* You have some data set: for example:

webuse mheart0, clear

* And you want to test how well an estimator will work on sampled data from that data set.

* There are obviously many ways to do this.

* One way would be to resample from that data 10,000 draws and then generate a dependent variable and test how well your estimator works.

sum

* First we want to mark the draws, but we can see that bmi is missing some information.

* For our purposes we could either drop the observations for which bmi is missing or linearly impute bmi.

* Let's just impute bmi:

reg bmi age smokes attack female hsgrad marstatus alcohol hightar

predict bmi_fill

replace bmi = bmi_fill if bmi==.

sum bmi
drop bmi_fill

* This might not be the best technique for imputing values for econometric estimation but since all we want is to generate data with reasonable ranges for the explanatory variables, this should work fine.

* First let's mark our observations
gen draw_num = _n

di "There are " _N " observations available to draw from"

* Let us save the name number of draws
gl max_draw = _N

tempfile draw
save `draw'

* Now we are ready to start drawing our data.
* Imagine we would like to draw 10,000 observations:

clear
set obs 10000

* Now is the only trick. We need to assign each observation a draw_num that ranges form 1 to ${max_draw}

gen draw_num = int(runiform()*${max_draw})+1
* This is a little tricky but it should do the job. First it generates a variable which ranges in value from 0 to max_draw-1 then it reduces it to an integer and adds 1.

sort draw_num
* Neccessary for the merge

* Now we just need to merge in our temporary data set:
merge m:1 draw_num using `draw'

* Sum

* Now we can generate some values and estimate how well they work
* for example

sum p*
* We can see the average predicted outcome probabilites are similar

pwcorr pchoc_hat pchoc, sig
* pchoc is statiscally significant

reg pvan fruit_loving sweet_tooth bean_loving cocoa_loving
reg pvan_hat fruit_loving sweet_tooth bean_loving cocoa_loving
* Looks very good
* We can see that how the independent variables affect the predicted variables is extremely similar.
* This is probably the best evidence yet that the mlogit command is doing what we want it to be doing.

Thursday, July 12, 2012

* This program will remove from a global a set of specified values
* I believe this is the same in sigma algebra of subtracting a set from another
* I figure there is probably a prewritten command for this but I have not been able to find it.

cap program drop takefromglobal
program define takefromglobal

* First define a local called temp to hold the initial values of the global
local temp ${`1'}

* Empty the global
global `1'

* Loop through each of the old values
foreach v of local temp {

* Start counting at 0
local i = 0

* Create a local that indicates that this value should be skipped (initially assume no skipping)local skip = 0* Loop through all of the user specified inputs `0'
foreach vv in `0' {* We will skip the first one since we know it is just the global specified
local i = `i' + 1 * If we are past the first input in the local and one our values to exclude matches the value in the global skip that value if `i' > 1 & "`vv'"=="`v'" local skip = 1
}* If the current value is not to be skipped then add it to the new global listif `skip' == 0 global `1' ${`1'} `v'
}
end

* Let's see if the simulation gives us the expected results
clear
set obs 12000
gen mort_rate = mod(_n,6)/200+.01
* This will split the data into 6 groups of 200 with each group having mortality rates from .01 to .035

tab mort_rate

* An indicator that the mosquito is alive
gen living=1

* An indicator that the mosquito survived
gen survived=0

* Sample 500 days, see how many mosquitos survive to live for each day
forv i = 1/500 {
* Draw a single temporary binomial draw to see if this day the mosquitos die
qui gen died = rbinomial(1,mort_rate) if living==1
* If the mosquitos died then change their "living" status
qui replace living=0 if living==1 & died == 1
* Replace survived to indicate the current day if they are still alive
qui replace survived=`i' if living==1

* Displays user feedback telling what round the days currently are on and what percentage of mosquitoes live.
qui sum living
di "Round `i' :" r(mean)

* As I understand it the Delta method:* Less machine time intensive while more algebra intensive. Taking derivitives above was easy, but when you have multiple explanatory variables and more complex equations things can get pretty ugly.* More suceptable to approximation errors (because it is a taylor series expansion)

Monday, July 9, 2012

* The bootstrap resampling technique is one of the coolest techniques that have developed in econometrics/statistics in the last 50 years (that and maximum likelihood). Bootstrap releases the user from strong assumptions about the underlying distribution of errors.

* As far as I understand the bootstrapping assumptions follow, so long as the data you draw are reasonable draws from the underlying distribution then resampling from that sample will give you a reasonable estimate of the populations underlying parameters. Note, this method is obviously related to Baysian methods.

* B Efron formalized the technique and termonology of Bootstrap in a paper in 1979.

* First let us define a program that will do our bootstrap
cap program drop bootme
program bootme

********
* First we will set up the data that we will use as a resource to draw from

* Loop through the number of repetitions in the bootstrap
forv i=1/`reps' {
* First clear the data
clear
* Set the number of observations equal to the number to draw
`nq' set obs $boot_N
* Generate a variable that represents random draws.
`nq' gen bootmatcher = int(runiform()*$boot_N)
* Sort the draws from 1 to _N to assist in the merge command
`nq' sort bootmatcher
* Merge in the drawfile
`nq' merge m:1 bootmatcher using `drawfile', keep(3)

* Now run the user specified command
`nq' `command'

* Now load up the return list file
`nq' use `returnlist', clear

* For each value of the list of keep values loop through
foreach v in `keep' {
* We need to make sure the variables are readable as names
local vtemp=strtoname("v`v'")
* Replace each observation with the estimated value
`nq' replace `vtemp'=`v' if _n==`i'
}
`nq' save `returnlist', replace

* If the observation is divisible by 50 insert a space in the progress bar.
if `i'/50 == int(`i'/50) di ". " `i'
* Otherwise keep adding .s
else di _continue "."
}
* This summarizes the results
sum
* Restores the data to the previous state
`nq' restore
* Runs the command one last time for display
`command'
end
* End of the Bootme command

* Our bootstrap seems to work very similarly though it seems to me that our standard errors seem to be a little wider in general. I am not sure why.
bootme 1, r(200) k(_b[x] _se[x] e(F) e(ll)) c(reg y x)

Thursday, July 5, 2012

* One of the first simulations I ever wrote was using random walks and seeing how OLS fails when the data generating process does not experience decay.

* A good example of this might be the stock market. We do not expect that because prices go down today that they need go up tomorrow or visa versa. Likewise the currency market (comparing two developed economies is unlikely experience any kind of a decay rate on the previous value. Either the dollar is up relative to the Euro or it is down relative to the Euro, the best we can do to predict the dollar is look at what it was yesterday.

* Let's imagine that we have two random walks and 730 (two years of daily information):

clear
set seed 111
set obs 730

gen t = _n
label var t "Time (day #)"

gen y = 100 if t==1
label var y "% value of dollars to t=0"

gen x = 100 if t==1
label var x "% value of euros to t=0"

* Initially both values are set at the same
replace y = y[_n-1] + rnormal() if _n>1
replace x = x[_n-1] + rnormal() if _n>1
* However now there is no relationship between y and x

two (line y t) (line x t)
* Looking at the two values plotted next to each other we can almost imagine a relationship. Play around with different seeds. If your mind is anything like mine it will start seeing patterns.

reg y x
* We strongly reject the null that there is no relationship. This is no accident of this draw. Almost any seed you set you will get similar results.

* So it the problem with the random draws? Are they not truly random?

* Well, yes they are not trully random but they are as close to random as we need concern ourselves.

* It is simply the nature of the random walk. Fortunately there are tests for that.

tsset t

dfuller y
dfuller x

* The null in this test is that the process follows a random walk (contains a unit root). The test fails to reject our null.

*********************************
* We can do the exact same thing in a panel data setting.

* Imagine you have the quiz scores for 1000 students over 100 periods.

clear
set seed 111
set obs 50

gen id = _n
label var id "Student id"

expand 52

bysort id: gen t = _n
label var t "Year"

* Student first year is the reference year
gen y = 100 if t==1
label var y "student quiz score"

* Initially both values are set at the same
replace y = y[_n-1] + rnormal() if t>1
replace x = x[_n-1] + rnormal() if t>1
* However now there is no relationship between y and x

reg y x
* In the panel data case we are much less likely to reject the null. Given that the random walks for each individual is independent of those of the others.

tsset id t

* For the test of the unit root we will use the aptly named madfuller test (just kidding) written by Christopher F Baum and can be found at http://ideas.repec.org/c/boc/bocode/s418701.html.
* The test can be installed via ssc install madfuller
madfuller x, lags(1)
madfuller y, lags(1)
* This test uncovers the truth that there is a random walk process at work.

* Note the biggest limitation of the test is that number of time periods must exceed the number of ids. Thus there is 50 students and 52 quizes.

Wednesday, July 4, 2012

* This simulation looks at the problem that happens when the variable of interest is correlated with other explanatory variables. By not including the other variables you may bias the results but by including them you may absorb so much of the variation in the explanatory variable that you may be unable to identify the true coefficient of interest.

Tuesday, July 3, 2012

* This Stata program offers the ability to generate counterfactual draws post estimation of a biprobit. These draws are unique because they reflect not only the actual data used in the original estimation technique but also modifications to that data that the researcher may want to simulation.

* For this purposes I use probabilities to generate counterfactual data to be used in the simulation. In the later step when the xb1 and xb2 are generated it is possible for the user to change these values by say adding 1 to one of the independent variables and seeing how the bivariate outcomes change as a result. This result can be expanded beyond just finding the marginal effect since the user may be also interested in knowing how the joint likelihood changes which cannot be found easily by looking just at the marginal coefficients on Y1 and Y2.

* Generate the probability of seeing a draw of 1 on the first equation
qui gen double `spx1' = normprob((`arg112')/`c22')
* Generate the probability of seeing a draw of 0 on the first equation
quigen double `spx0' = normprob((-`arg002')/`c22')