PAGES

Wednesday, December 19, 2012

# Writing your own functions can make your programs work much more efficiently, decreasing the lines of codes required to accomplish the desired results as well as simultaneously reducing the chance of error through repetative code.

# It is easy to take whole blocks of arguments from one function to the next with the ... syntax. This is a sort of catch all syntax that can take any number of arguments. Thus if you do not like the default choices for some functions such as the default of paste to use " " to conjoin different pieces then a simple function that I often include is the following:

# Print is a useful function for returning feeback to the user.
# Unfortunately this function only takes one argument. By conjoining it with my new paste function I can easily make it take multiple arguments.
# Concise paste and print function
pp = function(...) print(p(...))
# Print displays text.
for (i in seq(1,17,3)) pp("i is equal to ",i)

# Round to nearest x
# Functions can also take any number of specific arguments.
# These arguments are either targeted by use of the order arguments are placed in or by specific references.
# If an argument is left blank then the default for that argument is used when available or an error message is returned.
round2 = function(num, x=1) x*round(num/x)
# This rounding function is programmed to function similar to Stata's round function which is more flexible than the built in round command in R.
# In R you specify the number of digits to round to.
# In Stata however you specify a number to round.
# Thus in Stata round(z,.1) is the same in R as round(z,1).
# However, the Stata command is more general than the R one since Stata for instance could round to the nearest .25 while the R command would need to be manipulated to accomplish the same goal.
# I have therefore written round2 which rounds instead to the nearest x.
round(1.123, 2)
round2(1.123, .01)
# Yeild the same result. Yet, round2 will work with the following values.
# Order is not neccessary. The following produces identical results.
round2(x=.01,num=1.123)
round2(123, 20)
# The round2 has a default x of 1 so round2 can be used quickly to round to the nearest whole number.
round2(23.1)
# The original round has the same default.

# Using modular arithmatic is often times quite helpful in generating data.
# The following function reduces a number to the lowest positive integer in the modular arithmatic.
mod = function(num, modulo) num - floor(num/modulo)*modulo
# This syntax is programmed in a similar manner to Stata's mod command.
mod(15,12)
# This kind of a 12 modular system is the kind of system used for hours.
# Thus mod(15,12) is the same as
mod(3,12)
# Or even:
mod(-9,12)
# There is a built in operator that does the same thing as this function.
-9 %% 12

# Check for duplicate adjacent rows. First row of a duplicate set is ignored but subsequent rows are not.
radjdup = function(...) c(F, apply(...[2:(dim(...)[1]),]==...[1:dim(...)[1]-1,],1,sum)==dim(...)[2])

# The data needs to be sorted for radjdup to work.
example.order = example.data[order(example.data$x, example.data$y),]

cbind(example.order, duplicate=radjdup(example.order[,2:3]))

# Half the observations should be duplicated. We must sort our data for the adjected row duplicate command to be of any use with the current data.
example.data = example.data[order(example.data[,1], example.data[,2]),]

# We expect to see half the observations are duplicate flagged. This is because the first instance of any duplicate is not flagged.
cbind(example.order, duplicate=radjdup(example.order[,2:3]))

Friday, December 14, 2012

# In R I am often times unsure about the easiest way to run a flexible Monte Carlo simulation.
# Ideally, I would like to be able to feed a data generating function into a command with an estimator and get back out interesting values like mean estimates and the standard deviation of estimates.
# Stata has a very easly command to use called "Simulate".
# I am sure others have created these types of command previous but I don't know of them yet so I will write my own.

MC_easy = function(dgp, estimator, obs, reps, save_data_index=0) {
# dgp (Data Generating Proccess) is the name of a function that creates the data that will be passed to the estimator
# estimator is the name of a function that will process the data and return a vector of results to be analyzed
# obs is the number of observations to be created in each spawned data set
# reps is the number of times to loop through the dgp and estimation routine
# save data index is a vector of data repetitions that you would like saved

# I typically start looping programming by manually going through the first loop value before constructing the loop for both debugging and demonstrative purposes.

# Create an empty list of save data values
save_data = list()
# If the first data set generated has been specified as a data set to save then save it as the first entry in the save_data list.
if (sum(1==save_data_index)>0) save_data[[paste("1",1,sep="")]] = start_data

for (i in 2:(reps-1)) {
temp_data = get(dgp)(obs)

# If this repetition is in the save_data_index then save this data to the list save_data
if (sum(i==save_data_index)>0) save_data[[paste("i",i,sep="")]] = temp_data

# If the number of items for which the data was saved is equal to zero then only return the results of the estimation.
if (sum(save_data_index>0)==0) return(MC_results)

# If the number of items is greater than zero also return the saved data.
if (sum(save_data_index>0)>0) return(list(MC_results,save_data))
}

#### Done with the Monte Carlo easy simulation command.

# Let's try see it in action.

# Let's define some sample data generating program
endog_OLS_data = function(nobs) {
# Let's imagine that we are wondering if OLS is unbiased when we have a causal relationship between an unobserved variable z and the observed variables x2 and x3 and the unobserved error e.
# x4 is an observed variable that has no relationship with any of the other variables.
z = rnorm(nobs)
x1 = rnorm(nobs)
x2 = rnorm(nobs) + z
x3 = rnorm(nobs) - z
x4 = rnorm(nobs)

e = rnorm(nobs)*2 + z

y = 3 + x1 + x2 + x3 + 0*x4 + e
data.frame(y, x1, x2, x3, x4)
}

sample_data = endog_OLS_data(10)
sample_data
# Everything appears to be working well

# From this is it not obvious which estimates if any are biased.
# Of course our sample size is only 10 so this is no surprise.
# However, even at such a small sample size we can identify bias if we repeat the estimation many times.

# Let's try our MC_easy command.

MC_est = MC_easy(dgp="endog_OLS_data", estimator="OLS_est", obs=10, reps=500, save_data_index=1:5)
# We can see that our mean estimate on the coefficient is close to 3 which is correct.
# While b1 is close to 1 (unbiased hopefully) while b2 is too large and b3 too small and b4 is close to zero which is the true value.
# What is not possible to observe with 500 repetitions but capable of being observed with 5000 is that the standard error is a downward biased estimate of the standard deviation.
# This is a well known fact. I think even there is a proof that there is no unbaised estimator for the standard error for the OLS estimator.
# However, se^2 is an unbiased estimator of the variance of the coefficients.
# Though the variance of se^2 is large, making it neccessary to have a large number of repetitions before this becomes apparent.
# The final term of interest is the rejection rates. We should expect that for the coefficients b0-b3 that the rejections rates should be above random chance 10% (since there trully exists an effect) which is what they all ended up being though with b1 and b3 the rejection rate is only slightly above 20% which indicates that our power given our sample size given the effect size of x1 and x3 is pretty small.
# The final note is that though we do not have much power we are at least not overpowered when it comes to overrejecting the null when in fact the null is true. For b4 the mean rejection rate is close to 10% which is the correct power.

# We can recreate the results table by accessing elements of the MC_est list returned from the function.
rbind(mean=mean(MC_est[[1]]),sd=sd(MC_est[[1]]),var=sd(MC_est[[1]])^2)

# We may also be curious about the medians and maxes.
summary(MC_est[[1]])

# Or we may like to check if our data was generated correctly
MC_est[[2]]

Tuesday, December 11, 2012

Standard errors are really just as important as coefficients when estimating a relationship because they provide a critical component to inference. Without standard errors coefficients are largely uninteresting.

What does it matter that the coefficient from the estimation is say 7.3 if we do not know the standard error? If the standard error is say 1 then a 7.3 is quite different from 0, if that is what is important. However, if the standard error is instead 100, then a coefficient of 7.3 is meaningless (likely) random noise.

Standard errors are interesting because unlike coefficients they alone are interesting without even having coefficient estimates. They provide a point value which gives us insight into how much we can expect our estimates to vary.

Likewise, standard errors are often much more difficult to calculate than coefficients and even more sensitive to correct specification. Biased standard errors can have the effect of making an outcome look less likely than it actually is (over-reject the null) or more likely than it actually is (under-reject the null).

A good example is failing to cluster standards errors when appropriate. Not clustering standard errors when appropriate might be a problem if you had data from twenty different summer camps. Not clustering data by summer camp but evaluating student outcomes is implicitly arguing that the outcomes of camp goers at the summer camps are independent of each other camp goer. Clustering at the camp level however allows for each camp to have common shared shock (error) that season.

simulate Ghat=r(Ghat), reps(500): deltaMonteCarlo
sum
* We can see that our estimates of the standard error via the Delta method and bootstap is quite close to the monte carlo estimates on observed standard errors from 500 replications.

Wednesday, December 5, 2012

* Path analysis is an interesting statistical method that can be used to indentify complex relationships beween variables and an outcome variable.

* As with all statistical methods the modelling framework is essential to derive reasonable results.

* Conviently, I am only interested in simulating data so as usual my data will perfectly conform to the model's specifications.

* Imagine the following model.

* All of the boxes are observable variables. The arrows indicate the causal direction of the effects.

* There are two exogenous variables: A and D. These variables are not influenced by any other variables in the model.

* All other variables are endogenous.

* Each of the variables represents a direct effect of a one unit change in one variable on that of the other variable.

* This framework is convient because it allows us to indentify a "total effect" which is a combined result of both the direct and indirect effects of variables on the outcome variable.

* The variable of primary interest in explaining is H.

* The variable G has only a direct effect on H (pHG).

* While the variable C only has an indirect effect on H (pFC*pHF).

* The reason the indirect effect is a product is because C has a pFC effect on F, and F has a pHF effect on H, thus a change in H as a result of a change in C is how much F changes as a result of C and how much that change effects H.