Wednesday, February 6, 2013

Non-Parametric PDF Fit Test

* This is an idea that I decided to explore before inspecting how others have addressed the problem.

* As noted by my previous post we cannot use standard independence based draw reasoning in order to test model fit.

* The following command will simulate random draws in the distribution being tested and see how closely they fit to exact pdf draws.

* It will then use that information to see if we can reject the null that the observed distribution is the same as the null distribution.

cap: program drop dist_check
program define dist_check, rclass

* The syntax command parses the input into useful macros used for later calculations.
syntax varlist, Tgenerate(numlist >= 100) [Dist(string)]

if "`dist'"=="" local dist="normal"

if "`dist'"=="normal" di "Testing if `varlist' is normally distributed"
if "`dist'"=="uniform" di "Testing if `varlist' is uniformly distributed"
if "`dist'"=="poisson" di "Testing if `varlist' has a poisson distributed"

* Let's see how this works on a sample draw.
clear
set obs 1000
gen x = rnormal()+1
dist_check x, t(100) dist(poisson)
* Interestingly, some data distributions seem to fit to fit "better" pdfs from different distributions.
* Thus the estimated t is smaller for some distributions.
* For this reason I have included a two sided confidence interval.

* The following small program will be used to construct a Monte Carlo simulation to see how well the dist_check command is working.
cap program drop checker
program define checker
syntax [anything], obs(numlist > 0) rdraw(string) reps(numlist > 0) dist(string)
clear
set obs `obs'
gen x = `rdraw'
dist_check x, t(`reps') dist(`dist')
end

* Now let's see it in action
simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
checker , obs(50) rdraw(rnormal()*30) reps(100) dist(normal)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.

* We should reject at the 10% level for both one sided and two sided confidence intervals.
gen r1_10 = p1<= .1
gen r2_10 = p2<= .1

simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
checker , obs(1000) rdraw(rnormal()) reps(100) dist(uniform)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.

simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
checker , obs(100) rdraw(rnormal()+runiform()*50) reps(100) dist(uniform)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.

gen r1_10 = p1<= .1
gen r2_10 = p2<= .1

sum

* I think there might be some slight errors. I hope this post is useful though I would not recommend using this particular command as there are more effective and better established commands already developed for testing distribution fit assumptions.