Re: st: query on testing uniform distributions

Maarten has as usual discussed this very thoroughly and leaves only
scope for some extra details.
If the data span a long series of years, then every day of the year
might have about the same chance of being an employment start date. I
am guessing wildly at situations in which few people start work at
weekends. But at least in Britain there are jobs in which starting on
Mondays or (usually different) starting on the first day of a month
are standard. Sergio should know much more about the country being
studied and its practices. I'd expect the number of years spanned to
be more limiting than the number of distinct people in the sample.
Also what about 29 February?
All that said, it is all too likely with a sample size with 3 million
that statistically significant results might be scientifically
insignificant. The main role of significance tests used to be to stop
researchers making fools of themselves by overinterpreting very small
samples. Now some researchers want to use significance tests to check
for structure in very large samples. Often, there are better ways of
doing that which use all of the information available. The desire to
reduce all to a single badness-of-fit test or measure sometimes has to
be resisted.
The expected frequencies here are easy to calculate, so I'd move to
Pearson residuals, (observed - expected) / sqrt(expected) and also
plot those against day of year to see what fine structure there is.
The quantile plot is a good starting point, but needs to be followed
up.
Nick
On Tue, Nov 1, 2011 at 8:27 AM, Maarten Buis <maartenlbuis@gmail.com> wrote:
> --- Sergio wrote me privately:
>> I hope you can help me with the following query.
>
> Such question should be asked to the statalist and not to its members
> privately. This is not a silly rule, there are good reasons for it,
> which are listed here:
> <http://www.stata.com/support/faqs/res/statalist.html#private>.
>
>> I have read your suggestions on testing whether
>> observed data follow a uniform distribution:
> <http://www.stata.com/statalist/archive/2010-10/msg00146.html>
>>
>> and I am a bit puzzled by the results I obtain when
>> applying your syntax.
>>
>> I observe the dates people start their employment spells
>> over each tax year and I want to check if these dates are
>> distributed uniformly over the year. The dates are in
>> numeric format so I observe 364 different numbers for
>> each tax year.
>>
>> If I use the syntax you suggest:
>>
>> egen n = count(employment_start_dates)
>> egen i = rank(employment_start_dates)
>> gen hazen = (i - 0.5) / n
>> drop n i
>>
>> quantile hazen , aspect(1) name(quantile, replace)
>
> This graph tests whether the variable hazen is uniformly distributed,
> which is trivially the case since it is only based on the rank. I used
> that graph to spot ties, not to check whether the variable of interest
> (in your case employement_start_dates) is uniformly distributed. I
> suspect that in your case you would see 365 little horizontal plateaus
> on the 45 degree line. This may well be too subtle to easily see in
> that graph, but given your sample size of almost 3 million
> observations, I suspect that these ties might matter for your test. If
> you want to graphically test whether your variable of interest is
> uniformly distributed you would type in Stata: -quantile
> employement_start_dates, aspect(1)-.
>
>> In my case this graph shows values which lie exactly on
>> the 45 degree line (a histogram also shows data are more
>> or less uniformely distributed). However, the output I get
>> with the ksmirnov test is
>>
>> ksmirnov hazen=hazen
>>
>> One-sample Kolmogorov-Smirnov test against theoretical
>> distribution
>> hazen
>>
>> Smaller group D P-value Corrected
>> ----------------------------------------------
>> hazen: 0.0081 0.000
>> Cumulative: -0.0081 0.000
>> Combined K-S: 0.0081 0.000 0.000
>>
>>Note: ties exist in dataset;
>> there are 365 unique values out of 2887994 observations.
>>
>> I understand this means I reject Ho and therefore the finding
>> is that my data do not follow a uniform distribution. Can the
>> ksmirnov tests and the quantile plot produce totally opposite
>> results as in my case? Should the case of discrete values
>> (my case) be treated differently from the continuous case
>> you talk about? Here, I am assuming I have applied your
>> syntax correctly. Many thanks for your help, very much
>> appreciated.
>
> As I said above the graph does not test the same thing as the test, so
> it can easily be that the two lead to different conclusion. Moreover,
> there are two types of uniform distribution: a discrete and a
> continuous uniform distribution. For example the results of throwing a
> six sided die would follow a discrete uniform distribution, while the
> -runiform()- function in Stata produces draws from a continuous
> uniform distribution. The syntax you used tested against a continuous
> uniform distribution. However, in your case, you would have a discrete
> uniform distribution with 365 possible values. In what I would call
> normal size samples (say 1,000 to 10,000 observations) I would suspect
> that a continuous uniform distribution would be a perfectly acceptable
> approximation, but in your case it might make a difference.
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/