Sunday, July 5, 2009

Choosing a model for the data is hard. On the one hand, you want a
model that is reasonably faithful to the data. A model that is
too different from your data is not going to be too useful for
prediction. On the other hand, you want a model simple enough to
reason about. A model that has dozens of tuning parameters can be
very accurate, but it won't be easy to understand. It's a tradeoff.
If the model has a basis in some physical rationale — for example, an
exponential model is naturally expected from a radioactive sample
— it may offer an insight as to the physics behind what you are
measuring. But even if the model is simply an easily described curve
with no physical basis, it can still be useful.

A one or two parameter model is what I'm looking for.
There are a lot of one and two parameter models that look like my
data. The main characteristic of most of these distributions is that
they asymptotically approach zero over a long time.

Gamma:

Log-normal:

Pareto:

Weibull:

Eyeballing these is hard, too. So we need a couple of more tricks.
My favorite trick of log scale doesn't work too well. It does show
the tail of the curve nicely, but it also magnifies the variation.
Since the data are sparse out at the tail, this makes the variation
that much bigger.

Instead, I'm going to integrate over the distribution and normalize
the values so the curve goes from 0 to 1.

Now this graph is obviously bad. The curve is squeezed in near the edges
so you can't see it, and the asymptotes are so close you couldn't tell
if something fit or not. We'll fix this in a sec. But take a look
here
and you'll see a number of graphs that display the data is this awful
way.

Now I'm going to use my log scale trick.

This is quite a bit better. Now we can see important things like the
median (at the .5 mark) and the 90th percentile (at the .9 mark).
There is another benefit we got. The data in this graph is the
unsmoothed data. If we zoom in on a small part of the graph,
we can see that the ‘hair’ has turned into tiny
stairsteps.
Although the hair was tall, it was very narrow, so it
doesn't contribute much to the integral. (So my attempt to shave the
hair off the data was pretty much a waste of time. Oh well.)

But the point of doing this was to make it easier to fit the models.
The models usually have an integral form (cumulative distribution) so we can try them out. But
since we're using the integral form, the errors add up and we can more
easily see which distributions have a better fit.