Validating Financial Data with PDL

My job involves financial market data. A lot of financial market data. I take
the market data from various sources and store it in a database for later
analysis.

Being a programmer/analyst and not a mathematician with a Ph.D. in finance, my
use for time series analytics falls into the "ensure correct data is being
collected" category. But even then, some basic statistical analysis helps me
preserve quality historical data for later use.

PDL is perfect for doing these kinds of calculations very quickly. Combined
with PDL::Finance::TA, all the hard work is already done, and all I need to do
is wire it all up.

Let's take a large set of random numbers. If our random number generator were
perfect, we would expect that the set would be evenly distributed because each
possible value is exactly as possible as any other value. If we calculate the
standard deviation (stddev, a measurement of how disperse the data set is), we
would expect that 99.7% of the points would be within 3 standard deviations of
the mean (average).

So, if we write a test that checks to see if a new (completely random) point is
within 3 stddev, there is a 0.3% chance that new (completely random) point will
fail our test. If we bump that to 4 stddev, we should expect 99.99% of the
points to pass the test, and 0.01% of the points to fail (1 of every 15787). If
I collect 500,000 (completely random) points in a day, then 50 of them will
fail our test.

So I create a time series of random points. Then I create a new time series of
the 30-day standard deviation of the original series. Then I compare the two
and see which points are outliers.

Market data is not completely random, it's stochastic, which I interpret to
mean as "given value A1, the next value A2 will be somewhere between A1 +/- B".
It's predicting (guessing) "B" that earns quants the big bucks. But, over the
entire set of data, I know each previous value of B, which is the difference
between A1 and A2, or the rate of change between 2 points. What I really want
to know is if the rate of change from A1 to A2 appears abnormal, say, if it's
more than 4 stddev from the mean.

So I take my time series, create a new time series that is the rate of change
for each point in the previous series, create another new time series that is
the 30-day stddev of the previous time series, and then compare the rate of
change with the stddev to see which ones are outliers.

Finally, I should also make sure that my source is still updating, as it is
very rare that most series would be the same twice in a row, or for an entire
week. So let's check for flatness by using stddev.

PDL and TAlib make this all incredibly simple, so I can get on with my real
work (fragging lamers in Quake)