Tag Archives: CRAN

The book Computational Actuarial Science, with R is officially out. In the introduction of the book, and on the website of CRC, it is mentioned that the datasets can be found “in an R package on CRAN“, which is unfortunately incorrect. Some datasets are too large, so the package can not be uploaded on CRAN. Hopefully, Christophe host the package on his website.

I have to admit, first, that finding Waldo has been a difficult task. And I did not succeed. Neither could I correctly spot his shirt (because actually, it was what I was looking for). You know, that red-and-white striped shirt. I guess it should have been possible to look for Waldo’s face (assuming that his face does not change) but I still have problems with size factor (and resolution issues too). The problem is not that simple. At thehttp://mlsp2009.conwiz.dk/ conference, a price was offered for writing an algorithm in Matlab. And one can even find Mathematica codes online. But most of the those algorithms are based on the idea that we look for similarities with Waldo’s face, as described in problem 3 on http://www1.cs.columbia.edu/~blake/‘s webpage. You can find papers on that problem, e.g. Friencly & Kwan (2009) (based on statistical techniques, but Waldo is here a pretext to discuss other issues actually), or more recently (but more complex) Garg et al. (2011) on matching people in images of crowds.

What about codes in R ? On http://stackoverflow.com/, some ideas can be found (and thank Robert Hijmans for his help on his package). So let us try here to do something, on our own. Consider the following picture,

With the following code (based on the following file) it is possible to import the picture, and to extract the colors (based on an RGB decomposition),

My strategy is simple: try to spot areas with white and red stripes (horizontal stripes). Note that here, I ran the code on a Windows machine, the package is not working well on Mac. In order to get a better understanding of what could be done, let us start with something much more simple. Like the picture below, with Waldo (and Waldo only). Here, it is possible to extract the three colors (red, green and blue),

It is possible to extract the red zones (already on the graph above, since red is a primary color), as well as the white ones (green zones on the graphs means a white region on the picture, on the left)

So here, we can easily spot Waldo, i.e. the guy with the red-white stripes (with two different sets of thresholds for the RGB decomposition)

Let us try somthing slightly more complicated, with a zoom on the large picture of the department store (since, to be honest, I know where Waldo is…).

Here again, we can spot the white part (on the left) and the red one (on the right), with some thresholds for the RGB decomposition

Note that we can try to be (much) more selective, playing with threshold. Here, it is not very convincing: I cannot clearly identify the region where Waldo might be (the two graphs below were obtained playing with thresholds)

And if we look at the overall pictures, it is worst. Here are the white zones, and the red ones,

and again, playing with RGB thresholds, I cannot spot Waldo,

Maybe I was a bit optimistic, or ambitious. Let us try something more simple, like finding a flag on the moon. Consider the picture below on the left, and let us see if we can spot an American flag,

Again, on the left, let us identify white areas, and on the right, red ones

Then as before, let us look for horizontal stripes

Waouh, I did it ! That’s one small step for man, one giant leap for R-coders ! Or least for me… So, why might it be interesting to identify areas on pictures ? I mean, I am not Chloe O’Brian, I don’t have to spot flags in a crowd, neither Waldo, nor some terrorists (that might wear striped shirts). This might be fun if you want to give grades for your exams automatically. Consider the two following scans, the template, and a filled copy,

A first step is to identify regions where we expect to find some “red” part (I assume here that students have to use a red pencil). Let us start to check on the template and the filled form if we can identify red areas,

Irecently tried to answer a simple question, asked by @adelaigue. Actually, I thought that the answer would be obvious… but it is a little bit more compexe than what I thought. In a recent survey about elections in Brazil, it was mentionned in a French newspapper that “Mme Rousseff, 62 ans, de 46,8% des intentions de vote et José Serra, 68 ans, de 42,7%” (i.e. proportions obtained from the survey). It is also mentioned that “la marge d’erreur du sondage est de 2,2% ” i.e. the margin of error is 2.2%, which means (for the journalist) that there is a “grande probabilité que les 2 candidats soient à égalité” (there is a “large probability” to have equal proportions).
Usually, in sampling theory, we look at the margin of error of a single proportion. The idea is that the variance of , obtained from a sample of size is

thus, the standard error is

The standard 95% confidence interval, derived from a Gaussian approximation of the binomial distribution is

The largest value is obtained when p is 1/2, and then we have a worst case confidence interval (an upper bound) which is

So with a margin of error means that . Hence, with a 5% margin of error, it means that n=400. While 2.2% means that n=2000:
> 1/.022^2
[1] 2066.116
Classically, we compare proportions between two samples: surveys at two different dates, surveys in different regions, surveys paid by two different newpapers, etc. But here, we wish to compare proportions within the same sample. This has been consider in an “old” paper published in 1993 in the American Statistician,

It contains nice figures to illustrate the difference between the standard approach,

and the one we would like to study here.

This point is mentioned in the book by Kish, survey sampling (thanks Benoit for the reference),

Let and denote empirical frequencies we have obtained from the sample, based on observations. Then since

and

we have

Thus, a natural margin of error on the difference between the two proportion is here

where .
> p=seq(0,.5,by=.01)
> ic1=rep(1.96/sqrt(4*n),length(p))
> ic2=1.96*sqrt(p*(1-p))/sqrt(n)
> delta=.01
> ic31=1.96*sqrt(2*p-delta^2)/sqrt(n)
> delta=.2
> ic32=1.96*sqrt(2*p-delta^2)/sqrt(n)
> plot(p,ic32,type=”l”,col=”blue”)
> lines(p,ic31,col=”red”)
> lines(p,ic2)
> lines(p,ic1,lty=2)
So on the graph below, the dotted line is the standard upper bound, the plain line in black being a more accurate one when the probability is (the x-axis). The red line is the true margin of error with a large difference between candidates (20 points) and the blue line with a small difference (1 point).

Remark: an alternative is to consider a chi-square test, comparering two multinomial distributions, with probabilities and where is the average proportion, i.e. 44.75%. Then

i.e. =3.71
> p=(p1+p2)/2
> (x2=n*((p1-p)^2/p+(p2-p)^2/p))
[1] 3.756425
> 1-pchisq(x2,df=1)
[1] 0.05260495
Under the null hypothesis, should have a chi-square distribution, with one degree of freedom (since the average is fixed here). Here the probability to reach that level is around 5% (which can be compared with the 2% we add before).

So finally, I would think that here, stating that there is a “large probability” is not correct…

recently, a classmate working in an insurance company told me he had too large datasets to run simple regressions (GLM, which involves optimization issues), and that they were thinking of a reward for the one who will write the best R-code (at least the fastest). My first idea was to use subsampling techniques, saying that 10 regressions on 100,000 observations can take less time than a regression on 1,000,000 observations. And perhaps provide also better results…

Time to run a regression, as a function of the number of observations

Here, I generate a dataset as follows

and we fit

where is a spline function (just to make it as general as possible, since in insurance ratemaking, we include continuous variates that do not influence claims frequency linearly in the score). Yes, there might be also useless variables, including one of them which is strongly correlated with one that has an impact in the regression. The code to generate the dataset is simply

Finally, from the first regression, we have points in black (based on 200 simulated datasets), and with a stepwise procedure, we have the points in red.

i.e. it might look linear (proportional), but if it was linear, then on a log-log scale, we should have also straigh lines, with slope 1,

Actually, it looks like a convex function.

The interpretation of that convexity might lead to misinterpretation. On the graph below on the left, on a dataset two times bigger than the previous one (black point) will be less than two times longer to run, while on the right, it will be more than two timess longer,

Convexity can simply be interpreted as “too large datasets take time, and too small too…”. Which is a first step: it should be interesting, in some cases, to run several regressions on smaller datasets….

On the graph below, we have the time (y-axis, here on a log scale) it took to run regression on samples of size , as function of (x-axis), including the time it took to run the regression on a dataset of size which is the concentration of dots on the left (i.e. =1), both on the 6 regressors – in black – and with a strepwise procedure – in red. One has to keep in mind that I did not remove the printing option in the stepwise procedure, so it might be difficult to compare the two clouds (black vs. red). Nevertheless, we clearly see that if we run regression on samples of size , when is not too large, i.e. less than 10 or 15, it is not longer than the regression on =200,000 lines.

So here we see that running 100 regressions on 2,000 lines is longer than running 1 regression on 200,000 lines… But maybe we are not comparing things that are actually comparable: what if it takes a bit longer, but we strongely improve the quality of our estimators ?

What about the quality of the output ?

Here, we consider only one dataset, with =100,000 lines (just to make it run a bit faster). And =20 subsets. Recall that the generated dataset is from

and we fit

Here, we plot here and a confidence interval, defined as

The lightblue segment is the initial estimator, while the blue one is obtained from the stepwise procedure. The grey area represent the estimation on the overall sample, while the segments on the right are the estimators (each on samples of size ).

We can see that we have much more volatility on those estimators, but the average (horizontal doted lines) are not so bad… The true value (i.e. the one used to generate the dataset is the dotter black horizontal line).
And if we repeat that on 1,000 simulated dataset, we obtaind the following distribution for (blue line), so we have an unbiased estimator of our parameter (the verticular line being here the true value), here including a stepwise procedure,

But if we add the the red curve is the average of the the previous one being now the clear blue line in the back, we see that taking average of estimators on subsamples is not bad at all, on the contrary,

and for those who think that the stepwise procedure is a mistake, here is what we get without it,

So what we can see is that running 20 regressions can take (a little) more time (from what we’ve seen earlier) than running only one on the whole dataset…. but it provides better estimates. So the tradeoff is not that simple, and maybe running several regressions on huge datasets can be a proper alternative.

In a post I published a few month ago (in French, here, based on some old paper,there), I mentioned a statistical procedure to test if the underlying distribution of an i.i.d. sample had a finite mean (based on extreme value results). Since I just used it on a small dataset (yes, with real data), I decided to post the R code, since it is rather simple to use. But instead of working on that dataset, let us see what happens on simulated samples. Consider =200 observations generated from a Pareto distribution

A first idea is to fit a GPD distribution on observations that exceed a threshold >1.
Since we would like to test (against the assumption that the expected value is finite, i.e. ), it is natural to consider the likelihood ratio

Under the null hypothesis, the distribution of should be chi square distribution with one degree of freedom. As mentioned here, the significance level is attained with a higher accuracy by employing Bartlett correction (there). But let us make it as simple as possible for the blog, and use the chi-square distribution to derive the p-value.
Since it is rather difficult to select an appropriate threshold, it can be natural (as in Hill’s estimator) to consider , and thus, to fit a GPD on the largest values. And then to plot all that on a graph (like the Hill plot)

Here we keep the estimated value of the tail index, and the associated standard deviation of the estimator, to draw some confidence interval (assuming that the maximum likelihood estimate is Gaussian, which is correct only when n is extremely large). We also calculate the deviance of the model, the deviance of the constrained model (), and the difference, which is the log likelihood ratio. Then we calculate the p-value (since under the likelihood ratio statistics has a chi-square distribution).
If =2, we have the following graph, with on top the p-value (which is almost null here), the estimation of the tail index he largest values (and a confidence interval for the estimator),

If =1.5 (finite mean, but infinite variance), we have

i.e. for those two models, we clearly reject the assumption of infinite mean (even if gets too close from 1, we should consider thresholds large enough). On the other hand, if =1 (i.e. infinite mean) we clearly accept the assumption of infinite mean (whatever the threshold),

Using Hill’s estimator

An alternative could be to use Hill’s estimator (with Alexander McNeil’s package). See here for more details on that estimator. The test is simply based on the confidence interval derived from the (asymptotic) normal distribution of Hill’s estimator,

Here the test is more robust than the one based on the GPD. And if =1 (i.e. infinite mean), again we accept ,

Note that if =0.7, it is still possible to run the procedure, and hopefully, it suggests that the underlying distribution has infinite mean,

(here without any doubt). Now you need to wait a few days to see some practical applications of the idea (there was on in the paper mentioned above actually, on business interruption insurance losses).

Following my previous post on optimization and mixtures (here), Nicolas told me that my idea was probably not the most clever one (there).
So, we get back to our simple mixture model,

In order to describe how EM algorithm works, assume first that both and are perfectly known, and the mixture parameter is the only one we care about.

The simple model, with only one parameter that is unknown

Here, the likelihood is

so that we write the log likelihood as

which might not be simple to maximize. Recall that the mixture model can interpreted through a latent variate (that cannot be observed), taking value when is drawn from , and 0 if it is drawn from . More generally (especially in the case we want to extend our model to 3, 4, … mixtures), and .
With that notation, the likelihood becomes

and the log likelihood

the term on the right is useless since we only care about p, here. From here, consider the following iterative procedure,
Assume that the mixture probability is known, denoted . Then I can predict the value of (i.e. and ) for all observations,

So I can inject those values into my log likelihood, i.e. in

having maximum (no need to run numerical tools here)

that will be denoted . And I can iterate from here.
Formally, the first step is where we calculate an expected (E) value, where is the best predictor of given my observations (as well as my belief in ). Then comes a maximization (M) step, where using , I can estimate probability .

A more general framework, all parameters are now unkown

So far, it was simple, since we assumed that and were perfectly known. Which is not reallistic. An there is not much to change to get a complete algorithm, to estimate . Recall that we had which was the expected value of Z_{1,i}, i.e. it is a probability that observation i has been drawn from .
If , instead of being in the segment was in , then we could have considered mean and standard deviations of observations such that =0, and similarly on the subset of observations such that =1.
But we can’t. So what can be done is to consider as the weight we should give to observation i when estimating parameters of , and similarly, 1-would be weights given to observation i when estimating parameters of .
So we set, as before

Recently, one of my students asked me about optimization routines in R. He told me he that R performed well on the estimation of a time series model with different regimes, while he had trouble with a (simple) GARCH process, and he was wondering if R was good in optimization routines. Actually, I always thought that mixtures (and regimes) was something difficult to estimate, so I was a bit surprised…

Indeed, it reminded me some trouble I experienced once, while I was talking about maximum likelihooh estimation, for non standard distribution, i.e. when optimization had to be done on the log likelihood function. And even when generating nice samples, giving appropriate initial values (actually the true value used in random generation), each time I tried to optimize my log likelihood, it failed. So I decided to play a little bit with standard optimization functions, to see which one performed better when trying to estimate mixture parameter (from a mixture based sample). Here, I generate a mixture of two gaussian distributions, and I would like to see how different the mean should be to have a high probability to estimate properly the parameters of the mixture.

On the graph above, the x-axis is the difference between means of the mixture (as on the animated grap above). Then, the red point is the median of estimated parameter I have (here ), and I have included something that can be interpreted as a confidence interval, i.e. where I have been in 90% of my scenarios: theblack vertical segments. Obviously, when the sample is not enough heterogeneous (i.e. and rather different), I cannot estimate properly my parameters, I might even have a probability that exceed 1 (I did not add any constraint). The blue plain horizontal line is the true value of the parameter, while the blue dotted horizontal line is the initial value of the parameter in the optimization algorithm (I started assuming that the mixture probability was around 0.2).
The graph below is based on the second optimization routine (with identical starting values, and of course on the same generated samples),

(just to be honest, in many cases, it did not converge, so the loop stopped, and I had to run it again… so finally, my study is based on a bit less than 500 samples (times 15 since I considered several values for the mean of my second underlying distribution), with 200 generated observations from a mixture).
The graph below compares the two (empty circles are the first algorithm, while plain circles the second one),

On average, it is not so bad…. but the probability to be far away from the tru value is not small at all… except when the difference between the two means exceeds 3…
If I change starting values for the optimization algorithm (previously, I assumed that the mixture probability was 1/5, here I start from 1/2), we have the following graph

which look like the previous one, except for small differences between the two underlying distributions (just as if initial values had not impact on the optimization, but it might come from the fact that the surface is nice, and we are not trapped in regions of local minimum).
Thus, I am far from being an expert in optimization routines in R (see here for further information), but so far, it looks like R is not doing so bad… and the two algorithm perform similarly (maybe the first one being a bit closer to the trueparameter).

An Open Lab-Notebook Experiment

Some
sort of unpretentious (academic) blog, by a surreptitious economist and
born-again mathematician. A blog activist, and an actuary, too. Always curious.
Because academics are probably more than the sum of our publication lists, grants and conference talks...

Used to live in Paris (France),
Leuven (Belgium), Hong-Kong (China), and Montréal (Canada). Professor and researcher in
Montréal, currently back in Rennes (France). ENSAE ParisTech & KU Leuven Alumni