Thursday, I will be in Paris, to discuss the results we got from the pricing game. I will present 12 ou 13 models sent to me, an discuss what happened when I created a market, where the models were competing. One or two models were clearly underestimating the losses, so with the results as they were send, each time, one company goy 80% market share and over 250% loss ratio. So I decided to normalize all the premiums, so that the average premium was the same, for all the companies. Slides are now available.

Tomorrow, around noon, I will be giving a talk on predictive modeling for actuaries. In the introduction, I will get back shortly on the idea that a prediction is usually a best estimate, in the sense of getting an expected value. And because

it is natural to use least square ideas. In order to illustrate all those concepts, we will use a simple dataset, with the sex, the height and the weight of a person, as well as declared weight.

but it’s not a big deal. The variable of interest, here, is someone’s weight

attach(Davis)
Y=weight*2.204622

(here in pounds). We will use explanatory variables such as the sex of that person, or his/her height

X=Davis$height / 30.48

(in inches). So, we will start with the (standard) linear model, just to make sure that we all talk about the same thing.

The goal will be to use (possible) explanatory variable to improve our prediction. We will start with the standard linear model, but we will see that nonlinear models can also easily be obtained,

Non linearities will be discussed. But those models are Gaussian (as mentioned above). And homoscedastic. So we will see how generalized linear models can be used to model the mean and the variance, at the same time. For instance, with a Poisson regression (below), the variance will increase with the expected value.

After this general introduction, we will spend some time on 0-1 variables. We will see how to use a logistic regression, and also discuss more generally which kind of models can be used for classification. ROC curves will be presented, and explained.

Then, we will also see an alternative to the logistic model, namely classification trees and CART techniques

This morning, I was working with Julie, a student of mine, coming from Rennes, on mortality tables. Actually, we work on genealogical datasets from a small region in Québec, and we can observe a lot of volatiliy. If I borrow one of her graph, we get something like

Since we have some missing data, we wanted to use some Generalized Nonlinear Models. So let us see how to get a smooth estimator of the mortality surface. We will write some code that we can use on our data later on (the dataset we have has been obtained after signing a lot of official documents, and I guess I cannot upload it here, even partially).

In order to illustrate the next section of the non-life insurance course, consider the following example1, inspired from http://sciencepolicy.colorado.edu/…. This is the so-called “Normalized Hurricane Damages in the United States” dataset, for the period 1900-2005, from Pielke et al. (2008). The dataset is available in xls format, so we have to spend some time to import it,

We can visualize the normalized costs of hurricanes, from 1900 till 2005, with the 207 hurricanes (here the x-axis is not time, it is simply the index of the loss)

> plot(base$Normalized.PL05/1e9,type="h",ylim=c(0,155))

As usual, there are two components when computing the pure premium of an insurance contract. The number of claims (or here hurricanes) and the individual losses of each claim. We’ve seen – above – individual losses, let us focus now on the annual frequency.

In predictive modeling (here, we wish to price a reinsurance contract for, say, 2014), we need probably to take into account some possible trend in the hurricane occurrence frequency. We can consider either a linear trend,

Observe that changing the model will change the pure premium: with a flat prediction, we expect less than 2 (major) hurricanes, but with the exponential trend, we expect more than 4…

This is for the expected frequency. Now, we should find a suitable model to compute the pure premium of a reinsurance treaty, with a (high) deductible, and a limited (but large) cover. As we will seen in class next week, the appropriate model is a Pareto distribution (see Hagstrœm (1925), Huyghues-Beaufond (1991) or a survey – in French – published a few years ago).

Now, consider an insurance company, in the U.S., with 5% market share (just to illustrate). We will consider there \tilde Y_i= Y_i/20. The losses are given below. Consider a reinsurance treaty, with a deductible of 2 (billion) and a limited cover of 4 (billion),

[Nov 5th] there is a typo in the previous function, since the threshold should be used, here, as a parameter in the function, if you want to play with that function an see the impact of the threshold (see a more recent post on the same topic, but a different dataset)… but here, we do not change the threshold, so it is not a big deal.

Now, it is probably time to bring all the pieces together. We might expect a bit less than 2 (major) hurricanes per year,

> predictions[1]
[1] 1.95283

and each hurricane has 12.5% chances to cost more than 500 million for our insurance company,

> mean(base$Normalized.PL05/1e9/20>.5)
[1] 0.1256039

and given that an hurricane exceeds 500 million loss, then the expected repayment by the reinsurance company is (in millions)

We do have two parts here: the linear increase of the average, and the constant variance of the normal distribution .

On the other hand, if we assume a Poisson regression,

poisson.reg = glm(dist~speed,data=cars,family=poisson(link="log"))

we have something like

This time, two things have changed simultaneously: our model is no longer linear, it is an exponential one , and the variance is also increasing with the explanatory variable , since with a Poisson regression,

If we adapt the previous code, we get

The problem is that we changed two things when we introduced the Poisson regression from the linear model. So let us look at what happens when we change the two components independently. First, we can change the link function, with a Gaussian model but this time a multiplicative model (with a logarithm link function)

gaussian.reg = glm(dist~speed,data=cars,family=gaussian(link="log"))

which is still, here, an homoscedasctic model, but this time non-linear. Or we can change the link function in the Poisson regression, to get a linear model, but heteroscedastic

I admit it, the title sounds weird. The problem I want to address this evening is related to the use of the stepwise procedure on a regression model, and to discuss the use of categorical variables (and possible misinterpreations). Consider the following dataset

What do we get now? This time, the stepwise procedure recommends that we keep one category (namely C). So my point is simple: when running a stepwise procedure with factors, either we keep the factor as it is, or we drop it. If it is necessary to change the design, by pooling together some categories, and we forgot to do it, then it will be suggested to remove that variable, because having 4 categories meaning the same thing will cost us too much if we use the Akaike criteria. Because this is exactly what happens here

So here, we should pool together categories A, B, D and E (which was here the reference). As mentioned in a previous post, it is necessary to pool together categories that should be pulled together as soon as possible. If not, the stepwise procedure might yield to some misinterpretations.

An Open Lab-Notebook Experiment

Some
sort of unpretentious (academic) blog, by a surreptitious economist and
born-again mathematician. A blog activist, and an actuary, too. Always curious.
Because academics are probably more than the sum of our publication lists, grants and conference talks...

Used to live in Paris (France),
Leuven (Belgium), Hong-Kong (China), and Montréal (Canada). Professor and researcher in
Montréal, currently back in Rennes (France). ENSAE ParisTech & KU Leuven Alumni