Category Archives: Graphics

Yesterday evening, I was walking in Budapest, and I saw some nice map that was some sort of Otto Neurath style. It was hand-made but I thought it should be possible to do it in R, automatically.

A few years ago, Baptiste Coulmont published a nice blog post on the package osmar, that can be used to import OpenStreetMap objects (polygons, lines, etc) in R. We can start from there. More precisely, consider the city of Douai, in France,

Now, let us consider a rectangular grid. If there is a river in a cell, I want a river. If there is a church, I want a church, etc. Since there will be one (and only one) picture per cell, there will be priorities. But first we have to check intersections with polygons, between our grid, and the OpenStreetMap polygons.

Last week, I wanted to displaying inter-relationships between data in a matrix. My friend Fleur, from AXA, mentioned an interesting possible application, in car accidents. In car against car accidents, it might be interesting to see which parts of the cars were involved. On https://www.data.gouv.fr/fr/, we can find such a dataset, with a lot of information of car accident involving bodily injuries (in France, a police report is necessary, and all of them are reported in a big dataset… actually several dataset, with information of people involved, cars, locations, etc). For 2014 claims, the dataset is

The problem, when we ask for a symmetric chord diagram, is that we cannot have Front – Front claims (since values on the diagonal are removed)

> library(circlize)
> chordDiagramFromMatrix(M,symmetric=TRUE)

So let’s pretend that there could be some possible distinction in the dataset, between the first and the second row. Like the first one is the ‘responsible’ driver. Or like, for insurer, the first one is your insured. Just to avoid this symmetry problem

Friday evening, just before leaving the office to pick-up the kids after their first week back in class, Matthew Champion (aka @matthewchampion) sent me an email, asking for more details. He wanted to know if I did produce those graphs, and if he could mention then, in a post. The truth is, I have no idea who produced those graphs, but I told him one can easily reproduce them. For instance, for the cities, in R, use

Recently, with @3wen, we wanted to play with isodensity curves. The problem is that it is difficult to get – numerically – the equation of the contour (even if we can easily plot it). Consider the following surface (just for fun, in order to illustrate the idea)

In this paper, we investigate (and extend) Ripley’s circumference method to correct bias of density estimation of edges (or frontiers) of regions. The idea of the method was theoretical and difficult to implement. We provide a simple technique — based of properties of Gaussian kernels — to efficiently compute weights to correct border bias on frontiers of the region of interest, with an automatic selection of an optimal radius for the method. We illustrate the use of that technique to visualize hot spots of car accidents and campsite locations, as well as location of bike thefts.

The National Hurricane Center (NHC) collects datasets with all storms in North Atlantic, the North Atlantic Hurricane Database (HURDAT). For all sorms, we have the location of the storm, every six jours (at midnight, six a.m., noon and six p.m.). Note that we have also the date, the maximal wind speed – on a 6 hour window – and the pressure in the eye of the storm.

In almost three weeks, the (FIFA) World Cup will start, in Brazil. I have to admit that I am not a big fan of soccer, so I will not talk to much about it. Actually, I wanted to talk about colors, and variations on some colors. For instance, there are a lot of blues. In order to visualize standard blues, let us consider the following figure, inspired by the well known chart of R colors,

In order to illustrate the use of palette colors, consider some data, on soccer players (officially registered). The dataset – lic-2012-v1.csv – can be downloaded from http://data.gouv.fr/fr/dataset/… (I will also use a dataset we have on location of all towns, in France, with latitudes and longitudes)

The problem with France (I should probably say one of the many problems) is that regions and departements are not well coded, in the standard functions. To explain where départements are, let us use the dept.rda file, and then, we can get a matching between R names, and standard (administrative) ones,

In this paper, we investigate (and extend) Ripley’s circumference method to correct bias of density estimation of edges (or frontiers) of regions. The idea of the method was theoretical and di#cult to implement. We provide a simple technique – based of properties of Gaussian kernels – to compute e#efficiently weights to correct border bias on frontiers of the region of interest, with an automatic selection of an optimal radius for the method. An illustration on location of bodily-injury car accident (and hot spots) in the western part of France is discussed, where a lot of accident occur close to large cities, next to the sea.

Sketches of the R code can be found in the paper, to produce maps, an to describe the impact of our boundary correction. For instance, in Finistère, the distribution of car accident is the following (with a standard kernel on the left, and with correction on the right), with 186 claims (involving bodily injury)

and in Morbihan with 180 claims, observed in a specific year (2008 as far as I remember),

The code is the same as the one mentioned last year, except perhaps plotting functions. First, one needs to defi
ne a color scale and associated breaks

Then, we’ve applied that methodology to estimate the road network density in those two regions, in order to understand if high intensity means that it is a dangerous area, or if it simply because there is a lot of traffic (more traffic, more accident),

We have been using the dataset obtained from the Geofabrik website which provides
Open-StreetMap data. Each observation is a section of a road, and contains a few points identifi
ed by their geographical coordinates that allow to draw lines. We have use those points to estimate a proxy of road intensity, with weight going from 10 (highways) to 1 (service roads).

In my courses on R, I usually show how to insert a picture as a background for a graph. But it is also to see the picture as an object, and to insert it in a graph everywhere we like to see it, as explained on the awesome blog http://rsnippets.blogspot.ca/…. (in a post published in January 2012). I wanted to insert cards in a graph. Cards can be found, e.g. on wikipedia, even French versions, like the one I used to play with when I was a kid (see e.g. the Jack of clubs, http://commons.wikimedia.org/…, or the Queen of hearts, http://commons.wikimedia.org/…). But graphs are in svg. First, we have to export them in ppm, either using gimp, or online, with http://www.sciweavers.org/… instance. Here, I have a copy of the 32 cards, and the code to read one, in R, is

library(pixmap)
card=read.pnm("1000px_10_of_clubs.ppm")

Then, I can plot the cart using

plot(card,add=TRUE)

(on a predefined graph) The interesting part is that it is possible to plot the picture within a given box, but it has be bee specified when we read the image file, using

If we want to visulize all the cards, first, we have to store the pictures (the cards) in some R format, in a list, then to check for all of them for their dimensions, and then, we can write a code to plot any of them, anywhere we like (again it has to be specified when we read the file, which might take a while)

Note that, here, first we read the file to check the dimensions, and then, we read it again, using the appropriate box (with height given, here 0.9). Now, it is possible to plot the 32 cards on the same graph, for a given ordering

A nice post was recently published on the rsnippets blog, about the tikzDevice R package. This package is – indeed – awesome. Even if it has been removed from the CRAN website. Of course, it can be download from the archive folder, on http://cran.r-project.org/…, but also (for a more recent version) on http://download.r-forge.r-project.org/…. But first, it is necessary to install the following package.

(this is detailed, e.g. in http://yihui.name/…), then, we write a code to plot a graph. The idea is to produce a tex file which contains the graph, or more precisely which will produce a pdf graph when we compile it. We start with

I love Saint Patrick’s Day for, at least, two reasons. The first one is that, on March 17th, you can play out loud The Pogues, the second one is that it’s the only day in the year when I really enjoy getting a Guiness in a pub. And Guiness is important in statistical science (I did mention a couple of hours ago – on this blog – that beers were important for social reasons in the academic world, but that was for other reasons…)

As mentioned in all my statistics and econometrics courses, the history of statistics (I mean here mathematical statistics) is closely related to Guinness.

A long time ago, there was a Guinness Brewing Company of Dublin, which – as its name suggests – was an Irish brewing company. And the boss, who was to inherit the family business, decided to attract young students, trained in chemistry at Cambridge or Oxford.

In 1899, William Sealy Gosset, who had obtained a double degree in math and chemistry, left Oxford to Dublin. And to be quite honest, being graduate in maths meant when he had studied differential equations and astronomy. Basically, mathematics were useless for Guinness, and he got there with his expertise in chemistry. In fact, William turned out to be also a very good administrator, but this has nothing to do with our story.

William had good memories of his studies in math, and he wondered if he could find a problem to look at. He started studies on workmanship, noting that conditions vary so much (temperature, from hops, malt, manufacturing conditions …) that there were only few consistent data. The “law of errors” (the central limit theorem) can not apply under these conditions.

In short, Bill (now we know each other a little, we’ll call him Bill) took many measurements, and noticed that the Poisson distribution could be an interesting model to work with. To make the story short, Bill managed to use statistical techniques to control the variance of the production, meaning that he was able to lower losses in the production of beer.

A nice application like this one deserved publication in a scientific journal … Well, of course the Poisson distribution has long been known (it was 1904 and a few months before, Von Bortkiewicz found elegant applications of this law, as discussed in a post a few weeks ago). But there was a disclosure issue there: Bill’s contract prohibited him from disclosing secrets to the competitors.

Meanwhile, Bill had met Karl Pearson, who was then editor of Biometrika, and encouraged him to publish his results. In 1906, Bill who had helped Guiness to gain a lot of money – doing applied mathematics can be usefull – managed to take a sabbatical to work with Pearson to Galton Laboratory biometrics. Bill and Karl decided to publish the work under a pseudonym “Student.” The legend claims that they had hesitated to use “pupil.”

And for almost 30 years, “Mr Gosset” honorable employee Guinness led a dissolute life by publishing in statistical journals (after work in the brewery) always under the pseudonym “Student”. Of course, it might not be that simple. I mean, Bill had a family life, too. And his wife was the captain of the national Hockey team. So I hardly imagine Bill playing the smart ass and doing mathematical computations, when it was time to wash the dishes or iron his shirt…

In 1908, he wrote a remarkable “the probable error of the mean” remarked, at least, by Ronald Fisher. In fact, Bill found that there was a interesting law, but – as the normal – it was difficult to manipulate to obtain confidence intervals. Without a computer, he had the idea of ​​using monte carlo methods to tabulate quantiles and construct its tables. And he was probably the first one to look carefully at the problem of small samples, unlike Karl Pearson, who always put focus on the asymptotic case.

In fact, looking at his small sample, he saw the denominator magnitudes very close to those specifically manipulated Karl, in particular a square root of chi-square law. Well, of course, remained the normality assumption, but at least we had some results for finite samples !

For the story, William Gosset suggested to use letter z for its statistics, the ratio between the mean and (empirical) standard deviation. But a few years later, statisticians became accustomed to use this letter for Gaussian distribution (i.e. when the variance is known), and it became the standard to use the letter t. Hence finally the present name of “Student-t distribution” and in regression outputs, we have the “t-test”.

A legend (told by Harold Hotelling in his memoirs) claims that the Guinness family discovered this double life on the day of the death of William Gosset in 1937 when mathematicians requested financial assistance to print a volume of the works of their employee. But another legend claims that Mr Guinness himself would have suggested his nickname when he had expressed his intention to publish his research… So I guess we’ll never know. But at least, I’ll think about Bill when I’ll get my first Guiness tonight (but I will probably not be able to tell this story anymore when I’ll reach the fourth…)

In actuarial science, and insurance ratemaking, taking into account the exposure can be a nightmare (in datasets, some clients have been here for a few years – we call that exposure – while others have been here for a few months, or weeks). Somehow, simple results because more complicated to compute just because we have to take into account the fact that exposure is an heterogeneous variable.

The exposure in insurance ratemaking can be seen as a problem of censored data (in my dataset, the exposure is always smaller than 1 since observations are contracts, not policyholders),

the number of claims on the period is unobserved

the number of claims on is observed (as well as )

And as always, the variable of interest is the unobserved one, because we have to price insurance contract with a cover period of one (full) year. So we have to model the yearly frequency of insurance claims.

In our dataset, we have ‘s – or more generally also some additional covariates ‘s. For ratemaking, we need to estimate and perhaps also (for instance to test if the Poisson assumption is valid, or not). To estimate the expected value, a natural estimate for (forget about covariates as a start) is
which is also the weight average of annualized individual counts
We consider the ratio of the total number of claims to the total exposure-to-
risk. This estimate appears for instance if we consider a Poisson process, so that while . Then, the likelihood is

i.e.

The first order condition is here

which is satisfied if

So, we do have an estimator for the expected value, and a natural estimator for is then (if we consider categorical covariates)

Now, we need an estimate for the variance, or more precisely the conditional variable. Assume (as a starting point) that all have the same exposure . For instance, if is one half, insured were observed only the first six months. Then with ( is the number of claims on the first six months, while are the number of claims on the last six months), i.e. if we assume independent increments. I.e., or conversely . More generally, it is reasonable to assume that

for all values of . And then
Thus, it seems legitimate to assume that the empirical variance of can be written
Since the average of is , then
or equivalentlyi.e.
Thus, with different ‘s, it would be legitimate (I guess) to consider
Thus, an estimator for is

This can be used to test is the Poisson assumption is valid to model frequency. Consider the following dataset,

It looks like the variance is (slightly) larger than the average (we’ll see in a few weeks how to test it, more formally). It is possible to add covariates, for instance the density of population, in the area where the policyholder lives,

The size of the circles is related to the size of the group (the area is proportional to the total exposure within the group). The first diagonal corresponds to the Poisson model, i.e. the variance should be equal to the mean. It is also possible to consider other covariates, like the gas type

or the car brand,

It is also possible to consider the age of the driver as a categorical variate

Actually, the age is interesting: we can observe on that dataset a feature that Jean-Philippe Boucher observed also on his own datasets. Let us look more carefully where are the different ages,

On the right, we can observe young (unexperienced) drivers. That was expected. But some classes are below the first diagonal: the expected frequency is large, but not the variance. I.e. we know for sure that young drivers have more car accidents. It is not an heterogeneous class, on the contrary: young drivers can be seen as a relatively homogeneous class, with a high frequency of car accidents.

With the original dataset (here, I use only a subset with 50,000 clients), we do obtain the following graph:

If we do not observe underdispersion for young drivers, observe that those are incredibly homogeneous classes. With a clear impact of experience, since circles are moving downward from age 18 to 25.

Another disturbing story (this was – one more time – suggestion from Jean-Philippe) that it might be possible to consider the exposure as a standard variable, and see if the coefficient is actually equal to 1. Without any covariate,

An Open Lab-Notebook Experiment

Some
sort of unpretentious (academic) blog, by a surreptitious economist and
born-again mathematician. A blog activist, and an actuary, too. Always curious.
Because academics are probably more than the sum of our publication lists, grants and conference talks...

Used to live in Paris (France),
Leuven (Belgium), Hong-Kong (China), and Montréal (Canada). Professor and researcher in
Montréal, currently back in Rennes (France). ENSAE ParisTech & KU Leuven Alumni