At the end of that short but preciser tutorial, you'll understand what is a GEV (Generalized extreme value distribution) and how to fit it on your data in python.

What we need to think about before speaking of extreme values statistics, is that in general statistics focus on average values and observable phenomenoms, for which we do have datas. Extreme values statistics is different from "average" statistics in this regard, indeed as we will see, we here try to model something for which we have no data! You will see soon what we mean by that. Before getting there I want to show you why extreme values modelling is an actual problem, of high interest for insurance companies and public services. Let's start with two examples.

Montpellier, France, 2014

3 hours of rain = 50% of total annual average.Montpellier is not known to be a city with a high annual rain level, located in the dry location of Provence, therefore if we look at the average value we have a very common one. But it is not excluded that sometimes, rarely, very high values appear, as we can see:

Saltina river in Brigue, Switzerland, 1993

Unprecedented measured flows and precipitations in the twentieth century.
Same problem here, this rivers has reasonable flows in average, but sometimes the flow can be exceptional.

Number of natural catastrophes since 1980 is growing

Those "extreme" events tends to become more and numerous those days, and make extreme value analysis "popular".

Some could wonder what statistical model shall we use to predict a Nadal loss at Roland Garros

(Almost) just kidding

Question: What if we would like to quantify the probability of such events ?¶

As we said the particularity of the extreme value analysis is not to model an average phenomenom but an extreme phenomenom.

The answer is not satisfying, because it's surely not "impossible" to observe a flow superior to $400$ $m³/s$.

Partial Conclusion : we cannot estimate the probability of such a flow with the histogram!¶

And now we precisely get to what we talked about before, we want to compute the probability of an event for which we have absolutely no data, I hope this is clear now. In statistical modelling, usually, we train an algorithm with a dataset that is supposed to have the same distribution as the testing data, in other words we have datas for what we want to model. Extreme value theory comes into play here:

"Classic" statistics doesn't allow to compute probabilities such that $P(X > x)$ when $x$ is beyond the maximum of our observations.

A "famous" joke illustrates this problematic: during the night, someone is squatting at the foot of a lamppost, and a passerby asks him: "What are you doing?" "I'm looking for my keys." "Are you certain that you lost them around the lamppost?", asks again the passerby. "No, but, in fact, that's the only lighted place."

Extreme value statistics is kind of like this joke, we want to make statistics where there is no light, where there is no data.

With different values of $\gamma$ there is a type I, type II and type III GEV, also called the Gumbel ($\gamma=0$), Fréchet ($\gamma >0$) and Weibull ($\gamma <0$) distributions.

If you have a random variable $X$ and study the distribution of it's maximum over a number of samples $X_{n,n}$, this distribution must belong and must "fall" in the domain of attraction of one of these 3 types of GEV.

Let's use again our dataset of Nidd River, we will fit a GEV on it and estimate extreme quantiles.
In pratice we will estimate the three parameters $\gamma$, $a_{n}$ and $b_{n}$ such that:
$$P(X_{n,n} \leq x) \simeq H_{\gamma}(\frac{x-a_{n}}{b_{n}})$$

So first we need to turn our trimestrial data, that is our observations of $X$ into observations of X_{n,n}, lets say yearly maximums, we will create a vector of observations by taking the maximum flow for each year:

We can now calculate the probability that the river's flow exceeds 350 $m^3/s$:
Indeed, because the observations of $X$ that we have are independant, hence:
$P(X_{n,n} \leq x) = F^{n}(x)$
and from the extreme value theorem this means $F(x) \simeq H_{\gamma}^{1/n}(\frac{x-a_{n}}{b_{n}})$, taking the log and developping around $\log(1+u)$ gives :
$P(X\geq x) \simeq -\frac{1}{n}\log H_{\gamma}(\frac{x-a_{n}}{b_{n}})$
Therefore we now have an approximation of the survival function in tail:

You now understand the problem with extreme quantiles in classical statistics.

You now know that there exist an extreme value theorem that looks like CLT that tells us there is 3 types of distributions for the max of any random variable, depending on a shape parameter $\gamma$ : Gumbel, Fréchet, Weibull.

To fit such a distribution we need a dataset of maximum, and to estimate $\gamma$, $a_{n}$ (loc) and $b_{n}$ (scale).

We can do that with scipy.stats.genextreme very quickly.

With it we can compute extreme quantiles and extreme value propabilities where we have no data to learn !!!

What's not new yet:

It's sometimes hard to form a dataset of maximums, there is also a similar alternative with threshold levels instead of maximums, which is easier in general. And the distribution to fit is called a GPD (Generalized Pareto Distribution), that is in scipy.stats.pareto.

Thanks for your attention !God bless you!

This website does not host notebooks, it only renders notebooks
available on other websites.