The evmix package
for R aims to provide a suite of tools for extreme value
threshold estimation and uncertainty quantification, following the review
paper by Scarrott and MacDonald (2013) and comparative simulation study by Hu (2013). At this time, it provides almost all of the univariate stationary extreme value mixture models proposed in the
current literature. Similar to the evd package the classic quartet of model fit diagnostics
(probability, quantile, return level and density plots) are implemented. The following threshold diagnostics are implemented:

mean residual life plot;

threshold stability plots;

Hill plot and it's variants (altHill, smooHill, altsmooHill); and

Pickand's plot.

The package also includes a wide range of kernel density estimators, with cross-validation likelihood estimation. In particular, there are few publicly available boundary corrected kernel density estimation
functions for R at this time, i.e. designed to cope with populations on bounded support (lower/upper or both). We have implemented a wide range of boundary corrected kernel density estimators, see full list below.

For each mixture model the usual distribution functions (density,
cumulative distribution function, quantile and random number generation),
likelihood, negative log-likelihood and maximum likelihood estimation
(MLE) fitting routines are provided. The usual prefixes are used for these
functions (d, p, q, r, l, nl and f respectively). For most mixture models
these functions can take vectorised parameter inputs, allowing for evaluation
of nonstationary forms of the mixture models. However, at this time the
likelihood and MLE fitting functions can only take single parameter
inputs.

Be Careful! Default Settings May Not Be Optimal...

By default the MLE based fitting procedures use black-box numerical optimisation of the complete likelihood (for all the model parameters), and
initial values are calculated from the sample data wherever possible. Standard
black-box optimisers of the complete log-likelihood of the the extreme value mixture models
are very sensitive to the initial parameter values as there are often many local modes,
which is an inherent and unavoidable feature of such models. They are particularly sensitive to the the initial value for the threshold.

The initial value for the threshold is set to be the 90% sample quantile of the data. The black-box optimisation of the complete likelihood
may get stuck in a local mode close to this initial value. If one really needs to maximise the complete likelihood then you could attempt to overcome this problem by either using alternative optimisers (e.g. SANN is an option in the fitting functions)
or manually try out different initial parameter vectors.

Profile likelihood estimation of the threshold using a grid search over potential thresholds seemingly overcomes the multi-modality problem, as the source of the problem seems to be the threshold parameter. A sequence of thresholds to be considered is specified using the "useq" fitting function input ("ulseq" and "urseq" for lower and upper tail thresholds in the two-tailed models). The threshold which maximises the profile likelihood can be either used as an initial value in the optimisation (default behaviour when "fixedu = FALSE"), or treated as a fixed threshold (when "fixedu = TRUE"). Given our current experience, we suggest that profile likelihood estimation of the threshold, which is then fixed, is the estimation approach of choice when using likelihood inference. We are considering changing this to the default approach, feedback on your experiences would be appreciated.

The user can also pre-choose the threshold as is used in the classical "fixed threshold approach" for extreme value threshold modelling. This is achieved by providing a single threshold in the "useq" input and set "fixedu=TRUE".

Downloads and Documentation

The current release version for Windows and R 3.1.2 is available
here with the source files here and is now available here on
CRAN.

There is a basic User Guide and full Reference
Manual. A submitted paper outlining the features in more detail is available here. All these are provided in the package directories (accessed via html help system for package in R).

Yang Hu wrote the original code in R and carried out a detailed simulation
study all as part of his Masters thesis which is also available from
here. Some of the code is based on functions
written for MATLAB by Anna MacDonald who's thesis is also available
here.

A published review of threshold estimation and uncertainty quantification techniques is available
here which includes such mixture models: Scarrott and MacDonald (2012). A Review of Extreme Value Threshold Estimation and Uncertainty Quantification, REVSTAT 10(1), 33-60.

The first public showcase of the package was at the EVA2013 conference with the presentation
here, which includes some extra examples of functionality and advice for practitioners and mixture model developers. The code in this presentation was included using the listing package for LaTeX can be simply copied straight into R or is available as a R script.

Version 0.2.0 was presented at the NZSA2013 conference with the presentation
here, which includes some extra examples of functionality and advice for practitioners and mixture model developers and here is the R script.

Some new features (P-spline density estimator) added in version 0.2.4 was presented at the ERCIM2014 conference with the presentation
here.

We created and continue to maintain the package in RStudio
using Roxygen2 for some automation of the
documentation creation. The raw source files used to create the current package
are available here, in case these would of use
to anyone wanting to create their own package using these tools.

Please Get in Touch...

We are really interested to hear about usage of the package and your experiences with it. So please get in touch with the maintainer Carl Scarrott

Next Update

Future updates will likely include some Bayesian inference functionality using MCMC,
thus avoiding the challenges associated with optimisation of the likelihood which can have lots of local modes as mentioned above.

Current Functionality

All functions are visible to the user, so no functionality is hidden.
The functions which are designed for the user to call directly are fully
documented and have detailed input error checking. However, some
functions are designed for internal use and so
are undocumented (and no checking of user inputs is carried out). These
are fully user visible and useable, but the user must know what they are doing
before using these functions. All functions are currently written as raw
R code, so there is an inherent performance penalty for this. We decided
to go with fully transparent code for the average user. If you are dealing
with large datasets you should look at reformulating the code in a lower
level language to make the computations more efficient.

The gamma bulk includes the exponential (Wong and Li ,2010) as a special case. Various authors (e.g. Teodorescu and Vernicu, 2013; Scollnik, 2007; Nadarajah and Baka, 2012) have considered many of the above parametric mixtures with the GPD tail constrained to be a Pareto type tail (positive shape parameter, xi > 0).

The following semi-parametric extreme value mixture model is provided:

which uses the EM algorithm for iterative estimation of the mixture weights and component parameters. Again, the mixture of gammas includes the mixture of exponentials as a special case (Lee et al, 2012). The standalone "mixture of gammas" model (no extreme value tail modelling) has also been implemented using the EM algorithm.

The following nonparametric extreme value mixture models are provided:

Additional versions of the above models have also been implemented with an extra constraint of continuity at the threshold.

The following nonstandard parametric extreme value mixture models are also implemented:

hybrid Pareto, with continuity and continuity in first derivative at threshold (Carreau and Bengio, 2008)

hybrid Pareto, with only continuity at threshold (Hu, 2013)

dynamically weighted mixture model (Frigessi, Haug and Rue, 2002)

interval transitions function mixture model with normal or Weibull and single GPD tail, and normal with both GPD tails (Holden and Haug, 2009)

Users should be aware that all these non-standard models have some unexpected properties and frequent performance issues in practice, as they do not include the tail fraction scaling of the GPD tail components which is normally included in classical extreme value tail modelling approaches.

The simple boundary correction method may provide negative density estimates,
so the non-negative density transformation method of Jones and Foster (1996) is
applied by default. The simple boundary correction method, beta and gamma boundary corrected
KDEs also require normalisation to unity, which is also applied by default.
Cross-validation likelihood is used in the MLE fitting for all the KDE based
models. However, we have found this has a positive bias for generalised
jacknifing, leading to oversmoothing. We use an alternative like the reflection
or renormalisation method in the MLE of the bandwidth in this case.

A P-splines based nonparametric density estimator (based on the approach of Eilers and Marx, 1998 in Statistical Science) is also provided for wider usage for non-extreme value problems.

All the implemented extreme value mixture models, except the non-standard ones listed above
(hybrid Pareto, dynamically weighted mixture and interval transition models), use the GPD as a conditional model
(i.e. describe the exceedances, conditional on them being above the threshold).
Therefore, when used in the mixture model the required unconditional GPD is
obtained by application of the probability of being above the threshold,
called the tail fraction. All the relevant mixture model functions permit
the user to choose between treating the tail fraction as an extra parameter
to be estimated (for which the MLE is the sample proportion of exceedances)
or to parameterise it according to the bulk model parameters. See Yang Hu's
thesis above for definitions and discussion.

The diagnostic plots and base GPD distribution functions are heavily based
on those of the evd package, so the syntax is reasonably consistent and
users can safely interchange most code for these.

Caveat

The package is still in its early developmental phase, so you can expect
some changes in the functionality, and full backward compatibility is not
guaranteed. We will continue to add functionality in each release. If you
find bugs then do let us know (carl.scarrott@canterbury.ac.nz), but always
provide a reproducible example.

If you wish to extend the functionality of the package with your own relevant code,
then you are welcome to submit it to us and we try our best to include it, with acknowledgment of authorship of course!

Various researchers have submitted bug reports which are appreciated and acknowledged in the help files, keep them coming...