In the context of Stochastic Discrete Event Systems, simulations are often carried out for an extremely large number of combinations of input variables influencing a stochastic event. Such systems arise in a variety of engineering contexts, including manufacturing systems, communication networks, computer systems, logistics, and vehicular traffic. Due to constraints on the computationally feasible total number m of simulation runs, there is a trade off between the number of input combinations considered and the number of simulation runs per input combination. We investigate the optimization of this trade-off in the following set up:

We consider a multiple hypothesis testing framework when the overall number of observations that can be collected is large but limited by computational constraints. A natural question in this context is whether the number of hypotheses to be tested should be limited in favor of additional observations per considered hypothesis. We provide guidelines concerning the choice of an optimum number of considered hypotheses in common testing situations. Thinking of correctly rejected null hypotheses as interesting findings, our optimization is with respect to the expected number of correct rejections while controlling for the multiple testing error. We also briefly discuss the classification setting, where a linear combination of true and false positives is considered. We demonstrate that considering an appropriate number of hypotheses in this context can lead to a substantial increase in the expected number of correct rejections.

Two-stage sampling plans for a proportion nonconforming are classical components of statistical sampling standards like ISO 2859-1. However, customary standardised two-stage sampling schemes provide no mechanism to exploit prior information in the design of sampling plans. This leads to intolerably high sample sizes for applications in auditing and inspection of high-quality product. Another problem of customary sampling schemes is the exclusive focus on decisions, and the absence of methods which support the estimation of proportion nonconforming. We consider two ways of exploiting prior knowledge and present adequately designed two-stage estimation oriented sampling plans based on confidence intervals for a proportion nonconforming.

In Unilever R&D several thousands of experiments are done on a yearly basis before a product is (re)launched on the market. For efficient use of resources applying Design of Experiments (DoE) is desired. The several of hundreds of researchers, engineers and technicians involved globally in these experiments have been educated in different disciplines and on different levels and have a varying level of statistical knowledge. Applying DoE is often not common practice for them and statisticians are not always around for support.

There are many tools available that can help in setting up and analysing DoE’s. However, these tools have some major drawbacks: 1) There is an overwhelming amount of possible DoE’s from which the user needs to make a choice, 2) The level of detail that is needed to fill in the necessary information is high and 3) It might be difficult to understand the results of the statistical analysis of the measurement results. Without proper guidance an incorrect design could be chosen easily or results could be interpreted erroneously and both issues could make the difference between success and failure.

Instead of training all scientist in the basics of statistics and DoE, Unilever R&D has chosen to deploy a tool in which these difficulties have been overcome. This tool – called “Plyos” – was developed in-house and runs in JMP. First it creates an experimental design based on information provided by the user (responses, factors and some other common details). The user is guided through this process with extended help. Then it analyses the data from the DoE based experiments. At the same time the tool gives detailed explanations how the statistical techniques and the results should be interpreted.
The presentation highlights the ways of working, the advantages and the limitations of the tool.

This work presents a Python package for a two layers statistical meta-classifier for supervised learning on labeled skewed data. The classifier is the combination of simple classifiers (first layer) and an adaboost meta-classifier (second layer). This working architecture proved to be particularly effective on massive datasets characterized by strong asymmetry. The algorithm works both with binary and multi-class labeled data.

The Python package contains two modules: train and predict. The main module is train and it requires as input a dataset organized in an array, a vector of corresponding labels and a floating point number t in ]0.5,1[. It returns a trained simple classifier for each feature of the dataset, a trained Adaboost based on more than one feature and the meta-classifier architecture. In the current beta version, the simple classifiers are all from the same family chosen by the user among standard classifiers such as: Support Vector Machine, Decision Tree or Stochastic Gradient Descent. The architecture is based on the performance of the simple classifiers: weak classifiers performing better than the threshold t are allocated to both the first layer and the Adaboost whereas others are used only in the Adaboost or discarded if performing poorly. The output of train could be hidden or not. It consist of two vectors: the first one has length the number of features and each of his entries take one of three values according to how the corresponding feature has been allocated in the architecture. The second vector contains the estimate of the Adaboost weights for the relevant features.

The module predict requires in input a datapoint as a vector and returns the predicted label. The datapoint is processed by the first layer which, according to a majority rule, can return a final hypothesis (predicted label) or send it to the second layer where it is processed by the Adaboost, which returns the final hypothesis.

The meta-classifier has been tested on datasets from three very different classification problems and results will be presented together with performance analyses.

A Comparative Study of Different Methodologies for Supervised Fault Diagnosis in Multivariate Statistical Process Control

This presentation focuses on fault diagnosis in environments where latent-based multivariate statistical models are likely to be effective, i.e. when there is a large quantity of monitored variables with a complex multivariate correlation structure yielding ill-conditioned covariance matrices. Among these methods there are some that use existing information related to the faults to build individual models for the faults or alternatively to characterize them with a fault direction or subspace. The advantage of these methods is that they concentrate on the root cause of the problem and not simply provide a long list of variables suspected to behave abnormally, potentially reducing the time to fix the problem. We present a comparative study of the diagnosis performance when applying supervised fault diagnosis methods (e.g. fault reconstruction, fault signature and partial least square discriminant analysis methods) to some processes and analyse the requirements for their implementation in practice.

Statistical sampling is a fundamental tool for investigating in any scientific field. Usually, the whole sample is decided prior to the measurements and randomness is applied as in Random or Latin Hypercube sampling.
However, a different option would be starting with a Random or a LH sample (sample one) followed by an adaptive sample (sample two) where units are taken sequentially with the purpose to optimize an objective function. This is a composite sampling scheme which can significantly improve the trade-off between sample size and the information collected.
The core of the method is to drive the next-site selection in sample two by a sequel of kriging models, namely stationary Gaussian stochastic processes with a given autocorrelation structure . The distinctive merit of such models is their ability to promptly reconfigure themselves, changing the pattern of predictions and prediction uncertainty each time a new measurement comes in. The next sampling site can be selected via a number of model-based criteria, inspired by the principles of reducing prediction uncertainty or optimizing an objective function, or a combination of the two. Needless to say, adaptive kriging sampling can be regarded as a model-based optimizer.