Páginas

6 dic. 2018

I am starting to use the new software Foss Calibrator, so I will publish some posts about how it works. I use in this case the software for some samples of meat for a viability study of the calibration, and the software improves the split of the sample set into a validation and a calibration set, giving several options like random, time based,...We can choose also if the validation set is into the range of the calibration set, so the model has all the validation samples into the range of the constituent calibration, this way we have quickly the calibration and validation set ready to develop the calibration.

For the calibration we have several options for the cross validation (leave one out, using blocks, venetian blinds,......).

We can choose for developing the calibration the options: mPLS, PLS, ANN or LOCAL.I try for this case the mPLS models.

We can select the wavelength range, so we have to look to the spectra to see how if looks and remove noisy part of the spectra, or remove the visible part,.....

The XY plot of Measured vs Predicted shows the calibration and validation samples overlapped and is quite useful for a quick idea of the performance of the model.

We have also the plot of the GH distances with the calibration and validation values overlapped:

We see the statistics of the model and this time the RMSEP is the total error and the SEP is the error with the bias correction which makes easier to compare the results with other software or literature.

We can publish the model (calibration and outlier model together) to a folder in our PC and get the ".eqa", ".pca", and ".lib" files to use in Win ISI or load in MOSAIC Network or Solo, and get a report of the calibration.

I will continue sharing my experience with Foss Calibrator with the Label "Learning Foss Calibrator"

11 nov. 2018

This is a function of the R Caret package to check the importance of the variables in a regression. In the case of the model developed with the sunflower seed to determine oleic acid (model_oleic), we can plot it and check which variables have more importance and this is done with a simple step:

30 oct. 2018

This is a useful tool in R in order to evaluate a predictive model for classification. We know the expected value and the predicted on and from that we can get the Confusion Matrix and the useful statistics based by formulas from that matrix.I reproduce here the code from the post: "How To Estimate Model Accuracy in R Using The Caret Package" from the blog "Machine Learning Mastery":# load the librarieslibrary(caret)library(klaR)# load the iris datasetdata(iris)# define an 80%/20% train/test split of the datasetsplit=0.80trainIndex <- createDataPartition(iris$Species, p=split, list=FALSE)data_train <- iris[ trainIndex,]data_test <- iris[-trainIndex,]# train a naive bayes modelmodel <- NaiveBayes(Species~., data=data_train)# make predictionsx_test <- data_test[,1:4]y_test <- data_test[,5]predictions <- predict(model, x_test)# summarize resultsconfusionMatrix(predictions$class, y_test)Try to understand the results, some samples are well classified and others not. So we must try to find the model where we have the better statistics for the classification. This is a simple example, but why not to try this machine learning algorithms to spectra for classification and use the confusion matrix to get the best model.The statistics we get running the last line of code are:

10 oct. 2018

In this plot we teste different types of Principal Components Analysis with different packages. This time I use Caret.I use the same Tecator Meat data which comes with the package. Spectra is treated with MSC (Multiple Scatter Correction) and I represent the plane of the scores with the two terms chosen by the PCA processing:absorp_pca<-preProcess(absorpTrainMSC, method = c("center", "scale","pca"), thresh = 0.95)PC_scores_train<-predict.preProcess(absorp_pca,absorpTrainMSC)plot(PC_scores_train[,1],PC_scores_train[,2],col="blue", xlim=c(-15,11),ylim = c(-20,11), xlab = "PC1",ylab = "PC2")PC_scores_test<-predict.preProcess(absorp_pca,absorpTestMSC)par(new=TRUE)plot(PC_scores_test[,1],PC_scores_test[,2],col="red", xlim=c(-15,11),ylim = c(-20,11), xlab = "",ylab="")Now we get the plot of the scores for the training set in blue and for the test set in blue.

3 oct. 2018

I did started to use Caret, and I will continue using it, so I have to try a lot of things in R to become familiar with it.

In Caret the are a data set (data=tecator) from a Tecator instrument for meat analysis, working in transmitance and in the range from 850 to 1050 nm with a total of 100 data points.

The parameters are Moisture, Fat and Protein. You can play around with this data to become familiar with Caret, so I try to create a quick regression with PCR.

Caret let us prepare the Training and Testing Data in a random order and to train the model with several kinds of cross validations. So I wrote some code apart from the help I found in the available Caret Documentation.

14 sept. 2018

I use R regularly for study the validations of different equations, in this case is an equation of cereals which include (barley, wheat, rye, corn, oat, triticale, ..). The monitor function in this case compare the starch values of an instrument consider as the Master (Y axis) and other consider as the Host (X axis).

The idea is to check if there are differences which are important in order to take an action to adjust Bias or Slope and Intercept, or also consider to standardize the instruments.

In this case the Monitor function gives a warning to check if there are groups or extreme samples which can recommends the adjustment of slope and intercept.

And really there is a gap with two groups of samples, so we have to consider in this case what is happening: We have a group of barley samples with lower values of starch and a group with the wheat and corn samples with higher starch values.

In order to evaluate better we have to make subsets and check what is going on with the predictions statistics by groups and proceed the best way.

30 ago. 2018

Is not the first time Max Kuhn appears in this blog and this time with a lecture (in the last New York R Conference) about advices to estimate what is the best model based on R statistics. Sure we can get good advices to find the best model possible for our data sets.

16 ago. 2018

This is an study to develop calibrations for meat in a reflectance instrument from 1100 to 1650 nm. Normally meat are measured in transmitance but this is an approach to do it in reflectance.

I have just 64 samples with fat laboratory data. I split the spectra into 4 sets of 16 samples and merge 3 of leaving the other three for external validation. So I have 48 samples for training and 16 for validation and I can develop four calibrations and validate with 4 external sets.

Considering that we have few samples are in the training set, I have to use few terms. The SEPs for external or Cross Validation are quite high , but the idea here is to see the changes in the slope for the four validation sets.

The reason is that we have few samples and the slope value will stabilize as soon as more samples are included into the calibration and validation sets.

To improve the SEP we have to check the sample presentation method for this product and the procedure to obtain the laboratory reference method.

3 ago. 2018

NIR can be used to detect levels food additives and check if they are in the right limits.

In this cases there are several types of doughs, and they use two levels of additive concentration depending on the type. So we have always the same reference data.

A calibration is developed and we have new data to validate. NIR will give results which I expect to be covering the reference value with a Gauss distribution.

Using the Monitor function I can see the prediction distribution vs. the reference distribution and check if the expectations are fine.

In the case of the higher concentration is fine, and in the lower concentration is skewed (that is why the S/I adjust is suggested).This can be a first approach to continue with this application with mor accurate reference values.

13 jul. 2018

Continuing with this post evaluating the LOCAL model developed in Resemble. This time I use the Monitor function (one of the Monitor functions I am developing).

I create different subsets from the validation sample set for the different categories, In this case is for one time of puppies, and I am evaluating the moisture content. We can see that there are two outliers that increase the SEP, so we have to see if we remove this samples for some reasons.

Let´s validate first with this type of puppy subset and check the statistics:

6 jul. 2018

Good results for the prediction of the validation samples (Xu, Yu) for protein. This is the XY plot where we can see in different colors the classes of the validation samples (different types of petfood). the SEP is 0.88 (without removing outliers) . Defining the data frame by classes will allow us to see the SEP for every class so we can check which class needs for more samples in the training database (Xr, Yr) or to check for other reasons.

29 jun. 2018

Resemble allow a certain number of plots which are very useful for your works or personal papers. In this case I use the same sets than of the previous post and I plot the PCA scores, where I can see the training matrix (Xr) scores and the validation matrix (Xu) scores overlapped.

Validation set is a 35% (randomly selected) of the whole sample population obtained from a long time period.

We can see how the validation samples cover more or less the space of the training samples.

27 jun. 2018

Really interesting the Resemble package so I am trying to work and understand it better even to help me in the case of Win ISI LOCAL calibrations.

We can get predictions for different combinations of local selected samples for the calibration to predict the unknown, so we can see the best option. We use a certain number of terms (min. and max.) and a weighted average is calculated.

In this case I use an external validation set of petfood Xu with Reference data (protein) Yu, and I want to know the statistics (RMSE and R square) for the case of 90 local samples selected:

26 jun. 2018

This is the link to this presentation which help us to understand the concept of LOCAL that we will treat during next posts with the "Resamble package" and we have treated and we will continue with LOCAL in Win ISI.

We can use also LOCAL in R with the Resemble package. I am testing the package these days with a set of petfood spectra (with protein reference values) imported from Win ISI with SNV and a second derivative math treatment. After, I select 65% for training and the rest for test.

The get predictions process of Resemble allow a configuration to check for the better number of sample or factors for the better prediction, so there are a lot of options and functions to check in this package.

This is a plot of the results for a standard configuration from the reference manual, that I would try to go more deep into, trying to find the best configuration.

14 jun. 2018

Finally happy with this plot trying to explain the precision and accuracy of the laboratory vs. the NIR predictions for a set of samples and subsamples. I will explain more detail of this plots in coming posts.

6 jun. 2018

This is a boxplot where there are four subsamples of meat meal predictions. A representative of a certain batch has been divided in four subsamples and analyzed in a NIR. So we get four predictions, one for every subsamples, so the the boxplot gives an idea of the variance in the predictions for every sample based on their subsamples.

The colors are because the subsamples had been send to two different labs, so one is represented by one color. Colors had certain transparency because in some cases, two samples went to a lab and two to the other, in other cases the four subsamples went to the same lab and even in some cases three to one lab and one to another.

All these studies give an idea of the complexity of the meat meal product.

3 jun. 2018

In order to understand better the performance of a model, different blind subsamples of a sample had been sent to a laboratory, so in some cases we have the lab values of four subsamples of a sample and in other cases two subsamples of a sample. There are two cases with only one subsample.

For every subsample we calculate the average for the lab, and the average for the predictions, to get the red dot residuals.

We have also the residual of every subsample vs its prediction and those are the gray dots.

The plot (with R) gives a nice information about the performance of the model and how the average performs better in most cases than the individual subsamples.

We can see the warning (2.SEL) and action limits (3.SEL), and how the predictions for the average fall into the warning limits.