Education Article

Published: Jan 1, 2000

Channels: Chemometrics & Informatics

1. The log-standardised data are presented below.

Taking logarithms means that excessively intense peaks do not dominate the analysis, and is often used in environmental and food samples. For peak E this can have a beneficial effect on the analysis and means that the variable will be quite useful in terms of classification. It is not necessary for all the variables (e.g. peak B), but it is better to transform all the data in a similar way. Standardising means that each variable has similar significance, so the variability of B is considered as important as D, for example.

2. The results of PCA are as follows.

The first two eigenvalues represent 78.38% of the variability. Note that because the data have been standardised, the total sum of squares equals 14 ´ 8 = 112.

The scores plot is shown below.

Object FM is indicated and appears to be a clear outlier.

3. The new log-standardised data together with the score plot is presented below.

There appears better discrimination especially PC2 is much more useful, whereas for the entire dataset it primarily was influenced by sample FM.

4. The two new datasets are give below.

5. The two loadings vectors are as follows.

A

B

C

D

E

F

G

H

Fp

0.334

-0.436

0.443

-0.333

0.171

-0.438

0.291

0.292

Sp

0.444

0.016

0.444

-0.190

-0.229

-0.280

0.472

0.464

6. The predicted scores using both models are given as follows.

The prediction of X (on the 13 samples) using the "fresh" model is as follows.

The prediction using the "stored" model is as given below.

Notice that the predictions are of the log-standardised data. Also it is important to recognise that each model is used to predict different numbers, because the standardising is performed separately for each class. The aim is to predict the two datasets of question 4. For example, the numerical prediction of measurement A of swede FH appears very different using both models, but in both cases is reasonably good.

7. The class distances, calculated as indicated in the question are as follows.

The class distance plot is given below. FA is slightly dubious but there are no obvious outliers and no other samples that appear to belong to both classes. Notice that FA has a positive score for PC2 in question 2, and is closest to the stored swedes in question 3. Possibly it could be removed from the model, but in practice samples that have distances that are close to two classes could be re-examined, so SIMCA is a valuable form of screening.

8. To include the extra two samples, remember to perform scaling correctly and independently for each class.