Learning all of these multivariate analysis techniques to apply to your data is all well and good, but what happens when you have a data set? What should you do? And in what order? This post gives some suggestions on general strategies for analysing your data and gives a worked example.
The overall multivariate analysis workflow can involve different aspects of supervised and unsupervised analyses, depending on: (i) data available and (ii) the question(s) being asked.
This workflow example is concerned with when (i) Class memberships is explicitly stated and (ii) identifying patterns and important variables to distinguish class membership response AND building models and algorithms to classify unknowns into a class is the focus.
As discussed in the blog post introducing Multivariate Analysis Workflow, the general workflow can be as follows:
Scenario:
A study into classification of types of glass was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence...if it is correctly identified! A researcher analysed glass from builidings and vehicles for a range of parameters, including the refracive index (RI) and composition of a range of elements in the glass.
The researcher analysed 146 glass samples, 70 from building window glass and 76 from vehicle glass. This data was split into a training set (110 samples) and test set (36 samples)
The researchers were interested in seeing if they could: (i) Create models that were capable of differentiating between glass from different sources; (ii) use the models to understand what the underlying differences were between the glasses from different sources.
Analysis:
The data has already been collected, and the next stages involve reading the data into R (Section X) and checking its format (making sure that response and explanatory variables are split).
The next stage is unsupervised analysis/exploratory analysis on the data, looking at the explanatory variables only. Techniques for this include: PCA (Section 3A, Part III), HCA (Section 3A, Part II) and Heatmaps. This stage is particularly concerned with gaining an understanding of the data and identifying any underlying trends and patterns.
Following this, conducting supervised analysis on the data, first fitting models with the training set and then testing with the test set can create models that were capable of differentiating between glass from different sources and use the models to understand what the underlying differences were between the glasses from different sources. For model building, a range of different classification algorithms are possible, including SIMCA, LDA (Section 3B, Part IV), k-nearest neighbours (Section 3B, Part II), PLS-DA and random forest (Section 3B, Part II) and neural networks. Of all of these, PLS-DA and random forest models are the best to interpret.
R Output:
#Reading in the data and checkingglass.training<- read.table(file.choose(),sep=",", header = TRUE)
glass.test<-read.table(file.choose(),sep=",", header = TRUE)
summary(glass.training)
## Type RI Na Mg
## Length:110 Min. :1.512 Min. :10.73 Min. :0.000
## Class :character 1st Qu.:1.517 1st Qu.:12.85 1st Qu.:3.333
## Mode :character Median :1.518 Median :13.16 Median :3.515
## Mean :1.519 Mean :13.14 Mean :3.158
## 3rd Qu.:1.521 3rd Qu.:13.48 3rd Qu.:3.658
## Max. :1.534 Max. :14.77 Max. :3.980
## Al Si K Ca
## Min. :0.290 Min. :69.81 Min. :0.0000 Min. : 7.880
## 1st Qu.:1.120 1st Qu.:72.18 1st Qu.:0.2550 1st Qu.: 8.395
## Median :1.280 Median :72.72 Median :0.5600 Median : 8.680
## Mean :1.246 Mean :72.55 Mean :0.4595 Mean : 9.210
## 3rd Qu.:1.440 3rd Qu.:72.98 3rd Qu.:0.6075 3rd Qu.: 9.510
## Max. :2.100 Max. :74.45 Max. :0.8100 Max. :16.190
## Ba Fe
## Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000
## Mean :0.04282 Mean :0.07509
## 3rd Qu.:0.00000 3rd Qu.:0.15750
## Max. :3.15000 Max. :0.35000
summary(glass.test)
## Type RI Na Mg
## Length:36 Min. :1.514 Min. :12.61 Min. :3.090
## Class :character 1st Qu.:1.516 1st Qu.:12.88 1st Qu.:3.520
## Mode :character Median :1.516 Median :13.22 Median :3.580
## Mean :1.517 Mean :13.29 Mean :3.596
## 3rd Qu.:1.518 3rd Qu.:13.46 3rd Qu.:3.610
## Max. :1.522 Max. :14.86 Max. :4.490
## Al Si K Ca Ba
## Min. :0.890 Min. :71.36 Min. :0.0600 Min. :7.080 Min. :0
## 1st Qu.:1.265 1st Qu.:72.66 1st Qu.:0.5400 1st Qu.:7.955 1st Qu.:0
## Median :1.415 Median :72.92 Median :0.5900 Median :8.070 Median :0
## Mean :1.429 Mean :72.80 Mean :0.5658 Mean :8.119 Mean :0
## 3rd Qu.:1.540 3rd Qu.:73.11 3rd Qu.:0.6700 3rd Qu.:8.305 3rd Qu.:0
## Max. :2.120 Max. :73.29 Max. :1.1000 Max. :9.150 Max. :0
## Fe
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.04972
## 3rd Qu.:0.09250
## Max. :0.32000
#Splitting of response and explanatory variablesexp.glass.training<-glass.training[,-c(1)]
res.glass.training<-glass.training[,1]
exp.glass.test<-glass.test[,-c(1)]
res.glass.test<-glass.test[,1]
#Making sure the response is a factor (qualitative variable) res.glass.training<-as.factor(res.glass.training)
res.glass.test<-as.factor(res.glass.test)
###First conducting unsupervised analysis/exploratory analysis on the data, looking at the explanatory variables only
#Combining the explanatory variables into one file AllGlassExplanatory.df <- rbind(exp.glass.training,exp.glass.test)
AllGlassExplanatoryMS.df <- scale(AllGlassExplanatory.df, center = TRUE, scale = TRUE)
#HCA glass.euclid<-dist(AllGlassExplanatoryMS.df)
glass.HCA<-hclust(glass.euclid)
plot(glass.HCA)
#From the HCA, can see that sample 71 is very different from the others, additionally a smaller subgroup of samples (70, 72, 77, 75, 76) are also clustered distinctly and separately.
#Of all the other samples, there are two large subgroups, which split quite quickly to others.
#Heatmap
glass.mat<-as.matrix(AllGlassExplanatoryMS.df)heatmap(glass.mat,Colv =NA,Rowv =NA)
heatmap(glass.mat,scale ="column")
#HCA of the samples the same as HCA. From the heatmap, can see that sample 71has unusually high Ba levels, most likely causing it to be so different to the others.
#clustering of the variables show that RI and Ca are closely aligned and lookto be positively correlated. Na and Mg also positively correlated with eachother, as is K and Al. These seem to be not strongly associatedwith variableslike Ca, RI, Fe.
#PCA
glass.pca<-prcomp(AllGlassExplanatoryMS.df, center = TRUE, scale = TRUE)
print(glass.pca)
## Standard deviations (1, .., p=9):
## [1] 1.90316471 1.43802566 1.08929450 0.94199483 0.77134868 0.61916151 0.45781452
## [8] 0.21555702 0.04160251
##
## Rotation (n x k) = (9 x 9):
## PC1 PC2 PC3 PC4 PC5 PC6
## RI -0.50485785 0.00441147 -0.05256622 0.04090952 0.05129057 -0.32803022
## Na 0.03838445 -0.56175867 -0.35311546 -0.05027879 -0.43600004 0.38750130
## Mg 0.35941395 -0.31399289 -0.26467345 -0.19120352 0.48758791 -0.38704908
## Al 0.26782744 0.41731992 -0.32410933 0.24296547 -0.48317415 -0.02024951
## Si 0.34433464 0.12359693 0.61435705 0.01849107 0.15082897 0.42080880
## K 0.38125577 0.37131152 -0.08791498 0.04489329 -0.12317484 -0.39170681
## Ca -0.49093456 0.14110165 0.21893490 0.05492916 -0.15721255 -0.13021887
## Ba -0.18226140 0.37953583 -0.50594127 0.18156291 0.51046252 0.48101521
## Fe -0.06765353 0.30749750 -0.09151330 -0.92836837 -0.11342325 0.11478252
## PC7 PC8 PC9
## RI 0.017115293 -0.79168676 -0.05802490
## Na 0.301361061 -0.20986120 0.28393985
## Mg -0.250910759 -0.07031861 0.45844234
## Al -0.545612341 -0.18086484 0.16050314
## Si -0.006382748 -0.44545312 0.30013714
## K 0.722291025 -0.07088903 0.10812723
## Ca 0.028951620 0.29167880 0.74868981
## Ba 0.155145167 -0.01129839 0.13283969
## Fe -0.039827873 -0.05362499 0.01113644
summary(glass.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9032 1.4380 1.0893 0.94199 0.77135 0.6192 0.45781
## Proportion of Variance 0.4024 0.2298 0.1318 0.09859 0.06611 0.0426 0.02329
## Cumulative Proportion 0.4024 0.6322 0.7641 0.86265 0.92876 0.9714 0.99464
## PC8 PC9
## Standard deviation 0.21556 0.04160
## Proportion of Variance 0.00516 0.00019
## Cumulative Proportion 0.99981 1.00000
screeplot(glass.pca)
biplot(glass.pca)
#screeplot and cumulative variance show that, first two PCs only account for 63% of variation. Would suggest to look at more PCs to understand the variability in the data (up to PC5 looks to be appropriate)
#Biplot of PC1 and PC2 in line with the HCA - 71 very far from others - an outlier (is the Ba measurement wrong?)?. The other cluster of 70, 72, 76, 75 and 96 are all a bit different to the rest too. Of the others, two sub-clusters, one tightly packed and the other less, so, but grouped.
# Ca and RI strongly positively correlated, as seen in heatmap also. Al, K and Si are also, as is Ba and Fe. Si seems to be negatively correlated with RI and Ca. The "outlying samples" are higher in Ba, Ca and RI compared to the majority of samples.
###Next conducting supervised analysis/exploratory analysis on the data, first fitting models with the training set and then testing with the test set
#Splitting the explanatory variables back into training and test sets TrainingGlassExplanatoryMS.df <- AllGlassExplanatoryMS.df[c(1:110),]
TestGlassExplanatoryMS.df <- AllGlassExplanatoryMS.df[-c(1:110),]
#SIMCA require("mdatools")
## Loading required package: mdatools
## Warning: package 'mdatools' was built under R version 4.2.3
building.exp<-TrainingGlassExplanatoryMS.df[c(1:70),]
vehicle.exp<-TrainingGlassExplanatoryMS.df[-c(1:70),]
building.se = simca(building.exp, classname = "Building")
vehicle.se = simca(vehicle.exp, classname = "Vehicle")
glass.simca<- simcam(models = list(building.se, vehicle.se), info = 'Glass SIMCA')
summary(glass.simca)
##
## SIMCA multiple classes classification (class simcam)
##
## Number of classes: 2
## Info: Glass SIMCA
##
## Summary for calibration results
## Ncomp TP FP TN FN Spec. Sens. Accuracy
## Building 8 66 30 10 4 0.250 0.943 0.691
## Vehicle 8 38 33 37 2 0.529 0.950 0.682
plotCooman(glass.simca)
glass.simaca.pred <- predict(glass.simca, TestGlassExplanatoryMS.df, res.glass.test)
summary(glass.simaca.pred)
##
## Summary for SIMCA multiple classes classification result
##
## Number of classes: 2
## Ncomp TP FP TN FN Spec. Sens. Accuracy
## Building 8 17 15 3 1 0.167 0.944 0.556
## Vehicle 8 13 13 5 5 0.278 0.722 0.500
plotPredictions(glass.simaca.pred)
#SIMCA model not that great - as a soft classifier, it is assigning most samples in the test set to BOTH classes, so not separating them well. Classification accuracy for Building is 56% and vehicle is 50%. Model not very good for classification and wouldnt advise its use.
#LDA require("MASS")
## Loading required package: MASS
Glass.lda <- lda(x = TrainingGlassExplanatoryMS.df,grouping = res.glass.training)
print(Glass.lda)
## Call:
## lda(TrainingGlassExplanatoryMS.df, grouping = res.glass.training)
##
## Prior probabilities of groups:
## Building Vehicle
## 0.4727273 0.5272727
##
## Group means:
## RI Na Mg Al Si K
## Building 0.1240132 0.1413660 0.2707820 -0.5289065 -0.10441628 -0.29543661
## Vehicle 0.2328912 -0.2511281 -0.4627118 0.2069303 -0.08513247 0.03550838
## Ca Ba Fe
## Building 0.01623526 -0.05659596 -0.1186052
## Vehicle 0.33890108 0.12556882 0.2263905
##
## Coefficients of linear discriminants:
## LD1
## RI 1.134734
## Na -3.758572
## Mg -6.869009
## Al -1.243812
## Si -4.024910
## K -1.579375
## Ca -10.393782
## Ba -2.103177
## Fe 0.122006
plot(Glass.lda)
#Orthogonal distributions do have different means, but are highly overlapping. Indicates okay but not excellent separation of classes, so expect the model to not be brilliant.
Glass.Lda.pred <- predict(Glass.lda, newdata = TestGlassExplanatoryMS.df)
print(table(res.glass.test, Glass.Lda.pred[["class"]]))
##
## res.glass.test Building Vehicle
## Building 13 5
## Vehicle 5 13
#classification accuracy = 72%
#kNN require("class")
## Loading required package: class
glass.knn.pred<-knn(train = TrainingGlassExplanatoryMS.df, test = TestGlassExplanatoryMS.df, cl = res.glass.training, k = 5)
table(res.glass.test,glass.knn.pred)
## glass.knn.pred
## res.glass.test Building Vehicle
## Building 13 5
## Vehicle 2 16
#classification accuracy = 81%
#PLS-DA require(mdatools)
glass.PLS.1<-plsda(TrainingGlassExplanatoryMS.df,res.glass.training, ncomp = 9, scale = FALSE, cv = 1 )
summary(glass.PLS.1)
##
## PLS-DA model (class plsda) summary
## ------------------------------------
## Info:
## Number of selected components: 4
## Cross-validation: full (leave one out)
##
## Class #1 (Building)
## X cumexpvar Y cumexpvar TP FP TN FN Spec. Sens. Accuracy
## Cal 83.25 26.23 34 15 43 18 0.741 0.654 0.700
## Cv NA NA 32 17 41 20 0.707 0.615 0.664
##
## Class #2 (Vehicle)
## X cumexpvar Y cumexpvar TP FP TN FN Spec. Sens. Accuracy
## Cal 83.25 26.23 43 18 34 15 0.654 0.741 0.700
## Cv NA NA 41 20 32 17 0.615 0.707 0.664
getConfusionMatrix(glass.PLS.1$calres)
## Building Vehicle None
## Building 34 18 0
## Vehicle 15 43 0
plotMisclassified(glass.PLS.1)
plotSpecificity(glass.PLS.1)
plotSensitivity(glass.PLS.1)
glass.PLS.1$cvres$misclassified
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## Building 0.3818182 0.3818182 0.3454545 0.3363636 0.3090909 0.3181818 0.3
## Vehicle 0.3818182 0.3818182 0.3454545 0.3363636 0.3090909 0.3181818 0.3
## Total 0.3818182 0.3818182 0.3454545 0.3363636 0.3090909 0.3181818 0.3
## [,8] [,9]
## Building 0.3090909 0.2545455
## Vehicle 0.3090909 0.2545455
## Total 0.3090909 0.2545455
#Difficult to choose final number of PCs. The lowest missclassification is with 9 by quite a margin, so have decided to go with this, also has the best sensitivity.
glass.PLS.final<-plsda(TrainingGlassExplanatoryMS.df,res.glass.training, ncomp = 9, scale = FALSE, cv = 1 )
glass.pls.res<- predict(glass.PLS.final, TestGlassExplanatoryMS.df, res.glass.test)
plotPredictions(glass.pls.res)
getConfusionMatrix(glass.pls.res)
## Building Vehicle None
## Building 11 7 0
## Vehicle 1 17 0
#classification accuracy = 78%
plotSelectivityRatio(glass.PLS.final, ncomp = 9, type = "h", show.labels = TRUE)
plotXYLoadings(glass.PLS.final, show.labels = TRUE)
#Al and Mg have the highest selectivity between classes according to the selectivity ratio.
#Loadings plot indicates that Al generally higher in the vehicle glass while Mg is higher in building glass
#RandomForest require("randomForest")
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 4.2.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
glass.rf<-randomForest(x = exp.glass.training, y = as.factor(res.glass.training))
print(glass.rf)
##
## Call:
## randomForest(x = exp.glass.training, y = as.factor(res.glass.training))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 15.45%
## Confusion matrix:
## Building Vehicle class.error
## Building 44 8 0.1538462
## Vehicle 9 49 0.1551724
glass.rf.res<-predict(glass.rf, newdata = exp.glass.test, type = "class")
print(table(res.glass.test, glass.rf.res))
## glass.rf.res
## res.glass.test Building Vehicle
## Building 12 6
## Vehicle 0 18
#classification accuracy = 81% (for one, but this could change each time)
varImpPlot(glass.rf)
#Ca, Mg and Al most important for random forest model
#Neural Network require("neuralnet")
## Loading required package: neuralnet
## Warning: package 'neuralnet' was built under R version 4.2.3
glass.nn<-neuralnet(Type~.,data = glass.training, hidden = c(6,2), linear.output = FALSE)
glass.nn.pred<-predict(glass.nn, exp.glass.test)
prediction_label <- data.frame(max.col(glass.nn.pred))
prediction <- cut(prediction_label$max.col.glass.nn.pred, 2, labels=c('Building', 'Vehicle'))
print(table(res.glass.test, prediction))
## prediction
## res.glass.test Building Vehicle
## Building 10 8
## Vehicle 0 18
#classification accuracy = 78%
#All the models have about ~78-81% accuracy, except for SIMCA - see comments above, don't advise to use, and LDA which performed not as well, again see comments about above about orthogonal distribution as reason why.
#overall, would be fine to work with neural network, PLS-DA, random forest or kNN, moving forward, although would ideally like to have better performance. Perhaps need to try to build up sample size to have a bigger training set?
#Ca, Mg and Al seem to be the most important for classification and differ most notably between building and vehicle glass.
Based on the results, a summary of the investigation can be presented and future directions can be outlined:
Model creation: all the models had about ~78-81% accuracy, except for SIMCA - see comments above (and thus SIMCA is not advised to use), and LDA which did not perform (see comments about above about orthogonal distribution as reason why). Overall, it would be good to work with neural network, PLS-DA, random forest or kNN, moving forward, although would ideally like to have better performance.
Interpreting the models: it was apparent that Ca, Mg and Al seem to be the most important for classification and differ most notably between building and vehicle glass.
This is one example of a complete workflow for analysis of a complex multivariate data set – use this as a template for your own analysis!