Partial least squares in R

img_1847_ins_8287_600
Blood serum samples. From: https://health.onehowto.com

My last entry introduces principal component analysis (PCA), one of many unsupervised learning tools. I concluded the post with a demonstration of principal component regression (PCR), which essentially is a ordinary least squares (OLS) fit using the first k principal components (PCs) from the predictors. This brings about many advantages:

  1. There is virtually no limit for the number of predictors. PCA will perform the decomposition no matter how many variables you handle. Simpler models such as OLS do not cope with more predictors than observations (i.e. p > n ).
  2. Correlated predictors do not undermine the regression fit. Collinearity is a problem for OLS, by widening the solution space, i.e. destabilizing coefficient estimation. PCA will always produce few uncorrelated PCs from a set of variables, correlated or not.
  3. The PCs carry the maximum amount of variance possible. In any predictive model, predictors with zero or near-zero variance often constitute a problem and behave as second intercepts. In the process of compression, PCA will find the projections where your data points spread out the most, thus facilitating learning from the subsequent OLS.

However, in many cases it is much wiser performing a decomposition similar to the PCA, yet constructing PCs that best explain the response, be it quantitative (regression) or qualitative (classification). This is the concept of partial least squares (PLS), whose PCs are more often designated latent variables (LVs), although in my understanding the two terms can be used interchangeably.

PLS safeguards advantages 1. and 2. and does not necessarily follow 3. In addition, it follows that if you use the same number of k  LVs and p predictors, you are going to get exactly a OLS fit – each predictor gets its own LV, so it does not help much. PLS (regression) and PLS followed by discriminant analysis (PLS-DA, classification) are tremendously useful in predictive modelling. They are adequate in a wide variety of experimental designs and linear in their parameters, therefore more easily interpretable.

Today we will perform PLS-DA on the Arcene data set hosted at the UCI Machine Learning Repository that comprises 100 observations and 10,000 explanatory variables (p \gg n ) in order to diagnose cancer from serum samples. From the 10,000 features, 7,000 comprise distinct mass spectrometry (MS) peaks, each determining the levels of a protein. For some reason the contributors added 3,000 random variables, probably to test robustness against noise (check more information in the link above).

Let’s get started with R

For predictive modelling I always use the caret package, which builds up on existing model packages and employs a single syntax. In addition, caret features countless functions that deal with pre-processing, imputation, summaries, plotting and much more we will see firsthand shortly.

For some reason there is an empty column sticking with the data set upon loading, so I have added the colClasses argument to get rid of it. Once you load the data set take a look and note that the levels of most proteins are right-skewed, which can be a problem. However, the authors of this study conducted an exhaustive pre-processing and looking into this would take far too long. The cancer / no-cancer labels (coded as -1 / 1) are stored in a different file, so we can directly append it to the full data set and later use the formula syntax for training the models.

# Load caret, install if necessary
library(caret)
arcene <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/arcene/ARCENE/arcene_train.data", sep = " ",
 colClasses = c(rep("numeric", 10000), "NULL"))

# Add the labels as an additional column
arcene$class <- factor(scan("https://archive.ics.uci.edu/ml/machine-learning-databases/arcene/ARCENE/arcene_train.labels", sep = "\t"))

We can finally search missing values (NA’s). If these were present it would require some form of imputation – that is not the case. Calling any(is.na(arcene)) should prompt a FALSE.

So now the big questions are:

  • How accurately can we predict whether a patient is sick based of the MS profile of his/her blood serum?
  • Which proteins / MS peaks best discriminate sick and healthy patients?

Time to get started with the function train. With respect to pre-processing we will remove zero-variance predictors and center and scale all those remaining, in this exact order using the preProc argument. Considering the size of the sample (n = 100), I will pick a 10x repeated 5-fold cross-validation (CV) – a large number of repetitions compensates for the high variance stemming from a reduced number of folds – rendering a total of 50 estimates of accuracy. Many experts advise in favor of simpler models when performance is undistinguishable, so we will apply the “one SE” rule that selects the least complex model with the average cross-validated accuracy within one standard error (SE) from that in the optimal model. Finally, we must define a range of values for the tuning parameter, the number of LVs. In this particular case we will consider all models with 1-20 LVs.

# Compile cross-validation settings
set.seed(100)
myfolds <- createMultiFolds(arcene$class, k = 5, times = 10)
control <- trainControl("repeatedcv", index = myfolds, selectionFunction = "oneSE")

# Train PLS model
mod1 <- train(class ~ ., data = arcene,
 method = "pls",
 metric = "Accuracy",
 tuneLength = 20,
 trControl = control,
 preProc = c("zv","center","scale"))

# Check CV profile
plot(mod1)

Rplot

This figure depicts the CV profile, where we can learn the average accuracy (y-axis, %) obtained from models trained with different numbers of LVs (x-axis). Note the swift change in accuracy among models with 1-5 LVs. Although the model with six LVs had the highest average accuracy, calling mod1 in the console will show you that, because of the “one SE” rule, the selected model has five LVs (accuracy of 80.6%).

Now we will compare our PLS-DA to the classifier homolog of PCR  – linear discriminant analysis (LDA) following a PCA reductive-step (PCA-DA, if such thing exists). Note that the PCA pre-processing is also set in the preProc argument. We can also try some more complicated models, say, random forests (RF). I will not go into details about the workings and parameterisation of RF, as for today it will be purely illustrative. Note that RF will comparatively take very long (~15min in my 8Gb MacBook Pro from 2012). As usual, a simple object call will show you the summary of the corresponding CV.

# PCA-DA
mod2 <- train(class ~ ., data = arcene,
method = "lda",
metric = "Accuracy",
trControl = control,
preProc = c("zv","center","scale","pca"))

# RF
mod3 <- train(class ~ ., data = arcene,
method = "ranger",
metric = "Accuracy",
trControl = control,
tuneGrid = data.frame(mtry = seq(10,.5*ncol(arcene),length.out = 6)),
preProc = c("zv","center","scale"))

Finally, we can compare PLS-DA, PCA-DA and RF with respect to accuracy. We will compile the three models using caret::resamples, borrowing the plotting capabilities of ggplot2 to compare the 50 accuracy estimates from the optimal cross-validated model in the three cases.

# Compile models and compare performance
models <- resamples(list("PLS-DA" = mod1, "PCA-DA" = mod2, "RF" = mod3))
bwplot(models, metric = "Accuracy")

Rplot

It is clear that the long RF run did not translate into a excelling performance, quite the opposite. Although in average all three models have similar performances, the RF displays a much larger variance in accuracy, which is of course a concern if we seek a robust model. In this case PLS-DA and PCA-DA exhibit the best  performance (63-95% accuracy) and either model would do well in diagnosing cancer in new serum samples.

To conclude, we will determine the ten proteins that best diagnose cancer using the variable importance in the projection (ViP), from both the PLS-DA and PCA-DA. With varImp, this is given in relative levels (scaled to the range 0-100).

plot(varImp(mod1), 10, main = "PLS-DA")
plot(varImp(mod2), 10, main = "PCA-DA")

Rplot01

Rplot02

Very different figures, right? Idiosyncrasies aside, how many PCs did the PCA-DA use? By calling mod2$preProcess, we learn that “PCA needed 82 components to capture 95 percent of the variance”. This is because the PCA pre-processing functionality in caret uses, by default, as many PCs as it takes to cover 95% of the data variance. So, in one hand we have a PLS-DA with five LVs, on the other hand a PCA-DA with 82 PCs. This not only makes PCA-DA cheaply complicated, but arcane at the interpretation level.

The PLS-DA ViP plot above clearly distinguishes V1184 from all other proteins. This could be an interesting cancer biomarker. Of course, many other tests and models must be conducted in order to provide a reliable diagnostic tool. My aim here is only to introduce PLS and PLS-DA.

And that’s it for today, I hope you enjoyed. Please drop me a comment or message if you have any suggestion for the next post. Cheers!

Advertisements

7 thoughts on “Partial least squares in R

  1. Very interesting post. I ‘m quite puzzled by the fact that the ten most important biomarkers of cancers are completely different among approaches. Some proteins should be common, at least V1184, no?

    Like

    1. Hi Auguet! As I wrote above, “in one hand we have a PLS-DA with five LVs, on the other hand a PCA-DA with 82 PCs” – keep in mind that the PCA-DA fit is unconstrained! Technically speaking, we should also cross-validate the number of PCs in the PCA, but actually I don’t know how to do it with the functions above – wrapping them into a function along with prcomp should do it. I am very positive that the optimal number of PCs for the PCA-DA is lower than that, and the resulting ViP should be likely similar (assuming the performance is also similar!). Cheers

      Like

  2. Very interesting post! I wonder what is the difference (if any!) between PLS and Redundancy Analysis (RA). From what I understand, both procedures do something like a PCA, but constraining the PCs such that the variance explained AND the fit to the dependent variable(s) is maximized. Please comment.

    Like

    1. Hi Gerardo, thanks! That is a very good question, I actually don’t know what RA is. From what I read over different webpages, it seems that RA relates two datasets X and Y. Note that this Y is no longer a single variable, but multiple! This RA is very much like canonical correlation analysis (CCA) but

      – using the same decomposition that underlies PCA and PLS, as you correctly pointed out
      – maximizing the correlation of the resulting PCs in X (resp. PCs in Y) with the variables in Y (resp. X)

      If so, it only differs from O2PLS in that O2PLS maximizes correlation between PCs (PC1X with PC1Y, PC2X with PC2Y, and so on), similarly to CCA. I hope this helps?

      Francisco

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s