Author: Francisco Lima

Convolutional Neural Networks in R

Can we teach a computer to distinguish cats and dogs? [source]

Last time I promised to cover the graph-guided fused LASSO (GFLASSO) in a subsequent post. In the meantime, I wrote a GFLASSO R tutorial for DataCamp that you can freely access here, so give it a try!

The plan here is to experiment with convolutional neural networks (CNNs), a form of deep learning. CNNs underlie most advanced recognition algorithms used by the major tech giants. The recent development of back-end optimization tools and hardware (from Intel, NVIDIA and Google to name a few) now enables training CNNs on conventional laptop machines, hence accessible to a broader audience.

Today you will construct a binary classifier that can distinguish between dogs and cats from a set of 25,000 pictures, using the Keras R interface powered by the TensorFlow back-end engine. The code is available from a dedicated repo so you don’t have to copy-paste the snippets below.  If you fall short of RAM please consider adapting the script as to use less pictures or split, process and save them in separate instances. Finally, I encourage you to use the RStudio terminal shell to fetch the Dogs vs. Cats dataset from Kaggle via its new API feature. I will provide more detailed instructions below. If you want to pass the theory, scroll all the way down to the ‘Let’s get started with R’ section. Enjoy!

Neural Networks

Driverless cars were out there as far back as 1989. Neural networks (NNs) have been around for a long time, so what triggered this craze around artificial intelligence and deep learning in recent years? The answer partly lies in Moore’s law and the remarkable improvement of hardware and computing power – we can now do a lot more with a lot less. The concept of NNs, as the name suggests, was inspired by the network of our own brain neurons. Neurons are very long cells, each with protrusions called dendrites that receive and propagate electrochemical signals from and to surrounding neurons, respectively. As a result, our brain cells form flexible, robust communication networks that sequentially process cascading inputs. This distributive process, akin to an assembly line, is supportive of sophisticated cognitive abilities such as music playing and painting.

There is enough about NNs to write books and I do not intend to re-invent the wheel here. There are great free resources you can learn from, to master both basic and advanced concepts. I recommend the Intel® AI Academy, the  Coursera Machine Learning course taught by Andrew Ng and DataCamp for the Python enthusiasts. I will cover only some key features pertaining to classification problems.

Basic architecture

A NN typically contains one input layer, one or more hidden layers, and an output layer. The input layer consists of your p predictors, or input units / nodes. Needless to say, it is generally good practice to center, scale and transform predictors, if not at least to speed up the optimization procedure. These input units can be connected to one or more hidden units in the first hidden layer. A hidden layer that is fully connected to the preceding layer is designated dense. In the diagram below, both hidden layers are dense.

Schematic representation of a neural network with two hidden layers [source]
The output layer computes the prediction, and the number of units therein is determined by the problem in hands. Conventionally, a binary classification problem requires a single output unit (as shown above), whereas a multiclass problem with k classes will require k corresponding output units. The former can simply use a sigmoid function to directly compute a probability, while the latter usually requires a softmax transformation, whereby all values across all k output units sum up to one and can thus be treated as probabilities. Rather than having categorical predictions you can retrieve the actual probabilities, which are much more informative, and inspect their quality using calibration plots and lift charts. These can be discussed in a future tutorial.


Every arrow displayed in the diagram above passes on an input that is associated with a weight. Each weight is essentially one of many coefficient estimates that contribute to the regressions computed in the nodes the corresponding arrows point to. These are unknown parameters that must be tuned by the model as to minimize the loss function, using an optimization procedure. In effect, for any particular observation each neuron can be mathematically represented as z = b + \sum_{i = 1}^{m} W_i \cdot x_i , where denotes the intercept (also known as bias, and technically a weight itself) and and x are m-long vectors carrying the weights and values from all inputs, respectively. Before training, all weights are initialized with random values.

Activation functions

So far we have been assuming each unit carries a linear regression model, but there is more to that. Every hidden unit is equipped with a sort of toggle switch that applies a filter on the regression output. In this context, different types of toggle switches are designated activation functions, i.e. f(z) based on the mathematical representation above. On / off and high-pass toggle switches, for example, are encoded by sigmoid and rectified linear unit (ReLU) activation functions, respectively. The activation function in the output layer is particularly important, as I wrote before, as it must be picked according to the problem in hands.

Loss and optimization

So far we worked our way downstream the network, covering most elements necessary to make a prediction. However, we still did not discuss the actual training and weight fine-tuning. We need two things in place prior to training: i) a measure of goodness-of-fit, that compares predictions and known labels over all training observations, and ii) an optimization method that computes the gradient descent, essentially tweaking all weight estimates simultaneously, in the directions that improve the goodness-of-fit. For each of these we have loss functions and optimizers, respectively. There are many types of loss functions, all aimed at quantifying prediction error. Later we will use cross-entropy, CE = - \sum_{i = 1}^{n} y_i \cdot log(\hat{y_i}) . As for optimization, we will use Adam, a popular stochastic optimization method. Great, but there is yet no explanation of how optimization resolves the dependencies among neurons in successive layers.


Recap: to learn the parameters contained in the weight vectors that span l layers (W^{(1)}, W^{(2)}, ... , W^{(l)}), the model must plug in the predictors that cascade downstream and generate predictions, which in turn are contrasted to the actual labels. At this point, the loss function kicks in and gauges error.

To estimate all weights while considering their dependencies, backpropagation makes its way upstream, applying the chain rule. This chain rule calculates the gradients (i.e. partial derivatives w.r.t. the different weights) that reduce the error “dragged” from the output layer, a process governed by the optimizer.

This all covers the building block of training – evaluating observations, establishing predictions and subsequently tuning weights to minimize error. Training is nothing but repeating it and updating the weights until convergence of the optimizer or reaching a limit. But how many observations exactly should we use in each iteration?

Batch size

There are three main options for how many observations to be used in each training step. Stochastic batch, mini-batch and full-batch feed in single observations, samples or the entire training set in each training step, respectively. In terms of runtime, the stochastic batch is faster compared to the full-batch. In terms of accuracy in the optimization (i.e. tweaking weights in the right direction), the full-batch is more accurate than the stochastic batch. In both cases, mini-batch sits in between. While there is no optimal method, it is important to consider the number of epochs can aid in extending iterations beyond the size of the training set.


NNs must learn a myriad of parameters, bringing about the perils of overfitting. This problem can be tackled with various forms of regularization that constrain model flexibility. Weight decay acts much like L_2 or LASSO penalties, introducing a budget for the magnitude of the weights. Dropout, to be used later, randomly drops connections based on a pre-specified probability.

If you want to gain some more intuition about NNs, here is an interesting interactive page from TensorFlow.

Convolutional Neural Networks

Convolutional neural networks (CNNs) are a special type of NNs well poised for image processing and framed on the principles discussed above. The ‘convolutional’ in the name owes to separate square patches of pixels in a image being processed through filters. As a result, the model can mathematically capture key visual cues such as textures and edges that help discerning classes. The beak of a bird, for example, is highly discriminative of birds among animals. In the example depicted below, the CNN would likely process beak-shaped structures along a chain of transformations involving convolution, pooling and flattening into layers, the last of which would see the relevant neurons activated, ideally resulting in p_{bird} being the largest probability among the competing classes.

Schematic representation of a convolutional neural network [source]
Note that images can be represented as numerical matrices, based on color intensity. Monochromatic images are processed with 2D convolutional layers, whereas colored ones require 3D convolutional layers – we will use the former. Also, note that the depth of the convolutional layers is determined by the total number of filters. Now briefly, some important CNN features.

Kernel, stride and padding

Kernels, also known as filters, convolve square blocks of pixels into scalars in subsequent convolutional layers. In the animation below, you have a 3 x 3 kernel with ones running on the diagonal and off-diagonal, scanning an image from left to right, top to bottom.

Example of image convolution [source]
Throughout the process, the kernel performs element-wise multiplication and sums up all products, into a single value passed to the subsequent convolutional layer.

Note that the kernel is moving a pixel at a time. This is the stride, the stepsize of the sliding window the kernel uses to convolve. Larger strides implicate more granular, smaller convolved features.

Looking back to the figure above, note how central pixels are overrepresented compared to those lying on the edges of the image – while the very central pixel is picked up in every scan, the top-left pixel is picked up only once. More importantly, we might not be interested in downsampling. Padding, by adding zero-valued pixels around the borders of the image solves both problems by capturing more details and maintaining size, respectively.


Pooling is a strategic downsampling from a convolutional layer, rendering representations of predominant features in lower dimensions, preventing overfitting and alleviating computational demand. Two major types of pooling are average and max pooling. Provided a kernel and a stride, pooling amounts to convolution but instead taking either the average or maximum value per frame. Here is an illustrative example.

Example of pooling [source]


Flattening, as the name suggest, simply converts the last convolutional layer into a one-dimensional NN layer. It sets the stage for the actual predictions.

All of the principles discussed regarding NNs (i.e. weights, activation, loss, optimization, backprogagation, batch size and regularization) apply to CNNs as well.

Let’s get started with R

First, you will need to install the Keras package and the TensorFlow dependency. Please follow the installation instructions here. If you are using NVIDIA cards, you might want to customise the installation with the command install_keras() and tap into the power of CUDAs. The default installation is CPU-based.

Downloading Dogs vs. Cats

The Kaggle API was recently developed to facilitate a fast, programatic access to datasets and competitions. To use the Kaggle API please sign up, install Kaggle using the command line, create your kaggle.json token and accept the rules from the Dogs vs. Cats competition, as indicated in the GitHub page.

Using the terminal shell in RStudio, I suggest creating a folder originalData in your working directory, and then using the Kaggle API to store all contents associated with the competition therein. If you prefer using your web browser or cannot use a terminal shell, download it directly from the competition page. Finally, unzip the compressed files (i.e. train and test sets) into the working directory.

mkdir originalData/
kaggle competitions download -c dogs-vs-cats -p originalData/
unzip 'originalData/*.zip'

You have just populated the working directory with the directories train and test1 (besides originalData), each containing 25,000 and 12,500 .jpg pictures, respectively, of dogs and cats.

Image processing

Install the libraries listed below if necessary. We will start off by visualizing the second cat from the training dataset directly from RStudio, using the Viewer pane. In case you wonder why the second, take a look into the first. Poor cat.


secondCat <- readImage("train/cat.1.jpg")


At this stage I propose resizing the .jpg images into 50 x 50 px grayscale images that we can easily manipulate numerically. For this purpose I adapted the extract_feature function from Shikun Li’s image classification tutorial. After resizing, this function additionally flattens images to vectors of length 2500. We can then store the train and test sets as two data frames of size 25,000 x 2500 and 12,500 x 2500, respectively.

# Set image size
width <- 50
height <- 50

extract_feature <- function(dir_path, width, height, labelsExist = T) {
img_size <- width * height

## List images in path
images_names <- list.files(dir_path)

## Select only cats or dogs images
catdog <- str_extract(images_names, "^(cat|dog)")
# Set cat == 0 and dog == 1
key <- c("cat" = 0, "dog" = 1)
y <- key[catdog]

print(paste("Start processing", length(images_names), "images"))
## This function will resize an image, turn it into greyscale
feature_list <- pblapply(images_names, function(imgname) {
## Read image
img <- readImage(file.path(dir_path, imgname))
## Resize image
img_resized <- resize(img, w = width, h = height)
## Set to grayscale (normalized to max)
grayimg <- channel(img_resized, "gray")
## Get the image as a matrix
img_matrix <- grayimg@.Data
## Coerce to a vector (row-wise)
img_vector <- as.vector(t(img_matrix))
## bind the list of vector into matrix
feature_matrix <-, feature_list)
feature_matrix <-
## Set names
names(feature_matrix) <- paste0("pixel", c(1:img_size))

return(list(X = feature_matrix, y = y))

Next, we use the function to process both datasets by passing their corresponding directories. This should take about 20-30 minutes.

# Takes approx. 15min
trainData <- extract_feature("train/", width, height)
# Takes slightly less
testData <- extract_feature("test1/", width, height, labelsExist = F)

Let’s look at how the transformation worked on the second cat and store the two data frames in a single R object.

# Check processing on second cat
par(mar = rep(0, 4))
testCat <- t(matrix(as.numeric(trainData$X[2,]),
nrow = width, ncol = height, T))
image(t(apply(testCat, 2, rev)), col = gray.colors(12),
axes = F)

# Save
save(trainData, testData, file = "catdogData.RData")


We can confidently say any person could identify a cat in this particular instance. It is not always going to be the case, considering some images in both sets are ambiguous. Not only that, the downsizing to 50 x 50 px will inevitably complicate the recognition task.

Training and validation

The network structure was inspired from the CIFAR10 CNN example in the Keras GitHub page. We first need to rearrange the train and test data frames into 4D tensors / arrays of shape 25,000 x 50 x 50 x 1 (i.e. observations x height x width x channels). The binary classifier we are about to construct will have a single-unit output layer with sigmoid activation. As a result, and because we encoded cats and dogs as 0 and 1, respectively, it will retrieve the probability of dog, P(y = Dog) . The Keras training framework has three main stages: i) one to specify the architecture, ii) one to compile the model by selecting the loss function, the optimizer and metrics of goodness-of-fit, and finally iii) one to fit the model by determining the number of epochs, batch size and validation split. If you are familiar with Python-based Keras, you will notice the R syntax becomes very similar with the usage of the pipe operator %>% from the magrittr package. Here is the proposed CNN architecture.


Setting padding to ‘same’ maintains the original size, adding as many layers of zeros around the borders as determined by the kernel size, whereas to ‘valid’ does not introduce padding and thus reduces size. The following code will rearrange the 4D tensors, define the model and train it. For the next 1-2 hours you will be able to monitor accuracy and loss on a hold-out sample (20%) of the training set.

# Fix structure for 2d CNN
train_array %
layer_dropout(rate = 0.25) %>%

layer_flatten() %>%
layer_dense(units = 50, activation = "relu") %>%
layer_dropout(rate = 0.25) %>%
layer_dense(units = 1, activation = "sigmoid")


model %>% compile(
loss = 'binary_crossentropy',
optimizer = "adam",
metrics = c('accuracy')

history % fit(
x = train_array, y = as.numeric(trainData$y),
epochs = 30, batch_size = 100,
validation_split = 0.2



Your results should look like this. As the model trains epoch after epoch, cross-entropy (‘loss’) is minimized in both training and validation splits. The apparent increase in the validation split after the 20th epoch is suggestive of slight overfitting. Accuracy, inversely proportional to cross-entropy, stalls at around 95% and 75% in training and validation splits, respectively. Pretty decent results on the fly.

Let’s now investigate the performance on the test set. Recall we have no known labels for the 12,500 photographs that comprise the test set. Therefore, I propose taking 32 random images from the entire test set and visually inspect the predictions – classes and associated probabilities. Finally, we can store the model as a R object for future predictions.

# Compute probabilities and predictions on test set
predictions <-  predict_classes(model, test_array)
probabilities <- predict_proba(model, test_array)

# Visual inspection of 32 cases
random <- sample(1:nrow(testData), 32)
preds <- predictions[random,]
probs <- as.vector(round(probabilities[random,], 2))

par(mfrow = c(4, 8), mar = rep(0, 4))
for(i in 1:length(random)){
image(t(apply(test_array[random[i],,,], 2, rev)),
col = gray.colors(12), axes = F)
legend("topright", legend = ifelse(preds[i] == 0, "Cat", "Dog"),
text.col = ifelse(preds[i] == 0, 2, 4), bty = "n", text.font = 2)
legend("topleft", legend = probs[i], bty = "n", col = "white")

# Save model
save(model, file = "CNNmodel.RData")


Noteworthy, many of the misclassifications depicted above are associated with P(y = Dog) \approx 0.50, at with uncertainty is maximum. We could take advantage of this and introduce a ‘grey zone’ in the prediction, for instance, by predicting dogs with P(y = Dog) \geqslant 0.75 and cats with P(y = Dog) \leqslant 0.25 , while dismissing all other cases. Again, something worth covering in a future tutorial.

I was really curious to test this CNN with my own two pets, Pitti & Platsch, so I decided to give it a go.

The half-sibs Princess Pitti (left) and Prince Platsch (right).

I simply transformed their pictures above and ran predict_classes and predict_proba to determine the two predictions and associated probabilities, respectively. For my relief, the model identified both my pets as cats with relatively high certainty.



I hope you gained a basic understanding of CNNs and how to implement them using the Keras R interface in virtually any machine. I think this was a fun experiment that yielded a fairly good CNN model, being able to distinguish cats and dogs approximatelly 75% of the time, considering our frugal input setup.

We could still set the bar higher. We could try boosting the validation accuracy using inputs larger than 50 x 50 px, RGB color channels (in which case the model inputs would be 4D tensors of shape 25,000 x 50 x 50 x 3) and more elaborate architectures. I am afraid, however, that a substantial improvement will require some good specs.

The Keras R interface can be intimidating for new users, but it is certainly a good starting point for the emerging deep learning enthusiasts, myself included.

Finally, I am earnestly counting on your feedback for improvements, specially concerning clarity and any non-sense I might have written. Please, comment below or contact me directly. I also want to thank my beloved Isabel for reviewing my longest post ever.

Now, if you will excuse me, I have to feed and clean after my masters.


Linear mixed-effect models in R

Arabidopsis thaliana rosette. From: Wikimedia Commons

Statistical models generally assume that

  1. All observations are independent from each other
  2. The distribution of the residuals follows \mathcal{N}(0, \sigma^2), irrespective of the values taken by the dependent variable y

When any of the two is not observed, more sophisticated modelling approaches are necessary. Let’s consider two hypothetical problems that violate the two respective assumptions, where y denotes the dependent variable:

A. Suppose you want to study the relationship between average income (y) and the educational level in the population of a town comprising four fully segregated blocks. You will sample 1,000 individuals irrespective of their blocks. If you model as such, you neglect dependencies among observations – individuals from the same block are not independent, yielding residuals that correlate within block.

B. Suppose you want to study the relationship between anxiety (y) and the levels of triglycerides and uric acid in blood samples from 1,000 people, measured 10 times in the course of 24 hours. If you model as such, you will likely find that the variance of changes over time – this is an example of heteroscedasticity, a phenomenon characterized by the heterogeneity in the variance of the residuals.

In A. we have a problem of dependency caused by spatial correlation, whereas in B. we have a problem of heterogeneous variance. As a result, classic linear models cannot help in these hypothetical problems, but both can be addressed using linear mixed-effect models (LMMs). In rigour though, you do not need LMMs to address the second problem.

LMMs are extraordinarily powerful, yet their complexity undermines the appreciation from a broader community. LMMs dissect hierarchical and / or longitudinal (i.e. time course) data by separating the variance due to random sampling from the main effects. In essence, on top of the fixed effects normally used in classic linear models, LMMs resolve i) correlated residuals by introducing random effects that account for differences among random samples, and ii) heterogeneous variance using specific variance functions, thereby improving the estimation accuracy and interpretation of fixed effects in one go.

I personally reckon that most relevant textbooks and papers are hard to grasp for non-mathematicians. Therefore, following the brief reference in my last post on GWAS I will dedicate the present tutorial to LMMs. For further reading I highly recommend the ecology-oriented Zuur et al. (2009) and the R-intensive Gałecki et al. (2013) books, and this simple tutorial from Bodo Winter. For agronomic applications, H.-P. Piepho et al. (2003) is an excellent theoretical introduction.

Here, we will build LMMs using the Arabidopsis dataset from the package lme4, from a study published by Banta et al. (2010). These data summarize variation in total fruit set per plant in Arabidopsis thaliana plants conditioned to fertilization and simulated herbivory. Our goal is to understand the effect of fertilization and simulated herbivory adjusted to experimental differences across groups of plants.

Mixed-effect linear models

Whereas the classic linear model with n observational units and p predictors has the vectorized form

\mathbf{y} = \mathbf{X}\beta + \epsilon

with the n \times (p+1) predictor matrix \mathbf{X} , the vector of p + 1 coefficient estimates \beta and the n-long vectors of the response \mathbf{y} and the residuals \epsilon , LMMs additionally accomodate separate variance components modelled with a set of random effects \mathbf{u} ,

\mathbf{y} = \mathbf{X}\beta + \mathbf{Z}\mathbf{u} + \epsilon

where \mathbf{Z} and \mathbf{X} are design matrices that jointly represent the set of predictors. Random effects models include only an intercept as the fixed effect and a defined set of random effects. Random effects comprise random intercepts and / or random slopes. Also, random effects might be crossed and nested. In terms of estimation, the classic linear model can be easily solved using the least-squares method. For the LMM, however, we need methods that rather than estimating predict \mathbf{u} , such as maximum likelihood (ML) and restricted maximum likelihood (REML). Bear in mind that unlike ML, REML assumes that the fixed effects are not known, hence it is comparatively unbiased (see Chapter 5 in Zuur et al. (2009) for more details). Unfortunately, LMMs too have underlying assumptions – both residuals and random effects should be normally distributed. Residuals in particular should also have a uniform variance over different values of the dependent variable, exactly as assumed in a classic linear model.

One of the most common doubts concerning LMMs is determining whether a variable is a random or fixed. First of all, an effect might be fixed, random or even both simultaneously – it largely depends on how you approach a given problem. Generally, you should consider all factors that qualify as sampling from a population as random effects (e.g. individuals in repeated measurements, cities within countries, field trials, plots, blocks, batches) and everything else as fixed. As a rule of thumb, i) factors with fewer than 5 levels should be considered fixed and conversely ii) factors with numerous levels should be considered random effects in order to increase the accuracy in the estimation of variance. Both points relate to the LMM assumption of having normally distributed random effects.

Best linear unbiased estimators (BLUEs) and predictors (BLUPs) correspond to the values of fixed and random effects, respectively. The usage of the so-called genomic BLUPs (GBLUPs), for instance, elucidates the genetic merit of animal or plant genotypes that are regarded as random effects when trial conditions, e.g. location and year of trials are considered fixed. However, many studies sought the opposite, i.e. using breeding values as fixed effects and trial conditions as random, when the levels of the latter outnumber the former, chiefly because of point ii) outlined above. In GWAS, LMMs aid in teasing out population structure from the phenotypic measures.

Let’s get started with R

We will follow a structure similar to the 10-step protocol outlined in Zuur et al. (2009): i) fit a full ordinary least squares model and run the diagnostics in order to understand if and what is faulty about its fit; ii) fit an identical generalized linear model (GLM) estimated with ML, to serve as a reference for subsequent LMMs; iii) deploy the first LMM by introducing random effects and compare to the GLM, optimize the random structure in subsequent LMMs; iv) optimize the fixed structure by determining the significant of fixed effects, always using ML estimation; finally, v) use REML estimation on the optimal model and interpret the results.

You need to havenlme andlme4 installed to proceed. We will firstly examine the structure of the Arabidopsis dataset.

# Install (if necessary) and load nlme and lme4
# Load dataset, inspect size and additional info
dim(Arabidopsis) # 625 observations, 8 variables

The Arabidopsis dataset describes 625 plants with respect to the the following 8 variables (transcript from R):

region: a factor with 3 levels NL (Netherlands), SP (Spain), SW (Sweden)

population: a factor with the form n.R representing a population in region R

genotype: a factor with 24 (numeric-valued) levels

a nuisance factor with 2 levels, one for each of two greenhouse racks

fertilization treatment/nutrient level (1, minimal nutrients or 8, added nutrients)

simulated herbivory or “clipping” (apical meristem damage): unclipped(baseline) or clipped

a nuisance factor for germination method (Normal, Petri.Plate, or Transplant)

total fruit set per plant (integer), henceforth TFPP for short.

We will now visualise the absolute frequencies in all 7 factors and the distribution for TFPP.

# Overview of the variables
par(mfrow = c(2,4))
barplot(table(reg), ylab = "Frequency", main = "Region")
barplot(table(popu), ylab = "Frequency", main = "Population")
barplot(table(gen), ylab = "Frequency", las = 2, main = "Genotype")
barplot(table(rack), ylab = "Frequency", main = "Rack")
barplot(table(nutrient), ylab = "Frequency", main = "Nutrient")
barplot(table(amd), ylab = "Frequency", main = "AMD")
barplot(table(status), ylab = "Frequency", main = "Status")
hist(total.fruits, col = "grey", main = "Total fruits", xlab = NULL)


The frequencies are overall balanced, perhaps except for status (i.e. germination method). Genotype, greenhouse rack and fertilizer are incorrectly interpreted as quantitative variables. In addition, the distribution of TFPP is right-skewed. As such, we will encode these three variables as categorical variables and log-transform TFPP to approximate a Gaussian distribution (natural logarithm).

# Transform the three factor variables gen, rack and nutrient
Arabidopsis[,c("gen","rack","nutrient")] <- lapply(Arabidopsis[,c("gen","rack","nutrient")], factor)
# Re-attach after correction, ignore warnings
# Add 1 to total fruits, otherwise log of 0 will prompt error
total.fruits <- log(1 + total.fruits)

A closer look into the variables shows that each genotype is exclusive to a single region. The data contain no missing values.

# gen x popu table
table(gen, popu)
# Any NAs?
any( # FALSE

Formula syntax basics

At this point I hope you are familiar with the formula syntax in R. Note that interaction terms are denoted by : and fully crossed effects with *, so that A*B = A + B + A:B. The following code example

lm(y ~ x1 + x2*x3)

builds a linear model of using x_1 , x_2 , x_3  and the interaction between x_2  and x_3 In case you want to perform arithmetic operations inside the formula, use the function I. You can also introduce polynomial terms with the function poly. One handy trick I use to expand all pairwise interactions among predictors is

model.matrix(y ~ .*., data = X)

provided a matrix X that gathers all predictors and y. You can also simply use .*. inside the lm call, however you will likely need to preprocess the resulting interaction terms.

While the syntax of lme is identical to lm for fixed effects, its random effects are specified under the argument random as

random = ~intercept + fixed effect | random effect

and can be nested using /. In the following example

random = ~1 + C | A/B

the random effect B is nested within random effect A, altogether with random intercept and slope with respect to C. Therefore, not only will the groups defined by A and A/B have different intercepts, they will also be explained by different slight shifts of \beta from the fixed effect C.

Classic linear model

Ideally, you should start will a full model (i.e. including all independent variables). Here, however, we cannot use all descriptors in the classic linear model since the fit will be singular due to the redundancy in the levels of reg and popu. For simplicity I will exclude these alongside gen, since it contains a lot of levels and also represents a random sample (from many other extant Arabidopsis genotypes). Additionally, I would rather use rack and  status as random effects in the following models but note that having only two and three levels respectively, it is advisable to keep them as fixed.

LM <- lm(total.fruits ~ rack + nutrient + amd + status)
par(mfrow = c(2,2))


These diagnostic plots show that the residuals of the classic linear model poorly qualify as normally distributed. Because we have no obvious outliers, the leverage analysis provides acceptable results. We will try to improve the distribution of the residuals using LMMs.

Generalized linear model

We need to build a GLM as a benchmark for the subsequent LMMs. This model can be fit without random effects, just like a lm but employing ML or REML estimation, using the gls function. Hence, it can be used as a proper null model with respect to random effects. The GLM is also sufficient to tackle heterogeneous variance in the residuals by leveraging different types of variance and correlation functions, when no random effects are present (see arguments correlation and weights).

GLM <- gls(total.fruits ~ rack + nutrient + amd + status,
method = "ML")

At this point you might consider comparing the GLM and the classic linear model and note they are identical. Also, you might wonder why are we using LM instead of REML – as hinted in the introduction, REML comparisons are meaningless in LMMs that differ in their fixed effects. Therefore, we will base all of our comparisons on LM and only use the REML estimation on the final, optimal model.

Optimal random structure

Let’s fit our first LMM with all fixed effects used in the GLM and introducing reg, popu, genreg/popu, reg/gen, popu/gen and reg/popu/gen as random intercepts, separately.

In order to compare LMMs (and GLM), we can use the function anova (note that it does not work for lmer objects) to compute the likelihood ratio test (LRT). This test will determine if the models are significantly different with respect to goodness-of-fit, as weighted by the trade-off between variance explained and degrees-of-freedom. The model fits are also evaluated based on the Akaike (AIC) and Bayesian information criteria (BIC) – the smaller their value, the better the fit.

lmm1 <- lme(total.fruits ~ rack + nutrient + amd + status,
random = ~1|reg, method = "ML")
lmm2 <- lme(total.fruits ~ rack + nutrient + amd + status,
random = ~1|popu, method = "ML")
lmm3 <- lme(total.fruits ~ rack + nutrient + amd + status,
random = ~1|gen, method = "ML")
lmm4 <- lme(total.fruits ~ rack + nutrient + amd + status,
random = ~1|reg/popu, method = "ML")
lmm5 <- lme(total.fruits ~ rack + nutrient + amd + status,
random = ~1|reg/gen, method = "ML")
lmm6 <- lme(total.fruits ~ rack + nutrient + amd + status,
random = ~1|popu/gen, method = "ML")
lmm7 <- lme(total.fruits ~ rack + nutrient + amd + status,
random = ~1|reg/popu/gen, method = "ML")
anova(GLM, lmm1, lmm2, lmm3, lmm4, lmm5, lmm6, lmm7)

We could now base our selection on the AIC, BIC or log-likelihood. Considering most models are undistinguishable with respect to the goodness-of-fit, I will select lmm6 and lmm7  as the two best models so that we have more of a random structure to look at. Take a look into the distribution of the random effects with plot(ranef(MODEL)). We next proceed to incorporate random slopes.

There is the possibility that the different researchers from the different regions might have handled and fertilized plants differently, thereby exerting slightly different impacts. Let’s update lmm6 and lmm7 to include random slopes with respect to nutrient. We first need to setup a control setting that ensures the new models converge.

# Set optimization pars
ctrl <- lmeControl(opt="optim")
lmm6.2 <- update(lmm6, .~., random = ~nutrient|popu/gen, control = ctrl)
lmm7.2 <- update(lmm7, .~., random = ~nutrient|reg/popu/gen, control = ctrl)
anova(lmm6, lmm6.2, lmm7, lmm7.2) # both models improved
anova(lmm6.2, lmm7.2) # similar fit; lmm6.2 more parsimonious

Assuming a level of significance \alpha = 0.05 , the inclusion of random slopes with respect to nutrient improved both lmm6 and lmm7. Comparing lmm6.2 andlmm7.2 head-to-head provides no evidence for differences in fit, so we select the simpler model,lmm6.2. Let’s check how the random intercepts and slopes distribute in the highest level (i.e. gen within popu).

plot(ranef(lmm6.2, level = 2))


The random intercepts (left) appear to be normally distributed, except for genotype 34, biased towards negative values. This could warrant repeating the entire analysis without this genotype. The random slopes (right) appear to be normally distributed as well. Interestingly, there is a negative correlation of -0.61 between random intercepts and slopes, suggesting that genotypes with low baseline TFPP tend to respond better to fertilization. Try plot(ranef(lmm6.2, level = 1)) to observe the distributions at the level of popu only. Next, we will use QQ plots to compare the residual distributions between the GLM and lmm6.2 to gauge the relevance of the random effects.

# QQ plots (drawn to the same scale!)
par(mfrow = c(1,2))
lims <- c(-3.5,3.5)
qqnorm(resid(GLM, type = "normalized"),
xlim = lims, ylim = lims,main = "GLM")
abline(0,1, col = "red", lty = 2)
qqnorm(resid(lmm6.2, type = "normalized"),
xlim = lims, ylim = lims, main = "lmm6.2")
abline(0,1, col = "red", lty = 2)


The improvement is clear. Bear in mind these results do not change with REML estimation. Try different arrangements of random effects with nesting and random slopes, explore as much as possible!

Optimal fixed structure

Now that we are happy with the random structure, we will look into the summary of the optimal model so far (i.e. lmm6.2) and determine if we need to modify the fixed structure.


All effects are significant with \alpha = 0.05 , except for one of the levels from status that represents transplanted plants. Given the significant effect from the other two levels, we will keep status and all current fixed effects. Just for fun, let’s add the interaction term nutrient:amd and see if there is any significant improvement in fit.

lmm8 <- update(lmm6.2, .~. + nutrient:amd)
anova(lmm8, lmm6.2)

The addition of the interaction was non-significant with respect to both \beta and the goodness-of-fit, so we will drop it. Note that it is not a good idea to add new terms after optimizing the random structure, I did so only because otherwise there would be nothing to do with respect to the fixed structure.

Fit optimal model with REML

We could play a lot more with different model structures, but to keep it simple let’s finalize the analysis by fitting the lmm6.2 model using REML and finally identifying and understanding the differences in the main effects caused by the introduction of random effects.

finalModel <- update(lmm6.2, .~., method = "REML")

We will now contrast our REML-fitted final model against a REML-fitted GLM and determine the impact of incorporating random intercept and slope, with respect to nutrient, at the level of popu/gen. Therefore, both will be given the same fixed effects and estimated using REML. # Reset previous graphical pars
# New GLM, updated from the first by estimating with REML
GLM2 <- update(GLM, .~., method = "REML")
# Plot side by side, beta with respective SEs
plot(coef(GLM2), xlab = "Fixed Effects", ylab = expression(beta), axes = F,
pch = 16, col = "black", ylim = c(-.9,2.2))
stdErrors <- coef(summary(GLM2))[,2]
segments(x0 = 1:6, x1 = 1:6, y0 = coef(GLM2) - stdErrors, y1 = coef(GLM2) + stdErrors,
col = "black")
abline(h = 0, col = "grey", lty = 2)
axis(1, at = 1:6,
labels = c("Intercept", "Rack", "Nutrient (Treated)","AMD (Unclipped)","Status (PP)",
"Status (Transplant)"), cex.axis = .7)
points(1:6 + .1, fixef(finalModel), pch = 16, col = "red")
stdErrorsLMM <- coef(summary(finalModel))[,2]
segments(x0 = 1:6 + .1, x1 = 1:6 + .1, y0 = fixef(finalModel) - stdErrorsLMM, y1 = fixef(finalModel) + stdErrorsLMM, col = "red")
# Legend
legend("topright", legend = c("GLM","LMM"), text.col = c("black","red"), bty = "n")


The figure above depicts the estimated \beta from the different fixed effects, including the intercept, for the GLM (black) and the final LMM (red). Error bars represent the corresponding standard errors (SE). Overall the results are similar but uncover two important differences. First, for all fixed effects except the intercept and nutrient, the SE is smaller in the LMM. Second, the relative effects from two levels of status are opposite. With the consideration of random effects, the LMM estimated a more negative effect of culturing in Petri plates on TFPP, and conversely a less negative effect of transplantation.



The distribution of the residuals as a function of the predicted TFPP values in the LMM is still similar to the first panel in the diagnostic plots of the classic linear model. The usage of additional predictors and generalized additive models would likely improve it.


Now that we account for genotype-within-region random effects, how do we interpret the LMM results?

Plants that were placed in the first rack, left unfertilized, clipped and grown normally have an average TFPP of 2.15. This is the value of the estimated grand mean (i.e. intercept), and the predicted TFPP when all other factors and levels do not apply. For example, a plant grown under the same conditions but placed in the second rack will be predicted to have a smaller yield, more precisely of ln(TFPP) = \beta_{intercept} + \beta_{Rack} = 2.15 + (-0.75) = 1.4 . To these reported yield values, we still need to add the random intercepts predicted for region and genotype within region (which are tiny values, by comparison; think of them as a small adjustment). Moreover, we can state that

  • Fertilized plants produce more fruits than those kept unfertilized. This was the strongest main effect and represents a very sensible finding.
  • Plants grown in the second rack produce less fruits than those in the first rack. This was the second strongest main effect identified. Could this be due to light / water availability?
  • Simulated herbivory (AMD) negatively affects fruit yield. This is also a sensible finding – when plants are attacked, more energy is allocated to build up biochemical defence mechanisms against herbivores and pathogens, hence compromising growth and eventually fruit yield.
  • Both culturing in Petri plates and transplantation, albeit indistinguishable, negatively affect fruit yield as opposed to normal growth. When conditions are radically changed, plants must adapt swiftly and this comes at a cost as well. Thus, these observations too make perfect sense.
  • One important observation is that the genetic contribution to fruit yield, as gauged by gen, was normally distributed and adequately modelled as random. One single outlier could eventually be excluded from the analysis. This makes sense assuming plants have evolved universal mechanisms in response to both nutritional status and herbivory that overrule any minor genetic differences among individuals from the same species.


  • Always check the residuals and the random effects! While both linear models and LMMs require normally distributed residuals with homogeneous variance, the former assumes independence among observations and the latter normally distributed random effects. Use normalized residuals to establish comparisons.
  • One key additional advantage of LMMs we did not discuss is that they can handle missing values.
  • Wide format data should be first converted to long format, using e.g. the R package reshape.
  • Variograms are very helpful in determining spatial or temporal dependence in the residuals. In the case of spatial dependence, bubble plots nicely represent residuals in the space the observations were drown from (e.g. latitude and longitude; refer to Zuur et al. (2009) for more information).
  • REML estimation is unbiased but does not allow for comparing models with different fixed structures. Only use the REML estimation on the optimal model.

With respect to this particular set of results:

  • The analysis outlined here is not as exhaustive as it should be. Among other things, we did neither initially consider interaction terms among fixed effects nor investigate in sufficient depth the random effects from the optimal model.
  • The dependent variable (total fruit set per plant) was highly right-skewed and required a log-transformation for basic modeling. The large amount of zeros would in rigour require zero inflated GLMs or similar approaches.
  • All predictors used in the analysis were categorical factors. We could similarly use an ANOVA model. LMMs are likely more relevant in the presence of quantitative or mixed types of predictors.

I would like to thank Hans-Peter Piepho for answering my nagging questions over ResearchGate. I hope these superficial considerations were clear and insightful. I look forward for your suggestions and feedback. My next post will cover a joint multivariate model of multiple responses, the graph-guided fused LASSO (GFLASSO) using a R package I am currently developing. Happy holidays!

Genome-wide association studies in R

’10 High Cholesterol Foods You Must Avoid’. From:

This time I elaborate on a much more specific subject that will mostly concern biologists and geneticists. I will try my best to outline the approach as to ensure non-experts will still have a basic understanding. This tutorial illustrates the power of genome-wide association (GWA) studies by mapping the genetic determinants of cholesterol levels using three Southeast Asian populations.

Historical background

Early since Charles Darwin formulated the theories of natural and sexual selection in the late 1800s, the underlying role of genes, each represented by different alleles (i.e. variants) in different individuals was yet to be elucidated. His younger contemporary fellow Gregor Mendel scratched the surface after identifying the mechanism of trait inheritance, genetic segregation. In the years that followed, the genetic basis of phenotypes (i.e. observable traits) was gradually unraveled by classic geneticists pioneered by Ronald Fisher, who introduced key concepts such as genetic variance. The then emerging concept of genotype (i.e. genetic make-up) required the development of polymorphic genetic markers. In the early days, genotyping was based on determining the allelic composition of loci (loose definition of chromosomal regions), and later of copy number variations (CNVs), short-tandem repeats (STRs) and single nucleotide polymorphisms (SNPs). Humans and many other organisms including plants are diploid (i.e. carry two copies of each chromosome), which implies bi-allelic markers with alleles e.g. A and a distinguish individuals into AA (homozygous dominant), Aa (heterozygous) and aa (homozygous recessive).

The early discovery of the genetic basis for disorders such as sickle cell anemia and haemophilia owes much to their relatively simple genetic architectures – few mutations with high penetrance, thus more easily identified. Understandably then, more complex polygenic diseases that have been around for a long time, such as the neurodegenerative Alzheimer and Parkinson, are still currently under investigation. To characterize such traits, products of the interplay of many genes with small effects, it takes large genetic resources as well as flexible statistical methods, which is no problem in the current era of omics data and high-performance computing.

Genome-wide association studies

Genome-wide association (GWA) studies scan an entire species genome for association between up to millions of SNPs and a given trait of interest. Notably, the trait of interest can be virtually any sort of phenotype ascribed to the population, be it qualitative (e.g. disease status) or quantitative (e.g. height). Essentially, given p SNPs and n samples or individuals, a GWA analysis will fit p independent univariate linear models, each based on n samples, using the genotype of each SNP as predictor of the trait of interest. The significance of association (P-value) in each of the p tests is determined from the coefficient estimate \beta of the corresponding SNP (technically speaking, the significance of association is P(\beta|H_0: \beta=0) ). Note that because these tests are independent and quite numerous, there is a great computational advantage in setting up a parallelized GWA analysis (as we will do shortly). Quite reasonably, it is necessary to adjust the resulting P-values using multiple hypothesis testing methods such as Bonferroni, Benjamini-Hochberg or false discovery rate (FDR). GWA studies are now commonplace in genetics of many different species.

Association mapping vs. linkage mapping

Too often, people cannot tell the difference between association and linkage mapping, or quantitative trait loci (QTL) mapping. Albeit conceptually similar, their are actually opposite in their workings. One of the key differences between the two is that association mapping relies on high-density SNP genotyping of unrelated individuals, whereas linkage mapping relies on the segregation of substantially fewer markers in controlled breeding experiments – unsurprisingly QTL mapping is seldom conducted in humans. Importantly, association mapping gives you point associations in the genome, whereas linkage mapping gives you QTL, chromosomal regions.

The present tutorial covers fundamental aspects to consider when conducting GWA analysis, from the pre-processing of genotype and phenotype data to the interpretation of results. We will use a mixed population of 316 Chinese, Indian and Malay that was recently characterized using high-throughput SNP-chip sequencing, transcriptomics and lipidomics (Saw et al., 2017). More specifically, we will search for associations between the >2.5 million SNP markers and cholesterol levels. Finally, we will explore the vicinity of candidate SNPs using the USCS Genome Browser in order to gain functional insights. The methodology shown here is largely based on the tutorial outlined in Reed et al., 2015. The R scripts and some of the data can be found in my repository, but you will still need to download the omics data from here. Please follow the instructions in the repo.

Let’s get started with R

Read data

Let’s first import the PLINK-converted .bed, .fam and .bim Illumina files from each of the three ethnic groups. We will use the function read.plink from the package snpStats and work on the resulting objects throughout the rest of the tutorial. This function reads .bed, .fam and .bim and creates a list of three elements – $genotypes, $fam and $map. The first contains all SNPs determined from all samples, the second contains information about pedigree and sex, and the third contains the genomic coordinates of the SNPs. At this point we have a total of 323 individuals (110 Chinese, 105 Indian and 108 Malay) and 2,527,458 SNPs. Next, we will change the Illumina SNP IDs stored in $genotype to the more conventional rs IDs, which will allow us to zoom in into the genomic regions that surround the candidate SNPs in the USCS Genome Browser. I prepared a table that establishes the correspondence between the two, so that we can easily switch the IDs.


pathM <- paste("Genomics/108Malay_2527458snps", c(".bed", ".bim", ".fam"), sep = "")
SNP_M <- read.plink(pathM[1], pathM[2], pathM[3])

pathI <- paste("Genomics/105Indian_2527458snps", c(".bed", ".bim", ".fam"), sep = "")
SNP_I <- read.plink(pathI[1], pathI[2], pathI[3])

pathC <- paste("Genomics/110Chinese_2527458snps", c(".bed", ".bim", ".fam"), sep = "")
SNP_C <- read.plink(pathC[1], pathC[2], pathC[3])

# Merge the three SNP datasets
SNP <- rbind(SNP_M$genotypes, SNP_I$genotypes, SNP_C$genotypes)

# Take one bim map (all 3 maps are based on the same ordered set of SNPs)
map <- SNP_M$map
colnames(map) <- c("chr", "SNP", "gen.dist", "position", "A1", "A2")

# Rename SNPs present in the conversion table into rs IDs
mappedSNPs <- intersect(map$SNP, names(conversionTable))
newIDs <- conversionTable[match(map$SNP[map$SNP %in% mappedSNPs], names(conversionTable))]
map$SNP[rownames(map) %in% mappedSNPs] <- newIDs

Next we import and merge the three lipid data sets (stored as .txt) and determine which samples are present in both SNP and lipid data sets. In the following analyses we will use the subset of samples profiled in both platforms, a total of 319. Finally, we create a list genData that stores the merged SNP data ($SNP), one of the three $map since they are all identical ($MAP) and the merged lipid data ($LIP). Finally, let’s save RAM for the subsequent steps and remove all files after saving genData into a .RData file.

# Load lipid datasets & match SNP-Lipidomics samples
lipidsMalay <- read.delim("Lipidomic/117Malay_282lipids.txt", row.names = 1)
lipidsIndian <- read.delim("Lipidomic/120Indian_282lipids.txt", row.names = 1)
lipidsChinese <- read.delim("Lipidomic/122Chinese_282lipids.txt", row.names = 1)
all(Reduce(intersect, list(colnames(lipidsMalay),
colnames(lipidsChinese))) == colnames(lipidsMalay)) # TRUE
lip <- rbind(lipidsMalay, lipidsIndian, lipidsChinese)

matchingSamples <- intersect(rownames(lip), rownames(SNP))
SNP <- SNP[matchingSamples,]
lip <- lip[matchingSamples,]

genData <- list(SNP = SNP, MAP = map, LIP = lip)
save(genData, file = "PhenoGenoMap.RData")

# Clear memory
rm(list = ls())


Let’s reload genData and clean it up. In a nutshell, the pre-processing of the data consists in

  • discarding SNPs with call rate < 1  or  MAF < 0.1
  • discarding samples with call rate < 100%, IBD kinship coefficient > 0.1 or inbreeding coefficient |F| > 0.1

Call rate is the proportion of SNPs (or samples) that were genotyped. For example, a call rate of 0.95 for a particular SNP (sample) means 5% of the values are missing. Because we have so many SNPs, we can afford to have absolutely no missing values in the $SNP matrix by imposing a call rate threshold of 1 for both SNPs and samples. If you want to relax the threshold and tolerate missing values, bear in mind you need to run a whole procedure for imputing those. Reed et al., 2015 describe a PCA-based imputation method that utilizes the 1,000 Genome Project as a proxy, in case you are interested.

Minor-allele frequency (MAF) denotes the proportion of the least common allele for each SNP. Of course, it is harder to detect associations with rare variants and this is why we select against low MAF values. Most GWA studies I have read typically report MAF thresholds of 0.05. Here, I opt for a more stringent 0.1 because again, we have plenty of data and since all this is for illustrative purposes we want to conduct a fast GWA analysis.


# Use SNP call rate of 100%, MAF of 0.1 (very stringent)
maf <- 0.1
callRate <- 1
SNPstats <- col.summary(genData$SNP)

maf_call <- with(SNPstats, MAF > maf & Call.rate == callRate)
genData$SNP <- genData$SNP[,maf_call]
genData$MAP <- genData$MAP[maf_call,]
SNPstats <- SNPstats[maf_call,]

Next, we need to consider samples that exhibit excessive heterozygosity – technically speaking, deviations from the Hardy-Weinberg equilibrium (HWE),

p^2 + 2pq + q^2 = 1

where p  and q  are the frequencies of two alleles A and a that sum up to one, and 2pq the frequency of heterozygous individuals, assuming bi-allelic SNPs. As such, we determine the inbreeding coefficient |F|,

|F| = 1 - \frac{H}{2pq}

where and 2pq are the observed and expected heterozygosity (i.e. proportion of heterozygous SNPs per sample), respectively. You will note that in practice, rather than using the proportions we will use counts which still gives us the same results. Overall, large and small |F| might indicate poor sample quality or inbreeding, respectively. We will exclude samples with |F|> 0.1 as described in Reed et al., 2015.

# Sample call rate & heterozygosity
callMat <- !$SNP)
Sampstats <- row.summary(genData$SNP)
hetExp <- callMat %*% (2 * SNPstats$MAF * (1 - SNPstats$MAF)) # Hardy-Weinberg heterozygosity (expected)
hetObs <- with(Sampstats, Heterozygosity * (ncol(genData$SNP)) * Call.rate)
Sampstats$hetF <- 1-(hetObs/hetExp)
# Use sample call rate of 100%, het threshold of 0.1 (very stringent)
het <- 0.1 # Set cutoff for inbreeding coefficient;
het_call <- with(Sampstats, abs(hetF) < het & Call.rate == 1)
genData$SNP <- genData$SNP[het_call,]
genData$LIP <- genData$LIP[het_call,]

Finally, we will investigate relatedness among samples using the kinship coefficient based on identity by descent (IBD). Please note that these functions from the package SNPRelate require GDS files. For this reason we first need to aggregate the .bed, .fam and .bim files from the three populations into convertGDS. The function snpgdsBED2GDS2 creates the GDS necessary for this part of the analysis. To determine the kinship coefficient between pairs of samples we will use a subset of uncorrelated SNPs in order to have unbiased estimates. For this purpose, we will use linkage disequilibrium (LD) as a measure of correlation between markers. LD ranges from 0 to 1, the higher its value the more likely two SNPs co-segregate and therefore correlate. Here, we will utilize the subset of SNPs with LD < 0.2 (p ~ 12,000) to determine the IBD kinship coefficient. It took me about 2 hours to calculate the LD with the function snpgdsLDpruning, so be patient. Next, following Reed et al., 2015, we will exclude all samples with kinship coefficients > 0.1.

# LD and kinship coeff
ld <- .2
kin <- .1
snpgdsBED2GDS(bed.fn = "convertGDS.bed", bim.fn = "convertGDS.bim",
fam.fn = "convertGDS.fam", out.gdsfn = "myGDS", cvt.chr = "char")
genofile <- snpgdsOpen("myGDS", readonly = F)
gds.ids <- read.gdsn(index.gdsn(genofile, ""))
gds.ids <- sub("-1", "", gds.ids)
add.gdsn(genofile, "", gds.ids, replace = T)
geno.sample.ids <- rownames(genData$SNP)
# First filter for LD
snpSUB <- snpgdsLDpruning(genofile, ld.threshold = ld, = geno.sample.ids, = colnames(genData$SNP))
snpset.ibd <- unlist(snpSUB, use.names = F)
# And now filter for MoM
ibd <- snpgdsIBDMoM(genofile, kinship = T, = geno.sample.ids, = snpset.ibd,
num.thread = 1)
ibdcoef <- snpgdsIBDSelection(ibd)
ibdcoef <- ibdcoef[ibdcoef$kinship >= kin,]

# Filter samples out
related.samples <- NULL
while (nrow(ibdcoef) > 0) {
# count the number of occurrences of each and take the top one
sample.counts <- arrange(count(c(ibdcoef$ID1, ibdcoef$ID2)), -freq)
rm.sample <- sample.counts[1, 'x']
cat("Removing sample", as.character(rm.sample), 'too closely related to', sample.counts[1, 'freq'],'other samples.\n')

# remove from ibdcoef and add to list
ibdcoef <- ibdcoef[ibdcoef$ID1 != rm.sample & ibdcoef$ID2 != rm.sample,]
related.samples <- c(as.character(rm.sample), related.samples)
genData$SNP <- genData$SNP[!(rownames(genData$SNP) %in% related.samples),]
genData$LIP <- genData$LIP[!(rownames(genData$LIP) %in% related.samples),]

After pre-processing, we are left with 316 samples (110 Chinese, 100 Indian and 106 Malay) characterised by 795,668 SNP markers and 282 lipid species. Note that your sample size might differ slightly as the LD pruning procedure is stochastic.


Principal Component Analysis

Now that we are done with the pre-processing, it might be a good idea to examine the largest sources of variation in the genotype data and look out for outliers or clustering patterns, using Principal Component Analysis (PCA). Because we are working with S4 objects, we will be using the PCA function from SNPRelatesnpgdsPCA. Let’s plot the first two principal components (PCs).

pca <- snpgdsPCA(genofile, = geno.sample.ids, = snpset.ibd, num.thread = 1)
pctab <- data.frame( = pca$,
PC1 = pca$eigenvect[,1],
PC2 = pca$eigenvect[,2],
stringsAsFactors = F)

origin <- read.delim("countryOrigin.txt", sep = "\t")
origin <- origin[match(pca$, origin$,]

pcaCol <- rep(rgb(0,0,0,.3), length(pca$ # Set black for chinese
pcaCol[origin$Country == "I"] <- rgb(1,0,0,.3) # red for indian
pcaCol[origin$Country == "M"] <- rgb(0,.7,0,.3) # green for malay

png("PCApopulation.png", width = 500, height = 500)
plot(pctab$PC1, pctab$PC2, xlab = "PC1", ylab = "PC2", col = pcaCol, pch = 16)
abline(h = 0, v = 0, lty = 2, col = "grey")
legend("top", legend = c("Chinese", "Indian", "Malay"), col = 1:3, pch = 16, bty = "n")


As expected, the 795,668 SNP markers clearly delineate the three populations. The results also suggest that Chinese and Malay are closer to each other than to Indian (this observation would be much better addressed with e.g. hierarchical clustering). Also, no obvious outliers are identified with the first two PCs. All good to proceed to GWA.

Genome-Wide Association

Finally, the pièce de résistance. From the set of 282 lipids species I chose cholesterol, one of the most familiar, as the trait of interest. You are completely free to select a different lipid species and proceed. We are going to use the GWA function provided in Reed et al., 2015 with some minor modifications. I recommend you to open the script GWASfunction.R  and skim through. This is an excellent, well documented parallelized implementation. Note that the glm function is used to determine the significance of association between each SNP and the trait of interest. In case you are wondering, glm is much more versatile than lm since it conducts Gaussian, Poisson, binomial and multinomial regression / classification, depending on how your trait of interest is distributed (all lipids in the phenotype file are Gaussian). This GWA function will not create a variable, but rather write a .txt summary table listing the coefficient estimate \beta , t and the corresponding P-value for each SNP, alongside with the corresponding genomic coordinates. Running the GWA function took me approximately 1.5 hours with my MacBook Pro mid-2012 (8Gb, 2.9 GHz Intel Core i7).

# Choose trait for association analysis, use colnames(genData$LIP) for listing
# NOTE: Ignore the first column of genData$LIP (gender)
target <- "Cholesterol"

phenodata <- data.frame("id" = rownames(genData$LIP),
"phenotype" = scale(genData$LIP[,target]), stringsAsFactors = F)

# Conduct GWAS (will take a while)
start <- Sys.time()
GWAA(genodata = genData$SNP, phenodata = phenodata, filename = paste(target, ".txt", sep = ""))
Sys.time() - start # benchmark

Once finished, we can visualize the results using the so-called Manhattan plots. All we need is to load the .txt summary table written in the previous step, add a column with -\log_{10} (P) and plot these significance estimates against the genomic coordinates of all SNPs.

# Manhattan plot
GWASout <- read.table(paste(target, ".txt", sep = ""), header = T, colClasses = c("character", rep("numeric",4)))
GWASout$type <- rep("typed", nrow(GWASout))
GWASout$Neg_logP <- -log10(GWASout$p.value)
GWASout <- merge(GWASout, genData$MAP[,c("SNP", "chr", "position")])
GWASout <- GWASout[order(GWASout$Neg_logP, decreasing = T),]

png(paste(target, ".png", sep = ""), height = 500,width = 1000)

In addition, a full (resp. dashed) line indicate the levels of Bonferroni-adjusted \alpha = 0.05 for 1,000,000 (resp. 10,000) tests.Cholesterol

We see that a total of four SNPs pass the ‘relaxed’ Bonferroni threshold (none passes the ‘hard’ threshold). These are SNPs rs7527051 (Chr. 1), rs12140539 (co-localized with the first, Chr. 1), rs9509213 (Chr. 13) and rs2250402 (Chr. 15).

Before proceeding with these four hits, it is helpful to constrast the distribution of the resulting P-values against that expected by chance, as to ensure there is no confounding systemic bias. This is easily addressed with a quantile-quantile (Q-Q) plot. As the name suggests, it compares two distributions based on their quantiles. In the present case, we want to contrast the distribution of our t statistics against that obtained by chance. If two distributions are identical in shape, the Q-Q plot will display a x = y line. Therefore, the Q-Q plot from a reliable GWA analysis will display a x = y line with only few deviating values that are suggestive of association. Otherwise, if the line is shifted up or down the GWA analysis might be impaired by confounding factors. We will draw a Q-Q plot using the function estlambda from the package GenABEL.

# QQ plot using GenABEL estlambda function
png(paste(target, "_QQplot.png", sep = ""), width = 500, height = 500)
lambda <- estlambda(GWASout$t.value**2, plot = T, method = "median")


The resulting Q-Q plot clearly depicts a trend line (\lambda = 0.99 , red) overlapping with x = y (black) and a slight deviation in the right tail, so we can be confident about our results.

Functional insights into candidate markers

Finally, we will try to find the functional relevance of these four candidate SNPs by searching for genes in their vicinity, using the USCS Genome Browser (enter Genome Browser, insert the SNP ID in the text box, enter and zoom out). I found that rs9509213 (Chr. 13) lands right on CRYL1 (crystalline lambda 1, intron sequence), rendering it an interesting candidate for follow-up studies.

Screen Shot 2017-10-08 at 17.58.52

The SNP rs9509213 is shown as a black text box in the bottom of the figure, and its coordinates are highlighted by a vertical yellow line. The CRYL1 gene model is shown on top, below the chromosome model (topmost).

Interestingly, there is a recent publication that found ‘POE, HP, and CRYL1 have all been associated with Alzheimer’s Disease, the pathology of which involves lipid and cholesterol pathways.’.

Finally, I would like to remark on the need of experimental validation of gene candidates to validate the resulting SNP-trait associations. In this particular case, one could test the effect of overexpressing or knocking-out the ortholog of CRYL1 in cholesterol levels of mice.


To sum up, we have covered some of the most fundamental aspects of GWA analysis:

  • Pre-processing of genotype data based on call rates, MAF, kinship and heterozygosity
  • Investigation of population structure with PCA of the genotype data
  • GWA and visualisation with Manhattan plots
  • Q-Q plots
  • Functional insights into the resulting candidate SNPs

I did not cover many other aspects of GWA, such as fine-mapping using LD plots. Importantly, we did not consider sample size calculation in relation to the expected detectable effect sizes. Also noteworthy, the state-of-the-art of GWA is now almost-fully based on mixed-effect linear models (MLM) that consider kinship and cryptic relatedness as random factors, hence much more powerful that glm. My next post might cover MLM.

I encourage you to choose a different lipid species from the phenotype data, run the analysis and interpret the results.

I would like to thank Saw et al., 2017 for providing unrestricted access to their omics data. If it wasn’t for such conscientious scientists and their transparent research endeavours this whole tutorial would simply not exist. I also reiterate that this tutorial is largely based on the guide outlined in Reed et al., 2015, an excellent reference along with Sikorska et al., 2013.

Have fun and keep your cholesterol in check. Any feedback or suggestions are welcome!

Partial least squares in R

Blood serum samples. From:

My last entry introduces principal component analysis (PCA), one of many unsupervised learning tools. I concluded the post with a demonstration of principal component regression (PCR), which essentially is a ordinary least squares (OLS) fit using the first k principal components (PCs) from the predictors. This brings about many advantages:

  1. There is virtually no limit for the number of predictors. PCA will perform the decomposition no matter how many variables you handle. Simpler models such as OLS do not cope with more predictors than observations (i.e. p > n ).
  2. Correlated predictors do not undermine the regression fit. Collinearity is a problem for OLS, by widening the solution space, i.e. destabilizing coefficient estimation. PCA will always produce few uncorrelated PCs from a set of variables, correlated or not.
  3. The PCs carry the maximum amount of variance possible. In any predictive model, predictors with zero or near-zero variance often constitute a problem and behave as second intercepts. In the process of compression, PCA will find the projections where your data points spread out the most, thus facilitating learning from the subsequent OLS.

However, in many cases it is much wiser performing a decomposition similar to the PCA, yet constructing PCs that best explain the response, be it quantitative (regression) or qualitative (classification). This is the concept of partial least squares (PLS), whose PCs are more often designated latent variables (LVs), although in my understanding the two terms can be used interchangeably.

PLS safeguards advantages 1. and 2. and does not necessarily follow 3. In addition, it follows that if you use the same number of k  LVs and p predictors, you are going to get exactly a OLS fit – each predictor gets its own LV, so it does not help much. PLS (regression) and PLS followed by discriminant analysis (PLS-DA, classification) are tremendously useful in predictive modelling. They are adequate in a wide variety of experimental designs and linear in their parameters, therefore more easily interpretable.

Today we will perform PLS-DA on the Arcene data set hosted at the UCI Machine Learning Repository that comprises 100 observations and 10,000 explanatory variables (p \gg n ) in order to diagnose cancer from serum samples. From the 10,000 features, 7,000 comprise distinct mass spectrometry (MS) peaks, each determining the levels of a protein. For some reason the contributors added 3,000 random variables, probably to test robustness against noise (check more information in the link above).

Let’s get started with R

For predictive modelling I always use the caret package, which builds up on existing model packages and employs a single syntax. In addition, caret features countless functions that deal with pre-processing, imputation, summaries, plotting and much more we will see firsthand shortly.

For some reason there is an empty column sticking with the data set upon loading, so I have added the colClasses argument to get rid of it. Once you load the data set take a look and note that the levels of most proteins are right-skewed, which can be a problem. However, the authors of this study conducted an exhaustive pre-processing and looking into this would take far too long. The cancer / no-cancer labels (coded as -1 / 1) are stored in a different file, so we can directly append it to the full data set and later use the formula syntax for training the models.

# Load caret, install if necessary
arcene <- read.table("", sep = " ",
 colClasses = c(rep("numeric", 10000), "NULL"))

# Add the labels as an additional column
arcene$class <- factor(scan("", sep = "\t"))

We can finally search missing values (NA’s). If these were present it would require some form of imputation – that is not the case. Calling any( should prompt a FALSE.

So now the big questions are:

  • How accurately can we predict whether a patient is sick based of the MS profile of his/her blood serum?
  • Which proteins / MS peaks best discriminate sick and healthy patients?

Time to get started with the function train. With respect to pre-processing we will remove zero-variance predictors and center and scale all those remaining, in this exact order using the preProc argument. Considering the size of the sample (n = 100), I will pick a 10x repeated 5-fold cross-validation (CV) – a large number of repetitions compensates for the high variance stemming from a reduced number of folds – rendering a total of 50 estimates of accuracy. Many experts advise in favor of simpler models when performance is undistinguishable, so we will apply the “one SE” rule that selects the least complex model with the average cross-validated accuracy within one standard error (SE) from that in the optimal model. Finally, we must define a range of values for the tuning parameter, the number of LVs. In this particular case we will consider all models with 1-20 LVs.

# Compile cross-validation settings
myfolds <- createMultiFolds(arcene$class, k = 5, times = 10)
control <- trainControl("repeatedcv", index = myfolds, selectionFunction = "oneSE")

# Train PLS model
mod1 <- train(class ~ ., data = arcene,
 method = "pls",
 metric = "Accuracy",
 tuneLength = 20,
 trControl = control,
 preProc = c("zv","center","scale"))

# Check CV profile


This figure depicts the CV profile, where we can learn the average accuracy (y-axis, %) obtained from models trained with different numbers of LVs (x-axis). Note the swift change in accuracy among models with 1-5 LVs. Although the model with six LVs had the highest average accuracy, calling mod1 in the console will show you that, because of the “one SE” rule, the selected model has five LVs (accuracy of 80.6%).

Now we will compare our PLS-DA to the classifier homolog of PCR  – linear discriminant analysis (LDA) following a PCA reductive-step (PCA-DA, if such thing exists). Note that the PCA pre-processing is also set in the preProc argument. We can also try some more complicated models, say, random forests (RF). I will not go into details about the workings and parameterisation of RF, as for today it will be purely illustrative. Note that RF will comparatively take very long (~15min in my 8Gb MacBook Pro from 2012). As usual, a simple object call will show you the summary of the corresponding CV.

mod2 <- train(class ~ ., data = arcene,
method = "lda",
metric = "Accuracy",
trControl = control,
preProc = c("zv","center","scale","pca"))

# RF
mod3 <- train(class ~ ., data = arcene,
method = "ranger",
metric = "Accuracy",
trControl = control,
tuneGrid = data.frame(mtry = seq(10,.5*ncol(arcene),length.out = 6)),
preProc = c("zv","center","scale"))

Finally, we can compare PLS-DA, PCA-DA and RF with respect to accuracy. We will compile the three models using caret::resamples, borrowing the plotting capabilities of ggplot2 to compare the 50 accuracy estimates from the optimal cross-validated model in the three cases.

# Compile models and compare performance
models <- resamples(list("PLS-DA" = mod1, "PCA-DA" = mod2, "RF" = mod3))
bwplot(models, metric = "Accuracy")


It is clear that the long RF run did not translate into a excelling performance, quite the opposite. Although in average all three models have similar performances, the RF displays a much larger variance in accuracy, which is of course a concern if we seek a robust model. In this case PLS-DA and PCA-DA exhibit the best  performance (63-95% accuracy) and either model would do well in diagnosing cancer in new serum samples.

To conclude, we will determine the ten proteins that best diagnose cancer using the variable importance in the projection (ViP), from both the PLS-DA and PCA-DA. With varImp, this is given in relative levels (scaled to the range 0-100).

plot(varImp(mod1), 10, main = "PLS-DA")
plot(varImp(mod2), 10, main = "PCA-DA")



Very different figures, right? Idiosyncrasies aside, how many PCs did the PCA-DA use? By calling mod2$preProcess, we learn that “PCA needed 82 components to capture 95 percent of the variance”. This is because the PCA pre-processing functionality in caret uses, by default, as many PCs as it takes to cover 95% of the data variance. So, in one hand we have a PLS-DA with five LVs, on the other hand a PCA-DA with 82 PCs. This not only makes PCA-DA cheaply complicated, but arcane at the interpretation level.

The PLS-DA ViP plot above clearly distinguishes V1184 from all other proteins. This could be an interesting cancer biomarker. Of course, many other tests and models must be conducted in order to provide a reliable diagnostic tool. My aim here is only to introduce PLS and PLS-DA.

And that’s it for today, I hope you enjoyed. Please drop me a comment or message if you have any suggestion for the next post. Cheers!

Principal Component Analysis in R

Principal component analysis (PCA) is routinely employed on a wide range of problems. From the detection of outliers to predictive modeling, PCA has the ability of projecting the observations described by p variables into few orthogonal components defined at where the data ‘stretch’ the most, rendering a simplified overview. PCA is particularly powerful in dealing with multicollinearity and variables that outnumber the samples (p \gg n ).


It is an unsupervised method, meaning it will always look into the greatest sources of variation regardless of the data structure. Its counterpart, the partial least squares (PLS), is a supervised method and will perform the same sort of covariance decomposition, albeit building a user-defined number of components (frequently designated as latent variables) that minimize the SSE from predicting a specified outcome with an ordinary least squares (OLS). The PLS is worth an entire post and so I will refrain from casting a second spotlight.

In case PCA is entirely new to you, there is an excellent Primer from Nature Biotechnology that I highly recommend. Notwithstanding the focus on life sciences, it should still be clear to others than biologists.

Mathematical foundation

There are numerous PCA formulations in the literature dating back as long as one century, but all in all PCA is pure linear algebra. One of the most popular methods is the singular value decomposition (SVD). The SVD algorithm breaks down a matrix X of size n \times p  into three pieces,

X = U \Sigma V^T

where U  is the matrix with the eigenvectors of X X^T \Sigma is the diagonal matrix with the singular values and V^T  is the matrix with the eigenvectors of X^T X . These matrices are of size n \times n n \times p and p \times p , respectively. The key difference of SVD compared to a matrix diagonalization (X = V \Sigma V^{-1} ) is that U  and V^T  are distinct orthonormal (orthogonal and unit-vector) matrices.

PCA reduces the p dimensions of your data set X down to k principal components (PCs). The scores from the first k PCs result from multiplying the first k columns of U with the k \times k  upper-left submatrix of \Sigma . The loading factors of the k^{th} PC are directly given in the k^{th}  row in V^T . Consequently, multiplying all scores and loadings recovers X . You might as well keep in mind:

  • PCs are ordered by the decreasing amount of variance explained
  • PCs are orthogonal i.e. uncorrelated to each other
  • The columns of X should be mean-centered, so that the covariance matrix \approx X^{T} X
  • SVD-based PCA does not tolerate missing values (but there are solutions we will cover shortly)

For a more elaborate explanation with introductory linear algebra, here is an excellent free SVD tutorial I found online. At any rate, I guarantee you can master PCA without fully understanding the process.

Let’s get started with R

Although there is a plethora of PCA methods available for R, I will only introduce two,

  • prcomp, a default function from the R base package
  • pcaMethods, a Bioconductor package that I frequently use for my own PCAs

I will start by demonstrating that prcomp is based on the SVD algorithm, using the base svd function.

# Generate scaled 4*5 matrix with random std normal samples
mat <- scale(matrix(rnorm(20), 4, 5))
dimnames(mat) <- list(paste("Sample", 1:4), paste("Var", 1:5))

# Perform PCA
myPCA <- prcomp(mat, scale. = F, center = F)
myPCA$rotation # loadings
myPCA$x # scores

By default, prcomp will retrieve min(n, p) PCs. Therefore, in our setting we expect having four PCs.The svd function will behave the same way:

# Perform SVD
mySVD <- svd(mat)
mySVD # the diagonal of Sigma mySVD$d is given as a vector
sigma <- matrix(0,4,4) # we have 4 PCs, no need for a 5th column
diag(sigma) <- mySVD$d # sigma is now our true sigma matrix

Now that we have the PCA and SVD objects, let us compare the respective scores and loadings. We will compare the scores from the PCA with the product of U and \Sigma from the SVD. In R, matrix multiplication is possible with the operator %*%. Next, we will directly compare the loadings from the PCA with V from the SVD, and finally show that multiplying scores and loadings recovers X . I rounded the results to five decimal digits since the results are not exactly the same! The function t retrieves a transposed matrix.

# Compare PCA scores with the SVD's U*Sigma
theoreticalScores <- mySVD$u %*% sigma
all(round(myPCA$x,5) == round(theoreticalScores,5)) # TRUE

# Compare PCA loadings with the SVD's V
all(round(myPCA$rotation,5) == round(mySVD$v,5)) # TRUE

# Show that mat == U*Sigma*t(V)
recoverMatSVD <- theoreticalScores %*% t(mySVD$v)
all(round(mat,5) == round(recoverMatSVD,5)) # TRUE

#Show that mat == scores*t(loadings)
recoverMatPCA <- myPCA$x %*% t(myPCA$rotation)
all(round(mat,5) == round(recoverMatPCA,5)) # TRUE

PCA of the wine data set

Now that we established the association between SVD and PCA, we will perform PCA on real data. I found a wine data set at the UCI Machine Learning Repository that might serve as a good starting example.

wine <- read.table("", sep=",")

According to the documentation, these data consist of 13 physicochemical parameters measured in 178 wine samples from three distinct cultivars grown in Italy. Let’s check patterns in pairs of variables, and then see what a PCA does about that by plotting PC1 against PC2.

# Name the variables
colnames(wine) <- c("Cvs","Alcohol","Malic acid","Ash","Alcalinity of ash", "Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins", "Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline")

# The first column corresponds to the classes
wineClasses <- factor(wine$Cvs)

# Use pairs
pairs(wine[,-1], col = wineClasses, upper.panel = NULL, pch = 16, cex = 0.5)
legend("topright", bty = "n", legend = c("Cv1","Cv2","Cv3"), pch = 16, col = c("black","red","green"),xpd = T, cex = 2, y.intersp = 0.5)


Among other things, we observe correlations between variables (e.g. total phenols and flavonoids), and occasionally the two-dimensional separation of the three cultivars (e.g. using alcohol % and the OD ratio).

If its hard enough looking into all pairwise interactions in a set of 13 variables, let alone in sets of hundreds or thousands of variables. In these instances PCA is of great help. Let’s give it a try in this data set: # clear the format from the previous plot
winePCA <- prcomp(scale(wine[,-1]))
plot(winePCA$x[,1:2], col = wineClasses)


Three lines of code and we see a clear separation among grape vine cultivars. In addition, the data points are evenly scattered over relatively narrow ranges in both PCs. We could next investigate which parameters contribute the most to this separation and how much variance is explained by each PC, but I will leave it for pcaMethods. We will now repeat the procedure after introducing an outlier in place of the 10th observation.

wineOutlier <- wine
wineOutlier[10,] <- wineOutlier[10,]*10 # change the 10th obs. into an extreme one by multiplying its profile by 10
outlierPCA <- prcomp(scale(wineOutlier[,-1]))
plot(outlierPCA$x[,1:2], col = wineClasses)


As expected, the huge variance stemming from the separation of the 10th observation from the core of all other samples is fully absorbed by the first PC. The outlying sample becomes plain evident.

PCA of the wine data set with pcaMethods

We will now turn to pcaMethods, a compact suite of PCA tools. First you will need to install it from the Bioconductor:



There are three mains reasons why I use pcaMethods so extensively:

  • Besides SVD, it provides several different methods (bayesian PCA, probabilistic PCA, robust PCA, to name a few)
  • Some of these algorithms tolerate and impute missing values
  • The object structure and plotting capabilities are user-friendly

All information available about the package can be found here. I will now simply show the joint scores-loadings plots, but still encourage you to explore it further.

I will select the default SVD method to reproduce our previous PCA result, with the same scaling strategy as before (UV, or unit-variance, as executed by scale). The argument scoresLoadings gives you control over printing scores, loadings, or both jointly as right next. The standard graphical parameters (e.g. cex, pch, col) preceded by either letters s or l control the aesthetics in the scores or loadings plots, respectively.

winePCAmethods <- pca(wine[,-1], scale = "uv", center = T, nPcs = 2, method = "svd")
slplot(winePCAmethods, scoresLoadings = c(T,T), scol = wineClasses)


So firstly, we have a faithful reproduction of the previous PCA plot. Then, having the loadings panel on its right side, we can claim that

  • Wine from Cv2 (red) has a lighter color intensity, lower alcohol %, a greater OD ratio and hue, compared to the wine from Cv1 and Cv3.
  • Wine from Cv3 (green) has a higher content of malic acid and non-flavanoid phenols, and a higher alkalinity of ash compared to the wine from Cv1 (black)

Finally, although the variance jointly explained by the first two PCs is printed by default (55.41%), it might be more informative consulting the variance explained in individual PCs. We can call the structure of winePCAmethods, inspect the slots and print those of interest, since there is a lot of information contained. The variance explained per component is stored in a slot named R2.

str(winePCAmethods) # slots are marked with @

Seemingly, PC1 and PC2 explain 36.2% and 19.2% of the variance in the wine data set, respectively. PCAs of data exhibiting strong effects (such as the outlier example given above) will likely result in the sequence of PCs showing an abrupt drop in the variance explained. Screeplots are helpful in that matter, and allow you determining how much variance you can put into a principal component regression (PCR), for example, which is exactly what we will try next.

PCR with the housing data set

Now we will tackle a regression problem using PCR. I will use an old housing data set also deposited in the UCI MLR. Again according to its documentation, these data consist of 14 variables and 504 records from distinct towns somewhere in the US. To perform PCR all we need is conduct PCA and feed the scores of n PCs to a OLS. Let’s try predicting the median value of owner-occupied houses in thousands of dollars (MEDV) using the first three PCs from a PCA.

houses <- read.table("",header = F, na.string = "?")
colnames(houses) <- c("CRIM", "ZN", "INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX","PTRATIO","B","LSTAT","MEDV")

# Perform PCA
pcaHouses <- prcomp(scale(houses[,-14]))
scoresHouses <- pcaHouses$x

# Fit lm using the first 3 PCs
modHouses <- lm(houses$MEDV ~ scoresHouses[,1:3])

The printed summary shows two important pieces of information. Firstly, the three estimated coefficients (plus the intercept) are considered significant (H_0 : \beta = 0 ). Second, the predictability as defined by the R^2  (coefficient of determination, in most cases the same as the squared Pearson correlation coefficient) was 0.63.

Next we will compare this simple model to a OLS model featuring all 14 variables, and finally compare the observed vs. predicted MEDV plots from both models. Note that in the lm syntax, the response Y is given to the left of the tilde and the set of predictors X to the right. Moreover, provided there is an argument for data you can circumvent the need for typing all variable names for a full model (x_1 + x_2 + x_3 + ... x_p ), and simply use . .

# Fit lm using all 14 vars
modHousesFull <- lm(MEDV ~ ., data = houses)
summary(modHousesFull) # R2 = 0.741

# Compare obs. vs. pred. plots
par(mfrow = c(1,2))
plot(houses$MEDV, predict(modHouses), xlab = "Observed MEDV", ylab = "Predicted MEDV", main = "PCR", abline(a = 0, b = 1, col = "red"))
plot(houses$MEDV, predict(modHousesFull), xlab = "Observed MEDV", ylab = "Predicted MEDV", main = "Full model", abline(a = 0, b = 1, col = "red"))


Here the full model displays a slight improvement in fit (R^2 = 0.74 ). The high significance of most coefficient estimates is suggestive of a well-designed experiment. Nevertheless, it is notable that such a reduction of 13 down to three covariates still yields an accurate model.

Just as a side note, you probably noticed both models underestimated the MEDV in towns with MEVD worth 50,000 dollars. My guess is that missing values were set to MEVD = 50.


  • The SVD algorithm is founded on fundamental properties of linear algebra including matrix diagonalization. SVD-based PCA takes part of its solution and retains a reduced number of orthogonal covariates that explain as much variance as possible.
  • Use PCA when handling high-dimensional data. It is insensitive to correlation among variables and efficient in detecting sample outliers.
  • If you plan to use PCA results for subsequent analyses all care should be undertaken in the process. Although typically outperformed by numerous methods, PCR still benefits from interpretability and can be effective in many settings.

All feedback from these tutorials is very welcome, please enter the Contact tab and leave your comments. I do also appreciate suggestions. Enjoy!



Probability distributions in R

Some of the most fundamental functions in R, in my opinion, are those that deal with probability distributions. Whenever you compute a P-value you rely on a probability distribution, and there are many types out there. In this exercise I will cover four: Bernoulli, Binomial, Poisson, and Normal distributions. Let me begin with some theory first:


Think of Bernoulli as a single coin flip, with probability of success p  the coin will land heads. Let X  be the random variable defining the outcome of the coin flip, and it will follow a distribution of the form

X \sim \text{Bern} \left({p,p(1-p)} \right)

with mean p   and variance p(1-p) .  Because there are only two possible outcomes (i.e. the coin lands either heads or tails) this distribution will necessarily be characterised by a probability mass function (PMF), as any other probability distribution dealing with discrete outcomes such as the Binomial and Poisson we will discuss later on.


Think of Binomial as multiple independent Bernoulli trials, each with probabily of success p  the coin will land heads. Let X  be the random variable defining the number of successes in n trials, and it will follow a distribution of the form

X \sim \mathcal{B} \left({n,p} \right)

with mean np  (same as MLE) and variance np(1-p) . Notice that the variance of the Bernoulli and Binomial distributions is maximum when p = 0.5  (makes sense right? You are less certain about a binary outcome with 0.5 compared to any other value). There are n  possible outcomes, and the probability of each of of them is defined as

P(X = x) = {n \choose x}p^x(1-p)^{n-x}

so the probability of landing heads only once in 20 fair coin flips, for example, would be

P(X = 1|p = 0.5) = {20 \choose 1}0.5^1(1-0.5)^{19} \approx 1.91 \times 10^{-5} 

turning out to be very, very unlikely.


Think of Poisson as a the number of goals in a football match. Let X  be the random variable defining the number of occurrences in a given context with a rate of \lambda , and it will follow a distribution of the form

X \sim \text{Pois} \left( \lambda \right)

with \lambda being both the mean and the variance. Note that unlike the Bernoulli and Binomial, we use the Poisson distribution to model data in the context of unlimited occurrences.


Think of the Normal a.k.a. Gaussian distribution as the daily revenue from a local store. Let X be the unbiased random variable defining a quantity in a population, and it will follow a distribution of the form

X \sim \mathcal{N} \left({\mu,\sigma ^2} \right)

with mean \mu and variance \sigma ^2 . Because there are infinite outcomes this distribution will necessarily be characterised by a probability density function (PDF). The Normal distribution has interesting properties that leverage many statistical applications:

  1. The Central Limit Theorem (CLT) posits that no matter the shape of a particular distribution, the distribution of the sample mean (\overline{X} ) will follow a Normal distribution with \mu = \mu_{\overline{X}} and \sigma^2 = SE^2 = \frac{S^2}{n} . This is key for distinguishing populations.
  2. The 68-95-99.7 rule posits that 68, 95 and 99.7% of the observations are contained in the intervals \mu \pm 1 \sigma \mu \pm 2 \sigma and \mu \pm 3 \sigma , respectively.
  3. The standard Normal X \sim \mathcal{N} \left({0, 1} \right) is also a cornerstone in statistics. It simplifies many calculations since \sigma^2 = \sigma = 1 . Variable standardization (Z-normalization) is nothing else but transforming a normally-distributed sample into a standard Normal using \frac{x - \overline{x}}{s} , which explicitly mean-centers the data and scales for unit-variance (UV). This is often one of the first steps in predictive and inference modeling.

Let’s get started with R

We will now explore these distributions in R. Functions dealing with probability distributions in R have a single-letter prefix that defines the type of function we want to use. These prefixes are d, p, q and r. They refer to density/mass, cumulative, quantile and sampling functions, respectively. We will combine these prefixes with the names of the distributions we are interested in, which are binom (Bernoulli and Binomial), pois (Poisson) and norm (Normal). Note that a Binomial distribution with n = 1 is equivalent to a Bernoulli distribution, for the same value of p .

For starters, let us go back to the 20 coin flips example using p = 0.7 and plot the mass function of X \sim \mathcal{B} \left({20,0.7} \right) . We first generate a vector with the sequence of numbers 1,2,…20 and iterate the function over these values.

n <- 1:20
den <- dbinom(n, 20, 0.7)
plot(den, ylab = "Density", xlab = "Number of successes")
sum(den) # = 1


The probability is maximum for pn = 14 . Note that the area under mass/density functions must be one.

Let us now turn to a different problem: the daily revenue of a local store follows a distribution X \sim \mathcal{N} \left({1000, 200} \right) with \mu = \text{1000EUR}  and \sigma^2 = 200 . What is the probability that the revenue of today will be at least 1200EUR?

pnorm(1200,1000,200) # this gives us prob x smaller than 1200eur
1-pnorm(1200,1000,200) # this is the one, x greater than 1200eur

This probability is \approx 0.16 .

How about reversing the question? This is where quantiles become handy.

qnorm(1-0.16,1000,200) # = 1198.892

So not surprisingly, the 84th quantile is \approx 1200EUR.

Finally, a demonstration of the CLT. Let us start with a Poisson distribution X \sim \text{Pois} \left({3} \right) with a density that will look like this,

n <- 1:20
den <- dpois(n, 3)
plot(den, xlab = "Outcome", ylab = "Density")


and we will draw 100 samples of 10 observations each. In each sample we will take the average value only. I will use set.seed to ensure you will draw the same random samples I did.

myMeans <- vector()
for(i in 1:100){
   myMeans <- c(myMeans, mean(rpois(10,3)))
hist(myMeans, main = NULL, xlab = expression(bar(x)))



Looks not as normal as expected? That is because of the sample size. If you re-run the code above and replace 10 with 1000, you will obtain the following histogram.


Not only does the increased sample size improve the bell-shape, it also reduces the variance. Using the CLT we could successfully approximate \lambda = 3 . As you might realize, the CLT is more easily observed when variables are continuous.


  • Model your data with the appropriate distribution according to the underlying assumptions. There is a tendency for disregarding simple distributions, when in fact they can help the most.
  • While the definition of the Binomial and Poisson distributions is relatively straightforward, it is not so easy to ascertain ‘normality’ in a distribution of a continuous variable. A Box-Cox transformation might be helpful in resolving distributions skewed to different extents (and sign) into a Normal one.
  • There are many other distributions out there – Exponential, Gamma, Beta, etc. The rule of the prefixes aforementioned for R still applies (e.g. pgamma, rbeta).


Poisson means ‘fish’ in french

Poisson means ‘fish’ in french

At least this is how I used to remember the guy and his distribution. In the period spanning from 2008 to 2011 of my existence, I graduated in biology and never heard so much about a class as the class of biostatistics. All of us aspiring biologists deemed it as one of the toughest. I had prepared for the challenge with the same brisk readiness I did for all other classes. Promptly I realised at the age of 20 that scholars often consider ‘tough’ what in reality is (and should not be) boring. A lot of things went wrong. At a particular practical class on the basic probability theory, the lecturing professor called me out to solve an exercise in the blackboard and laughed at my hardship. His old-fashion teaching method did well in keeping most of my peers away – black and white slides citing dull old manuscripts, speech in monotone mode and little development on the formulas bombarding the white screen in our seminar hall. I was doing t-tests and I did not know its purpose nor how it did work. More surprisingly, considering the state-of-the-art in biostatistics at that time, we had only a five minute-long experience with R during which we learned to extract random numbers and plot shitty histograms. It turned out, at the end of 2009, that I failed biostatistics. The second attempt the year after, going through all over again, felt like having to eat my own vomit. Twice as ‘tough’ as the first attempt though I managed to survive with a very modest score. When it was over I swore I would never go back to it, and happily proceeded to a MSc in molecular biology (2011 to 2013). While I did not go back to it, it did come to me, slowly and gradually, following the boiling frog process. By the end of my master’s I noticed I had learned more theory and practice on my own than I had in those classes. As of today I am a PhD student working on the analysis of biological data.

I look back and wish I had better luck in the beginning – statistics and mathematics are cool and you should know it. With the exponentially growing amount of data worldwide there is a concomitant need for data analysis. With the existing online resources you can learn and master it all for free, using your own machine. This is my contribution for the people that like me felt incapable and frustrated, and I hope these R tutorials will reach the understanding of any interested people from any background. Never give up!