Partial least squares in R

img_1847_ins_8287_600
Blood serum samples. From: https://health.onehowto.com

My last entry introduces principal component analysis (PCA), one of many unsupervised learning tools. I concluded the post with a demonstration of principal component regression (PCR), which essentially is a ordinary least squares (OLS) fit using the first k principal components (PCs) from the predictors. This brings about many advantages:

  1. There is virtually no limit for the number of predictors. PCA will perform the decomposition no matter how many variables you handle. Simpler models such as OLS do not cope with more predictors than observations (i.e. p > n ).
  2. Correlated predictors do not undermine the regression fit. Collinearity is a problem for OLS, by widening the solution space, i.e. destabilizing coefficient estimation. PCA will always produce few uncorrelated PCs from a set of variables, correlated or not.
  3. The PCs carry the maximum amount of variance possible. In any predictive model, predictors with zero or near-zero variance often constitute a problem and behave as second intercepts. In the process of compression, PCA will find the projections where your data points spread out the most, thus facilitating learning from the subsequent OLS.

However, in many cases it is much wiser performing a decomposition similar to the PCA, yet constructing PCs that best explain the response, be it quantitative (regression) or qualitative (classification). This is the concept of partial least squares (PLS), whose PCs are more often designated latent variables (LVs), although in my understanding the two terms can be used interchangeably.

PLS safeguards advantages 1. and 2. and does not necessarily follow 3. In addition, it follows that if you use the same number of k  LVs and p predictors, you are going to get exactly a OLS fit – each predictor gets its own LV, so it does not help much. PLS (regression) and PLS followed by discriminant analysis (PLS-DA, classification) are tremendously useful in predictive modelling. They are adequate in a wide variety of experimental designs and linear in their parameters, therefore more easily interpretable.

Today we will perform PLS-DA on the Arcene data set hosted at the UCI Machine Learning Repository that comprises 100 observations and 10,000 explanatory variables (p \gg n ) in order to diagnose cancer from serum samples. From the 10,000 features, 7,000 comprise distinct mass spectrometry (MS) peaks, each determining the levels of a protein. For some reason the contributors added 3,000 random variables, probably to test robustness against noise (check more information in the link above).

Let’s get started with R

For predictive modelling I always use the caret package, which builds up on existing model packages and employs a single syntax. In addition, caret features countless functions that deal with pre-processing, imputation, summaries, plotting and much more we will see firsthand shortly.

For some reason there is an empty column sticking with the data set upon loading, so I have added the colClasses argument to get rid of it. Once you load the data set take a look and note that the levels of most proteins are right-skewed, which can be a problem. However, the authors of this study conducted an exhaustive pre-processing and looking into this would take far too long. The cancer / no-cancer labels (coded as -1 / 1) are stored in a different file, so we can directly append it to the full data set and later use the formula syntax for training the models.

# Load caret, install if necessary
library(caret)
arcene <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/arcene/ARCENE/arcene_train.data", sep = " ",
 colClasses = c(rep("numeric", 10000), "NULL"))

# Add the labels as an additional column
arcene$class <- factor(scan("https://archive.ics.uci.edu/ml/machine-learning-databases/arcene/ARCENE/arcene_train.labels", sep = "\t"))

We can finally search missing values (NA’s). If these were present it would require some form of imputation – that is not the case. Calling any(is.na(arcene)) should prompt a FALSE.

So now the big questions are:

  • How accurately can we predict whether a patient is sick based of the MS profile of his/her blood serum?
  • Which proteins / MS peaks best discriminate sick and healthy patients?

Time to get started with the function train. With respect to pre-processing we will remove zero-variance predictors and center and scale all those remaining, in this exact order using the preProc argument. Considering the size of the sample (n = 100), I will pick a 10x repeated 5-fold cross-validation (CV) – a large number of repetitions compensates for the high variance stemming from a reduced number of folds – rendering a total of 50 estimates of accuracy. Many experts advise in favor of simpler models when performance is undistinguishable, so we will apply the “one SE” rule that selects the least complex model with the average cross-validated accuracy within one standard error (SE) from that in the optimal model. Finally, we must define a range of values for the tuning parameter, the number of LVs. In this particular case we will consider all models with 1-20 LVs.

# Compile cross-validation settings
set.seed(100)
myfolds <- createMultiFolds(arcene$class, k = 5, times = 10)
control <- trainControl("repeatedcv", index = myfolds, selectionFunction = "oneSE")

# Train PLS model
mod1 <- train(class ~ ., data = arcene,
 method = "pls",
 metric = "Accuracy",
 tuneLength = 20,
 trControl = control,
 preProc = c("zv","center","scale"))

# Check CV profile
plot(mod1)

Rplot

This figure depicts the CV profile, where we can learn the average accuracy (y-axis, %) obtained from models trained with different numbers of LVs (x-axis). Note the swift change in accuracy among models with 1-5 LVs. Although the model with six LVs had the highest average accuracy, calling mod1 in the console will show you that, because of the “one SE” rule, the selected model has five LVs (accuracy of 80.6%).

Now we will compare our PLS-DA to the classifier homolog of PCR  – linear discriminant analysis (LDA) following a PCA reductive-step (PCA-DA, if such thing exists). Note that the PCA pre-processing is also set in the preProc argument. We can also try some more complicated models, say, random forests (RF). I will not go into details about the workings and parameterisation of RF, as for today it will be purely illustrative. Note that RF will comparatively take very long (~15min in my 8Gb MacBook Pro from 2012). As usual, a simple object call will show you the summary of the corresponding CV.

# PCA-DA
mod2 <- train(class ~ ., data = arcene,
method = "lda",
metric = "Accuracy",
trControl = control,
preProc = c("zv","center","scale","pca"))

# RF
mod3 <- train(class ~ ., data = arcene,
method = "ranger",
metric = "Accuracy",
trControl = control,
tuneGrid = data.frame(mtry = seq(10,.5*ncol(arcene),length.out = 6)),
preProc = c("zv","center","scale"))

Finally, we can compare PLS-DA, PCA-DA and RF with respect to accuracy. We will compile the three models using caret::resamples, borrowing the plotting capabilities of ggplot2 to compare the 50 accuracy estimates from the optimal cross-validated model in the three cases.

# Compile models and compare performance
models <- resamples(list("PLS-DA" = mod1, "PCA-DA" = mod2, "RF" = mod3))
bwplot(models, metric = "Accuracy")

Rplot

It is clear that the long RF run did not translate into a excelling performance, quite the opposite. Although in average all three models have similar performances, the RF displays a much larger variance in accuracy, which is of course a concern if we seek a robust model. In this case PLS-DA and PCA-DA exhibit the best  performance (63-95% accuracy) and either model would do well in diagnosing cancer in new serum samples.

To conclude, we will determine the ten proteins that best diagnose cancer using the variable importance in the projection (ViP), from both the PLS-DA and PCA-DA. With varImp, this is given in relative levels (scaled to the range 0-100).

plot(varImp(mod1), 10, main = "PLS-DA")
plot(varImp(mod2), 10, main = "PCA-DA")

Rplot01

Rplot02

Very different figures, right? Idiosyncrasies aside, how many PCs did the PCA-DA use? By calling mod2$preProcess, we learn that “PCA needed 82 components to capture 95 percent of the variance”. This is because the PCA pre-processing functionality in caret uses, by default, as many PCs as it takes to cover 95% of the data variance. So, in one hand we have a PLS-DA with five LVs, on the other hand a PCA-DA with 82 PCs. This not only makes PCA-DA cheaply complicated, but arcane at the interpretation level.

The PLS-DA ViP plot above clearly distinguishes V1184 from all other proteins. This could be an interesting cancer biomarker. Of course, many other tests and models must be conducted in order to provide a reliable diagnostic tool. My aim here is only to introduce PLS and PLS-DA.

And that’s it for today, I hope you enjoyed. Please drop me a comment or message if you have any suggestion for the next post. Cheers!

Principal Component Analysis in R

Principal component analysis (PCA) is routinely employed on a wide range of problems. From the detection of outliers to predictive modeling, PCA has the ability of projecting the observations described by p variables into few orthogonal components defined at where the data ‘stretch’ the most, rendering a simplified overview. PCA is particularly powerful in dealing with multicollinearity and variables that outnumber the samples (p \gg n ).

q7hip

It is an unsupervised method, meaning it will always look into the greatest sources of variation regardless of the data structure. Its counterpart, the partial least squares (PLS), is a supervised method and will perform the same sort of covariance decomposition, albeit building a user-defined number of components (frequently designated as latent variables) that minimize the SSE from predicting a specified outcome with an ordinary least squares (OLS). The PLS is worth an entire post and so I will refrain from casting a second spotlight.

In case PCA is entirely new to you, there is an excellent Primer from Nature Biotechnology that I highly recommend. Notwithstanding the focus on life sciences, it should still be clear to others than biologists.

Mathematical foundation

There are numerous PCA formulations in the literature dating back as long as one century, but all in all PCA is pure linear algebra. One of the most popular methods is the singular value decomposition (SVD). The SVD algorithm breaks down a matrix X of size n \times p  into three pieces,

X = U \Sigma V^T

where U  is the matrix with the eigenvectors of X X^T \Sigma is the diagonal matrix with the singular values and V^T  is the matrix with the eigenvectors of X^T X . These matrices are of size n \times n n \times p and p \times p , respectively. The key difference of SVD compared to a matrix diagonalization (X = V \Sigma V^{-1} ) is that U  and V^T  are distinct orthonormal (orthogonal and unit-vector) matrices.

PCA reduces the p dimensions of your data set X down to k principal components (PCs). The scores from the first k PCs result from multiplying the first k columns of U with the k \times k  upper-left submatrix of \Sigma . The loading factors of the k^{th} PC are directly given in the k^{th}  row in V^T . Consequently, multiplying all scores and loadings recovers X . You might as well keep in mind:

  • PCs are ordered by the decreasing amount of variance explained
  • PCs are orthogonal i.e. uncorrelated to each other
  • The columns of X should be mean-centered, so that the covariance matrix \approx X^{T} X
  • SVD-based PCA does not tolerate missing values (but there are solutions we will cover shortly)

For a more elaborate explanation with introductory linear algebra, here is an excellent free SVD tutorial I found online. At any rate, I guarantee you can master PCA without fully understanding the process.

Let’s get started with R

Although there is a plethora of PCA methods available for R, I will only introduce two,

  • prcomp, a default function from the R base package
  • pcaMethods, a Bioconductor package that I frequently use for my own PCAs

I will start by demonstrating that prcomp is based on the SVD algorithm, using the base svd function.

# Generate scaled 4*5 matrix with random std normal samples
set.seed(101)
mat <- scale(matrix(rnorm(20), 4, 5))
dimnames(mat) <- list(paste("Sample", 1:4), paste("Var", 1:5))

# Perform PCA
myPCA <- prcomp(mat, scale. = F, center = F)
myPCA$rotation # loadings
myPCA$x # scores

By default, prcomp will retrieve min(n, p) PCs. Therefore, in our setting we expect having four PCs.The svd function will behave the same way:

# Perform SVD
mySVD <- svd(mat)
mySVD # the diagonal of Sigma mySVD$d is given as a vector
sigma <- matrix(0,4,4) # we have 4 PCs, no need for a 5th column
diag(sigma) <- mySVD$d # sigma is now our true sigma matrix

Now that we have the PCA and SVD objects, let us compare the respective scores and loadings. We will compare the scores from the PCA with the product of U and \Sigma from the SVD. In R, matrix multiplication is possible with the operator %*%. Next, we will directly compare the loadings from the PCA with V from the SVD, and finally show that multiplying scores and loadings recovers X . I rounded the results to five decimal digits since the results are not exactly the same! The function t retrieves a transposed matrix.

# Compare PCA scores with the SVD's U*Sigma
theoreticalScores <- mySVD$u %*% sigma
all(round(myPCA$x,5) == round(theoreticalScores,5)) # TRUE

# Compare PCA loadings with the SVD's V
all(round(myPCA$rotation,5) == round(mySVD$v,5)) # TRUE

# Show that mat == U*Sigma*t(V)
recoverMatSVD <- theoreticalScores %*% t(mySVD$v)
all(round(mat,5) == round(recoverMatSVD,5)) # TRUE

#Show that mat == scores*t(loadings)
recoverMatPCA <- myPCA$x %*% t(myPCA$rotation)
all(round(mat,5) == round(recoverMatPCA,5)) # TRUE

PCA of the wine data set

Now that we established the association between SVD and PCA, we will perform PCA on real data. I found a wine data set at the UCI Machine Learning Repository that might serve as a good starting example.

wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep=",")

According to the documentation, these data consist of 13 physicochemical parameters measured in 178 wine samples from three distinct cultivars grown in Italy. Let’s check patterns in pairs of variables, and then see what a PCA does about that by plotting PC1 against PC2.

# Name the variables
colnames(wine) <- c("Cvs","Alcohol","Malic acid","Ash","Alcalinity of ash", "Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins", "Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline")

# The first column corresponds to the classes
wineClasses <- factor(wine$Cvs)

# Use pairs
pairs(wine[,-1], col = wineClasses, upper.panel = NULL, pch = 16, cex = 0.5)
legend("topright", bty = "n", legend = c("Cv1","Cv2","Cv3"), pch = 16, col = c("black","red","green"),xpd = T, cex = 2, y.intersp = 0.5)

rplot02

Among other things, we observe correlations between variables (e.g. total phenols and flavonoids), and occasionally the two-dimensional separation of the three cultivars (e.g. using alcohol % and the OD ratio).

If its hard enough looking into all pairwise interactions in a set of 13 variables, let alone in sets of hundreds or thousands of variables. In these instances PCA is of great help. Let’s give it a try in this data set:

dev.off() # clear the format from the previous plot
winePCA <- prcomp(scale(wine[,-1]))
plot(winePCA$x[,1:2], col = wineClasses)

rplot

Three lines of code and we see a clear separation among grape vine cultivars. In addition, the data points are evenly scattered over relatively narrow ranges in both PCs. We could next investigate which parameters contribute the most to this separation and how much variance is explained by each PC, but I will leave it for pcaMethods. We will now repeat the procedure after introducing an outlier in place of the 10th observation.

wineOutlier <- wine
wineOutlier[10,] <- wineOutlier[10,]*10 # change the 10th obs. into an extreme one by multiplying its profile by 10
outlierPCA <- prcomp(scale(wineOutlier[,-1]))
plot(outlierPCA$x[,1:2], col = wineClasses)

rplot01

As expected, the huge variance stemming from the separation of the 10th observation from the core of all other samples is fully absorbed by the first PC. The outlying sample becomes plain evident.

PCA of the wine data set with pcaMethods

We will now turn to pcaMethods, a compact suite of PCA tools. First you will need to install it from the Bioconductor:

source("https://bioconductor.org/biocLite.R")
biocLite("pcaMethods")
library(pcaMethods)

 

There are three mains reasons why I use pcaMethods so extensively:

  • Besides SVD, it provides several different methods (bayesian PCA, probabilistic PCA, robust PCA, to name a few)
  • Some of these algorithms tolerate and impute missing values
  • The object structure and plotting capabilities are user-friendly

All information available about the package can be found here. I will now simply show the joint scores-loadings plots, but still encourage you to explore it further.

I will select the default SVD method to reproduce our previous PCA result, with the same scaling strategy as before (UV, or unit-variance, as executed by scale). The argument scoresLoadings gives you control over printing scores, loadings, or both jointly as right next. The standard graphical parameters (e.g. cex, pch, col) preceded by either letters s or l control the aesthetics in the scores or loadings plots, respectively.

winePCAmethods <- pca(wine[,-1], scale = "uv", center = T, nPcs = 2, method = "svd")
slplot(winePCAmethods, scoresLoadings = c(T,T), scol = wineClasses)

rplot04

So firstly, we have a faithful reproduction of the previous PCA plot. Then, having the loadings panel on its right side, we can claim that

  • Wine from Cv2 (red) has a lighter color intensity, lower alcohol %, a greater OD ratio and hue, compared to the wine from Cv1 and Cv3.
  • Wine from Cv3 (green) has a higher content of malic acid and non-flavanoid phenols, and a higher alkalinity of ash compared to the wine from Cv1 (black)

Finally, although the variance jointly explained by the first two PCs is printed by default (55.41%), it might be more informative consulting the variance explained in individual PCs. We can call the structure of winePCAmethods, inspect the slots and print those of interest, since there is a lot of information contained. The variance explained per component is stored in a slot named R2.

str(winePCAmethods) # slots are marked with @
winePCAmethods@R2

Seemingly, PC1 and PC2 explain 36.2% and 19.2% of the variance in the wine data set, respectively. PCAs of data exhibiting strong effects (such as the outlier example given above) will likely result in the sequence of PCs showing an abrupt drop in the variance explained. Screeplots are helpful in that matter, and allow you determining how much variance you can put into a principal component regression (PCR), for example, which is exactly what we will try next.

PCR with the housing data set

Now we will tackle a regression problem using PCR. I will use an old housing data set also deposited in the UCI MLR. Again according to its documentation, these data consist of 14 variables and 504 records from distinct towns somewhere in the US. To perform PCR all we need is conduct PCA and feed the scores of n PCs to a OLS. Let’s try predicting the median value of owner-occupied houses in thousands of dollars (MEDV) using the first three PCs from a PCA.

houses <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data",header = F, na.string = "?")
colnames(houses) <- c("CRIM", "ZN", "INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX","PTRATIO","B","LSTAT","MEDV")

# Perform PCA
pcaHouses <- prcomp(scale(houses[,-14]))
scoresHouses <- pcaHouses$x

# Fit lm using the first 3 PCs
modHouses <- lm(houses$MEDV ~ scoresHouses[,1:3])
summary(modHouses)

The printed summary shows two important pieces of information. Firstly, the three estimated coefficients (plus the intercept) are considered significant (H_0 : \beta = 0 ). Second, the predictability as defined by the R^2  (coefficient of determination, in most cases the same as the squared Pearson correlation coefficient) was 0.63.

Next we will compare this simple model to a OLS model featuring all 14 variables, and finally compare the observed vs. predicted MEDV plots from both models. Note that in the lm syntax, the response Y is given to the left of the tilde and the set of predictors X to the right. Moreover, provided there is an argument for data you can circumvent the need for typing all variable names for a full model (x_1 + x_2 + x_3 + ... x_p ), and simply use . .

# Fit lm using all 14 vars
modHousesFull <- lm(MEDV ~ ., data = houses)
summary(modHousesFull) # R2 = 0.741

# Compare obs. vs. pred. plots
par(mfrow = c(1,2))
plot(houses$MEDV, predict(modHouses), xlab = "Observed MEDV", ylab = "Predicted MEDV", main = "PCR", abline(a = 0, b = 1, col = "red"))
plot(houses$MEDV, predict(modHousesFull), xlab = "Observed MEDV", ylab = "Predicted MEDV", main = "Full model", abline(a = 0, b = 1, col = "red"))

rplot03

Here the full model displays a slight improvement in fit (R^2 = 0.74 ). The high significance of most coefficient estimates is suggestive of a well-designed experiment. Nevertheless, it is notable that such a reduction of 13 down to three covariates still yields an accurate model.

Just as a side note, you probably noticed both models underestimated the MEDV in towns with MEVD worth 50,000 dollars. My guess is that missing values were set to MEVD = 50.

Concluding,

  • The SVD algorithm is founded on fundamental properties of linear algebra including matrix diagonalization. SVD-based PCA takes part of its solution and retains a reduced number of orthogonal covariates that explain as much variance as possible.
  • Use PCA when handling high-dimensional data. It is insensitive to correlation among variables and efficient in detecting sample outliers.
  • If you plan to use PCA results for subsequent analyses all care should be undertaken in the process. Although typically outperformed by numerous methods, PCR still benefits from interpretability and can be effective in many settings.

All feedback from these tutorials is very welcome, please enter the Contact tab and leave your comments. I do also appreciate suggestions. Enjoy!

 

 

Probability distributions in R

Some of the most fundamental functions in R, in my opinion, are those that deal with probability distributions. Whenever you compute a P-value you rely on a probability distribution, and there are many types out there. In this exercise I will cover four: Bernoulli, Binomial, Poisson, and Normal distributions. Let me begin with some theory first:

Bernoulli

Think of Bernoulli as a single coin flip, with probability of success p  the coin will land heads. Let X  be the random variable defining the outcome of the coin flip, and it will follow a distribution of the form

X \sim \text{Bern} \left({p,p(1-p)} \right)

with mean p   and variance p(1-p) .  Because there are only two possible outcomes (i.e. the coin lands either heads or tails) this distribution will necessarily be characterised by a probability mass function (PMF), as any other probability distribution dealing with discrete outcomes such as the Binomial and Poisson we will discuss later on.

Binomial

Think of Binomial as multiple independent Bernoulli trials, each with probabily of success p  the coin will land heads. Let X  be the random variable defining the number of successes in n trials, and it will follow a distribution of the form

X \sim \mathcal{B} \left({n,p} \right)

with mean np  (same as MLE) and variance np(1-p) . Notice that the variance of the Bernoulli and Binomial distributions is maximum when p = 0.5  (makes sense right? You are less certain about a binary outcome with 0.5 compared to any other value). There are n  possible outcomes, and the probability of each of of them is defined as

P(X = x) = {n \choose x}p^x(1-p)^{n-x}

so the probability of landing heads only once in 20 fair coin flips, for example, would be

P(X = 1|p = 0.5) = {20 \choose 1}0.5^1(1-0.5)^{19} \approx 1.91 \times 10^{-5} 

turning out to be very, very unlikely.

Poisson

Think of Poisson as a the number of goals in a football match. Let X  be the random variable defining the number of occurrences in a given context with a rate of \lambda , and it will follow a distribution of the form

X \sim \text{Pois} \left( \lambda \right)

with \lambda being both the mean and the variance. Note that unlike the Bernoulli and Binomial, we use the Poisson distribution to model data in the context of unlimited occurrences.

Normal

Think of the Normal a.k.a. Gaussian distribution as the daily revenue from a local store. Let X be the unbiased random variable defining a quantity in a population, and it will follow a distribution of the form

X \sim \mathcal{N} \left({\mu,\sigma ^2} \right)

with mean \mu and variance \sigma ^2 . Because there are infinite outcomes this distribution will necessarily be characterised by a probability density function (PDF). The Normal distribution has interesting properties that leverage many statistical applications:

  1. The Central Limit Theorem (CLT) posits that no matter the shape of a particular distribution, the distribution of the sample mean (\overline{X} ) will follow a Normal distribution with \mu = \mu_{\overline{X}} and \sigma^2 = SE^2 = \frac{S^2}{n} . This is key for distinguishing populations.
  2. The 68-95-99.7 rule posits that 68, 95 and 99.7% of the observations are contained in the intervals \mu \pm 1 \sigma \mu \pm 2 \sigma and \mu \pm 3 \sigma , respectively.
  3. The standard Normal X \sim \mathcal{N} \left({0, 1} \right) is also a cornerstone in statistics. It simplifies many calculations since \sigma^2 = \sigma = 1 . Variable standardization (Z-normalization) is nothing else but transforming a normally-distributed sample into a standard Normal using \frac{x - \overline{x}}{s} , which explicitly mean-centers the data and scales for unit-variance (UV). This is often one of the first steps in predictive and inference modeling.

Let’s get started with R

We will now explore these distributions in R. Functions dealing with probability distributions in R have a single-letter prefix that defines the type of function we want to use. These prefixes are d, p, q and r. They refer to density/mass, cumulative, quantile and sampling functions, respectively. We will combine these prefixes with the names of the distributions we are interested in, which are binom (Bernoulli and Binomial), pois (Poisson) and norm (Normal). Note that a Binomial distribution with n = 1 is equivalent to a Bernoulli distribution, for the same value of p .

For starters, let us go back to the 20 coin flips example using p = 0.7 and plot the mass function of X \sim \mathcal{B} \left({20,0.7} \right) . We first generate a vector with the sequence of numbers 1,2,…20 and iterate the function over these values.

n <- 1:20
den <- dbinom(n, 20, 0.7)
plot(den, ylab = "Density", xlab = "Number of successes")
sum(den) # = 1

rplot

The probability is maximum for pn = 14 . Note that the area under mass/density functions must be one.

Let us now turn to a different problem: the daily revenue of a local store follows a distribution X \sim \mathcal{N} \left({1000, 200} \right) with \mu = \text{1000EUR}  and \sigma^2 = 200 . What is the probability that the revenue of today will be at least 1200EUR?

pnorm(1200,1000,200) # this gives us prob x smaller than 1200eur
1-pnorm(1200,1000,200) # this is the one, x greater than 1200eur

This probability is \approx 0.16 .

How about reversing the question? This is where quantiles become handy.

qnorm(1-0.16,1000,200) # = 1198.892

So not surprisingly, the 84th quantile is \approx 1200EUR.

Finally, a demonstration of the CLT. Let us start with a Poisson distribution X \sim \text{Pois} \left({3} \right) with a density that will look like this,

n <- 1:20
den <- dpois(n, 3)
plot(den, xlab = "Outcome", ylab = "Density")

rplot03

and we will draw 100 samples of 10 observations each. In each sample we will take the average value only. I will use set.seed to ensure you will draw the same random samples I did.

myMeans <- vector()
for(i in 1:100){
   set.seed(i)
   myMeans <- c(myMeans, mean(rpois(10,3)))
}
hist(myMeans, main = NULL, xlab = expression(bar(x)))

 

Rplot02.png

Looks not as normal as expected? That is because of the sample size. If you re-run the code above and replace 10 with 1000, you will obtain the following histogram.

Rplot06.png

Not only does the increased sample size improve the bell-shape, it also reduces the variance. Using the CLT we could successfully approximate \lambda = 3 . As you might realize, the CLT is more easily observed when variables are continuous.

Concluding,

  • Model your data with the appropriate distribution according to the underlying assumptions. There is a tendency for disregarding simple distributions, when in fact they can help the most.
  • While the definition of the Binomial and Poisson distributions is relatively straightforward, it is not so easy to ascertain ‘normality’ in a distribution of a continuous variable. A Box-Cox transformation might be helpful in resolving distributions skewed to different extents (and sign) into a Normal one.
  • There are many other distributions out there – Exponential, Gamma, Beta, etc. The rule of the prefixes aforementioned for R still applies (e.g. pgamma, rbeta).

Enjoy!

Poisson means ‘fish’ in french

Poisson means ‘fish’ in french

At least this is how I used to remember the guy and his distribution. In the period spanning from 2008 to 2011 of my existence, I graduated in biology and never heard so much about a class and the class of biostatistics. All of us aspiring biologists deemed it as one of the toughest. I had prepared for the challenge with the same brisk readiness I did for all other classes. Promptly I realised at the age of 20 that scholars often consider ‘tough’ what in reality is (and should not be) boring. A lot of things went wrong. At a particular practical class on the basic probability theory, the lecturing professor called me out to solve an exercise in the blackboard and laughed at my hardship. His old-fashion teaching method did well in keeping most of my peers away – black and white slides citing dull old manuscripts, speech in monotone mode and little development on the formulas bombarding the white screen in our seminar hall. I was doing t-tests and I did not know its purpose nor how it did work. More surprisingly, considering the state-of-the-art in biostatistics at that time, we had only a five minute-long experience with R during which we learned to extract random numbers and plot shitty histograms. It turned out, at the end of 2009, that I failed biostatistics. The second attempt the year after, going through all over again, felt like having to eat my own vomit. Twice as ‘tough’ as the first attempt though I managed to survive with a very modest score. When it was over I swore I would never go back to it, and happily proceeded to a MSc in molecular biology (2011 to 2013). While I did not go back to it, it did come to me, slowly and gradually, following the boiling frog process. By the end of my master’s I noticed I had learned more theory and practice on my own than I had in those classes. As of today I am a PhD student working on the analysis of biological data.

I look back and wish I had better luck in the beginning – statistics and mathematics are cool and you should know it. With the exponentially growing amount of data worldwide there is a concomitant need for data analysis. With the existing online resources you can learn and master it all for free, using your own machine. This is my contribution for the people that like me felt incapable and frustrated, and I hope these R tutorials will reach the understanding of any interested people from any background. Never give up!