Probability distributions in R

Some of the most fundamental functions in R, in my opinion, are those that deal with probability distributions. Whenever you compute a P-value you rely on a probability distribution, and there are many types out there. In this exercise I will cover four: Bernoulli, Binomial, Poisson, and Normal distributions. Let me begin with some theory first:

Bernoulli

Think of Bernoulli as a single coin flip, with probability of success p  the coin will land heads. Let X  be the random variable defining the outcome of the coin flip, and it will follow a distribution of the form

X \sim \text{Bern} \left({p,p(1-p)} \right)

with mean p   and variance p(1-p) .  Because there are only two possible outcomes (i.e. the coin lands either heads or tails) this distribution will necessarily be characterised by a probability mass function (PMF), as any other probability distribution dealing with discrete outcomes such as the Binomial and Poisson we will discuss later on.

Binomial

Think of Binomial as multiple independent Bernoulli trials, each with probabily of success p  the coin will land heads. Let X  be the random variable defining the number of successes in n trials, and it will follow a distribution of the form

X \sim \mathcal{B} \left({n,p} \right)

with mean np  (same as MLE) and variance np(1-p) . Notice that the variance of the Bernoulli and Binomial distributions is maximum when p = 0.5  (makes sense right? You are less certain about a binary outcome with 0.5 compared to any other value). There are n  possible outcomes, and the probability of each of of them is defined as

P(X = x) = {n \choose x}p^x(1-p)^{n-x}

so the probability of landing heads only once in 20 fair coin flips, for example, would be

P(X = 1|p = 0.5) = {20 \choose 1}0.5^1(1-0.5)^{19} \approx 1.91 \times 10^{-5} 

turning out to be very, very unlikely.

Poisson

Think of Poisson as a the number of goals in a football match. Let X  be the random variable defining the number of occurrences in a given context with a rate of \lambda , and it will follow a distribution of the form

X \sim \text{Pois} \left( \lambda \right)

with \lambda being both the mean and the variance. Note that unlike the Bernoulli and Binomial, we use the Poisson distribution to model data in the context of unlimited occurrences.

Normal

Think of the Normal a.k.a. Gaussian distribution as the daily revenue from a local store. Let X be the unbiased random variable defining a quantity in a population, and it will follow a distribution of the form

X \sim \mathcal{N} \left({\mu,\sigma ^2} \right)

with mean \mu and variance \sigma ^2 . Because there are infinite outcomes this distribution will necessarily be characterised by a probability density function (PDF). The Normal distribution has interesting properties that leverage many statistical applications:

  1. The Central Limit Theorem (CLT) posits that no matter the shape of a particular distribution, the distribution of the sample mean (\overline{X} ) will follow a Normal distribution with \mu = \mu_{\overline{X}} and \sigma^2 = SE^2 = \frac{S^2}{n} . This is key for distinguishing populations.
  2. The 68-95-99.7 rule posits that 68, 95 and 99.7% of the observations are contained in the intervals \mu \pm 1 \sigma \mu \pm 2 \sigma and \mu \pm 3 \sigma , respectively.
  3. The standard Normal X \sim \mathcal{N} \left({0, 1} \right) is also a cornerstone in statistics. It simplifies many calculations since \sigma^2 = \sigma = 1 . Variable standardization (Z-normalization) is nothing else but transforming a normally-distributed sample into a standard Normal using \frac{x - \overline{x}}{s} , which explicitly mean-centers the data and scales for unit-variance (UV). This is often one of the first steps in predictive and inference modeling.

Let’s get started with R

We will now explore these distributions in R. Functions dealing with probability distributions in R have a single-letter prefix that defines the type of function we want to use. These prefixes are d, p, q and r. They refer to density/mass, cumulative, quantile and sampling functions, respectively. We will combine these prefixes with the names of the distributions we are interested in, which are binom (Bernoulli and Binomial), pois (Poisson) and norm (Normal). Note that a Binomial distribution with n = 1 is equivalent to a Bernoulli distribution, for the same value of p .

For starters, let us go back to the 20 coin flips example using p = 0.7 and plot the mass function of X \sim \mathcal{B} \left({20,0.7} \right) . We first generate a vector with the sequence of numbers 1,2,…20 and iterate the function over these values.

n <- 1:20
den <- dbinom(n, 20, 0.7)
plot(den, ylab = "Density", xlab = "Number of successes")
sum(den) # = 1

rplot

The probability is maximum for pn = 14 . Note that the area under mass/density functions must be one.

Let us now turn to a different problem: the daily revenue of a local store follows a distribution X \sim \mathcal{N} \left({1000, 200} \right) with \mu = \text{1000EUR}  and \sigma^2 = 200 . What is the probability that the revenue of today will be at least 1200EUR?

pnorm(1200,1000,200) # this gives us prob x smaller than 1200eur
1-pnorm(1200,1000,200) # this is the one, x greater than 1200eur

This probability is \approx 0.16 .

How about reversing the question? This is where quantiles become handy.

qnorm(1-0.16,1000,200) # = 1198.892

So not surprisingly, the 84th quantile is \approx 1200EUR.

Finally, a demonstration of the CLT. Let us start with a Poisson distribution X \sim \text{Pois} \left({3} \right) with a density that will look like this,

n <- 1:20
den <- dpois(n, 3)
plot(den, xlab = "Outcome", ylab = "Density")

rplot03

and we will draw 100 samples of 10 observations each. In each sample we will take the average value only. I will use set.seed to ensure you will draw the same random samples I did.

myMeans <- vector()
for(i in 1:100){
   set.seed(i)
   myMeans <- c(myMeans, mean(rpois(10,3)))
}
hist(myMeans, main = NULL, xlab = expression(bar(x)))

 

Rplot02.png

Looks not as normal as expected? That is because of the sample size. If you re-run the code above and replace 10 with 1000, you will obtain the following histogram.

Rplot06.png

Not only does the increased sample size improve the bell-shape, it also reduces the variance. Using the CLT we could successfully approximate \lambda = 3 . As you might realize, the CLT is more easily observed when variables are continuous.

Concluding,

  • Model your data with the appropriate distribution according to the underlying assumptions. There is a tendency for disregarding simple distributions, when in fact they can help the most.
  • While the definition of the Binomial and Poisson distributions is relatively straightforward, it is not so easy to ascertain ‘normality’ in a distribution of a continuous variable. A Box-Cox transformation might be helpful in resolving distributions skewed to different extents (and sign) into a Normal one.
  • There are many other distributions out there – Exponential, Gamma, Beta, etc. The rule of the prefixes aforementioned for R still applies (e.g. pgamma, rbeta).

Enjoy!

Advertisements

One thought on “Probability distributions in R

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s