Chapter 1.2.5 Probability Distributions

Standard univariate probability distributions for discrete and continuous random variables

Probability distributions and random number generation

Quantiles and cumulative distribution values can be calculated easily within R. Random variables are commonly needed for simulation and analysis. These can be generated for a large number of distributions.

A seed can be specified for the random number generator. This is important to allow replication of results (e.g., while testing and debugging). Information about random number seeds can be found in 3.1.3.

Table 3.1 summarizes support for quantiles, cumulative distribution functions, and random numbers. More information on probability distributions can be found in the CRAN probability distributions task view (http://cran.r-project.org/web/views/Distributions.html).

Sample statistics from random samples

Now that you know how to calculate summary statistics, let’s take a closer look at how R draws random samples using the rnorm() and runif() functions. In the next code chunk, I’ll calculate some summary statistics from a vector of 5 values from a Normal distribution with a mean of 10 and a standard deviation of 5. I’ll then calculate summary statistics from this sample using mean() and sd():

# 5 samples from a Normal dist with mean = 10 and sd = 5
x <- rnorm(n = 5, mean = 10, sd = 5)

# What are the mean and standard deviation of the sample?
mean(x)
## [1] 11
sd(x)
## [1] 2.5

As you can see, the mean and standard deviation of our sample vector are close to the population values of 10 and 5 – but they aren’t exactly the same because these are sample data. If we take a much larger sample (say, 100,000), the sample statistics should get much closer to the population values:

# 100,000 samples from a Normal dist with mean = 10, sd = 5
y <- rnorm(n = 100000, mean = 10, sd = 5)

mean(y)
## [1] 10
sd(y)
## [1] 5

Yep, sure enough our new sample y (containing 100,000 values) has a sample mean and standard deviation much closer (almost identical) to the population values than our sample x (containing only 5 values). This is an example of what is called the law of large numbers.