Chapter 1.1.1 Simulations

One of the advantages of using R is that we can often estimate probabilities using simulations. That’s not to downplay the importance of being able to make computations regarding probabilities, but that is a topic for another course.

For the purposes of simulation, one of the most important ways of creating a vector is with the sample() command. sample() is one of the first functions that we have come across, so take a second and type ?sample on the command line to see the help box.

That’s a lot of information! Some important things to digest are:

  1. sample takes up to 4 arguments, though only the first argument x is required.
  2. the parameter x is the vector of elements from which you are sampling.
  3. size is the number of samples you wish to take.
  4. replace determines whether you are sampling with replacement or not. Sampling without replacement means that sample will not pick the same value twice, and this is the default behavior. Pass replace = TRUE to sample if you wish to sample with replacement.
  5. prob is a vector of probabilities or weights associated with x. It should be a vector of nonnegative numbers of the same length as x. If the sum of prob is not 1, it will be normalized.

Note: You don’t have to type replace = FALSE, because FALSE is the default value for whether sampling is done with replacement. You also don’t have to type x = ... or size = ... as long as these are the first and second arguments. However, it is sometimes clearer to explicitly name the arguments to complicated functions like sample. Use your best judgment, and include the parameter name if there is any doubt.

Using replicate to simulate experiments

The function replicate is an example of an implicit loop in R. The call

replicate(n, expr)

repeats the expression stored in expr n times, and stores the result as a vector.

Example

Estimate the probability that the sum of two dice is 8.

The plan is to use sample to simulate rolling two dice. We will say that success is a sum of 8. Then, we use the replicate function to repeat the experiment many times, and take the mean number of successes to estimate the probability. Since the final R command is complicated, we will build it up piece by piece.

You can use curly braces { and } to replicate more than one R command, separating each command with semicolons.

Generally, the more samples you take using replicate(), the more accurate you can expect your simulation to be.
It is often a good idea to repeat a simulation a couple of times to get an idea about how much variance there is in the results.

Example

A fair coin is repeatedly tossed. Estimate the probability that you observe Heads for the third time on the 10th toss.

NOTE: I strongly recommend that you follow the workflow as presented above; namely,

  1. Write code that performs the experiment a single time.
  2. Add semi-colons to the end of each line and place inside mean(replicate(1000, { HERE }))

It is much easier to trouble-shoot your code this way, as you can test each line of your simulation separately.

Example (The Birthday Problem)

Estimate the probability that out of 25 randomly selected people, no two will have the same birthday. Assume that all birthdays are equally likely, except that none are leapday babies.

In order to do this, we need to be able to determine whether all of the elements in a vector are unique. R has many, many functions that can be used with vectors. For most things that you want to do, there will be an R function that does it. In this case it is anyDuplicated(), which returns the location of the first duplicate if there are any, and zero otherwise.

The important thing to learn here isn’t necessarily this particular function, but rather the fact that most tasks are possible via some built in functionality.

Here, we have built up the simulation from the ground floor, first simulating the 25 birthdays, then determining if any two birthdays are the same, and then replicating the experiment. Note the use of mean in the last line to compute the proportion of successes. “Success” here is the event “no two people have the same birthday”, and the probability of this event is approximately 0.43. Interestingly, we see that it is actually quite likely (about 57%) that a group of 25 people will contain two with the same birthday.

Challenge

Modify the above code to take into account leap years. Is it reasonable to believe that birthdays throughout the year are equally likely? Later in this book, we will use birthday data to get a better approximation of the probability that out of 25 randomly selected people, no two will have the same birthday.

Example Three numbers are picked uniformly at random from the interval (0,1). What is the probability that a triangle can formed whose side-lengths are the three numbers that you chose?

Solution: We need to check whether the sum of the two smaller numbers is larger than the largest number. We use the sort command to sort the three numbers into increasing order.

Simulating Conditional Probability

Simulating conditional probabilities can be a bit more challenging. In order to estimate P(A|B), we will estimate P(AB) and P(B), and then divide the two answers. This is not the most efficient or best way to estimate P(A|B), but it is easy to do with the tools that we already have developed.

Example

Two dice are rolled. Estimate the conditional probability that the sum of the dice is at least 10, given that at least one of the dice is a 6.

First, we estimate the probability that the sum of the dice is at least 10 and at least one of the dice is a 6.