Chapter 1.1.1 Simulations
One of the advantages of using R is that we can often estimate probabilities using simulations. That’s not to downplay the importance of being able to make computations regarding probabilities, but that is a topic for another course.
For the purposes of simulation, one of the most important ways of creating a vector is with the sample() command. sample() is one of the first functions that we have come across, so take a second and type ?sample on the command line to see the help box.
That’s a lot of information! Some important things to digest are:
sampletakes up to 4 arguments, though only the first argumentxis required.- the parameter
xis the vector of elements from which you are sampling. sizeis the number of samples you wish to take.replacedetermines whether you are sampling with replacement or not. Sampling without replacement means thatsamplewill not pick the same value twice, and this is the default behavior. Passreplace = TRUEto sample if you wish to sample with replacement.-
probis a vector of probabilities or weights associated withx. It should be a vector of nonnegative numbers of the same length asx. If the sum ofprobis not 1, it will be normalized.
Note: You don’t have to type replace = FALSE, because FALSE is the default value for whether sampling is done with replacement. You also don’t have to type x = ... or size = ... as long as these are the first and second arguments. However, it is sometimes clearer to explicitly name the arguments to complicated functions like sample. Use your best judgment, and include the parameter name if there is any doubt.
Using replicate to simulate experiments
The function replicate is an example of an implicit loop in R. The call
replicate(n, expr)
repeats the expression stored in expr n times, and stores the result as a vector.
Example
Estimate the probability that the sum of two dice is 8.
The plan is to use sample to simulate rolling two dice. We will say that success is a sum of 8. Then, we use the replicate function to repeat the experiment many times, and take the mean number of successes to estimate the probability. Since the final R command is complicated, we will build it up piece by piece.
You can use curly braces { and } to replicate more than one R command, separating each command with semicolons.
Generally, the more samples you take using replicate(), the more accurate you can expect your simulation to be.
It is often a good idea to repeat a simulation a couple of times to get an idea about how much variance there is in the results.
Example
A fair coin is repeatedly tossed. Estimate the probability that you observe Heads for the third time on the 10th toss.
NOTE: I strongly recommend that you follow the workflow as presented above; namely,
- Write code that performs the experiment a single time.
- Add semi-colons to the end of each line and place inside
mean(replicate(1000, { HERE }))
It is much easier to trouble-shoot your code this way, as you can test each line of your simulation separately.
Example (The Birthday Problem)
Estimate the probability that out of 25 randomly selected people, no two will have the same birthday. Assume that all birthdays are equally likely, except that none are leapday babies.
In order to do this, we need to be able to determine whether all of the elements in a vector are unique. R has many, many functions that can be used with vectors. For most things that you want to do, there will be an R function that does it. In this case it is anyDuplicated(), which returns the location of the first duplicate if there are any, and zero otherwise.
The important thing to learn here isn’t necessarily this particular function, but rather the fact that most tasks are possible via some built in functionality.
Here, we have built up the simulation from the ground floor, first simulating the 25 birthdays, then determining if any two birthdays are the same, and then replicating the experiment. Note the use of mean in the last line to compute the proportion of successes. “Success” here is the event “no two people have the same birthday”, and the probability of this event is approximately 0.43. Interestingly, we see that it is actually quite likely (about 57%) that a group of 25 people will contain two with the same birthday.
Challenge
Modify the above code to take into account leap years. Is it reasonable to believe that birthdays throughout the year are equally likely? Later in this book, we will use birthday data to get a better approximation of the probability that out of 25 randomly selected people, no two will have the same birthday.
Example Three numbers are picked uniformly at random from the interval (0,1). What is the probability that a triangle can formed whose side-lengths are the three numbers that you chose?
Solution: We need to check whether the sum of the two smaller numbers is larger than the largest number. We use the sort command to sort the three numbers into increasing order.
Simulating Conditional Probability
Simulating conditional probabilities can be a bit more challenging. In order to estimate P(A|B), we will estimate P(A∩B) and P(B), and then divide the two answers. This is not the most efficient or best way to estimate P(A|B), but it is easy to do with the tools that we already have developed.
Example
Two dice are rolled. Estimate the conditional probability that the sum of the dice is at least 10, given that at least one of the dice is a 6.
First, we estimate the probability that the sum of the dice is at least 10 and at least one of the dice is a 6.