Chapter 6.1.3 Exercises
In [7]:
# if you want to save some time and get the APA style conclusion quickly, just use the apa function. Here’s how it works:
# Consider the following two-sample t-test on the pirates dataset that compares whether or not there is a significant age difference between pirates who wear headbands and those who do not:
(test.result <- t.test(age ~ headband,
data = pirates))
In [10]:
# As you can see, the apa function got the values we wanted and reported them in proper APA style. The apa function will even automatically adapt the output for Chi-Square and correlation tests if you enter such a test object. Let’s see how it works on a correlation test where we correlate a pirate’s age with the number of parrots she has owned:
# Print an APA style conclusion of the correlation between a pirate's age and # of parrots
(age.parrots.ctest <- cor.test(formula = ~ age + parrots,
data = pirates))
# Print the apa style conclusion!
yarrr::apa(age.parrots.ctest)
# The apa function has a few optional arguments that control things like the number of significant digits in the output, and the number of tails in the test. Run ?apa to see all the options.
In [13]:
# You can read normtemp from a file as follows, assuming the file is stored in the same directory as is returned when you type getwd(), or get it from the UsingR package.
#normtemp <- read.csv("normtemp.csv")
str(normtemp)
In [23]:
# The R command that finds a confidence interval for the mean in this way is
library(UsingR)
t.test(normtemp$hr, conf.level = .98)
In [24]:
library(UsingR)
t.test(normtemp$temperature, mu = 98.6)
In [33]:
t.test(daily.intake, mu = 7725, conf.level = .99)
# We see that the 99% confidence interval is (5662,7845), and the p-value is .01814. So, there is about a 1.8% chance of getting data that is this compelling or more against H0 even if H0 is true.
In [30]:
anorexia <- MASS::anorexia
weightdifference <- anorexia$Postwt - anorexia$Prewt
t.test(weightdifference, mu = 0, alternative = "greater", conf.level = .98)
# There is pretty strong evidence that there is a positive mean difference in weight of patients.
In [3]:
library(yarrr)
# Let’s conduct a correlation test on the pirates dataset to see if there is a relationship between a pirate’s age and number of parrots they’ve had in their lifetime. Because the variables (age and parrots) are in a dataframe, we can do this in formula notation:
# Is there a correlation between a pirate's age and the number of parrots (s)he's owned?
# Method 1: Formula notation
age.parrots.ctest <- cor.test(formula = ~ age + parrots,
data = pirates)
# Print result
age.parrots.ctest
In [12]:
# Just like the t.test() function, we can use the subset argument in the cor.test() function to conduct a test on a subset of the entire dataframe. For example, to run the same correlation test between a pirate’s age and the number of parrot’s she’s owned, but only for female pirates, I can add the subset = sex == "female" argument:
# Is there a correlation between age and number of parrots ONLY for female pirates?
cor.test(formula = ~ age + parrots,
data = pirates,
subset = sex == "female")
# The results look pretty much identical. In other words, the strength of the relationship between a pirate’s age and the number of parrots they’ve owned is pretty much the same for female pirates and pirates in general.
In [1]:
# If the results of the preceding section didn’t convince you that you need to understand your underlying population before applying a t-test, then this section will. A typical “heavy tail” distribution is the t random variable. Indeed, the t rv with 1 or 2 degrees of freedom doesn’t even have a finite standard deviation!
# Estimate the effective type I error rate in a t-test of H0:μ=0 versus Ha:μ≠0 at the α=0.1 level when the underlying population is t with 3 degrees of freedom when we take a sample of size n=30.
mean(replicate(20000, t.test(rt(30,3), mu = 0)$p.value < .1))
# Not too bad. It seems that the t-test has an effective error rate less than how the test was designed.
In [3]:
# Permutation test for two binary variables
library(dplyr)
library(infer)
mtcars %>%
dplyr::mutate(am = factor(am), vs = factor(vs)) %>%
specify(am ~ vs, success = "1") %>%
hypothesize(null = "independence") %>%
generate(reps = 100, type = "permute") %>%
calculate(stat = "diff in props", order = c("1", "0"))
In [5]:
# Permutation test similar to ANOVA
library(dplyr)
library(infer)
mtcars %>%
dplyr::mutate(cyl = factor(cyl)) %>%
specify(mpg ~ cyl) %>%
hypothesize(null = "independence") %>%
generate(reps = 100, type = "permute") %>%
calculate(stat = "F")
In [14]:
# Permutation test for two binary variables
mtcars %>%
dplyr::mutate(am = factor(am), vs = factor(vs)) %>%
specify(am ~ vs, success = "1") %>%
hypothesize(null = "independence") %>%
generate(reps = 100, type = "permute")
In [5]:
library(yarrr)
# R stores hypothesis tests in special object classes called htest. htest objects contain all the major results from a hypothesis test, from the test statistic (e.g.; a t-statistic for a t-test, or a correlation coefficient for a correlation test), to the p-value, to a confidence interval.
# To show you how this works, let’s create an htest object called height.htest containing the results from a two-sample t-test comparing the heights of male and female pirates:
# T-test comparing male and female heights stored in a new htest object called height.htest
height.htest <- t.test(formula = height ~ sex,
data = pirates,
subset = sex %in% c("male", "female"))
In [6]:
# Once you’ve created an htest object, you can view a print-out of the main results by just evaluating the object name:
# Print main results from height.htest
height.htest
In [9]:
# Now, if we want to access the test statistic or p-value directly, we can just use $:
# Get the test statistic
height.htest$statistic
# Get the p-value
height.htest$p.value
# Get a confidence interval for the mean
height.htest$conf.int
str(height.htest$conf.int)
In [2]:
str(t.test(c(1,2,3)))
In [3]:
# It’s a list of 9 things, and we can pull out the p-value by using
t.test(c(1,2,3))$p.value
In [4]:
# Symmetric, light tailed:
# Estimate the effective type I error rate in a t-test of H0:μ=0 versus Ha:μ≠0 at the α=0.1 level when the underlying population is uniform on the interval (−1,1) when we take a sample size of n=10. Note that H0 is definitely true in this case. So, we are going to estimate the probability of rejecting H0 under the scenario above, and compare it to 10%, which is what the test is designed to give. Let’s build up our replicate function step by step.
set.seed(1)
runif(10,-1,1) #Sample of size 10 from uniform rv
In [5]:
t.test(runif(10,-1,1))$p.value #p-value of test
In [7]:
mean(replicate(10000, t.test(runif(10,-1,1))$p.value < .1))
# Our result is 10.07%. That is really close to the designed value of 10%.
In [8]:
# Estimate the effective type I error rate in a t-test of H0:μ=1 versus Ha:μ≠1 at the α=0.05 level when the underlying population is exponential with rate 1 when we take a sample of size n=20.
set.seed(2)
mean(replicate(20000, t.test(rexp(20,1), mu = 1)$p.value < .05))
In [42]:
# Pretty similar each time. That’s reason enough for me to believe the true type I error rate is relatively close to what we computed. We summarize the simulated effective Type I error rates for H0:μ=1 versus Ha:μ≠1 at α=.1,.05,.01 and .001 levels for sample sizes N=10,20,50,100 and 200.
effectiveError <- sapply(c(.1,.05,.01,.001), function (y) sapply(c(10,20,50,100,200), function(x) mean(replicate(20000, t.test(rexp(x,1), mu = 1)$p.value < y)) ))
colnames(effectiveError) <- c(".1", ".05", ".01", ".001")
rownames(effectiveError) <- c("10", "20", "50", "100", "200")
tbl_df(effectiveError)
In [35]:
# an example selecting a sample of flights traveling to Boston and to San Francisco from New York City in the flights data frame in the nycflights13 package. (We will remove flights with missing data first using na.omit and then sample 100 flights going to each of the two airports.)
bos_sfo <- flights %>%
na.omit() %>%
filter(dest %in% c("BOS", "SFO")) %>%
group_by(dest) %>%
sample_n(100)
bos_sfo
In [36]:
# Suppose we were interested in seeing if the air_time to SFO in San Francisco was statistically greater than the air_time to BOS in Boston. As suggested, let’s begin with some exploratory data analysis to get a sense for how the two variables of air_time and dest relate for these two destination airports:
bos_sfo_summary <- bos_sfo %>% group_by(dest) %>%
summarize(mean_time = mean(air_time),
sd_time = sd(air_time))
bos_sfo_summary
In [41]:
# Looking at these results, we can clearly see that SFO air_time is much larger than BOS air_time. The standard deviation is also extremely informative here. To further understand just how different the air_time variable is for BOS and SFO, let’s look at a boxplot:
ggplot(data = bos_sfo, mapping = aes(x = dest, y = air_time)) +
geom_boxplot()
# Since there is no overlap at all, we can conclude that the air_time for San Francisco flights is statistically greater (at any level of significance) than the air_time for Boston flights.
In [38]:
# The movies dataset in the ggplot2movies package contains information on a large number of movies that have been rated by users of IMDB.com (Wickham 2015). We are interested in the question here of whether Action movies are rated higher on IMDB than Romance movies. We will first need to do a little bit of data wrangling using the ideas from Chapter 5 to get the data in the form that we would like:
movies_trimmed <- movies %>%
select(title, year, rating, Action, Romance)
# Note that Action and Romance are binary variables here. To remove any overlap of movies (and potential confusion) that are both Action and Romance, we will remove them from our population:
movies_trimmed <- movies_trimmed %>%
filter(!(Action == 1 & Romance == 1))
# We will now create a new variable called genre that specifies whether a movie in our movies_trimmed data frame is an "Action" movie, a "Romance" movie, or "Neither". We aren’t really interested in the "Neither" category here so we will exclude those rows as well. Lastly, the Action and Romance columns are not needed anymore since they are encoded in the genre column.
movies_trimmed <- movies_trimmed %>%
mutate(genre = case_when(Action == 1 ~ "Action",
Romance == 1 ~ "Romance",
TRUE ~ "Neither")) %>%
filter(genre != "Neither") %>%
select(-Action, -Romance)
# The case_when function is useful for assigning values in a new variable based on the values of another variable. The last step of TRUE ~ "Neither" is used when a particular movie is not set to either Action or Romance.
# We are left with 8878 movies in our population dataset that focuses on only "Action" and "Romance" movies. Let’s now visualize the distributions of rating across both levels of genre. Think about what type(s) of plot is/are appropriate here before you proceed:
ggplot(data = movies_trimmed, aes(x = genre, y = rating)) +
geom_boxplot()
In [39]:
# We can see that the middle 50% of ratings for "Action" movies is more spread out than that of "Romance" movies in the population. "Romance" has outliers at both the top and bottoms of the scale though. We are initially interested in comparing the mean rating across these two groups so a faceted histogram may also be useful:
ggplot(data = movies_trimmed, mapping = aes(x = rating)) +
geom_histogram(binwidth = 1, color = "white") +
facet_grid(genre ~ .)
In [17]:
# Let’s select a random sample of 34 action movies and a random sample of 34 romance movies. (The number 34 was chosen somewhat arbitrarily here.)
set.seed(2017)
movies_genre_sample <- movies_trimmed %>%
group_by(genre) %>%
sample_n(34) %>%
ungroup()
# Note the addition of the ungroup() function here. This will be useful shortly in allowing us to permute the values of rating across genre. Our analysis does not work without this ungroup() function since the data stays grouped by the levels of genre without it.
In [18]:
# We can now observe the distributions of our two sample ratings for both groups. Remember that these plots should be rough approximations of our population distributions of movie ratings for "Action" and "Romance" in our population of all movies in the movies data frame.
ggplot(data = movies_genre_sample, aes(x = genre, y = rating)) +
geom_boxplot()
In [19]:
ggplot(data = movies_genre_sample, mapping = aes(x = rating)) +
geom_histogram(binwidth = 1, color = "white") +
facet_grid(genre ~ .)
In [20]:
# it’s often useful to calculate the mean and standard deviation as well, conditioned on the two levels.
summary_ratings <- movies_genre_sample %>%
group_by(genre) %>%
summarize(mean = mean(rating),
std_dev = sd(rating),
n = n())
summary_ratings
In [21]:
obs_diff <- movies_genre_sample %>%
specify(formula = rating ~ genre) %>%
calculate(stat = "diff in means", order = c("Romance", "Action"))
obs_diff
# Our goal next is to figure out a random process with which to simulate the null hypothesis being true. Recall that H0:μr−μa=0 corresponds to us assuming that the population means are the same. We would like to assume this is true and perform a random process to generate() data in the model of the null hypothesis.
In [22]:
shuffled_ratings_old <- #movies_trimmed %>%
movies_genre_sample %>%
mutate(genre = mosaic::shuffle(genre)) %>%
group_by(genre) %>%
summarize(mean = mean(rating))
diff(shuffled_ratings_old$mean)
In [43]:
generated_samples <- movies_genre_sample %>%
specify(formula = rating ~ genre) %>%
hypothesize(null = "independence") %>%
generate(reps = 5000)
# A null distribution of simulated differences in sample means is created with the specification of stat = "diff in means" for the calculate() step.
#The null distribution is similar to the bootstrap distribution we saw in Chapter 9, but remember that it consists of statistics generated assuming the null hypothesis is true.
null_distribution_two_means <- generated_samples %>%
calculate(stat = "diff in means", order = c("Romance", "Action"))
# We can now plot the distribution of these simulated differences in means:
null_distribution_two_means %>% visualize()
In [44]:
# The p-value: Remember that we are interested in seeing where our observed sample mean difference of 0.95 falls on this null/randomization distribution. We are interested in simply a difference here so “more extreme” corresponds to values in both tails on the distribution. Let’s shade our null distribution to show a visual representation of our p-value:
null_distribution_two_means %>%
visualize(obs_stat = obs_diff, direction = "both")
# Remember that the observed difference in means was 0.95. We have shaded red all values at or above that value and also shaded red those values at or below its negative value (since this is a two-tailed test). By giving obs_stat = obs_diff a vertical darker line is also shown at 0.95.
# Shaded histogram to show p-value:
In [46]:
# To better estimate how large the p-value will be, we also increase the number of bins to 100 here from 20:
null_distribution_two_means %>%
visualize(bins = 100, obs_stat = obs_diff, direction = "both")
# Histogram with vertical lines corresponding to observed statistic.
In [51]:
# At this point, it is important to take a guess as to what the p-value may be. We can see that there are only a few permuted differences as extreme or more extreme than our observed effect (in both directions). Maybe we guess that this p-value is somewhere around 2%, or maybe 3%, but certainly not 30% or more. Lastly, we calculate the p-value directly using infer:
pvalue <- null_distribution_two_means %>%
get_pvalue(obs_stat = obs_diff, direction = "both")
pvalue
# We have around 0.46% of values as extreme or more extreme than our observed statistic in both directions. Assuming we are using a 5% significance level for α, we have evidence supporting the conclusion that the mean rating for romance movies is different from that of action movies. The next important idea is to better understand just how much higher of a mean rating can we expect the romance movies to have compared to that of action movies.
In [49]:
# To get the corresponding bootstrap distribution with which we can compute a confidence interval, we can just remove or comment out the hypothesize() step since we are no longer assuming the null hypothesis is true when we bootstrap:
percentile_ci_two_means <- movies_genre_sample %>%
specify(formula = rating ~ genre) %>%
# hypothesize(null = "independence") %>%
generate(reps = 5000) %>%
calculate(stat = "diff in means", order = c("Romance", "Action")) %>%
get_ci()
percentile_ci_two_means
# Thus, we can expect the true mean of Romance movies on IMDB to have a rating 0.333 to 1.593 points higher than that of Action movies. Remember that this is based on bootstrapping using movies_genre_sample as our original sample and the confidence interval process being 95% reliable.
In [29]:
# We have already built an approximation for what we think the distribution of δ=¯x1−¯x2 looks like using randomization above. Recall this distribution:
ggplot(data = null_distribution_two_means, aes(x = stat)) +
geom_histogram(color = "white", bins = 20)
In [ ]:
# The infer package also includes some built-in theory-based statistics as well, so instead of going through the process of trying to transform the difference into a standardized form, we can just provide a different value for stat in calculate(). Recall the generated_samples data frame created via:
generated_samples <- movies_genre_sample %>%
specify(formula = rating ~ genre) %>%
hypothesize(null = "independence") %>%
generate(reps = 5000)
# We can now created a null distribution of t statistics:
null_distribution_t <- generated_samples %>%
calculate(stat = "t", order = c("Romance", "Action"))
null_distribution_t %>% visualize()
In [31]:
# We see that the shape of this stat = "t" distribution is the same as that of stat = "diff in means". The scale has changed though with the t values having less spread than the difference in means.
# A traditional t-test doesn’t look at this simulated distribution, but instead it looks at the t-curve with degrees of freedom equal to 62.029. We can overlay this distribution over the top of our permuted t statistics using the method = "both" setting in visualize().
null_distribution_t %>%
visualize(method = "both")
In [47]:
# We can see that the curve does a good job of approximating the randomization distribution here. (More on when to expect for this to be the case when we discuss conditions for the t-test in a bit.) To calculate the p-value in this case, we need to figure out how much of the total area under the t-curve is at our observed T-statistic or more, plus also adding the area under the curve at the negative value of the observed T-statistic or below. (Remember this is a two-tailed test so we are looking for a difference–values in the tails of either direction.)
# Just as we converted all of the simulated values to T-statistics, we must also do so for our observed effect δ∗ :
obs_t <- movies_genre_sample %>%
specify(formula = rating ~ genre) %>%
calculate(stat = "t", order = c("Romance", "Action"))
# So graphically we are interested in finding the percentage of values that are at or above 2.945 or at or below -2.945.
null_distribution_t %>%
visualize(method = "both", obs_stat = obs_t, direction = "both")
# As we might have expected with this just being a standardization of the difference in means statistic that produced a small p-value, we also have a very small one here.
In [11]:
# 100 Bernoulli trials -- no need for replace=TRUE
library(dplyr)
library(infer)
library(mosaic)
resample(0:1, 100)
tally(resample(0:1, 100))
if (require(mosaicData)) {
Small <- sample(KidsFeet, 10)
resample(Small)
tally(~ sex, data=resample(Small))
tally(~ sex, data=resample(Small))
# shuffled can be used to reshuffle some variables within groups
# orig.id shows where the values were in original data frame.
Small <- mutate(Small,
id1 = paste(sex,1:10, sep=":"),
id2 = paste(sex,1:10, sep=":"))
resample(Small, groups=sex, shuffled=c("id1","id2"))
}
deal(Cards, 13) # A Bridge hand
shuffle(Cards)
In [12]:
# fixed marginals for sex
tally(~ sex, data=Small)
tally(~ sex, data=resample(Small, groups=sex))
In [3]:
library(yarrr)
# Here, I’ll conduct a t-test to see if the average number of tattoos owned by pirates is different from 5:
tattoo.ttest <- t.test(x = pirates$tattoos, # Vector of data
mu = 5) # Null: Mean is 5
# Print the result
tattoo.ttest
# As you can see, the function printed lots of information: the sample mean was 9.43, the test statistic (41.59), and the p-value was 2e-16 (which is virtually 0). Because 2e-16 is less than 0.05, we would reject the null hypothesis that the true mean is equal to 5.
In [4]:
# Now, what happens if I change the null hypothesis to a mean of 9.4? Because the sample mean was 9.43, quite close to 9.4, the test statistic should decrease, and the p-value should increase:
tattoo.ttest <- t.test(x = pirates$tattoos,
mu = 9.5) # Null: Mean is 9.4
tattoo.ttest
# Just as we predicted! The test statistic decreased to just -0.67, and the p-value increased to 0.51. In other words, our sample mean of 9.43 is reasonably consistent with the hypothesis that the true population mean is 9.50.
In [6]:
# For example, let’s test a prediction that pirates who wear eye patches have fewer tattoos on average than those who don’t wear eye patches. Because the data are in the pirates dataframe, we can do this using the formula method:
# 2-sample t-test
# IV = eyepatch (0 or 1)
# DV = tattoos
(tat.patch.htest <- t.test(formula = tattoos ~ eyepatch,
data = pirates))
# This test gave us a test statistic of 1.22 and a p-value of 0.22. Because the p-value is greater than 0.05, we would fail to reject the null hypothesis.
# Show me all of the elements in the htest object
names(tat.patch.htest)
In [9]:
# To fix this, I need to tell the t.test() function which two values of age I want to test. To do this, use the subset argument and indicate which values of the IV you want to test using the %in% operator. For example, to compare the number of tattoos between pirates of age 29 and 30, I would add the subset = age %in% c(29, 30) argument like this:
# Compare the tattoos of pirates aged 29 and 30:
tatage.htest <- t.test(formula = tattoos ~ age,
data = pirates,
subset = age %in% c(29, 30)) # Compare age of 29 to 30
tatage.htest
# Looks like we got a p-value of 0.79 which is pretty high and suggests that we should fail to reject the null hypothesis.
In [10]:
# You can select any subset of data in the subset argument to the t.test() function – not just the primary independent variable. For example, if I wanted to compare the number of tattoos between pirates who wear headbands or not, but only for female pirates, I would do the following:
# Is there an effect of college on # of tattoos only for female pirates?
t.test(formula = tattoos ~ college,
data = pirates,
subset = sex == "female")
In [1]:
# Consider the igf1 levels of girls in tanner puberty levels I and V. Test whether the mean igf1 level is the same in both populations.
# Let μ1 be the mean igf1 level of all girls in tanner puberty level I, and let μ2 be the mean igf1 level of all girls in tanner puberty level V. We wish to test H0:μ1=μ2 versus Ha:μ1≠μ2.
# Let’s think about the data that we have for a minute. Do we expect the igf1 level of girls in the two puberty levels to be approximately normal? Do we expect symmetry? How many samples do we have?
library(ISwR)
juul <- ISwR::juul
# We need to clean up the juul data:
juul$sex <- factor(juul$sex, levels = c(1,2), labels = c("Boy", "Girl"))
juul$menarche <- factor(juul$menarche, levels = c(1,2), labels = c("No", "Yes"))
juul$tanner <- factor(juul$tanner, levels = c(1,2,3,4,5), labels = c("I", "II", "III", "IV", "V"))
# Now let’s see what our sample sizes are:
table(juul$sex, juul$tanner)
In [2]:
# And we can see that we have 224 girls in tanner I and 204 in tanner V. I doubt that the data would be so skewed that we would need to worry about using t test in this case. I also don’t think there will be outliers. We can get a boxplot of all of the girls in tanner 1 or 5’s igf1 levels. Let’s create a data frame that consists only of those observations that are girls in tanner 1 or tanner 5.
library(ggplot2)
smallJuul <- dplyr::filter(juul, tanner %in% c("I","V"), sex == "Girl")
ggplot(smallJuul, aes(x = tanner, y = igf1)) + geom_boxplot()
# Looks pretty symmetric, and the outliers aren’t too far away from the whiskers. It looks like we were correct to assume that t test will be applicable to this data.
# Moreover, we had no reason before collecting the data to think that the variances were equal, so we will not make that assumption in t.test.
# Finally, before collecting the data, we had no reason to believe that the igf1 levels of the two populations would differ in a particular way, so we do a 2-sided test.
In [3]:
# Now, we run a t-test:
t.test(igf1~tanner, data = smallJuul, var.equal = FALSE)
# OK, that’s crazy. We definitely reject H0. Looking back at the boxplots, notice that the 25th percentile of the tanner 5 girls is almost above the whisker of the tanner 1 girls. That’s a pretty huge difference, and we almost didn’t even need to run a t test.
In [4]:
anorexia <- MASS::anorexia
weightdifference <- anorexia$Postwt - anorexia$Prewt
t.test(weightdifference, mu = 0, alternative = "greater", conf.level = .98)
In [12]:
# Looks good. Let’s run the test.
t.test(anorexia$Prewt,anorexia$Postwt, paired = TRUE, alternative = "less")
In [13]:
# We see that we get a p-value of 0.01751(?). That is evidence that the treatment has been effective. Note that if we didn’t pair the data, we would get
t.test(anorexia$Prewt,anorexia$Postwt,alternative = "less")
In [14]:
# which has a p-value of .05024. That is, the p-value would be three times as large.
# Finally, let’s look at the data to make sure no outliers snuck in.
ggplot(anorexia, aes(x = "", y = Prewt)) + geom_boxplot()
In [16]:
# interesting that the pre-treatment weights are nice and symmetric, while the post-treatment weights are skew. This leads me to wonder whether the treatment wasn’t effective for all of the patients, and maybe there is some subset of patients for which this treatment tends to work. Without further information, we won’t be able to tell.
ggplot(anorexia, aes(x = Postwt - Prewt)) + geom_histogram(binwidth = 5)
# thus it seems that the treatment was totally ineffective for some girls, but almost unrealistically effective for some others. I would really like to understand which patients are more likely to be helped by this treatment. But, we don’t have any data that would help us understand it.