Chapter 6.2.3.1 Exercises
In [1]:
# we introduce one-way analysis of variance (ANOVA) through the analysis of a motivating example.
# Consider the red.cell.folate data in ISwR. We have data on folate levels of patients under three different treatments. Suppose we want to know whether treatment has a significant effect on the folate level of the patient. How would we proceed? First, we would want to plot the data. A boxplot is a good idea.
library(ISwR)
library(ggplot2)
red.cell.folate <- ISwR::red.cell.folate; red.cell.folate
ggplot(red.cell.folate, aes(x = ventilation, y = folate)) +
geom_boxplot()
# Looking at this data, it is not clear whether there is a difference. It seems that the folate level under treatment 1 is perhaps higher. We could try to compare the means of each pair of ventilations using t.test(). The problem with that is we would be doing three tests, and we would need to adjust our p-values accordingly. We are going to take a different approach, which is ANOVA.
In [6]:
library(dplyr)
red.cell.folate %>% group_by(ventilation) %>% summarize(mean(folate))
mean(red.cell.folate$folate)
In [24]:
# Returning to red.cell.folate, we use R to do all of the computations for us. First, build a linear model for folate as explained by ventilation, and then apply the anova function to the model. anova produces a table which contains all of the computations from the discussion above.
rcf.mod <- lm(folate~ventilation, data = red.cell.folate)
anova(rcf.mod)
# The two rows ventilation and Residuals correspond to the between group and within group variation.
# The first column, Df gives the degrees of freedom in each case. Since k=3, the between group variation has k−1=2 degrees of freedom, and since N=22, the within group variation (Residuals) has N−k=19 degrees of freedom.
# The Sum Sq column gives SSDB and SSDW.
# The Mean Sq variable is the Sum Sq value divided by the degrees of freedom. These two numbers are the numerator and denominator of the test statistic, F. So here, F=7757.9/2090.3=3.7113.
# To compute the p-value, we need the area under the tail of the F distribution above F=3.7113. Note that this is a one-tailed test, because small values of F would actually be evidence in favor of H0.
In [25]:
rcf.mod
library(moderndive)
get_regression_table(rcf.mod)
In [10]:
# The ANOVA table gives the p-value as Pr(>F), here P=0.04359. We could compute this from F using:
pf(3.7113, df1 = 2, df2 = 19, lower.tail = FALSE)
In [9]:
# According to the p-value, we would reject at the α=.05 level, but some caution is in order. Let’s continue by checking the assumptions in the model. As in regression, plotting the linear model object results in a series of residual plots.
plot(rcf.mod)
# The variances of each group did not look equal in the boxplot, and the Scale-Location plot suggests that the rightmost group does have somewhat higher variation than the others. The x-axis of these plots show Fitted values, which in ANOVA are the group means. So the rightmost cluster of points corresponds to the group with mean 316, which is group 1, the N2O+O2,24h group.
# The Normal Q-Q plot suggests the data may be thick tailed. I would put a strongly worded caveat in any report claiming significance based on this data, but it is certainly evidence that perhaps a larger study would be useful.
In [10]:
# The built in data set InsectSprays records the counts of insects in agricultural experimental units treated with different insecticides. Plotting the data reveals an obvious difference in effectiveness among the spray types:
ggplot(InsectSprays, aes(x=spray, y=count))+geom_boxplot()
In [11]:
# Not surprisingly, ANOVA strongly rejects the null hypothesis that all sprays have the same mean, with a p-value less than 2.2×10−16:
IS.mod <- lm(count ~ spray, data = InsectSprays)
anova(IS.mod)
In [14]:
IS.mod <- lm(sqrt(count) ~ spray, data = InsectSprays)
plot(IS.mod)
In [15]:
anova(IS.mod)
In [16]:
# Or, if normality of the data within each group is not in question (only the equal variance assumption) then oneway.test can be used.
oneway.test(count~spray, data = InsectSprays)
In [18]:
# For example, returning to the red.cell.folate data, we could do a pairwise t test via the following command:
pairwise.t.test(red.cell.folate$folate, red.cell.folate$ventilation)
In [5]:
# You can get a lot of interesting information from ANOVA objects. To see everything that’s stored in one, run the command on an ANOVA object. For example, here’s what’s in our last ANOVA object:
# Show me what's in my aov object
library(yarrr)
cleaner.type.int.aov <- aov(formula = time ~ cleaner * type,
data = poopdeck)
names(cleaner.type.int.aov)
In [8]:
# For example, the "fitted.values" contains the model fits for the dependent variable (time) for every observation in our dataset. We can add these fits back to the dataset with the $ operator and assignment. For example, let’s get the model fitted values from both the interaction model (cleaner.type.aov) and the non-interaction model (cleaner.type.int.aov) and assign them to new columns in the dataframe:
# Add the fits for the interaction model to the dataframe as int.fit
poopdeck$int.fit <- cleaner.type.int.aov$fitted.values
# Add the fits for the main effects model to the dataframe as me.fit
cleaner.type.aov <- aov(formula = time ~ cleaner + type,
data = poopdeck)
poopdeck$me.fit <- cleaner.type.aov$fitted.values
In [9]:
# Now let’s look at the first few rows in the table to see the fits for the first few observations.
head(poopdeck)
In [15]:
# Calculate a mixed-effects regression on time with two fixed factors (cleaner and type) and one repeated measure (day)
library(lme4)
(my.mod <- lmer(formula = time ~ cleaner + type + (1|day),
data = poopdeck))
In [17]:
pirates
In [18]:
tatmov <- lm(formula = tattoos ~ fav.pixar,
data = pirates)
summary(tatmov)
In [19]:
with(pirates,
table(fav.pixar))
In [62]:
tatmov.aov <- aov(formula = tattoos ~ fav.pixar,
data = pirates)
summary(tatmov.aov)
# aov(tatmov) returns the same.
In [51]:
tatpir <- lm(formula = tattoos ~ favorite.pirate,
data = pirates)
summary(tatpir)
In [54]:
TukeyHSD(tatpir.aov)
In [55]:
tatmopi <- lm(formula = tattoos ~ favorite.pirate + fav.pixar,
data = pirates)
summary(tatmopi)
In [58]:
tatmopii <- lm(formula = tattoos ~ favorite.pirate * fav.pixar,
data = pirates)
summary(tatmopii)
In [ ]:
library(yarrr)
head(poopdeck)
yarrr::pirateplot(time ~ cleaner,
data = poopdeck,
theme = 2,
cap.beans = TRUE,
main = "formula = time ~ cleaner")
In [2]:
# From the plot, it looks like cleaners a and b are the same, and cleaner c is a bit faster. To test this, we’ll create an ANOVA object with aov. Because time is the dependent variable and cleaner is the independent variable, we’ll set the formula to formula = time ~ cleaner
# Step 1: aov object with time as DV and cleaner as IV
cleaner.aov <- aov(formula = time ~ cleaner,
data = poopdeck)
# Now, to see a full ANOVA summary table of the ANOVA object, apply the summary() to the ANOVA object from Step 1.
# Step 2: Look at the summary of the anova object
summary(cleaner.aov)
# The main result from our table is that we have a significant effect of cleaner on cleaning time (F(2, 597) = 5.29, p = 0.005. However, the ANOVA table does not tell us which levels of the independent variable differ. In other words, we don’t know which cleaner is better than which. To answer this, we need to conduct a post-hoc test.
In [3]:
# If you’ve found a significant effect of a factor, you can then do post-hoc tests to test the difference between each all pairs of levels of the independent variable. There are many types of pairwise comparisons that make different assumptions. One of the most common post-hoc tests for standard ANOVAs is the Tukey Honestly Significant Difference (HSD) test. To do an HSD test, apply the TukeyHSD() function to your ANOVA object as follows:
# Step 3: Conduct post-hoc tests
TukeyHSD(cleaner.aov)
# This table shows us pair-wise differences between each group pair. The diff column shows us the mean differences between groups (which thankfully are identical to what we found in the summary of the regression object before), a confidence interval for the difference, and a p-value testing the null hypothesis that the group differences are not different.
In [4]:
# I almost always find it helpful to combine an ANOVA summary table with a regression summary table. Because ANOVA is just a special case of regression (where all the independent variables are factors), you’ll get the same results with a regression object as you will with an ANOVA object. However, the format of the results are different and frequently easier to interpret.
# To create a regression object, use the lm() function. Your inputs to this function will be identical to your inputs to the aov() function
# Step 4: Create a regression object
cleaner.lm <- lm(formula = time ~ cleaner,
data = poopdeck)
# Show summary
summary(cleaner.lm)
# As you can see, the regression table does not give us tests for each variable like the ANOVA table does. Instead, it tells us how different each level of an independent variable is from a default value. You can tell which value of an independent variable is the default variable just by seeing which value is missing from the table. In this case, I don’t see a coefficient for cleaner a, so that must be the default value.
# The intercept in the table tells us the mean of the default value. In this case, the mean time of cleaner a was 66.02. The coefficients for the other levels tell us that cleaner b is, on average 0.42 minutes faster than cleaner a, and cleaner c is on average 6.94 minutes faster than cleaner a. Not surprisingly, these are the same differences we saw in the Tukey HSD test!
In [5]:
pirateplot(formula = time ~ cleaner + type,
data = poopdeck,
ylim = c(0, 150),
xlab = "Cleaner",
ylab = "Cleaning Time (minutes)",
main = "poopdeck data",
back.col = gray(.97),
cap.beans = TRUE,
theme = 2)
In [19]:
# we introduce one-way analysis of variance (ANOVA) through the analysis of a motivating example.
# Consider the red.cell.folate data in ISwR. We have data on folate levels of patients under three different treatments. Suppose we want to know whether treatment has a significant effect on the folate level of the patient. How would we proceed? First, we would want to plot the data. A boxplot is a good idea.
library(ISwR)
library(ggplot2)
red.cell.folate <- ISwR::red.cell.folate; red.cell.folate
ggplot(red.cell.folate, aes(x = ventilation, y = folate)) +
geom_boxplot()
# Looking at this data, it is not clear whether there is a difference. It seems that the folate level under treatment 1 is perhaps higher. We could try to compare the means of each pair of ventilations using t.test(). The problem with that is we would be doing three tests, and we would need to adjust our p-values accordingly. We are going to take a different approach, which is ANOVA.
In [20]:
library(dplyr)
red.cell.folate %>% group_by(ventilation) %>% summarize(mean(folate))
mean(red.cell.folate$folate)
In [21]:
# Returning to red.cell.folate, we use R to do all of the computations for us. First, build a linear model for folate as explained by ventilation, and then apply the anova function to the model. anova produces a table which contains all of the computations from the discussion above.
(rcf.mod <- lm(folate~ventilation, data = red.cell.folate))
anova(rcf.mod)
# The two rows ventilation and Residuals correspond to the between group and within group variation.
# The first column, Df gives the degrees of freedom in each case. Since k=3, the between group variation has k−1=2 degrees of freedom, and since N=22, the within group variation (Residuals) has N−k=19 degrees of freedom.
# The Sum Sq column gives SSDB and SSDW.
# The Mean Sq variable is the Sum Sq value divided by the degrees of freedom. These two numbers are the numerator and denominator of the test statistic, F. So here, F=7757.9/2090.3=3.7113.
# To compute the p-value, we need the area under the tail of the F distribution above F=3.7113. Note that this is a one-tailed test, because small values of F would actually be evidence in favor of H0.
In [13]:
# Not surprisingly, ANOVA strongly rejects the null hypothesis that all sprays have the same mean, with a p-value less than 2.2×10−16:
IS.mod <- lm(count ~ spray, data = InsectSprays)
anova(IS.mod)
In [16]:
IS.mod <- lm(sqrt(count) ~ spray, data = InsectSprays)
plot(IS.mod)
In [3]:
library(yarrr)
# To conduct a two-way ANOVA or a Menage a trois NOVA, just include additional independent variables in the regression model formula with the + sign. That’s it. All the steps are the same. Let’s conduct a two-way ANOVA with both cleaner and type as independent variables. To do this, we’ll set formula = time ~ cleaner + type.
# Step 1: Create ANOVA object with aov()
cleaner.type.aov <- aov(formula = time ~ cleaner + type,
data = poopdeck)
# Step 2: Get ANOVA table with summary()
summary(cleaner.type.aov)
In [4]:
# It looks like we found significant effects of both independent variables.
# Step 3: Conduct post-hoc tests
TukeyHSD(cleaner.type.aov)
In [5]:
# The only non-significant group difference we found is between cleaner b and cleaner a. All other comparisons were significant.
# Step 4: Look at regression coefficients
cleaner.type.lm <- lm(formula = time ~ cleaner + type,
data = poopdeck)
summary(cleaner.type.lm)
# Now we need to interpret the results in respect to two default values (here, cleaner = a and type = parrot). The intercept means that the average time for cleaner a on parrot poop was 54.36 minutes. Additionally, the average time to clean shark poop was 23.33 minutes slower than when cleaning parrot poop.
In [6]:
# Let’s repeat our previous ANOVA with two independent variables, but now we’ll include the interaction between cleaner and type. To do this, we’ll set the formula to time ~ cleaner * type.
# Step 1: Create ANOVA object with interactions
cleaner.type.int.aov <- aov(formula = time ~ cleaner * type,
data = poopdeck)
# Step 2: Look at summary table
summary(cleaner.type.int.aov)
In [7]:
# Looks like we did indeed find a significant interaction between cleaner and type. In other words, the effectiveness of a cleaner depends on the type of poop it’s being applied to. This makes sense given our plot of the data at the beginning of the chapter.
# To understand the nature of the difference, we’ll look at the regression coefficients from a regression object:
# Step 4: Calculate regression coefficients
cleaner.type.int.lm <- lm(formula = time ~ cleaner * type,
data = poopdeck)
summary(cleaner.type.int.lm)
# Again, to interpret this table, we first need to know what the default values are. We can tell this from the coefficients that are ‘missing’ from the table. Because I don’t see terms for cleanera or typeparrot, this means that cleaner = "a" and type = "parrot" are the defaults. Again, we can interpret the coefficients as differences between a level and the default. It looks like for parrot poop, cleaners b and c both take more time than cleaner a (the default). Additionally, shark poop tends to take much longer than parrot poop to clean (the estimate for typeshark is positive).
# The interaction terms tell us how the effect of cleaners changes when one is cleaning shark poop. The negative estimate (-16.96) for cleanerb:typeshark means that cleaner b is, on average 16.96 minutes faster when cleaning shark poop compared to parrot poop. Because the previous estimate for cleaner b (for parrot poop) was just 8.06, this suggests that cleaner b is slower than cleaner a for parrot poop, but faster than cleaner a for shark poop. Same thing for cleaner c which simply has stronger effects in both directions.
In [8]:
# Center variables before computing interactions!? what now?
# Create a regression model with interactions between IVS weight and clarity
diamonds.int.lm <- lm(formula = value ~ weight * clarity,
data = diamonds)
# Show summary statistics of model coefficients
summary(diamonds.int.lm)$coefficients
In [9]:
# Centering a variable means simply subtracting the mean of the variable from all observations. In the following code, I’ll repeat the previous regression, but first I’ll create new centered variables weight.c and clarity.c, and then run the regression on the interaction between these centered variables:
# Create centered versions of weight and clarity
diamonds$weight.c <- diamonds$weight - mean(diamonds$weight)
diamonds$clarity.c <- diamonds$clarity - mean(diamonds$clarity)
# Create a regression model with interactions of centered variables
diamonds.int.lm <- lm(formula = value ~ weight.c * clarity.c,
data = diamonds)
# Print summary
summary(diamonds.int.lm)$coefficients
# That looks much better! Now we see that the main effects are significant and the interaction is non-significant.
In [3]:
library(yarrr)
# In the next code chunk, I’ll calculate 3 separate ANOVAs from the poopdeck data using the three different types. First, I’ll create a regression object with lm(). As you’ll see, the Anova() function requires you to enter a regression object as the main argument, and not a formula and dataset. That is, you need to first create a regression object from the data with lm() (or glm()), and then enter that object into the Anova() function. You can also do the same thing with the standard aov() function.
# Step 1: Calculate regression object with lm()
time.lm <- lm(formula = time ~ type + cleaner,
data = poopdeck)
In [4]:
# Now that I’ve created the regression object time.lm, I can calculate the three different types of ANOVAs by entering the object as the main argument to either aov() for a Type I ANOVA, or Anova() in the car package for a Type II or Type III ANOVA:
# Type I ANOVA - aov()
time.I.aov <- aov(time.lm)
# Type II ANOVA - Anova(type = 2)
time.II.aov <- car::Anova(time.lm, type = 2)
# Type III ANOVA - Anova(type = 3)
time.III.aov <- car::Anova(time.lm, type = 3)
# As it happens, the data in the poopdeck dataframe are perfectly balanced (so we’ll get exactly the same result for each ANOVA type. However, if they were not balanced, then we should not use the Type I ANOVA calculated with the aov() function.
In [3]:
library(yarrr)
# Let’s do an example with the diamonds dataset. I’ll create three regression models that each predict a diamond’s value. The models will differ in their complexity – that is, the number of independent variables they use. diamonds.mod1 will be the simplest model with just one IV (weight), diamonds.mod2 will include 2 IVs (weight and clarity) while diamonds.mod3 will include three IVs (weight, clarity, and color).
# model 1: 1 IV (only weight)
diamonds.mod1 <- lm(value ~ weight, data = diamonds)
# Model 2: 2 IVs (weight AND clarity)
diamonds.mod2 <- lm(value ~ weight + clarity, data = diamonds)
# Model 3: 3 IVs (weight AND clarity AND color)
diamonds.mod3 <- lm(value ~ weight + clarity + color, data = diamonds)
In [4]:
# Now let’s use the anova() function to compare these models and see which one provides the best parsimonious fit of the data. First, we’ll compare the two simplest models: model 1 with model 2. Because these models differ in the use of the clarity IV (both models use weight), this ANVOA will test whether or not including the clarity IV leads to a significant improvement over using just the weight IV:
# Compare model 1 to model 2
anova(diamonds.mod1, diamonds.mod2)
# As you can see, the result shows a Df of 1 (indicating that the more complex model has one additional parameter), and a very small p-value (< .001). This means that adding the clarity IV to the model did lead to a significantly improved fit over the model 1.
In [5]:
# Next, let’s use anova() to compare model 2 and model 3. This will tell us whether adding color (on top of weight and clarity) further improves the model:
# Compare model 2 to model 3
anova(diamonds.mod2, diamonds.mod3)
# The result shows a non-significant result (p = 0.21). Thus, we should reject model 3 and stick with model 2 with only 2 IVs.
In [7]:
# You don’t need to compare models that only differ in one IV – you can also compare models that differ in multiple DVs. For example, here is a comparison of model 1 (with 1 IV) to model 3 (with 3 IVs):
# Compare model 1 to model 3
anova(diamonds.mod1, diamonds.mod3)
# The result shows that model 3 did indeed provide a significantly better fit to the data compared to model 1. However, as we know from our previous analysis, model 3 is not significantly better than model 2.
In [22]:
plot(function(x)(df(x, df1 = 2, df2 = 150)), xlim = c(-2,10))
plot(function(x)(df(x, df1 = 4, df2 = 10)), xlim = c(-2,10),lty="dashed", add = T)