Chapter 6.2.2.1 Exercises
In [4]:
library("HSAUR")
library("HSAUR2")
library("HSAUR3")
library("graphics")
data("clouds")
clouds
rownames(clouds)[clouds$rainfall %in% c(bxpseeding$out, bxpecho$out)]
In [5]:
clouds_formula <- rainfall ~ seeding * (sne + cloudcover +
prewetness + echomotion) + time
clouds_formula
In [10]:
Xstar <- model.matrix(clouds_formula, data = clouds)
Xstar
In [12]:
library(MASS)
library(ggplot2)
data(geyser)
fit.ols <- lm(waiting ~ duration, data = geyser)
fit.ols
In [13]:
ggplot(geyser, aes(x = waiting, y = duration)) +
geom_point() +
labs(x = "waiting", y = "duration",
title = "geyser") +
geom_smooth(method = "lm", se = TRUE)
In [7]:
# In Figure 7.4, we plot a scatterplot of score over age.
# Given that gender is a binary categorical variable in this study, we can make some interesting tweaks: We can assign a color to points from each of the two levels of gender: female and male.
# Furthermore, the geom_smooth(method = "lm", se = FALSE) layer automatically fits a different regression line for each since we have provided color = gender at the top level in ggplot(). This allows for all geom_etries that follow to have the same mapping of aes()thetics to variables throughout the plot.
ggplot(evals_ch7, aes(x = age, y = score, color = gender)) +
geom_jitter() +
labs(x = "Age", y = "Teaching Score", color = "Gender") +
geom_smooth(method = "lm", se = FALSE)
# We notice some interesting trends: There are almost no women faculty over the age of 60. We can see this by the lack of red dots above 60. Fitting separate regression lines for men and women, we see they have different slopes. We see that the associated effect of increasing age seems to be much harsher for women than men. In other words, as women age, the drop in their teaching score appears to be faster.
In [13]:
# Parallel slopes model:
score_model_2 <- lm(score ~ age + gender, data = evals_ch7)
get_regression_table(score_model_2)
In [22]:
ggplot(score_model_2, aes(x = age, y = score, color = gender)) +
geom_jitter() +
labs(x = "Age", y = "Teaching Score", color = "Gender") +
geom_smooth(method = "lm", se = FALSE)
# mismatch!!!
In [17]:
score_model_interaction <- lm(score ~ age * gender, data = evals_ch7)
get_regression_table(score_model_interaction)
In [18]:
ggplot(score_model_interaction, aes(x = age, y = score, color = gender)) +
geom_jitter() +
labs(x = "Age", y = "Teaching Score", color = "Gender") +
geom_smooth(method = "lm", se = FALSE)
# mismatch!!!
In [3]:
library(ggplot2)
library(dplyr)
library(moderndive)
library(ISLR)
library(skimr)
evals_ch7 <- evals %>%
select(score, age, gender)
score_model_interaction <- lm(score ~ age * gender, data = evals_ch7)
In [4]:
# Now say we want to apply the above calculations for male and female instructors for all 463 instructors in the evals_ch7 dataset. As our multiple regression models get more and more complex, computing such values by hand gets more and more tedious. The get_regression_points() function spares us this tedium and returns all fitted values and all residuals. For simplicity, let’s focus only on the fitted interaction model, which is saved in score_model_interaction.
regression_points <- get_regression_points(score_model_interaction)
regression_points
# Recall the format of the output:
# score corresponds to y the observed value.
# score_hat corresponds to ˆy=ˆscore the fitted value
# residual corresponds to the residual y−ˆy
In [5]:
# As always, let’s perform a residual analysis first with a histogram, which we can facet by gender:
ggplot(regression_points, aes(x = residual)) +
geom_histogram(binwidth = 0.25, color = "white") +
labs(x = "Residual") +
facet_wrap(~gender)
In [7]:
# Second, the residuals as compared to the predictor variables: x1: numerical explanatory/predictor variable of age. x2: categorical explanatory/predictor variable of gender
ggplot(regression_points, aes(x = age, y = residual)) +
geom_point() +
labs(x = "age", y = "Residual") +
geom_hline(yintercept = 0, col = "blue", size = 1) +
facet_wrap(~ gender)
In [ ]:
In [3]:
# Recall from Table 7.2 that we saw the correlation coefficient between Income in thousands of dollars and credit card Balance was 0.464. What if in instead we looked at the correlation coefficient between Income and credit card Balance, but where Income was in dollars and not thousands of dollars? This can be done by multiplying Income by 1000.
library(ISLR)
library(dplyr)
data(Credit)
Credit %>%
select(Balance, Income) %>%
mutate(Income = Income * 1000) %>%
cor()
# We see it is the same! We say that the correlation coefficient is invariant to linear transformations! In other words, the correlation between x and y will be the same as the correlation between a×x+b and y where a and b are numerical values (real numbers in mathematical terms).
In [2]:
library(ggplot2)
library(dplyr)
library(moderndive)
library(ISLR)
library(skimr)
In [6]:
# Let’s load the Credit data and select() only the needed subset of variables.
(Credit <- Credit %>%
select(Balance, Limit, Income, Rating, Age))
In [5]:
# Let’s look at the raw data values both by bringing up RStudio’s spreadsheet viewer and the glimpse() function. Although in Table 7.1 we only show 5 randomly selected credit card holders out of 400:
View(Credit)
# Balance Limit Income Rating Age
# 141 1425 6045 39.8 459 32
# 9 279 3300 15.1 266 66
# 13 204 5308 80.6 394 57
# 262 1050 9310 180.4 665 67
# 267 15 4952 88.8 360 86
In [6]:
glimpse(Credit)
In [5]:
# Let’s look at some summary statistics, again using the skim() function from the skimr package:
Credit %>%
select(Balance, Limit, Income) %>%
skim()
# We observe for example:
# The mean and median credit card balance are $520.01 and $459.50 respectively.
# 25% of card holders had debts of $68.75 or less.
# The mean and median credit card limit are $4735.6 and $4622.50 respectively.
# 75% of these card holders had incomes of $57,470 or less.
In [9]:
# Since our outcome variable Balance and the explanatory variables Limit and Rating are numerical, we can compute the correlation coefficient between pairs of these variables. First, we could run the get_correlation() command as seen in Subsection 6.1.1 twice, once for each explanatory variable:
Credit %>%
get_correlation(Balance ~ Limit)
Credit %>%
get_correlation(Balance ~ Income)
In [9]:
# Or we can simultaneously compute them by returning a correlation matrix in Table 7.2. We can read off the correlation coefficient for any pair of variables by looking them up in the appropriate row/column combination.
Credit %>%
select(Balance, Limit, Income) %>%
cor()
In [10]:
# Let’s visualize the relationship of the outcome variable with each of the two explanatory variables in two separate plots:
ggplot(Credit, aes(x = Limit, y = Balance)) +
geom_point() +
labs(x = "Credit limit (in $)", y = "Credit card balance (in $)",
title = "Relationship between balance and credit limit") +
geom_smooth(method = "lm", se = FALSE)
ggplot(Credit, aes(x = Income, y = Balance)) +
geom_point() +
labs(x = "Income (in $1000)", y = "Credit card balance (in $)",
title = "Relationship between balance and income") +
geom_smooth(method = "lm", se = FALSE)
In [2]:
Balance_model <- lm(Balance ~ Limit + Income, data = Credit)
get_regression_table(Balance_model)
In [15]:
Balance_model <- lm(Balance ~ Limit + Income, data = Credit)
get_regression_table(Balance_model)
Balance_model <- lm(Balance ~ Limit, data = Credit)
get_regression_table(Balance_model)
Balance_model <- lm(Balance ~ Income, data = Credit)
get_regression_table(Balance_model)
In [21]:
# Recall in Section 6.1.4, our first residual analysis plot investigated the presence of any systematic pattern in the residuals when we had a single numerical predictor: bty_age. For the Credit card dataset, since we have two numerical predictors, Limit and Income, we must perform this twice:
ggplot(regression_points, aes(x = Income, y = residual)) +
geom_point() +
labs(x = "Income (in $1000)", y = "Residual", title = "Residuals vs income")
In [20]:
ggplot(regression_points, aes(x = Limit, y = residual)) +
geom_point() +
labs(x = "Credit limit (in $)", y = "Residual", title = "Residuals vs credit limit")
In [5]:
# In this case, there does appear to be a systematic pattern to the residuals. As the scatter of the residuals around the line y=0 is definitely not consistent. This behavior of the residuals is further evidenced by the histogram of residuals in Figure 7.3. We observe that the residuals have a slight right-skew (recall we say that data is right-skewed, or positively-skewed, if there is a tail to the right). Ideally, these residuals should be bell-shaped around a residual value of 0.
ggplot(regression_points, aes(x = residual)) +
geom_histogram(color = "white") +
labs(x = "Residual")
# Another way to interpret this histogram is that since the residual is computed as y−ˆy = balance - balance_hat, we have some values where the fitted value ˆy is very much lower than the observed value y. In other words, we are underestimating certain credit card holders’ balances by a very large amount.
In [4]:
# We will examine the interest rate for four year car loans, and the data that we use comes from the U.S. Federal Reserve’s mean rates . We are looking at and plotting means. This, of course, is a very bad thing because it removes a lot of the variance and is misleading. The only reason that we are working with the data in this way is to provide an example of linear regression that does not use too many data points.
# The first thing to do is to specify the data. Here there are only five pairs of numbers so we can enter them in manually. Each of the five pairs consists of a year and the mean interest rate:
year <- c(2000 , 2001 , 2002 , 2003 , 2004)
rate <- c(9.34 , 8.50 , 7.62 , 6.93 , 6.60)
In [3]:
# The next thing we do is take a look at the data. We first plot the data using a scatter plot and notice that it looks linear. To confirm our suspicions we then find the correlation between the year and the mean interest rates:
plot(year,rate,
main="Commercial Banks Interest Rate for 4 Year Car Loan",
sub="http://www.federalreserve.gov/releases/g19/20050805/")
In [6]:
fit <- lm(rate ~ year); fit
# When you make the call to lm it returns a variable with a lot of information in it. If you are just learning about least squares regression you are probably only interested in two things at this point, the slope and the y-intercept. If you just type the name of the variable returned by lm it will print out this minimal information to the screen.
In [7]:
#If you would like to know what else is stored in the variable you can use the attributes command:
attributes(fit)
In [9]:
#One of the things you should notice is the coefficients variable within fit. You can print out the y-intercept and slope by accessing this part of the variable:
fit$coefficients[1]
fit$coefficients[[1]]
# Note that if you just want to get the number you should use two square braces.
In [12]:
# A better use for this formula would be to calculate the residuals and plot them:
res <- rate - (fit$coefficients[[2]]*year+fit$coefficients[[1]]); res
plot(year,res)
In [13]:
# That is a bit messy, but fortunately there are easier ways to get the residuals. Two other ways are shown below:
residuals(fit)
fit$residuals
plot(year,fit$residuals)
In [15]:
# If you want to plot the regression line on the same plot as your scatter plot you can use the abline function along with your variable fit:
plot(year,rate,
main="Commercial Banks Interest Rate for 4 Year Car Loan",
sub="http://www.federalreserve.gov/releases/g19/20050805/")
abline(fit)
In [16]:
# Finally, as a teaser for the kinds of analyses you might see later, you can get the results of an F-test by asking R for a summary of the fit variable:
summary(fit)