Chapter 6.2.4.1 Exercises

LGR
In [2]:
library(yarrr)

# Create a binary variable indicating whether or not a diamond's value is greater than 190

diamonds$value.g190 <- diamonds$value > 190

# Conduct a logistic regression on the new binary variable

(diamond.glm <- glm(formula = value.g190 ~ weight + clarity + color,
                   data = diamonds,
                   family = binomial))
Call:  glm(formula = value.g190 ~ weight + clarity + color, family = binomial, 
    data = diamonds)

Coefficients:
(Intercept)       weight      clarity        color  
   -18.8009       1.1251       9.2910      -0.3836  

Degrees of Freedom: 149 Total (i.e. Null);  146 Residual
Null Deviance:	    207.5 
Residual Deviance: 106.7 	AIC: 114.7
In [3]:
# Here are the resulting coefficients:
# Print coefficients from logistic regression

summary(diamond.glm)$coefficients
EstimateStd. Errorz valuePr(>|z|)
(Intercept)-18.8009011 3.4634258 -5.428412 5.685775e-08
weight 1.1251118 0.1968203 5.716441 1.087783e-08
clarity 9.2909721 1.9629068 4.733272 2.209289e-06
color -0.3836406 0.2480698 -1.546503 1.219832e-01
In [4]:
# Just like with regular regression with lm(), we can get the fitted values from the model and put them back into our dataset to see how well the model fit the data:

# Add logistic fitted values back to dataframe as new column pred.g190

diamonds$pred.g190 <- diamond.glm$fitted.values

# Look at the first few rows (of the named columns)

head(diamonds[c("weight", "clarity", "color", "value", "pred.g190")])

# Looking at the first few observations, it looks like the probabilities match the data pretty well. For example, the first diamond with a value of 182.5 had a fitted probability of just 0.16 of being valued greater than 190. In contrast, the second diamond, which had a value of 191.2 had a much higher fitted probability of 0.82.
weightclaritycolorvaluepred.g190
9.35 0.88 4 182.5 0.16251772
11.10 1.05 5 191.2 0.82129665
8.65 0.85 6 175.7 0.03008442
10.43 1.15 5 195.2 0.84559083
10.62 0.92 5 181.6 0.44454833
12.35 0.44 4 182.9 0.08688269
In [3]:
# Just like we did with regular regression, you can use the predict() function along with the results of a glm() object to predict new data. Let’s use the diamond.glm object to predict the probability that the new diamonds will have a value greater than 190:

# Predict the 'probability' that the 3 new diamonds will have a value greater than 190

predict(object = diamond.glm,
        newdata = diamonds.new)
Error in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == : object 'diamonds.new' not found
Traceback:

1. predict(object = diamond.glm, newdata = diamonds.new)
2. predict.glm(object = diamond.glm, newdata = diamonds.new)
3. predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == 
 .     "link", "response", type), terms = terms, na.action = na.action)
In [4]:
library(stats)

# What the heck, these don’t look like probabilities! True, they’re not. They are logit-transformed probabilities. To turn them back into probabilities, we need to invert them by applying the inverse logit function:

# Get logit predictions of new diamonds

logit.predictions <- predict(object = diamond.glm,
                             newdata = diamonds.new
                             )

# Apply inverse logit to transform to probabilities (See Equation in the margin)

prob.predictions <- 1 / (1 + exp(-logit.predictions))

# Print final predictions!

prob.predictions

# So, the model predicts that the probability that the three new diamonds will be valued over 190 is 99.23%, 2.86%, and 40.38% respectively.
Error in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == : object 'diamonds.new' not found
Traceback:

1. predict(object = diamond.glm, newdata = diamonds.new)
2. predict.glm(object = diamond.glm, newdata = diamonds.new)
3. predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == 
 .     "link", "response", type), terms = terms, na.action = na.action)
In [5]:
# You can easily add a regression line to a scatterplot. To do this, just put the regression object you created with as the main argument to . For example, the following code will create the scatterplot on the right (Figure~) showing the relationship between a diamond’s weight and its value including a red regression line:

# Scatterplot of diamond weight and value

plot(x = diamonds$weight,
     y = diamonds$value,
     xlab = "Weight",
     ylab = "Value",
     main = "Adding a regression line with abline()"
     )

# Calculate regression model

(diamonds.lm <- lm(formula = value ~ weight,
                  data = diamonds))

# Add regression line

abline(diamonds.lm,
       col = "red", lwd = 2)
Call:
lm(formula = value ~ weight, data = diamonds)

Coefficients:
(Intercept)       weight  
    165.660        2.402  
In [6]:
# The distribution of movie revenues is highly skewed.

hist(movies$revenue.all, 
     main = "Movie revenue\nBefore log-transformation")
In [11]:
# For example, look at the distribution of movie revenues in the movies dataset in the margin Figure 15.5: As you can see, these data don’t look normally distributed at all. If we want to conduct a standard regression analysis on these data, we need to create a new log-transformed version of the variable. In the following code, I’ll create a new variable called revenue.all.log defined as the logarithm of revenue.all.
# Create a new log-transformed version of movie revenue

movies$revenue.all.log <- log(movies$revenue.all)
In [10]:
# In Figure 15.6 you can see a histogram of the new log-transformed variable. It’s still skewed, but not nearly as badly as before, so I would feel much better using this variable in a standard regression analysis with lm().
# Distribution of log-transformed revenue is much less skewed

hist(movies$revenue.all.log, 
     main = "Log-transformed Movie revenue")