Chapter 6.2.4.1 Exercises
In [2]:
library(yarrr)
# Create a binary variable indicating whether or not a diamond's value is greater than 190
diamonds$value.g190 <- diamonds$value > 190
# Conduct a logistic regression on the new binary variable
(diamond.glm <- glm(formula = value.g190 ~ weight + clarity + color,
data = diamonds,
family = binomial))
In [3]:
# Here are the resulting coefficients:
# Print coefficients from logistic regression
summary(diamond.glm)$coefficients
In [4]:
# Just like with regular regression with lm(), we can get the fitted values from the model and put them back into our dataset to see how well the model fit the data:
# Add logistic fitted values back to dataframe as new column pred.g190
diamonds$pred.g190 <- diamond.glm$fitted.values
# Look at the first few rows (of the named columns)
head(diamonds[c("weight", "clarity", "color", "value", "pred.g190")])
# Looking at the first few observations, it looks like the probabilities match the data pretty well. For example, the first diamond with a value of 182.5 had a fitted probability of just 0.16 of being valued greater than 190. In contrast, the second diamond, which had a value of 191.2 had a much higher fitted probability of 0.82.
In [3]:
# Just like we did with regular regression, you can use the predict() function along with the results of a glm() object to predict new data. Let’s use the diamond.glm object to predict the probability that the new diamonds will have a value greater than 190:
# Predict the 'probability' that the 3 new diamonds will have a value greater than 190
predict(object = diamond.glm,
newdata = diamonds.new)
In [4]:
library(stats)
# What the heck, these don’t look like probabilities! True, they’re not. They are logit-transformed probabilities. To turn them back into probabilities, we need to invert them by applying the inverse logit function:
# Get logit predictions of new diamonds
logit.predictions <- predict(object = diamond.glm,
newdata = diamonds.new
)
# Apply inverse logit to transform to probabilities (See Equation in the margin)
prob.predictions <- 1 / (1 + exp(-logit.predictions))
# Print final predictions!
prob.predictions
# So, the model predicts that the probability that the three new diamonds will be valued over 190 is 99.23%, 2.86%, and 40.38% respectively.
In [5]:
# You can easily add a regression line to a scatterplot. To do this, just put the regression object you created with as the main argument to . For example, the following code will create the scatterplot on the right (Figure~) showing the relationship between a diamond’s weight and its value including a red regression line:
# Scatterplot of diamond weight and value
plot(x = diamonds$weight,
y = diamonds$value,
xlab = "Weight",
ylab = "Value",
main = "Adding a regression line with abline()"
)
# Calculate regression model
(diamonds.lm <- lm(formula = value ~ weight,
data = diamonds))
# Add regression line
abline(diamonds.lm,
col = "red", lwd = 2)
In [6]:
# The distribution of movie revenues is highly skewed.
hist(movies$revenue.all,
main = "Movie revenue\nBefore log-transformation")
In [11]:
# For example, look at the distribution of movie revenues in the movies dataset in the margin Figure 15.5: As you can see, these data don’t look normally distributed at all. If we want to conduct a standard regression analysis on these data, we need to create a new log-transformed version of the variable. In the following code, I’ll create a new variable called revenue.all.log defined as the logarithm of revenue.all.
# Create a new log-transformed version of movie revenue
movies$revenue.all.log <- log(movies$revenue.all)
In [10]:
# In Figure 15.6 you can see a histogram of the new log-transformed variable. It’s still skewed, but not nearly as badly as before, so I would feel much better using this variable in a standard regression analysis with lm().
# Distribution of log-transformed revenue is much less skewed
hist(movies$revenue.all.log,
main = "Log-transformed Movie revenue")