Chapter 5.1.5 Exercises
In [4]:
setwd("./data")
load( "parenthood.Rdata" )
library(lsr)
who(TRUE)
In [6]:
# If I calculate my descriptive statistics using the describe() function
library(psych)
describe( parenthood2 )
# we can see from the n column that there are 9 missing values for dan.sleep, 11 missing values for baby.sleep and 8 missing values for dan.grump.
In [12]:
# Similar, but not quite the same. It’s also worth noting that the correlate() function (in the lsr package) automatically uses the “pairwise complete” method:
library(lsr)
correlate(parenthood2)
In [12]:
# Similar, but not quite the same. It’s also worth noting that the correlate() function (in the lsr package) automatically uses the “pairwise complete” method:
library(lsr)
correlate(parenthood2)
In [4]:
# we see that the file contains a single data frame called parenthood, which contains four variables dan.sleep, baby.sleep, dan.grump and day. If we peek at the data using head() out the data, here’s what we get:
head(parenthood,10)
In [6]:
# Next, I’ll calculate some basic descriptive statistics:
library(psych)
describe( parenthood )
In [10]:
library(ggplot2)
library(dplyr)
library(moderndive)
library(ISLR)
library(skimr)
Balance <- lm(dan.sleep ~ dan.grump, data = parenthood)
get_regression_table(Balance)
ggplot(parenthood, aes(x = dan.grump, y = dan.sleep)) +
geom_point() +
labs(x = "grump", y = "sleep",
title = "parenthood") +
geom_smooth(method = "lm")
In [12]:
# However, the cor() function is a bit more powerful than this simple example suggests. For example, you can also calculate a complete “correlation matrix”, between all pairs of variables in the data frame.
# correlate all pairs of variables in "parenthood":
cor( x = parenthood )
correlate(parenthood)
In [4]:
require(graphics)
y <- rt(200, df = 5)
qqnorm(y); qqline(y, col = 2)
In [11]:
qqplot(y, rt(300, df = 5))
In [36]:
library(knitr)
knitr::kable(
rbind(
c("-1.0 to -0.9" ,"Very strong", "Negative"),
c("-0.9 to -0.7", "Strong", "Negative") ,
c("-0.7 to -0.4", "Moderate", "Negative") ,
c("-0.4 to -0.2", "Weak", "Negative"),
c("-0.2 to 0","Negligible", "Negative") ,
c("0 to 0.2","Negligible", "Positive"),
c("0.2 to 0.4", "Weak", "Positive"),
c("0.4 to 0.7", "Moderate", "Positive"),
c("0.7 to 0.9", "Strong", "Positive"),
c("0.9 to 1.0", "Very strong", "Positive")), col.names=c("Correlation", "Strength", "Direction"),
booktabs = TRUE)
In [11]:
# You’d think that these four data setswould look pretty similar to one another. They do not.
cor( anscombe$x1, anscombe$y1 )
cor( anscombe$x2, anscombe$y2 )
In [16]:
load( "effort.Rdata" )
who(TRUE)
# The raw data look like this:
effort
In [22]:
#We can get R to construct these rankings using the rank() function, like this:
hours.rank <- rank( effort$hours ) # rank students by hours worked
grade.rank <- rank( effort$grade ) # rank students by grade received
hours.rank
grade.rank
In [23]:
# It’s much easier to just specify the method argument of the cor() function.
cor( effort$hours, effort$grade, method = "spearman")
# The default value of the method argument is "pearson", which is why we didn’t have to specify it earlier on when we were doing Pearson correlations.
In [24]:
load("work.Rdata")
who(TRUE)
In [30]:
correlate(work)
In [3]:
# let’s load the aflsmall.Rdata file, and use the who() function in the lsr package to see what variables are stored in the file:
setwd("./data")
load( "aflsmall.Rdata" )
library(lsr)
who( expand = TRUE )
In [5]:
print(afl.margins)
In [6]:
hist(afl.margins)
In [14]:
# Let’s use the interquartile range to calculate the median AFL winning margin:
quantile( x = afl.margins, probs = .5)
In [26]:
X <- c(56, 31,56,8,32) # enter the data
X.bar <- mean( X ) # step 1. the mean of the data
AD <- abs( X - X.bar ) # step 2. the absolute deviations from the mean
AAD <- mean( AD ) # step 3. the mean absolute deviations
print( AAD ) # print the results
In [3]:
setwd("./data")
partial <- c(10, 20, NA, 30)
mean( x = partial )
mean( x = partial, na.rm = TRUE )
# Notice that the mean is 20 (i.e., 60 / 3) and not 15. When R ignores a NA value, it genuinely ignores it. In effect, the calculation above is identical to what you’d get if you asked for the mean of the three-element vector c(10, 20, 30).
In [6]:
qqnorm(precip, ylab = "Precipitation [in/yr] for 70 US cities")
In [9]:
## "QQ-Chisquare" : --------------------------
y <- rchisq(500, df = 3)
## Q-Q plot for Chi^2 data against true theoretical distribution:
qqplot(qchisq(ppoints(500), df = 3), y,
main = expression("Q-Q plot for" ~~ {chi^2}[nu == 3]))
qqline(y, distribution = function(p) qchisq(p, df = 3),
prob = c(0.1, 0.6), col = 2)
mtext("qqline(*, dist = qchisq(., df=3), prob = c(0.1, 0.6))")
In [2]:
# show plot using runquantile
library(caTools)
k=31; n=200;
x = rnorm(n,sd=30) + abs(seq(n)-n/4)
y=runquantile(x, k, probs=c(0.05, 0.25, 0.5, 0.75, 0.95))
col = c("black", "red", "green", "blue", "magenta", "cyan")
plot(x, col=col[1], main = "Moving Window Quantiles")
lines(y[,1], col=col[2])
lines(y[,2], col=col[3])
lines(y[,3], col=col[4])
lines(y[,4], col=col[5])
lines(y[,5], col=col[6])
lab = c("data", "runquantile(.05)", "runquantile(.25)", "runquantile(0.5)",
"runquantile(.75)", "runquantile(.95)")
legend(0,230, lab, col=col, lty=1 )
In [29]:
setwd("~/Documents/learning/current/R/r-intro/rbook/data")
load( "aflsmall.Rdata" )
summary( object = afl.margins )
In [34]:
# There’s a single data frame called clin.trial which contains three variables, drug, therapy and mood.gain. Presumably then, this data is from a clinical trial of some kind, in which people were administered different drugs; and the researchers looked to see what the drugs did to their mood. Let’s see if the summary() function sheds a little more light on this situation:
clin.trial
summary( clin.trial )
# Evidently there were three drugs: a placebo, something called “anxifree” and something called “joyzepam”; and there were 6 people administered each drug. There were 9 people treated using cognitive behavioural therapy (CBT) and 9 people who received no psychological treatment.
# And we can see from looking at the summary of the mood.gain variable that most people did show a mood gain (mean =.88), though without knowing what the scale is here it’s hard to say much more than that. Still, that’s not too bad. Overall, I feel that I learned something from that.
In [26]:
# let’s use the describe() function to have a look at the clin.trial data frame. Here’s what we get:
library(psych)
describe( x = clin.trial )
# As you can see, the output for the asterisked variables is pretty meaningless, and should be ignored. However, for the mood.gain variable, there’s a lot of useful information.
In [31]:
# For instance, let’s say, I want to look at the descriptive statistics for the clin.trial data, broken down separately by therapy type. The command I would use here is:
describeBy( x=clin.trial, group=clin.trial$therapy )
In [32]:
# To give a sense of how powerful by() is, you can reproduce the describeBy() function by using a command like this:
by( data=clin.trial, INDICES=clin.trial$therapy, FUN=describe )
In [33]:
# This will produce the exact same output as the command shown earlier. However, there’s nothing special about the describe() function. You could just as easily use the by() function in conjunction with the summary() function. For example:
by( data=clin.trial, INDICES=clin.trial$therapy, FUN=summary )
In [35]:
# to obtain group means, use this command:
aggregate( formula = mood.gain ~ drug + therapy, # mood.gain by drug/therapy combination
data = clin.trial, # data is in the clin.trial data frame
FUN = mean # print out group means
)
In [36]:
# or, alternatively, if you want to calculate the standard deviations for each group, you would use the following command (argument names omitted this time):
aggregate( mood.gain ~ drug + therapy, clin.trial, sd )