Load packages

library(tidyverse)
library(tidymodels)
library(patchwork)

Read in data

evals<-read.csv("data/evals.csv", row.names=1)

Exercise 1

  1. Visualize the distribution of score in the dataframe evals.
ggplot(data = evals, mapping = aes(x = score)) +
  geom_histogram(binwidth = 0.2) + 
  labs(
    x = "Evaluation score", 
    y = "Frequency", 
    title = "Professor evaluation scores"
    )

The distribution is negatively/left-skewed—this happens because most professors are getting scores near the upper end of the scale, but then there is a tail to the left of lower scores. There appears to be a sharp peak around 4.4, but this may be an artefact of the data (evaluation scores, as we shall soon see, appear to be recorded in discrete steps of 0.1).

  1. Visualize and describe the relationship between score and bty_avg using geom_point() to represent the data.
plot_geom_point  <- ggplot(data = evals, mapping = aes(x = bty_avg, y = score)) + 
  geom_point() + 
  labs(
    x = "Beauty score", 
    y = "Evaluation score", 
    title = "Course evaluation by beauty scores"
  )

plot_geom_jitter <- ggplot(data = evals, mapping = aes(x = bty_avg, y = score)) + 
  geom_jitter() + 
  labs(
    x = "Beauty score", 
    y = "Evaluation score", 
    title = "Course evaluation by beauty scores"
  )

# Note: this uses the patchwork package loaded above
# learn more about patchwork at https://patchwork.data-imaginist.com/
# it might be useful for your presentations!
plot_geom_point + plot_geom_jitter

Jittering randomly shifts the data points slightly so that points are not overplotted. In the initial scatterplot observations (professors) that have the same score and bty_avg were overplotted, making it impossible to see where there is a higher/lower density of points.

Exercise 2: Simple Linear regression with a numerical predictor

  1. Fit a linear model called score_bty_fit to predict average professor evaluation score from average beauty rating (bty_avg).
score_bty_fit <- linear_reg() %>%
  set_engine("lm") %>%
  fit(score ~ bty_avg, data = evals)
tidy(score_bty_fit)
## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   3.88      0.0761     51.0  1.56e-191
## 2 bty_avg       0.0666    0.0163      4.09 5.08e-  5

The linear model is score-hat = 3.88 + 0.0666 bty_avg.

  1. Plot the data again using geom_jitter(), and add the regression line.
ggplot(data = evals, mapping = aes(x = bty_avg, y = score)) + 
  geom_jitter() + 
  geom_smooth(method = "lm", se = FALSE, formula = "y ~ x") +
  labs(
    x = "Beauty score", 
    y = "Evaluation score", 
    title = "Evaluation vs. beauty scores"
    )

  1. Interpret the slope of the linear model in context of the data.

For each unit increase in the average beauty score, we expect the evaluation scores to be higher, on average, by 0.0666 points.

  1. Interpret the intercept of the linear model in context of the data. Comment on whether or not the intercept makes sense in this context.

Professors who have a 0 beauty score on average are predicted to have an evaluation score of 3.88. The intercept doesn’t make sense in this context as it’s not possible for a professor to have a 0 beauty score on average (lowest possible score a student can assign a professor is 1).

  1. Determine the \(R^2\) of the model and interpret it in the context of the data.
# remove eval = FALSE from the code chunk options after filling in the blanks
glance(score_bty_fit)$r.squared
## [1] 0.03502226

The model has an R-squared value of 3.5%. This means that average beauty scores explain 3.5% of the variability in evaluation scores.

  1. Make a plot of residuals vs. predicted values for the model above.
score_bty_aug <- augment(score_bty_fit$fit)

ggplot(score_bty_aug, aes(x = .fitted, y = .resid)) +
  geom_jitter() +
  geom_hline(yintercept = 0, linetype = "dashed") + 
  labs( 
    x = "Predicted values",
    y = "Residuals"
  )

The model is probably reasonable, but could be better. There’s a slight “fan shape” in the residuals, or “heteroschedasticity” — that is, there are differences in the variation of the residuals for different values of the predicted values, specifically the variation seems to be larger on left. There are also more large negative residuals than large positive ones, which is probably due to the fact that values of the response variable were close to the rigid maximum limit of the scale.

Exercise 3: Simple Linear regression with a categorical predictor

  1. Look at the variable rank, and determine the frequency of each category level.
evals %>% count(rank)
##           rank   n
## 1     teaching 102
## 2 tenure track 108
## 3      tenured 253
  1. Fit a new linear model called score_rank_fit to predict average professor evaluation score based on rank of the professor.
score_rank_fit <- linear_reg() %>%
  set_engine("lm") %>%
  fit(score ~ rank, data = evals)

score_rank_fit %>% tidy()
## # A tibble: 3 × 5
##   term             estimate std.error statistic   p.value
##   <chr>               <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)         4.28     0.0537     79.9  1.02e-271
## 2 ranktenure track   -0.130    0.0748     -1.73 8.37e-  2
## 3 ranktenured        -0.145    0.0636     -2.28 2.28e-  2
  1. Fit a new linear model called score_gender_fit to predict average professor evaluation score based on gender of the professor.
# fit model
score_gender_fit <- linear_reg() %>%
  set_engine("lm") %>%
  fit(score ~ gender, data = evals)
# tidy model output
tidy(score_gender_fit)
## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    4.09     0.0387    106.   0      
## 2 gendermale     0.142    0.0508      2.78 0.00558
score_gender_intercept <- tidy(score_gender_fit) %>% 
  filter(term == "(Intercept)") %>%
  select(estimate) %>%
  pull()
score_gender_slope <- tidy(score_gender_fit) %>% 
  filter(term == "gendermale") %>%
  select(estimate) %>%
  pull()

Exercise 4: Multiple linear regression

  1. Fit a multiple linear regression model, predicting average professor evaluation score based on average beauty rating (bty_avg) and gender.
# fit model
score_bty_gender_fit <- linear_reg() %>%
  set_engine("lm") %>%
  fit(score ~ bty_avg + gender, data = evals)
# tidy model output
tidy(score_bty_gender_fit)
## # A tibble: 3 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   3.75      0.0847     44.3  6.23e-168
## 2 bty_avg       0.0742    0.0163      4.56 6.48e-  6
## 3 gendermale    0.172     0.0502      3.43 6.52e-  4
ggplot(data = evals, mapping = aes(x = bty_avg, y = score, color = gender)) + 
  geom_jitter() + 
  labs(
    x = "Beauty score", 
    y = "Evaluation score", 
    title = "Course evaluation by beauty scores"
  )

  1. What percent of the variability in score is explained by the model score_bty_gender_fit.
# glance model output
glance(score_bty_gender_fit)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic     p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>       <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1    0.0591        0.0550 0.529      14.5 0.000000818     2  -360.  729.  745.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# glance model output
score_bty_gender_r <- glance(score_bty_gender_fit) %>%  # get model summaries
  pull("r.squared") %>%                        # pull out the r squared value
  signif(., 2)*100                             # round to 2 significant figures and times by 100 to get a percent

The model has an R-squared value of 5.9%. This means that a model with both average beauty scores and gender can explain 5.9% of the variability in evaluation scores.

  1. What is the equation of the line corresponding to just male professors?

\[\text{score-hat} = 3.75 + 0.172 + 0.0742 \times \text{bty_avg} = 3.92 + 0.0743 \times \text{bty_avg}\]

  1. For two professors who received the same beauty rating, which gender tends to have the higher course evaluation score?

Male professors tend to have higher course evaluation than female professors, assuming they have the same beauty score.

  1. How does the relationship between beauty and evaluation score vary between male and female professors?

In this model, it doesn’t, because we haven’t fitted an interaction effects model—the model gives the same increase in evaluation for each increase in beauty score for both male and female professors.

  1. How do the adjusted \(R^2\) values of score_bty_fit and score_bty_gender_fit compare?
glance(score_bty_fit)$adj.r.squared
## [1] 0.03292903
glance(score_bty_gender_fit)$adj.r.squared
## [1] 0.05503202

The adjusted R-squared is higher for the fit when gender is included, suggesting gender is useful for explaining the variability in evaluation scores when we already have information on the beauty scores.

  1. Compare the slopes of bty_avg under the two models (score_bty_fit and score_bty_gender_fit).
score_bty_fit %>%
  tidy() %>%
  filter(term == "bty_avg") %>%
  pull("estimate")
## [1] 0.06663704
score_bty_gender_fit %>%
  tidy() %>%
  filter(term == "bty_avg") %>%
  pull("estimate")
## [1] 0.07415537

The addition of gender has changed the slope estimate: it has increased it from around 0.067 to around 0.074.

Exercise 5: Interpretation of log-transformed response variables

  1. Using the two properties listed above, and starting from the hint provided below, prove why we can rewrite the interpretation of the slope using the multiplicative factor of \(exp(b_1)\) phrase.

\[ log(\text{price for width (x+1)}) - log(\text{price for width x}) = 0.0192 \]

Using the “Subtraction of logs” property we get:

\[ log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right) = 0.0192 \] We exponentiate both sides:

\[ e^{log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right)} = e^{0.0192} \]

Using the definition of the natural logarithm we get:

\[ \frac{\text{price for width x+1}}{\text{price for width x}} \approx 1.02 = exp(b_1) \]

  1. Using the fact that \(exp(x)\approx 1+x\), and the definition of percentage change (\((b-a)/a = \%\text{-change}\)), prove that we can rewrite the interpretation of the slope with the percentage change.

The definition of percentage change is given by: \[ \frac{\left( (\text{price for width x+1}) - (\text{price for width x}) \right)}{\text{price for width x}} \]

which can be re-written as \[ \frac{\text{price for width x+1}}{\text{price for width x}} -1 \]

Then we can use the result from point 1, and the approximation that \(exp(b_1) \approx 1+b_1\) and we have that:

\[ \frac{\left( (\text{price for width x+1}) - (\text{price for width x}) \right)}{\text{price for width x}} = \frac{\text{price for width x+1}}{\text{price for width x}} -1 = exp(b_1)-1 \approx 1+b_1-1 = b_1 \]