library(tidyverse)
library(tidymodels)
library(patchwork)
evals<-read.csv("data/evals.csv", row.names=1)
score
in the dataframe
evals
.ggplot(data = evals, mapping = aes(x = score)) +
geom_histogram(binwidth = 0.2) +
labs(
x = "Evaluation score",
y = "Frequency",
title = "Professor evaluation scores"
)
The distribution is negatively/left-skewed—this happens because most professors are getting scores near the upper end of the scale, but then there is a tail to the left of lower scores. There appears to be a sharp peak around 4.4, but this may be an artefact of the data (evaluation scores, as we shall soon see, appear to be recorded in discrete steps of 0.1).
score
and bty_avg
using geom_point()
to represent
the data.plot_geom_point <- ggplot(data = evals, mapping = aes(x = bty_avg, y = score)) +
geom_point() +
labs(
x = "Beauty score",
y = "Evaluation score",
title = "Course evaluation by beauty scores"
)
plot_geom_jitter <- ggplot(data = evals, mapping = aes(x = bty_avg, y = score)) +
geom_jitter() +
labs(
x = "Beauty score",
y = "Evaluation score",
title = "Course evaluation by beauty scores"
)
# Note: this uses the patchwork package loaded above
# learn more about patchwork at https://patchwork.data-imaginist.com/
# it might be useful for your presentations!
plot_geom_point + plot_geom_jitter
Jittering randomly shifts the data points slightly so that points are not overplotted. In the initial scatterplot observations (professors) that have the same score and bty_avg were overplotted, making it impossible to see where there is a higher/lower density of points.
score_bty_fit
to predict
average professor evaluation score
from average beauty
rating (bty_avg
).score_bty_fit <- linear_reg() %>%
set_engine("lm") %>%
fit(score ~ bty_avg, data = evals)
tidy(score_bty_fit)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.88 0.0761 51.0 1.56e-191
## 2 bty_avg 0.0666 0.0163 4.09 5.08e- 5
The linear model is score-hat = 3.88 + 0.0666 bty_avg.
geom_jitter()
, and add the
regression line.ggplot(data = evals, mapping = aes(x = bty_avg, y = score)) +
geom_jitter() +
geom_smooth(method = "lm", se = FALSE, formula = "y ~ x") +
labs(
x = "Beauty score",
y = "Evaluation score",
title = "Evaluation vs. beauty scores"
)
For each unit increase in the average beauty score, we expect the evaluation scores to be higher, on average, by 0.0666 points.
Professors who have a 0 beauty score on average are predicted to have an evaluation score of 3.88. The intercept doesn’t make sense in this context as it’s not possible for a professor to have a 0 beauty score on average (lowest possible score a student can assign a professor is 1).
# remove eval = FALSE from the code chunk options after filling in the blanks
glance(score_bty_fit)$r.squared
## [1] 0.03502226
The model has an R-squared value of 3.5%. This means that average beauty scores explain 3.5% of the variability in evaluation scores.
score_bty_aug <- augment(score_bty_fit$fit)
ggplot(score_bty_aug, aes(x = .fitted, y = .resid)) +
geom_jitter() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(
x = "Predicted values",
y = "Residuals"
)
The model is probably reasonable, but could be better. There’s a slight “fan shape” in the residuals, or “heteroschedasticity” — that is, there are differences in the variation of the residuals for different values of the predicted values, specifically the variation seems to be larger on left. There are also more large negative residuals than large positive ones, which is probably due to the fact that values of the response variable were close to the rigid maximum limit of the scale.
evals %>% count(rank)
## rank n
## 1 teaching 102
## 2 tenure track 108
## 3 tenured 253
score_rank_fit
to predict
average professor evaluation score
based on
rank
of the professor.score_rank_fit <- linear_reg() %>%
set_engine("lm") %>%
fit(score ~ rank, data = evals)
score_rank_fit %>% tidy()
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.28 0.0537 79.9 1.02e-271
## 2 ranktenure track -0.130 0.0748 -1.73 8.37e- 2
## 3 ranktenured -0.145 0.0636 -2.28 2.28e- 2
score_gender_fit
to
predict average professor evaluation score
based on
gender
of the professor.# fit model
score_gender_fit <- linear_reg() %>%
set_engine("lm") %>%
fit(score ~ gender, data = evals)
# tidy model output
tidy(score_gender_fit)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.09 0.0387 106. 0
## 2 gendermale 0.142 0.0508 2.78 0.00558
score_gender_intercept <- tidy(score_gender_fit) %>%
filter(term == "(Intercept)") %>%
select(estimate) %>%
pull()
score_gender_slope <- tidy(score_gender_fit) %>%
filter(term == "gendermale") %>%
select(estimate) %>%
pull()
score
based on average beauty rating
(bty_avg
) and gender.
# fit model
score_bty_gender_fit <- linear_reg() %>%
set_engine("lm") %>%
fit(score ~ bty_avg + gender, data = evals)
# tidy model output
tidy(score_bty_gender_fit)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.75 0.0847 44.3 6.23e-168
## 2 bty_avg 0.0742 0.0163 4.56 6.48e- 6
## 3 gendermale 0.172 0.0502 3.43 6.52e- 4
ggplot(data = evals, mapping = aes(x = bty_avg, y = score, color = gender)) +
geom_jitter() +
labs(
x = "Beauty score",
y = "Evaluation score",
title = "Course evaluation by beauty scores"
)
score
is explained
by the model score_bty_gender_fit
.# glance model output
glance(score_bty_gender_fit)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0591 0.0550 0.529 14.5 0.000000818 2 -360. 729. 745.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# glance model output
score_bty_gender_r <- glance(score_bty_gender_fit) %>% # get model summaries
pull("r.squared") %>% # pull out the r squared value
signif(., 2)*100 # round to 2 significant figures and times by 100 to get a percent
The model has an R-squared value of 5.9%. This means that a model with both average beauty scores and gender can explain 5.9% of the variability in evaluation scores.
\[\text{score-hat} = 3.75 + 0.172 + 0.0742 \times \text{bty_avg} = 3.92 + 0.0743 \times \text{bty_avg}\]
Male professors tend to have higher course evaluation than female professors, assuming they have the same beauty score.
In this model, it doesn’t, because we haven’t fitted an interaction effects model—the model gives the same increase in evaluation for each increase in beauty score for both male and female professors.
score_bty_fit
and score_bty_gender_fit
compare?glance(score_bty_fit)$adj.r.squared
## [1] 0.03292903
glance(score_bty_gender_fit)$adj.r.squared
## [1] 0.05503202
The adjusted R-squared is higher for the fit when gender is included, suggesting gender is useful for explaining the variability in evaluation scores when we already have information on the beauty scores.
bty_avg
under the two models
(score_bty_fit
and score_bty_gender_fit
).score_bty_fit %>%
tidy() %>%
filter(term == "bty_avg") %>%
pull("estimate")
## [1] 0.06663704
score_bty_gender_fit %>%
tidy() %>%
filter(term == "bty_avg") %>%
pull("estimate")
## [1] 0.07415537
The addition of gender has changed the slope estimate: it has increased it from around 0.067 to around 0.074.
\[ log(\text{price for width (x+1)}) - log(\text{price for width x}) = 0.0192 \]
Using the “Subtraction of logs” property we get:
\[ log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right) = 0.0192 \] We exponentiate both sides:
\[ e^{log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right)} = e^{0.0192} \]
Using the definition of the natural logarithm we get:
\[ \frac{\text{price for width x+1}}{\text{price for width x}} \approx 1.02 = exp(b_1) \]
The definition of percentage change is given by: \[ \frac{\left( (\text{price for width x+1}) - (\text{price for width x}) \right)}{\text{price for width x}} \]
which can be re-written as \[ \frac{\text{price for width x+1}}{\text{price for width x}} -1 \]
Then we can use the result from point 1, and the approximation that \(exp(b_1) \approx 1+b_1\) and we have that:
\[ \frac{\left( (\text{price for width x+1}) - (\text{price for width x}) \right)}{\text{price for width x}} = \frac{\text{price for width x+1}}{\text{price for width x}} -1 = exp(b_1)-1 \approx 1+b_1-1 = b_1 \]