This lab is all about linear regression, so make sure you are up to date with the week 8 material!
Cartoon guide to Statistics by Larry Gonick
The objective of this workshop is to:
In this workshop you will be working individually. However, while working on your version control project, we suggest to discuss with your group members to hear different opinions and angles. Start by doing the usual set-up steps:
lab-07.Rmd
and change the author
at the top of the file to your name.During the workshop we’ll use the following packages:
tidyverse
: for data wrangling and visualisationtidymodels
: for modellingMany college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor.
The article titled, “Beauty in the classroom: instructors’ pulchritude and putative pedagogical productivity” (Hamermesh and Parker, 2005) found that instructors who are viewed to be better looking receive higher instructional ratings.
For this assignment you will analyze the data from this study in order to learn what goes into a positive professor evaluation.
The data were gathered from end of semester student evaluations for a large sample of professors from the University of Texas at Austin. In addition, six students rated the professors’ physical appearance.1 This is a slightly modified version of the original data set that was released as part of the replication data for ``Data Analysis Using Regression and Multilevel/Hierarchical Models’’ (Gelman and Hill, 2007).
The result is a data frame where each row contains a different course and columns represent variables about the courses and professors.
To familiarise yourself with the data set, reading the data dictionary and reference here is suggested. You should read the data into RStudio using the following code:
Cartoon guide to Statistics by Larry Gonick
score
in the dataframe evals
.score
and bty_avg
using geom_point()
to represent the data.
Then, visualise again using geom_jitter()
for the points. What does “jitter” mean? What was misleading about the initial scatter plot?
Cartoon guide to Statistics by Larry Gonick
Hint: Linear models are in the form \(\hat{y} = b_0 + b_1 x\).
score_bty_fit
to predict average professor evaluation score
from average beauty rating (bty_avg
). Print the regression output using tidy()
. Based on the regression output, write down the linear model.geom_jitter()
, and add the regression line.Hint: Use glance()
to obtain \(R^2\) and other similar statistics for your
model.
Next, we’ll assess the model fit using a graphical diagnostic. To do so we need to calculate the predicted evaluation scores for each professor in the dataset as well as the residuals for each observation. We use the augment()
function for this:
Let’s take a look at what’s in this augmented dataset:
## [1] "score" "bty_avg" ".fitted" ".resid" ".hat"
## [6] ".sigma" ".cooksd" ".std.resid"
First, we have the variables used to build the model: score
and bty_avg
.
We also have the predicted values (.fitted
) and the residuals (.resid
).
We’ll talk about a few of the other variables in the augmented data frame later in the course, and some others you will encounter in future courses.
Hint: You can use geom_hline()
with
linetype = “dashed”
for this.
geom_jitter()
instead of geom_point()
, and overlay a dashed horizontal line at y = 0
. Then, comment on whether the linear model is appropriate for modeling the relationship between evaluation scores and beauty scores.rank
, and determine the frequency of each category level.score_rank_fit
to predict average professor evaluation score
based on rank
of the professor and print out the regression output using tidy()
. Based on the regression output, interpret the slope and intercept in context of the data.score_gender_fit
to predict average professor evaluation score
based on gender
of the professor. Print out the regression output using tidy()
and interpret the slope and intercept in context of the data. But we’ll do things a bit differently this time, using inline code for the values of the intercept and the slope! Below are some tips for extracting these values from the model output to use in your inline code.
Cartoon guide to Statistics by Larry Gonick
Let’s start with the intercept. You have two options for extracting the value of the intercept from the regression output. Remember, the output looks like this:
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.09 0.0387 106. 0
## 2 gendermale 0.142 0.0508 2.78 0.00558
So the intercept is in the estimate
column, and it’s the first element in there.
## [1] 4.092821
We can also extract it using a dplyr pipeline:
# Option 2
tidy(score_gender_fit) %>%
filter(term == "(Intercept)") %>% # filter for the intercept
select(estimate) %>% # select the estimate column
pull() # pull value out of data frame
## [1] 4.092821
Regardless of which option you use, you might consider storing the value in an object that you can easily refer to in your inline code, e.g.
You can hide the code for chunks like this where you are simply
preparing objects for later use by adding echo = FALSE
in
the code chunk options, that is, where you label your code chunk,
separated by a comma, i.e.
{r label, echo = FALSE}
score_gender_intercept <- tidy(score_gender_fit) %>%
filter(term == "(Intercept)") %>%
select(estimate) %>%
pull()
And then, you can use the score_gender_intercept
in inline code, e.g.
which will render to
The intercept of the model is 4.0928205...
There is still one small issue here though, the number of decimal places reported.
It would be better to round the value in our narrative, for which we can use the round()
function.
This function takes two arguments: the first one is the value (or vector of values) you want to round, and the second one is the number of digits.
which will render to
The intercept of the model is 4.09...
score
based on average beauty rating (bty_avg
) and gender
. Name the model score_bty_gender_fit
. Interpret the intercept and the slopes of bty_avg
and gender
. Make a scatterplot (using jitter) of score
by bty_avg
and color the points by gender
.score
is explained by the model score_bty_gender_fit
. Once again, use inline code in your answer, similar to above example.score_bty_fit
and score_bty_gender_fit
compare? What does this tell us about how useful gender
is in explaining the variability in evaluation scores when we already have information on the beauty score of the professor.bty_avg
under the two models (score_bty_fit
and score_bty_gender_fit
). Has the addition of gender
to the model changed the parameter estimate (slope) of bty_avg
?You saw in class that the interpretation for a slope when the response has been log-transformed is in terms of a multiplicative factor, or percentage change. You are now going to see why.
For example, in the model where we were trying to predict the log-price of paris paintings using the width of the paintings, we fitted the following model:
\[ \widehat{log(price)} = 4.67 + 0.0192 \times Width \] To interpret the slope (\(b_1 = 0.0192\)) we could say
For each additional inch the painting is wider, the log price of the painting is expected to be higher, on average, by 0.0192 log-livres.
However, this is not a very useful statement.
We will be using two properties of the \(log\) to “undo” the log transformation.
Subtraction and logs: \(log(a) − log(b) = log(a / b)\)
Natural logarithm: \(e^{log(x)} = x\)
and to rewrite the interpretation as
For each additional inch the painting is wider, the price of the painting is expected to be higher, on average, by a (multiplicative) factor of 1.0192.
Hint: The slope coefficient for the log transformed model is 0.0192, meaning the log price difference between paintings whose widths are one inch apart is predicted to be 0.0192 log livres:
\[ log(\text{price for width (x+1)}) - log(\text{price for width x}) = 0.0192 \] 2. Using the fact that \(exp(x)\approx 1+x\), and the definition of percentage change (\((b-a)/a = \%\text{-change}\)), prove that we can rewrite the interpretation of the slope as
For each additional inch the painting is wider, the price of the painting is expected to be higher, on average, by 1.92%
Once you’re done with this week’s lab, you can spend some time for thinking about your projects with the first modeling experience.
In your data set, which variables can be considered as response for a linear regression type model?
Which type of predictors you have in your data and how do you create a reasonable multiple linear regression model?
That’s the end of this lab!
Don’t worry if you did not reach the end of the worksheet today, but please try to go through any remaining exercises in your own time.
Remember to commit and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.