Model checking and multiple linear regression

class: center, middle, inverse, title-slide

.title[
# Model checking and multiple linear regression
]
.subtitle[
## <br><br> Introduction to Data Science
]
.author[
### University of Edinburgh
]
.date[
### <br> 2024/2025
]

---

layout: true
  
<div class="my-footer">
<span>
University of Edinburgh
</span>
</div>

---
## Topics

- Assessing model fit.
- Linearity and transformations.
- Multiple linear regression
- Interaction effects
- Model comparison

---
class: middle

# Model checking

---

## "Linear" models

- We're fitting a "linear" model, which assumes a linear relationship between our explanatory and response variables.
- But how do we assess this?

---

## Data: Paris Paintings

``` r
pp <- read_csv("data/paris-paintings.csv", na = c("n/a", "", "NA"))
```

- Number of observations: 3393
- Number of variables: 61

---

## Graphical diagnostic: residuals plot

.panelset[
.panel[.panel-name[Plot]
<img src="w08-L16_files/figure-html/unnamed-chunk-2-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

``` r
ht_wt_fit <- linear_reg() %>%
  set_engine("lm") %>%
  fit(Height_in ~ Width_in, data = pp)

*ht_wt_fit_aug <- augment(ht_wt_fit$fit)

ggplot(ht_wt_fit_aug, mapping = aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, color = "gray", lty = "dashed") +
  labs(x = "Predicted height", y = "Residuals")
```
]

.panel[.panel-name[Augment]

``` r
ht_wt_fit_aug
```

```
## # A tibble: 3,135 × 9
##   .rownames Height_in Width_in .fitted .resid     .hat .sigma
##   <chr>         <dbl>    <dbl>   <dbl>  <dbl>    <dbl>  <dbl>
## 1 1                37     29.5    26.7  10.3  0.000399   8.30
## 2 2                18     14      14.6   3.45 0.000396   8.31
## 3 3                13     16      16.1  -3.11 0.000361   8.31
## 4 4                14     18      17.7  -3.68 0.000337   8.31
## 5 5                14     18      17.7  -3.68 0.000337   8.31
## 6 6                 7     10      11.4  -4.43 0.000498   8.31
## # ℹ 3,129 more rows
## # ℹ 2 more variables: .cooksd <dbl>, .std.resid <dbl>
```
]
]

---

## Looking for...

- Residuals distributed randomly around 0
- With no visible pattern along the x or y axes

---

## Not looking for...

.large[
**Fan shapes**
]

---

## Not looking for...

.large[
**Groups of patterns**
]

---

## Not looking for...

.large[
**Residuals correlated with predicted values**
]

---

## Not looking for...

.large[
**Any patterns!**
]

---

.question[
What patterns does the residuals plot reveal that should make us question whether a linear model is a good fit for modeling the relationship between height and width of paintings?
]

---

class: middle

# Linearity and transformations

---

## Data: Paris Paintings

---

## Price vs. width

---

## Focus on paintings with `Width_in < 100`

That is, paintings with width < 2.54 m

``` r
pp_wt_lt_100 <- pp %>% 
  filter(Width_in < 100)
```

---

## Price vs. width

.question[
Which plot shows a more linear relationship?
]

.small[
  
.pull-left[
<img src="w08-L16_files/figure-html/unnamed-chunk-13-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="w08-L16_files/figure-html/unnamed-chunk-14-1.png" width="100%" style="display: block; margin: auto;" />
]

]

---

## Price vs. width, residuals

.question[
Which plot shows a residuals that are uncorrelated with predicted values from the model? Also, what is the unit of the residuals?
]
  
.pull-left[
<img src="w08-L16_files/figure-html/unnamed-chunk-15-1.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[
<img src="w08-L16_files/figure-html/unnamed-chunk-16-1.png" width="100%" style="display: block; margin: auto;" />
]

---

## Transforming the data

- We saw that `price` has a right-skewed distribution, and the relationship between price and width of painting is non-linear.

--
- In these situations a transformation applied to the response variable may be useful.

--
- In order to decide which transformation to use, we should examine the distribution of the response variable.

--
- The extremely right skewed distribution suggests that a log transformation may 
be useful.
    - log = natural log, `$ln$`
    - Default base of the `log` function in R is the natural log: <br>
    `log(x, base = exp(1))`
    
---

## Logged price vs. width

.question[
How do we interpret the slope of this model?
]

---

## Models with log transformation

``` r
linear_reg() %>%
  set_engine("lm") %>%
  fit(log(price) ~ Width_in, data = pp_wt_lt_100) %>%
  tidy()
```

```
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   4.67     0.0585      79.9  0       
## 2 Width_in      0.0192   0.00226      8.48 3.36e-17
```

---

## Interpreting the slope

$$ \widehat{log(price)} = 4.67 + 0.0192 \times Width $$

- For each additional inch the painting is wider, the log price of the painting is expected to be higher, on average, by 0.0192 log livres.

--
- which is not a very useful statement...

Instead:

- For each additional inch the painting is wider, the price of the painting is expected to be higher, on average, *by a (multiplicative) factor of 1.0192* (1.0192 = exp(0.0192)).

In other words:

- For each additional inch the painting is wider, the price of the painting is expected to be higher, on average, *by 1.92%* (1.92 = 100 * 0.0192).

---

## Transform, or learn more?

- Data transformations may also be useful when the relationship is non-linear
- However in those cases a polynomial regression may be more appropriate
  + This is beyond the scope of this course, but you’re welcomed to try it for your final project, and I’d be happy to provide further guidance

---
class: middle

# The linear model with multiple predictors

### Two numerical predictors

---

## The data

``` r
pp <- read_csv("data/paris-paintings.csv",na = c("n/a", "", "NA")) %>%
  mutate(log_price = log(price))
```

---

## Multiple predictors

- Response variable: `log_price` 
- Explanatory variables: Width and height

.pull-left[
<img src="w08-L16_files/figure-html/logprice-vs-width-1.png" width="90%" style="display: block; margin: auto;" />
]
.pull-right[
<img src="w08-L16_files/figure-html/logprice-vs-height-1.png" width="90%" style="display: block; margin: auto;" />

]

---

##  Linear model with multiple predictors

``` r
pp_fit <- linear_reg() %>%
  set_engine("lm") %>%
  fit(log_price ~ Width_in + Height_in, data = pp)
tidy(pp_fit)
```

```
## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   4.77     0.0579      82.4  0       
## 2 Width_in      0.0269   0.00373      7.22 6.58e-13
## 3 Height_in    -0.0133   0.00395     -3.36 7.93e- 4
```

`$$\widehat{log\_price} = 4.77 + 0.0269 \times width - 0.0133 \times height$$`

---

##  Linear model with multiple predictors

.small[

``` r
tidy(pp_fit) %>% mutate(exp_estimate = exp(estimate)) %>% 
  select(term, estimate, exp_estimate)
```

```
## # A tibble: 3 × 3
##   term        estimate exp_estimate
##   <chr>          <dbl>        <dbl>
## 1 (Intercept)   4.77        118.   
## 2 Width_in      0.0269        1.03 
## 3 Height_in    -0.0133        0.987
```

`$$\widehat{log\_price} = 4.77 + 0.0269 \times width - 0.0133 \times height$$`
]

--
- **Slope - width:** *All else held constant*, for each additional inch the painting is wider, we would expect the log-price to be higher, on average, by 0.0269.

--
- **Slope - height:** *All else held constant*, for each additional inch the painting is higher, we would expect the log-price to be lower, on average, by 0.0133.

--
- **Intercept:** The log-price of paintings that are 0 inches wide and 0 inches high is expected to be 4.77, on average. (Doesn't make sense in context.)

---

##  Linear model with multiple predictors

.small[

``` r
tidy(pp_fit) %>% mutate(exp_estimate = exp(estimate)) %>% 
  select(term, estimate, exp_estimate)
```

`$$\widehat{log\_price} = 4.77 + 0.0269 \times width - 0.0133 \times height$$`
]

.question[
Interpret the slope for width in terms of price (not log-price).
]

---

##  Linear model with multiple predictors

.small[

``` r
tidy(pp_fit) %>% mutate(exp_estimate = exp(estimate)) %>% 
  select(term, estimate, exp_estimate)
```

`$$\widehat{log\_price} = 4.77 + 0.0269 \times width - 0.0133 \times height$$`
]

- **Slope - width:** *All else held constant*, for each additional inch the painting is wider, we would expect the *price* to be higher, on average, *by a factor of 1.03*.
- **Slope - height:** *All else held constant*, for each additional inch the painting is higher, we would expect the *price* to be lower, on average, *by a factor of 0.98*.
- **Intercept:** The *price* of paintings that are 0 inches wide and 0 inches high is expected to be *118*, on average. (Doesn't make sense in context.)

---

## Visualizing models with multiple predictors

.panelset[
.panel[.panel-name[Plot]
.pull-left-wide[
<div id="htmlwidget-5e1145adb6dc5e0163be" style="width:100%;height:1483.2px;" class="widgetframe html-widget"></div>
<script type="application/json" data-for="htmlwidget-5e1145adb6dc5e0163be">{"x":{"url":"w08-L16_files/figure-html//widgets/widget_plotly-plot.html","options":{"xdomain":"*","allowfullscreen":false,"lazyload":false}},"evals":[],"jsHooks":[]}</script>
]
]
.panel[.panel-name[Code]

``` r
p <- plot_ly(pp,
  x = ~Width_in, y = ~Height_in, z = ~log_price,
  marker = list(size = 3, color = "lightgray", alpha = 0.5, 
                line = list(color = "gray", width = 2))) %>%
  add_markers() %>%
  plotly::layout(scene = list(
    xaxis = list(title = "Width (in)"),
    yaxis = list(title = "Height (in)"),
    zaxis = list(title = "log_price")
  )) %>%
  config(displayModeBar = FALSE)
frameWidget(p)
```
]
]

---
class: middle

# The linear model with multiple predictors

### Numerical and categorical predictors

---
  
## Data: Book weight and volume
  
The `allbacks` data frame gives measurements on the volume and weight of 15 books, some of which are paperback and some of which are hardback

.pull-left[
- Volume - cubic centimetres
- Area - square centimetres
- Weight - grams
]
.pull-right[
.small[

```
## # A tibble: 15 × 4
##    volume  area weight cover
##     <dbl> <dbl>  <dbl> <fct>
##  1    885   382    800 hb   
##  2   1016   468    950 hb   
##  3   1125   387   1050 hb   
##  4    239   371    350 hb   
##  5    701   371    750 hb   
##  6    641   367    600 hb   
##  7   1228   396   1075 hb   
##  8    412     0    250 pb   
##  9    953     0    700 pb   
## 10    929     0    650 pb   
## 11   1492     0    975 pb   
## 12    419     0    350 pb   
## 13   1010     0    950 pb   
## 14    595     0    425 pb   
## 15   1034     0    725 pb
```
]
]

.footnote[
.small[
These books are from the bookshelf of J. H. Maindonald at Australian National University.
]
]

---
  
## Book weight vs. volume
  
.pull-left[

``` r
linear_reg() %>%
  set_engine("lm") %>%
  fit(weight ~ volume, 
      data = allbacks) %>%
  tidy()
```

```
## # A tibble: 2 × 5
##   term        estimate std.error statistic    p.value
##   <chr>          <dbl>     <dbl>     <dbl>      <dbl>
## 1 (Intercept)  108.      88.4         1.22 0.245     
## 2 volume         0.709    0.0975      7.27 0.00000626
```
]
.pull-right[
<img src="w08-L16_files/figure-html/unnamed-chunk-21-1.png" width="75%" style="display: block; margin: auto 0 auto auto;" />
]

---

## Book weight vs. volume and cover

.pull-left[

``` r
allbacks %>% 
  group_by(cover) %>%
  summarise(mean_weight = mean(weight))
```

```
## # A tibble: 2 × 2
##   cover mean_weight
##   <fct>       <dbl>
## 1 hb           796.
## 2 pb           628.
```

]
.pull-right[
<img src="w08-L16_files/figure-html/unnamed-chunk-23-1.png" width="75%" style="display: block; margin: auto 0 auto auto;" />
]

---

## Book weight vs. volume and cover

.pull-left[

``` r
linear_reg() %>%
  set_engine("lm") %>%
  fit(weight ~ volume + cover, 
      data = allbacks) %>%
  tidy()
```

```
## # A tibble: 3 × 5
##   term        estimate std.error statistic      p.value
##   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
## 1 (Intercept)  198.      59.2         3.34 0.00584     
## 2 volume         0.718    0.0615     11.7  0.0000000660
## 3 coverpb     -184.      40.5        -4.55 0.000672
```
]
.pull-right[
<img src="w08-L16_files/figure-html/unnamed-chunk-25-1.png" width="75%" style="display: block; margin: auto 0 auto auto;" />
]

---

## Interpretation of estimates

- **Slope - volume:** *All else held constant*, for each additional cubic centimetre books are larger in volume, we would expect the weight to be higher, on average, by 0.718 grams.

.question[
What is the interpretation of the slope for cover?
]

---

## Interpretation of estimates

- **Slope - volume:** *All else held constant*, for each additional cubic centimetre books are larger in volume, we would expect the weight to be higher, on average, by 0.718 grams.

- **Slope - cover:** *All else held constant*, paperback books weigh, on average, 184 grams less than hardback books.

- **Intercept:** Hardback books with 0 volume are expected to weigh 198 grams, on average. (Doesn't make sense in context.)

---

## Parallel slopes

By adding the **main effect** for `cover` we obtained to parallel lines for each group.

Can we fit a model so that the fitted lines for the two categories of `cover` are 
not parallel, 
or in other words, have a different slope?

---

## Two ways to model

- **Main effects:** Assuming relationship between volume and weight 
**does not vary** by whether the cover is hardback or paperback.
- **Interaction effects:** Assuming relationship between volume and weight 
**varies** by whether the cover is hardback or paperback.

---

## Interacting explanatory variables

- Including an interaction effect in the model allows for different slopes, i.e. 
nonparallel lines.
- This implies that the regression coefficient for an explanatory variable would 
change as another explanatory variable changes.
- This can be accomplished by adding an interaction variable: the product of two 
explanatory variables.

---

## Two ways to model

.pull-left[
- **Main effects:** Assuming relationship between volume and weight **does not vary** by whether the cover is hardback or paperback

- **Interaction effects:** Assuming relationship between volume and weight **varies** by whether the cover is hardback or paperback
]
.pull-right[
<img src="w08-L16_files/figure-html/pp-main-int-viz0-1.png" width="90%" style="display: block; margin: auto;" />
]

In this example, the difference is small. 
Let's look at another example.

---

## Price, surface area, and living artist

- Explore the relationship between price of paintings and surface area, conditioned 
on whether or not the artist is still living
- First visualize and explore, then model

---

## Typical surface area

.panelset[
.panel[.panel-name[Plot]
.pull-left-narrow[
Typical surface area appears to be less than 1000 square inches (~ 80cm x 80cm).

There are very few paintings that have surface area above 5000 square inches.
]
.pull-right-wide[
<img src="w08-L16_files/figure-html/unnamed-chunk-30-1.png" width="90%" style="display: block; margin: auto;" />
]
]
.panel[.panel-name[Code]

``` r
ggplot(data = pp, aes(x = Surface, fill = artistliving)) +
  geom_histogram(binwidth = 500) + 
  facet_grid(artistliving ~ .) +
  scale_fill_manual(values = c("#E48957", "#071381")) +
  guides(fill = FALSE) +
  labs(x = "Surface area", y = NULL) +
  geom_vline(xintercept = 1000) +
  geom_vline(xintercept = 5000, linetype = "dashed", color = "gray")
```

```
## Warning: Removed 176 rows containing non-finite outside the scale range
## (`stat_bin()`).
```
]
]

---

## Narrowing the scope

For simplicity let's focus on the paintings with `Surface < 5000`:
.midi[

``` r
pp_Surf_lt_5000 <- pp %>%
  filter(Surface < 5000)
```
]

.panelset[
.panel[.panel-name[Plot]
<img src="w08-L16_files/figure-html/unnamed-chunk-31-1.png" width="40%" style="display: block; margin: auto;" />

]
.panel[.panel-name[Code]

``` r
ggplot(data = pp_Surf_lt_5000, 
       aes(y = log_price, x = Surface, color = artistliving, shape = artistliving)) +
  geom_point(alpha = 0.5) +
  facet_wrap(~artistliving) +
  scale_color_manual(values = c("#E48957", "#071381")) +
  labs(color = "Artist", shape = "Artist")
```
]
]

---

## Fit model with main effects

- Response variable: `log_price`
- Explanatory variables: `Surface` area and `artistliving`

.midi[

``` r
pp_main_fit <- linear_reg() %>%
  set_engine("lm") %>%
  fit(log_price ~ Surface + artistliving, data = pp_Surf_lt_5000)
tidy(pp_main_fit)
```

```
## # A tibble: 3 × 5
##   term               estimate std.error statistic  p.value
##   <chr>                 <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)        4.88     0.0424       115.   0       
## 2 Surface            0.000265 0.0000415      6.39 1.85e-10
## 3 artistlivingLiving 0.137    0.0970         1.41 1.57e- 1
```
]

`$$\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times artistliving$$`

---

## Solving the model

`$$\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times artistliving$$`

- Non-living artist: Plug in 0 for `artistliving`

`\begin{align}
\widehat{log\_price} =& \color{#E48957}{4.88} + 0.000265 \times surface + \color{#E48957}{0.137 \times 0}\\
=& (\color{#E48957}{4.88+0}) + 0.000265 \times surface\\
=& \color{#E48957}{4.88} + 0.000265 \times surface
\end{align}`

--
- Living artist: Plug in 1 for `artistliving`

`\begin{align}
\widehat{log\_price} =& \color{#2AA3BB}{4.88} + 0.000265 \times surface + \color{#2AA3BB}{0.137 \times 1}\\
=& (\color{#2AA3BB}{4.88+0.137}) + 0.000265 \times surface\\
=& \color{#2AA3BB}{5.017} + 0.000265 \times surface
\end{align}`

---

## Visualizing main effects

.pull-left-narrow[
- **Same slope:** Rate of change in price as the surface area increases does
not vary between paintings by living and non-living artists.
- **Different intercept:** Paintings by living artists are consistently more
expensive than paintings by non-living artists.
]
.pull-right-wide[
<img src="w08-L16_files/figure-html/unnamed-chunk-32-1.png" width="100%" style="display: block; margin: auto;" />
]

---

## Interaction: `Surface * artistliving`

---

## Fit model with interaction effects

- Response variable: log_price
- Explanatory variables: `Surface` area, `artistliving`, and their interaction

.midi[

``` r
pp_int_fit <- linear_reg() %>%
  set_engine("lm") %>%
  fit(log_price ~ Surface * artistliving, data = pp_Surf_lt_5000)
tidy(pp_int_fit)
```

```
## # A tibble: 4 × 5
##   term                       estimate std.error statistic p.value
##   <chr>                         <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)                 4.91e+0 0.0432       114.   0      
## 2 Surface                     2.06e-4 0.0000442      4.65 3.37e-6
## 3 artistlivingLiving         -1.26e-1 0.119         -1.06 2.89e-1
## 4 Surface:artistlivingLiving  4.79e-4 0.000126       3.81 1.39e-4
```
]

---

## Fit model with interaction effects

- Response variable: log_price
- Explanatory variables: `Surface` area, `artistliving`, and their interaction

.midi[

``` r
pp_int_fit <- linear_reg() %>%
  set_engine("lm") %>%
  fit(log_price ~ Surface + artistliving + 
                  Surface * artistliving, data = pp_Surf_lt_5000)
tidy(pp_int_fit)
```

---

## Linear model with interaction effects

.midi[

.small[
`\begin{align}
\widehat{log\_price} = 4.91 &+ 0.00021 \times surface - 0.126 \times artistliving + 0.00048 \times surface * artistliving
\end{align}`
]

---

## Solving the model

.small[
`\begin{align}
\widehat{log\_price} = 4.91 &+ 0.00021 \times surface - 0.126 \times artistliving + 0.00048 \times surface * artistliving
\end{align}`
]

- Non-living artist: Plug in 0 for `artistliving`

.small[
`\begin{align}
\widehat{log\_price} =& \color{#E48957}{4.91} + \color{#FFCE7E}{0.00021} \times surface \color{#E48957}{- 0.126 \times 0} + \color{#FFCE7E}{0.00048 \times 0} \times surface\\
=& (\color{#E48957}{4.91+0}) + (\color{#FFCE7E}{0.00021+0}) \times surface\\
=& \color{#E48957}{4.91} + \color{#FFCE7E}{0.00021} \times surface
\end{align}`
]

- Living artist: Plug in 1 for `artistliving`
.small[
`\begin{align}
\widehat{log\_price} =& \color{#2AA3BB}{4.91} + \color{#ABC8C0}{0.00021} \times surface \color{#2AA3BB}{- 0.126 \times 1} + \color{#ABC8C0}{0.00048 \times 1} \times surface\\
=& (\color{#2AA3BB}{4.91-0.126}) + (\color{#ABC8C0}{0.00021+0.00048}) \times surface\\
=& \color{#2AA3BB}{4.784} + \color{#ABC8C0}{0.00069} \times surface
\end{align}`
]

---

## Interpretation of interaction effects

- Rate of change in price as the surface area of the painting increases does
vary between paintings by living and non-living artists (different slopes),
- Some paintings by living artists are more expensive than paintings by
non-living artists, and some are not (different intercept).

---
class: middle

# Comparing models

---

## Books weight by volume and cover

---

## In pursuit of Occam's razor

- Occam's Razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected.

--
- Model selection follows this principle.

--
- We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model.

--
- In other words, we prefer the simplest best model, i.e. **parsimonious** model.

---

## Comparing models

.pull-left[
<img src="w08-L16_files/figure-html/unnamed-chunk-34-1.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[
.question[
Visually, which of the two models is preferable under Occam's razor?
]
]

---

## R-squared

`$R^2$` is the percentage of variability in the response variable explained by 
the regression model.

``` r
glance(book_main_fit)$r.squared
```

```
## [1] 0.9274776
```

``` r
glance(book_int_fit)$r.squared
```

```
## [1] 0.9297137
```

--
- Clearly the model with interactions has a higher `$R^2$`.

--
- However using `$R^2$` for model selection in models with multiple explanatory variables is not a good idea as `$R^2$` increases when **any** variable is added to the model.

---

## Adjusted R-squared

... a (more) objective measure for model selection

- Adjusted `$R^2$` doesn't increase if the new variable does not provide any new 
informaton or is completely unrelated, as it applies a penalty for number of 
variables included in the model.
- This makes adjusted `$R^2$` a preferable metric for model selection in multiple
regression models.

---

## Comparing models with adjusted R-squared

.pull-left[

``` r
glance(book_main_fit)$r.squared
```

```
## [1] 0.9274776
```

``` r
glance(book_int_fit)$r.squared
```

```
## [1] 0.9297137
```
]
.pull-right[

``` r
glance(book_main_fit)$adj.r.squared
```

```
## [1] 0.9153905
```

``` r
glance(book_int_fit)$adj.r.squared
```

```
## [1] 0.9105447
```
]

.question[
Using adjusted R-squared, which of the two models is preferable?
]

---

## Paintings price by surface and living artist

.pull-left[
<img src="w08-L16_files/figure-html/unnamed-chunk-38-1.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[

``` r
glance(pp_main_fit)$adj.r.squared
```

```
## [1] 0.01258977
```

``` r
glance(pp_int_fit)$adj.r.squared
```

```
## [1] 0.01676753
```
.question[
Which of the two models is preferable?
]
]

It appears that adding the interaction actually increased adjusted `$R^2$`, so we
should indeed use the model with the interactions.

---

## Third order interactions

- Can you? Yes
- Should you? Probably not if you want to interpret these interactions in context
of the data.

---

## Back to log-price vs width and height

- Response variable: `log_price` 
- Explanatory variables: Width and height

.pull-left[
<img src="w08-L16_files/figure-html/logprice-vs-width2-1.png" width="90%" style="display: block; margin: auto;" />
]
.pull-right[
<img src="w08-L16_files/figure-html/logprice-vs-height2-1.png" width="90%" style="display: block; margin: auto;" />
]

---

## Comparing models with numerical predictors

``` r
single_reg_fit = linear_reg() %>%
  set_engine("lm") %>%
  fit(log(price) ~ Width_in, data = pp)
glance(single_reg_fit)$adj.r.squared
```

```
## [1] 0.01908747
```

``` r
mult_reg_fit = linear_reg() %>%
  set_engine("lm") %>%
  fit(log(price) ~ Width_in + Height_in, data = pp)
glance(mult_reg_fit)$adj.r.squared
```

```
## [1] 0.02231559
```

---

## Recap

- The plot of residuals vs predicted/fitted values can be used as diagnostic plot.

- Transformations might help, and `$\log(y)$` is useful when the response is right skewed and 
for variance stabilization.

- In this case, the interpretation of the coefficients uses percentage change.

- We learnt how to fit and interpret models with multiple predictors

- When working with categorical predictors, main effects mean parallel lines, while interaction effects mean different slopes.

- We learnt about Occam's razor and how to use adjusted `$R^2$` to do model comparison.