class: center, middle, inverse, title-slide .title[ # Model checking and multiple linear regression ] .subtitle[ ##
Introduction to Data Science ] .author[ ### University of Edinburgh ] .date[ ###
2024/2025 ] --- layout: true <div class="my-footer"> <span> University of Edinburgh </span> </div> --- ## Topics - Assessing model fit. - Linearity and transformations. - Multiple linear regression - Interaction effects - Model comparison --- class: middle # Model checking --- ## "Linear" models - We're fitting a "linear" model, which assumes a linear relationship between our explanatory and response variables. - But how do we assess this? --- ## Data: Paris Paintings ``` r pp <- read_csv("data/paris-paintings.csv", na = c("n/a", "", "NA")) ``` - Number of observations: 3393 - Number of variables: 61 --- ## Graphical diagnostic: residuals plot .panelset[ .panel[.panel-name[Plot] <img src="w08-L16_files/figure-html/unnamed-chunk-2-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ht_wt_fit <- linear_reg() %>% set_engine("lm") %>% fit(Height_in ~ Width_in, data = pp) *ht_wt_fit_aug <- augment(ht_wt_fit$fit) ggplot(ht_wt_fit_aug, mapping = aes(x = .fitted, y = .resid)) + geom_point(alpha = 0.5) + geom_hline(yintercept = 0, color = "gray", lty = "dashed") + labs(x = "Predicted height", y = "Residuals") ``` ] .panel[.panel-name[Augment] ``` r ht_wt_fit_aug ``` ``` ## # A tibble: 3,135 × 9 ## .rownames Height_in Width_in .fitted .resid .hat .sigma ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 37 29.5 26.7 10.3 0.000399 8.30 ## 2 2 18 14 14.6 3.45 0.000396 8.31 ## 3 3 13 16 16.1 -3.11 0.000361 8.31 ## 4 4 14 18 17.7 -3.68 0.000337 8.31 ## 5 5 14 18 17.7 -3.68 0.000337 8.31 ## 6 6 7 10 11.4 -4.43 0.000498 8.31 ## # ℹ 3,129 more rows ## # ℹ 2 more variables: .cooksd <dbl>, .std.resid <dbl> ``` ] ] --- ## Looking for... - Residuals distributed randomly around 0 - With no visible pattern along the x or y axes <img src="w08-L16_files/figure-html/unnamed-chunk-4-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Not looking for... .large[ **Fan shapes** ] <img src="w08-L16_files/figure-html/unnamed-chunk-5-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Not looking for... .large[ **Groups of patterns** ] <img src="w08-L16_files/figure-html/unnamed-chunk-6-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Not looking for... .large[ **Residuals correlated with predicted values** ] <img src="w08-L16_files/figure-html/unnamed-chunk-7-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Not looking for... .large[ **Any patterns!** ] <img src="w08-L16_files/figure-html/unnamed-chunk-8-1.png" width="60%" style="display: block; margin: auto;" /> --- .question[ What patterns does the residuals plot reveal that should make us question whether a linear model is a good fit for modeling the relationship between height and width of paintings? ] <img src="w08-L16_files/figure-html/unnamed-chunk-9-1.png" width="60%" style="display: block; margin: auto;" /> --- class: middle # Linearity and transformations --- ## Data: Paris Paintings <img src="w08-L16_files/figure-html/unnamed-chunk-10-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Price vs. width <img src="w08-L16_files/figure-html/unnamed-chunk-11-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Focus on paintings with `Width_in < 100` That is, paintings with width < 2.54 m ``` r pp_wt_lt_100 <- pp %>% filter(Width_in < 100) ``` --- ## Price vs. width .question[ Which plot shows a more linear relationship? ] .small[ .pull-left[ <img src="w08-L16_files/figure-html/unnamed-chunk-13-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="w08-L16_files/figure-html/unnamed-chunk-14-1.png" width="100%" style="display: block; margin: auto;" /> ] ] --- ## Price vs. width, residuals .question[ Which plot shows a residuals that are uncorrelated with predicted values from the model? Also, what is the unit of the residuals? ] .pull-left[ <img src="w08-L16_files/figure-html/unnamed-chunk-15-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="w08-L16_files/figure-html/unnamed-chunk-16-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Transforming the data - We saw that `price` has a right-skewed distribution, and the relationship between price and width of painting is non-linear. -- - In these situations a transformation applied to the response variable may be useful. -- - In order to decide which transformation to use, we should examine the distribution of the response variable. -- - The extremely right skewed distribution suggests that a log transformation may be useful. - log = natural log, `\(ln\)` - Default base of the `log` function in R is the natural log: <br> `log(x, base = exp(1))` --- ## Logged price vs. width .question[ How do we interpret the slope of this model? ] <img src="w08-L16_files/figure-html/unnamed-chunk-17-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Models with log transformation ``` r linear_reg() %>% set_engine("lm") %>% fit(log(price) ~ Width_in, data = pp_wt_lt_100) %>% tidy() ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.67 0.0585 79.9 0 ## 2 Width_in 0.0192 0.00226 8.48 3.36e-17 ``` --- ## Interpreting the slope $$ \widehat{log(price)} = 4.67 + 0.0192 \times Width $$ - For each additional inch the painting is wider, the log price of the painting is expected to be higher, on average, by 0.0192 log livres. -- - which is not a very useful statement... Instead: - For each additional inch the painting is wider, the price of the painting is expected to be higher, on average, *by a (multiplicative) factor of 1.0192* (1.0192 = exp(0.0192)). -- In other words: - For each additional inch the painting is wider, the price of the painting is expected to be higher, on average, *by 1.92%* (1.92 = 100 * 0.0192). --- ## Transform, or learn more? - Data transformations may also be useful when the relationship is non-linear - However in those cases a polynomial regression may be more appropriate + This is beyond the scope of this course, but you’re welcomed to try it for your final project, and I’d be happy to provide further guidance --- class: middle # The linear model with multiple predictors ### Two numerical predictors --- ## The data ``` r pp <- read_csv("data/paris-paintings.csv",na = c("n/a", "", "NA")) %>% mutate(log_price = log(price)) ``` --- ## Multiple predictors - Response variable: `log_price` - Explanatory variables: Width and height .pull-left[ <img src="w08-L16_files/figure-html/logprice-vs-width-1.png" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="w08-L16_files/figure-html/logprice-vs-height-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Linear model with multiple predictors ``` r pp_fit <- linear_reg() %>% set_engine("lm") %>% fit(log_price ~ Width_in + Height_in, data = pp) tidy(pp_fit) ``` ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.77 0.0579 82.4 0 ## 2 Width_in 0.0269 0.00373 7.22 6.58e-13 ## 3 Height_in -0.0133 0.00395 -3.36 7.93e- 4 ``` `$$\widehat{log\_price} = 4.77 + 0.0269 \times width - 0.0133 \times height$$` --- ## Linear model with multiple predictors .small[ ``` r tidy(pp_fit) %>% mutate(exp_estimate = exp(estimate)) %>% select(term, estimate, exp_estimate) ``` ``` ## # A tibble: 3 × 3 ## term estimate exp_estimate ## <chr> <dbl> <dbl> ## 1 (Intercept) 4.77 118. ## 2 Width_in 0.0269 1.03 ## 3 Height_in -0.0133 0.987 ``` `$$\widehat{log\_price} = 4.77 + 0.0269 \times width - 0.0133 \times height$$` ] -- - **Slope - width:** *All else held constant*, for each additional inch the painting is wider, we would expect the log-price to be higher, on average, by 0.0269. -- - **Slope - height:** *All else held constant*, for each additional inch the painting is higher, we would expect the log-price to be lower, on average, by 0.0133. -- - **Intercept:** The log-price of paintings that are 0 inches wide and 0 inches high is expected to be 4.77, on average. (Doesn't make sense in context.) --- ## Linear model with multiple predictors .small[ ``` r tidy(pp_fit) %>% mutate(exp_estimate = exp(estimate)) %>% select(term, estimate, exp_estimate) ``` ``` ## # A tibble: 3 × 3 ## term estimate exp_estimate ## <chr> <dbl> <dbl> ## 1 (Intercept) 4.77 118. ## 2 Width_in 0.0269 1.03 ## 3 Height_in -0.0133 0.987 ``` `$$\widehat{log\_price} = 4.77 + 0.0269 \times width - 0.0133 \times height$$` ] .question[ Interpret the slope for width in terms of price (not log-price). ] --- ## Linear model with multiple predictors .small[ ``` r tidy(pp_fit) %>% mutate(exp_estimate = exp(estimate)) %>% select(term, estimate, exp_estimate) ``` ``` ## # A tibble: 3 × 3 ## term estimate exp_estimate ## <chr> <dbl> <dbl> ## 1 (Intercept) 4.77 118. ## 2 Width_in 0.0269 1.03 ## 3 Height_in -0.0133 0.987 ``` `$$\widehat{log\_price} = 4.77 + 0.0269 \times width - 0.0133 \times height$$` ] - **Slope - width:** *All else held constant*, for each additional inch the painting is wider, we would expect the *price* to be higher, on average, *by a factor of 1.03*. - **Slope - height:** *All else held constant*, for each additional inch the painting is higher, we would expect the *price* to be lower, on average, *by a factor of 0.98*. - **Intercept:** The *price* of paintings that are 0 inches wide and 0 inches high is expected to be *118*, on average. (Doesn't make sense in context.) --- ## Visualizing models with multiple predictors .panelset[ .panel[.panel-name[Plot] .pull-left-wide[
] ] .panel[.panel-name[Code] ``` r p <- plot_ly(pp, x = ~Width_in, y = ~Height_in, z = ~log_price, marker = list(size = 3, color = "lightgray", alpha = 0.5, line = list(color = "gray", width = 2))) %>% add_markers() %>% plotly::layout(scene = list( xaxis = list(title = "Width (in)"), yaxis = list(title = "Height (in)"), zaxis = list(title = "log_price") )) %>% config(displayModeBar = FALSE) frameWidget(p) ``` ] ] --- class: middle # The linear model with multiple predictors ### Numerical and categorical predictors --- ## Data: Book weight and volume The `allbacks` data frame gives measurements on the volume and weight of 15 books, some of which are paperback and some of which are hardback .pull-left[ - Volume - cubic centimetres - Area - square centimetres - Weight - grams ] .pull-right[ .small[ ``` ## # A tibble: 15 × 4 ## volume area weight cover ## <dbl> <dbl> <dbl> <fct> ## 1 885 382 800 hb ## 2 1016 468 950 hb ## 3 1125 387 1050 hb ## 4 239 371 350 hb ## 5 701 371 750 hb ## 6 641 367 600 hb ## 7 1228 396 1075 hb ## 8 412 0 250 pb ## 9 953 0 700 pb ## 10 929 0 650 pb ## 11 1492 0 975 pb ## 12 419 0 350 pb ## 13 1010 0 950 pb ## 14 595 0 425 pb ## 15 1034 0 725 pb ``` ] ] .footnote[ .small[ These books are from the bookshelf of J. H. Maindonald at Australian National University. ] ] --- ## Book weight vs. volume .pull-left[ ``` r linear_reg() %>% set_engine("lm") %>% fit(weight ~ volume, data = allbacks) %>% tidy() ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 108. 88.4 1.22 0.245 ## 2 volume 0.709 0.0975 7.27 0.00000626 ``` ] .pull-right[ <img src="w08-L16_files/figure-html/unnamed-chunk-21-1.png" width="75%" style="display: block; margin: auto 0 auto auto;" /> ] --- ## Book weight vs. volume and cover .pull-left[ ``` r allbacks %>% group_by(cover) %>% summarise(mean_weight = mean(weight)) ``` ``` ## # A tibble: 2 × 2 ## cover mean_weight ## <fct> <dbl> ## 1 hb 796. ## 2 pb 628. ``` ] .pull-right[ <img src="w08-L16_files/figure-html/unnamed-chunk-23-1.png" width="75%" style="display: block; margin: auto 0 auto auto;" /> ] --- ## Book weight vs. volume and cover .pull-left[ ``` r linear_reg() %>% set_engine("lm") %>% fit(weight ~ volume + cover, data = allbacks) %>% tidy() ``` ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 198. 59.2 3.34 0.00584 ## 2 volume 0.718 0.0615 11.7 0.0000000660 ## 3 coverpb -184. 40.5 -4.55 0.000672 ``` ] .pull-right[ <img src="w08-L16_files/figure-html/unnamed-chunk-25-1.png" width="75%" style="display: block; margin: auto 0 auto auto;" /> ] --- ## Interpretation of estimates ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 198. 59.2 3.34 0.00584 ## 2 volume 0.718 0.0615 11.7 0.0000000660 ## 3 coverpb -184. 40.5 -4.55 0.000672 ``` - **Slope - volume:** *All else held constant*, for each additional cubic centimetre books are larger in volume, we would expect the weight to be higher, on average, by 0.718 grams. -- .question[ What is the interpretation of the slope for cover? ] --- ## Interpretation of estimates ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 198. 59.2 3.34 0.00584 ## 2 volume 0.718 0.0615 11.7 0.0000000660 ## 3 coverpb -184. 40.5 -4.55 0.000672 ``` - **Slope - volume:** *All else held constant*, for each additional cubic centimetre books are larger in volume, we would expect the weight to be higher, on average, by 0.718 grams. - **Slope - cover:** *All else held constant*, paperback books weigh, on average, 184 grams less than hardback books. -- - **Intercept:** Hardback books with 0 volume are expected to weigh 198 grams, on average. (Doesn't make sense in context.) --- ## Parallel slopes By adding the **main effect** for `cover` we obtained to parallel lines for each group. <img src="w08-L16_files/figure-html/unnamed-chunk-28-1.png" width="50%" style="display: block; margin: auto;" /> Can we fit a model so that the fitted lines for the two categories of `cover` are not parallel, or in other words, have a different slope? --- ## Two ways to model - **Main effects:** Assuming relationship between volume and weight **does not vary** by whether the cover is hardback or paperback. - **Interaction effects:** Assuming relationship between volume and weight **varies** by whether the cover is hardback or paperback. --- ## Interacting explanatory variables - Including an interaction effect in the model allows for different slopes, i.e. nonparallel lines. - This implies that the regression coefficient for an explanatory variable would change as another explanatory variable changes. - This can be accomplished by adding an interaction variable: the product of two explanatory variables. --- ## Two ways to model .pull-left[ - **Main effects:** Assuming relationship between volume and weight **does not vary** by whether the cover is hardback or paperback - **Interaction effects:** Assuming relationship between volume and weight **varies** by whether the cover is hardback or paperback ] .pull-right[ <img src="w08-L16_files/figure-html/pp-main-int-viz0-1.png" width="90%" style="display: block; margin: auto;" /> ] In this example, the difference is small. Let's look at another example. --- ## Price, surface area, and living artist - Explore the relationship between price of paintings and surface area, conditioned on whether or not the artist is still living - First visualize and explore, then model --- ## Typical surface area .panelset[ .panel[.panel-name[Plot] .pull-left-narrow[ Typical surface area appears to be less than 1000 square inches (~ 80cm x 80cm). There are very few paintings that have surface area above 5000 square inches. ] .pull-right-wide[ <img src="w08-L16_files/figure-html/unnamed-chunk-30-1.png" width="90%" style="display: block; margin: auto;" /> ] ] .panel[.panel-name[Code] ``` r ggplot(data = pp, aes(x = Surface, fill = artistliving)) + geom_histogram(binwidth = 500) + facet_grid(artistliving ~ .) + scale_fill_manual(values = c("#E48957", "#071381")) + guides(fill = FALSE) + labs(x = "Surface area", y = NULL) + geom_vline(xintercept = 1000) + geom_vline(xintercept = 5000, linetype = "dashed", color = "gray") ``` ``` ## Warning: Removed 176 rows containing non-finite outside the scale range ## (`stat_bin()`). ``` ] ] --- ## Narrowing the scope For simplicity let's focus on the paintings with `Surface < 5000`: .midi[ ``` r pp_Surf_lt_5000 <- pp %>% filter(Surface < 5000) ``` ] .panelset[ .panel[.panel-name[Plot] <img src="w08-L16_files/figure-html/unnamed-chunk-31-1.png" width="40%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(data = pp_Surf_lt_5000, aes(y = log_price, x = Surface, color = artistliving, shape = artistliving)) + geom_point(alpha = 0.5) + facet_wrap(~artistliving) + scale_color_manual(values = c("#E48957", "#071381")) + labs(color = "Artist", shape = "Artist") ``` ] ] --- ## Fit model with main effects - Response variable: `log_price` - Explanatory variables: `Surface` area and `artistliving` .midi[ ``` r pp_main_fit <- linear_reg() %>% set_engine("lm") %>% fit(log_price ~ Surface + artistliving, data = pp_Surf_lt_5000) tidy(pp_main_fit) ``` ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.88 0.0424 115. 0 ## 2 Surface 0.000265 0.0000415 6.39 1.85e-10 ## 3 artistlivingLiving 0.137 0.0970 1.41 1.57e- 1 ``` ] -- `$$\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times artistliving$$` --- ## Solving the model `$$\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times artistliving$$` - Non-living artist: Plug in 0 for `artistliving` `\begin{align} \widehat{log\_price} =& \color{#E48957}{4.88} + 0.000265 \times surface + \color{#E48957}{0.137 \times 0}\\ =& (\color{#E48957}{4.88+0}) + 0.000265 \times surface\\ =& \color{#E48957}{4.88} + 0.000265 \times surface \end{align}` -- - Living artist: Plug in 1 for `artistliving` `\begin{align} \widehat{log\_price} =& \color{#2AA3BB}{4.88} + 0.000265 \times surface + \color{#2AA3BB}{0.137 \times 1}\\ =& (\color{#2AA3BB}{4.88+0.137}) + 0.000265 \times surface\\ =& \color{#2AA3BB}{5.017} + 0.000265 \times surface \end{align}` --- ## Visualizing main effects .pull-left-narrow[ - **Same slope:** Rate of change in price as the surface area increases does not vary between paintings by living and non-living artists. - **Different intercept:** Paintings by living artists are consistently more expensive than paintings by non-living artists. ] .pull-right-wide[ <img src="w08-L16_files/figure-html/unnamed-chunk-32-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Interaction: `Surface * artistliving` <img src="w08-L16_files/figure-html/unnamed-chunk-33-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Fit model with interaction effects - Response variable: log_price - Explanatory variables: `Surface` area, `artistliving`, and their interaction .midi[ ``` r pp_int_fit <- linear_reg() %>% set_engine("lm") %>% fit(log_price ~ Surface * artistliving, data = pp_Surf_lt_5000) tidy(pp_int_fit) ``` ``` ## # A tibble: 4 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.91e+0 0.0432 114. 0 ## 2 Surface 2.06e-4 0.0000442 4.65 3.37e-6 ## 3 artistlivingLiving -1.26e-1 0.119 -1.06 2.89e-1 ## 4 Surface:artistlivingLiving 4.79e-4 0.000126 3.81 1.39e-4 ``` ] --- ## Fit model with interaction effects - Response variable: log_price - Explanatory variables: `Surface` area, `artistliving`, and their interaction .midi[ ``` r pp_int_fit <- linear_reg() %>% set_engine("lm") %>% fit(log_price ~ Surface + artistliving + Surface * artistliving, data = pp_Surf_lt_5000) tidy(pp_int_fit) ``` ``` ## # A tibble: 4 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.91e+0 0.0432 114. 0 ## 2 Surface 2.06e-4 0.0000442 4.65 3.37e-6 ## 3 artistlivingLiving -1.26e-1 0.119 -1.06 2.89e-1 ## 4 Surface:artistlivingLiving 4.79e-4 0.000126 3.81 1.39e-4 ``` ] --- ## Linear model with interaction effects .midi[ ``` ## # A tibble: 4 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.91e+0 0.0432 114. 0 ## 2 Surface 2.06e-4 0.0000442 4.65 3.37e-6 ## 3 artistlivingLiving -1.26e-1 0.119 -1.06 2.89e-1 ## 4 Surface:artistlivingLiving 4.79e-4 0.000126 3.81 1.39e-4 ``` ] .small[ `\begin{align} \widehat{log\_price} = 4.91 &+ 0.00021 \times surface - 0.126 \times artistliving + 0.00048 \times surface * artistliving \end{align}` ] --- ## Solving the model .small[ `\begin{align} \widehat{log\_price} = 4.91 &+ 0.00021 \times surface - 0.126 \times artistliving + 0.00048 \times surface * artistliving \end{align}` ] - Non-living artist: Plug in 0 for `artistliving` .small[ `\begin{align} \widehat{log\_price} =& \color{#E48957}{4.91} + \color{#FFCE7E}{0.00021} \times surface \color{#E48957}{- 0.126 \times 0} + \color{#FFCE7E}{0.00048 \times 0} \times surface\\ =& (\color{#E48957}{4.91+0}) + (\color{#FFCE7E}{0.00021+0}) \times surface\\ =& \color{#E48957}{4.91} + \color{#FFCE7E}{0.00021} \times surface \end{align}` ] -- - Living artist: Plug in 1 for `artistliving` .small[ `\begin{align} \widehat{log\_price} =& \color{#2AA3BB}{4.91} + \color{#ABC8C0}{0.00021} \times surface \color{#2AA3BB}{- 0.126 \times 1} + \color{#ABC8C0}{0.00048 \times 1} \times surface\\ =& (\color{#2AA3BB}{4.91-0.126}) + (\color{#ABC8C0}{0.00021+0.00048}) \times surface\\ =& \color{#2AA3BB}{4.784} + \color{#ABC8C0}{0.00069} \times surface \end{align}` ] --- ## Interpretation of interaction effects - Rate of change in price as the surface area of the painting increases does vary between paintings by living and non-living artists (different slopes), - Some paintings by living artists are more expensive than paintings by non-living artists, and some are not (different intercept). --- class: middle # Comparing models --- ## Books weight by volume and cover <img src="w08-L16_files/figure-html/book-main-int-1.png" width="55%" style="display: block; margin: auto;" /> --- ## In pursuit of Occam's razor - Occam's Razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected. -- - Model selection follows this principle. -- - We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model. -- - In other words, we prefer the simplest best model, i.e. **parsimonious** model. --- ## Comparing models .pull-left[ <img src="w08-L16_files/figure-html/unnamed-chunk-34-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ .question[ Visually, which of the two models is preferable under Occam's razor? ] ] --- ## R-squared `\(R^2\)` is the percentage of variability in the response variable explained by the regression model. ``` r glance(book_main_fit)$r.squared ``` ``` ## [1] 0.9274776 ``` ``` r glance(book_int_fit)$r.squared ``` ``` ## [1] 0.9297137 ``` -- - Clearly the model with interactions has a higher `\(R^2\)`. -- - However using `\(R^2\)` for model selection in models with multiple explanatory variables is not a good idea as `\(R^2\)` increases when **any** variable is added to the model. --- ## Adjusted R-squared ... a (more) objective measure for model selection - Adjusted `\(R^2\)` doesn't increase if the new variable does not provide any new informaton or is completely unrelated, as it applies a penalty for number of variables included in the model. - This makes adjusted `\(R^2\)` a preferable metric for model selection in multiple regression models. --- ## Comparing models with adjusted R-squared .pull-left[ ``` r glance(book_main_fit)$r.squared ``` ``` ## [1] 0.9274776 ``` ``` r glance(book_int_fit)$r.squared ``` ``` ## [1] 0.9297137 ``` ] .pull-right[ ``` r glance(book_main_fit)$adj.r.squared ``` ``` ## [1] 0.9153905 ``` ``` r glance(book_int_fit)$adj.r.squared ``` ``` ## [1] 0.9105447 ``` ] -- .question[ Using adjusted R-squared, which of the two models is preferable? ] --- ## Paintings price by surface and living artist .pull-left[ <img src="w08-L16_files/figure-html/unnamed-chunk-38-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ``` r glance(pp_main_fit)$adj.r.squared ``` ``` ## [1] 0.01258977 ``` ``` r glance(pp_int_fit)$adj.r.squared ``` ``` ## [1] 0.01676753 ``` .question[ Which of the two models is preferable? ] ] -- It appears that adding the interaction actually increased adjusted `\(R^2\)`, so we should indeed use the model with the interactions. --- ## Third order interactions - Can you? Yes - Should you? Probably not if you want to interpret these interactions in context of the data. --- ## Back to log-price vs width and height - Response variable: `log_price` - Explanatory variables: Width and height .pull-left[ <img src="w08-L16_files/figure-html/logprice-vs-width2-1.png" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="w08-L16_files/figure-html/logprice-vs-height2-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Comparing models with numerical predictors ``` r single_reg_fit = linear_reg() %>% set_engine("lm") %>% fit(log(price) ~ Width_in, data = pp) glance(single_reg_fit)$adj.r.squared ``` ``` ## [1] 0.01908747 ``` ``` r mult_reg_fit = linear_reg() %>% set_engine("lm") %>% fit(log(price) ~ Width_in + Height_in, data = pp) glance(mult_reg_fit)$adj.r.squared ``` ``` ## [1] 0.02231559 ``` --- ## Recap - The plot of residuals vs predicted/fitted values can be used as diagnostic plot. -- - Transformations might help, and `\(\log(y)\)` is useful when the response is right skewed and for variance stabilization. - In this case, the interpretation of the coefficients uses percentage change. -- - We learnt how to fit and interpret models with multiple predictors - When working with categorical predictors, main effects mean parallel lines, while interaction effects mean different slopes. -- - We learnt about Occam's razor and how to use adjusted `\(R^2\)` to do model comparison.