class: center, middle, inverse, title-slide .title[ # Introducing linear models ] .subtitle[ ##
Introduction to Data Science ] .author[ ### University of Edinburgh ] .date[ ###
2024/2025 ] --- layout: true <div class="my-footer"> <span> University of Edinburgh </span> </div> --- ## Topics - What is a model? - How can we use models? - How can we fit linear models in R? - How should fitted linear models be interpreted? - Differences between models using numerical input data and categorical input data. --- class: middle # What is a model? --- ## Modelling - Use models to explain the relationship between variables and to make predictions - For now we will focus on **linear** models (but remember there are *many* *many* other types of models too!) .pull-left[ <img src="w08-L15_files/figure-html/chunk1-1.png" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="w08-L15_files/figure-html/chunk2-1.png" width="90%" style="display: block; margin: auto;" /> ] --- class: middle # Data: Paris Paintings --- ## Paris Paintings ``` r pp <- read_csv("data/paris-paintings.csv", na = c("n/a", "", "NA")) ``` - Source: Printed catalogues of 28 auction sales in Paris, 1764 - 1780 - Data curators Sandra van Ginhoven and Hilary Coe Cronheim (who were PhD students in the Duke Art, Law, and Markets Initiative at the time of putting together this dataset) translated and tabulated the catalogues - 3393 paintings, their prices, and descriptive details from sales catalogues over 60 variables --- ## Paris auction market <img src="img/auction-trend-paris.png" width="60%" style="display: block; margin: auto;" /> .footnote[ .small[ Plot credit: Sandra van Ginhoven ] ] --- ## Auctions today .center[ <iframe width="840" height="473" src="https://www.youtube.com/embed/apaE1Q7r4so" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> ] --- ## Auctions back in the day <img src="img/old-auction.png" width="65%" style="display: block; margin: auto;" /> .footnote[ .small[ Pierre-Antoine de Machy, Public Sale at the Hôtel Bullion, Musée Carnavalet, Paris (18th century) ] ] --- ## Départ pour la chasse <img src="img/depart-pour-la-chasse.png" width="65%" style="display: block; margin: auto;" /> --- ## Auction catalog text .pull-left[ <img src="img/auction-catalogue.png" width="60%" style="display: block; margin: auto;" /> ] .pull-right[ .small[ Two paintings very rich in composition, of a beautiful execution, and whose merit is very remarkable, each 17 inches 3 lines high, 23 inches wide; the first, painted on wood, comes from the Cabinet of Madame la Comtesse de Verrue; it represents a departure for the hunt: it shows in the front a child on a white horse, a man who gives the horn to gather the dogs, a falconer and other figures nicely distributed across the width of the painting; two horses drinking from a fountain; on the right in the corner a lovely country house topped by a terrace, on which people are at the table, others who play instruments; trees and fabriques pleasantly enrich the background. ] ] --- <img src="img/painting1.png" width="60%" style="display: block; margin: auto;" /><img src="img/painting2.png" width="60%" style="display: block; margin: auto;" /><img src="img/painting3.png" width="60%" style="display: block; margin: auto;" /> --- ``` r pp %>% filter(name == "R1777-89a") %>% glimpse() ``` .small[ .pull-left[ ``` ## Rows: 1 ## Columns: 61 ## $ name <chr> "R1777-89a" ## $ sale <chr> "R1777" ## $ lot <chr> "89" ## $ position <dbl> 0.3755274 ## $ dealer <chr> "R" ## $ year <dbl> 1777 ## $ origin_author <chr> "D/FL" ## $ origin_cat <chr> "D/FL" ## $ school_pntg <chr> "D/FL" ## $ diff_origin <dbl> 0 ## $ logprice <dbl> 8.575462 ## $ price <dbl> 5300 ## $ count <dbl> 1 ## $ subject <chr> "D\x8epart pour la chasse" ## $ authorstandard <chr> "Wouwerman, Philips" ## $ artistliving <dbl> 0 ## $ authorstyle <chr> NA ## $ author <chr> "Philippe Wouwermans" ## $ winningbidder <chr> "Langlier, Jacques for Poullain, Anto… ## $ winningbiddertype <chr> "DC" ## $ endbuyer <chr> "C" ... ``` ] .pull-right[ ``` ... ## $ Interm <dbl> 1 ## $ type_intermed <chr> "D" ## $ Height_in <dbl> 17.25 ## $ Width_in <dbl> 23 ## $ Surface_Rect <dbl> 396.75 ## $ Diam_in <dbl> NA ## $ Surface_Rnd <dbl> NA ## $ Shape <chr> "squ_rect" ## $ Surface <dbl> 396.75 ## $ material <chr> "bois" ## $ mat <chr> "b" ## $ materialCat <chr> "wood" ## $ quantity <dbl> 1 ## $ nfigures <dbl> 0 ## $ engraved <dbl> 0 ## $ original <dbl> 0 ## $ prevcoll <dbl> 1 ## $ othartist <dbl> 0 ## $ paired <dbl> 1 ## $ figures <dbl> 0 ## $ finished <dbl> 0 ... ``` ] ] --- class: middle # Modeling the relationship between variables --- ## Models as functions - We can represent relationships between variables using **functions** - A function is a mathematical concept: the relationship between an output and one or more inputs - Plug in the inputs and receive back the output - Example: The formula `\(y = 3x + 7\)` is a function with input `\(x\)` and output `\(y\)`. If `\(x\)` is `\(5\)`, `\(y\)` is `\(22\)`, `\(y = 3 \times 5 + 7 = 22\)` --- ## Heights .small[ ``` r ggplot(data = pp, aes(x = Height_in)) + geom_histogram(binwidth = 5) + labs(x = "Height, in inches", y = NULL) ``` <img src="w08-L15_files/figure-html/height-dist-1.png" width="60%" style="display: block; margin: auto;" /> ] --- ## Widths .small[ ``` r ggplot(data = pp, aes(x = Width_in)) + geom_histogram(binwidth = 5) + labs(x = "Width, in inches", y = NULL) ``` <img src="w08-L15_files/figure-html/width-dist-1.png" width="60%" style="display: block; margin: auto;" /> ] --- ## Height as a function of width .panelset[ .panel[.panel-name[Plot] <img src="w08-L15_files/figure-html/chunk13-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(data = pp, aes(x = Width_in, y = Height_in)) + geom_point() + * geom_smooth(method = "lm") + labs( title = "Height vs. width of paintings", subtitle = "Paris auctions, 1764 - 1780", x = "Width (inches)", y = "Height (inches)" ) ``` ``` ## Warning: Removed 258 rows containing non-finite outside the scale range ## (`stat_smooth()`). ``` ``` ## Warning: Removed 258 rows containing missing values or values outside the ## scale range (`geom_point()`). ``` ] ] --- ## ... without the measure of uncertainty .panelset[ .panel[.panel-name[Plot] <img src="w08-L15_files/figure-html/chunk15-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(data = pp, aes(x = Width_in, y = Height_in)) + geom_point() + geom_smooth(method = "lm", * se = FALSE) + labs( title = "Height vs. width of paintings", subtitle = "Paris auctions, 1764 - 1780", x = "Width (inches)", y = "Height (inches)" ) ``` ] ] --- ## Other smoothing methods: gam .panelset[ .panel[.panel-name[Plot] <img src="w08-L15_files/figure-html/chunk17-1.png" width="50%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(data = pp, aes(x = Width_in, y = Height_in)) + geom_point() + * geom_smooth(method = "gam", se = FALSE, color = "#8E2C90") + labs( title = "Height vs. width of paintings", subtitle = "Paris auctions, 1764 - 1780", x = "Width (inches)", y = "Height (inches)" ) ``` ] ] .footnote[GAM = Generalized additive models] --- ## Other smoothing methods: loess .panelset[ .panel[.panel-name[Plot] <img src="w08-L15_files/figure-html/chunk20-1.png" width="50%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(data = pp, aes(x = Width_in, y = Height_in)) + geom_point() + * geom_smooth(method = "loess", se = FALSE, color = "#8E2C90") + labs( title = "Height vs. width of paintings", subtitle = "Paris auctions, 1764 - 1780", x = "Width (inches)", y = "Height (inches)" ) ``` ] ] .footnote[LOESS = Locally Estimated Scatterplot Smoothing] --- ## Vocabulary - **Response variable:** Variable whose behavior or variation you are trying to understand, on the y-axis -- - **Explanatory variables:** Other variables that you want to use to explain the variation in the response, on the x-axis --- ## Vocabulary (..continued) - **Predicted value:** Output of the **model function** - The model function gives the typical (expected) value of the response variable *conditioning* on the explanatory variables <img src="w08-L15_files/figure-html/height-width-plot-02-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Vocabulary (..continued) - **Residuals:** A measure of how far each case is from its predicted value (based on a particular model) - Residual = Observed value - Predicted value - Tells how far above/below the expected value each case is <img src="w08-L15_files/figure-html/height-width-plot-03-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Residuals .panelset[ .panel[.panel-name[Plot] <img src="w08-L15_files/figure-html/chunk24-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] .small[ ``` r ht_wt_fit <- linear_reg() %>% set_engine("lm") %>% fit(Height_in ~ Width_in, data = pp) ht_wt_fit_tidy <- tidy(ht_wt_fit$fit) ht_wt_fit_aug <- augment(ht_wt_fit$fit) %>% mutate(res_cat = ifelse(.resid > 0, TRUE, FALSE)) ggplot(data = ht_wt_fit_aug) + geom_point(aes(x = Width_in, y = Height_in, color = res_cat)) + geom_line(aes(x = Width_in, y = .fitted), linewidth = 0.75) + geom_point(mapping = aes(x = Width_in,y = .fitted), color = "#E48957")+ geom_segment(mapping = aes(x = Width_in, y = Height_in, xend = Width_in, yend = .fitted), color = "#E48957", alpha = 0.4) + labs( title = "Height vs. width of paintings", subtitle = "Paris auctions, 1764 - 1780", x = "Width (inches)", y = "Height (inches)" ) + guides(color = FALSE) + scale_color_manual(values = c("maroon", "forestgreen")) + geom_text(aes(x = 0, y = 150), label = "Positive residual", color = "forestgreen", hjust = 0, size = 8) + geom_text(aes(x = 150, y = 25), label = "Negative residual", color = "maroon", hjust = 0, size = 8) ``` ] ] ] --- ## Extrapolation (..or extending regression lines) Please don't do it. <img src="img/xkcd_extrapolating.png" width="50%" style="display: block; margin: auto;" /> .footnote[ Source: XKCD, [Extrapolating](https://xkcd.com/605/) ] --- ## Models - upsides and downsides - Models can sometimes reveal patterns that are not evident in a graph of the data. This is a great advantage of modeling over simple visual inspection of data. - There is a real risk, however, that a model is imposing structure that is not really there on the scatter plot of the data, just as people imagine animal shapes in the stars. A skeptical approach is always warranted. --- ## Variation around the model... is just as important as the model, if not more! *Statistics is the explanation of variation in the context of what remains unexplained.* - The scatter plot of the residuals suggests that there might be other factors that account for large parts of painting-to-painting variability, or perhaps just that randomness plays a big role. - Adding more explanatory variables to a model can sometimes usefully reduce the size of the scatter plot of the residuals around the predicted values. (We'll talk more about this later.) --- class: middle # Fitting and interpreting models --- class: middle # Models with numerical explanatory variables --- ## Paris Paintings: Predict height from width Linear model with one predictor: `$$\widehat{height}_{i} = \beta_0 + \beta_1 \times width_{i}$$` <img src="w08-L15_files/figure-html/height-width-plot-1.png" width="60%" style="display: block; margin: auto;" /> --- <img src="img/tidymodels.png" width="98%" style="display: block; margin: auto;" /> --- ## Step 1: Specify model ``` r linear_reg() ``` ``` ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` --- ## Step 2: Set model fitting *engine* ``` r linear_reg() %>% set_engine("lm") # lm: linear model ``` ``` ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` --- ## Step 3: Fit model & estimate parameters ... using **formula syntax** ``` r linear_reg() %>% set_engine("lm") %>% fit(Height_in ~ Width_in, data = pp) ``` ``` ## parsnip model object ## ## ## Call: ## stats::lm(formula = Height_in ~ Width_in, data = data) ## ## Coefficients: ## (Intercept) Width_in ## 3.6214 0.7808 ``` --- ## A closer look at model output ``` ## parsnip model object ## ## ## Call: ## stats::lm(formula = Height_in ~ Width_in, data = data) ## ## Coefficients: ## (Intercept) Width_in ## 3.6214 0.7808 ``` .large[ `$$\widehat{height}_{i} = 3.6214 + 0.7808 \times width_{i}$$` ] --- ## A tidy look at model output ``` r linear_reg() %>% set_engine("lm") %>% fit(Height_in ~ Width_in, data = pp) %>% tidy() ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 3.62 0.254 14.3 8.82e-45 ## 2 Width_in 0.781 0.00950 82.1 0 ``` .large[ `$$\widehat{height}_{i} = 3.62 + 0.781 \times width_{i}$$` ] --- ## Interpretation: slope and intercept .large[ `$$\widehat{height}_{i} = 3.62 + 0.781 \times width_{i}$$` ] -- - **Slope:** For each additional inch the painting is wider, the height is *expected* to be higher, *on average*, by 0.781 inches. -- - **Intercept:** Paintings that are 0 inches wide are *expected* to be 3.62 inches high, *on average*. (Does this make sense?) --- ## Correlation does not imply causation Remember this when interpreting model coefficients <img src="img/cell_phones.png" width="90%" style="display: block; margin: auto;" /> .footnote[ Source: XKCD, [Cell phones](https://xkcd.com/925/) ] --- class: middle # Parameter estimation --- ## Linear model with a single predictor - We're interested in `\(\beta_0\)` (population parameter for the intercept) and `\(\beta_1\)` (population parameter for the slope) in the following model: `$$\hat{y}_{i} = \beta_0 + \beta_1~x_{i}$$` -- - Tough luck, you can't have them... -- - So we use sample statistics to estimate them: `$$\hat{y}_{i} = b_0 + b_1~x_{i}$$` --- ## Least squares regression - The regression line minimizes the sum of squared residuals. -- - If `\(e_i = y_i - \hat{y}_i\)`, then, the regression line minimizes `\(\sum_{i = 1}^n e_i^2\)`. --- ## Visualizing residuals <img src="w08-L15_files/figure-html/vis-res-1-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Visualizing residuals (cont.) <img src="w08-L15_files/figure-html/vis-res-2-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Visualizing residuals (cont.) <img src="w08-L15_files/figure-html/vis-res-3-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Properties of least squares regression - The regression line goes through the center of mass point, the coordinates corresponding to average `\(x\)` and average `\(y\)`, `\((\bar{x}, \bar{y})\)`: `$$\bar{y} = b_0 + b_1 \bar{x} ~ \rightarrow ~ b_0 = \bar{y} - b_1 \bar{x}$$` -- - The slope has the same sign as the correlation coefficient: `\(b_1 = r \frac{s_y}{s_x}\)` -- - The sum of the residuals is zero: `\(\sum_{i = 1}^n e_i = 0\)` -- - The residuals and `\(x\)` values are uncorrelated --- class: middle # Models with categorical explanatory variables --- ## Categorical predictor with 2 levels .pull-left-narrow[ .small[ ``` ## # A tibble: 3,393 × 3 ## name Height_in landsALL ## <chr> <dbl> <dbl> ## 1 L1764-2 37 0 ## 2 L1764-3 18 0 ## 3 L1764-4 13 1 ## 4 L1764-5a 14 1 ## 5 L1764-5b 14 1 ## 6 L1764-6 7 0 ## 7 L1764-7a 6 0 ## 8 L1764-7b 6 0 ## 9 L1764-8 15 0 ## 10 L1764-9a 9 0 ## 11 L1764-9b 9 0 ## 12 L1764-10a 16 1 ## 13 L1764-10b 16 1 ## 14 L1764-10c 16 1 ## 15 L1764-11 20 0 ## 16 L1764-12a 14 1 ## 17 L1764-12b 14 1 ## 18 L1764-13a 15 1 ## 19 L1764-13b 15 1 ## 20 L1764-14 37 0 ## # ℹ 3,373 more rows ``` ] ] .pull-right-wide[ - `landsALL = 0`: No landscape features - `landsALL = 1`: Some landscape features ] --- ## Height & landscape features ``` r linear_reg() %>% set_engine("lm") %>% fit(Height_in ~ factor(landsALL), data = pp) %>% tidy() ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 22.7 0.328 69.1 0 ## 2 factor(landsALL)1 -5.65 0.532 -10.6 7.97e-26 ``` --- ## Height & landscape features `$$\widehat{Height_{in}} = 22.7 - 5.645~landsALL$$` - **Slope:** Paintings with landscape features are expected, on average, to be 5.645 inches shorter than paintings that without landscape features - Compares baseline level (`landsALL = 0`) to the other level (`landsALL = 1`) - **Intercept:** Paintings that don't have landscape features are expected, on average, to be 22.7 inches tall --- ## Relationship between height and school ``` r linear_reg() %>% set_engine("lm") %>% fit(Height_in ~ school_pntg, data = pp) %>% tidy() ``` ``` ## # A tibble: 7 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 14.0 10.0 1.40 0.162 ## 2 school_pntgD/FL 2.33 10.0 0.232 0.816 ## 3 school_pntgF 10.2 10.0 1.02 0.309 ## 4 school_pntgG 1.65 11.9 0.139 0.889 ## 5 school_pntgI 10.3 10.0 1.02 0.306 ## 6 school_pntgS 30.4 11.4 2.68 0.00744 ## 7 school_pntgX 2.87 10.3 0.279 0.780 ``` --- ## Dummy variables ``` ## # A tibble: 7 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 14.0 10.0 1.40 0.162 ## 2 school_pntgD/FL 2.33 10.0 0.232 0.816 ## 3 school_pntgF 10.2 10.0 1.02 0.309 ## 4 school_pntgG 1.65 11.9 0.139 0.889 ## 5 school_pntgI 10.3 10.0 1.02 0.306 ## 6 school_pntgS 30.4 11.4 2.68 0.00744 ## 7 school_pntgX 2.87 10.3 0.279 0.780 ``` - When the categorical explanatory variable has many levels, they're encoded to **dummy variables** - Each coefficient describes the expected difference between heights in that particular school compared to the baseline level --- ## Categorical predictor with 3+ levels .pull-left-wide[ <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> school_pntg </th> <th style="text-align:center;"> D_FL </th> <th style="text-align:center;"> F </th> <th style="text-align:center;"> G </th> <th style="text-align:center;"> I </th> <th style="text-align:center;"> S </th> <th style="text-align:center;"> X </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> </tr> <tr> <td style="text-align:left;"> D/FL </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(122, 209, 81, 255) !important;"> 1 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> </tr> <tr> <td style="text-align:left;"> F </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(122, 209, 81, 255) !important;"> 1 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> </tr> <tr> <td style="text-align:left;"> G </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(122, 209, 81, 255) !important;"> 1 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> </tr> <tr> <td style="text-align:left;"> I </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(122, 209, 81, 255) !important;"> 1 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> </tr> <tr> <td style="text-align:left;"> S </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(122, 209, 81, 255) !important;"> 1 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> </tr> <tr> <td style="text-align:left;"> X </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 255) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(122, 209, 81, 255) !important;"> 1 </td> </tr> </tbody> </table> ] .pull-right-narrow[ .small[ ``` ## # A tibble: 3,393 × 3 ## name Height_in school_pntg ## <chr> <dbl> <chr> ## 1 L1764-2 37 F ## 2 L1764-3 18 I ## 3 L1764-4 13 D/FL ## 4 L1764-5a 14 F ## 5 L1764-5b 14 F ## 6 L1764-6 7 I ## 7 L1764-7a 6 F ## 8 L1764-7b 6 F ## 9 L1764-8 15 I ## 10 L1764-9a 9 D/FL ## 11 L1764-9b 9 D/FL ## 12 L1764-10a 16 X ## 13 L1764-10b 16 X ## 14 L1764-10c 16 X ## 15 L1764-11 20 D/FL ## 16 L1764-12a 14 D/FL ## 17 L1764-12b 14 D/FL ## 18 L1764-13a 15 D/FL ## 19 L1764-13b 15 D/FL ## 20 L1764-14 37 F ## # ℹ 3,373 more rows ``` ] ] `\begin{align} \widehat{ Height_{in}} = 14 &+ 2.32 \times {school\_pntgD/FL} + 10.2 \times { school\_pntgF} \\ &+ 1.65 \times { school\_pntgG} + 10.3 \times { school\_pntgI} \\ &+ 30.4 \times { school\_pntgS} + 2.87 \times { school\_pntgX} \end{align}` --- ## Relationship between height and school .small[ ``` ## # A tibble: 7 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 14.0 10.0 1.40 0.162 ## 2 school_pntgD/FL 2.33 10.0 0.232 0.816 ## 3 school_pntgF 10.2 10.0 1.02 0.309 ## 4 school_pntgG 1.65 11.9 0.139 0.889 ## 5 school_pntgI 10.3 10.0 1.02 0.306 ## 6 school_pntgS 30.4 11.4 2.68 0.00744 ## 7 school_pntgX 2.87 10.3 0.279 0.780 ``` - **Austrian school (A)** paintings are expected, on average, to be **14 inches** tall. - **Dutch/Flemish school (D/FL)** paintings are expected, on average, to be **2.33 inches taller** than *Austrian school* paintings. - **French school (F)** paintings are expected, on average, to be **10.2 inches taller** than *Austrian school* paintings. - **German school (G)** paintings are expected, on average, to be **1.65 inches taller** than *Austrian school* paintings. - **Italian school (I)** paintings are expected, on average, to be **10.3 inches taller** than *Austrian school* paintings. - **Spanish school (S)** paintings are expected, on average, to be **30.4 inches taller** than *Austrian school* paintings. - Paintings whose school is **unknown (X)** are expected, on average, to be **2.87 inches taller** than *Austrian school* paintings. ] --- ## Recap: How do we use models? - Explanation: Characterize the relationship between `\(y\)` and `\(x\)` via - *slopes* for numerical explanatory variables or - *differences* for categorical explanatory variables - Prediction: Plug in `\(x\)`, get the predicted `\(y\)`