class: center, middle, inverse, title-slide .title[ # Quantifying uncertainty ] .subtitle[ ##
Introduction to Data Science ] .author[ ### University of Edinburgh ] .date[ ###
2024/2025 ] --- layout: true <div class="my-footer"> <span> University of Edinburgh </span> </div> --- ## Topics - Why is quantification of uncertainty important? - Confidence intervals - Bootstrapping for confidence intervals - Bootstrapping with Tidymodels - Precision vs Accuracy --- class: middle # Recap and motivation --- ## Data - Family income and gift aid data from a random sample of fifty students in the freshman class of Elmhurst College in Illinois, USA - Gift aid is financial aid that does not need to be paid back, as opposed to a loan <img src="w10-L20_files/figure-html/unnamed-chunk-1-1.png" width="50%" style="display: block; margin: auto;" /> .footnote[ .small[ The data come from the openintro package: [`elmhurst`](http://openintrostat.github.io/openintro/reference/elmhurst.html). ] ] --- ## Linear model .pull-left[ .small[ ``` r linear_reg() %>% set_engine("lm") %>% fit(gift_aid ~ family_income, data = elmhurst) %>% tidy() ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 24.3 1.29 18.8 8.28e-24 ## 2 family_income -0.0431 0.0108 -3.98 2.29e- 4 ``` ] ] .pull-right[ <img src="w10-L20_files/figure-html/unnamed-chunk-3-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Interpreting the slope .pull-left-wide[ ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 24.3 1.29 18.8 8.28e-24 ## 2 family_income -0.0431 0.0108 -3.98 2.29e- 4 ``` ] .pull-right-narrow[ <img src="w10-L20_files/figure-html/unnamed-chunk-4-1.png" width="100%" style="display: block; margin: auto;" /> ] -- For each additional $1,000 of family income, we would expect students to receive a net difference of 1,000 * (-0.0431) = -$43.10 in aid on average, i.e. $43.10 less in gift aid, on average. --- class: middle .hand[ .light-blue[ exactly $43.10 for all students at this school?! ] ] --- class: middle # Inference --- ## Statistical inference ... is the process of using sample data to make conclusions about the underlying population the sample came from <img src="img/photo-1571942676516-bcab84649e44.png" width="60%" style="display: block; margin: auto;" /> --- ## Estimation So far we have done lots of estimation (mean, median, slope, etc.), i.e. - used data from samples to calculate sample statistics - which can then be used as estimates for population parameters --- .question[ If you want to catch a fish, do you prefer a spear or a net? ] <br> .pull-left[ <img src="img/spear.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="img/net.png" width="80%" style="display: block; margin: auto;" /> ] --- .question[ If you want to estimate a population parameter, do you prefer to report a range of values the parameter might be in, or a single value? ] <br> -- - If we report a point estimate, we probably won’t hit the exact population parameter - If we report a range of plausible values we have a good shot at capturing the parameter --- .center[ <img src="img/gallup_nhs.jpeg" width="80%" style="display: block; margin: auto;" /> ] .footnote[ .midi[ Source: Gallup. [Britons' Satisfaction With Healthcare, Transportation Falters](https://news.gallup.com/poll/505682/britons-satisfaction-healthcare-transportation-falters.aspx), 23/11/23. ] ] --- class: middle # Confidence intervals --- ## Confidence intervals A plausible range of values for the population parameter is a **confidence interval**. -- - In order to construct a confidence interval we need to quantify the variability of our sample statistic -- - For example, if we want to construct a confidence interval for a population slope, we need to come up with a plausible range of values around our observed sample slope -- - This range will depend on how precise and how accurate our sample mean is as an estimate of the population mean -- - Quantifying this requires a measurement of how much we would expect the sample population to vary from sample to sample --- .question[ Suppose we split the class in half down the middle of the classroom and ask each student their heights. Then, we calculate the mean height of students on each side of the classroom. Would you expect these two means to be exactly equal, close but not equal, or wildly different? ] -- <br> .question[ Suppose you randomly sample 50 students and 5 of them are left handed. If you were to take another random sample of 50 students, how many would you expect to be left handed? Would you be surprised if only 3 of them were left handed? Would you be surprised if 40 of them were left handed? ] --- ## Quantifying the variability of slopes We can quantify the variability of sample statistics using - simulation: via bootstrapping (now) or - theory: via Central Limit Theorem (future stat courses!) ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 24.3 1.29 18.8 8.28e-24 ## 2 family_income -0.0431 0.0108 -3.98 2.29e- 4 ``` --- class: middle # Bootstrapping --- ## Bootstrapping .pull-left-wide[ - _"pulling oneself up by one’s bootstraps"_: accomplishing an impossible task without any outside help - **Impossible task:** estimating a population parameter using data from only the given sample - **Note:** Notion of saying something about a population parameter using only information from an observed sample is the crux of statistical inference ] .pull-right-narrow[ .huge[ 🥾 ] ] --- ## Observed sample <img src="w10-L20_files/figure-html/unnamed-chunk-11-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Bootstrap population Generated assuming there are more students like the ones in the observed sample... <img src="w10-L20_files/figure-html/unnamed-chunk-12-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Bootstrapping scheme 1. Take a bootstrap sample - a random sample taken **with replacement** from the original sample, of the same size as the original sample -- 2. Calculate the bootstrap statistic - a statistic such as mean, median, proportion, slope, etc. computed on the bootstrap samples -- 3. Repeat steps (1) and (2) many times to create a bootstrap distribution - a distribution of bootstrap statistics -- 4. Calculate the bounds of the XX% confidence interval as the middle XX% of the bootstrap distribution --- ## Bootstrap sample 1 ``` r elmhurtst_boot_1 <- elmhurst %>% slice_sample(n = 50, replace = TRUE) ``` <img src="w10-L20_files/figure-html/unnamed-chunk-14-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Bootstrap sample 2 ``` r elmhurtst_boot_2 <- elmhurst %>% slice_sample(n = 50, replace = TRUE) ``` <img src="w10-L20_files/figure-html/unnamed-chunk-16-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Bootstrap sample 3 ``` r elmhurtst_boot_3 <- elmhurst %>% slice_sample(n = 50, replace = TRUE) ``` <img src="w10-L20_files/figure-html/unnamed-chunk-18-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Bootstrap sample 4 ``` r elmhurtst_boot_4 <- elmhurst %>% slice_sample(n = 50, replace = TRUE) ``` <img src="w10-L20_files/figure-html/unnamed-chunk-20-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Bootstrap samples 1 - 4 <img src="w10-L20_files/figure-html/unnamed-chunk-21-1.png" width="60%" style="display: block; margin: auto;" /> --- class: middle .hand[ .light-blue[ we could keep going... ] ] --- ## Many many samples... <img src="w10-L20_files/figure-html/unnamed-chunk-22-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Slopes of bootstrap samples <img src="w10-L20_files/figure-html/unnamed-chunk-23-1.png" width="60%" style="display: block; margin: auto;" /> --- ## 95% confidence interval <img src="w10-L20_files/figure-html/unnamed-chunk-24-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Interpreting the slope, take two ``` ## # A tibble: 2 × 6 ## term .lower .estimate .upper .alpha .method ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 (Intercept) 21.8 24.4 26.8 0.05 percentile ## 2 family_income -0.0695 -0.0445 -0.0219 0.05 percentile ``` **We are 95% confident that** for each additional $1,000 of family income, we would expect students to receive $69.50 to $21.90 less in gift aid, on average. --- ## Code ``` r # set a seed set.seed(1234) # take 1000 bootstrap samples elmhurst_boot <- bootstraps(elmhurst, times = 1000) # for each sample # fit a model and save output in model column # tidy model output and save in coef_info column elmhurst_models <- elmhurst_boot %>% mutate( model = map(splits, ~ lm(gift_aid ~ family_income, data = .)), coef_info = map(model, tidy) ) # unnest coef_info (for intercept and slope) elmhurst_coefs <- elmhurst_models %>% unnest(coef_info) # calculate 95% (default) percentile interval int_pctl(elmhurst_models, coef_info) ``` --- class: middle # Rent in Edinburgh --- ## Rent in Edinburgh .question[ Take a guess! How much did a typical 3 BR flat in Edinburgh rent for in 2020? ] --- ## Sample Fifteen 3 BR flats in Edinburgh were randomly selected on rightmove.co.uk. ``` r library(tidyverse) edi_3br <- read_csv2("data/edi-3br.csv") # ; separated ``` .small[ ``` ## # A tibble: 15 × 4 ## flat_id rent title address ## <chr> <dbl> <chr> <chr> ## 1 flat_01 825 3 bedroom apartment to rent Burnhead Grove, Edin… ## 2 flat_02 2400 3 bedroom flat to rent Simpson Loan, Quarte… ## 3 flat_03 1900 3 bedroom flat to rent FETTES ROW, NEW TOWN… ## 4 flat_04 1500 3 bedroom apartment to rent Eyre Crescent, Edinb… ## 5 flat_05 3250 3 bedroom flat to rent Walker Street, Edinb… ## 6 flat_06 2145 3 bedroom flat to rent George Street, City … ## # ℹ 9 more rows ``` ] --- ## Observed sample <img src="w10-L20_files/figure-html/unnamed-chunk-29-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Observed sample Sample mean ≈ £1895. <br> <img src="img/rent-bootsamp.png" width="90%" style="display: block; margin: auto;" /> --- ## Bootstrap population Generated assuming there are more flats like the ones in the observed sample... Population mean = ?? <img src="img/rent-bootpop.png" width="65%" style="display: block; margin: auto;" /> --- ## Bootstrapping scheme 1. Take a bootstrap sample - a random sample taken **with replacement** from the original sample, of the same size as the original sample 2. Calculate the bootstrap statistic - a statistic such as mean, median, proportion, slope, etc. computed on the bootstrap samples 3. Repeat steps (1) and (2) many times to create a bootstrap distribution - a distribution of bootstrap statistics 4. Calculate the bounds of the XX% confidence interval as the middle XX% of the bootstrap distribution --- class: middle # Bootstrapping with tidymodels --- ## Generate bootstrap means ``` r edi_3br %>% # specify the variable of interest specify(response = rent) ``` --- ## Generate bootstrap means ``` r edi_3br %>% # specify the variable of interest specify(response = rent) # generate 15000 bootstrap samples generate(reps = 15000, type = "bootstrap") ``` --- ## Generate bootstrap means ``` r edi_3br %>% # specify the variable of interest specify(response = rent) # generate 15000 bootstrap samples generate(reps = 15000, type = "bootstrap") # calculate the mean of each bootstrap sample calculate(stat = "mean") ``` --- ## Generate bootstrap means ``` r # save resulting bootstrap distribution boot_df <- edi_3br %>% # specify the variable of interest specify(response = rent) %>% # generate 15000 bootstrap samples generate(reps = 15000, type = "bootstrap") %>% # calculate the mean of each bootstrap sample calculate(stat = "mean") ``` --- ## The bootstrap sample .question[ How many observations are there in `boot_df`? What does each observation represent? ] ``` r boot_df ``` ``` ## Response: rent (numeric) ## # A tibble: 15,000 × 2 ## replicate stat ## <int> <dbl> ## 1 1 1793. ## 2 2 1938. ## 3 3 2175 ## 4 4 2159. ## 5 5 2084 ## 6 6 1761 ## # ℹ 14,994 more rows ``` --- ## Visualize the bootstrap distribution ``` r ggplot(data = boot_df, mapping = aes(x = stat)) + geom_histogram(binwidth = 100) + labs(title = "Bootstrap distribution of means") ``` <img src="w10-L20_files/figure-html/unnamed-chunk-38-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Calculate the confidence interval A 95% confidence interval is bounded by the middle 95% of the bootstrap distribution. ``` r boot_df %>% summarize(lower = quantile(stat, 0.025), upper = quantile(stat, 0.975)) ``` ``` ## # A tibble: 1 × 2 ## lower upper ## <dbl> <dbl> ## 1 1603. 2213. ``` --- ## Visualize the confidence interval <img src="w10-L20_files/figure-html/unnamed-chunk-41-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Interpret the confidence interval .question[ The 95% confidence interval for the mean rent of three bedroom flats in Edinburgh was calculated as (1603, 2213). Which of the following is the correct interpretation of this interval? **(a)** 95% of the time the mean rent of three bedroom flats in this sample is between £1603 and £2213. **(b)** 95% of all three bedroom flats in Edinburgh have rents between £1603 and £2213. **(c)** We are 95% confident that the mean rent of all three bedroom flats is between £1603 and £2213. **(d)** We are 95% confident that the mean rent of three bedroom flats in this sample is between £1603 and £2213. ] --- class: middle # Accuracy vs. precision --- ## Confidence level **We are 95% confident that ...** - Suppose we took many samples from the original population and built a 95% confidence interval based on each sample. - Then about 95% of those intervals would contain the true population parameter. -- <img src="img/ci-1.gif" width="30%" style="display: block; margin: auto;" /> .footnote[ .small[ Source: Minitab blog [Understanding Hypothesis Tests: Confidence Intervals and Confidence Levels](https://blog.minitab.com/en/adventures-in-statistics-2/understanding-hypothesis-tests-confidence-intervals-and-confidence-levels), 22 Nov 2023. ] ] --- ## Commonly used confidence levels .question[ Which line (orange dash/dot, blue dash, green dot) represents which confidence level? ] <img src="w10-L20_files/figure-html/unnamed-chunk-43-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Precision vs. accuracy .question[ If we want to be very certain that we capture the population parameter, should we use a wider or a narrower interval? What drawbacks are associated with using a wider interval? ] -- <img src="img/garfield.png" width="60%" style="display: block; margin: auto;" /> -- .question[ How can we get best of both worlds -- high precision and high accuracy? ] --- ## Changing confidence level .question[ How would you modify the following code to calculate a 90% confidence interval? How would you modify it for a 99% confidence interval? ] ``` r edi_3br %>% specify(response = rent) %>% generate(reps = 15000, type = "bootstrap") %>% calculate(stat = "mean") %>% summarize(lower = quantile(stat, 0.025), upper = quantile(stat, 0.975)) ``` --- ## Recap - Sample statistic `\(\ne\)` population parameter, but if the sample is good, it can be a good estimate - We report the estimate with a confidence interval, and the width of this interval depends on the variability of sample statistics from different samples from the population - Since we can't continue sampling from the population, we bootstrap from the one sample we have to estimate sampling variability - We can do this for any sample statistic: - For a mean: `calculate(stat = "mean")` - Learn about calculating bootstrap intervals for other statistics in the quiz