class: center, middle, inverse, title-slide .title[ # Predicting categorical outcomes ] .subtitle[ ##
Introduction to Data Science ] .author[ ### University of Edinburgh ] .date[ ###
2024/2025 ] --- layout: true <div class="my-footer"> <span> University of Edinburgh </span> </div> --- ## Topics - Modelling categorical outcomes - Brief introduction to GLMs - Logistic regression - Sensitivity and specificity - Using models for prediction --- class: middle # Modelling categorical outcomes --- ## Spam filters .pull-left-narrow[ - Data from 3921 emails and 21 variables on them - Outcome: whether the email is spam or not - Predictors: number of characters, whether the email had "Re:" in the subject, time at which email was sent, number of times the word "inherit" shows up in the email, etc. ] .pull-right-wide[ .small[ ``` r library(openintro) glimpse(email) ``` ``` ## Rows: 3,921 ## Columns: 21 ## $ spam <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ to_multiple <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, … ## $ from <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ cc <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, … ## $ sent_email <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, … ## $ time <dttm> 2012-01-01 06:16:41, 2012-01-01 07:03:59,… ## $ image <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, … ## $ attach <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, … ## $ dollar <dbl> 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ winner <fct> no, no, no, no, no, no, no, no, no, no, no… ## $ inherit <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ password <dbl> 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ num_char <dbl> 11.370, 10.504, 7.773, 13.256, 1.231, 1.09… ## $ line_breaks <int> 202, 202, 192, 255, 29, 25, 193, 237, 69, … ## $ format <fct> 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, … ## $ re_subj <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, … ## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ urgent_subj <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ exclaim_mess <dbl> 0, 1, 6, 48, 1, 1, 1, 18, 1, 0, 2, 1, 0, 1… ## $ number <fct> big, small, small, small, none, none, big,… ``` ] ] --- .question[ Would you expect longer or shorter emails to be spam? ] --- .question[ Would you expect longer or shorter emails to be spam? ] .pull-left[ <img src="w09-L17_files/figure-html/unnamed-chunk-2-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ``` ## # A tibble: 2 × 2 ## spam mean_num_char ## <fct> <dbl> ## 1 0 11.3 ## 2 1 5.44 ``` ] --- .question[ Would you expect emails that have subjects starting with "Re:", "RE:", "re:", or "rE:" to be spam or not? ] --- .question[ Would you expect emails that have subjects starting with "Re:", "RE:", "re:", or "rE:" to be spam or not? ] <img src="w09-L17_files/figure-html/unnamed-chunk-4-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Modelling spam - Both number of characters and whether the message has "re:" in the subject might be related to whether the email is spam. How do we come up with a model that will let us explore this relationship? -- - For simplicity, we'll focus on the number of characters (`num_char`) as predictor, but the model we describe can be expanded to take multiple predictors as well. --- ## Modelling spam This isn't something we can reasonably fit a linear model to -- we need something different! <img src="w09-L17_files/figure-html/unnamed-chunk-5-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Framing the problem - We can treat each outcome (spam and not) as successes and failures arising from separate Bernoulli trials - Bernoulli trial: a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is conducted -- - Each Bernoulli trial can have a separate probability of success $$ y_i ∼ Bern(p_i) $$ -- - We can then use the predictor variables to model that probability of success, `\(p_i\)` -- - We can't just use a linear model for `\(p_i\)` (since `\(p_i\)` must be between 0 and 1) but we can transform the linear model to have the appropriate range --- ## Generalized linear models - This is a very general way of addressing many problems in regression and the resulting models are called **generalized linear models (GLMs)** -- - Logistic regression is just one example --- ## Three characteristics of GLMs All GLMs have the following three characteristics: 1. A probability distribution describing a generative model for the outcome variable -- 2. A linear model: `$$\eta = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k$$` -- 3. A link function that relates the linear model to the parameter of the outcome distribution --- class: middle # Logistic regression --- ## Logistic regression - Logistic regression is a GLM used to model a binary categorical outcome using numerical and categorical predictors -- - To finish specifying the Logistic model we just need to define a reasonable link function that connects `\(\eta_i\)` to `\(p_i\)`: logit function -- - **Logit function:** For `\(0\le p \le 1\)` `$$logit(p) = \log\left(\frac{p}{1-p}\right)$$` --- ## Logit function, visualised <img src="w09-L17_files/figure-html/unnamed-chunk-6-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Properties of the logit - The logit function takes a value between 0 and 1 and maps it to a value between `\(-\infty\)` and `\(\infty\)` -- - Inverse logit (logistic) function: `$$g^{-1}(x) = \frac{\exp(x)}{1+\exp(x)} = \frac{1}{1+\exp(-x)}$$` -- - The inverse logit function takes a value between `\(-\infty\)` and `\(\infty\)` and maps it to a value between 0 and 1 -- - This formulation is also useful for interpreting the model, since the logit can be interpreted as the log odds of a success -- more on this later --- ## Logistic function, visualised <img src="w09-L17_files/figure-html/unnamed-chunk-7-1.png" width="60%" style="display: block; margin: auto;" /> --- ## The logistic regression model - Based on the three GLM criteria we have - `\(y_i \sim \text{Bern}(p_i)\)` - `\(\eta_i = \beta_0+\beta_1 x_{1,i} + \cdots + \beta_n x_{n,i}\)` - `\(\text{logit}(p_i) = \eta_i\)` -- - From which we get `$$p_i = \frac{\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}{1+\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}$$` --- ## Modeling spam In R we fit a GLM in the same way as a linear model except we - specify the model with `logistic_reg()` - use `"glm"` instead of `"lm"` as the engine - define `family = "binomial"` for the link function to be used in the model -- ``` r spam_fit <- logistic_reg() %>% set_engine("glm") %>% fit(spam ~ num_char, data = email, family = "binomial") tidy(spam_fit) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -1.80 0.0716 -25.1 2.04e-139 ## 2 num_char -0.0621 0.00801 -7.75 9.50e- 15 ``` --- ## Spam model ``` r tidy(spam_fit) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -1.80 0.0716 -25.1 2.04e-139 ## 2 num_char -0.0621 0.00801 -7.75 9.50e- 15 ``` -- Model: `$$\log\left(\frac{p}{1-p}\right) = -1.80-0.0621\times \text{num_char}$$` --- ## P(spam) for an email with 2000 characters `$$\log\left(\frac{p}{1-p}\right) = -1.80-0.0621\times 2$$` -- `$$\frac{p}{1-p} = \exp(-1.9242) = 0.15 \rightarrow p = 0.15 \times (1 - p)$$` -- `$$p = 0.15 - 0.15p \rightarrow 1.15p = 0.15$$` -- `$$p = 0.15 / 1.15 = 0.13$$` --- .question[ What is the probability that an email with 15000 characters is spam? What about an email with 40000 characters? ] -- .pull-left[ <img src="w09-L17_files/figure-html/spam-predict-viz-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ - .light-blue[2K chars: P(spam) = 0.13] - .yellow[15K chars, P(spam) = 0.06] - .green[40K chars, P(spam) = 0.01] ] --- .question[ Would you prefer an email with 2000 characters to be labelled as spam or not? How about 40,000 characters? ] <img src="w09-L17_files/figure-html/unnamed-chunk-10-1.png" width="60%" style="display: block; margin: auto;" /> --- class: middle # Sensitivity and specificity --- ## False positive and negative | | Email is spam | Email is not spam | |-------------------------|-------------------------------|-------------------------------| | Email labelled spam | True positive | False positive (Type 1 error) | | Email labelled not spam | False negative (Type 2 error) | True negative | -- - False negative rate = P(Labelled not spam | Email spam) = FN / (TP + FN) - False positive rate = P(Labelled spam | Email not spam) = FP / (FP + TN) --- ## Sensitivity and specificity | | Email is spam | Email is not spam | |-------------------------|-------------------------------|-------------------------------| | Email labelled spam | True positive | False positive (Type 1 error) | | Email labelled not spam | False negative (Type 2 error) | True negative | -- - Sensitivity = P(Labelled spam | Email spam) = TP / (TP + FN) - Sensitivity = 1 − False negative rate - Specificity = P(Labelled not spam | Email not spam) = TN / (FP + TN) - Specificity = 1 − False positive rate --- .question[ If you were designing a spam filter, would you want sensitivity and specificity to be high or low? What are the trade-offs associated with each decision? ] --- class: middle # Prediction and classification --- ## Goal: Building a spam filter - Data: Set of emails and we know if each email is spam/not and other features - Use logistic regression to predict the probability that an incoming email is spam - Use model selection to pick the model with the best predictive performance -- - Building a model to predict the probability that an email is spam is only half of the battle! We also need a decision rule about which emails get flagged as spam (e.g. what probability should we use as out cutoff?) -- - A simple approach: choose a single threshold probability and any email that exceeds that probability is flagged as spam --- ## A multiple regression approach .panelset[ .panel[.panel-name[Output] .small[ ``` ## # A tibble: 22 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -9.09e+1 9.80e+3 -0.00928 9.93e- 1 ## 2 to_multiple1 -2.68e+0 3.27e-1 -8.21 2.25e-16 ## 3 from1 -2.19e+1 9.80e+3 -0.00224 9.98e- 1 ## 4 cc 1.88e-2 2.20e-2 0.855 3.93e- 1 ## 5 sent_email1 -2.07e+1 3.87e+2 -0.0536 9.57e- 1 ## 6 time 8.48e-8 2.85e-8 2.98 2.92e- 3 ## 7 image -1.78e+0 5.95e-1 -3.00 2.73e- 3 ## 8 attach 7.35e-1 1.44e-1 5.09 3.61e- 7 ## 9 dollar -6.85e-2 2.64e-2 -2.59 9.64e- 3 ## 10 winneryes 2.07e+0 3.65e-1 5.67 1.41e- 8 ## 11 inherit 3.15e-1 1.56e-1 2.02 4.32e- 2 ## 12 viagra 2.84e+0 2.22e+3 0.00128 9.99e- 1 ## 13 password -8.54e-1 2.97e-1 -2.88 4.03e- 3 ## 14 num_char 5.06e-2 2.38e-2 2.13 3.35e- 2 ## 15 line_breaks -5.49e-3 1.35e-3 -4.06 4.91e- 5 ## 16 format1 -6.14e-1 1.49e-1 -4.14 3.53e- 5 ## 17 re_subj1 -1.64e+0 3.86e-1 -4.25 2.16e- 5 ## 18 exclaim_subj 1.42e-1 2.43e-1 0.585 5.58e- 1 ## 19 urgent_subj1 3.88e+0 1.32e+0 2.95 3.18e- 3 ## 20 exclaim_mess 1.08e-2 1.81e-3 5.98 2.23e- 9 ## 21 numbersmall -1.19e+0 1.54e-1 -7.74 9.62e-15 ## 22 numberbig -2.95e-1 2.20e-1 -1.34 1.79e- 1 ``` ] ] .panel[.panel-name[Code] ``` r logistic_reg() %>% set_engine("glm") %>% fit(spam ~ ., data = email, family = "binomial") %>% tidy() %>% print(n = 22) ``` ``` ## Warning: glm.fit: fitted probabilities numerically 0 or 1 ## occurred ``` ] ] --- ## Prediction - The mechanics of prediction is **easy**: - Plug in values of predictors to the model equation - Calculate the predicted value of the response variable, `\(\hat{y}\)` -- - Getting it right is **hard**! - There is no guarantee the model estimates you have are correct - Or that your model will perform as well with new data as it did with your sample data --- ## Underfitting and overfitting <img src="w09-L17_files/figure-html/unnamed-chunk-12-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Spending our data - Several steps to create a useful model: parameter estimation, model selection, performance assessment, etc. - Doing all of this on the entire data we have available can lead to **overfitting** - Allocate specific subsets of data for different tasks, as opposed to allocating the largest possible amount to the model parameter estimation only (which is what we've done so far) --- class: middle # Splitting data --- ## Splitting data - **Training set:** - Sandbox for model building - Spend most of your time using the training set to develop the model - Majority of the data (usually 80%) - **Testing set:** - Held in reserve to determine efficacy of one or two chosen models - Critical to look at it once, otherwise it becomes part of the modeling process - Remainder of the data (usually 20%) --- ## Performing the split ``` r # Fix random numbers by setting the seed # Enables analysis to be reproducible when random numbers are used set.seed(1114) # Put 80% of the data into the training set email_split <- initial_split(email, prop = 0.80) # Create data frames for the two sets: train_data <- training(email_split) test_data <- testing(email_split) ``` --- ## Peek at the split .small[ .pull-left[ ``` r glimpse(train_data) ``` ``` ## Rows: 3,136 ## Columns: 21 ## $ spam <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ to_multiple <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ from <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ cc <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 8, 0, … ## $ sent_email <fct> 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, … ## $ time <dttm> 2012-03-11 18:56:18, 2012-01-18 16:03:54,… ## $ image <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ attach <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ dollar <dbl> 0, 14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 4, 0… ## $ winner <fct> no, no, no, no, no, no, yes, no, no, no, n… ## $ inherit <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ password <dbl> 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, … ## $ num_char <dbl> 10.508, 13.684, 21.031, 0.110, 0.833, 0.12… ## $ line_breaks <int> 234, 305, 282, 6, 28, 7, 479, 167, 13, 643… ## $ format <fct> 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, … ## $ re_subj <fct> 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, … ## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, … ## $ urgent_subj <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ exclaim_mess <dbl> 2, 10, 0, 2, 0, 0, 5, 1, 0, 1, 0, 8, 0, 12… ## $ number <fct> big, small, small, none, small, small, big… ``` ] .pull-right[ ``` r glimpse(test_data) ``` ``` ## Rows: 785 ## Columns: 21 ## $ spam <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ to_multiple <fct> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, … ## $ from <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ cc <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 2, 2, 1, … ## $ sent_email <fct> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, … ## $ time <dttm> 2012-01-01 06:16:41, 2012-01-01 07:03:59,… ## $ image <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ attach <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, … ## $ dollar <dbl> 0, 0, 4, 0, 0, 0, 0, 21, 0, 0, 0, 0, 0, 0,… ## $ winner <fct> no, no, no, no, no, no, no, no, no, no, no… ## $ inherit <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, … ## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ password <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ num_char <dbl> 11.370, 10.504, 7.773, 13.256, 4.837, 7.42… ## $ line_breaks <int> 202, 202, 192, 255, 193, 237, 68, 560, 69,… ## $ format <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, … ## $ re_subj <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, … ## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ urgent_subj <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ exclaim_mess <dbl> 0, 1, 6, 48, 1, 18, 0, 3, 0, 1, 1, 1, 0, 6… ## $ number <fct> big, small, small, small, big, small, smal… ``` ] ] --- ## Recap - How to model categorical outcomes - Introduction to GLMs - Logistic regression - Sensitivity and specificity - Splitting the data in train and test set.