class: center, middle, inverse, title-slide .title[ # Model building ] .subtitle[ ##
Introduction to Data Science ] .author[ ### University of Edinburgh ] .date[ ###
2024/2025 ] --- layout: true <div class="my-footer"> <span> University of Edinburgh </span> </div> --- ## Topics - Prediction (including ROC curves) - Feature engineering and recipes. - Workflows - combining recipes and models. - More on prediction (making decisions). --- class: middle # Prediction --- ## Continued - Building a spam filter - Data: Set of emails. We know if each email is spam/not and other features. - Use logistic regression to predict the probability that an incoming email is spam. - Use model selection to pick the model with the best predictive performance. -- - Reminder: split the data into a training dataset and a test dataset. - Use the training dataset to investigate and fit models. - Choose 1 or 2 models and test how well they predict the test dataset. - Helps us avoid overfitting. --- ## Spam filter dataset .pull-left-narrow[ - Data from 3921 emails and 21 variables on them - Outcome: whether the email is spam or not - Predictors: number of characters, whether the email had "Re:" in the subject, time at which email was sent, number of times the word "inherit" shows up in the email, etc. ] .pull-right-wide[ .small[ ``` r library(openintro) glimpse(email) ``` ``` ## Rows: 3,921 ## Columns: 21 ## $ spam <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ to_multiple <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, … ## $ from <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ cc <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, … ## $ sent_email <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, … ## $ time <dttm> 2012-01-01 06:16:41, 2012-01-01 07:03:59,… ## $ image <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, … ## $ attach <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, … ## $ dollar <dbl> 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ winner <fct> no, no, no, no, no, no, no, no, no, no, no… ## $ inherit <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ password <dbl> 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ num_char <dbl> 11.370, 10.504, 7.773, 13.256, 1.231, 1.09… ## $ line_breaks <int> 202, 202, 192, 255, 29, 25, 193, 237, 69, … ## $ format <fct> 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, … ## $ re_subj <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, … ## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ urgent_subj <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ exclaim_mess <dbl> 0, 1, 6, 48, 1, 1, 1, 18, 1, 0, 2, 1, 0, 1… ## $ number <fct> big, small, small, small, none, none, big,… ``` ] ] --- ## Performing the split ``` r # Fix random numbers by setting the seed # Enables analysis to be reproducible when random numbers are used set.seed(1116) # Put 80% of the data into the training set email_split <- initial_split(email, prop = 0.80) # Create data frames for the two sets: train_data <- training(email_split) test_data <- testing(email_split) ``` --- ## Fit a model to the training dataset ``` r email_fit <- logistic_reg() %>% set_engine("glm") %>% fit(spam ~ ., data = train_data, family = "binomial") ``` ``` ## Warning: glm.fit: fitted probabilities numerically 0 or 1 ## occurred ``` --- ## Categorical predictors <img src="w09-L18_files/figure-html/unnamed-chunk-4-1.png" width="75%" style="display: block; margin: auto;" /> --- ## `from` and `sent_email` .pull-left[ - `from`: Whether the message was listed as from anyone (this is usually set by default for regular outgoing email) ``` r train_data %>% count(spam, from) ``` ``` ## # A tibble: 3 × 3 ## spam from n ## <fct> <fct> <int> ## 1 0 1 2837 ## 2 1 0 3 ## 3 1 1 296 ``` ] .pull-right[ - `sent_email`: Indicator for whether the sender had been sent an email in the last 30 days ``` r train_data %>% count(spam, sent_email) ``` ``` ## # A tibble: 3 × 3 ## spam sent_email n ## <fct> <fct> <int> ## 1 0 0 1972 ## 2 0 1 865 ## 3 1 0 299 ``` ] --- ## `from` and `sent_email` .question[ If we predict "spam" using only "sent_email", what do you think is the prediction for an email with *sent_email = 1*? ] --- ## Numerical predictors .small[ ``` ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────── ## skim_variable spam n_missing complete_rate mean sd p0 p25 p50 p75 p100 ## 1 cc 0 0 1 0.393 2.62 0 0 0 0 68 ## 2 cc 1 0 1 0.388 3.25 0 0 0 0 50 ## 3 image 0 0 1 0.0536 0.503 0 0 0 0 20 *## 4 image 1 0 1 0.00334 0.0578 0 0 0 0 1 ## 5 attach 0 0 1 0.124 0.775 0 0 0 0 21 ## 6 attach 1 0 1 0.227 0.620 0 0 0 0 2 ## 7 dollar 0 0 1 1.56 5.33 0 0 0 0 64 ## 8 dollar 1 0 1 0.779 3.01 0 0 0 0 36 ## 9 inherit 0 0 1 0.0352 0.216 0 0 0 0 6 ## 10 inherit 1 0 1 0.0702 0.554 0 0 0 0 9 *## 11 viagra 0 0 1 0 0 0 0 0 0 0 ## 12 viagra 1 0 1 0.0268 0.463 0 0 0 0 8 ## 13 password 0 0 1 0.112 0.938 0 0 0 0 22 ## 14 password 1 0 1 0.0201 0.182 0 0 0 0 2 ## 15 num_char 0 0 1 11.4 14.9 0.003 1.97 6.83 15.7 190. ## 16 num_char 1 0 1 5.63 15.7 0.001 0.468 0.999 3.55 174. ## 17 line_breaks 0 0 1 247. 326. 2 42 138 318 4022 ## 18 line_breaks 1 0 1 108. 321. 1 14 23 66.5 3729 ## 19 exclaim_subj 0 0 1 0.0783 0.269 0 0 0 0 1 ## 20 exclaim_subj 1 0 1 0.0769 0.267 0 0 0 0 1 ## 21 exclaim_mess 0 0 1 6.68 50.2 0 0 1 5 1236 ## 22 exclaim_mess 1 0 1 8.75 88.4 0 0 0 1 1209 ``` ] --- ## Numerical predictors Closer look at `"image"`: number of images attached. ``` r train_data %>% count(spam, image) %>% print(n=12) ``` ``` ## # A tibble: 10 × 3 ## spam image n ## <fct> <dbl> <int> ## 1 0 0 2753 ## 2 0 1 56 ## 3 0 2 15 ## 4 0 3 8 ## 5 0 4 2 ## 6 0 5 1 ## 7 0 9 1 ## 8 0 20 1 ## 9 1 0 298 ## 10 1 1 1 ``` --- ## Fit a model to the training dataset ``` r email_fit <- logistic_reg() %>% set_engine("glm") %>% * fit(spam ~ . - from - sent_email - viagra - image, data = train_data, family = "binomial") ``` .small[ ``` r email_fit ``` ``` ## parsnip model object ## ## ## Call: stats::glm(formula = spam ~ . - from - sent_email - viagra - ## image, family = stats::binomial, data = data) ## ## Coefficients: ## (Intercept) to_multiple1 cc time attach dollar winneryes ## -9.818e+01 -2.537e+00 2.264e-02 7.363e-08 2.269e-01 -6.260e-02 2.159e+00 ## inherit password num_char line_breaks format1 re_subj1 exclaim_subj ## 4.559e-01 -7.173e-01 5.991e-02 -5.604e-03 -9.328e-01 -3.008e+00 1.173e-01 ## urgent_subj1 exclaim_mess numbersmall numberbig ## 3.584e+00 1.041e-02 -8.795e-01 -1.658e-01 ## ## Degrees of Freedom: 3135 Total (i.e. Null); 3118 Residual ## Null Deviance: 1974 ## Residual Deviance: 1460 AIC: 1496 ``` ] --- ## Predict outcome on the testing dataset ``` r predict(email_fit, test_data) ``` ``` ## # A tibble: 785 × 1 ## .pred_class ## <fct> ## 1 0 ## 2 0 ## 3 0 ## 4 0 ## 5 0 ## 6 0 ## # ℹ 779 more rows ``` --- ## Predict probabilities on the testing dataset ``` r email_pred <- predict(email_fit, test_data, type = "prob") %>% bind_cols(test_data %>% select(spam, time)) email_pred ``` ``` ## # A tibble: 785 × 4 ## .pred_0 .pred_1 spam time ## <dbl> <dbl> <fct> <dttm> ## 1 0.993 0.00665 0 2012-01-01 17:55:06 ## 2 0.998 0.00172 0 2012-01-01 19:38:32 ## 3 0.981 0.0188 0 2012-01-02 05:42:16 ## 4 0.999 0.00110 0 2012-01-02 15:12:51 ## 5 0.988 0.0124 0 2012-01-02 16:45:36 ## 6 0.826 0.174 0 2012-01-02 21:55:03 ## # ℹ 779 more rows ``` --- ## A closer look at predictions .pull-left-wide[ ``` r email_pred %>% arrange(desc(.pred_1)) %>% print(n = 10) ``` ``` ## # A tibble: 785 × 4 ## .pred_0 .pred_1 spam time ## <dbl> <dbl> <fct> <dttm> ## 1 0.0939 0.906 1 2012-02-13 12:15:00 *## 2 0.161 0.839 0 2012-01-27 20:05:06 ## 3 0.175 0.825 1 2012-03-01 05:40:27 ## 4 0.264 0.736 1 2012-03-17 10:13:27 ## 5 0.381 0.619 1 2012-03-21 12:33:12 ## 6 0.446 0.554 1 2012-03-06 11:46:20 ## 7 0.481 0.519 1 2012-02-10 12:53:18 *## 8 0.486 0.514 0 2012-01-22 21:23:41 ## 9 0.497 0.503 1 2012-02-08 08:00:05 *## 10 0.509 0.491 0 2012-02-17 01:57:16 ## # ℹ 775 more rows ``` ] --- ## Evaluate the performance **Receiver operating characteristic (ROC) curve**<sup>+</sup> which plots true positive rate vs. false positive rate (1 - specificity) .pull-left[ ``` r email_pred %>% roc_curve( truth = spam, .pred_1, event_level = "second" ) %>% autoplot() ``` ] .pull-right[ <img src="w09-L18_files/figure-html/unnamed-chunk-14-1.png" width="100%" style="display: block; margin: auto;" /> ] .footnote[ .small[ <sup>+</sup>Originally developed for operators of military radar receivers, hence the name. ] ] --- ## Evaluate the performance .question[ In this plot, what is the ideal point = (1-specificity, sensitivity) that we would like our curve to go through? ] --- ## Evaluate the performance Find the **area under the curve**: .pull-left[ ``` r email_pred %>% roc_auc( truth = spam, .pred_1, event_level = "second" ) ``` ``` ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 roc_auc binary 0.860 ``` ] .pull-right[ <img src="w09-L18_files/figure-html/unnamed-chunk-16-1.png" width="100%" style="display: block; margin: auto;" /> ] --- class: middle # Feature engineering --- ## Feature engineering - We prefer simple models when possible, but **parsimony** does not mean sacrificing accuracy (or predictive performance) in the interest of simplicity -- - Variables that go into the model and how they are represented are just as critical to success of the model -- - **Feature engineering** allows us to get creative with our predictors in an effort to make them more useful for our model (to increase its predictive performance) --- ## Same training and testing sets as before ``` r # Fix random numbers by setting the seed # Enables analysis to be reproducible when random numbers are used set.seed(1116) # Put 80% of the data into the training set email_split <- initial_split(email, prop = 0.80) # Create data frames for the two sets: train_data <- training(email_split) test_data <- testing(email_split) ``` --- ## A simple approach: `mutate()` ``` r train_data %>% mutate( date = date(time), dow = wday(time), month = month(time) ) %>% select(time, date, dow, month) %>% sample_n(size = 5) # shuffle to show a variety ``` ``` ## # A tibble: 5 × 4 ## time date dow month ## <dttm> <date> <dbl> <dbl> ## 1 2012-03-15 18:51:35 2012-03-15 5 3 ## 2 2012-03-03 14:24:02 2012-03-03 7 3 ## 3 2012-01-18 16:55:23 2012-01-18 4 1 ## 4 2012-02-25 04:08:59 2012-02-25 7 2 ## 5 2012-01-11 13:18:51 2012-01-11 4 1 ``` --- ## Modeling workflow, revisited - Create a **recipe** for feature engineering steps to be applied to the training data -- - Fit the model to the training data after these steps have been applied -- - Using the model estimates from the training data, predict outcomes for the test data -- - Evaluate the performance of the model on the test data --- class: middle # Building recipes --- ## Initiate a recipe ``` r email_rec <- recipe( spam ~ ., # formula data = train_data # data to use for cataloguing names and types of variables ) summary(email_rec) ``` .xsmall[ ``` ## # A tibble: 21 × 4 ## variable type role source ## <chr> <list> <chr> <chr> ## 1 to_multiple <chr [3]> predictor original ## 2 from <chr [3]> predictor original ## 3 cc <chr [2]> predictor original ## 4 sent_email <chr [3]> predictor original ## 5 time <chr [1]> predictor original ## 6 image <chr [2]> predictor original ## 7 attach <chr [2]> predictor original ## 8 dollar <chr [2]> predictor original ## 9 winner <chr [3]> predictor original ## 10 inherit <chr [2]> predictor original ## 11 viagra <chr [2]> predictor original ## 12 password <chr [2]> predictor original ## 13 num_char <chr [2]> predictor original ## 14 line_breaks <chr [2]> predictor original ## 15 format <chr [3]> predictor original ## 16 re_subj <chr [3]> predictor original ## 17 exclaim_subj <chr [2]> predictor original ## 18 urgent_subj <chr [3]> predictor original ## 19 exclaim_mess <chr [2]> predictor original ## 20 number <chr [3]> predictor original ## 21 spam <chr [3]> outcome original ``` ] --- ## Remove certain variables ``` r email_rec <- email_rec %>% step_rm(from, sent_email, image, viagra) ``` .small[ ] --- ## Feature engineer date ``` r email_rec <- email_rec %>% step_date(time, features = c("dow", "month")) %>% step_rm(time) ``` .small[ ] --- ## Discretize numeric variables ``` r email_rec <- email_rec %>% step_cut(cc, attach, dollar, breaks = c(0, 1)) %>% step_cut(inherit, password, breaks = c(0, 1, 5, 10, 20)) ``` .small[ ] --- ## Create dummy variables ``` r email_rec <- email_rec %>% step_dummy(all_nominal(), -all_outcomes()) ``` .small[ ] --- ## Remove zero variance variables Variables that contain only a single value ``` r email_rec <- email_rec %>% step_zv(all_predictors()) ``` .small[ ] --- ## All in one place ``` r email_rec <- recipe(spam ~ ., data = email) %>% step_rm(from, sent_email, image, viagra) %>% step_date(time, features = c("dow", "month")) %>% step_rm(time) %>% step_cut(cc, attach, dollar, breaks = c(0, 1)) %>% step_cut(inherit, password, breaks = c(0, 1, 5, 10, 20, 100)) %>% step_dummy(all_nominal(), -all_outcomes()) %>% step_zv(all_predictors()) ``` --- class: middle # Building workflows --- ## Define model ``` r email_mod <- logistic_reg() %>% set_engine("glm") email_mod ``` ``` ## Logistic Regression Model Specification (classification) ## ## Computational engine: glm ``` --- ## Define workflow **Workflows** bring together models and recipes so that they can be easily applied to both the training and test data. ``` r email_wflow <- workflow() %>% add_model(email_mod) %>% add_recipe(email_rec) ``` .small[ ``` ## ══ Workflow ════════════════════════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: logistic_reg() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────────────────────────── ## 7 Recipe Steps ## ## • step_rm() ## • step_date() ## • step_rm() ## • step_cut() ## • step_cut() ## • step_dummy() ## • step_zv() ## ## ── Model ─────────────────────────────────────────────────────────────────────────────────────────── ## Logistic Regression Model Specification (classification) ## ## Computational engine: glm ``` ] --- ## Fit model to training data ``` r email_fit <- email_wflow %>% fit(data = train_data) ``` --- .small[ ``` r tidy(email_fit) %>% print(n = 30) ``` ``` ## # A tibble: 30 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -0.809 0.255 -3.17 1.53e- 3 ## 2 num_char 0.0497 0.0246 2.02 4.38e- 2 ## 3 line_breaks -0.00544 0.00141 -3.86 1.13e- 4 ## 4 exclaim_subj -0.163 0.272 -0.599 5.49e- 1 ## 5 exclaim_mess 0.00943 0.00191 4.93 8.02e- 7 ## 6 to_multiple_X1 -2.60 0.348 -7.48 7.37e-14 ## 7 cc_X.1.68. -0.186 0.461 -0.404 6.86e- 1 ## 8 attach_X.1.21. 1.70 0.329 5.19 2.14e- 7 ## 9 dollar_X.1.64. 0.271 0.215 1.26 2.07e- 1 ## 10 winner_yes 2.13 0.427 4.98 6.30e- 7 ## 11 inherit_X.1.5. -9.29 753. -0.0123 9.90e- 1 ## 12 inherit_X.5.10. 2.52 1.45 1.73 8.28e- 2 ## 13 password_X.1.5. -1.74 0.748 -2.32 2.01e- 2 ## 14 password_X.5.10. -12.6 499. -0.0253 9.80e- 1 ## 15 password_X.10.20. -13.9 803. -0.0173 9.86e- 1 ## 16 password_X.20.100. -13.9 1029. -0.0135 9.89e- 1 ## 17 format_X1 -0.918 0.158 -5.82 5.91e- 9 ## 18 re_subj_X1 -2.91 0.437 -6.67 2.55e-11 ## 19 urgent_subj_X1 3.50 1.07 3.25 1.14e- 3 ## 20 number_small -0.896 0.166 -5.40 6.67e- 8 ## 21 number_big -0.193 0.248 -0.775 4.38e- 1 ## 22 time_dow_Mon -0.0956 0.305 -0.314 7.54e- 1 ## 23 time_dow_Tue 0.197 0.276 0.714 4.75e- 1 ## 24 time_dow_Wed -0.0726 0.275 -0.264 7.92e- 1 ## 25 time_dow_Thu -0.0777 0.281 -0.277 7.82e- 1 ## 26 time_dow_Fri 0.00328 0.281 0.0117 9.91e- 1 ## 27 time_dow_Sat 0.251 0.295 0.848 3.96e- 1 ## 28 time_month_Feb 0.740 0.178 4.15 3.27e- 5 ## 29 time_month_Mar 0.543 0.179 3.03 2.41e- 3 ## 30 time_month_Apr -13.2 628. -0.0211 9.83e- 1 ``` ] --- ## Make predictions for test data ``` r email_pred <- predict(email_fit, test_data, type = "prob") %>% bind_cols(test_data) email_pred ``` ``` ## # A tibble: 785 × 23 ## .pred_0 .pred_1 spam to_multiple from cc sent_email ## <dbl> <dbl> <fct> <fct> <fct> <int> <fct> ## 1 0.995 0.00486 0 1 1 0 1 ## 2 0.999 0.00133 0 0 1 1 1 ## 3 0.968 0.0322 0 0 1 0 0 ## 4 0.999 0.000613 0 0 1 1 0 ## 5 0.993 0.00674 0 0 1 4 0 ## 6 0.868 0.132 0 0 1 0 0 ## # ℹ 779 more rows ## # ℹ 16 more variables: time <dttm>, image <dbl>, attach <dbl>, ## # dollar <dbl>, winner <fct>, inherit <dbl>, viagra <dbl>, ## # password <dbl>, num_char <dbl>, line_breaks <int>, ## # format <fct>, re_subj <fct>, exclaim_subj <dbl>, ## # urgent_subj <fct>, exclaim_mess <dbl>, number <fct> ``` --- ## Evaluate the performance .pull-left[ ``` r email_pred %>% roc_curve( truth = spam, .pred_1, event_level = "second" ) %>% autoplot() ``` ] .pull-right[ <img src="w09-L18_files/figure-html/unnamed-chunk-37-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Evaluate the performance .pull-left[ ``` r email_pred %>% roc_auc( truth = spam, .pred_1, event_level = "second" ) ``` ``` ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 roc_auc binary 0.858 ``` ] .pull-right[ <img src="w09-L18_files/figure-html/unnamed-chunk-39-1.png" width="100%" style="display: block; margin: auto;" /> ] --- class: middle # Making decisions --- ## Cutoff probability: 0.5 .panelset[ .panel[.panel-name[Output] Suppose we decide to label an email as spam if the model predicts the probability of spam to be **more than 0.5**. | | Email is not spam| Email is spam| |:-----------------------|-----------------:|-------------:| |Email labelled not spam | 711| 60| |Email labelled spam | 6| 8| ] .panel[.panel-name[Code] ``` r cutoff_prob <- 0.5 email_pred %>% mutate( spam = if_else(spam == 1, "Email is spam", "Email is not spam"), spam_pred = if_else(.pred_1 > cutoff_prob, "Email labelled spam", "Email labelled not spam") ) %>% count(spam_pred, spam) %>% pivot_wider(names_from = spam, values_from = n) %>% kable(col.names = c("", "Email is not spam", "Email is spam")) ``` ] ] --- ## Cutoff probability: 0.25 Suppose we decide to label an email as spam if the model predicts the probability of spam to be **more than 0.25**. | | Email is not spam| Email is spam| |:-----------------------|-----------------:|-------------:| |Email labelled not spam | 666| 35| |Email labelled spam | 51| 33| --- ## Cutoff probability: 0.75 Suppose we decide to label an email as spam if the model predicts the probability of spam to be **more than 0.75**. | | Email is not spam| Email is spam| |:-----------------------|-----------------:|-------------:| |Email labelled not spam | 716| 65| |Email labelled spam | 1| 3| --- ## Recap - ROC curve and AUC - Feature engineering - Workflows and recipes - Making decisions