Model building

.title[
# Model building
]
.subtitle[
## <br><br> Introduction to Data Science
]
.author[
### University of Edinburgh
]
.date[
### <br> 2024/2025
]

---

layout: true
  
<div class="my-footer">
<span>
University of Edinburgh
</span>
</div>

---
## Topics

- Prediction (including ROC curves) 
- Feature engineering and recipes.
- Workflows - combining recipes and models.
- More on prediction (making decisions).

---

# Prediction

---

## Continued - Building a spam filter

- Data: Set of emails. We know if each email is spam/not and other features.

- Use logistic regression to predict the probability that an incoming email is spam.

- Use model selection to pick the model with the best predictive performance.

- Reminder: split the data into a training dataset and a test dataset.

- Use the training dataset to investigate and fit models.

- Choose 1 or 2 models and test how well they predict the test dataset.

- Helps us avoid overfitting.

---

## Spam filter dataset

.pull-left-narrow[
- Data from 3921 emails and 21 variables on them
- Outcome: whether the email is spam or not
- Predictors: number of characters, whether the email had "Re:" in the subject, time at which email was sent, number of times the word "inherit" shows up in the email, etc.
]
.pull-right-wide[
.small[

``` r
library(openintro)
glimpse(email)
```

```
## Rows: 3,921
## Columns: 21
## $ spam         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ to_multiple  <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, …
## $ from         <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ cc           <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, …
## $ sent_email   <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, …
## $ time         <dttm> 2012-01-01 06:16:41, 2012-01-01 07:03:59,…
## $ image        <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ attach       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ dollar       <dbl> 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ winner       <fct> no, no, no, no, no, no, no, no, no, no, no…
## $ inherit      <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ password     <dbl> 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ num_char     <dbl> 11.370, 10.504, 7.773, 13.256, 1.231, 1.09…
## $ line_breaks  <int> 202, 202, 192, 255, 29, 25, 193, 237, 69, …
## $ format       <fct> 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, …
## $ re_subj      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, …
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ urgent_subj  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ exclaim_mess <dbl> 0, 1, 6, 48, 1, 1, 1, 18, 1, 0, 2, 1, 0, 1…
## $ number       <fct> big, small, small, small, none, none, big,…
```
]
]

---

## Performing the split

``` r
# Fix random numbers by setting the seed 
# Enables analysis to be reproducible when random numbers are used 
set.seed(1116)

# Put 80% of the data into the training set 
email_split <- initial_split(email, prop = 0.80)

# Create data frames for the two sets:
train_data <- training(email_split)
test_data  <- testing(email_split)
```

---

## Fit a model to the training dataset

``` r
email_fit <- logistic_reg() %>%
  set_engine("glm") %>%
  fit(spam ~ ., data = train_data, family = "binomial")
```

```
## Warning: glm.fit: fitted probabilities numerically 0 or 1
## occurred
```

---

## Categorical predictors

---

## `from` and `sent_email`

.pull-left[
- `from`: Whether the message was listed as from anyone (this is usually set by default for regular outgoing email)

``` r
train_data %>%
  count(spam, from)
```

```
## # A tibble: 3 × 3
##   spam  from      n
##   <fct> <fct> <int>
## 1 0     1      2837
## 2 1     0         3
## 3 1     1       296
```
]
.pull-right[
- `sent_email`: Indicator for whether the sender had been sent an email in the last 30 days

``` r
train_data %>%
  count(spam, sent_email)
```

```
## # A tibble: 3 × 3
##   spam  sent_email     n
##   <fct> <fct>      <int>
## 1 0     0           1972
## 2 0     1            865
## 3 1     0            299
```
]

---

## `from` and `sent_email`

.question[
If we predict "spam" using only "sent_email", what do you think is the prediction for an email with *sent_email = 1*?
]

---

## Numerical predictors

```
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##    skim_variable spam n_missing complete_rate      mean       sd    p0    p25     p50    p75  p100
##  1 cc            0            0             1   0.393     2.62   0      0       0       0      68 
##  2 cc            1            0             1   0.388     3.25   0      0       0       0      50 
##  3 image         0            0             1   0.0536    0.503  0      0       0       0      20 
*##  4 image         1            0             1   0.00334   0.0578 0      0       0       0       1 
##  5 attach        0            0             1   0.124     0.775  0      0       0       0      21 
##  6 attach        1            0             1   0.227     0.620  0      0       0       0       2 
##  7 dollar        0            0             1   1.56      5.33   0      0       0       0      64 
##  8 dollar        1            0             1   0.779     3.01   0      0       0       0      36 
##  9 inherit       0            0             1   0.0352    0.216  0      0       0       0       6 
## 10 inherit       1            0             1   0.0702    0.554  0      0       0       0       9 
*## 11 viagra        0            0             1   0         0      0      0       0       0       0 
## 12 viagra        1            0             1   0.0268    0.463  0      0       0       0       8 
## 13 password      0            0             1   0.112     0.938  0      0       0       0      22 
## 14 password      1            0             1   0.0201    0.182  0      0       0       0       2 
## 15 num_char      0            0             1  11.4      14.9    0.003  1.97    6.83   15.7   190.
## 16 num_char      1            0             1   5.63     15.7    0.001  0.468   0.999   3.55  174.
## 17 line_breaks   0            0             1 247.      326.     2     42     138     318    4022 
## 18 line_breaks   1            0             1 108.      321.     1     14      23      66.5  3729 
## 19 exclaim_subj  0            0             1   0.0783    0.269  0      0       0       0       1 
## 20 exclaim_subj  1            0             1   0.0769    0.267  0      0       0       0       1 
## 21 exclaim_mess  0            0             1   6.68     50.2    0      0       1       5    1236 
## 22 exclaim_mess  1            0             1   8.75     88.4    0      0       0       1    1209
```
]

---
## Numerical predictors

Closer look at `"image"`: number of images attached.

``` r
 train_data %>%
     count(spam, image) %>%
 print(n=12)
```

```
## # A tibble: 10 × 3
##    spam  image     n
##    <fct> <dbl> <int>
##  1 0         0  2753
##  2 0         1    56
##  3 0         2    15
##  4 0         3     8
##  5 0         4     2
##  6 0         5     1
##  7 0         9     1
##  8 0        20     1
##  9 1         0   298
## 10 1         1     1
```

---

## Fit a model to the training dataset

``` r
email_fit <- logistic_reg() %>%
  set_engine("glm") %>%
* fit(spam ~ . - from - sent_email - viagra - image, data = train_data, family = "binomial")
```

``` r
email_fit
```

```
## parsnip model object
## 
## 
## Call:  stats::glm(formula = spam ~ . - from - sent_email - viagra - 
##     image, family = stats::binomial, data = data)
## 
## Coefficients:
##  (Intercept)  to_multiple1            cc          time        attach        dollar     winneryes  
##   -9.818e+01    -2.537e+00     2.264e-02     7.363e-08     2.269e-01    -6.260e-02     2.159e+00  
##      inherit      password      num_char   line_breaks       format1      re_subj1  exclaim_subj  
##    4.559e-01    -7.173e-01     5.991e-02    -5.604e-03    -9.328e-01    -3.008e+00     1.173e-01  
## urgent_subj1  exclaim_mess   numbersmall     numberbig  
##    3.584e+00     1.041e-02    -8.795e-01    -1.658e-01  
## 
## Degrees of Freedom: 3135 Total (i.e. Null);  3118 Residual
## Null Deviance:	    1974 
## Residual Deviance: 1460 	AIC: 1496
```
]

---

## Predict outcome on the testing dataset

``` r
predict(email_fit, test_data)
```

```
## # A tibble: 785 × 1
##   .pred_class
##   <fct>      
## 1 0          
## 2 0          
## 3 0          
## 4 0          
## 5 0          
## 6 0          
## # ℹ 779 more rows
```

---

## Predict probabilities on the testing dataset

``` r
email_pred <- predict(email_fit, test_data, type = "prob") %>%
  bind_cols(test_data %>% select(spam, time))

email_pred
```

```
## # A tibble: 785 × 4
##   .pred_0 .pred_1 spam  time               
##     <dbl>   <dbl> <fct> <dttm>             
## 1   0.993 0.00665 0     2012-01-01 17:55:06
## 2   0.998 0.00172 0     2012-01-01 19:38:32
## 3   0.981 0.0188  0     2012-01-02 05:42:16
## 4   0.999 0.00110 0     2012-01-02 15:12:51
## 5   0.988 0.0124  0     2012-01-02 16:45:36
## 6   0.826 0.174   0     2012-01-02 21:55:03
## # ℹ 779 more rows
```

---

## A closer look at predictions

``` r
email_pred %>%
  arrange(desc(.pred_1)) %>%
  print(n = 10)
```

```
## # A tibble: 785 × 4
##    .pred_0 .pred_1 spam  time               
##      <dbl>   <dbl> <fct> <dttm>             
##  1  0.0939   0.906 1     2012-02-13 12:15:00
*##  2  0.161    0.839 0     2012-01-27 20:05:06
##  3  0.175    0.825 1     2012-03-01 05:40:27
##  4  0.264    0.736 1     2012-03-17 10:13:27
##  5  0.381    0.619 1     2012-03-21 12:33:12
##  6  0.446    0.554 1     2012-03-06 11:46:20
##  7  0.481    0.519 1     2012-02-10 12:53:18
*##  8  0.486    0.514 0     2012-01-22 21:23:41
##  9  0.497    0.503 1     2012-02-08 08:00:05
*## 10  0.509    0.491 0     2012-02-17 01:57:16
## # ℹ 775 more rows
```
]

---

## Evaluate the performance

**Receiver operating characteristic (ROC) curve**<sup>+</sup> which plots true positive rate vs. false positive rate (1 - specificity)

``` r
email_pred %>%
  roc_curve(
    truth = spam,
    .pred_1,
    event_level = "second"
  ) %>%
  autoplot()
```
]
.pull-right[
<img src="w09-L18_files/figure-html/unnamed-chunk-14-1.png" width="100%" style="display: block; margin: auto;" />
]

.footnote[
.small[
<sup>+</sup>Originally developed for operators of military radar receivers, hence the name.
]
]

---

## Evaluate the performance

.question[
In this plot, what is the ideal point = (1-specificity, sensitivity) that we would like our curve to go through?
]

---

## Evaluate the performance

Find the **area under the curve**:

``` r
email_pred %>%
  roc_auc(
    truth = spam,
    .pred_1,
    event_level = "second"
  )
```

```
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.860
```
]
.pull-right[
<img src="w09-L18_files/figure-html/unnamed-chunk-16-1.png" width="100%" style="display: block; margin: auto;" />
]

---
class: middle

# Feature engineering

---

## Feature engineering

- We prefer simple models when possible, but **parsimony** does not mean sacrificing accuracy (or predictive performance) in the interest of simplicity

--
- Variables that go into the model and how they are represented are just as critical to success of the model

--
- **Feature engineering** allows us to get creative with our predictors in an effort to make them more useful for our model (to increase its predictive performance)

---

## Same training and testing sets as before

``` r
# Fix random numbers by setting the seed 
# Enables analysis to be reproducible when random numbers are used 
set.seed(1116)

# Put 80% of the data into the training set 
email_split <- initial_split(email, prop = 0.80)

# Create data frames for the two sets:
train_data <- training(email_split)
test_data  <- testing(email_split)
```

---

## A simple approach: `mutate()`

``` r
train_data %>%
  mutate(
    date = date(time),
    dow  = wday(time),
    month = month(time)
    ) %>%
  select(time, date, dow, month) %>%
  sample_n(size = 5) # shuffle to show a variety
```

```
## # A tibble: 5 × 4
##   time                date         dow month
##   <dttm>              <date>     <dbl> <dbl>
## 1 2012-03-15 18:51:35 2012-03-15     5     3
## 2 2012-03-03 14:24:02 2012-03-03     7     3
## 3 2012-01-18 16:55:23 2012-01-18     4     1
## 4 2012-02-25 04:08:59 2012-02-25     7     2
## 5 2012-01-11 13:18:51 2012-01-11     4     1
```

---

## Modeling workflow, revisited

- Create a **recipe** for feature engineering steps to be applied to the training data

--
- Fit the model to the training data after these steps have been applied

--
- Using the model estimates from the training data, predict outcomes for the test data

--
- Evaluate the performance of the model on the test data

---

# Building recipes

---

## Initiate a recipe

``` r
email_rec <- recipe(
  spam ~ .,          # formula
  data = train_data  # data to use for cataloguing names and types of variables
  )

summary(email_rec)
```

```
## # A tibble: 21 × 4
##    variable     type      role      source  
##    <chr>        <list>    <chr>     <chr>   
##  1 to_multiple  <chr [3]> predictor original
##  2 from         <chr [3]> predictor original
##  3 cc           <chr [2]> predictor original
##  4 sent_email   <chr [3]> predictor original
##  5 time         <chr [1]> predictor original
##  6 image        <chr [2]> predictor original
##  7 attach       <chr [2]> predictor original
##  8 dollar       <chr [2]> predictor original
##  9 winner       <chr [3]> predictor original
## 10 inherit      <chr [2]> predictor original
## 11 viagra       <chr [2]> predictor original
## 12 password     <chr [2]> predictor original
## 13 num_char     <chr [2]> predictor original
## 14 line_breaks  <chr [2]> predictor original
## 15 format       <chr [3]> predictor original
## 16 re_subj      <chr [3]> predictor original
## 17 exclaim_subj <chr [2]> predictor original
## 18 urgent_subj  <chr [3]> predictor original
## 19 exclaim_mess <chr [2]> predictor original
## 20 number       <chr [3]> predictor original
## 21 spam         <chr [3]> outcome   original
```
]

---

## Remove certain variables

``` r
email_rec <- email_rec %>%
  step_rm(from, sent_email, image, viagra)
```

]

---

## Feature engineer date

``` r
email_rec <- email_rec %>%
  step_date(time, features = c("dow", "month")) %>%
  step_rm(time)
```

]

---

## Discretize numeric variables

``` r
email_rec <- email_rec %>%
  step_cut(cc, attach, dollar, breaks = c(0, 1)) %>%
  step_cut(inherit, password, breaks = c(0, 1, 5, 10, 20))
```

]

---

## Create dummy variables

``` r
email_rec <- email_rec %>%
  step_dummy(all_nominal(), -all_outcomes())
```

]

---

## Remove zero variance variables

Variables that contain only a single value

``` r
email_rec <- email_rec %>%
  step_zv(all_predictors())
```

]

---

## All in one place

``` r
email_rec <- recipe(spam ~ ., data = email) %>%
  step_rm(from, sent_email, image, viagra) %>%
  step_date(time, features = c("dow", "month")) %>%               
  step_rm(time) %>%
  step_cut(cc, attach, dollar, breaks = c(0, 1)) %>%
  step_cut(inherit, password, breaks = c(0, 1, 5, 10, 20, 100)) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_zv(all_predictors())
```

---

# Building workflows

---

## Define model

``` r
email_mod <- logistic_reg() %>% 
  set_engine("glm")

email_mod
```

```
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm
```

---

## Define workflow

**Workflows** bring together models and recipes so that they can be easily applied to both the training and test data.

``` r
email_wflow <- workflow() %>% 
  add_model(email_mod) %>% 
  add_recipe(email_rec)
```

```
## ══ Workflow ════════════════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────────────────────────
## 7 Recipe Steps
## 
## • step_rm()
## • step_date()
## • step_rm()
## • step_cut()
## • step_cut()
## • step_dummy()
## • step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm
```
]

---

## Fit model to training data

``` r
email_fit <- email_wflow %>% 
  fit(data = train_data)
```

---

``` r
tidy(email_fit) %>% print(n = 30)
```

```
## # A tibble: 30 × 5
##    term                estimate  std.error statistic  p.value
##    <chr>                  <dbl>      <dbl>     <dbl>    <dbl>
##  1 (Intercept)         -0.809      0.255     -3.17   1.53e- 3
##  2 num_char             0.0497     0.0246     2.02   4.38e- 2
##  3 line_breaks         -0.00544    0.00141   -3.86   1.13e- 4
##  4 exclaim_subj        -0.163      0.272     -0.599  5.49e- 1
##  5 exclaim_mess         0.00943    0.00191    4.93   8.02e- 7
##  6 to_multiple_X1      -2.60       0.348     -7.48   7.37e-14
##  7 cc_X.1.68.          -0.186      0.461     -0.404  6.86e- 1
##  8 attach_X.1.21.       1.70       0.329      5.19   2.14e- 7
##  9 dollar_X.1.64.       0.271      0.215      1.26   2.07e- 1
## 10 winner_yes           2.13       0.427      4.98   6.30e- 7
## 11 inherit_X.1.5.      -9.29     753.        -0.0123 9.90e- 1
## 12 inherit_X.5.10.      2.52       1.45       1.73   8.28e- 2
## 13 password_X.1.5.     -1.74       0.748     -2.32   2.01e- 2
## 14 password_X.5.10.   -12.6      499.        -0.0253 9.80e- 1
## 15 password_X.10.20.  -13.9      803.        -0.0173 9.86e- 1
## 16 password_X.20.100. -13.9     1029.        -0.0135 9.89e- 1
## 17 format_X1           -0.918      0.158     -5.82   5.91e- 9
## 18 re_subj_X1          -2.91       0.437     -6.67   2.55e-11
## 19 urgent_subj_X1       3.50       1.07       3.25   1.14e- 3
## 20 number_small        -0.896      0.166     -5.40   6.67e- 8
## 21 number_big          -0.193      0.248     -0.775  4.38e- 1
## 22 time_dow_Mon        -0.0956     0.305     -0.314  7.54e- 1
## 23 time_dow_Tue         0.197      0.276      0.714  4.75e- 1
## 24 time_dow_Wed        -0.0726     0.275     -0.264  7.92e- 1
## 25 time_dow_Thu        -0.0777     0.281     -0.277  7.82e- 1
## 26 time_dow_Fri         0.00328    0.281      0.0117 9.91e- 1
## 27 time_dow_Sat         0.251      0.295      0.848  3.96e- 1
## 28 time_month_Feb       0.740      0.178      4.15   3.27e- 5
## 29 time_month_Mar       0.543      0.179      3.03   2.41e- 3
## 30 time_month_Apr     -13.2      628.        -0.0211 9.83e- 1
```
]

---

## Make predictions for test data

``` r
email_pred <- predict(email_fit, test_data, type = "prob") %>% 
  bind_cols(test_data)

email_pred
```

```
## # A tibble: 785 × 23
##   .pred_0  .pred_1 spam  to_multiple from     cc sent_email
##     <dbl>    <dbl> <fct> <fct>       <fct> <int> <fct>     
## 1   0.995 0.00486  0     1           1         0 1         
## 2   0.999 0.00133  0     0           1         1 1         
## 3   0.968 0.0322   0     0           1         0 0         
## 4   0.999 0.000613 0     0           1         1 0         
## 5   0.993 0.00674  0     0           1         4 0         
## 6   0.868 0.132    0     0           1         0 0         
## # ℹ 779 more rows
## # ℹ 16 more variables: time <dttm>, image <dbl>, attach <dbl>,
## #   dollar <dbl>, winner <fct>, inherit <dbl>, viagra <dbl>,
## #   password <dbl>, num_char <dbl>, line_breaks <int>,
## #   format <fct>, re_subj <fct>, exclaim_subj <dbl>,
## #   urgent_subj <fct>, exclaim_mess <dbl>, number <fct>
```

---

## Evaluate the performance

``` r
email_pred %>%
  roc_curve(
    truth = spam,
    .pred_1,
    event_level = "second"
  ) %>%
  autoplot()
```
]
.pull-right[
<img src="w09-L18_files/figure-html/unnamed-chunk-37-1.png" width="100%" style="display: block; margin: auto;" />
]

---

## Evaluate the performance

``` r
email_pred %>%
  roc_auc(
    truth = spam,
    .pred_1,
    event_level = "second"
  )
```

```
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.858
```
]
.pull-right[
<img src="w09-L18_files/figure-html/unnamed-chunk-39-1.png" width="100%" style="display: block; margin: auto;" />
]

---

# Making decisions

---

## Cutoff probability: 0.5

Suppose we decide to label an email as spam if the model predicts the probability of spam to be **more than 0.5**.

|                        | Email is not spam| Email is spam|
|:-----------------------|-----------------:|-------------:|
|Email labelled not spam |               711|            60|
|Email labelled spam     |                 6|             8|
]
.panel[.panel-name[Code]

``` r
cutoff_prob <- 0.5
email_pred %>%
  mutate(
    spam      = if_else(spam == 1, "Email is spam", "Email is not spam"),
    spam_pred = if_else(.pred_1 > cutoff_prob, "Email labelled spam", "Email labelled not spam")
    ) %>%
  count(spam_pred, spam) %>%
  pivot_wider(names_from = spam, values_from = n) %>%
  kable(col.names = c("", "Email is not spam", "Email is spam"))
```
]
]

---

## Cutoff probability: 0.25

Suppose we decide to label an email as spam if the model predicts the probability of spam to be **more than 0.25**.

|                        | Email is not spam| Email is spam|
|:-----------------------|-----------------:|-------------:|
|Email labelled not spam |               666|            35|
|Email labelled spam     |                51|            33|

---

## Cutoff probability: 0.75

Suppose we decide to label an email as spam if the model predicts the probability of spam to be **more than 0.75**.

|                        | Email is not spam| Email is spam|
|:-----------------------|-----------------:|-------------:|
|Email labelled not spam |               716|            65|
|Email labelled spam     |                 1|             3|

---

## Recap

- ROC curve and AUC

- Feature engineering

- Workflows and recipes

- Making decisions