Learning Goals

The objective of this workshop is to:

Learn the basics of logistic regression modeling with a data set.
Use modelling to make predictions.
Use recipes and workflows to make your modeling choices reproducible.

Set-up

In this workshop you will be working individually. However, while working on your version control project, we suggest to discuss with your group members to hear different opinions and angles. Start by doing the usual set-up steps:

Create your own copy of today’s template repository from https://github.com/uoeIDS/lab-08-template.
Create a new version control project in RStudio that links to your copy of today’s template.
Open lab-08.Rmd and change the author at the top of the file to your name.
Verify that everything is set-up correctly: 🧶 knit your document, ✅ commit your changes and ⬆️ push the commits to GitHub.

Packages & Data

During the workshop we’ll use the following packages:

tidyverse: for data wrangling and visualisation
tidymodels: for modelling

Data set: Rain in Australia

Who does not want to know if tomorrow is going to rain? Well if you live in Scotland the prediction is probably easy. But for people living in Australia, it might be actually more interesting to know if you’ll need to carry an umbrella (or a “brolly” as they like to say) or not.

In this lab we will be using a dataset containing about 10 years of daily weather observations from many locations across Australia, with information relating to temperature, wind, humidity, cloud cover and pressure. The variable we want to predict is whether it will rain on the following day, RainTomorrow.

To familiarise yourself with the data set, reading the data dictionary and reference here is suggested. You should read the data into RStudio using the following code:

weather <- read.csv("data/weatherAUS.csv", header = TRUE)

Exercise 1: Exploratory Data Analysis

We will start by transform any character variables that need to be transformed into categorical. Use the following code to identify them and trasform them into factors.

variables_to_transform = weather %>% 
  select(where(is.character),-Date) %>% names()
weather <- weather %>% 
  mutate_at(vars(all_of(variables_to_transform)),factor)

To simplify things, today we will not be using some categorical explanatory variables, because they have a very large number of categories and might make our model interpretation more complex. Specifically we will exclude WindGustDir, WindDir9am and WindDir3pm.

weather <- weather %>%
  select(-WindGustDir,-WindDir9am,-WindDir3pm)

Note that some of these variables have a large number of missing values:

weather %>% 
  select(where(~ any(is.na(.)))) %>% 
  summarise(across(everything(), ~mean(is.na(.)))) %>%
  pivot_longer(col = everything(), names_to = "Variable", values_to = "Prop_NA") %>%
  arrange(desc(Prop_NA))

## # A tibble: 18 × 2
##    Variable      Prop_NA
##    <chr>           <dbl>
##  1 Sunshine      0.480  
##  2 Evaporation   0.432  
##  3 Cloud3pm      0.408  
##  4 Cloud9am      0.384  
##  5 Pressure9am   0.104  
##  6 Pressure3pm   0.103  
##  7 WindGustSpeed 0.0706 
##  8 Humidity3pm   0.0310 
##  9 Temp3pm       0.0248 
## 10 RainTomorrow  0.0225 
## 11 Rainfall      0.0224 
## 12 RainToday     0.0224 
## 13 WindSpeed3pm  0.0211 
## 14 Humidity9am   0.0182 
## 15 WindSpeed9am  0.0121 
## 16 Temp9am       0.0121 
## 17 MinTemp       0.0102 
## 18 MaxTemp       0.00867

Are there any missing values in our variable of interest RainTomorrow? If so, we filter them out and save the new dataset as weather_noNA.
Which cities are the ones with more rain days? To do this, let’s analyze the RainToday variable.

Exercise 2: Logistic regression

We will focus our analysis on the city of Portland.

weather_Portland <- weather_noNA %>%
  filter(Location == "Portland")

Try to predict RainTomorrow by fitting a linear regression using the variable RainToday and print the output using tidy().
For each point in our dataset, what are the fitted probabilities that tomorrow it’s going to rain?

Plot them using an appropriate visualization. What is peculiar about them?

Hint: how many unique values do the predicted probabilities take? What do these value correspond to?

Are there any missing values? Why?

Exercise 3: Split the data and build workflows

Let us set a seed and perform a split of our data.

set.seed(111723)

Split the data into a training set (80% of your Portland data) and a testing set.
Refit the simple logistic regression using RainToday as predictor on this training data, using tidymodels recipes and workflows.

Start by the recipe. First initialize the recipe, then remove observations with missing values using step_naomit() and finally use step_dummy to convert categorical to dummy variables.

weather_rec1 <- recipe(
  RainTomorrow ~ RainToday, 
  data = weather_Portland
  ) %>%
  step_naomit(all_predictors()) %>%
  step_dummy(all_nominal(), -all_outcomes())

Build your workflow combining model and recipe

weather_mod1 <- ___ %>% 
  set_engine(___)
weather_wflow1 <- workflow() %>% # initiate workflow
  ___(___) %>%                   # add model
  ___(___)                       # add recipe

Now fit your model on the training data

weather_fit1 <- weather_wflow1 %>% 
  fit(data = ___)
tidy(weather_fit1)

Fit now a multiple logistic regression, i.e. using multiple explanatory variables, to predict RainTomorrow. We will use as predictors the variables MinTemp, MaxTemp, RainToday and Rainfall. Similarly to question 2, use workflows to fit this model on our training data.

Start by the recipe. This will be a simple recipe, because we have not done many pre-processing steps, but do remove missing values and transform categorical variables into dummy variables:

weather_rec2 <- recipe(
  RainTomorrow ~ ___+___+___+___, # include your formula
  data = weather_Portland
  ) %>%
  step_naomit(___) %>%.          # exclude cases with missing values in all predictors
  step_dummy(all_nominal(), ___) # exclude all outcomes

Save the model, workflow and finally, let’s fit to the training data.

Now let’s evaluate the predictive performance of these two models on our test set.

Create the ROC curve and get the AUC (area under the curve) value for your first simple logistic regression model.

weather_pred2 <- predict(___, test_data, type = "prob") %>%
  bind_cols(test_data)
weather_pred2 %>%
  ___(                      # plot ROC curve
    truth = RainTomorrow,
    .pred_Yes,
    event_level = "second"
  ) %>%
  autoplot()

weather_pred2 %>%
  ___(                  # get AUC value
    truth = RainTomorrow,
    .pred_Yes,
    event_level = "second"
  )

Create now the ROC curve and get the AUC (area under the curve) value for your second model.
Which model seems to have a better performance?

Now focus on the second model. Consider several thresholds for predicting RainTomorrow and look at the number of false positives and false negatives. For example:

cutoff_prob <- 0.5
weather_pred2 %>%
  mutate(
    RainTomorrow      = if_else(RainTomorrow == "Yes", "It rains", "It does not rain"),
    RainTomorrow_pred = if_else(.pred_Yes > cutoff_prob, "Predicted rain", "Predicted no rain")
    ) %>%
  na.omit() %>%
  count(RainTomorrow_pred, RainTomorrow)

What is the the false positive rate with cutoff_prob = 0.3?
What about the false negative rate?

Exercise 4: Extend our model [OPTIONAL]

We will now try to improve our fit by building a model using additional explanatory variables.

Let us analyze the various variables in our dataset.

Is there any categorical variable which is very unbalanced? If so, remove it.
Is there any numerical variable that has a very small standard deviation for one of the two categories of RainTomorrow? If so, remove it.

Let’s do some feature engineering: let us transform the variable Date. We will use Ludbridate again: extract the month and year.
Let’s now combine everything into recipes and workflows. Then fit the model on the training data and use the test data for calculating the AUC and plotting the ROC curve.
Is this model better than the one we fitted in Exercise 3?

Lab 08 - Predicting rain