Lab 08 - Predicting rain

This lab is all about logistic regression and using recipes and workflows, so make sure you are up to date with the week 9 material!

Learning Goals

The objective of this workshop is to:

Set-up

In this workshop you will be working individually. However, while working on your version control project, we suggest to discuss with your group members to hear different opinions and angles. Start by doing the usual set-up steps:

Packages & Data

During the workshop we’ll use the following packages:

Data set: Rain in Australia

Who does not want to know if tomorrow is going to rain? Well if you live in Scotland the prediction is probably easy. But for people living in Australia, it might be actually more interesting to know if you’ll need to carry an umbrella (or a “brolly” as they like to say) or not.

In this lab we will be using a dataset containing about 10 years of daily weather observations from many locations across Australia, with information relating to temperature, wind, humidity, cloud cover and pressure. The variable we want to predict is whether it will rain on the following day, RainTomorrow.

To familiarise yourself with the data set, reading the data dictionary and reference here is suggested. You should read the data into RStudio using the following code:

weather <- read.csv("data/weatherAUS.csv", header = TRUE)

Exercise 1: Exploratory Data Analysis

We will start by transform any character variables that need to be transformed into categorical. Use the following code to identify them and trasform them into factors.

variables_to_transform = weather %>% 
  select(where(is.character),-Date) %>% names()
weather <- weather %>% 
  mutate_at(vars(all_of(variables_to_transform)),factor)

To simplify things, today we will not be using some categorical explanatory variables, because they have a very large number of categories and might make our model interpretation more complex. Specifically we will exclude WindGustDir, WindDir9am and WindDir3pm.

weather <- weather %>%
  select(-WindGustDir,-WindDir9am,-WindDir3pm)

Note that some of these variables have a large number of missing values:

weather %>% 
  select(where(~ any(is.na(.)))) %>% 
  summarise(across(everything(), ~mean(is.na(.)))) %>%
  pivot_longer(col = everything(), names_to = "Variable", values_to = "Prop_NA") %>%
  arrange(desc(Prop_NA))
## # A tibble: 18 × 2
##    Variable      Prop_NA
##    <chr>           <dbl>
##  1 Sunshine      0.480  
##  2 Evaporation   0.432  
##  3 Cloud3pm      0.408  
##  4 Cloud9am      0.384  
##  5 Pressure9am   0.104  
##  6 Pressure3pm   0.103  
##  7 WindGustSpeed 0.0706 
##  8 Humidity3pm   0.0310 
##  9 Temp3pm       0.0248 
## 10 RainTomorrow  0.0225 
## 11 Rainfall      0.0224 
## 12 RainToday     0.0224 
## 13 WindSpeed3pm  0.0211 
## 14 Humidity9am   0.0182 
## 15 WindSpeed9am  0.0121 
## 16 Temp9am       0.0121 
## 17 MinTemp       0.0102 
## 18 MaxTemp       0.00867
  1. Are there any missing values in our variable of interest RainTomorrow? If so, we filter them out and save the new dataset as weather_noNA.

  2. Which cities are the ones with more rain days? To do this, let’s analyze the RainToday variable.

Exercise 2: Logistic regression

We will focus our analysis on the city of Portland.

weather_Portland <- weather_noNA %>%
  filter(Location == "Portland")
  1. Try to predict RainTomorrow by fitting a linear regression using the variable RainToday and print the output using tidy().

  2. For each point in our dataset, what are the fitted probabilities that tomorrow it’s going to rain?

Hint: how many unique values do the predicted probabilities take? What do these value correspond to?

Exercise 3: Split the data and build workflows

Let us set a seed and perform a split of our data.

set.seed(111723)
  1. Split the data into a training set (80% of your Portland data) and a testing set.

  2. Refit the simple logistic regression using RainToday as predictor on this training data, using tidymodels recipes and workflows.

weather_rec1 <- recipe(
  RainTomorrow ~ RainToday, 
  data = weather_Portland
  ) %>%
  step_naomit(all_predictors()) %>%
  step_dummy(all_nominal(), -all_outcomes())
weather_mod1 <- ___ %>% 
  set_engine(___)
weather_wflow1 <- workflow() %>% # initiate workflow
  ___(___) %>%                   # add model
  ___(___)                       # add recipe
weather_fit1 <- weather_wflow1 %>% 
  fit(data = ___)
tidy(weather_fit1)
  1. Fit now a multiple logistic regression, i.e. using multiple explanatory variables, to predict RainTomorrow. We will use as predictors the variables MinTemp, MaxTemp, RainToday and Rainfall. Similarly to question 2, use workflows to fit this model on our training data.
weather_rec2 <- recipe(
  RainTomorrow ~ ___+___+___+___, # include your formula
  data = weather_Portland
  ) %>%
  step_naomit(___) %>%.          # exclude cases with missing values in all predictors
  step_dummy(all_nominal(), ___) # exclude all outcomes
  1. Now let’s evaluate the predictive performance of these two models on our test set.
weather_pred2 <- predict(___, test_data, type = "prob") %>%
  bind_cols(test_data)
weather_pred2 %>%
  ___(                      # plot ROC curve
    truth = RainTomorrow,
    .pred_Yes,
    event_level = "second"
  ) %>%
  autoplot()

weather_pred2 %>%
  ___(                  # get AUC value
    truth = RainTomorrow,
    .pred_Yes,
    event_level = "second"
  )
  1. Now focus on the second model. Consider several thresholds for predicting RainTomorrow and look at the number of false positives and false negatives. For example:
cutoff_prob <- 0.5
weather_pred2 %>%
  mutate(
    RainTomorrow      = if_else(RainTomorrow == "Yes", "It rains", "It does not rain"),
    RainTomorrow_pred = if_else(.pred_Yes > cutoff_prob, "Predicted rain", "Predicted no rain")
    ) %>%
  na.omit() %>%
  count(RainTomorrow_pred, RainTomorrow)

Exercise 4: Extend our model [OPTIONAL]

We will now try to improve our fit by building a model using additional explanatory variables.

  1. Let us analyze the various variables in our dataset.
  1. Let’s do some feature engineering: let us transform the variable Date. We will use Ludbridate again: extract the month and year.

  2. Let’s now combine everything into recipes and workflows. Then fit the model on the training data and use the test data for calculating the AUC and plotting the ROC curve.

  3. Is this model better than the one we fitted in Exercise 3?