This lab is all about logistic regression and using recipes and workflows, so make sure you are up to date with the week 9 material!
The objective of this workshop is to:
In this workshop you will be working individually. However, while working on your version control project, we suggest to discuss with your group members to hear different opinions and angles. Start by doing the usual set-up steps:
lab-08.Rmd
and change the author
at the top of the file to your name.During the workshop we’ll use the following packages:
tidyverse
: for data wrangling and visualisationtidymodels
: for modellingWho does not want to know if tomorrow is going to rain? Well if you live in Scotland the prediction is probably easy. But for people living in Australia, it might be actually more interesting to know if you’ll need to carry an umbrella (or a “brolly” as they like to say) or not.
In this lab we will be using a dataset containing about 10 years of daily weather observations from many locations across Australia, with information relating to temperature, wind, humidity, cloud cover and pressure. The variable we want to predict is whether it will rain on the following day, RainTomorrow
.
To familiarise yourself with the data set, reading the data dictionary and reference here is suggested. You should read the data into RStudio using the following code:
We will start by transform any character variables that need to be transformed into categorical. Use the following code to identify them and trasform them into factors.
variables_to_transform = weather %>%
select(where(is.character),-Date) %>% names()
weather <- weather %>%
mutate_at(vars(all_of(variables_to_transform)),factor)
To simplify things, today we will not be using some categorical explanatory variables, because they have a very large number of categories and might make our model interpretation more complex. Specifically we will exclude WindGustDir
, WindDir9am
and WindDir3pm
.
Note that some of these variables have a large number of missing values:
weather %>%
select(where(~ any(is.na(.)))) %>%
summarise(across(everything(), ~mean(is.na(.)))) %>%
pivot_longer(col = everything(), names_to = "Variable", values_to = "Prop_NA") %>%
arrange(desc(Prop_NA))
## # A tibble: 18 × 2
## Variable Prop_NA
## <chr> <dbl>
## 1 Sunshine 0.480
## 2 Evaporation 0.432
## 3 Cloud3pm 0.408
## 4 Cloud9am 0.384
## 5 Pressure9am 0.104
## 6 Pressure3pm 0.103
## 7 WindGustSpeed 0.0706
## 8 Humidity3pm 0.0310
## 9 Temp3pm 0.0248
## 10 RainTomorrow 0.0225
## 11 Rainfall 0.0224
## 12 RainToday 0.0224
## 13 WindSpeed3pm 0.0211
## 14 Humidity9am 0.0182
## 15 WindSpeed9am 0.0121
## 16 Temp9am 0.0121
## 17 MinTemp 0.0102
## 18 MaxTemp 0.00867
Are there any missing values in our variable of interest RainTomorrow
? If so, we filter them out and save the new dataset as weather_noNA
.
Which cities are the ones with more rain days? To do this, let’s analyze the RainToday
variable.
We will focus our analysis on the city of Portland
.
Try to predict RainTomorrow
by fitting a linear regression using the variable RainToday
and print the output using tidy()
.
For each point in our dataset, what are the fitted probabilities that tomorrow it’s going to rain?
Hint: how many unique values do the predicted probabilities take? What do these value correspond to?
Let us set a seed and perform a split of our data.
Split the data into a training set (80% of your Portland data) and a testing set.
Refit the simple logistic regression using RainToday
as predictor on this training data, using tidymodels
recipes and workflows.
step_naomit()
and finally use step_dummy
to convert categorical to dummy variables.weather_rec1 <- recipe(
RainTomorrow ~ RainToday,
data = weather_Portland
) %>%
step_naomit(all_predictors()) %>%
step_dummy(all_nominal(), -all_outcomes())
weather_mod1 <- ___ %>%
set_engine(___)
weather_wflow1 <- workflow() %>% # initiate workflow
___(___) %>% # add model
___(___) # add recipe
RainTomorrow
. We will use as predictors the variables MinTemp
, MaxTemp
, RainToday
and Rainfall
. Similarly to question 2, use workflows to fit this model on our training data.weather_rec2 <- recipe(
RainTomorrow ~ ___+___+___+___, # include your formula
data = weather_Portland
) %>%
step_naomit(___) %>%. # exclude cases with missing values in all predictors
step_dummy(all_nominal(), ___) # exclude all outcomes
weather_pred2 <- predict(___, test_data, type = "prob") %>%
bind_cols(test_data)
weather_pred2 %>%
___( # plot ROC curve
truth = RainTomorrow,
.pred_Yes,
event_level = "second"
) %>%
autoplot()
weather_pred2 %>%
___( # get AUC value
truth = RainTomorrow,
.pred_Yes,
event_level = "second"
)
Create now the ROC curve and get the AUC (area under the curve) value for your second model.
Which model seems to have a better performance?
RainTomorrow
and look at the number of false positives and false negatives. For example:cutoff_prob <- 0.5
weather_pred2 %>%
mutate(
RainTomorrow = if_else(RainTomorrow == "Yes", "It rains", "It does not rain"),
RainTomorrow_pred = if_else(.pred_Yes > cutoff_prob, "Predicted rain", "Predicted no rain")
) %>%
na.omit() %>%
count(RainTomorrow_pred, RainTomorrow)
What is the the false positive rate with cutoff_prob = 0.3
?
What about the false negative rate?
We will now try to improve our fit by building a model using additional explanatory variables.
Is there any categorical variable which is very unbalanced? If so, remove it.
Is there any numerical variable that has a very small standard deviation for one of the two categories of RainTomorrow
? If so, remove it.
Let’s do some feature engineering: let us transform the variable Date
. We will use Ludbridate
again: extract the month and year.
Let’s now combine everything into recipes and workflows. Then fit the model on the training data and use the test data for calculating the AUC and plotting the ROC curve.
Is this model better than the one we fitted in Exercise 3?