class: center, middle, inverse, title-slide .title[ # Data Types and Classes ] .subtitle[ ##
Introduction to Data Science ] .author[ ### University of Edinburgh ] .date[ ###
2024/2025 ] --- class: middle # Why should you care about data types? --- ## Example: Cat lovers A survey asked respondents their name and number of cats. The instructions said to enter the number of cats as a numerical value. .pull-left[ ``` r cat_lovers <- read_csv("data/cat-lovers.csv") ``` ``` ## # A tibble: 60 × 3 ## name number_of_cats handedness ## <chr> <chr> <chr> ## 1 Bernice Warren 0 left ## 2 Woodrow Stone 0 left ## 3 Willie Bass 1 left ## 4 Tyrone Estrada 3 left ## 5 Alex Daniels 3 left ## 6 Jane Bates 2 left ## 7 Latoya Simpson 1 left ## 8 Darin Woods 1 left ## 9 Agnes Cobb 0 left ## 10 Tabitha Grant 0 left ## # ℹ 50 more rows ``` ] .pull-right[ ``` r cat_lovers %>% summarise(mean_cats = mean(number_of_cats)) ``` ``` ## Warning: There was 1 warning in `summarise()`. ## ℹ In argument: `mean_cats = ## mean(number_of_cats)`. ## Caused by warning in `mean.default()`: ## ! argument is not numeric or logical: returning NA ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 NA ``` .question[Why won't this work?] ] --- ## Closer look at the `mean` function .pull-left-narrow[ ``` r help(mean) ``` ] .pull-right-wide[ <img src="img/mean-help-v2.png" width="125%" style="display: block; margin: auto;" /> ] ``` r cat_lovers %>% summarise(mean_cats = mean(number_of_cats, na.rm = TRUE)) ``` ``` ## Warning: There was 1 warning in `summarise()`. ## ℹ In argument: `mean_cats = mean(number_of_cats, na.rm = TRUE)`. ## Caused by warning in `mean.default()`: ## ! argument is not numeric or logical: returning NA ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 NA ``` --- ## Take a breath and look at your data .question[ What is the type of the `number_of_cats` variable? ] ``` r glimpse(cat_lovers) ``` ``` ## Rows: 60 ## Columns: 3 ## $ name <chr> "Bernice Warren", "Woodrow Stone", "Willie Bass", "Tyro… ## $ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", "0", "0", "0", … ## $ handedness <chr> "left", "left", "left", "left", "left", "left", "left",… ``` ``` r cat_lovers %>% slice( c(48, 54) ) ``` ``` ## # A tibble: 2 × 3 ## name number_of_cats handedness ## <chr> <chr> <chr> ## 1 Ginger Clark 1.5 - honestly I think one of my cats is half human right ## 2 Doug Bass three right ``` --- ## Might need to babysit your respondents Need to _mutate_ the `number_of_cats` variable: - to fix unusual data values, and - to coarse the characters to numbers. ``` r cat_lovers %>% mutate( number_of_cats = case_when( name == "Ginger Clark" ~ "2", name == "Doug Bass" ~ "3", TRUE ~ number_of_cats ), number_of_cats = as.numeric(number_of_cats) ) %>% summarise(mean_cats = mean(number_of_cats)) ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 0.833 ``` --- ## Moral of the story - If your data does not behave how you expect it to, type coercion upon reading in the data might be the reason. - Go in and investigate your data, apply the fix, *save your data*, live happily ever after. - You can make explicit coercion by using suitable functions starting with `as.???()` - Different options for `???` part of the code --- class: middle # Data types --- ## Four key data types in R .pull-left[ **logical** - boolean values `TRUE` and `FALSE` ``` r typeof(TRUE) ``` ``` ## [1] "logical" ``` **double** - floating point numerical values (default numerical type) ``` r typeof(1.335) ``` ``` ## [1] "double" ``` ``` r typeof(7) ``` ``` ## [1] "double" ``` ] .pull-right[ **character** - character strings ``` r typeof("hello") ``` ``` ## [1] "character" ``` **integer** - integer numerical values (indicated with an `L`) ``` r typeof(7L) ``` ``` ## [1] "integer" ``` ``` r typeof(1:3) ``` ``` ## [1] "integer" ``` ] --- ## Concatenation Vectors can be constructed using the `c()` function. ``` r c(1, 2, 3) ``` ``` ## [1] 1 2 3 ``` ``` r c("Hello", "World!") ``` ``` ## [1] "Hello" "World!" ``` ``` r c(c("hi", "hello"), c("bye", "jello")) ``` ``` ## [1] "hi" "hello" "bye" "jello" ``` --- ## Explicit coercion between types `as.character()`, `as.numeric()`, `as.integer()` and `as.logical()` .pull-left[ .small[ ``` r x <- 1:3 x ``` ``` ## [1] 1 2 3 ``` ``` r typeof(x) ``` ``` ## [1] "integer" ``` <br> ``` r y <- as.character(x) y ``` ``` ## [1] "1" "2" "3" ``` ``` r typeof(y) ``` ``` ## [1] "character" ``` ] ] -- .pull-right[ .small[ ``` r z <- c(TRUE, FALSE) z ``` ``` ## [1] TRUE FALSE ``` ``` r typeof(z) ``` ``` ## [1] "logical" ``` <br> ``` r w <- as.numeric(z) w ``` ``` ## [1] 1 0 ``` ``` r typeof(w) ``` ``` ## [1] "double" ``` ] ] --- ## Implicit coercion between types R will happily convert between various types without complaint when different types of data are concatenated in a vector, and that's not always a great thing! .pull-left[ .small[ ``` r x <- c(1, "Hello") x ``` ``` ## [1] "1" "Hello" ``` ``` r typeof(x) ``` ``` ## [1] "character" ``` ``` r y <- c(FALSE, 3L) y ``` ``` ## [1] 0 3 ``` ``` r typeof(y) ``` ``` ## [1] "integer" ``` ] ] .pull-right[ .small[ ``` r z <- c(1.2, 3L) z ``` ``` ## [1] 1.2 3.0 ``` ``` r typeof(z) ``` ``` ## [1] "double" ``` ``` r w <- c(2L, "two") w ``` ``` ## [1] "2" "two" ``` ``` r typeof(w) ``` ``` ## [1] "character" ``` ] ] Data type hierarchy: **character** > **double** > **integer** > **logical** --- class: middle # Special values --- ## Special values .pull-left[ - `NaN`: Not a number .small[ ``` r 0/0 ``` ``` ## [1] NaN ``` ] - `Inf`: Positive infinity .small[ ``` r pi/0 ``` ``` ## [1] Inf ``` ] - `-Inf`: Negative infinity .small[ ``` r log(0) ``` ``` ## [1] -Inf ``` ] ] -- .pull-right[ - `NA`: Not available, represents missing values in the data structure .small[ ``` r x <- c(1, 2, 3, 4, NA) mean(x) ``` ``` ## [1] NA ``` ``` r mean(x, na.rm = TRUE) ``` ``` ## [1] 2.5 ``` ``` r summary(x) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 1.00 1.75 2.50 2.50 3.25 4.00 1 ``` ``` r typeof(NA) ``` ``` ## [1] "logical" ``` (ie `NA`s can work with all data types in the hierarchy) ] ] --- ## Working with `NA`s What is happening in the following examples? .pull-left[ ``` r # TRUE or NA TRUE | NA ``` ``` ## [1] TRUE ``` ``` r # TRUE and NA TRUE & NA ``` ``` ## [1] NA ``` ``` r # FALSE or NA FALSE | NA ``` ``` ## [1] NA ``` ] .pull-right[ ``` r # FALSE and NA FALSE & NA ``` ``` ## [1] FALSE ``` .small[ | First | Second | First OR ( | ) Second | First AND ( & ) Second | |:-----:|:------:|:---------------------:| :---------------------:| | TRUE | TRUE | TRUE | TRUE | | TRUE | FALSE | TRUE | FALSE | | FALSE | TRUE | TRUE | FALSE | | FALSE | FALSE | FALSE | FALSE | ] ] --- class: middle # Data classes --- ## Data classes So far we talked about data *types*, next we'll introduce *classes*. - Classes are types of structures in R that have certain useful properties. - Like Lego, we can stick different classes together to build more complicated ones. - Analogy: - Stick two numbers together makes a vector. - Two vectors glued together creates a matrix. - Class examples: **matrices**, **arrays**, **lists**, **data frames**, **factors** and **dates** --- ## Matrices/Arrays Matrices are rectangular arrays of numbers, having two dimensions. .pull-left[ ``` r M1 <- matrix(c(1, 4, 3, 6, 2, 3), nrow = 2, ncol = 3) M1 ``` ``` ## [,1] [,2] [,3] ## [1,] 1 3 2 ## [2,] 4 6 3 ``` ``` r M2 <- matrix(c(6, 2, 5, 1), nrow = 2, ncol = 2) M2 ``` ``` ## [,1] [,2] ## [1,] 6 5 ## [2,] 2 1 ``` ] .pull-right[ ``` r M2 %*% M1 #Matrix multiplication ``` ``` ## [,1] [,2] [,3] ## [1,] 26 48 27 ## [2,] 6 12 7 ``` ``` r det(M2) ``` ``` ## [1] -4 ``` <br> .small[ .question[ Can you find a use for this in ILA? 😉 ] ] ] --- ## Arrays Arrays can be considered as collection of matrices having three dimensions. Arrays can store only data type! .pull-left[ ``` r M1 <- matrix(c(1, 4, 3, 6, 2, 3), nrow = 2, ncol = 3) M1 ``` ``` ## [,1] [,2] [,3] ## [1,] 1 3 2 ## [2,] 4 6 3 ``` ``` r M2 <- matrix(c(6, 2, 5, 1, 4, 3), nrow = 2, ncol = 3) M2 ``` ``` ## [,1] [,2] [,3] ## [1,] 6 5 4 ## [2,] 2 1 3 ``` ] .pull-right[ ``` r # Take these matrices as input to the array. array_matrix <- array(c(M1, M2), dim = c(2, 3, 2)) array_matrix ``` ``` ## , , 1 ## ## [,1] [,2] [,3] ## [1,] 1 3 2 ## [2,] 4 6 3 ## ## , , 2 ## ## [,1] [,2] [,3] ## [1,] 6 5 4 ## [2,] 2 1 3 ``` ] --- ## Lists Lists are a generic container containing items of different data type and class. You can even have lists of lists! .pull-left[ ``` r l <- list( item1 = 1:4, item2 = c("hi", "hello", "jello"), itme3 = c(TRUE, FALSE), item4 = list( item4_1 = diag(2) ) ) ``` ] .pull-right[ ``` r l$item2 ``` ``` ## [1] "hi" "hello" "jello" ``` ``` r l$item4$item4_1 ``` ``` ## [,1] [,2] ## [1,] 1 0 ## [2,] 0 1 ``` ] --- ## Data frames Data frames are special types of lists that contain vectors of any data type but all of equal length. ``` r df <- data.frame(x = c(1, 2), y = c("me","you"), z = c(TRUE, FALSE)) df ``` ``` ## x y z ## 1 1 me TRUE ## 2 2 you FALSE ``` .pull-left[ ``` r typeof(df) ``` ``` ## [1] "list" ``` ] .pull-right[ ``` r class(df) ``` ``` ## [1] "data.frame" ``` ] --- ## Factors R uses **factors** to handle categorical variables. Factors like character (level labels) and an integer (level numbers) glued together. ``` r x <- factor(c("BS", "MS", "PhD", "MS")) x ``` ``` ## [1] BS MS PhD MS ## Levels: BS MS PhD ``` .pull-left[ ``` r typeof(x) ``` ``` ## [1] "integer" ``` ``` r class(x) ``` ``` ## [1] "factor" ``` ] .pull-right[ ``` r glimpse(x) ``` ``` ## Factor w/ 3 levels "BS","MS","PhD": 1 2 3 2 ``` ``` r as.integer(x) ``` ``` ## [1] 1 2 3 2 ``` ] --- ## The `forcats` package .pull-left[ <br> - Factors are useful when you have true categorical data and you want to override the ordering of character vectors to improve display - They are also useful in modeling scenarios - The **forcats** package provides a suite of useful tools that solve common problems with factors - More info [HERE](https://forcats.tidyverse.org/) ] .pull-right[ <img src="img/forcats-part-of-tidyverse.png" width="60%" style="display: block; margin: auto;" /> ``` r cat_lovers %>% group_by(handedness) %>% tally() ``` ``` ## # A tibble: 3 × 2 ## handedness n ## <chr> <int> ## 1 ambidextrous 5 ## 2 left 13 ## 3 right 42 ``` ] --- ## Working with factors .pull-left[ .small[ <img src="w02-L04_files/figure-html/unnamed-chunk-43-1.png" width="75%" /> ] ] .pull-right[ .small[ <img src="w02-L04_files/figure-html/unnamed-chunk-44-1.png" width="70%" /> ] ] Do you recall how these plots can be created from lab-00? --- ## Dates We can think of dates like an integer (the number of days since the origin, 1 Jan 1970) presented as a character in the form `YYYY-MM-DD`. ``` r y <- as.Date("2020-01-01") y ``` ``` ## [1] "2020-01-01" ``` .pull-left[ ``` r typeof(y) ``` ``` ## [1] "double" ``` ``` r class(y) ``` ``` ## [1] "Date" ``` ] .pull-right[ ``` r as.integer(y) ``` ``` ## [1] 18262 ``` ``` r as.integer(y) / 365 # roughly 50 yrs ``` ``` ## [1] 50.03288 ``` ] --- class: middle .hand[.light-blue[ we're just going to scratch the surface of working with dates in R here... ]] YouTube: Tom Scott, The problem with time and timezones, Computerphile - [
](https://www.youtube.com/watch?v=-5wpm-gesOY) <img src="img/pinch-point-hazard-label.png" width="10%" style="display: block; margin: auto;" /> --- ## Make a date .pull-left[ - **lubridate** is the tidyverse-friendly package that makes dealing with dates a little easier - It's _not_ one of the core tidyverse packages! - Functions to read different date orders: `ymd()`, `dmy()`, `mdy()` - Recognises numerical and text months. <img src="img/lubridate-not-part-of-tidyverse.png" width="30%" style="display: block; margin: auto;" /> ] .pull-right[ ``` r #install.packages("lubridate") library(lubridate) ``` ``` r ymd("2022 October 11") ``` ``` ## [1] "2022-10-11" ``` ``` r mdy("10oct2022") ``` ``` ## [1] "2022-10-20" ``` ``` r dmy("11/10/2022") ``` ``` ## [1] "2022-10-11" ``` ] <!-- ## Visualising poll-tracker data Data from YouGov's voting intention tracer ([source](https://yougov.co.uk/topics/politics/articles-reports/2022/09/30/voting-intention-con-21-lab-54-28-29-sep-2022)) .pull-left[ How do you go from this... ``` r poll <- read_csv("data/voting_intension.csv") glimpse(poll) ``` ``` ## Rows: 129 ## Columns: 9 ## $ Date <chr> "26/01/2020", "0… ## $ Con <dbl> 49, 49, 48, 52, … ## $ Lab <dbl> 29, 30, 28, 28, … ## $ `Lib Dem` <dbl> 10, 8, 10, 8, 5,… ## $ SNP <dbl> 5, 4, 4, 5, 4, 4… ## $ `Plaid Cymru` <dbl> 1, 1, 1, 1, 1, 1… ## $ `Reform UK` <dbl> 2, 2, 2, 1, 1, 3… ## $ Green <dbl> 4, 5, 6, 5, 3, 5… ## $ Other <dbl> 0, 1, 1, 1, 1, 1… ``` ] .pull-right[ ...to this... <img src="img/poll_tracker.png" width="100%" style="display: block; margin: auto;" /> ] --> <!-- ## Step to take .pull-left[ 1. Pivot data from wide format to long 2. Convert character `Date` variable to a date class 3. Convert `Parties` from character to factor - choose an appropriate category ordering (see [here](https://forcats.tidyverse.org/reference/fct_reorder.html) for more info) 4. Create line plot of `Share` against `Date` per `Party` 5. Add title and axis labels 6. Change line colours to party affiliated colours ] .pull-right[ .small[ ``` r library(tidyverse) library(lubridate) poll <- read_csv("data/voting_intension.csv") poll %>% pivot_longer(cols = -Date, names_to = "Party", values_to = "Share") %>% mutate(Date = dmy(Date), Party = fct_reorder(Party, Share, first, .desc=TRUE)) %>% ggplot(mapping = aes(x = Date, y = Share, colour = Party)) + geom_line() + labs(x = "Date", y = "Vote Share (%)", colour = "Political Party", title = "YouGov, Voting Intension Tracker") + scale_color_manual(values = c("#003cab", "#c20600", "#ffb922", "#ffe046", "#149678", "#6ac9f5", "#006b46", "#bfbdbd")) ``` ] ] --> <!-- ## The final result! <img src="w02-L04_files/figure-html/unnamed-chunk-55-1.png" style="display: block; margin: auto;" /> -->