Data Types and Classes

.title[
# Data Types and Classes
]
.subtitle[
## <br><br> Introduction to Data Science
]
.author[
### University of Edinburgh
]
.date[
### <br> 2024/2025
]

---

# Why should you care about data types?

---

## Example: Cat lovers

A survey asked respondents their name and number of cats. The instructions said to enter the number of cats as a numerical value.

``` r
cat_lovers <- read_csv("data/cat-lovers.csv")
```

```
## # A tibble: 60 × 3
##    name           number_of_cats handedness
##    <chr>          <chr>          <chr>     
##  1 Bernice Warren 0              left      
##  2 Woodrow Stone  0              left      
##  3 Willie Bass    1              left      
##  4 Tyrone Estrada 3              left      
##  5 Alex Daniels   3              left      
##  6 Jane Bates     2              left      
##  7 Latoya Simpson 1              left      
##  8 Darin Woods    1              left      
##  9 Agnes Cobb     0              left      
## 10 Tabitha Grant  0              left      
## # ℹ 50 more rows
```
]

``` r
cat_lovers %>%
  summarise(mean_cats = mean(number_of_cats))
```

```
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `mean_cats =
##   mean(number_of_cats)`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
```

```
## # A tibble: 1 × 1
##   mean_cats
##       <dbl>
## 1        NA
```

]

---

## Closer look at the `mean` function

``` r
help(mean)
```
]
.pull-right-wide[
<img src="img/mean-help-v2.png" width="125%" style="display: block; margin: auto;" />

]

``` r
cat_lovers %>%
  summarise(mean_cats = mean(number_of_cats, na.rm = TRUE))
```

```
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `mean_cats = mean(number_of_cats, na.rm = TRUE)`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
```

```
## # A tibble: 1 × 1
##   mean_cats
##       <dbl>
## 1        NA
```

---

## Take a breath and look at your data

``` r
glimpse(cat_lovers)
```

```
## Rows: 60
## Columns: 3
## $ name           <chr> "Bernice Warren", "Woodrow Stone", "Willie Bass", "Tyro…
## $ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", "0", "0", "0", …
## $ handedness     <chr> "left", "left", "left", "left", "left", "left", "left",…
```

``` r
cat_lovers %>% slice( c(48, 54) )
```

```
## # A tibble: 2 × 3
##   name         number_of_cats                                      handedness
##   <chr>        <chr>                                               <chr>     
## 1 Ginger Clark 1.5 - honestly I think one of my cats is half human right     
## 2 Doug Bass    three                                               right
```

---

## Might need to babysit your respondents

Need to _mutate_ the `number_of_cats` variable:
-   to fix unusual data values, and
-   to coarse the characters to numbers.

``` r
cat_lovers %>%
  mutate(
    number_of_cats = case_when(
      name == "Ginger Clark" ~ "2",
      name == "Doug Bass"    ~ "3",
      TRUE                   ~ number_of_cats
      ),
    number_of_cats = as.numeric(number_of_cats)
    ) %>%
  summarise(mean_cats = mean(number_of_cats))
```

```
## # A tibble: 1 × 1
##   mean_cats
##       <dbl>
## 1     0.833
```

---

## Moral of the story

- If your data does not behave how you expect it to, type coercion upon reading in the data might be the reason.

- Go in and investigate your data, apply the fix, *save your data*, live happily ever after.

- You can make explicit coercion by using suitable functions starting with `as.???()`

- Different options for `???` part of the code

---

# Data types

---

## Four key data types in R

``` r
typeof(TRUE)
```

```
## [1] "logical"
```

**double** - floating point numerical values (default numerical type)

``` r
typeof(1.335)
```

```
## [1] "double"
```

``` r
typeof(7)
```

```
## [1] "double"
```
]
.pull-right[

**character** - character strings

``` r
typeof("hello")
```

```
## [1] "character"
```

**integer** - integer numerical values (indicated with an `L`)

``` r
typeof(7L)
```

```
## [1] "integer"
```

``` r
typeof(1:3)
```

```
## [1] "integer"
```
]

---

## Concatenation

Vectors can be constructed using the `c()` function.

``` r
c(1, 2, 3)
```

```
## [1] 1 2 3
```

``` r
c("Hello", "World!")
```

```
## [1] "Hello"  "World!"
```

``` r
c(c("hi", "hello"), c("bye", "jello"))
```

```
## [1] "hi"    "hello" "bye"   "jello"
```

---

## Explicit coercion between types

`as.character()`, `as.numeric()`, `as.integer()` and `as.logical()`

``` r
x <- 1:3
x
```

```
## [1] 1 2 3
```

``` r
typeof(x)
```

```
## [1] "integer"
```

<br>

``` r
y <- as.character(x)
y
```

```
## [1] "1" "2" "3"
```

``` r
typeof(y)
```

```
## [1] "character"
```
]
]

``` r
z <- c(TRUE, FALSE)
z
```

```
## [1]  TRUE FALSE
```

``` r
typeof(z)
```

```
## [1] "logical"
```

<br>

``` r
w <- as.numeric(z)
w
```

```
## [1] 1 0
```

``` r
typeof(w)
```

```
## [1] "double"
```
]
]

---

## Implicit coercion between types

R will happily convert between various types without complaint when different types of data are concatenated in a vector, and that's not always a great thing!

``` r
x <- c(1, "Hello")
x
```

```
## [1] "1"     "Hello"
```

``` r
typeof(x)
```

```
## [1] "character"
```

``` r
y <- c(FALSE, 3L)
y
```

```
## [1] 0 3
```

``` r
typeof(y)
```

```
## [1] "integer"
```
]
]

``` r
z <- c(1.2, 3L)
z
```

```
## [1] 1.2 3.0
```

``` r
typeof(z)
```

```
## [1] "double"
```

``` r
w <- c(2L, "two")
w
```

```
## [1] "2"   "two"
```

``` r
typeof(w)
```

```
## [1] "character"
```
]
]

Data type hierarchy: **character** > **double** > **integer** > **logical**

---

# Special values

---

## Special values

``` r
0/0
```

```
## [1] NaN
```
]
- `Inf`: Positive infinity
.small[

``` r
pi/0
```

```
## [1] Inf
```
]
- `-Inf`: Negative infinity
.small[

``` r
log(0)
```

```
## [1] -Inf
```
]
]

``` r
x <- c(1, 2, 3, 4, NA)
mean(x)
```

```
## [1] NA
```

``` r
mean(x, na.rm = TRUE)
```

```
## [1] 2.5
```

``` r
summary(x)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    1.75    2.50    2.50    3.25    4.00       1
```

``` r
typeof(NA)
```

```
## [1] "logical"
```
(ie `NA`s can work with all data types in the hierarchy)
]

]

---

## Working with `NA`s

What is happening in the following examples?
.pull-left[

``` r
# TRUE or NA
TRUE | NA
```

```
## [1] TRUE
```

``` r
# TRUE and NA
TRUE & NA
```

```
## [1] NA
```

``` r
# FALSE or NA
FALSE | NA
```

```
## [1] NA
```
]
.pull-right[

``` r
# FALSE and NA
FALSE & NA
```

```
## [1] FALSE
```

---

# Data classes

---

## Data classes

So far we talked about data *types*, next we'll introduce *classes*.

- Classes are types of structures in R that have certain useful properties.

- Like Lego, we can stick different classes together to build more complicated ones.

- Analogy: 
  -   Stick two numbers together makes a vector. 
  -   Two vectors glued together creates a matrix.

- Class examples: **matrices**, **arrays**, **lists**, **data frames**, **factors** and **dates**

---

## Matrices/Arrays

Matrices are rectangular arrays of numbers, having two dimensions.

``` r
M1 <- matrix(c(1, 4, 3, 6, 2, 3), 
             nrow = 2, ncol = 3)
M1
```

```
##      [,1] [,2] [,3]
## [1,]    1    3    2
## [2,]    4    6    3
```

``` r
M2 <- matrix(c(6, 2, 5, 1), 
             nrow = 2, ncol = 2)
M2
```

```
##      [,1] [,2]
## [1,]    6    5
## [2,]    2    1
```
]
.pull-right[

``` r
M2 %*% M1   #Matrix multiplication
```

```
##      [,1] [,2] [,3]
## [1,]   26   48   27
## [2,]    6   12    7
```

``` r
det(M2)
```

```
## [1] -4
```

<br>

---

## Arrays

Arrays can be considered as collection of matrices having three dimensions. Arrays can store only data type!

``` r
M1 <- matrix(c(1, 4, 3, 6, 2, 3), 
             nrow = 2, ncol = 3)
M1
```

```
##      [,1] [,2] [,3]
## [1,]    1    3    2
## [2,]    4    6    3
```

``` r
M2 <- matrix(c(6, 2, 5, 1, 4, 3), 
             nrow = 2, ncol = 3)
M2
```

```
##      [,1] [,2] [,3]
## [1,]    6    5    4
## [2,]    2    1    3
```
]

``` r
# Take these matrices as input to the array.
array_matrix <- array(c(M1, M2), 
                      dim = c(2, 3, 2))
array_matrix
```

```
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    2
## [2,]    4    6    3
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    6    5    4
## [2,]    2    1    3
```
]

---

## Lists

Lists are a generic container containing items of different data type and class.

You can even have lists of lists!

``` r
l <- list(
  item1 = 1:4,
  item2 = c("hi", "hello", "jello"),
  itme3 = c(TRUE, FALSE),
  item4 = list(
    item4_1 = diag(2)
  )
)
```
]
.pull-right[

``` r
l$item2
```

```
## [1] "hi"    "hello" "jello"
```

``` r
l$item4$item4_1
```

```
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1
```
]

---

## Data frames

Data frames are special types of lists that contain vectors of any data type but all of equal length.

``` r
df <- data.frame(x = c(1, 2), 
                 y = c("me","you"), 
                 z = c(TRUE, FALSE))
df
```

```
##   x   y     z
## 1 1  me  TRUE
## 2 2 you FALSE
```

``` r
typeof(df)
```

```
## [1] "list"
```
]
.pull-right[

``` r
class(df)
```

```
## [1] "data.frame"
```
]

---

## Factors

R uses **factors** to handle categorical variables.

Factors like character (level labels) and an integer (level numbers) glued together.

``` r
x <- factor(c("BS", "MS", "PhD", "MS"))
x
```

```
## [1] BS  MS  PhD MS 
## Levels: BS MS PhD
```

``` r
typeof(x)
```

```
## [1] "integer"
```

``` r
class(x)
```

```
## [1] "factor"
```
]
.pull-right[

``` r
glimpse(x)
```

```
##  Factor w/ 3 levels "BS","MS","PhD": 1 2 3 2
```

``` r
as.integer(x)
```

```
## [1] 1 2 3 2
```
]

---

## The `forcats` package

.pull-left[
<br>
- Factors are useful when you have true categorical data and you want to override the ordering of character vectors to improve display
- They are also useful in modeling scenarios
- The **forcats** package provides a suite of useful tools that solve common problems with factors
- More info [HERE](https://forcats.tidyverse.org/)
]
.pull-right[
<img src="img/forcats-part-of-tidyverse.png" width="60%" style="display: block; margin: auto;" />

``` r
cat_lovers %>%
  group_by(handedness) %>%
  tally()
```

```
## # A tibble: 3 × 2
##   handedness       n
##   <chr>        <int>
## 1 ambidextrous     5
## 2 left            13
## 3 right           42
```

]

---

## Working with factors

Do you recall how these plots can be created from lab-00?

---

## Dates

We can think of dates like an integer (the number of days since the origin, 1 Jan 1970) presented as a character in the form `YYYY-MM-DD`.

``` r
y <- as.Date("2020-01-01")
y
```

```
## [1] "2020-01-01"
```

``` r
typeof(y)
```

```
## [1] "double"
```

``` r
class(y)
```

```
## [1] "Date"
```
]
.pull-right[

``` r
as.integer(y)
```

```
## [1] 18262
```

``` r
as.integer(y) / 365 # roughly 50 yrs
```

```
## [1] 50.03288
```
]

---

YouTube: Tom Scott, The problem with time and timezones, Computerphile - [<svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:red;overflow:visible;position:relative;"><path d="M549.655 124.083c-6.281-23.65-24.787-42.276-48.284-48.597C458.781 64 288 64 288 64S117.22 64 74.629 75.486c-23.497 6.322-42.003 24.947-48.284 48.597-11.412 42.867-11.412 132.305-11.412 132.305s0 89.438 11.412 132.305c6.281 23.65 24.787 41.5 48.284 47.821C117.22 448 288 448 288 448s170.78 0 213.371-11.486c23.497-6.321 42.003-24.171 48.284-47.821 11.412-42.867 11.412-132.305 11.412-132.305s0-89.438-11.412-132.305zm-317.51 213.508V175.185l142.739 81.205-142.739 81.201z"/></svg>](https://www.youtube.com/watch?v=-5wpm-gesOY)

---

## Make a date

.pull-left[
- **lubridate** is the tidyverse-friendly package that makes dealing with dates a little easier
- It's _not_ one of the core tidyverse packages!
- Functions to read different date orders: `ymd()`, `dmy()`, `mdy()`
- Recognises numerical and text months.

<img src="img/lubridate-not-part-of-tidyverse.png" width="30%" style="display: block; margin: auto;" />
]
.pull-right[

``` r
#install.packages("lubridate")
library(lubridate)
```

``` r
ymd("2022 October 11")
```

```
## [1] "2022-10-11"
```

``` r
mdy("10oct2022")
```

```
## [1] "2022-10-20"
```

``` r
dmy("11/10/2022")
```

```
## [1] "2022-10-11"
```
]

<!--

## Visualising poll-tracker data

Data from YouGov's voting intention tracer ([source](https://yougov.co.uk/topics/politics/articles-reports/2022/09/30/voting-intention-con-21-lab-54-28-29-sep-2022))

``` r
poll <- read_csv("data/voting_intension.csv")
glimpse(poll)
```

```
## Rows: 129
## Columns: 9
## $ Date          <chr> "26/01/2020", "0…
## $ Con           <dbl> 49, 49, 48, 52, …
## $ Lab           <dbl> 29, 30, 28, 28, …
## $ `Lib Dem`     <dbl> 10, 8, 10, 8, 5,…
## $ SNP           <dbl> 5, 4, 4, 5, 4, 4…
## $ `Plaid Cymru` <dbl> 1, 1, 1, 1, 1, 1…
## $ `Reform UK`   <dbl> 2, 2, 2, 1, 1, 3…
## $ Green         <dbl> 4, 5, 6, 5, 3, 5…
## $ Other         <dbl> 0, 1, 1, 1, 1, 1…
```
]

]

-->

<!--
## Step to take

.pull-left[
1.   Pivot data from wide format to long
2.   Convert character `Date` variable to a date class
3.   Convert `Parties` from character to factor
  -   choose an appropriate category ordering (see [here](https://forcats.tidyverse.org/reference/fct_reorder.html) for more info)
4.   Create line plot of `Share` against `Date` per `Party`
5.   Add title and axis labels
6.   Change line colours to party affiliated colours
]
.pull-right[
.small[

``` r
library(tidyverse)
library(lubridate)

poll <- read_csv("data/voting_intension.csv")

poll %>% 
  pivot_longer(cols = -Date, 
               names_to = "Party", 
               values_to = "Share") %>%
  mutate(Date = dmy(Date),
         Party = fct_reorder(Party, Share, 
                             first, .desc=TRUE)) %>%
  ggplot(mapping = aes(x = Date, 
                       y = Share, 
                       colour = Party)) +
    geom_line() +
    labs(x = "Date",
         y = "Vote Share (%)",
         colour = "Political Party",
        title = "YouGov, Voting Intension Tracker") +
    scale_color_manual(values = c("#003cab", "#c20600", 
                                  "#ffb922", "#ffe046",
                                  "#149678", "#6ac9f5", 
                                  "#006b46", "#bfbdbd"))
```
]
]
-->

<!--
## The final result!

-->