Tidy Data and Data Wrangling

.title[
# Tidy Data and Data Wrangling
]
.subtitle[
## <br><br> Introduction to Data Science
]
.author[
### University of Edinburgh
]
.date[
### <br> 2024/2025
]

---

## Tidy data

>Happy families are all alike; every unhappy family is unhappy in its own way. 
>
>Leo Tolstoy

- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
]

- Merged cells.
- Repeated or non-informative variable/column names.
- Cell comments or superfluous information.
- Empty or total/subtotal rows.
- ... 
]

---

.footnote[
Source: [Army Air Forces Statistical Digest, WW II](https://www.ibiblio.org/hyperwar/AAF/StatDigest/aafsd-3.html)
]

---

<br>

Source: [Gapminder, Estimated HIV prevalence among 15-49 year olds](https://www.gapminder.org/data)

---

---

## Displaying vs. summarising data

| Name | Species | Height | Mass |
|:----|----:|----:|---:|
| Luke Skywalker      | Human |  172 |    77 |
| C-3PO               | Droid |  167 |    75 |
| R2-D2               | Droid |   96 |    32 |
| Darth Vader         | Human |  202 |   136 |
| Leia Organa         | Human |  150 |    49 |
| ... | ... | ... | ... |
]

| Species | Count | Average Height |
|:----|----:|----:|
| Droid | 6   | 131 |
| Human | 35  | 177 |
| Other | 46  | 178 |

]

``` r
library(tidyverse)
```

``` r
starwars %>%
  select(species, height, mass)
```
]

``` r
starwars %>%
  select(species, height) %>%
  mutate(species_group = case_when(
    species == "Droid" ~ "Droid",
    species == "Human" ~ "Human",
    TRUE ~ "Other")) %>%
  group_by(species_group) %>%
  summarise(
    count = n(),
    avg_height = mean(height, na.rm = TRUE)
  )
```
]

]

---

# Grammar of data wrangling

---

## A grammar of data wrangling...

... based on the concepts of functions as verbs that manipulate data frames

.pull-left[
<img src="img/dplyr-part-of-tidyverse.png" width="70%" />
]
.pull-right[
.midi[
- `select`: pick columns by name
- `arrange`: reorder rows
- `slice`: pick rows using index(es)
- `filter`: pick rows matching criteria
- `distinct`: filter for unique rows
- `mutate`: add new variables
- `summarise`: reduce variables to values
- `group_by`: for grouped operations
- ... (many more)
]

``` r
<function_name>(<dataset_name>, ...)
```

``` r
help(select)
```
]

---

## `select` specific variables

``` r
select(starwars, name, height, mass)
```

```
## # A tibble: 87 × 3
##    name               height  mass
##    <chr>               <int> <dbl>
##  1 Luke Skywalker        172    77
##  2 C-3PO                 167    75
##  3 R2-D2                  96    32
##  4 Darth Vader           202   136
##  5 Leia Organa           150    49
##  6 Owen Lars             178   120
##  7 Beru Whitesun Lars    165    75
##  8 R5-D4                  97    32
##  9 Biggs Darklighter     183    84
## 10 Obi-Wan Kenobi        182    77
## # ℹ 77 more rows
```
]

``` r
select(starwars, -name)
```

```
## # A tibble: 87 × 13
##    height  mass hair_color    skin_color
##     <int> <dbl> <chr>         <chr>     
##  1    172    77 blond         fair      
##  2    167    75 <NA>          gold      
##  3     96    32 <NA>          white, bl…
##  4    202   136 none          white     
##  5    150    49 brown         light     
##  6    178   120 brown, grey   light     
##  7    165    75 brown         light     
##  8     97    32 <NA>          white, red
##  9    183    84 black         light     
## 10    182    77 auburn, white fair      
## # ℹ 77 more rows
## # ℹ 9 more variables: eye_color <chr>,
## #   birth_year <dbl>, sex <chr>,
## #   gender <chr>, homeworld <chr>,
## #   species <chr>, films <list>,
## #   vehicles <list>, starships <list>
```
]

---

## Select helper functions

``` r
select(starwars, ends_with("color"))
```

```
## # A tibble: 87 × 3
##    hair_color    skin_color  eye_color
##    <chr>         <chr>       <chr>    
##  1 blond         fair        blue     
##  2 <NA>          gold        yellow   
##  3 <NA>          white, blue red      
##  4 none          white       yellow   
##  5 brown         light       brown    
##  6 brown, grey   light       blue     
##  7 brown         light       blue     
##  8 <NA>          white, red  red      
##  9 black         light       brown    
## 10 auburn, white fair        blue-gray
## # ℹ 77 more rows
```
]

.pull-right[
<br>
.small[
*   `starts_with()`: Starts with a prefix
*   `ends_with()`: Ends with a suffix
*   `contains()`: Contains a literal string
*   `num_range()`: Match a numerical range like x01, x02, x03
*   `one_of()`: Matches variable names in a character vector
*   `everything()`: Matches all variables
*   `last_col()`: Select last variable, possibly with an offset
*   `matches()`: Matches a regular expression (a sequence of symbols/characters expressing a string/pattern to be searched for within text)

See the help page for more info on these functions.
]
]

---

## `arrange` mass in numerical order

``` r
arrange(starwars, mass)
```

```
## # A tibble: 87 × 14
##    name          height  mass hair_color
##    <chr>          <int> <dbl> <chr>     
##  1 Ratts Tyerel      79    15 none      
##  2 Yoda              66    17 white     
##  3 Wicket Systr…     88    20 brown     
##  4 R2-D2             96    32 <NA>      
##  5 R5-D4             97    32 <NA>      
##  6 Sebulba          112    40 none      
##  7 Padmé Amidala    185    45 brown     
...
```
]

``` r
arrange(starwars, desc(mass))
```

```
## # A tibble: 87 × 14
##    name          height  mass hair_color
##    <chr>          <int> <dbl> <chr>     
##  1 Jabba Desili…    175  1358 <NA>      
##  2 Grievous         216   159 none      
##  3 IG-88            200   140 none      
##  4 Darth Vader      202   136 none      
##  5 Tarfful          234   136 brown     
##  6 Owen Lars        178   120 brown, gr…
##  7 Bossk            190   113 none      
...
```
]

---

## `slice` for certain row numbers

``` r
slice(starwars, 1:4)
```

```
## # A tibble: 4 × 14
##   name           height  mass hair_color
##   <chr>           <int> <dbl> <chr>     
## 1 Luke Skywalker    172    77 blond     
## 2 C-3PO             167    75 <NA>      
## 3 R2-D2              96    32 <NA>      
## 4 Darth Vader       202   136 none      
## # ℹ 10 more variables:
## #   skin_color <chr>, eye_color <chr>,
## #   birth_year <dbl>, sex <chr>,
## #   gender <chr>, homeworld <chr>,
## #   species <chr>, films <list>,
## #   vehicles <list>, starships <list>
```
]

``` r
slice(starwars, c(2,4,6,8))
```

```
## # A tibble: 4 × 14
##   name        height  mass hair_color 
##   <chr>        <int> <dbl> <chr>      
## 1 C-3PO          167    75 <NA>       
## 2 Darth Vader    202   136 none       
## 3 Owen Lars      178   120 brown, grey
## 4 R5-D4           97    32 <NA>       
## # ℹ 10 more variables:
## #   skin_color <chr>, eye_color <chr>,
## #   birth_year <dbl>, sex <chr>,
## #   gender <chr>, homeworld <chr>,
## #   species <chr>, films <list>,
## #   vehicles <list>, starships <list>
```
]
---

## `filter` for the droids

``` r
filter(starwars, species == "Droid")
```

```
## # A tibble: 6 × 14
##   name   height  mass hair_color skin_color  eye_color birth_year sex   gender  
##   <chr>   <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>   
## 1 C-3PO     167    75 <NA>       gold        yellow           112 none  masculi…
## 2 R2-D2      96    32 <NA>       white, blue red               33 none  masculi…
## 3 R5-D4      97    32 <NA>       white, red  red               NA none  masculi…
## 4 IG-88     200   140 none       metal       red               15 none  masculi…
## 5 R4-P17     96    NA none       silver, red red, blue         NA none  feminine
## 6 BB8        NA    NA none       none        black             NA none  masculi…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
```

<br>

"These aren't the droids you're looking for!"<img style='vertical-align:middle;height:36px;' src='img/C3PO.webp'><img style='vertical-align:middle;height:36px;' src='img/R2D2.webp'>

---

## Logical operators in R

| Code    | Description           | Example        | Output  |
|:-------:|:----------------------|:---------------|:--------|
| `<`     | Strictly less than    | `1 < 2`        | `TRUE`  |
| `<=`    | Less than or equal to | `2 <= 2`       | `TRUE`  | 
| `>`     | Strictly greater than | `-3 > 3`       | `FALSE` |
| `>=`    | Greater than or equal to | `10 >= 0`      | `TRUE`  |
| `==`    | Exactly equal to      | `pi == 3.14`   | `FALSE` |
| `!=`    | Not equal to          | `pi != 3.14`   | `TRUE`  |
| ` `     | ` `                   | ` `            | ` `     |
| ` `     | ` `                   | ` `            | ` `     |
| ` `     | ` `                   | ` `            | ` `     |
| `x & y` | Logical AND operator  | `TRUE & FALSE` | `FALSE` |
| `x `<span, style="color:#a7a798;">&nbsp;&nbsp;&#124;&nbsp;&nbsp;</span>` y` | Logical OR operator   | `TRUE `<span, style="color:#a7a798;">&nbsp;&nbsp;&#124;&nbsp;&nbsp;</span>` FALSE` | `TRUE`  |
| `!x`    | Logical NOT operator  | `!TRUE `       | `FALSE` |

``` r
filter(starwars, species == "Droid" & height < 100)
```

---

### `mutate` to compute BMI and species group

``` r
mutate(starwars, 
       bmi = 10000 * mass / height^2,
       species_grp = 
         case_when(
           species == "Human" ~ "Human",
           species == "Droid" ~ "Droid",
           TRUE ~ "Other"
         ))
```
]

```
## # A tibble: 87 × 16
##    name     bmi species_grp height  mass
##    <chr>  <dbl> <chr>        <int> <dbl>
##  1 Jabba… 443.  Other          175  1358
##  2 Dud B…  50.9 Other           94    45
##  3 Yoda    39.0 Other           66    17
##  4 Owen …  37.9 Human          178   120
##  5 IG-88   35   Droid          200   140
##  6 R2-D2   34.7 Droid           96    32
##  7 Griev…  34.1 Other          216   159
##  8 R5-D4   34.0 Droid           97    32
##  9 Jek T…  34.0 Other          180   110
## 10 Darth…  33.3 Human          202   136
## # ℹ 77 more rows
## # ℹ 11 more variables:
## #   hair_color <chr>, skin_color <chr>,
## #   eye_color <chr>, birth_year <dbl>,
## #   sex <chr>, gender <chr>,
## #   homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, …
```
]

---

## `count` to make a frequency table

``` r
count(starwars, eye_color, sort = TRUE)
```

```
## # A tibble: 15 × 2
##    eye_color         n
##    <chr>         <int>
##  1 brown            21
##  2 blue             19
##  3 yellow           11
##  4 black            10
##  5 orange            8
##  6 red               5
##  7 hazel             3
##  8 unknown           3
##  9 blue-gray         1
## 10 dark              1
## 11 gold              1
## 12 green, yellow     1
## 13 pink              1
## 14 red, blue         1
## 15 white             1
```
]

---

## `summerise` to compute the average height

``` r
summarise(starwars, avg_ht = mean(height))
```

```
## # A tibble: 1 × 1
##   avg_ht
##    <dbl>
## 1   179.
```

``` r
summarise(starwars, 
          min_ht = min(height),
          avg_ht = mean(height),
          med_ht = median(height),
          max_ht = max(height),
          sd_ht = sd(height))
```

```
## # A tibble: 1 × 5
##   min_ht avg_ht med_ht max_ht sd_ht
##    <int>  <dbl>  <int>  <int> <dbl>
## 1     88   179.    180    228  22.4
```

]

``` r
summarise(starwars, mean(height))
```

```
## # A tibble: 1 × 1
##   `mean(height)`
##            <dbl>
## 1           179.
```

.tip[
`summarise()` changes the data frame entirely, it collapses rows down to a single summary statistic, and removes all columns that are irrelevant to the calculation.
]

]

---

# .hand[We...]

---

## How to do multiple data wrangling steps?

``` r
summarise( 
  filter( 
    starwars, 
    species == "Human"
    ), 
  avg_mass = mean(mass, na.rm = TRUE)
  )
```

*ISSUE*: Where do you begin reading the code?!

]

``` r
starwars_humans <-  
  filter(starwars, 
         species == "Human")

summarise(starwars_humans, 
          av_mass = mean(mass, 
                         na.rm = TRUE))
```

*ISSUE*: How good is *your* memory of each object in the computer's memory?!
]

.question[
Which method (if either) produces `tidy` code that is easy to read? Can we do any better?
]

---

# Pipes

---

## What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

The pipe operator is implemented in the package **magrittr**, though we don't need to load this package explicitly since **tidyverse** does this for us.

``` r
data %>% 
  select(variable1, variable2) %>%
  filter(variable1 <= variable2)
```

.center[
<svg aria-hidden="true" role="img" viewBox="0 0 384 512" style="height:1em;width:0.75em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M169.4 470.6c12.5 12.5 32.8 12.5 45.3 0l160-160c12.5-12.5 12.5-32.8 0-45.3s-32.8-12.5-45.3 0L224 370.8 224 64c0-17.7-14.3-32-32-32s-32 14.3-32 32l0 306.7L54.6 265.4c-12.5-12.5-32.8-12.5-45.3 0s-12.5 32.8 0 45.3l160 160z"/></svg>
]

]

---

## The piping narrative

* **Take the `starwars` data set and ...**

]

``` r
*starwars
```

```
## # A tibble: 87 × 14
##    name         height  mass hair_color skin_color eye_color
##    <chr>         <int> <dbl> <chr>      <chr>      <chr>    
##  1 Luke Skywal…    172    77 blond      fair       blue     
##  2 C-3PO           167    75 <NA>       gold       yellow   
##  3 R2-D2            96    32 <NA>       white, bl… red      
##  4 Darth Vader     202   136 none       white      yellow   
##  5 Leia Organa     150    49 brown      light      brown    
##  6 Owen Lars       178   120 brown, gr… light      blue     
##  7 Beru Whites…    165    75 brown      light      blue     
##  8 R5-D4            97    32 <NA>       white, red red      
##  9 Biggs Darkl…    183    84 black      light      brown    
## 10 Obi-Wan Ken…    182    77 auburn, w… fair       blue-gray
## # ℹ 77 more rows
## # ℹ 8 more variables: birth_year <dbl>, sex <chr>,
## #   gender <chr>, homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>
```
]
]

---

## The piping narrative

* Take the `starwars` data set and ...
* **... select the `species` and `height` variables. **

]

``` r
starwars %>%
* select(species, height)
```

```
## # A tibble: 87 × 2
##    species height
##    <chr>    <int>
##  1 Human      172
##  2 Droid      167
##  3 Droid       96
##  4 Human      202
##  5 Human      150
##  6 Human      178
##  7 Human      165
##  8 Droid       97
##  9 Human      183
## 10 Human      182
## # ℹ 77 more rows
```
]
]

---
## The piping narrative

* Take the `starwars` data set and ...
* ... select the `species` and `height` variables. 
* **Mutate the data to add a new categorical variable called `species_group` that classifies a character's `species` as either "Droid", "Human" or "Other".**

]

``` r
starwars %>%
  select(species, height) %>%
* mutate(species_group = case_when(
*   species == "Droid" ~ "Droid",
*   species == "Human" ~ "Human",
*   TRUE ~ "Other"))
```

```
## # A tibble: 87 × 3
##    species height species_group
##    <chr>    <int> <chr>        
##  1 Human      172 Human        
##  2 Droid      167 Droid        
##  3 Droid       96 Droid        
##  4 Human      202 Human        
##  5 Human      150 Human        
##  6 Human      178 Human        
##  7 Human      165 Human        
##  8 Droid       97 Droid        
##  9 Human      183 Human        
## 10 Human      182 Human        
## # ℹ 77 more rows
```
]
]

---
## The piping narrative

]

``` r
starwars %>%
  select(species, height) %>%
  mutate(species_group = case_when(
    species == "Droid" ~ "Droid",
    species == "Human" ~ "Human",
    TRUE ~ "Other")) %>%
* group_by(species_group)
```

```
## # A tibble: 87 × 3
## # Groups:   species_group [3]
##    species height species_group
##    <chr>    <int> <chr>        
##  1 Human      172 Human        
##  2 Droid      167 Droid        
##  3 Droid       96 Droid        
##  4 Human      202 Human        
##  5 Human      150 Human        
##  6 Human      178 Human        
##  7 Human      165 Human        
##  8 Droid       97 Droid        
##  9 Human      183 Human        
## 10 Human      182 Human        
## # ℹ 77 more rows
```
]
]

---
## The piping narrative

* Take the `starwars` data set and ...
* ... select the `species` and `height` variables. 
* Mutate the data to add a new categorical variable called `species_group` that classifies a character's `species` as either "Droid", "Human" or "Other".
* Next, group the data according to their `species_group`, and ...
* **... summarise the data to return ...**

]

```
## # A tibble: 3 × 1
##   species_group
##   <chr>        
## 1 Droid        
## 2 Human        
## 3 Other
```
]
]

---
## The piping narrative

]

```
## # A tibble: 3 × 2
##   species_group count
##   <chr>         <int>
## 1 Droid             6
## 2 Human            35
## 3 Other            46
```
]
]

---

## The piping narrative

]

```
## # A tibble: 3 × 3
##   species_group count avg_height
##   <chr>         <int>      <dbl>
## 1 Droid             6        NA 
## 2 Human            35        NA 
## 3 Other            46       177.
```
]
]

---

## The piping narrative

]

``` r
starwars %>%
  select(species, height) %>%
  mutate(species_group = case_when(
    species == "Droid" ~ "Droid",
    species == "Human" ~ "Human",
    TRUE ~ "Other")) %>%
  group_by(species_group) %>%
  summarise(
    count = n(),
*   avg_height = mean(height, na.rm = TRUE)
  )
```

```
## # A tibble: 3 × 3
##   species_group count avg_height
##   <chr>         <int>      <dbl>
## 1 Droid             6       131.
## 2 Human            35       178 
## 3 Other            46       177.
```
]
]

---

## A note on piping and layering

- `%>%` is used in data wrangling pipelines. We _pipe_ the output of the previous line of code as the first input of the next line of code.

- Later on you will see another simple command to collocate lines of codes under data visualization

- Namely, `+` is used in **ggplot2** plots for "layering". We will create the plot by _adding_ layers (in Week 4)

- Be aware of how `%>%` and `+` are used differently!

- They can be combined for certain tasks! But we are not allowed to use them interchangeably.