Joining and Organising Data Sets

.title[
# Joining and Organising Data Sets
]
.subtitle[
## <br><br> Introduction to Data Science
]
.author[
### University of Edinburgh
]
.date[
### <br> 2024/2025
]

---

# .hand[We...]

---

## Reminder

The **grammar** of data wrangling
- `select`: pick columns by name
- `arrange`: reorder rows
- `slice`: pick rows using index(es)
- `filter`: pick rows matching criteria
- `distinct`: filter for unique rows
- `mutate`: add new variables
- `summarise`: reduce variables to values
- `group_by`: for grouped operations
- ... (many more)
]

``` r
starwars %>%
  select(species, height) %>%
  mutate(species_group = case_when(
    species == "Droid" ~ "Droid",
    species == "Human" ~ "Human",
    TRUE ~ "Other")) %>%
  group_by(species_group) %>%
  summarise(
    count = n(),
    avg_height = mean(height, na.rm = TRUE)
  )
```
]

---

# .hand[We...]

---

## Data: Women in science

Information on 10 women in science who changed the world

|name               |
|:------------------|
|Ada Lovelace       |
|Marie Curie        |
|Janaki Ammal       |
|Chien-Shiung Wu    |
|Katherine Johnson  |
|Rosalind Franklin  |
|Vera Rubin         |
|Gladys West        |
|Flossie Wong-Staal |
|Jennifer Doudna    |
]

.footnote[
Source: [Discover Magazine](https://www.discovermagazine.com/the-sciences/meet-10-women-in-science-who-changed-the-world)
]

---

## Reading the data*

Project directory structure

<ul>
  <li> Lecture6.Rmd </li>
  <li> Lecture6.html </li>
  <li> data/ </li>
  <ul>
    <li> scientists/ </li>
    <ul>
      <li> professions.csv </li>
      <li> dates.csv </li>
      <li> works.csv </li>
    </ul>
  </ul>
</ul>

``` r
professions <- read_csv("data/scientists/professions.csv")
dates <- read_csv("data/scientists/dates.csv")
works <- read_csv("data/scientists/works.csv")
```

---

## A closer look

``` r
professions
```

```
## # A tibble: 10 × 2
##    name               profession                        
##    <chr>              <chr>                             
##  1 Ada Lovelace       Mathematician                     
##  2 Marie Curie        Physicist and Chemist             
##  3 Janaki Ammal       Botanist                          
##  4 Chien-Shiung Wu    Physicist                         
##  5 Katherine Johnson  Mathematician                     
##  6 Rosalind Franklin  Chemist                           
##  7 Vera Rubin         Astronomer                        
##  8 Gladys West        Mathematician                     
##  9 Flossie Wong-Staal Virologist and Molecular Biologist
## 10 Jennifer Doudna    Biochemist
```
]

``` r
dates
```

```
## # A tibble: 8 × 3
##   name               birth_year death_year
##   <chr>                   <dbl>      <dbl>
## 1 Janaki Ammal             1897       1984
## 2 Chien-Shiung Wu          1912       1997
## 3 Katherine Johnson        1918       2020
## 4 Rosalind Franklin        1920       1958
## 5 Vera Rubin               1928       2016
## 6 Gladys West              1930         NA
## 7 Flossie Wong-Staal       1947       2020
## 8 Jennifer Doudna          1964         NA
```
]

``` r
works
```

```
## # A tibble: 9 × 2
##   name               known_for                                                  
##   <chr>              <chr>                                                      
## 1 Ada Lovelace       first computer algorithm                                   
## 2 Marie Curie        theory of radioactivity,  discovery of elements polonium a…
## 3 Janaki Ammal       hybrid species, biodiversity protection                    
## 4 Chien-Shiung Wu    confim and refine theory of radioactive beta decy, Wu expe…
## 5 Katherine Johnson  calculations of orbital mechanics critical to sending the …
## 6 Vera Rubin         existence of dark matter                                   
## 7 Gladys West        mathematical modeling of the shape of the Earth which serv…
## 8 Flossie Wong-Staal first scientist to clone HIV and create a map of its genes…
## 9 Jennifer Doudna    one of the primary developers of CRISPR, a ground-breaking…
```
]

]

---

## Desired joined output

```
## # A tibble: 10 × 5
##    name               profession                 birth_year death_year known_for
##    <chr>              <chr>                           <dbl>      <dbl> <chr>    
##  1 Ada Lovelace       Mathematician                      NA         NA first co…
##  2 Marie Curie        Physicist and Chemist              NA         NA theory o…
##  3 Janaki Ammal       Botanist                         1897       1984 hybrid s…
##  4 Chien-Shiung Wu    Physicist                        1912       1997 confim a…
##  5 Katherine Johnson  Mathematician                    1918       2020 calculat…
##  6 Rosalind Franklin  Chemist                          1920       1958 <NA>     
##  7 Vera Rubin         Astronomer                       1928       2016 existenc…
##  8 Gladys West        Mathematician                    1930         NA mathemat…
##  9 Flossie Wong-Staal Virologist and Molecular …       1947       2020 first sc…
## 10 Jennifer Doudna    Biochemist                       1964         NA one of t…
```

---

## Inputs, reminder

``` r
names(professions)
```

```
## [1] "name"       "profession"
```

``` r
names(dates)
```

```
## [1] "name"       "birth_year" "death_year"
```

``` r
names(works)
```

```
## [1] "name"      "known_for"
```
]
.pull-right[

``` r
nrow(professions)
```

```
## [1] 10
```

``` r
nrow(dates)
```

```
## [1] 8
```

``` r
nrow(works)
```

```
## [1] 9
```
]

Observations:
* `dates` and `works` have missing rows
* variable `name` is common for all data sets

---

# Joining data frames

---

## Joining data frames

``` r
something_join(x, y)
```

- `left_join()`: all rows from x
- `right_join()`: all rows from y
- `full_join()`: all rows from both x and y
- `semi_join()`: all rows from x where there are matching values in y, keeping just columns from x
- `inner_join()`: all rows from x where there are matching values in y, return 
all combination of multiple matches in the case of multiple matches
- `anti_join()`: return all rows from x where there are not matching values in y, never duplicate rows of x
- ...
 
---

## Setup

Let's keep things simple. For the next few slides...

``` r
x
```

```
## # A tibble: 3 × 2
##      id value_x
##   <dbl> <chr>  
## 1     1 x1     
## 2     2 x2     
## 3     3 x3
```
]
.pull-right[

``` r
y
```

```
## # A tibble: 3 × 2
##      id value_y
##   <dbl> <chr>  
## 1     1 y1     
## 2     2 y2     
## 3     4 y4
```
]

---

## `left_join()`

.pull-left[
<img src="img/left-join.gif" width="80%" style="background-color: #FDF6E3" />
]
.pull-right[

``` r
left_join(x, y, by = "id")
```

```
## # A tibble: 3 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2     
## 3     3 x3      <NA>
```

``` r
x %>% left_join(y, by = "id")
```

```
## # A tibble: 3 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2     
## 3     3 x3      <NA>
```
]

---

## `left_join()`

``` r
professions %>%
* left_join(dates, by = "name")
```

```
## # A tibble: 10 × 4
##    name               profession                         birth_year death_year
##    <chr>              <chr>                                   <dbl>      <dbl>
##  1 Ada Lovelace       Mathematician                              NA         NA
##  2 Marie Curie        Physicist and Chemist                      NA         NA
##  3 Janaki Ammal       Botanist                                 1897       1984
##  4 Chien-Shiung Wu    Physicist                                1912       1997
##  5 Katherine Johnson  Mathematician                            1918       2020
##  6 Rosalind Franklin  Chemist                                  1920       1958
##  7 Vera Rubin         Astronomer                               1928       2016
##  8 Gladys West        Mathematician                            1930         NA
##  9 Flossie Wong-Staal Virologist and Molecular Biologist       1947       2020
## 10 Jennifer Doudna    Biochemist                               1964         NA
```

---

## `right_join()`

.pull-left[
<img src="img/right-join.gif" width="80%" style="background-color: #FDF6E3" />
]
.pull-right[

``` r
right_join(x, y, by = "id")
```

```
## # A tibble: 3 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2     
## 3     4 <NA>    y4
```

``` r
x %>% right_join(y, by = "id")
```

```
## # A tibble: 3 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2     
## 3     4 <NA>    y4
```

]

---

## `right_join()`

``` r
professions %>%
* right_join(dates, by = "name")
```

```
## # A tibble: 8 × 4
##   name               profession                         birth_year death_year
##   <chr>              <chr>                                   <dbl>      <dbl>
## 1 Janaki Ammal       Botanist                                 1897       1984
## 2 Chien-Shiung Wu    Physicist                                1912       1997
## 3 Katherine Johnson  Mathematician                            1918       2020
## 4 Rosalind Franklin  Chemist                                  1920       1958
## 5 Vera Rubin         Astronomer                               1928       2016
## 6 Gladys West        Mathematician                            1930         NA
## 7 Flossie Wong-Staal Virologist and Molecular Biologist       1947       2020
## 8 Jennifer Doudna    Biochemist                               1964         NA
```

.question[
Would `left_join(x, y, by="id")` and `right_join(y, x, by = "id")` produce the same result?
]

---

## `full_join()`

.pull-left[
<img src="img/full-join.gif" width="80%" style="background-color: #FDF6E3" />
]
.pull-right[

``` r
full_join(x, y, by = "id")  
```

```
## # A tibble: 4 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2     
## 3     3 x3      <NA>   
## 4     4 <NA>    y4
```

]

---

## `full_join()`

``` r
dates %>%
* full_join(works, by = "name")
```

```
## # A tibble: 10 × 4
##    name               birth_year death_year known_for                           
##    <chr>                   <dbl>      <dbl> <chr>                               
##  1 Janaki Ammal             1897       1984 hybrid species, biodiversity protec…
##  2 Chien-Shiung Wu          1912       1997 confim and refine theory of radioac…
##  3 Katherine Johnson        1918       2020 calculations of orbital mechanics c…
##  4 Rosalind Franklin        1920       1958 <NA>                                
##  5 Vera Rubin               1928       2016 existence of dark matter            
##  6 Gladys West              1930         NA mathematical modeling of the shape …
##  7 Flossie Wong-Staal       1947       2020 first scientist to clone HIV and cr…
##  8 Jennifer Doudna          1964         NA one of the primary developers of CR…
##  9 Ada Lovelace               NA         NA first computer algorithm            
## 10 Marie Curie                NA         NA theory of radioactivity,  discovery…
```

---

## `inner_join()`

.pull-left[
<img src="img/inner-join.gif" width="80%" style="background-color: #FDF6E3" />
]
.pull-right[

``` r
inner_join(x, y, by = "id")
```

```
## # A tibble: 2 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2
```
]

---

## `inner_join()`

``` r
dates %>%
* inner_join(works, by = "name")
```

```
## # A tibble: 7 × 4
##   name               birth_year death_year known_for                            
##   <chr>                   <dbl>      <dbl> <chr>                                
## 1 Janaki Ammal             1897       1984 hybrid species, biodiversity protect…
## 2 Chien-Shiung Wu          1912       1997 confim and refine theory of radioact…
## 3 Katherine Johnson        1918       2020 calculations of orbital mechanics cr…
## 4 Vera Rubin               1928       2016 existence of dark matter             
## 5 Gladys West              1930         NA mathematical modeling of the shape o…
## 6 Flossie Wong-Staal       1947       2020 first scientist to clone HIV and cre…
## 7 Jennifer Doudna          1964         NA one of the primary developers of CRI…
```

---

## `semi_join()`

.pull-left[
<img src="img/semi-join.gif" width="80%" style="background-color: #FDF6E3" />
]
.pull-right[

``` r
semi_join(x, y, by = "id")
```

```
## # A tibble: 2 × 2
##      id value_x
##   <dbl> <chr>  
## 1     1 x1     
## 2     2 x2
```
]

---

## `semi_join()`

``` r
dates %>%
* semi_join(works, by = "name")
```

```
## # A tibble: 7 × 3
##   name               birth_year death_year
##   <chr>                   <dbl>      <dbl>
## 1 Janaki Ammal             1897       1984
## 2 Chien-Shiung Wu          1912       1997
## 3 Katherine Johnson        1918       2020
## 4 Vera Rubin               1928       2016
## 5 Gladys West              1930         NA
## 6 Flossie Wong-Staal       1947       2020
## 7 Jennifer Doudna          1964         NA
```

---

## `anti_join()`

.pull-left[
<img src="img/anti-join.gif" width="80%" style="background-color: #FDF6E3" />
]
.pull-right[

``` r
anti_join(x, y, by = "id")
```

```
## # A tibble: 1 × 2
##      id value_x
##   <dbl> <chr>  
## 1     3 x3
```
]

---

## `anti_join()`

``` r
dates %>%
* anti_join(works, by = "name")
```

```
## # A tibble: 1 × 3
##   name              birth_year death_year
##   <chr>                  <dbl>      <dbl>
## 1 Rosalind Franklin       1920       1958
```

---

## Putting it altogether

``` r
professions %>%
  left_join(dates, by = "name") %>%
  left_join(works, by = "name")
```

---

# Case study: Student records

---

## Student records

- Have:
  - Enrolment: official university enrolment records
  - Survey: Student provided info missing students who never filled it out and including students who filled it out but dropped the class
- Want: Survey info for all enrolled in class

.pull-left-narrow[
Directory structure:
.small[
<ul>
  <li> Lecture6.Rmd </li>
  <li> Lecture6.html </li>
  <li> data/ </li>
  <ul>
    <li> scientists/ </li>
    <li> students/ </li>
    <ul>
      <li> enrolment.csv </li>
      <li> survey.csv </li>
    </ul>
  </ul>
</ul>
]
]
.pull-right-wide[
Importing the data:

``` r
enrolment <- read_csv("data/students/enrolment.csv")
survey <- read_csv("data/students/survey.csv")
```

]

---

## Student records

``` r
enrolment
```

```
## # A tibble: 3 × 2
##      id name           
##   <dbl> <chr>          
## 1     1 Dave Friday    
## 2     2 Hermine        
## 3     3 Sura Selvarajah
```
]
.pull-right[

``` r
survey
```

```
## # A tibble: 4 × 3
##      id name    username            
##   <dbl> <chr>   <chr>               
## 1     2 Hermine bakealongwithhermine
## 2     3 Sura    surasbakes          
## 3     4 Peter   peter_bakes         
## 4     5 Mark    thebakingbuddha
```
]

* Common variable: `id`

---

## Student records

``` r
enrolment %>% 
* left_join(survey, by = "id")
```

```
## # A tibble: 3 × 4
##      id name.x          name.y  username            
##   <dbl> <chr>           <chr>   <chr>               
## 1     1 Dave Friday     <NA>    <NA>                
## 2     2 Hermine         Hermine bakealongwithhermine
## 3     3 Sura Selvarajah Sura    surasbakes
```
]

``` r
enrolment %>% 
* anti_join(survey, by = "id")
```

```
## # A tibble: 1 × 2
##      id name       
##   <dbl> <chr>      
## 1     1 Dave Friday
```
]

``` r
survey %>% 
* anti_join(enrolment, by = "id")
```

```
## # A tibble: 2 × 3
##      id name  username       
##   <dbl> <chr> <chr>          
## 1     4 Peter peter_bakes    
## 2     5 Mark  thebakingbuddha
```
]

]

---

# Case study: Grocery sales

---

## Grocery sales

- Have:
  - Purchases: One row per customer per item, listing purchases they made
  - Prices: One row per item in the store, listing their prices
- Want: Total revenue

.pull-left-narrow[
Directory structure:
.small[
<ul>
  <li> Lecture6.Rmd </li>
  <li> Lecture6.html </li>
  <li> data/ </li>
  <ul>
    <li> scientists/ </li>
    <li> students/ </li>
    <li> sales/ </li>
    <ul>
      <li> purchases.csv </li>
      <li> prices.csv </li>
      <li> customers.csv </li>
    </ul>
  </ul>
</ul>
]
]
.pull-right-wide[
Importing the data:

``` r
purchases <- read_csv("data/sales/purchases.csv")
prices <- read_csv("data/sales/prices.csv")
customers <- read_csv("data/sales/customers.csv")  #(for later)
```

]

---

## Grocery sales

``` r
purchases
```

```
## # A tibble: 5 × 2
##   customer_id item        
##         <dbl> <chr>       
## 1           1 bread       
## 2           1 milk        
## 3           1 banana      
## 4           2 milk        
## 5           2 toilet paper
```
]
.pull-right[

``` r
prices
```

```
## # A tibble: 5 × 2
##   item         price
##   <chr>        <dbl>
## 1 avocado       0.5 
## 2 banana        0.15
## 3 bread         1   
## 4 milk          0.8 
## 5 toilet paper  3
```
]

* Common variable: `item`

---

## Grocery sales

``` r
purchases %>% 
* left_join(prices, by = "item")
```

```
## # A tibble: 5 × 3
##   customer_id item         price
##         <dbl> <chr>        <dbl>
## 1           1 bread         1   
## 2           1 milk          0.8 
## 3           1 banana        0.15
## 4           2 milk          0.8 
## 5           2 toilet paper  3
```
]
.pull-right[

``` r
purchases %>% 
  left_join(prices, by = "item") %>%
* summarise(total_revenue = sum(price))
```

```
## # A tibble: 1 × 1
##   total_revenue
##           <dbl>
## 1          5.75
```
]
]

``` r
purchases %>% 
  left_join(prices, by = "item")
```

``` r
purchases %>% 
  left_join(prices, by = "item") %>%
* group_by(customer_id) %>%
  summarise(total_revenue = sum(price))
```

```
## # A tibble: 2 × 2
##   customer_id total_revenue
##         <dbl>         <dbl>
## 1           1          1.95
## 2           2          3.8
```
]

]

---

# .hand[We...]

---

## Data: Sales

<br>

One row per customer (.green[wide format])

```
## # A tibble: 2 × 4
##   customer_id item_1 item_2       item_3
##         <dbl> <chr>  <chr>        <chr> 
## 1           1 bread  milk         banana
## 2           2 milk   toilet paper <NA>
```
]

--
.pull-right[
### .pink[We want...]

One row per item sold (.green[long format])

```
## # A tibble: 6 × 3
##   customer_id item_no item        
##         <dbl> <chr>   <chr>       
## 1           1 item_1  bread       
## 2           1 item_2  milk        
## 3           1 item_3  banana      
## 4           2 item_1  milk        
## 5           2 item_2  toilet paper
## 6           2 item_3  <NA>
```
]

---

## A grammar of data tidying

.pull-left[
<img src="img/tidyr-part-of-tidyverse.png" width="60%" />
]
.pull-right[
The goal of tidyr is to help you tidy your data via

- pivoting for going between wide and long data
- splitting and combining character columns
- nesting and unnesting columns
- clarifying how `NA`s should be treated
]

---

## Pivoting data

---

## `pivot_longer()` `\(~~~~~~~~~~~~\)` `pivot_wider()`

``` r
pivot_longer(
  data, 
  cols, 
  names_to = "name", 
  values_to = "value"
  )
```

- `data` (as usual)
- `cols`: columns to pivot into longer format 
- `names_to`: name of the column where column names of pivoted variables go (character string)
- `values_to`: name of the column where data in pivoted variables go (character string)
]
.pull-right[

``` r
pivot_wider(
  data, 
  names_from = "name", 
  values_from = "value"
  )
```

- `data` (as usual)
- `names_from`: which column in the long format contains the what should be column names in the wide format
- `values_from`: which column in the long format contains the what should be values in the new columns in the wide format
]

---

## Customers `\(\rightarrow\)` purchases

``` r
purchases <- customers %>%
* pivot_longer(
*   cols = item_1:item_3,  # variables item_1 to item_3
*   names_to = "item_no",  # column names -> new column called item_no
*   values_to = "item"     # values in columns -> new column called item
*   )

purchases
```

---

## Why pivot?

Most likely, because the next step of your analysis needs it

``` r
prices
```

```
## # A tibble: 5 × 2
##   item         price
##   <chr>        <dbl>
## 1 avocado       0.5 
## 2 banana        0.15
## 3 bread         1   
## 4 milk          0.8 
## 5 toilet paper  3
```
]
.pull-right[

``` r
purchases %>%
* left_join(prices, by = "item")
```

```
## # A tibble: 6 × 4
##   customer_id item_no item         price
##         <dbl> <chr>   <chr>        <dbl>
## 1           1 item_1  bread         1   
## 2           1 item_2  milk          0.8 
## 3           1 item_3  banana        0.15
## 4           2 item_1  milk          0.8 
## 5           2 item_2  toilet paper  3   
## 6           2 item_3  <NA>         NA
```
]

---

## Purchases `\(\rightarrow\)` customers

``` r
purchases %>%
* pivot_wider(
*   names_from = item_no,
*   values_from = item
* )
```

```
## # A tibble: 2 × 4
##   customer_id item_1 item_2       item_3
##         <dbl> <chr>  <chr>        <chr> 
## 1           1 bread  milk         banana
## 2           2 milk   toilet paper <NA>
```