class: center, middle, inverse, title-slide .title[ # Joining and Organising Data Sets ] .subtitle[ ##
Introduction to Data Science ] .author[ ### University of Edinburgh ] .date[ ###
2024/2025 ] --- class: middle # .hand[We...] .huge[.green[have]] .hand[a single data frame] .huge[.pink[want]] .hand[to slice it, and dice it, and juice it, and process it] --- ## Reminder .pull-left[ The **grammar** of data wrangling - `select`: pick columns by name - `arrange`: reorder rows - `slice`: pick rows using index(es) - `filter`: pick rows matching criteria - `distinct`: filter for unique rows - `mutate`: add new variables - `summarise`: reduce variables to values - `group_by`: for grouped operations - ... (many more) ] -- .pull-right[ **Pipe** the steps together ``` r starwars %>% select(species, height) %>% mutate(species_group = case_when( species == "Droid" ~ "Droid", species == "Human" ~ "Human", TRUE ~ "Other")) %>% group_by(species_group) %>% summarise( count = n(), avg_height = mean(height, na.rm = TRUE) ) ``` ] --- class: middle # .hand[We...] .huge[.green[have]] .hand[multiple data frames] .huge[.pink[want]] .hand[to bring them together] --- ## Data: Women in science Information on 10 women in science who changed the world .small[ |name | |:------------------| |Ada Lovelace | |Marie Curie | |Janaki Ammal | |Chien-Shiung Wu | |Katherine Johnson | |Rosalind Franklin | |Vera Rubin | |Gladys West | |Flossie Wong-Staal | |Jennifer Doudna | ] .footnote[ Source: [Discover Magazine](https://www.discovermagazine.com/the-sciences/meet-10-women-in-science-who-changed-the-world) ] --- ## Reading the data* Project directory structure <ul> <li> Lecture6.Rmd </li> <li> Lecture6.html </li> <li> data/ </li> <ul> <li> scientists/ </li> <ul> <li> professions.csv </li> <li> dates.csv </li> <li> works.csv </li> </ul> </ul> </ul> ``` r professions <- read_csv("data/scientists/professions.csv") dates <- read_csv("data/scientists/dates.csv") works <- read_csv("data/scientists/works.csv") ``` .small[ *Check w03-L05 slides about data importing! ] --- ## A closer look .panelset[ .panel[.panel-name[professions] ``` r professions ``` ``` ## # A tibble: 10 × 2 ## name profession ## <chr> <chr> ## 1 Ada Lovelace Mathematician ## 2 Marie Curie Physicist and Chemist ## 3 Janaki Ammal Botanist ## 4 Chien-Shiung Wu Physicist ## 5 Katherine Johnson Mathematician ## 6 Rosalind Franklin Chemist ## 7 Vera Rubin Astronomer ## 8 Gladys West Mathematician ## 9 Flossie Wong-Staal Virologist and Molecular Biologist ## 10 Jennifer Doudna Biochemist ``` ] .panel[.panel-name[dates] ``` r dates ``` ``` ## # A tibble: 8 × 3 ## name birth_year death_year ## <chr> <dbl> <dbl> ## 1 Janaki Ammal 1897 1984 ## 2 Chien-Shiung Wu 1912 1997 ## 3 Katherine Johnson 1918 2020 ## 4 Rosalind Franklin 1920 1958 ## 5 Vera Rubin 1928 2016 ## 6 Gladys West 1930 NA ## 7 Flossie Wong-Staal 1947 2020 ## 8 Jennifer Doudna 1964 NA ``` ] .panel[.panel-name[works] ``` r works ``` ``` ## # A tibble: 9 × 2 ## name known_for ## <chr> <chr> ## 1 Ada Lovelace first computer algorithm ## 2 Marie Curie theory of radioactivity, discovery of elements polonium a… ## 3 Janaki Ammal hybrid species, biodiversity protection ## 4 Chien-Shiung Wu confim and refine theory of radioactive beta decy, Wu expe… ## 5 Katherine Johnson calculations of orbital mechanics critical to sending the … ## 6 Vera Rubin existence of dark matter ## 7 Gladys West mathematical modeling of the shape of the Earth which serv… ## 8 Flossie Wong-Staal first scientist to clone HIV and create a map of its genes… ## 9 Jennifer Doudna one of the primary developers of CRISPR, a ground-breaking… ``` ] ] --- ## Desired joined output ``` ## # A tibble: 10 × 5 ## name profession birth_year death_year known_for ## <chr> <chr> <dbl> <dbl> <chr> ## 1 Ada Lovelace Mathematician NA NA first co… ## 2 Marie Curie Physicist and Chemist NA NA theory o… ## 3 Janaki Ammal Botanist 1897 1984 hybrid s… ## 4 Chien-Shiung Wu Physicist 1912 1997 confim a… ## 5 Katherine Johnson Mathematician 1918 2020 calculat… ## 6 Rosalind Franklin Chemist 1920 1958 <NA> ## 7 Vera Rubin Astronomer 1928 2016 existenc… ## 8 Gladys West Mathematician 1930 NA mathemat… ## 9 Flossie Wong-Staal Virologist and Molecular … 1947 2020 first sc… ## 10 Jennifer Doudna Biochemist 1964 NA one of t… ``` --- ## Inputs, reminder .pull-left[ ``` r names(professions) ``` ``` ## [1] "name" "profession" ``` ``` r names(dates) ``` ``` ## [1] "name" "birth_year" "death_year" ``` ``` r names(works) ``` ``` ## [1] "name" "known_for" ``` ] .pull-right[ ``` r nrow(professions) ``` ``` ## [1] 10 ``` ``` r nrow(dates) ``` ``` ## [1] 8 ``` ``` r nrow(works) ``` ``` ## [1] 9 ``` ] Observations: * `dates` and `works` have missing rows * variable `name` is common for all data sets --- class: middle # Joining data frames --- ## Joining data frames ``` r something_join(x, y) ``` - `left_join()`: all rows from x - `right_join()`: all rows from y - `full_join()`: all rows from both x and y - `semi_join()`: all rows from x where there are matching values in y, keeping just columns from x - `inner_join()`: all rows from x where there are matching values in y, return all combination of multiple matches in the case of multiple matches - `anti_join()`: return all rows from x where there are not matching values in y, never duplicate rows of x - ... --- ## Setup Let's keep things simple. For the next few slides... .pull-left[ ``` r x ``` ``` ## # A tibble: 3 × 2 ## id value_x ## <dbl> <chr> ## 1 1 x1 ## 2 2 x2 ## 3 3 x3 ``` ] .pull-right[ ``` r y ``` ``` ## # A tibble: 3 × 2 ## id value_y ## <dbl> <chr> ## 1 1 y1 ## 2 2 y2 ## 3 4 y4 ``` ] --- ## `left_join()` .pull-left[ <img src="img/left-join.gif" width="80%" style="background-color: #FDF6E3" /> ] .pull-right[ ``` r left_join(x, y, by = "id") ``` ``` ## # A tibble: 3 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ``` ``` r x %>% left_join(y, by = "id") ``` ``` ## # A tibble: 3 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ``` ] --- ## `left_join()` ``` r professions %>% * left_join(dates, by = "name") ``` ``` ## # A tibble: 10 × 4 ## name profession birth_year death_year ## <chr> <chr> <dbl> <dbl> ## 1 Ada Lovelace Mathematician NA NA ## 2 Marie Curie Physicist and Chemist NA NA ## 3 Janaki Ammal Botanist 1897 1984 ## 4 Chien-Shiung Wu Physicist 1912 1997 ## 5 Katherine Johnson Mathematician 1918 2020 ## 6 Rosalind Franklin Chemist 1920 1958 ## 7 Vera Rubin Astronomer 1928 2016 ## 8 Gladys West Mathematician 1930 NA ## 9 Flossie Wong-Staal Virologist and Molecular Biologist 1947 2020 ## 10 Jennifer Doudna Biochemist 1964 NA ``` --- ## `right_join()` .pull-left[ <img src="img/right-join.gif" width="80%" style="background-color: #FDF6E3" /> ] .pull-right[ ``` r right_join(x, y, by = "id") ``` ``` ## # A tibble: 3 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 4 <NA> y4 ``` ``` r x %>% right_join(y, by = "id") ``` ``` ## # A tibble: 3 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 4 <NA> y4 ``` ] --- ## `right_join()` ``` r professions %>% * right_join(dates, by = "name") ``` ``` ## # A tibble: 8 × 4 ## name profession birth_year death_year ## <chr> <chr> <dbl> <dbl> ## 1 Janaki Ammal Botanist 1897 1984 ## 2 Chien-Shiung Wu Physicist 1912 1997 ## 3 Katherine Johnson Mathematician 1918 2020 ## 4 Rosalind Franklin Chemist 1920 1958 ## 5 Vera Rubin Astronomer 1928 2016 ## 6 Gladys West Mathematician 1930 NA ## 7 Flossie Wong-Staal Virologist and Molecular Biologist 1947 2020 ## 8 Jennifer Doudna Biochemist 1964 NA ``` .question[ Would `left_join(x, y, by="id")` and `right_join(y, x, by = "id")` produce the same result? ] --- ## `full_join()` .pull-left[ <img src="img/full-join.gif" width="80%" style="background-color: #FDF6E3" /> ] .pull-right[ ``` r full_join(x, y, by = "id") ``` ``` ## # A tibble: 4 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ## 4 4 <NA> y4 ``` ] --- ## `full_join()` ``` r dates %>% * full_join(works, by = "name") ``` ``` ## # A tibble: 10 × 4 ## name birth_year death_year known_for ## <chr> <dbl> <dbl> <chr> ## 1 Janaki Ammal 1897 1984 hybrid species, biodiversity protec… ## 2 Chien-Shiung Wu 1912 1997 confim and refine theory of radioac… ## 3 Katherine Johnson 1918 2020 calculations of orbital mechanics c… ## 4 Rosalind Franklin 1920 1958 <NA> ## 5 Vera Rubin 1928 2016 existence of dark matter ## 6 Gladys West 1930 NA mathematical modeling of the shape … ## 7 Flossie Wong-Staal 1947 2020 first scientist to clone HIV and cr… ## 8 Jennifer Doudna 1964 NA one of the primary developers of CR… ## 9 Ada Lovelace NA NA first computer algorithm ## 10 Marie Curie NA NA theory of radioactivity, discovery… ``` --- ## `inner_join()` .pull-left[ <img src="img/inner-join.gif" width="80%" style="background-color: #FDF6E3" /> ] .pull-right[ ``` r inner_join(x, y, by = "id") ``` ``` ## # A tibble: 2 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ``` ] --- ## `inner_join()` ``` r dates %>% * inner_join(works, by = "name") ``` ``` ## # A tibble: 7 × 4 ## name birth_year death_year known_for ## <chr> <dbl> <dbl> <chr> ## 1 Janaki Ammal 1897 1984 hybrid species, biodiversity protect… ## 2 Chien-Shiung Wu 1912 1997 confim and refine theory of radioact… ## 3 Katherine Johnson 1918 2020 calculations of orbital mechanics cr… ## 4 Vera Rubin 1928 2016 existence of dark matter ## 5 Gladys West 1930 NA mathematical modeling of the shape o… ## 6 Flossie Wong-Staal 1947 2020 first scientist to clone HIV and cre… ## 7 Jennifer Doudna 1964 NA one of the primary developers of CRI… ``` .question[ For `full_join()` and `inner_join()`, does the order of `x` and `y` matter? ] --- ## `semi_join()` .pull-left[ <img src="img/semi-join.gif" width="80%" style="background-color: #FDF6E3" /> ] .pull-right[ ``` r semi_join(x, y, by = "id") ``` ``` ## # A tibble: 2 × 2 ## id value_x ## <dbl> <chr> ## 1 1 x1 ## 2 2 x2 ``` ] --- ## `semi_join()` ``` r dates %>% * semi_join(works, by = "name") ``` ``` ## # A tibble: 7 × 3 ## name birth_year death_year ## <chr> <dbl> <dbl> ## 1 Janaki Ammal 1897 1984 ## 2 Chien-Shiung Wu 1912 1997 ## 3 Katherine Johnson 1918 2020 ## 4 Vera Rubin 1928 2016 ## 5 Gladys West 1930 NA ## 6 Flossie Wong-Staal 1947 2020 ## 7 Jennifer Doudna 1964 NA ``` --- ## `anti_join()` .pull-left[ <img src="img/anti-join.gif" width="80%" style="background-color: #FDF6E3" /> ] .pull-right[ ``` r anti_join(x, y, by = "id") ``` ``` ## # A tibble: 1 × 2 ## id value_x ## <dbl> <chr> ## 1 3 x3 ``` ] --- ## `anti_join()` ``` r dates %>% * anti_join(works, by = "name") ``` ``` ## # A tibble: 1 × 3 ## name birth_year death_year ## <chr> <dbl> <dbl> ## 1 Rosalind Franklin 1920 1958 ``` --- ## Putting it altogether ``` r professions %>% left_join(dates, by = "name") %>% left_join(works, by = "name") ``` ``` ## # A tibble: 10 × 5 ## name profession birth_year death_year known_for ## <chr> <chr> <dbl> <dbl> <chr> ## 1 Ada Lovelace Mathematician NA NA first co… ## 2 Marie Curie Physicist and Chemist NA NA theory o… ## 3 Janaki Ammal Botanist 1897 1984 hybrid s… ## 4 Chien-Shiung Wu Physicist 1912 1997 confim a… ## 5 Katherine Johnson Mathematician 1918 2020 calculat… ## 6 Rosalind Franklin Chemist 1920 1958 <NA> ## 7 Vera Rubin Astronomer 1928 2016 existenc… ## 8 Gladys West Mathematician 1930 NA mathemat… ## 9 Flossie Wong-Staal Virologist and Molecular … 1947 2020 first sc… ## 10 Jennifer Doudna Biochemist 1964 NA one of t… ``` --- class: middle # Case study: Student records --- ## Student records - Have: - Enrolment: official university enrolment records - Survey: Student provided info missing students who never filled it out and including students who filled it out but dropped the class - Want: Survey info for all enrolled in class .pull-left-narrow[ Directory structure: .small[ <ul> <li> Lecture6.Rmd </li> <li> Lecture6.html </li> <li> data/ </li> <ul> <li> scientists/ </li> <li> students/ </li> <ul> <li> enrolment.csv </li> <li> survey.csv </li> </ul> </ul> </ul> ] ] .pull-right-wide[ Importing the data: ``` r enrolment <- read_csv("data/students/enrolment.csv") survey <- read_csv("data/students/survey.csv") ``` ] --- ## Student records .pull-left[ ``` r enrolment ``` ``` ## # A tibble: 3 × 2 ## id name ## <dbl> <chr> ## 1 1 Dave Friday ## 2 2 Hermine ## 3 3 Sura Selvarajah ``` ] .pull-right[ ``` r survey ``` ``` ## # A tibble: 4 × 3 ## id name username ## <dbl> <chr> <chr> ## 1 2 Hermine bakealongwithhermine ## 2 3 Sura surasbakes ## 3 4 Peter peter_bakes ## 4 5 Mark thebakingbuddha ``` ] * Common variable: `id` --- ## Student records .panelset[ .panel[.panel-name[In class] ``` r enrolment %>% * left_join(survey, by = "id") ``` ``` ## # A tibble: 3 × 4 ## id name.x name.y username ## <dbl> <chr> <chr> <chr> ## 1 1 Dave Friday <NA> <NA> ## 2 2 Hermine Hermine bakealongwithhermine ## 3 3 Sura Selvarajah Sura surasbakes ``` ] .panel[.panel-name[Survey missing] ``` r enrolment %>% * anti_join(survey, by = "id") ``` ``` ## # A tibble: 1 × 2 ## id name ## <dbl> <chr> ## 1 1 Dave Friday ``` ] .panel[.panel-name[Dropped] ``` r survey %>% * anti_join(enrolment, by = "id") ``` ``` ## # A tibble: 2 × 3 ## id name username ## <dbl> <chr> <chr> ## 1 4 Peter peter_bakes ## 2 5 Mark thebakingbuddha ``` ] ] --- class: middle # Case study: Grocery sales --- ## Grocery sales - Have: - Purchases: One row per customer per item, listing purchases they made - Prices: One row per item in the store, listing their prices - Want: Total revenue .pull-left-narrow[ Directory structure: .small[ <ul> <li> Lecture6.Rmd </li> <li> Lecture6.html </li> <li> data/ </li> <ul> <li> scientists/ </li> <li> students/ </li> <li> sales/ </li> <ul> <li> purchases.csv </li> <li> prices.csv </li> <li> customers.csv </li> </ul> </ul> </ul> ] ] .pull-right-wide[ Importing the data: ``` r purchases <- read_csv("data/sales/purchases.csv") prices <- read_csv("data/sales/prices.csv") customers <- read_csv("data/sales/customers.csv") #(for later) ``` ] --- ## Grocery sales .pull-left[ ``` r purchases ``` ``` ## # A tibble: 5 × 2 ## customer_id item ## <dbl> <chr> ## 1 1 bread ## 2 1 milk ## 3 1 banana ## 4 2 milk ## 5 2 toilet paper ``` ] .pull-right[ ``` r prices ``` ``` ## # A tibble: 5 × 2 ## item price ## <chr> <dbl> ## 1 avocado 0.5 ## 2 banana 0.15 ## 3 bread 1 ## 4 milk 0.8 ## 5 toilet paper 3 ``` ] * Common variable: `item` --- ## Grocery sales .panelset[ .panel[.panel-name[Total revenue] .pull-left[ ``` r purchases %>% * left_join(prices, by = "item") ``` ``` ## # A tibble: 5 × 3 ## customer_id item price ## <dbl> <chr> <dbl> ## 1 1 bread 1 ## 2 1 milk 0.8 ## 3 1 banana 0.15 ## 4 2 milk 0.8 ## 5 2 toilet paper 3 ``` ] .pull-right[ ``` r purchases %>% left_join(prices, by = "item") %>% * summarise(total_revenue = sum(price)) ``` ``` ## # A tibble: 1 × 1 ## total_revenue ## <dbl> ## 1 5.75 ``` ] ] .panel[.panel-name[Revenue per customer] .pull-left[ ``` r purchases %>% left_join(prices, by = "item") ``` ``` ## # A tibble: 5 × 3 ## customer_id item price ## <dbl> <chr> <dbl> ## 1 1 bread 1 ## 2 1 milk 0.8 ## 3 1 banana 0.15 ## 4 2 milk 0.8 ## 5 2 toilet paper 3 ``` ] .pull-right[ ``` r purchases %>% left_join(prices, by = "item") %>% * group_by(customer_id) %>% summarise(total_revenue = sum(price)) ``` ``` ## # A tibble: 2 × 2 ## customer_id total_revenue ## <dbl> <dbl> ## 1 1 1.95 ## 2 2 3.8 ``` ] ] ] --- class: middle # .hand[We...] .huge[.green[have]] .hand[data organised in an unideal way for our analysis] .huge[.pink[want]] .hand[to reorganise the data to carry on with our analysis] --- ## Data: Sales <br> .pull-left[ ### .green[We have...] One row per customer (.green[wide format]) ``` ## # A tibble: 2 × 4 ## customer_id item_1 item_2 item_3 ## <dbl> <chr> <chr> <chr> ## 1 1 bread milk banana ## 2 2 milk toilet paper <NA> ``` ] -- .pull-right[ ### .pink[We want...] One row per item sold (.green[long format]) ``` ## # A tibble: 6 × 3 ## customer_id item_no item ## <dbl> <chr> <chr> ## 1 1 item_1 bread ## 2 1 item_2 milk ## 3 1 item_3 banana ## 4 2 item_1 milk ## 5 2 item_2 toilet paper ## 6 2 item_3 <NA> ``` ] --- ## A grammar of data tidying .pull-left[ <img src="img/tidyr-part-of-tidyverse.png" width="60%" /> ] .pull-right[ The goal of tidyr is to help you tidy your data via - pivoting for going between wide and long data - splitting and combining character columns - nesting and unnesting columns - clarifying how `NA`s should be treated ] --- ## Pivoting data .center[ <img src="img/tidyr-longer-wider.gif" width="45%" style="background-color: #FDF6E3" /> ] --- ## `pivot_longer()` `\(~~~~~~~~~~~~\)` `pivot_wider()` .pull-left[ ``` r pivot_longer( data, cols, names_to = "name", values_to = "value" ) ``` - `data` (as usual) - `cols`: columns to pivot into longer format - `names_to`: name of the column where column names of pivoted variables go (character string) - `values_to`: name of the column where data in pivoted variables go (character string) ] .pull-right[ ``` r pivot_wider( data, names_from = "name", values_from = "value" ) ``` - `data` (as usual) - `names_from`: which column in the long format contains the what should be column names in the wide format - `values_from`: which column in the long format contains the what should be values in the new columns in the wide format ] --- ## Customers `\(\rightarrow\)` purchases ``` r purchases <- customers %>% * pivot_longer( * cols = item_1:item_3, # variables item_1 to item_3 * names_to = "item_no", # column names -> new column called item_no * values_to = "item" # values in columns -> new column called item * ) purchases ``` ``` ## # A tibble: 6 × 3 ## customer_id item_no item ## <dbl> <chr> <chr> ## 1 1 item_1 bread ## 2 1 item_2 milk ## 3 1 item_3 banana ## 4 2 item_1 milk ## 5 2 item_2 toilet paper ## 6 2 item_3 <NA> ``` --- ## Why pivot? Most likely, because the next step of your analysis needs it .pull-left[ ``` r prices ``` ``` ## # A tibble: 5 × 2 ## item price ## <chr> <dbl> ## 1 avocado 0.5 ## 2 banana 0.15 ## 3 bread 1 ## 4 milk 0.8 ## 5 toilet paper 3 ``` ] .pull-right[ ``` r purchases %>% * left_join(prices, by = "item") ``` ``` ## # A tibble: 6 × 4 ## customer_id item_no item price ## <dbl> <chr> <chr> <dbl> ## 1 1 item_1 bread 1 ## 2 1 item_2 milk 0.8 ## 3 1 item_3 banana 0.15 ## 4 2 item_1 milk 0.8 ## 5 2 item_2 toilet paper 3 ## 6 2 item_3 <NA> NA ``` ] --- ## Purchases `\(\rightarrow\)` customers ``` r purchases %>% * pivot_wider( * names_from = item_no, * values_from = item * ) ``` ``` ## # A tibble: 2 × 4 ## customer_id item_1 item_2 item_3 ## <dbl> <chr> <chr> <chr> ## 1 1 bread milk banana ## 2 2 milk toilet paper <NA> ```