class: center, middle, inverse, title-slide .title[ # Tidy Data and Data Wrangling ] .subtitle[ ##
Introduction to Data Science ] .author[ ### University of Edinburgh ] .date[ ###
2024/2025 ] --- ## Tidy data >Happy families are all alike; every unhappy family is unhappy in its own way. > >Leo Tolstoy .pull-left[ **Characteristics of tidy data:** - Each variable forms a column. - Each observation forms a row. - Each type of observational unit forms a table. ] .pull-right[ **Characteristics of untidy data:** - Merged cells. - Repeated or non-informative variable/column names. - Cell comments or superfluous information. - Empty or total/subtotal rows. - ... ] --- ## .question[ What makes this data not tidy? ] <img src="img/hyperwar-airplanes-on-hand.png" width="70%" style="display: block; margin: auto;" /> .footnote[ Source: [Army Air Forces Statistical Digest, WW II](https://www.ibiblio.org/hyperwar/AAF/StatDigest/aafsd-3.html) ] --- .question[ What makes this data not tidy? ] <br> <img src="img/hiv-est-prevalence-15-49.png" width="70%" style="display: block; margin: auto;" /> <!-- .footnote[ Source: [Gapminder, Estimated HIV prevalence among 15-49 year olds](https://www.gapminder.org/data) ] --> Source: [Gapminder, Estimated HIV prevalence among 15-49 year olds](https://www.gapminder.org/data) --- .question[ Is this version tidy? ] <img src="img/hiv-est-prevalence-15-49-tidy.PNG" width="35%" style="display: block; margin: auto;" /> --- ## Displaying vs. summarising data .panelset[ .panel[ .panel-name[Tables] .pull-left[ | Name | Species | Height | Mass | |:----|----:|----:|---:| | Luke Skywalker | Human | 172 | 77 | | C-3PO | Droid | 167 | 75 | | R2-D2 | Droid | 96 | 32 | | Darth Vader | Human | 202 | 136 | | Leia Organa | Human | 150 | 49 | | ... | ... | ... | ... | ] .pull-right[ | Species | Count | Average Height | |:----|----:|----:| | Droid | 6 | 131 | | Human | 35 | 177 | | Other | 46 | 178 |
] ] .panel[ .panel-name[Code] ``` r library(tidyverse) ``` .pull-left[ ``` r starwars %>% select(species, height, mass) ``` ] .pull-right[ ``` r starwars %>% select(species, height) %>% mutate(species_group = case_when( species == "Droid" ~ "Droid", species == "Human" ~ "Human", TRUE ~ "Other")) %>% group_by(species_group) %>% summarise( count = n(), avg_height = mean(height, na.rm = TRUE) ) ``` ] ] ] --- class: middle # Grammar of data wrangling --- ## A grammar of data wrangling... ... based on the concepts of functions as verbs that manipulate data frames .pull-left[ <img src="img/dplyr-part-of-tidyverse.png" width="70%" /> ] .pull-right[ .midi[ - `select`: pick columns by name - `arrange`: reorder rows - `slice`: pick rows using index(es) - `filter`: pick rows matching criteria - `distinct`: filter for unique rows - `mutate`: add new variables - `summarise`: reduce variables to values - `group_by`: for grouped operations - ... (many more) ] ``` r <function_name>(<dataset_name>, ...) ``` ``` r help(select) ``` ] --- ## `select` specific variables .pull-left[ ``` r select(starwars, name, height, mass) ``` ``` ## # A tibble: 87 × 3 ## name height mass ## <chr> <int> <dbl> ## 1 Luke Skywalker 172 77 ## 2 C-3PO 167 75 ## 3 R2-D2 96 32 ## 4 Darth Vader 202 136 ## 5 Leia Organa 150 49 ## 6 Owen Lars 178 120 ## 7 Beru Whitesun Lars 165 75 ## 8 R5-D4 97 32 ## 9 Biggs Darklighter 183 84 ## 10 Obi-Wan Kenobi 182 77 ## # ℹ 77 more rows ``` ] .pull-right[ ``` r select(starwars, -name) ``` ``` ## # A tibble: 87 × 13 ## height mass hair_color skin_color ## <int> <dbl> <chr> <chr> ## 1 172 77 blond fair ## 2 167 75 <NA> gold ## 3 96 32 <NA> white, bl… ## 4 202 136 none white ## 5 150 49 brown light ## 6 178 120 brown, grey light ## 7 165 75 brown light ## 8 97 32 <NA> white, red ## 9 183 84 black light ## 10 182 77 auburn, white fair ## # ℹ 77 more rows ## # ℹ 9 more variables: eye_color <chr>, ## # birth_year <dbl>, sex <chr>, ## # gender <chr>, homeworld <chr>, ## # species <chr>, films <list>, ## # vehicles <list>, starships <list> ``` ] --- ## Select helper functions .pull-left[ ``` r select(starwars, ends_with("color")) ``` ``` ## # A tibble: 87 × 3 ## hair_color skin_color eye_color ## <chr> <chr> <chr> ## 1 blond fair blue ## 2 <NA> gold yellow ## 3 <NA> white, blue red ## 4 none white yellow ## 5 brown light brown ## 6 brown, grey light blue ## 7 brown light blue ## 8 <NA> white, red red ## 9 black light brown ## 10 auburn, white fair blue-gray ## # ℹ 77 more rows ``` ] .pull-right[ <br> .small[ * `starts_with()`: Starts with a prefix * `ends_with()`: Ends with a suffix * `contains()`: Contains a literal string * `num_range()`: Match a numerical range like x01, x02, x03 * `one_of()`: Matches variable names in a character vector * `everything()`: Matches all variables * `last_col()`: Select last variable, possibly with an offset * `matches()`: Matches a regular expression (a sequence of symbols/characters expressing a string/pattern to be searched for within text) See the help page for more info on these functions. ] ] --- ## `arrange` mass in numerical order .pull-left[ Ascending order: ``` r arrange(starwars, mass) ``` ``` ## # A tibble: 87 × 14 ## name height mass hair_color ## <chr> <int> <dbl> <chr> ## 1 Ratts Tyerel 79 15 none ## 2 Yoda 66 17 white ## 3 Wicket Systr… 88 20 brown ## 4 R2-D2 96 32 <NA> ## 5 R5-D4 97 32 <NA> ## 6 Sebulba 112 40 none ## 7 Padmé Amidala 185 45 brown ... ``` ] .pull-right[ Descending order: ``` r arrange(starwars, desc(mass)) ``` ``` ## # A tibble: 87 × 14 ## name height mass hair_color ## <chr> <int> <dbl> <chr> ## 1 Jabba Desili… 175 1358 <NA> ## 2 Grievous 216 159 none ## 3 IG-88 200 140 none ## 4 Darth Vader 202 136 none ## 5 Tarfful 234 136 brown ## 6 Owen Lars 178 120 brown, gr… ## 7 Bossk 190 113 none ... ``` ] --- ## `slice` for certain row numbers .pull-left[ ``` r slice(starwars, 1:4) ``` ``` ## # A tibble: 4 × 14 ## name height mass hair_color ## <chr> <int> <dbl> <chr> ## 1 Luke Skywalker 172 77 blond ## 2 C-3PO 167 75 <NA> ## 3 R2-D2 96 32 <NA> ## 4 Darth Vader 202 136 none ## # ℹ 10 more variables: ## # skin_color <chr>, eye_color <chr>, ## # birth_year <dbl>, sex <chr>, ## # gender <chr>, homeworld <chr>, ## # species <chr>, films <list>, ## # vehicles <list>, starships <list> ``` ] .pull-right[ ``` r slice(starwars, c(2,4,6,8)) ``` ``` ## # A tibble: 4 × 14 ## name height mass hair_color ## <chr> <int> <dbl> <chr> ## 1 C-3PO 167 75 <NA> ## 2 Darth Vader 202 136 none ## 3 Owen Lars 178 120 brown, grey ## 4 R5-D4 97 32 <NA> ## # ℹ 10 more variables: ## # skin_color <chr>, eye_color <chr>, ## # birth_year <dbl>, sex <chr>, ## # gender <chr>, homeworld <chr>, ## # species <chr>, films <list>, ## # vehicles <list>, starships <list> ``` ] --- ## `filter` for the droids ``` r filter(starwars, species == "Droid") ``` ``` ## # A tibble: 6 × 14 ## name height mass hair_color skin_color eye_color birth_year sex gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 C-3PO 167 75 <NA> gold yellow 112 none masculi… ## 2 R2-D2 96 32 <NA> white, blue red 33 none masculi… ## 3 R5-D4 97 32 <NA> white, red red NA none masculi… ## 4 IG-88 200 140 none metal red 15 none masculi… ## 5 R4-P17 96 NA none silver, red red, blue NA none feminine ## 6 BB8 NA NA none none black NA none masculi… ## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>, ## # vehicles <list>, starships <list> ``` <br> "These aren't the droids you're looking for!"<img style='vertical-align:middle;height:36px;' src='img/C3PO.webp'><img style='vertical-align:middle;height:36px;' src='img/R2D2.webp'> --- ## Logical operators in R | Code | Description | Example | Output | |:-------:|:----------------------|:---------------|:--------| | `<` | Strictly less than | `1 < 2` | `TRUE` | | `<=` | Less than or equal to | `2 <= 2` | `TRUE` | | `>` | Strictly greater than | `-3 > 3` | `FALSE` | | `>=` | Greater than or equal to | `10 >= 0` | `TRUE` | | `==` | Exactly equal to | `pi == 3.14` | `FALSE` | | `!=` | Not equal to | `pi != 3.14` | `TRUE` | | ` ` | ` ` | ` ` | ` ` | | ` ` | ` ` | ` ` | ` ` | | ` ` | ` ` | ` ` | ` ` | | `x & y` | Logical AND operator | `TRUE & FALSE` | `FALSE` | | `x `<span, style="color:#a7a798;"> | </span>` y` | Logical OR operator | `TRUE `<span, style="color:#a7a798;"> | </span>` FALSE` | `TRUE` | | `!x` | Logical NOT operator | `!TRUE ` | `FALSE` | ``` r filter(starwars, species == "Droid" & height < 100) ``` --- ### `mutate` to compute BMI and species group .pull-left[ ``` r mutate(starwars, bmi = 10000 * mass / height^2, species_grp = case_when( species == "Human" ~ "Human", species == "Droid" ~ "Droid", TRUE ~ "Other" )) ``` ] .pull-right[ ``` ## # A tibble: 87 × 16 ## name bmi species_grp height mass ## <chr> <dbl> <chr> <int> <dbl> ## 1 Jabba… 443. Other 175 1358 ## 2 Dud B… 50.9 Other 94 45 ## 3 Yoda 39.0 Other 66 17 ## 4 Owen … 37.9 Human 178 120 ## 5 IG-88 35 Droid 200 140 ## 6 R2-D2 34.7 Droid 96 32 ## 7 Griev… 34.1 Other 216 159 ## 8 R5-D4 34.0 Droid 97 32 ## 9 Jek T… 34.0 Other 180 110 ## 10 Darth… 33.3 Human 202 136 ## # ℹ 77 more rows ## # ℹ 11 more variables: ## # hair_color <chr>, skin_color <chr>, ## # eye_color <chr>, birth_year <dbl>, ## # sex <chr>, gender <chr>, ## # homeworld <chr>, species <chr>, ## # films <list>, vehicles <list>, … ``` ] --- ## `count` to make a frequency table .pull-left[ ``` r count(starwars, eye_color, sort = TRUE) ``` ``` ## # A tibble: 15 × 2 ## eye_color n ## <chr> <int> ## 1 brown 21 ## 2 blue 19 ## 3 yellow 11 ## 4 black 10 ## 5 orange 8 ## 6 red 5 ## 7 hazel 3 ## 8 unknown 3 ## 9 blue-gray 1 ## 10 dark 1 ## 11 gold 1 ## 12 green, yellow 1 ## 13 pink 1 ## 14 red, blue 1 ## 15 white 1 ``` ] --- ## `summerise` to compute the average height .pull-left[ ✅ Informative variable name ``` r summarise(starwars, avg_ht = mean(height)) ``` ``` ## # A tibble: 1 × 1 ## avg_ht ## <dbl> ## 1 179. ``` ``` r summarise(starwars, min_ht = min(height), avg_ht = mean(height), med_ht = median(height), max_ht = max(height), sd_ht = sd(height)) ``` ``` ## # A tibble: 1 × 5 ## min_ht avg_ht med_ht max_ht sd_ht ## <int> <dbl> <int> <int> <dbl> ## 1 88 179. 180 228 22.4 ``` ] .pull-right[ ❌ Bad/sloppy variable name ``` r summarise(starwars, mean(height)) ``` ``` ## # A tibble: 1 × 1 ## `mean(height)` ## <dbl> ## 1 179. ``` .tip[ `summarise()` changes the data frame entirely, it collapses rows down to a single summary statistic, and removes all columns that are irrelevant to the calculation. ] ] --- class: middle # .hand[We...] .huge[.green[have]] .hand[a single data frame] .huge[.pink[want]] .hand[to slice it, and dice it, and juice it, and process it] --- ## How to do multiple data wrangling steps? .pull-left[ **Method 1**: Nesting ``` r summarise( filter( starwars, species == "Human" ), avg_mass = mean(mass, na.rm = TRUE) ) ``` *ISSUE*: Where do you begin reading the code?! ] .pull-right[ **Method 2**: Assign to Memory ``` r starwars_humans <- filter(starwars, species == "Human") summarise(starwars_humans, av_mass = mean(mass, na.rm = TRUE)) ``` *ISSUE*: How good is *your* memory of each object in the computer's memory?! ] .question[ Which method (if either) produces `tidy` code that is easy to read? Can we do any better? ] --- class: middle # Pipes --- ## What is a pipe? In programming, a pipe is a technique for passing information from one process to another. The pipe operator is implemented in the package **magrittr**, though we don't need to load this package explicitly since **tidyverse** does this for us. .pull-left[ <img src="img/magrittr_hex.png" width="50%" style="display: block; margin: auto;" /> ] .pull-right[ ``` r data %>% select(variable1, variable2) %>% filter(variable1 <= variable2) ``` .center[
] <img src="img/piping_code.png" width="100%" /> ] --- ## The piping narrative .pull-left[ * **Take the `starwars` data set and ...** ] .pull-right[ .small[ ``` r *starwars ``` ``` ## # A tibble: 87 × 14 ## name height mass hair_color skin_color eye_color ## <chr> <int> <dbl> <chr> <chr> <chr> ## 1 Luke Skywal… 172 77 blond fair blue ## 2 C-3PO 167 75 <NA> gold yellow ## 3 R2-D2 96 32 <NA> white, bl… red ## 4 Darth Vader 202 136 none white yellow ## 5 Leia Organa 150 49 brown light brown ## 6 Owen Lars 178 120 brown, gr… light blue ## 7 Beru Whites… 165 75 brown light blue ## 8 R5-D4 97 32 <NA> white, red red ## 9 Biggs Darkl… 183 84 black light brown ## 10 Obi-Wan Ken… 182 77 auburn, w… fair blue-gray ## # ℹ 77 more rows ## # ℹ 8 more variables: birth_year <dbl>, sex <chr>, ## # gender <chr>, homeworld <chr>, species <chr>, ## # films <list>, vehicles <list>, starships <list> ``` ] ] --- ## The piping narrative .pull-left[ * Take the `starwars` data set and ... * **... select the `species` and `height` variables. ** ] .pull-right[ .small[ ``` r starwars %>% * select(species, height) ``` ``` ## # A tibble: 87 × 2 ## species height ## <chr> <int> ## 1 Human 172 ## 2 Droid 167 ## 3 Droid 96 ## 4 Human 202 ## 5 Human 150 ## 6 Human 178 ## 7 Human 165 ## 8 Droid 97 ## 9 Human 183 ## 10 Human 182 ## # ℹ 77 more rows ``` ] ] --- ## The piping narrative .pull-left[ * Take the `starwars` data set and ... * ... select the `species` and `height` variables. * **Mutate the data to add a new categorical variable called `species_group` that classifies a character's `species` as either "Droid", "Human" or "Other".** ] .pull-right[ .small[ ``` r starwars %>% select(species, height) %>% * mutate(species_group = case_when( * species == "Droid" ~ "Droid", * species == "Human" ~ "Human", * TRUE ~ "Other")) ``` ``` ## # A tibble: 87 × 3 ## species height species_group ## <chr> <int> <chr> ## 1 Human 172 Human ## 2 Droid 167 Droid ## 3 Droid 96 Droid ## 4 Human 202 Human ## 5 Human 150 Human ## 6 Human 178 Human ## 7 Human 165 Human ## 8 Droid 97 Droid ## 9 Human 183 Human ## 10 Human 182 Human ## # ℹ 77 more rows ``` ] ] --- ## The piping narrative .pull-left[ * Take the `starwars` data set and ... * ... select the `species` and `height` variables. * Mutate the data to add a new categorical variable called `species_group` that classifies a character's `species` as either "Droid", "Human" or "Other". * **Next, group the data according to their `species_group`, and ...** ] .pull-right[ .small[ ``` r starwars %>% select(species, height) %>% mutate(species_group = case_when( species == "Droid" ~ "Droid", species == "Human" ~ "Human", TRUE ~ "Other")) %>% * group_by(species_group) ``` ``` ## # A tibble: 87 × 3 ## # Groups: species_group [3] ## species height species_group ## <chr> <int> <chr> ## 1 Human 172 Human ## 2 Droid 167 Droid ## 3 Droid 96 Droid ## 4 Human 202 Human ## 5 Human 150 Human ## 6 Human 178 Human ## 7 Human 165 Human ## 8 Droid 97 Droid ## 9 Human 183 Human ## 10 Human 182 Human ## # ℹ 77 more rows ``` ] ] --- ## The piping narrative .pull-left[ * Take the `starwars` data set and ... * ... select the `species` and `height` variables. * Mutate the data to add a new categorical variable called `species_group` that classifies a character's `species` as either "Droid", "Human" or "Other". * Next, group the data according to their `species_group`, and ... * **... summarise the data to return ...** ] .pull-right[ .small[ ``` r starwars %>% select(species, height) %>% mutate(species_group = case_when( species == "Droid" ~ "Droid", species == "Human" ~ "Human", TRUE ~ "Other")) %>% group_by(species_group) %>% * summarise() ``` ``` ## # A tibble: 3 × 1 ## species_group ## <chr> ## 1 Droid ## 2 Human ## 3 Other ``` ] ] --- ## The piping narrative .pull-left[ * Take the `starwars` data set and ... * ... select the `species` and `height` variables. * Mutate the data to add a new categorical variable called `species_group` that classifies a character's `species` as either "Droid", "Human" or "Other". * Next, group the data according to their `species_group`, and ... * ... summarise the data to return ... * **... the group's sample size and ...** ] .pull-right[ .small[ ``` r starwars %>% select(species, height) %>% mutate(species_group = case_when( species == "Droid" ~ "Droid", species == "Human" ~ "Human", TRUE ~ "Other")) %>% group_by(species_group) %>% summarise( * count = n() ) ``` ``` ## # A tibble: 3 × 2 ## species_group count ## <chr> <int> ## 1 Droid 6 ## 2 Human 35 ## 3 Other 46 ``` ] ] --- ## The piping narrative .pull-left[ * Take the `starwars` data set and ... * ... select the `species` and `height` variables. * Mutate the data to add a new categorical variable called `species_group` that classifies a character's `species` as either "Droid", "Human" or "Other". * Next, group the data according to their `species_group`, and ... * ... summarise the data to return ... * ... the group's sample size and ... * **... the group's sample mean** ] .pull-right[ .small[ ``` r starwars %>% select(species, height) %>% mutate(species_group = case_when( species == "Droid" ~ "Droid", species == "Human" ~ "Human", TRUE ~ "Other")) %>% group_by(species_group) %>% summarise( count = n(), * avg_height = mean(height) ) ``` ``` ## # A tibble: 3 × 3 ## species_group count avg_height ## <chr> <int> <dbl> ## 1 Droid 6 NA ## 2 Human 35 NA ## 3 Other 46 177. ``` ] ] --- ## The piping narrative .pull-left[ * Take the `starwars` data set and ... * ... select the `species` and `height` variables. * Mutate the data to add a new categorical variable called `species_group` that classifies a character's `species` as either "Droid", "Human" or "Other". * Next, group the data according to their `species_group`, and ... * ... summarise the data to return ... * ... the group's sample size and ... * ... the group's sample mean **(excluding the rows where data is not available, `NA`).** ] .pull-right[ .small[ ``` r starwars %>% select(species, height) %>% mutate(species_group = case_when( species == "Droid" ~ "Droid", species == "Human" ~ "Human", TRUE ~ "Other")) %>% group_by(species_group) %>% summarise( count = n(), * avg_height = mean(height, na.rm = TRUE) ) ``` ``` ## # A tibble: 3 × 3 ## species_group count avg_height ## <chr> <int> <dbl> ## 1 Droid 6 131. ## 2 Human 35 178 ## 3 Other 46 177. ``` ] ] --- ## A note on piping and layering - `%>%` is used in data wrangling pipelines. We _pipe_ the output of the previous line of code as the first input of the next line of code. - Later on you will see another simple command to collocate lines of codes under data visualization - Namely, `+` is used in **ggplot2** plots for "layering". We will create the plot by _adding_ layers (in Week 4) - Be aware of how `%>%` and `+` are used differently! - They can be combined for certain tasks! But we are not allowed to use them interchangeably.