class: center, middle, inverse, title-slide .title[ # Programming ] .subtitle[ ##
Introduction to Data Science ] .author[ ### University of Edinburgh ] .date[ ###
2024/2025 ] --- class: middle # R environment --- ## 🏁 Start with `sessionInfo()` <img src="img/sessioninfo24.png" width="70%" /> - `sessionInfo()` provides essential information about the current session environment (such as R version, operating system, and loaded packages). - Knowing the configuration of the R environment is vital when developing and sharing code, to ensure compatibility and proper functioning across different systems. --- ## For more information Use `search` to get the list of all the attached packages in the R search path. .pull-left[ ``` r search() ``` ``` ## [1] ".GlobalEnv" "package:stats" ## [3] "package:graphics" "package:grDevices" ## [5] "package:utils" "package:datasets" ## [7] "package:methods" "Autoloads" ## [9] "package:base" ``` ] .pull-right[ ``` r library(tidyverse) search() ``` ``` ## [1] ".GlobalEnv" "package:lubridate" ## [3] "package:forcats" "package:stringr" ## [5] "package:dplyr" "package:purrr" ## [7] "package:readr" "package:tidyr" ## [9] "package:tibble" "package:ggplot2" ## [11] "package:tidyverse" "package:stats" ## [13] "package:graphics" "package:grDevices" ## [15] "package:utils" "package:datasets" ## [17] "package:methods" "Autoloads" ## [19] "package:base" ``` ] --- Use `ls()` to see all the variables and functions that are defined in the current working directory in R. It returns a vector of character strings. ``` r ls() ``` ``` ## character(0) ``` -- If I add a variable or a function to the global environment. ``` r a <- 1; b <- 10; f <- function(x) x <- 1 ls() ``` ``` ## [1] "a" "b" "f" ``` -- Remove a variable or function from the global environment with `rm()`. ``` r rm(a, f) ls() ``` ``` ## [1] "b" ``` --- Use `ls()` to see what is defined in a specific package. ``` r search() ``` ``` ## [1] ".GlobalEnv" "package:lubridate" "package:forcats" ## [4] "package:stringr" "package:dplyr" "package:purrr" ## [7] "package:readr" "package:tidyr" "package:tibble" ## [10] "package:ggplot2" "package:tidyverse" "package:stats" ## [13] "package:graphics" "package:grDevices" "package:utils" ## [16] "package:datasets" "package:methods" "Autoloads" ## [19] "package:base" ``` Let's see what is in package `datasets` of the list. ``` r ls(16) ``` ``` ## [1] "ability.cov" "airmiles" "AirPassengers" "airquality" ## [5] "anscombe" "attenu" "attitude" "austres" ## [9] "beaver1" "beaver2" "BJsales" "BJsales.lead" ## [13] "BOD" "cars" "ChickWeight" ## [ reached getOption("max.print") -- omitted 89 entries ] ``` --- Use `ls()` to list all the objects in a package. ``` r ls("package:datasets") ``` ``` ## [1] "ability.cov" "airmiles" "AirPassengers" "airquality" ## [5] "anscombe" "attenu" "attitude" "austres" ## [9] "beaver1" "beaver2" "BJsales" "BJsales.lead" ## [13] "BOD" "cars" "ChickWeight" "chickwts" ## [17] "co2" "CO2" "crimtab" "discoveries" ## [21] "DNase" "esoph" "euro" "euro.cross" ## [25] "eurodist" "EuStockMarkets" "faithful" "fdeaths" ## [29] "Formaldehyde" "freeny" "freeny.x" "freeny.y" ## [33] "HairEyeColor" "Harman23.cor" "Harman74.cor" "Indometh" ## [37] "infert" "InsectSprays" "iris" "iris3" ## [41] "islands" "JohnsonJohnson" "LakeHuron" "ldeaths" ## [45] "lh" ## [ reached getOption("max.print") -- omitted 59 entries ] ``` --- class: middle # Functions --- ## When should you write a function? .pull-left[ <img src="img/funct-all-things.png" width="100%" /> ] .pull-right[ When you’ve copied and pasted a block of code more than twice. ] --- ## Why functions? - Automate common tasks in a more powerful and general way than copy-and-pasting: - Choose an informative name for your function to make code easier to understand - As requirements change, only need to update code in one place, instead of many - Eliminate chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another) -- - Down the line: Improve your reach as a data scientist by writing functions (and packages!) that others can use --- class: middle # Writing functions --- ## Turn your code into a function - Pick a short but informative **name**, preferably a verb. <br> <br> <br> ``` r name_function <- .. .. ``` --- ## Turn your code into a function - Pick a short but informative **name**, preferably a verb. - List the function inputs, or **arguments**, inside `function( )`. If there are more than one input, the syntax looks like `function(x, y, z)`. <br> ``` r name_function <- function(x){ } ``` --- ## Turn your code into a function - Pick a short but informative **name**, preferably a verb. - List the function inputs, or **arguments**, inside `function( )`. If there are more than one input, the syntax looks like `function(x, y, z)`. - Place the **code** you have developed in body of the function, a `{ .. }` block that immediately follows `function(...)`. ``` r name_function <- function(x){ # code you want the function to do } ``` --- ## What goes in / what comes out? .pull-left-wide[ - They take input(s) defined in the function definition ``` r name_function <- function([inputs separated by commas]){ # what to do with those inputs } ``` - By default they return the last value computed in the function ``` r name_function <- function(x){ # do bunch of stuff with the input... # return the final object return(...) } ``` - You can define more outputs to be returned in a list as well as nice print methods (but we won't go there for now...) ] --- .question[ What is going on here? ] ``` r add_2 <- function(x){ x + 2 1000 } ``` ``` r add_2(3) ``` ``` ## [1] 1000 ``` ``` r add_2(10) ``` ``` ## [1] 1000 ``` --- ## R passes arguments by value Which means that an R function cannot change the variable that you input to that function. ``` r triple <- function(x) { x <- 3*x # this change is only local to the function environment x } a <- 5 ``` ``` r triple(a) ``` ``` ## [1] 15 ``` ``` r a ``` ``` ## [1] 5 ``` --- ## Naming functions > "There are only two hard things in Computer Science: cache invalidation and naming things." - Phil Karlton --- ## Naming functions - Names should be short but clearly inform on what the function does - Names should be verbs, not nouns - Multi-word names should be separated by underscores (`snake_case` as opposed to `snakeCase`) - A family of functions should be named similarly (`scrape_page()`, `scrape_speech()` OR `str_remove()`, `str_replace()` etc.) --- ## Overwriting existing functions ``` r mean ``` ``` ## function (x, ...) ## UseMethod("mean") ## <bytecode: 0x110bfb1a8> ## <environment: namespace:base> ``` ``` r mean(1:10) ``` ``` ## [1] 5.5 ``` What happens if we write a function with the same name? ``` r mean <- function(x){ x+2 } mean(1:10) ``` ``` ## [1] 3 4 5 6 7 8 9 10 11 12 ``` --- ## Overwriting existing functions Now the new `mean` is: ``` r mean ``` ``` ## function (x) ## { ## x + 2 ## } ``` How can we go back? ``` r mean <- base::mean ``` - Avoid overwriting existing (especially widely used) functions! <!--- ``` r # JUST DON'T mean <- function(x){ return(x+2) } rm(mean) mean(1:10) ``` ``` ## [1] 5.5 ``` ---> --- ## Functions in action Create a function that calculates the area of a circle. - Input: The function need as input the `r`adius of the circle. - Output: Return the area of the circle, `\(\pi \cdot r^2\)` .pull-left[ ``` r area <- function(r) { return(pi*r^2) } area(2) ``` ``` ## [1] 12.56637 ``` ] -- .pull-right[ If the input is negative? ``` r area(-2) ``` ``` ## [1] 12.56637 ``` **Be careful!** ] --- ## Functions in action If the input is negative, the function should return an error. Let's add a condition. * `message` prints a message but not stop execution. * `warning` prints a warning message but not stop execution. * `stop` stops execution of the current expression and executes an error action. .pull-left[ ``` r area <- function(r) { if (r < 0) stop("Provide non-negative radius.") return(pi*r^2) } area(2) ``` ``` ## [1] 12.56637 ``` ] -- .pull-right[ If the input is negative? ``` r area(-2) ``` ## Error in area(-2) : Provide non-negative radius. ] --- ## Functions in action If the input is negative, the function should return an error. Let's add a condition. * `message` prints a message but not stop execution. * `warning` prints a warning message but not stop execution. * `stop` stops execution of the current expression and executes an error action. .pull-left[ ``` r area <- function(r) { if (r < 0) stop("Provide non-negative radius.") return(pi*r^2) } area(2) ``` ``` ## [1] 12.56637 ``` ] .pull-right[ If the input is logical? ``` r area(TRUE) ``` ``` ## [1] 3.141593 ``` ] --- ## Functions in action If the input is negative, the function should return an error. Let's add a condition. * `message` prints a message but not stop execution. * `warning` prints a warning message but not stop execution. * `stop` stops execution of the current expression and executes an error action. .pull-left[ ``` r area <- function(r) { if (r < 0) stop("Provide non-negative radius.") return(pi*r^2) } area(2) ``` ``` ## [1] 12.56637 ``` ] .pull-right[ If the input is logical? ``` r area(TRUE) ``` ``` ## [1] 3.141593 ``` `TRUE` (logical) is converted to `1` (numeric). ] --- ## Functions in action Add the condition that `radius` should be a numeric value. ``` r area <- function(r) { if (r < 0 | !is.numeric(r) ) stop("Provide non-negative numeric radius.") return(pi*r^2) } ``` Try again. ``` r area(TRUE) ``` ## Error in area(TRUE) : Provide non-negative numeric radius. --- ## Anonymous functions Sometimes we want to define a function without giving it a name, such as when we need to apply a function to multiple columns of a data frame. The `airquality` dataset (in the `datasets` package) contains daily air quality measurements in New York, May to September 1973. ``` r head(airquality) ``` ``` ## Ozone Solar.R Wind Temp Month Day ## 1 41 190 7.4 67 5 1 ## 2 36 118 8.0 72 5 2 ## 3 12 149 12.6 74 5 3 ## 4 18 313 11.5 62 5 4 ## 5 NA NA 14.3 56 5 5 ## 6 28 NA 14.9 66 5 6 ``` --- ## Anonymous functions Let's find the mean of the first four columns: ``` r airquality %>% summarize( across(Ozone:Temp, mean) ) ``` ``` ## Ozone Solar.R Wind Temp ## 1 NA NA 9.957516 77.88235 ``` -- Can we remove missing values? ``` r airquality %>% summarize( across(Ozone:Temp, function(x) mean(x, na.rm = TRUE) # anonymous or lambda function ) ) ``` ``` ## Ozone Solar.R Wind Temp ## 1 42.12931 185.9315 9.957516 77.88235 ``` --- ## Anonymous functions Many approaches to do the same thing: .panelset[ .panel[.panel-name[Version 1] ``` r airquality %>% summarize( across(Ozone:Temp, function(x) mean(x, na.rm = TRUE) ) ) ``` ``` ## Ozone Solar.R Wind Temp ## 1 42.12931 185.9315 9.957516 77.88235 ``` ] .panel[.panel-name[Version 2] ``` r airquality %>% summarize( across(Ozone:Temp, \(x) mean(x, na.rm = TRUE) # Base R style ) ) ``` ``` ## Ozone Solar.R Wind Temp ## 1 42.12931 185.9315 9.957516 77.88235 ``` ] .panel[.panel-name[Version 3] ``` r airquality %>% summarize( across(Ozone:Temp, ~ mean(.x, na.rm = TRUE) # formula style ) ) ``` ``` ## Ozone Solar.R Wind Temp ## 1 42.12931 185.9315 9.957516 77.88235 ``` ] ] --- ## Passing column names as input What if you want to apply a function to different columns (and different datasets)? Maybe we keep writing code that filters values of a column above a given threshold: .panelset[ .panel[.panel-name[Temperature] ``` r airquality %>% filter(Temp > 95) # 95F = 35C degrees ``` ``` ## Ozone Solar.R Wind Temp Month Day ## 1 76 203 9.7 97 8 28 ## 2 84 237 6.3 96 8 30 ``` ] .panel[.panel-name[Wind] ``` r airquality %>% filter(Wind > 15) # (miles per hour) ``` ``` ## Ozone Solar.R Wind Temp Month Day ## 1 8 19 20.1 61 5 9 ## 2 6 78 18.4 57 5 18 ## 3 11 320 16.6 73 5 22 ## 4 NA 66 16.6 57 5 25 ## 5 NA 242 16.1 67 6 3 ## 6 37 284 20.7 72 6 17 ## 7 21 259 15.5 77 8 21 ## 8 32 92 15.5 84 9 6 ## 9 21 259 15.5 76 9 12 ## 10 14 20 16.6 63 9 25 ``` ] .panel[.panel-name[Ozone] ``` r airquality %>% filter(Ozone > 100) # (parts per billion) ``` ``` ## Ozone Solar.R Wind Temp Month Day ## 1 115 223 5.7 79 5 30 ## 2 135 269 4.1 84 7 1 ## 3 108 223 8.0 85 7 25 ## 4 122 255 4.0 89 8 7 ## 5 110 207 8.0 90 8 9 ## 6 168 238 3.4 81 8 25 ## 7 118 225 2.3 94 8 29 ``` ] ] --- ## Passing column names as input We can define a new function `filter_above` that we can use on any dataset, by providing the column name and the threshold: ``` r filter_above <- function(data, column, threshold) { data %>% filter({{ column }} > threshold) } ``` The `{{ }}` allows your function to work similarly to other `dplyr` functions. You can pass the name of the variable you'd like to filer, without using quotes. ``` r filter_above(airquality, Temp, 95) ``` ``` ## Ozone Solar.R Wind Temp Month Day ## 1 76 203 9.7 97 8 28 ## 2 84 237 6.3 96 8 30 ``` --- ## Passing column names as input We can define a new function `filter_above` that we can use on any dataset, by providing the column name and the threshold: ``` r filter_above <- function(data, column, threshold) { data %>% filter({{ column }} > threshold) } ``` You can even use it with the pipe `%>%`! ``` r airquality %>% filter_above(Temp,95) ``` ``` ## Ozone Solar.R Wind Temp Month Day ## 1 76 203 9.7 97 8 28 ## 2 84 237 6.3 96 8 30 ```