class: center, middle, inverse, title-slide .title[ # Iteration ] .subtitle[ ##
Introduction to Data Science ] .author[ ### University of Edinburgh ] .date[ ###
2024/2025 ] --- class: middle # Iterations - For loop - While loop - Mapping --- ## For loops in R - The `for` loop iterates over each element in **values** (which can be a vector, list, or any other iterable object), executing the specified block of code once for each element. - It is an entry controlled loop, i.e. first the expression in **values** is evaluated, then the **body of the loop is executed** for each of the values. For loop Syntax in R: ``` r for (<index> in <values>) { # do something } ``` --- ## Example 1 - with a sequence .pull-left[ ``` r for (i in 1:10){ print(5*i) } ``` ``` ## [1] 5 ## [1] 10 ## [1] 15 ## [1] 20 ## [1] 25 ## [1] 30 ## [1] 35 ## [1] 40 ## [1] 45 ## [1] 50 ``` ] .pull-right[ Narrative: - START. - Set the loop values as the sequence<br> `1 2 3 4 ... 10` - Set index `i` to the 1st value, `1` - Compute `5*i` and print. - Set `index` to the next value, `2` - Compute `5*i` and print. - Set `index` to the next value, `3` - ... (continue) ... - Set `index` to the next value, `10` - Compute `5*i` and print. - Reached end of the values. - END. ] --- ## Example 2 - with a general vector ``` r # Create a vector with the days of the week week <- c('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday') # using for loop to iterate over the values in the vector for (day in week) { # displaying each string in the vector print(day) } ``` ``` ## [1] "Sunday" ## [1] "Monday" ## [1] "Tuesday" ## [1] "Wednesday" ## [1] "Thursday" ## [1] "Friday" ## [1] "Saturday" ``` --- ## Generalising - `seq_len(n)` .pull-left[ * Creates a sequence from `1` to `n`. * An alternative method to `1:n`. * Example: ``` r seq_len(6) # same as 1:6 ``` ``` ## [1] 1 2 3 4 5 6 ``` ``` r n <- 5 for(i in seq_len(n)){ print(LETTERS[i]) } ``` ``` ## [1] "A" ## [1] "B" ## [1] "C" ## [1] "D" ## [1] "E" ``` ] .pull-right[ - Useful for catching unexpected behaviour ``` r seq_len(0) # an empty vector ``` ``` ## integer(0) ``` ``` r # Bad example n <- 0 for(i in 1:0) print(LETTERS[i]) ``` ``` ## [1] "A" ## character(0) ``` ``` r # Good example n <- 0 for(i in seq_len(n)) print(LETTERS[i]) ``` ] --- ## Generalising - `seq_along(x)` .pull-left[ * Creates an index sequence based on the input `x` * Example: ``` r week <- c('Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat') seq_along(week) ``` ``` ## [1] 1 2 3 4 5 6 7 ``` ] .pull-right[ ``` r week <- c('Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat') #Compute number of characters in each string #Set-up blank vector to store calculations num_char <- rep(0L, length(week)) for(i in seq_along(week)){ #Do calc. for i-th weekday num_char[i] <- nchar(week[i]) } num_char ``` ``` ## [1] 3 3 4 3 5 3 3 ``` ``` r nchar(week) ``` ``` ## [1] 3 3 4 3 5 3 3 ``` ] --- class: middle # While loop --- ## While loops in R - The `while` loop in R repeatedly executes a block of code as long as the specified **condition** evaluates to `TRUE`. - In this loop, the **test condition** is evaluated at the beginning of each iteration; if it evaluates to `TRUE`, the **body of the loop** is executed. While loop Syntax in R: ``` r while (<condition>) { #statement } ``` --- ## Example 1 - for loop as while loop .pull-left[ ``` r i <- 0 while (i < 10){ print(5 * i) i <- i + 1 # IMPORTANT! } ``` ``` ## [1] 0 ## [1] 5 ## [1] 10 ## [1] 15 ## [1] 20 ## [1] 25 ## [1] 30 ## [1] 35 ## [1] 40 ## [1] 45 ``` ] .pull-right[ Narrative: * Initialise `i` with `0` * Enter while loop * TEST: `i < 10`, i.e. `0 < 10` -> `TRUE` * Compute `5*i` and print * Add `1` to `i` & re-assign (`i == 1`) * TEST: `i < 10`, i.e. `1 < 10` -> `TRUE` * Compute `5*i` and print * Add `1` to `i` & re-assign (`i == 2`) * ... (continue) ... * TEST: `i < 10`, i.e. `10 < 10` -> `FALSE` * Exit while loop ] --- ## Caution! Infinite loops .pull-left[ ``` r i <- 0 while (i < 10){ print(5 * i) * i <- i - 1 } ``` ``` ## [1] 0 ## [1] -5 ## [1] -10 ## [1] -15 ## [1] -20 ## [1] -25 ## [1] -30 ## [1] -35 ## [1] -40 ## [1] -45 ## [1] -50 ## [1] -55 ## [1] -60 ## [1] -65 ## [1] -70 ## [1] -75 ## [1] -80 ## [1] -85 ## [1] -90 ## [1] -95 ## [1] -100 ``` ] .pull-right[ Narrative: * Initialise `i` with `0` * Enter while loop * TEST: `i < 10`, i.e. `0 < 10` -> `TRUE` * Compute `5*i` and print * Update `i` (`i = -1`) * TEST: `i < 10`, i.e. `-1 < 10` -> `TRUE` * Compute `5*i` and print * Update `i` (`i = -2`) * ... (continue) ... * ... (FOR EVER!) ... * .small[... (AND EVER!) ...] * .small[... (AND EVER!) ...] * .tiny[... (AND EVER!) ...] * .tiny[... (AND EVER!) ...] ] --- ## Main differences between for and while loop. - A `for` loop is used when the number of iterations a code should be run is **known**; a `while` loop is used when the number of iterations is **not known**. - `while` loops can potentially result in infinite loops if not written properly. - In `for` loops the evaluation of the iteration <tt>values</tt> is done once and never repeated. - In `while` loops the test condition is evaluated each time the loop iterates. - If you accidental create an infinite `while` loop, either: - Press the `Esc` key to stop all computation, or - Click on the stop button, 🟥 (for code chunk) or 🛑 (for console/viewer panels) --- ## Example 2 - convergence of algorithm .panelset[ .panel[.panel-name[Gradient descent algorithm] Use gradient descent algorithm to minimize a function. - Initialize the starting point `\(x_1\)` - Set the learning rate `\(\alpha>0\)` - `\(x_{n+1} = x_n - \alpha \cdot f'(x_n)\)` - Stop when `\(\vert x_{n+1}-x_n \vert < \epsilon\)` For example, let's find the minimum of `\(f(x) = x^2 -4x+4\)`. ] .panel[.panel-name[Code] ``` r f <- function(x) { x^2 - 4*x + 4 } df <- function(x) { 2*x - 4 } epsilon <- 1e-5 # Convergence criterion (tolerance) alpha <- 0.3 # Learning rate x <- 3 # Initial guess error <- 1 # For initial step iter <- 1 while(error > epsilon) { # Gradient descent algorithm gradient <- df(x) new_x <- x - alpha * gradient error <- abs(new_x - x) # Update error x <- new_x # Update the guess cat("Iteration", iter, ": x =", x, "error =", error, "f(x) =", f(x), "\n") iter <- iter + 1 # Update the iteration } cat("Minimum occurs at x =", x, "\n") ``` ] .panel[.panel-name[Output] ``` ## Iteration 1 : x = 2.4 error = 0.6 f(x) = 0.16 ## Iteration 2 : x = 2.16 error = 0.24 f(x) = 0.0256 ## Iteration 3 : x = 2.064 error = 0.096 f(x) = 0.004096 ## Iteration 4 : x = 2.0256 error = 0.0384 f(x) = 0.00065536 ## Iteration 5 : x = 2.01024 error = 0.01536 f(x) = 0.0001048576 ## Iteration 6 : x = 2.004096 error = 0.006144 f(x) = 1.677722e-05 ## Iteration 7 : x = 2.001638 error = 0.0024576 f(x) = 2.684355e-06 ## Iteration 8 : x = 2.000655 error = 0.00098304 f(x) = 4.294967e-07 ## Iteration 9 : x = 2.000262 error = 0.000393216 f(x) = 6.871948e-08 ## Iteration 10 : x = 2.000105 error = 0.0001572864 f(x) = 1.099512e-08 ## Iteration 11 : x = 2.000042 error = 6.291456e-05 f(x) = 1.759219e-09 ## Iteration 12 : x = 2.000017 error = 2.516582e-05 f(x) = 2.814753e-10 ## Iteration 13 : x = 2.000007 error = 1.006633e-05 f(x) = 4.503597e-11 ## Iteration 14 : x = 2.000003 error = 4.026532e-06 f(x) = 7.205792e-12 ``` ``` ## Minimum occurs at x = 2.000003 ``` ] ] --- ## Example 2 - convergence of algorithm Note that I used ``` r while(error > epsilon){ # ... } ``` instead of ``` r while(error != 0){ # ... } ``` Why? -- Floating point calculations! --- ## Unexpected behavour with floating point numbers .pull-left[ ``` r #EXAMPLE 1 2/7 + 3/7 + 2/7 ``` ``` ## [1] 1 ``` ``` r 2/7 + 3/7 + 2/7 == 1 ``` ``` ## [1] FALSE ``` ``` r 2/7 + 3/7 + 2/7 - 1 ``` ``` ## [1] -1.110223e-16 ``` ] -- .pull-right[ ``` r #EXAMPLE 2 1 + 10^(-16) == 1 ``` ``` ## [1] TRUE ``` ``` r 10^(-16) ``` ``` ## [1] 1e-16 ``` ``` r .Machine$double.eps ``` ``` ## [1] 2.220446e-16 ``` ] --- ## Solutions: use a **tolerance** value, or `near`. ``` r #EXAMPLE 1 2/7 + 3/7 + 2/7 == 1 ``` ``` ## [1] FALSE ``` .pull-left[ ``` r tolerance <- 10^-16 # Something small 2/7 + 3/7 + 2/7 - 1 < tolerance ``` ``` ## [1] TRUE ``` ] -- .pull-right[ ``` r # default tol = .Machine$double.eps^0.5 # which approx is 1.5e-8 near(2/7 + 3/7 + 2/7,1) ``` ``` ## [1] TRUE ``` ] --- ## `next` and `break` statements - The `next` statement terminates the current iteration early and immediately proceeds to the next iteration of the loop. - The `break` statement exits the entire loop prematurely, halting any further iterations. .pull-left[ ``` r for(i in 1:6){ if(i %% 2 == 0){ next } print(i) } ``` ``` ## [1] 1 ## [1] 3 ## [1] 5 ``` ] .pull-right[ ``` r i <- 1 while(i < 6){ if(i == 3){ break } print(i) i <- i + 1 } ``` ``` ## [1] 1 ## [1] 2 ``` ] --- class: middle # Mapping: ## To loop or not to loop, that is the question! --- ## Avoiding Explicit Loops Suppose we have exam 1 and exam 2 scores of 10 students stored in a list... ``` r exam_scores <- list( exam1 <- c(80, 90, 70, 50, 80, 40, 90, 95, 80, 90), exam2 <- c(85, 83, 45, 60, 85, 95, 40, 75, 90, 85) ) ``` Calculate the mean of exam 1 and exam 2 using a loop: ``` r m <- numeric(length(exam_scores)) for (i in seq_along(exam_scores)){ m[i] <- mean(exam_scores[[i]]) } m ``` ``` ## [1] 76.5 74.3 ``` Can we make the R code both simpler and more efficient? --- ## `map_something` Functions for looping over an object and returning a value (of a specific type): * `map()` - returns a list * `map_lgl()` - returns a logical vector * `map_int()` - returns a integer vector * `map_dbl()` - returns a double vector * `map_chr()` - returns a character vector * `map_df()` / `map_dfr()` - returns a data frame by row binding * `map_dfc()` - returns a data frame by column binding * ... --- ## How does mapping work? Suppose we have `exam1` and `exam2` scores of 10 students stored in a list... ``` r exam_scores <- list( exam1 <- c(80, 90, 70, 50, 80, 40, 90, 95, 80, 90), exam2 <- c(85, 83, 45, 60, 85, 95, 40, 75, 90, 85) ) ``` ...and we find the mean score in each exam .pull-left[ ``` r map(exam_scores, mean) ``` ``` ## [[1]] ## [1] 76.5 ## ## [[2]] ## [1] 74.3 ``` ] .pull-right[ ``` r map_dbl(exam_scores, mean) ``` ``` ## [1] 76.5 74.3 ``` ] --- ## Comparing execution times ⏱ .pull-left[ * Via looping ``` r start.time <- Sys.time() m <- numeric(length(exam_scores)) for (i in seq_along(exam_scores)){ m[i] <- mean(exam_scores[[i]]) } end.time <- Sys.time() m ``` ``` ## [1] 76.5 74.3 ``` ``` r end.time - start.time ``` ``` ## Time difference of 0.001788139 secs ``` ] .pull-right[ * Via mapping ``` r start.time <- Sys.time() out <- map_dbl(exam_scores, mean) end.time <- Sys.time() out ``` ``` ## [1] 76.5 74.3 ``` ``` r end.time - start.time ``` ``` ## Time difference of 0.0004599094 secs ``` ] --- ## Examples ``` r map_dbl(mtcars, median) ``` ``` ## mpg cyl disp hp drat wt qsec vs am gear ## 19.200 6.000 196.300 123.000 3.695 3.325 17.710 0.000 0.000 4.000 ## carb ## 2.000 ``` ``` r map_lgl(mtcars,~any(is.na(.x))) ``` ``` ## mpg cyl disp hp drat wt qsec vs am gear carb ## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ``` ``` r week <- c('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday') map_chr(week, \(word) paste(word,"has",nchar(word),"letters.")) ``` ``` ## [1] "Sunday has 6 letters." "Monday has 6 letters." ## [3] "Tuesday has 7 letters." "Wednesday has 9 letters." ## [5] "Thursday has 8 letters." "Friday has 6 letters." ## [7] "Saturday has 8 letters." ``` --- ## Chat-GPT, give examples of `for` loops I asked Chat-GPT (actually, ELM) to give me examples of uses of for-loops in R for students taking an Introduction to Data Science class. .panelset[ .panel[.panel-name[Example 1] Summarize each column of a dataset .pull-left[ ``` r mean_values <- numeric(ncol(mtcars)) for (i in 1:ncol(mtcars)) { mean_values[i] <- mean(mtcars[[i]]) } ``` ] .pull-right[ ``` r mtcars %>% summarise(across(everything(), mean)) ``` ] ] .panel[.panel-name[Example 2] Modify the value of a variable .pull-left[ ``` r for (i in 1:nrow(mtcars)) { if(mtcars$am[i] == 1){ mtcars$am[i] <- "manual" } else { mtcars$am[i] <- "automatic" } } ``` ] .pull-right[ ``` r mtcars %>% mutate(am=ifelse(am == 1, "manual","automatic")) ``` ] ] .panel[.panel-name[Example 3] Bootstrap (come back in w10L20!) .pull-left[ ``` r n_bootstraps <- 1000 bootstrap_means <- numeric(n_bootstraps) for (i in 1:n_bootstraps) { resample <- sample(mtcars$mpg, replace = TRUE) bootstrap_means[i] <- mean(resample) } ``` ] .pull-right[ ``` r replicate(1000, mtcars$mpg %>% sample(size = length(.), replace = TRUE) %>% mean()) ``` ] ] .panel[.panel-name[Example 4] Processing multiple files .pull-left[ ``` r # files <- ... # provide list of files data_list <- list() for (file in files) { dat <- read.csv(file) # user-defined function to process data cleaned <- clean_data(dat) data_list[[file]] <- cleaned } ``` ] .pull-right[ ``` r data_list <- map(files, ~ read_csv(.x) %>% clean_data) ``` ] ] ]