Lab 06 - Functions and iterations

Photo by Joan Gamell on Unsplash Photo by Joan Gamell on Unsplash

Learning goals

The objective of this workshop is to:


Set-up

In this workshop you will be working individually. Begin by doing the following set-up steps:


Packages & Data

In today’s lab we will using the tidyverse and ggplot2 packages and their dependencies. At the top of your template you will find the following code that will load this library whenever you knit your document.

library(tidyverse)
library(ggplot2)

The diamonds data set part of the ggplot2 package and is readily available once you have loaded the package. You can inspect the content of the data set by running:

View(diamonds)

You can read the data dictionary for the diamonds data set by opening its manual page:

help(diamonds)

Further information about what each of the variables mean can be found in this RPubs article by David Curtis.

Some of the variable names are generic, so the following code can be used to rename the variables so that they are more informative.

diamonds <- diamonds %>% rename(
    depth_pct = depth,
    length_mm = x,
    width_mm = y,
    depth_mm = z,
    table_pct = table
  )

Exercise 1

Perform a preliminary exploratory data analysis (EDA) of the diamonds data set. Create some frequency tables, summary statistics and/or data visualisations to explore the relationship between the variables.

In particular, make appropriate decisions to clean the data by removing rows containing problematic observations

Hint:

  • Can a diamond have a dimension length of 0mm?

Exercise 2

The table_pct variable is a percentage between the width of the table (the diamond’s top face) and the diamond’s overall width. Copy the following code to calculate the table width in millimetres.

diamonds <- diamonds %>%
  mutate(table_mm = table_pct * width_mm / 100)

In the course we have discussed the three different ways of measuring the location of data: the (arithmetic) mean, median and mode. However, these are only three of the main ways of measuring central tendancy but there are many more variations, each with their own applications and uses.

In this lab we are going to take a closer look at some different ‘mean’ values.

The Geometric Mean

When thinking about the ‘mean’ you usually think about adding all of the data values together and then divide by the overall sample size. Instead of adding the points together, why not multiply? In this case we obtain the definition of the geometric mean:

\[\bar{x}_g = \sqrt[n]{\prod_{i=1}^n x_i}\]

When calculating the geometric mean, we often prefer to use an equivalent formula: \[\bar{x}_g = \exp\left( \frac{1}{n}\sum_{i=1}^n \ln x_i\right)\]

See exercise 9 (optional) to show that the two expressions are equivalent.

Exercise 3

Create a function called gmean that computes the geometric mean statistic. The function must do the following:

Exercise 4

Now that you have created your gmean() command, it is now good to test that it works as you expect it to.

Below is a list containing 6 small data sets to test your function.

test_data <- list(
  test1 = c(2.1, 3.8, 4.2),
  test2 = c(1, 10, 100, 1000),
  test3 = c(0, 1, 4),
  test4 = c(0.38,  0.94, -1.56),
  test5 = c(TRUE, TRUE),
  test6 = c("6", "7", "8")
)

Create a for loop where in each iteration you:

Does your gmean() function work on all test data sets? If not, then why not?

Exercise 5

Hopefully you should notice that some of the test data sets produce some problematic results. This is because the geometric mean is only valid for strictly positive data values.

To make the command gmean() robust to potentially invalid input data, a check needs to be performed and an informative message needs to be returned.

if( any(x <= 0) ){
  warning("<ERROR MESSAGE>")
  return(NaN)
}

As an optional extension to this exercise, what other conditions do we require on the input x in order for the gmean() command to work correctly? Add extra if statements to the start of your function to check for any other potential bad inputs.

Exercise 6

How can you rewrite the for loop of Exercise 5 using the map function instead, or another function in the same map family?

Exercise 7

With the diamonds data set, summarise the table_mm variable by computing the (arithmetic) mean, median and geometric mean for each clarity category. Caution: if you have not cleaned the data in exercise 1 then you may need to filter your data first.

Comment on the relationship between table_mm and clarity using the summarised values.

Exercise 8 (optional)

In the construction of the command gmean() you may have questioned why do we need to perform the logarithm and exponential transformation? Why cannot we compute the geometric mean based on the original definition, using prod(x)^(1/length(x))?

Exercise 9 (optional)

An alternative measurement of central tendency is the Harmonic Mean, which is calculated by the reciprocal of the arithmetic mean of the reciprocals:

\[\bar{x}_h = \frac{1}{\frac{1}{n}\sum_{i=1}^n \frac{1}{x_i}}\] As with the geometric mean, this is only valid for positive value data.

Exercise 10 (Optional)

Hint:

  • \(x = \exp(\log(x))\)
  • \(\sqrt[n]{x} = x^{1/n}\)
  • \(\log(x^a) = a\log(x)\)
  • \(\log(xy) = \log(x)+\log(y)\)
  • \(\log(\prod_{i=1}^n x_i) = \sum_{i=1}^n \log(x_i)\)


I have time left …

Once you’re done with this week’s lab, you can spend some time working on your projects. Reflect on the comments you received from your check-in last week and discuss with your team what (if anything) you might want to change about your approach. Is there anything we learnt this week that you might apply to your project?


Wrapping up

That’s the end of this lab!

Remember to commit and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

Don’t worry if you did not reach the end of the worksheet today, but please try to go through any remaining exercises in your own time.