Set-up

In this workshop you will be working individually. Begin by doing the following set-up steps:

Create your own copy of today’s template repository from https://github.com/uoeIDS/lab-06-template.
Create a new version control project in RStudio that links to your copy of today’s template.
Open lab-06.Rmd and change the author at the top of the file to your name.
Verify that everything is set-up correctly: 🧶 knit your document, ✅ commit your changes and ⬆️ push the commits to GitHub.

Packages & Data

In today’s lab we will using the tidyverse and ggplot2 packages and their dependencies. At the top of your template you will find the following code that will load this library whenever you knit your document.

library(tidyverse)
library(ggplot2)

The diamonds data set part of the ggplot2 package and is readily available once you have loaded the package. You can inspect the content of the data set by running:

View(diamonds)

You can read the data dictionary for the diamonds data set by opening its manual page:

help(diamonds)

Further information about what each of the variables mean can be found in this RPubs article by David Curtis.

Some of the variable names are generic, so the following code can be used to rename the variables so that they are more informative.

diamonds <- diamonds %>% rename(
    depth_pct = depth,
    length_mm = x,
    width_mm = y,
    depth_mm = z,
    table_pct = table
  )

Exercise 1

Perform a preliminary exploratory data analysis (EDA) of the diamonds data set. Create some frequency tables, summary statistics and/or data visualisations to explore the relationship between the variables.

In particular, make appropriate decisions to clean the data by removing rows containing problematic observations

Hint:

Can a diamond have a dimension length of 0mm?

Exercise 2

The table_pct variable is a percentage between the width of the table (the diamond’s top face) and the diamond’s overall width. Copy the following code to calculate the table width in millimetres.

diamonds <- diamonds %>%
  mutate(table_mm = table_pct * width_mm / 100)

Calculate the arithmetic mean of the table length variable table_mm across all diamonds in the data set.
Calculate the arithmetic mean for table_mm within each of the 8 clarity categories.
Comment on the relationship between clarity and average table length.
Calculate the arithmetic mean across all the numeric variables in the dataset. Specify informative names for each of these summary statistics, for example my adding avg_ in front of each variable name (Hint: check the help for the across function and see some of the examples provided).

In the course we have discussed the three different ways of measuring the location of data: the (arithmetic) mean, median and mode. However, these are only three of the main ways of measuring central tendancy but there are many more variations, each with their own applications and uses.

In this lab we are going to take a closer look at some different ‘mean’ values.

The Geometric Mean

When thinking about the ‘mean’ you usually think about adding all of the data values together and then divide by the overall sample size. Instead of adding the points together, why not multiply? In this case we obtain the definition of the geometric mean:

\[\bar{x}_g = \sqrt[n]{\prod_{i=1}^n x_i}\]

When calculating the geometric mean, we often prefer to use an equivalent formula: \[\bar{x}_g = \exp\left( \frac{1}{n}\sum_{i=1}^n \ln x_i\right)\]

See exercise 9 (optional) to show that the two expressions are equivalent.

Exercise 3

Create a function called gmean that computes the geometric mean statistic. The function must do the following:

Take a single input called x representing the data vector.
Computes the geometric mean using the equivalent formula (based on logarithms) provided above. (Hint: the command log() computes the natural logarithm).
Returns a single value representing the geometric mean of the input data.

Exercise 4

Now that you have created your gmean() command, it is now good to test that it works as you expect it to.

Below is a list containing 6 small data sets to test your function.

test_data <- list(
  test1 = c(2.1, 3.8, 4.2),
  test2 = c(1, 10, 100, 1000),
  test3 = c(0, 1, 4),
  test4 = c(0.38,  0.94, -1.56),
  test5 = c(TRUE, TRUE),
  test6 = c("6", "7", "8")
)

Create a for loop where in each iteration you:

apply the gmean() to the test data
print the output.

Does your gmean() function work on all test data sets? If not, then why not?

Exercise 5

Hopefully you should notice that some of the test data sets produce some problematic results. This is because the geometric mean is only valid for strictly positive data values.

To make the command gmean() robust to potentially invalid input data, a check needs to be performed and an informative message needs to be returned.

Edit your gmean() command to include the following check at an appropriate location.

if( any(x <= 0) ){
  warning("<ERROR MESSAGE>")
  return(NaN)
}

Ensure that you know what the above code is doing.
Change the error message to something concise and informative.
Re-test the gmean() command using the examples in exercise 4.

As an optional extension to this exercise, what other conditions do we require on the input x in order for the gmean() command to work correctly? Add extra if statements to the start of your function to check for any other potential bad inputs.

Exercise 6

How can you rewrite the for loop of Exercise 5 using the map function instead, or another function in the same map family?

Exercise 7

With the diamonds data set, summarise the table_mm variable by computing the (arithmetic) mean, median and geometric mean for each clarity category. Caution: if you have not cleaned the data in exercise 1 then you may need to filter your data first.

Comment on the relationship between table_mm and clarity using the summarised values.

Exercise 8 (optional)

In the construction of the command gmean() you may have questioned why do we need to perform the logarithm and exponential transformation? Why cannot we compute the geometric mean based on the original definition, using prod(x)^(1/length(x))?

Create a function called gmean2() that computes the geometric mean using the original product definition.
Try gmean2() on the test examples. Do they work and produce the same result?
Try gmean2() with the table_mm variable.
Hopefully, you will see a problem with the calculation. Why do you think gmean2() did not work in this case?

Exercise 9 (optional)

An alternative measurement of central tendency is the Harmonic Mean, which is calculated by the reciprocal of the arithmetic mean of the reciprocals:

\[\bar{x}_h = \frac{1}{\frac{1}{n}\sum_{i=1}^n \frac{1}{x_i}}\] As with the geometric mean, this is only valid for positive value data.

Create a new function called hmean() that takes a single numerical vector x as input that computes and returns the harmonic mean statistic.
Test your function on the 4 test examples and add appropriate input checks where necessary.
Apply the hmean() command to the table_mm variable from the diamonds data set.

Exercise 10 (Optional)

Use exponent and logarithm rules to prove that the geometric mean is equal to the exponential of the arithmetic mean of the natural logarithm transformed data: \[\bar{x}_g = \exp\left( \frac{1}{n}\sum_{i=1}^n \ln x_i\right)\]