Plastic pollution is a major and growing problem, negatively affecting oceans and wildlife health. Our World in Data has a lot of great data at various levels including globally, per country, and over time. For this lab we focus on data from 2019.

Additionally, National Geographic ran a data visualization communication contest on plastic waste as seen here. The winners, Perpetual Plastic, created a physical data visualisation sculpture on Bali’s beaches out of washed up flip-flops and other plastic debris.

Learning goals

Load data from different file types.
Join multiple data frames.
More practice of data wrangling and summarising
Working collaboratively via GitHub.

Before we get started, please ensure that you have RStudio installed, a GitHub account and are able to push and pull correctly. If not, please follow the set-up instructions here. Ask a tutor for help if you have any problems.

In today’s lab you will be working collaboratively from the same repository in GitHub. This will be extremely useful in your projects for sharing out the workload among the team members.

GitHub is very smart when merging file when you Push committed changes from RStudio.
Typically, each team member works in different locations in the repository and there should not be any major merger issues.
However, if two or more team members are making different changes at the same location at the same time, then it is not possible for GitHub to identify which is the correct version.
This issue is called a Merger Conflict (ask a tutor for help if this happens to your team).

❗ For today, it is very important that you follow the instructions carefully to avoid creating any merger conflicts. Ask a tutor for help if this happens to your team. We will discuss how to resolve merger conflicts in the next lab.

Getting started

⊕If your group consists of fewer than 6 people, then take turns in doing the tasks.

Today you will be working in a team of no more than 6 people. It is important that you take it in turns to work on each activity one person at a time.

You will also get started with pair programming techniques. What is pair programming? From Codeacademy.com,

In pair programming, one person is the “driver,” and the other is the “navigator.” The driver is the person at the keyboard who’s actively writing code. The navigator observes, checks code for accuracy, and keeps an eye on the bigger picture.

Pair programmers switch roles regularly, so both pairs stay engaged. They also work collaboratively, determining which tasks need to be done.

⊕If your group has an odd number of people, one pair can be replaced by a group of three people and you can alternate 🚘 driver the 🧭 navigator roles.

We will be using the 🚘 emoji to describe the “driver”, and the 🧭 emojy to describe the “navigator.” In your group, form pairs with each person sitting next to their partner. Each member of the pair will alternate between the 🚘 driver role and the 🧭 navigator role.

Give each member of your team a number and look-out for the following emoji sequence to indicate who should be completing the activity:

🚘🚘🚘🚘🚘🚘 - All members do the activity.
🚘🧭😐😐😐😐 - Member 1 is the driver, member 2 is the navigator.
🧭🚘😐😐😐😐 - Member 2 is the driver, member 1 is the navigator.
😐😐🚘🧭😐😐 - Member 3 is the driver, member 4 is the navigator.
😐😐🧭🚘😐😐 - Member 4 is the driver, member 3 is the navigator.
😐😐😐😐🚘🧭 - Member 5 is the driver, member 6 is the navigator.
😐😐😐😐🧭🚘 - Member 6 is the driver, member 5 is the navigator.

If it’s your turn to be the navigator 🧭, then guide your partner member in doing their task. Remember: for anyone who is not the driver 🚘, do not make any changes to your work and do not make any pushes to or pulls from GitHub – Keep your hand off the keyboard!

Register your team on Wooclap

This week we will start to finalize the teams which will later work on the group projects. To collect the initial information on teams and team members, it is important that one of you fills out this wooclap form:

Go to the main page of Wooclap
Use the event code to enter: JBJYER
Follow the instruction and given example to write your team name and team members (name and student ID).
Submit your answer in the Wooclap.

Creating a collaborative repository

Let’s first set-up GitHub.

🚘🧭😐😐😐😐 (Member 1 only - pair 1)

You are the maintainer of the GitHub repository for today’s lab worksheet. This means that you will need to take a clone of today’s lab template and to add your team members as collaborators so that they can add their contribution.

First, log onto GitHub and create a new repository by cloning today’s lab template project. To remind you of the step:

Go to Your repositories in your GitHub account and then click on the green New button.
Click on Import a repository and type/copy the URL of today’s lab template project: https://github.com/uoeIDS/lab-02-template
Add an appropriate name to your repository, say lab-02, and click on Begin import.

Next, to add your team members as collaborators:

Navigate to your version the repository you have just cloned.
Click on ‘setting’ along the top and then ‘Collaborators’ in the sidebar.
Add each of your team members as collaborators.

(😐🚘🚘🚘🚘🚘 - You should receive a collaboration invitation via email, accept this.)

Version control R project

🚘🚘🚘🚘🚘🚘 (For all)

Once everyone has been added to the collaborative repository, open RStudio and create a new version control project using the GitHub repository you have just made. To remind you of the steps:

Open RStudio and go to File > New Project…
Select Version Control and then Git. Type/paste the URL of the repository you have just created.
Browse an appropriate location for the project and then click on Create Project.

PAUSE: Ensure that all team members have successfully created an R project and have pulled the current content from GitHub. Everyone, hands off the computer unless it is your turn!

Adding your own name

⊕MERGING CONFLICT: A git merge conflict is an event that takes place when Git is unable to automatically resolve differences in code between two commits. Git can merge the changes automatically only if the commits are on different lines or branches. For some general instructions on how to solve merging conflict, you can have a look at this page

You are going to take it in turns to add your own name to the author string at the top of the lab-02.Rmd .

🚘🧭😐😐😐😐 (Member 1 only - pair 1)

Hands on the computer.
Open lab-02.Rmd.
At the top, replace the author text User1 and User2 with yours and your partner’s name.
🧶 Knit the document and ensure that your name appears in the html file.
✅ Commit all of the changes you have made, with an informative message (e.g. Add name of user 1).
⬆️ Push the changes to GitHub. Verify on GitHub that the repository has been updated.
Hands off your computer.

😐😐🚘🧭😐😐 (Member 3 only - pair 2)

Hands on the computer.
Click on ⬇️ Pull in the Git tab to download the latest updates from the shared repository.
Open lab-02.Rmd. You should see the names of team member 1 and 2.
Replace the author text User3 and User4 with yours and your partner’s name.
🧶 Knit the document and ensure that your name appears in the html file.
✅ Commit all of the changes you have made, with an informative message (e.g. Add name of user 2).
⬆️ Push the changes to GitHub. Verify on GitHub that the repository has been updated.
Hands off your computer.

😐😐😐😐🚘🧭 (Member 5 only - pair 3)

Hands on the computer.
⬇️ Pull the latest changes from the shared repository.
Replace the author text User5 and User6 with yours and your partner’s name.
🧶 ✅ ⬆️ Knit, Commit and Push the changes.
Hands off the computer.

🚘🚘🚘🚘🚘🚘 (For all)

Everybody, ⬇️ Pull the latest changes from the shared repository so that the version you have has everyone’s name.

Congratulations! - You have now started working collaboratively from the same repository in GitHub. Now let’s do some data science…

Packages

Before getting started with the Exercises, run the following code in the Console to load the packages you will need to today’s lab.

library(tidyverse)
library(readxl)

Loading the data

The data for this lab is contained with 4 different files, each saved as different file types within the data folder. Detailed below

File 1: mismanaged-plastic-waste-per-capita.csv

Data about the amount of mismanaged plastic waste for 159 coastal countries/territories.
From Our World in Data
Data stored with comma delimitation encoding.
Variable information:
- name: Country or territory name.
- code: ISO3 Alpha-code.
- mismanaged_plastic: 2019 mismanaged plastic estimates (kg per capita)

File 2: per-capicta-ocean-plastic-waste.txt

Data on the amount of mismanaged plastic escapes into the ocean from 159 coastal countries/territories.
From Our World in Data
Data stored with semicolon delimitation encoding.
Variable information:
- name: Country or territory name.
- code: ISO3 Alpha-code.
- mismanaged_plastic_ocean: 2019 estimates of ocean deposited mismanaged plastic (kg per capita)

File 3: UN_country_population.tsv

2019 population estimates from 237 countries/territories (includes coastal and landlocked regions).
From the United Nations
Data stored with tab delimitation encoding.
Variable information:
- name: Country or territory name.
- code: ISO3 Alpha-code.
- population: 2019 population estimates.

File 4: UN_country_region.xlsx

UN region identification for 237 countries/territories.
From the United Nations
Data stored with Excel encoding.
Variable information:
- name: Country or territory name (may differ from the above 2 data sets).
- code: ISO3 Alpha-code.
- region: The continent/region where the country/territory is located.

Note: The two data sources use different text for name, like “Turkey” or “Turkiye”, so joining of data should be based on the ISO3 Alpha-code.

Also, given that all dataset have a name variable, I suggest using a select() prior to joining the data sets in order to remove all but one of the name columns. Otherwise, the resulting joined data will have variables name.x and name.y but not name which may be confusing.

Load & join each data set.

Let’s do more collaborative work.

🧭🚘😐😐😐😐 (Member 2 only - pair 1)

Hands on the computer and ⬇️ Pull the latest changes.
You will be loading and printing the content of File 1. The data is encoded with comma separated delimitation, so we use the read_csv() function to load the data.
In the code chunk labelled load-data, write the following code:

data1 <- read_csv("data/mismanaged-plastic-waste-per-capita.csv")

In the code chunk labelled join-data, write the following code. Renaming may seem pointless, but your team members will be joining their data to this object.

plastic_data_all <- data1

In the code chunk labelled print-data, write:

plastic_data_all %>% head(n = 10)

🧶 ✅ ⬆️ Knit your work and check the output. Commit your changes with an informative message and Push them to the shared repository on GitHub.
Hands off your computer.

😐😐🧭🚘😐😐 (Member 4 only - pair 2)

Hands on the computer and ⬇️ Pull the latest changes.
You will be loading and printing the content of File 2. The data is encoded with semicolon separated delimitation, so we use the read_csv2() function to load the data.
Add the following code into the code chunk labelled load-data:

data2 <- read_csv2("data/per-capita-ocean-plastic-waste.txt")

Write the following code in the join-data code chunk. Ensure that you and your team members understand what this code is doing.

plastic_data_all <- data2 %>%
  select(-name) %>%
  left_join(plastic_data_all, by = "code")

🧶 ✅ ⬆️ Knit your work and check that plastic_data_all contains the data from both files. Commit your changes with an informative message and Push them to the shared repository on GitHub.
Hands off your computer.

😐😐😐😐🧭🚘 (Member 6 only - pair 3)

Hands on the computer and ⬇️ Pull the latest changes.
You will be loading and printing the content of File 3. The data is encoded with tab separated delimitation, so we use the read_tsv() function to load the data.
Add the following code into the code chunk labelled load-data:

data3 <- read_tsv("data/UN_country_population.tsv")

Write the following code in the join-data code chunk.

plastic_data_all <- data3 %>%
  select(-name) %>%
  right_join(plastic_data_all, by = "code")

Question: The above code does a right join, what would happen if you instead did a left join?
🧶 ✅ ⬆️ Knit your work and check that plastic_data_all contains the data from both files. Commit your changes with an informative message and Push them to the shared repository on GitHub.
Hands off your computer.

Show answer to question

The data from the UN contains more rows (run nrow(data3)) than the plastic waste data set (run nrow(data1)). This is because the plastic waste data set only contains data from countries/territories with a coastline, whilst the UN data contains population data on all countries/territories whether they are coastal nations or landlocked. If we instead ran the code data3 %>% left_join(data1, by = "code") then the plastic waste data is added to the UN data, but there are no plastic waste data for landlocked countries. Consequently, the missing entries will be filled with NAs. This can be resolved by using drop_na() to remove all rows that contain at least one NA. Therefore, the following code should produce the same result:

data3 %>% select(-name) %>% left_join(plastic_data_all, by = "code") %>% drop_na()

🚘🧭😐😐😐😐 (Member 1 only - pair 1)

Hands on the computer and ⬇️ Pull the latest changes.
You will be loading and printing the content of File 4. The data is saved using Excel encoding. To load this file you will need to use the read_excel() function from the readxl package.
Add the following code into the code chunk labelled load-data:

data4 <- read_excel("data/UN_country_region.xlsx")

Write the following code in the join-data code chunk to join the final data set into data_all.

plastic_data_all <- data4 %>%
  select(-name) %>%
  right_join(plastic_data_all, by = "code")

🧶 ✅ ⬆️ Knit your work and check that plastic_data_all contains the data from both files. Commit your changes with an informative message and Push them to the shared repository on GitHub.
Hands off your computer.

🚘🚘🚘🚘🚘🚘 (For all)

Everybody, ⬇️ Pull the latest changes from the shared repository. Check that all 4 files are loaded and that they are joined into a single data frame.

Exercises

Now that you have loaded and joined the data, let’s now do some investigations.

Please continue to work collaboratively, using pair programming and taking turns in contributing to the questions. When the team member changes, remember to begin with a ⬇️ Pull from GitHub and then finish with a 🧶 ✅ ⬆️ Knit, Commit and Push.

Only one person should have their hands on their computer at any one time to minimise the chance of merger conflicts.

EXERCISE 1.

Create a frequency table of coastal countries/territories by region.
Which region has the most number of coastal countries/territories?

EXERCISE 2

The mismanaged plastic waste is measured in kg per capita. Add a new variable to plastic_data_all called total_mismanage_plastic by multiplying mismanaged_plastic by population.
What is the mean total of mismanaged plastic waste per region?
Which region has the highest total and which has the lowest total?

EXERCISE 3

Add a new variable called pct_mismanaged_plastic_ocean to plastic_data_all that represents the amount of ocean emitted mismanaged plastic waste as a percentage of all mismanaged plastic waste.
Calculate the median pct_mismanaged_plastic_ocean for each region.
Your answer for Africa should be NaN. What does this mean and what is causing this issue? Hint: filter your data using code == "SOM".
Add the command drop_na() after computing the variable pct_mismanaged_plastic_ocean and re-evaluate the median estimates. Which region has the lowest median?

Exercise 4

The variable names in plastic_data_all are quite long, let’s do something about that.

Use the rename() command to replace the existing variable names with something that is more concise. For example, we can contract population to pop using the following code:
Examine the names of the other variables in plastic_data_all and rename them to something that is concise yet still informative about what the variable contains. Discuss and agree in your group how best to simplify the variable names.

In addition, region name of "Latin America and The Caribbean" is much longer than the other regions, so we can consider replacing this text with a suitable acronym.

Mutate the region variable to replace the text "Latin America and The Caribbean" with "LAC". Hint: have a look at how to use the str_replace_all() command.

Finally, create a frequency table or calculate an interesting statistic per region that uses the renamed variables.

Lab 02 - Global plastic waste

Learning goals

Getting started

Register your team on Wooclap

Creating a collaborative repository

Version control R project

Adding your own name

Packages

Loading the data

Load & join each data set.

Exercises

EXERCISE 1.

EXERCISE 2

EXERCISE 3

Exercise 4

Wrapping up