Image by Willfried Wende from Pixabay
Plastic pollution is a major and growing problem, negatively affecting oceans and wildlife health. Our World in Data has a lot of great data at various levels including globally, per country, and over time. For this lab we focus on data from 2019.
Additionally, National Geographic ran a data visualization communication contest on plastic waste as seen here. The winners, Perpetual Plastic, created a physical data visualisation sculpture on Bali’s beaches out of washed up flip-flops and other plastic debris.
Before we get started, please ensure that you have RStudio installed, a GitHub account and are able to push and pull correctly. If not, please follow the set-up instructions here. Ask a tutor for help if you have any problems.
In today’s lab you will be working collaboratively from the same repository in GitHub. This will be extremely useful in your projects for sharing out the workload among the team members.
❗ For today, it is very important that you follow the instructions carefully to avoid creating any merger conflicts. Ask a tutor for help if this happens to your team. We will discuss how to resolve merger conflicts in the next lab.
If your group consists of fewer than 6 people, then take turns in doing the tasks.
Today you will be working in a team of no more than 6 people. It is important that you take it in turns to work on each activity one person at a time.
You will also get started with pair programming techniques. What is pair programming? From Codeacademy.com,
In pair programming, one person is the “driver,” and the other is the “navigator.” The driver is the person at the keyboard who’s actively writing code. The navigator observes, checks code for accuracy, and keeps an eye on the bigger picture.
Pair programmers switch roles regularly, so both pairs stay engaged. They also work collaboratively, determining which tasks need to be done.
If your group has an odd number of people, one pair can be replaced by a group of three people and you can alternate 🚘 driver the 🧭 navigator roles.
We will be using the 🚘 emoji to describe the “driver”, and the 🧭 emojy to describe the “navigator.” In your group, form pairs with each person sitting next to their partner. Each member of the pair will alternate between the 🚘 driver role and the 🧭 navigator role.
Give each member of your team a number and look-out for the following emoji sequence to indicate who should be completing the activity:
If it’s your turn to be the navigator 🧭, then guide your partner member in doing their task. Remember: for anyone who is not the driver 🚘, do not make any changes to your work and do not make any pushes to or pulls from GitHub – Keep your hand off the keyboard!
This week we will start to finalize the teams which will later work on the group projects. To collect the initial information on teams and team members, it is important that one of you fills out this wooclap form:
Go to the main page of Wooclap
Use the event code to enter: JBJYER
Follow the instruction and given example to write your team name and team members (name and student ID).
Submit your answer in the Wooclap.
Let’s first set-up GitHub.
🚘🧭😐😐😐😐 (Member 1 only - pair 1)
You are the maintainer of the GitHub repository for today’s lab worksheet. This means that you will need to take a clone of today’s lab template and to add your team members as collaborators so that they can add their contribution.
First, log onto GitHub and create a new repository by cloning today’s lab template project. To remind you of the step:
Go to Your repositories in your GitHub account and then click on the green New button.
Click on Import a repository and type/copy the URL of today’s lab template project: https://github.com/uoeIDS/lab-02-template
Add an appropriate name to your repository, say lab-02
, and click on Begin import.
Next, to add your team members as collaborators:
(😐🚘🚘🚘🚘🚘 - You should receive a collaboration invitation via email, accept this.)
🚘🚘🚘🚘🚘🚘 (For all)
Once everyone has been added to the collaborative repository, open RStudio and create a new version control project using the GitHub repository you have just made. To remind you of the steps:
Open RStudio and go to File > New Project…
Select Version Control and then Git. Type/paste the URL of the repository you have just created.
Browse an appropriate location for the project and then click on Create Project.
PAUSE: Ensure that all team members have successfully created an R project and have pulled the current content from GitHub. Everyone, hands off the computer unless it is your turn!
MERGING CONFLICT: A git merge conflict is an event that takes place when Git is unable to automatically resolve differences in code between two commits. Git can merge the changes automatically only if the commits are on different lines or branches. For some general instructions on how to solve merging conflict, you can have a look at this page
You are going to take it in turns to add your own name to the author string at the top of the lab-02.Rmd
.
🚘🧭😐😐😐😐 (Member 1 only - pair 1)
lab-02.Rmd
.Add name of user 1
).😐😐🚘🧭😐😐 (Member 3 only - pair 2)
lab-02.Rmd
. You should see the names of team member 1 and 2.Add name of user 2
).😐😐😐😐🚘🧭 (Member 5 only - pair 3)
🚘🚘🚘🚘🚘🚘 (For all)
Everybody, ⬇️ Pull the latest changes from the shared repository so that the version you have has everyone’s name.
Congratulations! - You have now started working collaboratively from the same repository in GitHub. Now let’s do some data science…
Before getting started with the Exercises, run the following code in the Console to load the packages you will need to today’s lab.
The data for this lab is contained with 4 different files, each saved as different file types within the data
folder. Detailed below
File 1: mismanaged-plastic-waste-per-capita.csv
name
: Country or territory name.code
: ISO3 Alpha-code.mismanaged_plastic
: 2019 mismanaged plastic estimates (kg per capita)File 2: per-capicta-ocean-plastic-waste.txt
name
: Country or territory name.code
: ISO3 Alpha-code.mismanaged_plastic_ocean
: 2019 estimates of ocean deposited mismanaged plastic (kg per capita)File 3: UN_country_population.tsv
name
: Country or territory name.code
: ISO3 Alpha-code.population
: 2019 population estimates.File 4: UN_country_region.xlsx
name
: Country or territory name (may differ from the above 2 data sets).code
: ISO3 Alpha-code.region
: The continent/region where the country/territory is located.Note: The two data sources use different text for name
, like “Turkey” or “Turkiye”, so joining of data should be based on the ISO3 Alpha-code.
Also, given that all dataset have a name
variable, I suggest using a select()
prior to joining the data sets in order to remove all but one of the name
columns. Otherwise, the resulting joined data will have variables name.x
and name.y
but not name
which may be confusing.
Let’s do more collaborative work.
🧭🚘😐😐😐😐 (Member 2 only - pair 1)
read_csv()
function to load the data.load-data
, write the following code:join-data
, write the following code. Renaming may seem pointless, but your team members will be joining their data to this object.print-data
, write:😐😐🧭🚘😐😐 (Member 4 only - pair 2)
read_csv2()
function to load the data.load-data
:join-data
code chunk. Ensure that you and your team members understand what this code is doing.plastic_data_all
contains the data from both files. Commit your changes with an informative message and Push them to the shared repository on GitHub.😐😐😐😐🧭🚘 (Member 6 only - pair 3)
read_tsv()
function to load the data.load-data
:join-data
code chunk.plastic_data_all
contains the data from both files. Commit your changes with an informative message and Push them to the shared repository on GitHub.Show answer to question
The data from the UN contains more rows (run nrow(data3)
) than the plastic waste data set (run nrow(data1)
). This is because the plastic waste data set only contains data from countries/territories with a coastline, whilst the UN data contains population data on all countries/territories whether they are coastal nations or landlocked. If we instead ran the code data3 %>% left_join(data1, by = "code")
then the plastic waste data is added to the UN data, but there are no plastic waste data for landlocked countries. Consequently, the missing entries will be filled with NA
s. This can be resolved by using drop_na()
to remove all rows that contain at least one NA
. Therefore, the following code should produce the same result:
🚘🧭😐😐😐😐 (Member 1 only - pair 1)
read_excel()
function from the readxl
package.load-data
:join-data
code chunk to join the final data set into data_all
.plastic_data_all
contains the data from both files. Commit your changes with an informative message and Push them to the shared repository on GitHub.🚘🚘🚘🚘🚘🚘 (For all)
Everybody, ⬇️ Pull the latest changes from the shared repository. Check that all 4 files are loaded and that they are joined into a single data frame.
Now that you have loaded and joined the data, let’s now do some investigations.
Please continue to work collaboratively, using pair programming and taking turns in contributing to the questions. When the team member changes, remember to begin with a ⬇️ Pull from GitHub and then finish with a 🧶 ✅ ⬆️ Knit, Commit and Push.
Only one person should have their hands on their computer at any one time to minimise the chance of merger conflicts.
plastic_data_all
called total_mismanage_plastic
by multiplying mismanaged_plastic
by population
.region
?pct_mismanaged_plastic_ocean
to plastic_data_all
that represents the amount of ocean emitted mismanaged plastic waste as a percentage of all mismanaged plastic waste.pct_mismanaged_plastic_ocean
for each region.Africa
should be NaN
. What does this mean and what is causing this issue? Hint: filter your data using code == "SOM"
.drop_na()
after computing the variable pct_mismanaged_plastic_ocean
and re-evaluate the median estimates. Which region has the lowest median?The variable names in plastic_data_all
are quite long, let’s do something about that.
Use the rename()
command to replace the existing variable names with something that is more concise. For example, we can contract population
to pop
using the following code:
Examine the names of the other variables in plastic_data_all
and rename them to something that is concise yet still informative about what the variable contains. Discuss and agree in your group how best to simplify the variable names.
In addition, region name of "Latin America and The Caribbean"
is much longer than the other regions, so we can consider replacing this text with a suitable acronym.
region
variable to replace the text "Latin America and The Caribbean"
with "LAC"
. Hint: have a look at how to use the str_replace_all()
command.Finally, create a frequency table or calculate an interesting statistic per region that uses the renamed variables.
At the end of the lab, you need to ensure that you have your own personal copy of today’s work. Please follow the following instructions carefully:
🚘🚘🚘🚘🚘🚘 (For all)
Everybody, ⬇️ Pull the latest changes from the shared repository.
😐🚘🚘🚘🚘🚘 (All except member 1)
On GitHub, create your own copy of the shared repository. You can do this using the same instructions as at the start when copying today’s template repository, but instead importing from member 1’s GitHub account rather than the course account.
If you want to continue to work on today’s lab after the workshop, then you will need to create a new version control project with your personal copy of the repository that you have just created.
🚘😐😐😐😐😐 (Member 1 only)
At the end of the workshop, you want to ensure that only you can make further changes to the shared repository. To do this, you will need to remove the collaboration permissions of your team members. To do this:
That’s all for today. In next week’s lab we will continue to work collaboratively, but we will look at how to resolve merger conflicts.