Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Doing Data Science



Introduction to Data Science

University of Edinburgh


2024/2025

1 / 35

What's in a data analysis?

2 / 35

Five core activities of data analysis

  1. Stating and refining the question
  2. Exploring the data
  3. Building formal statistical models
  4. Interpreting the results
  5. Communicating the results

Roger D. Peng and Elizabeth Matsui. "The Art of Data Science." A Guide for Anyone Who Works with Data. Skybrude Consulting, LLC (2015).

3 / 35

Stating and refining the question

Six types of questions

  1. Descriptive: summarize a characteristic of a set of data
  2. Exploratory: analyse to see if there are patterns, trends, or relationships between variables (hypothesis generating)
  3. Inferential: analyse patterns, trends, or relationships in representative data from a population
  4. Predictive: make predictions for individuals or groups of individuals
  5. Causal: whether changing one factor will change another factor, on average, in a population
  6. Mechanistic: explore "how" as opposed to whether

Jeffery T. Leek and Roger D. Peng. "What is the question?." Science 347.6228 (2015): 1314-1315.

4 / 35

Eg: COVID-19 and Vitamin D

  1. Descriptive: frequency of hospitalisations due to COVID-19 in a set of data collected from a group of individuals

  2. Exploratory: examine relationships between a range of dietary factors and COVID-19 hospitalisations

  3. Inferential: examine whether any relationship between taking Vitamin D supplements and COVID-19 hospitalisations found in the sample hold for the population at large

  4. Predictive: what types of people will take Vitamin D supplements during the next year

  5. Causal: whether people with COVID-19 who were randomly assigned to take Vitamin D supplements or those who were not are hospitalised

  6. Mechanistic: how increased vitamin D intake leads to a reduction in the number of viral illnesses

5 / 35

Questions to data science problems

  • Do you have appropriate data to answer your question?


  • Do you have information on confounding variables?
    • Other variables that may explain the relationship seen between the two key variables.


  • Was the data you're working with collected in a way that introduces bias?
    • Attempting to make conclusions about a wider group when the data is collected from a typically non-representative cohort.
6 / 35

Is my data biased?

Suppose I want to estimate the average number of children in households in Edinburgh. I conduct a survey at an elementary school in Edinburgh and ask students at this elementary school how many children, including themselves, live in their house. Then, I take the average of the responses.

7 / 35

Is my data biased?

Suppose I want to estimate the average number of children in households in Edinburgh. I conduct a survey at an elementary school in Edinburgh and ask students at this elementary school how many children, including themselves, live in their house. Then, I take the average of the responses.

  • Target: average number of children in households in Edinburgh
  • Data: an elementary school
  • Potential biases:
    • Do all households have a child at elementary school?
    • Will AN elementary school be representative for across Edinburgh?
    • Could there be double counting? "...how many children, including themselves, ..."
8 / 35

Exploratory data analysis

Checklist:

  • Formulate your question
  • Read in your data
  • Check the dimensions
  • Look at the data:
    • The top few rows
    • The bottom few rows
    • Some random rows in the middle
  • Validate with at least one external data source
  • Make a plot
  • Try the easy solution first
9 / 35

Formulate your question/hypothesis

  • Examples of questions/hypothesis:
    • Are air pollution levels higher in Scotland compared to elsewhere in the United Kingdom?
    • Are daily average temperatures in Edinburgh higher than what they are in Glasgow? Is this relationship true throughout the year?
    • Are people's opinion about political statement X the same across the country or does it differ between between different regions or perhaps different social-economic status?
  • Most importantly: "Do I have the right data to answer this question?"
    • What is the format of the data
    • How do I import it into RStudio
    • Does it need cleaning? (Most certainly Yes!)
10 / 35

Final advice

When doing the data analysis:

  • Find some data, any data!
  • Look at the data!
  • Do some simple exploratory data analysis (data visualisations/summary statistics)
  • Formulate your investigation question/hypothesis
  • Iterate, iterate and iterate
  • Let the data speak for itself!
11 / 35

Final advice

When doing the data analysis:

  • Find some data, any data!
  • Look at the data!
  • Do some simple exploratory data analysis (data visualisations/summary statistics)
  • Formulate your investigation question/hypothesis
  • Iterate, iterate and iterate
  • Let the data speak for itself!

When communicating with your audience:

  • Avoid jargon, uninterpreted results and lengthy output
  • Pay attention to organization, presentation and flow
  • Don't forget about coding best practices (make it readable)
  • Be open to suggestions and feedback
  • Write as you go, don't leave it for the end!
11 / 35

Scientific studies

12 / 35

Scientific studies

Observational

  • Collect data in a way that does not interfere with how the data arise ("observe")
  • Establish associations/correlations

Experimental

  • Randomly assign subjects to treatments
  • Establish causal connections
13 / 35

Case study: Breakfast cereal keeps girls slim

Girls who ate breakfast of any type had a lower average body mass index (BMI), a common obesity gauge, than those who said they didn't. The index was even lower for girls who said they ate cereal for breakfast, according to findings of the study conducted by the Maryland Medical Research Institute with funding from the National Institutes of Health (NIH) and cereal-maker General Mills. [...]

The results were gleaned from a larger NIH survey of 2,379 girls in California, Ohio, and Maryland who were tracked between the ages of 9 and 19. [...]

As part of the survey, the girls were asked once a year what they had eaten during the previous three days. [...]

Source: Study: Cereal Keeps Girls Slim, Retrieved Sep 13, 2018.

14 / 35

Explanatory and response variables

  • Response variable
    • The key variable of interest
    • Example: BMI of the participant


  • Explanatory variable(s)
    • The variable(s) that may help to describe or explain the response variable
    • Example: Whether the participant ate breakfast or not
15 / 35

Three possible explanations

  1. Eating breakfast causes girls to be slimmer
  2. Being slim causes girls to eat breakfast
  3. A third variable is responsible for both -- a confounding variable: an extraneous variable that affects both the explanatory and the response variable, and that makes it seem like there is a relationship between them

16 / 35

Correlation != Causation

17 / 35

Studies and conclusions

18 / 35

Case study: Climate change survey

19 / 35

The survey

A July 2019 YouGov survey asked 1633 GB and 1333 USA randomly selected adults which of the following statements about the global environment best describes their view:

  • The climate is changing and human activity is mainly responsible
  • The climate is changing and human activity is partly responsible, together with other factors
  • The climate is changing but human activity is not responsible at all
  • The climate is not changing
The climate is changing and human activity is mainly responsible The climate is changing and human activity is partly responsible, together with other factors The climate is changing but human activity is not responsible at all The climate is not changing Don't know Sum
GB 833 604 49 33 114 1633
US 507 493 120 80 133 1333
Sum 1340 1097 169 113 247 2966
20 / 35

Investigation question

Is the proportion of respondents who said "The climate is changing and human activity is mainly responsible" in Great Briton and the United States the same?

21 / 35

Probability 101

  • What is the probability that a respondent says "The climate is changing and human activity is mainly responsible"?

    • Unconditional probability, P(A)
  • What is the probability that a respondent says "The climate is changing and human activity is mainly responsible", given that the respondent is from Great Briton?

    • Conditional probability, P(A|B)
  • If knowing that event B happened, then does it change the probability of event A happening?

    • If yes, then events A and B are said to be dependant
    • If not, then the two events are said to be independent, P(A|B)=P(A)
22 / 35

Simple calcualtions

Based on the response proportions for all respondents and within each country, does there appear to be a relationship between country and beliefs about climate change?

all <- 1340 / 2966
all
## [1] 0.4517869
gb <- 833 / 1633
gb
## [1] 0.5101041
us <- 507 / 1333
us
## [1] 0.3803451

Is the relationship causal or associative? Can it be explained via another variable?

23 / 35

Case study: Berkeley admission data

24 / 35

Berkeley admission data

  • Study carried out by the Graduate Division of the University of California, Berkeley in the early 70’s to evaluate whether there was a gender bias in graduate admissions.
  • The data come from six departments (labelled A--F).
  • We have information on whether the applicant was male or female and whether they were admitted or rejected.
  • First, we will evaluate whether the percentage of males admitted is indeed higher than females, overall. Next, we will calculate the same percentage for each department.
25 / 35

Data

## # A tibble: 4,526 × 3
## admit gender dept
## <fct> <fct> <ord>
## 1 Admitted Male A
## 2 Admitted Male A
## 3 Admitted Male A
## 4 Admitted Male A
## 5 Admitted Male A
## 6 Admitted Male A
## 7 Admitted Male A
## 8 Admitted Male A
## 9 Admitted Male A
## 10 Admitted Male A
## 11 Admitted Male A
## 12 Admitted Male A
## 13 Admitted Male A
## 14 Admitted Male A
## 15 Admitted Male A
## # ℹ 4,511 more rows
## # A tibble: 2 × 2
## gender n
## <fct> <int>
## 1 Female 1835
## 2 Male 2691
## # A tibble: 6 × 2
## dept n
## <ord> <int>
## 1 A 933
## 2 B 585
## 3 C 918
## 4 D 792
## 5 E 584
## 6 F 714
## # A tibble: 2 × 2
## admit n
## <fct> <int>
## 1 Rejected 2771
## 2 Admitted 1755
26 / 35

What can you say about the overall gender distribution? Hint: Calculate the following probabilities: P(Admit|Male) and P(Admit|Female).

ucbadmit %>%
count(gender, admit)
## # A tibble: 4 × 3
## gender admit n
## <fct> <fct> <int>
## 1 Female Rejected 1278
## 2 Female Admitted 557
## 3 Male Rejected 1493
## 4 Male Admitted 1198
ucbadmit %>%
count(gender, admit) %>%
group_by(gender) %>%
mutate(prop_admit = n / sum(n))
## # A tibble: 4 × 4
## # Groups: gender [2]
## gender admit n prop_admit
## <fct> <fct> <int> <dbl>
## 1 Female Rejected 1278 0.696
## 2 Female Admitted 557 0.304
## 3 Male Rejected 1493 0.555
## 4 Male Admitted 1198 0.445
  • P(Admit|Female)=0.304
  • P(Admit|Male)=0.445
27 / 35

Overall gender distribution

ggplot(ucbadmit, aes(y = gender,
fill = admit)) +
geom_bar(position = "fill") +
labs(title = "Admit proportion by gender",
x = "Proportion",
y = NULL,
fill = "Admission")
28 / 35

What can you say about the gender distribution by department ?

ucbadmit %>%
count(dept, gender, admit)
## # A tibble: 24 × 4
## dept gender admit n
## <ord> <fct> <fct> <int>
## 1 A Female Rejected 19
## 2 A Female Admitted 89
## 3 A Male Rejected 313
## 4 A Male Admitted 512
## 5 B Female Rejected 8
## 6 B Female Admitted 17
## 7 B Male Rejected 207
## 8 B Male Admitted 353
## 9 C Female Rejected 391
## 10 C Female Admitted 202
## # ℹ 14 more rows
ucbadmit %>%
count(dept, gender, admit) %>%
pivot_wider(names_from = dept, values_from = n)
## # A tibble: 4 × 8
## gender admit A B C D E F
## <fct> <fct> <int> <int> <int> <int> <int> <int>
## 1 Female Rejected 19 8 391 244 299 317
## 2 Female Admitted 89 17 202 131 94 24
## 3 Male Rejected 313 207 205 279 138 351
## 4 Male Admitted 512 353 120 138 53 22
ucbadmit %>%
count(dept, gender, admit) %>%
group_by(dept, gender) %>%
mutate(prop_admit = n / sum(n)) %>%
select(-n) %>%
pivot_wider(names_from = dept, values_from = prop_admit)
## # A tibble: 4 × 8
## # Groups: gender [2]
## gender admit A B C D E F
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Female Rejected 0.176 0.32 0.659 0.651 0.761 0.930
## 2 Female Admitted 0.824 0.68 0.341 0.349 0.239 0.0704
## 3 Male Rejected 0.379 0.370 0.631 0.669 0.723 0.941
## 4 Male Admitted 0.621 0.630 0.369 0.331 0.277 0.0590
29 / 35

Gender distribution, by department

ggplot(ucbadmit, aes(y = gender, fill = admit)) +
geom_bar(position = "fill") +
facet_wrap(. ~ dept, nrow = 1) +
labs(title = "Admissions by gender and department",
x = "Proportion",
y = NULL,
fill = "Admission")
30 / 35

Case for gender discrimination?

Overall:

  • P(Admit|Female)=0.304 and P(Admit|Male)=0.445
  • The proportion of applicants begin admitted is higher given that the applicant is male compared to the applicant being female.

Within departments:

  • Department A:
    • P(Admit|A,Female)=0.824 and P(Admit|A,Male)=0.621
  • Department E:
    • P(Admit|E,Female)=0.239 and P(Admit|E,Male)=0.277
  • 4 out of 6 departments have higher admission proportions given the applicant is female.
31 / 35

Simpson's paradox

32 / 35

Simpson's paradox

  • Not considering an important variable when studying a relationship can result in Simpson's paradox
  • Simpson's paradox illustrates the effect that omission of an explanatory variable can have on the measure of association between another explanatory variable and a response variable
  • The inclusion of a third variable in the analysis can change the apparent relationship between the other two variables
33 / 35

Relationship between two variables

df
## # A tibble: 8 × 3
## x y z
## <dbl> <dbl> <chr>
## 1 2 4.1 A
## 2 3 2.9 A
## 3 4 2.1 A
## 4 5 0.9 A
## 5 6 10.9 B
## 6 7 10.1 B
## 7 8 8.9 B
## 8 9 8.1 B

34 / 35

Considering a third variable

df %>%
summarise(cor_x_y = cor(x, y))
## # A tibble: 1 × 1
## cor_x_y
## <dbl>
## 1 0.683
df %>%
group_by(z) %>%
summarise(cor_x_y = cor(x, y))
## # A tibble: 2 × 2
## z cor_x_y
## <chr> <dbl>
## 1 A -0.997
## 2 B -0.997

35 / 35

What's in a data analysis?

2 / 35
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow