Roger D. Peng and Elizabeth Matsui. "The Art of Data Science." A Guide for Anyone Who Works with Data. Skybrude Consulting, LLC (2015).
Six types of questions
Jeffery T. Leek and Roger D. Peng. "What is the question?." Science 347.6228 (2015): 1314-1315.
Descriptive: frequency of hospitalisations due to COVID-19 in a set of data collected from a group of individuals
Exploratory: examine relationships between a range of dietary factors and COVID-19 hospitalisations
Inferential: examine whether any relationship between taking Vitamin D supplements and COVID-19 hospitalisations found in the sample hold for the population at large
Predictive: what types of people will take Vitamin D supplements during the next year
Causal: whether people with COVID-19 who were randomly assigned to take Vitamin D supplements or those who were not are hospitalised
Mechanistic: how increased vitamin D intake leads to a reduction in the number of viral illnesses
Suppose I want to estimate the average number of children in households in Edinburgh. I conduct a survey at an elementary school in Edinburgh and ask students at this elementary school how many children, including themselves, live in their house. Then, I take the average of the responses.
Suppose I want to estimate the average number of children in households in Edinburgh. I conduct a survey at an elementary school in Edinburgh and ask students at this elementary school how many children, including themselves, live in their house. Then, I take the average of the responses.
Checklist:
When doing the data analysis:
When doing the data analysis:
When communicating with your audience:
Observational
Experimental
Girls who ate breakfast of any type had a lower average body mass index (BMI), a common obesity gauge, than those who said they didn't. The index was even lower for girls who said they ate cereal for breakfast, according to findings of the study conducted by the Maryland Medical Research Institute with funding from the National Institutes of Health (NIH) and cereal-maker General Mills. [...]
The results were gleaned from a larger NIH survey of 2,379 girls in California, Ohio, and Maryland who were tracked between the ages of 9 and 19. [...]
As part of the survey, the girls were asked once a year what they had eaten during the previous three days. [...]
Source: Study: Cereal Keeps Girls Slim, Retrieved Sep 13, 2018.
A July 2019 YouGov survey asked 1633 GB and 1333 USA randomly selected adults which of the following statements about the global environment best describes their view:
- The climate is changing and human activity is mainly responsible
- The climate is changing and human activity is partly responsible, together with other factors
- The climate is changing but human activity is not responsible at all
- The climate is not changing
The climate is changing and human activity is mainly responsible | The climate is changing and human activity is partly responsible, together with other factors | The climate is changing but human activity is not responsible at all | The climate is not changing | Don't know | Sum | |
---|---|---|---|---|---|---|
GB | 833 | 604 | 49 | 33 | 114 | 1633 |
US | 507 | 493 | 120 | 80 | 133 | 1333 |
Sum | 1340 | 1097 | 169 | 113 | 247 | 2966 |
Is the proportion of respondents who said "The climate is changing and human activity is mainly responsible" in Great Briton and the United States the same?
What is the probability that a respondent says "The climate is changing and human activity is mainly responsible"?
What is the probability that a respondent says "The climate is changing and human activity is mainly responsible", given that the respondent is from Great Briton?
If knowing that event B happened, then does it change the probability of event A happening?
Based on the response proportions for all respondents and within each country, does there appear to be a relationship between country and beliefs about climate change?
all <- 1340 / 2966all
## [1] 0.4517869
gb <- 833 / 1633gb
## [1] 0.5101041
us <- 507 / 1333us
## [1] 0.3803451
Is the relationship causal or associative? Can it be explained via another variable?
## # A tibble: 4,526 × 3## admit gender dept ## <fct> <fct> <ord>## 1 Admitted Male A ## 2 Admitted Male A ## 3 Admitted Male A ## 4 Admitted Male A ## 5 Admitted Male A ## 6 Admitted Male A ## 7 Admitted Male A ## 8 Admitted Male A ## 9 Admitted Male A ## 10 Admitted Male A ## 11 Admitted Male A ## 12 Admitted Male A ## 13 Admitted Male A ## 14 Admitted Male A ## 15 Admitted Male A ## # ℹ 4,511 more rows
## # A tibble: 2 × 2## gender n## <fct> <int>## 1 Female 1835## 2 Male 2691
## # A tibble: 6 × 2## dept n## <ord> <int>## 1 A 933## 2 B 585## 3 C 918## 4 D 792## 5 E 584## 6 F 714
## # A tibble: 2 × 2## admit n## <fct> <int>## 1 Rejected 2771## 2 Admitted 1755
What can you say about the overall gender distribution? Hint: Calculate the following probabilities: P(Admit|Male) and P(Admit|Female).
ucbadmit %>% count(gender, admit)
## # A tibble: 4 × 3## gender admit n## <fct> <fct> <int>## 1 Female Rejected 1278## 2 Female Admitted 557## 3 Male Rejected 1493## 4 Male Admitted 1198
ucbadmit %>% count(gender, admit) %>% group_by(gender) %>% mutate(prop_admit = n / sum(n))
## # A tibble: 4 × 4## # Groups: gender [2]## gender admit n prop_admit## <fct> <fct> <int> <dbl>## 1 Female Rejected 1278 0.696## 2 Female Admitted 557 0.304## 3 Male Rejected 1493 0.555## 4 Male Admitted 1198 0.445
ggplot(ucbadmit, aes(y = gender, fill = admit)) + geom_bar(position = "fill") + labs(title = "Admit proportion by gender", x = "Proportion", y = NULL, fill = "Admission")
What can you say about the gender distribution by department ?
ucbadmit %>% count(dept, gender, admit)
## # A tibble: 24 × 4## dept gender admit n## <ord> <fct> <fct> <int>## 1 A Female Rejected 19## 2 A Female Admitted 89## 3 A Male Rejected 313## 4 A Male Admitted 512## 5 B Female Rejected 8## 6 B Female Admitted 17## 7 B Male Rejected 207## 8 B Male Admitted 353## 9 C Female Rejected 391## 10 C Female Admitted 202## # ℹ 14 more rows
ucbadmit %>% count(dept, gender, admit) %>% pivot_wider(names_from = dept, values_from = n)
## # A tibble: 4 × 8## gender admit A B C D E F## <fct> <fct> <int> <int> <int> <int> <int> <int>## 1 Female Rejected 19 8 391 244 299 317## 2 Female Admitted 89 17 202 131 94 24## 3 Male Rejected 313 207 205 279 138 351## 4 Male Admitted 512 353 120 138 53 22
ucbadmit %>% count(dept, gender, admit) %>% group_by(dept, gender) %>% mutate(prop_admit = n / sum(n)) %>% select(-n) %>% pivot_wider(names_from = dept, values_from = prop_admit)
## # A tibble: 4 × 8## # Groups: gender [2]## gender admit A B C D E F## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Female Rejected 0.176 0.32 0.659 0.651 0.761 0.930 ## 2 Female Admitted 0.824 0.68 0.341 0.349 0.239 0.0704## 3 Male Rejected 0.379 0.370 0.631 0.669 0.723 0.941 ## 4 Male Admitted 0.621 0.630 0.369 0.331 0.277 0.0590
Overall:
Within departments:
df
## # A tibble: 8 × 3## x y z ## <dbl> <dbl> <chr>## 1 2 4.1 A ## 2 3 2.9 A ## 3 4 2.1 A ## 4 5 0.9 A ## 5 6 10.9 B ## 6 7 10.1 B ## 7 8 8.9 B ## 8 9 8.1 B
df %>% summarise(cor_x_y = cor(x, y))
## # A tibble: 1 × 1## cor_x_y## <dbl>## 1 0.683
df %>% group_by(z) %>% summarise(cor_x_y = cor(x, y))
## # A tibble: 2 × 2## z cor_x_y## <chr> <dbl>## 1 A -0.997## 2 B -0.997
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |