library(natality2014) library(statisticalModeling) library(mosaic)
The Natality_2014
package provides support for teaching introductory statistics. It contains information from each of the almost 4 million births registered in the US in 2014. These offer many diverse possibilities for framing research questions ranging from the very simple to the very complex, as well as opportunities to construct narrative and graphical reports.
This document describes some of the research questions that can be suitable for an introductory statistics class or a class on statistical modeling. The research questions are intentionally stated in a simple and general way and not in terms of the numerical result from a statistical procedure. This gives students a chance to explore an important aspect of research: reframing the question with more nuance to address not just the question but what motivates the question and what factors might be brought into play.
We'll start with an example many parts of which are suitable for the very start of an introductory statistics course.
Research question: How long is a normal gestation?
This is a familiar topic and almost everyone will have a ready answer: 9 months.
"OK. So you think it's nine months. Here are some data. Are they consistent with that?"
How to answer this question? Put discussion notes here.
What fraction last exactly 36 weeks?
tally( ~ combgest == 36, data = Natality_2014_1k) tally( ~ combgest == 36, data = Natality_2014_1k, format = "proportion")
What fraction last approximately 36 weeks.
tally( ~ combgest %in% c(35, 36, 37), data = Natality_2014_1k, format = "proportion") tally( ~ combgest %in% c(34, 35, 36, 37, 38), data = Natality_2014_1k, format = "proportion")
Let the data tell us what approximately means.
tally( ~ combgest, data = Natality_2014_1k)
We can see that 39 weeks is the most common length for gestation.
tally( ~ combgest %in% c(37, 38, 39, 40, 41), data = Natality_2014_1k, format = "proportion")
84% sounds like a good definition of normal. - Let's turn the question inside out: what are the bounds on an interval that contains, say, 80% of all births?
qdata( ~ combgest, data = Natality_2014_1k, p = c(.10, .90))
* Graphic: Remember it's an iterative process, where you successively add in or refine components to address shortcomings. ```r gf_density( ~ combgest + fill:"blue" + alpha:0.5, data = Natality_2014_1k) + geom_vline(aes(xintercept = 35, color = "red")) + geom_vline(aes(xintercept = 41, color = "red")) + ggtitle("Defining normal") + xlab("Gestation (weeks)") + guides(color = FALSE)
Questions prompted by the graphic?
bw
parameter. 0.7 will be reasonable, but try several."When will my baby arrive?"
See Path One.
Suppose I'm a 28-year old weighing 125 pounds before pregnancy. I'm pregnant with a baby girl? How long will pregnancy last for me?
Natality_2014_100k %>% filter(mager == 28, pwgt_r > 120, pwgt_r < 130, sex == "F") %>% qdata( ~ combgest, data = ., p = c(.05, .95))
How reliable is this estimate? - How many people is it based on? - What happens if we try it with more data?
gest_mod_1 <- lm(combgest ~ mager + pwgt_r + sex, data = Natality_2014_1k) me <- list(mager = 28, sex = "F", pwgt_r = 125) evaluate_model(gest_mod_1, at = me)
This gives us a central estimate. How much variation is there around this?
Look at prediction error ...
Preds <- evaluate_model(gest_mod_1, data = Natality_2014_1k) qdata( ~ I(combgest - model_output), data = Preds, p = c(.05, .95))
But do these variables really have a connection to gestation time?
effect_size(gest_mod_1, ~ mager, at = me) effect_size(gest_mod_1, ~ pwgt_r, at = me) effect_size(gest_mod_1, ~ sex, at = me)
BE <- ensemble(gest_mod_1, nreps = 100) effect_size(BE, ~ mager, at = me ) %>% ggplot(aes(x = slope)) + geom_density() + ggtitle("Age and gestation") effect_size(BE, ~ pwgt_r, at = me ) %>% ggplot(aes(x = slope)) + geom_density() + ggtitle("Weight and gestation") effect_size(BE, ~ sex, at = me ) %>% ggplot(aes(x = change)) + geom_density() + ggtitle("Baby's sex and gestation")
Many labors are induced artificially.
tally( ~ uop_induc, data = Natality_2014_1k)
When are such inductions undertaken?
Each of these graphics highlights a different aspect of the story. Put them together to tell the story of when inductions are done. Try them with larger data sets e.g. _10k
or _100k
and see what details stem from the small data set used.
gf_histogram( ~ combgest + fill:uop_induc + alpha:0.7 + position:"dodge", data = Natality_2014_1k) + scale_fill_discrete(guide_legend(title = "Labor induced")) + xlab("Gestation")
gf_histogram( ~ combgest + fill:uop_induc + alpha:0.7 + position:"stack", data = Natality_2014_1k) + scale_fill_discrete(guide_legend(title = "Labor induced")) + xlab("Gestation")
gf_histogram( ~ combgest + fill:uop_induc + alpha:0.7 + position:"fill", data = Natality_2014_1k) + scale_fill_discrete(guide_legend(title = "Labor induced")) + xlab("Gestation")
There are two variables that relate to prenatal care:
precare
- the month when prenatal care started. (15 has a special meaning.)previs
- the number of prenatal care visits.There are several variables that might reasonably be used to measure outcome.
apgar5
The APGAR score at five minutes dbwt
Baby's weight in gramsab_aven1
Whether the baby was put on a mechanical respirator immediately.ab_aven6
Whether the baby was still on a mechanical respirator at six hours.ab_seiz
, ab_anti
, ab_surf
, ab_nicu
See the documentation for Natality_2014_100k
.Using any of the variables above, or any others you think are relevant, construct an index of the birth outcome. You can do this by converting each variable to a score and adding up the scores. Aim that the total possible range for the scores is 10 points. (This is arbitrary, but convenient.) The first step is to construct a table that turns each of the variables you will use into a simple numerical score, like this one for APGAR:
APGAR | score ------|------- 8 and higher | 3 6 to 7 | 2 4 to 5 | 1 less than 4 | 0
As a matter of computer technique, here's a way to translate APGAR score according to the table:
apgar_to_score <- function(apgar) { (apgar >= 4) + (apgar >= 6) + (apgar >= 8) }
Make sure that each component of the score is justifiable.
We'll make a score by adding together twice the apgar score plus a weight score from 0 to 3, plus 1 if the baby was not transferred to the neonatal intensive care unit.
nicu_to_score <- function(nicu) { nicu == "n" } weight_to_score <- function(weight) { (weight > 2700) + (weight > 2300) + (weight > 2000) + (weight > 1500) }
Here's a statement to create the score:
For_analysis <- Natality_2014_100k %>% mutate(outcome_score = 2 + apgar_to_score(apgar5) + weight_to_score(dbwt) + nicu_to_score(ab_nicu)) gf_counts( ~ outcome_score, data = For_analysis) tally( ~ outcome_score <= 7, data = For_analysis)
Data is missing for about 5% of babies.
ggplot(For_analysis, aes(x = precare, fill = outcome_score > 9)) + geom_histogram(position = "fill")
tally(outcome_score >= 9 ~ precare, data = For_analysis, format = "proportion")
Mothers who had no prenatal care (that is, precare == 15
) had babies much more likely to fall below 10.
mod_1 <- lm(outcome_score ~ previs * precare, dat = For_analysis) summary(mod_1)
library(rpart) library(rpart.plot) mod_2 <- rpart(outcome_score ~ previs + precare + priorlive + mager, data = For_analysis, cp = 0.001) prp(mod_2, type = 3) fmodel(mod_2, ~ previs + precare)
The dominant theme here is that the ratio doesn't vary with any of the explanatory variables.
Can be used with
mager >= 35
For_analysis <- Natality_2014_100k %>% mutate(male = 0 + (sex == "M")) mod_0 <- glm(male ~ 1, data = For_analysis, family = "binomial") mod_1 <- glm(male ~ mager, data = For_analysis, family = "binomial") mod_2 <- glm(male ~ mager + pwgt_r + urf_diab + fagecomb, data = For_analysis, family = "binomial") mod_3 <- glm(male ~ mager > 35, data = For_analysis)
Describe the probability of survival for babies as a function of gestation time.
ggplot(Natality_2014_100k, aes(x = combgest, fill = ilive)) + geom_bar() ggplot(Natality_2014_100k, aes(x = combgest, fill = ilive)) + geom_bar(position = "fill")
ggplot(Natality_2014_100k, aes(x = ilive, fill = as.factor(combgest))) + geom_bar(position = "fill")
mod_1 <- glm(ilive == "y" ~ combgest * mager, data = Natality_2014_100k, family = "binomial") fmodel(mod_1, ~ combgest + mager) library(splines) mod_2 <- glm(ilive == "y" ~ ns(combgest, 2) * mager, data = Natality_2014_100k, family = "binomial") fmodel(mod_2, ~ combgest + mager)
Survival is very low for babies born at or before 21 weeks, very high near 38 weeks and still high but decreasing after 41 weeks. You can see the last part of this most easily in a table of numbers rather than a visualization, since the survival is still practically
tally(ilive ~ combgest, data = Natality_2014_100k, format = "proportion", useNA = "no")
pay
, parity, age, race, educationResearch question:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.