library(natality2014)
library(statisticalModeling)
library(mosaic)

The Natality_2014 package provides support for teaching introductory statistics. It contains information from each of the almost 4 million births registered in the US in 2014. These offer many diverse possibilities for framing research questions ranging from the very simple to the very complex, as well as opportunities to construct narrative and graphical reports.

This document describes some of the research questions that can be suitable for an introductory statistics class or a class on statistical modeling. The research questions are intentionally stated in a simple and general way and not in terms of the numerical result from a statistical procedure. This gives students a chance to explore an important aspect of research: reframing the question with more nuance to address not just the question but what motivates the question and what factors might be brought into play.

We'll start with an example many parts of which are suitable for the very start of an introductory statistics course.

Introductory Example: Normal Gestation Period {.tabset .tabset-pills .tabset-fade}

Research question: How long is a normal gestation?

This is a familiar topic and almost everyone will have a ready answer: 9 months.

Guide

  1. Are data consistent with 9 months?
  2. When will my baby arrive?

Path One {.tabset}

"OK. So you think it's nine months. Here are some data. Are they consistent with that?"

Discussion

How to answer this question? Put discussion notes here.

Some refinements

  1. The data are provided in units of weeks. What fraction of all gestations last exactly 36 weeks?
  2. What fraction of all gestations last approximately 36 weeks? What's a good definition of "approximately"?
  3. How can the data be used to tell us what "approximately" means?

Exploring One

What fraction last exactly 36 weeks?

tally( ~ combgest == 36, data = Natality_2014_1k)
tally( ~ combgest == 36, data = Natality_2014_1k, format = "proportion")

Exploring Two

What fraction last approximately 36 weeks.

tally( ~ combgest %in% c(35, 36, 37), data = Natality_2014_1k, format = "proportion")
tally( ~ combgest %in% c(34, 35, 36, 37, 38), data = Natality_2014_1k, format = "proportion")

Exploring Three

Let the data tell us what approximately means.

tally( ~ combgest, data = Natality_2014_1k)

We can see that 39 weeks is the most common length for gestation.

tally( ~ combgest %in% c(37, 38, 39, 40, 41), data = Natality_2014_1k, format = "proportion")

84% sounds like a good definition of normal. - Let's turn the question inside out: what are the bounds on an interval that contains, say, 80% of all births?

qdata( ~ combgest, data = Natality_2014_1k, p = c(.10, .90))
* Graphic: Remember it's an iterative process, where you successively add in or refine components to address shortcomings.
```r
gf_density( ~ combgest + fill:"blue" + alpha:0.5, data = Natality_2014_1k) +
  geom_vline(aes(xintercept = 35, color = "red")) +
  geom_vline(aes(xintercept = 41, color = "red")) +
  ggtitle("Defining normal") +
  xlab("Gestation (weeks)") +
  guides(color = FALSE)

Path Two {.tabset}

"When will my baby arrive?"

Some refinements:

  1. Give a general description from the whole population.
  2. "My" means literally about me. Customize the answer based on individual factors.
  3. Do pregnancies end after a natural time? How often is the duration determined by artificial factors such as medical intervention? If we separate these out those that end naturally, is the result different? Is it important to give an answer that includes reference to artificial induction of labor.

Exploring One

See Path One.

Exploring Two {.tabset}

Suppose I'm a 28-year old weighing 125 pounds before pregnancy. I'm pregnant with a baby girl? How long will pregnancy last for me?

One method: selecting matching cases:
Natality_2014_100k %>%
  filter(mager == 28, pwgt_r > 120, pwgt_r < 130, sex == "F") %>%
  qdata( ~ combgest, data = ., p = c(.05, .95))

How reliable is this estimate? - How many people is it based on? - What happens if we try it with more data?

Another method: Building and evaluating a model of combgest
gest_mod_1 <- lm(combgest ~ mager + pwgt_r + sex, data = Natality_2014_1k)
me <- list(mager = 28, sex = "F", pwgt_r = 125)
evaluate_model(gest_mod_1, at = me)

This gives us a central estimate. How much variation is there around this?

Look at prediction error ...

Preds <- evaluate_model(gest_mod_1, data = Natality_2014_1k)
qdata( ~ I(combgest - model_output), data = Preds, p = c(.05, .95))

But do these variables really have a connection to gestation time?

effect_size(gest_mod_1, ~ mager, at = me)
effect_size(gest_mod_1, ~ pwgt_r, at = me)
effect_size(gest_mod_1, ~ sex, at = me)
BE <- ensemble(gest_mod_1, nreps = 100)
effect_size(BE, ~ mager, at = me ) %>%
  ggplot(aes(x = slope)) + geom_density() + ggtitle("Age and gestation")
effect_size(BE, ~ pwgt_r, at = me ) %>%
  ggplot(aes(x = slope)) + geom_density() + ggtitle("Weight and gestation")
effect_size(BE, ~ sex, at = me ) %>%
  ggplot(aes(x = change)) + geom_density() + ggtitle("Baby's sex and gestation")

Exploring Three {.tabset}

Many labors are induced artificially.

tally( ~ uop_induc, data = Natality_2014_1k)

When are such inductions undertaken?

Each of these graphics highlights a different aspect of the story. Put them together to tell the story of when inductions are done. Try them with larger data sets e.g. _10k or _100k and see what details stem from the small data set used.

gf_histogram( ~ combgest + fill:uop_induc + alpha:0.7 + position:"dodge", 
              data = Natality_2014_1k) + 
  scale_fill_discrete(guide_legend(title = "Labor induced")) + 
  xlab("Gestation") 
gf_histogram( ~ combgest + fill:uop_induc + alpha:0.7 + position:"stack", 
              data = Natality_2014_1k) +
  scale_fill_discrete(guide_legend(title = "Labor induced")) + 
  xlab("Gestation") 
gf_histogram( ~ combgest + fill:uop_induc + alpha:0.7 + position:"fill", 
              data = Natality_2014_1k) + 
  scale_fill_discrete(guide_legend(title = "Labor induced")) + 
  xlab("Gestation") 

Does prenatal care improve outcomes?

There are two variables that relate to prenatal care:

There are several variables that might reasonably be used to measure outcome.

Making an index of "outcome" {.tabset}

Introduction

Using any of the variables above, or any others you think are relevant, construct an index of the birth outcome. You can do this by converting each variable to a score and adding up the scores. Aim that the total possible range for the scores is 10 points. (This is arbitrary, but convenient.) The first step is to construct a table that turns each of the variables you will use into a simple numerical score, like this one for APGAR:

APGAR | score ------|------- 8 and higher | 3 6 to 7 | 2 4 to 5 | 1 less than 4 | 0

As a matter of computer technique, here's a way to translate APGAR score according to the table:

apgar_to_score <- function(apgar) {
  (apgar >= 4) + (apgar >= 6) + (apgar >= 8)
}

Make sure that each component of the score is justifiable.

An example index

We'll make a score by adding together twice the apgar score plus a weight score from 0 to 3, plus 1 if the baby was not transferred to the neonatal intensive care unit.

nicu_to_score <- function(nicu) {
  nicu == "n"
}
weight_to_score <- function(weight) {
  (weight > 2700) + (weight > 2300) + (weight > 2000) + (weight > 1500)
}

Here's a statement to create the score:

For_analysis <-
  Natality_2014_100k %>%
  mutate(outcome_score = 2 + apgar_to_score(apgar5) + 
           weight_to_score(dbwt) + nicu_to_score(ab_nicu))
gf_counts( ~ outcome_score, data = For_analysis)
tally( ~ outcome_score <= 7, data = For_analysis)

Data is missing for about 5% of babies.

Relationship of score to prenatal care {.tabset}

A graphical approach

ggplot(For_analysis, aes(x = precare, fill = outcome_score > 9)) +
  geom_histogram(position = "fill")

A tabular approach

tally(outcome_score >= 9 ~ precare, data = For_analysis, format = "proportion")

Mothers who had no prenatal care (that is, precare == 15) had babies much more likely to fall below 10.

A modeling approach

mod_1 <- lm(outcome_score ~ previs * precare, dat = For_analysis)
summary(mod_1)
library(rpart)
library(rpart.plot)
mod_2 <- rpart(outcome_score ~ previs + precare + priorlive + mager, 
               data = For_analysis, cp = 0.001)
prp(mod_2, type = 3)
fmodel(mod_2, ~ previs + precare)

Maternal outcome

Ratio of male to female births {.tabset}

Discussion

Instructor notes

The dominant theme here is that the ratio doesn't vary with any of the explanatory variables.

Can be used with

Calculating the ratio

For_analysis <- Natality_2014_100k %>% mutate(male = 0 + (sex == "M"))
mod_0 <- glm(male ~ 1, data = For_analysis, family = "binomial")
mod_1 <- glm(male ~ mager, data = For_analysis, family = "binomial")
mod_2 <- glm(male ~ mager + pwgt_r + urf_diab + fagecomb, 
             data = For_analysis, family = "binomial")
mod_3 <- glm(male ~ mager > 35, data = For_analysis)

Survival and gestation

Description

Describe the probability of survival for babies as a function of gestation time.

Visualization

ggplot(Natality_2014_100k, aes(x = combgest, fill = ilive)) + geom_bar()
ggplot(Natality_2014_100k, aes(x = combgest, fill = ilive)) + geom_bar(position = "fill")
ggplot(Natality_2014_100k, aes(x = ilive, fill = as.factor(combgest))) + geom_bar(position = "fill")

Modeling

mod_1 <- glm(ilive == "y" ~ combgest * mager, 
           data = Natality_2014_100k,
           family = "binomial")
fmodel(mod_1, ~ combgest + mager)
library(splines)
mod_2 <- glm(ilive == "y" ~ ns(combgest, 2) * mager,
             data = Natality_2014_100k,
             family = "binomial")
fmodel(mod_2, ~ combgest + mager) 

Instructor

Survival is very low for babies born at or before 21 weeks, very high near 38 weeks and still high but decreasing after 41 weeks. You can see the last part of this most easily in a table of numbers rather than a visualization, since the survival is still practically

tally(ilive ~ combgest, data = Natality_2014_100k, format = "proportion", useNA = "no")

Other models

  1. Cigarettes
    • and baby health
    • who quits smoking (education, pay, wic?)
  2. Plurality
  3. Morbidity
    • risk factors for eclampsia, perineal rupture, historectomy
  4. Inheritance
    • weight of baby vs weight, age, height of mother
  5. Home births
    • who pays?
    • outcomes for mothers and baby
  6. Sex ratios
    • number of previous children
  7. Gestation as a function of number of previous children
  8. Prenatal care as a function of pay, parity, age, race, education

Modeling Risk

Research question:

Meta stuff ...



dtkaplan/natality2014 documentation built on May 15, 2019, 5:22 p.m.