## Do not delete this! ## It loads the s20x library for you. If you delete it ## your document may not compile it. require(s20x) knitr::opts_chunk$set( dev = "png", fig.ext = "png", dpi = 96 )
Let's examine what affects the birth weight of babies.
bwt: birth weight in ouncesgestation: length of pregnancy in daysnot.first.born: 0=first born, 1=not first-bornage: mother's age in yearsheight: mother's height in inchesweight: mother's pre-pregnancy weight in poundssmoke: smoking status of mother 0=not now, 1=yes.This dataset was obtained from \url{ http://www.stat.berkeley.edu/users/statlabs/labs.html}.
It accompanies the excellent text Stat Labs: Mathematical Statistics through Applications Springer-Verlag (2001) by Deborah Nolan and Terry Speed.
We want to build a model to explain the birth weight of babies.
load(system.file("extdata", "Babies.df.rda", package = "s20x"))
par(mar=c(0,0,0,0)) Babies.df = read.table("babies_data.txt", header = T) pairs20x(Babies.df[, c(1, 2, 4, 5, 6)])
par(mar=c(0,0,0,0)) pairs20x(Babies.df[, c(1, 2, 4, 5, 6)])
Looking at the pairs plot, we see a somewhat weak relationship between bwt and mother's height and weight.
There is a stronger relationship between the gestation time (gestation) for the babies and it's bwt which is not surprising, as the longer the child is in the mother's womb the longer the child has had time to have nutrition and grow --- up to a point --- then it `flattens out' somewhat.
There doesn't seem to be any relationship between a mother's age and her child's bwt.
Let us look deeper into the relationship between bwt and gestation.
plot(bwt ~ gestation, data = Babies.df, col = "gray60") lines(lowess(Babies.df$gestation,Babies.df$bwt), col = "tomato", lwd = 2) text(152, 120, "?") text(185, 115, "?") abline(v = 294, col = "steelblue", lwd = 2)
Note also that there seems to be some `weird' data points in these plots. There does not appear to be much of a relationship between the $X$s. That is, the explanatory variables do not seem to have any strong relationships between them.
Most babes are born before 42 weeks = $42*7=294$ days\footnote{"American College of Obstetricians and Gynaecologists - How Your Baby Grows During Pregnancy". See \url{https://www.acog.org/-/media/For-Patients/faq156.pdf?dmc=1&ts=20150329T2112264959}.}. It seems that beyond this point babies cease to grow and hence the 'flattening out' and/or decrease. We'll create a dummy variable OD (for overdue) for this time point.
Let's look at the categorical (factor) data variables against the baby's birth weight (bwt).
They are not.first.born and smoke.
pairs20x(Babies.df[, c(1, 3, 7)])
Here, we only see a slight relationship between whether the mother smokes (smoke) and bwt. There is a slight decrease in babies bwt if the mother smokes. This increases the chance of a mother having a low birth weight baby if she smokes -- perhaps another reason to avoid tobacco!
The variable not.first.born does not appear to have too much of an effect --- which is perhaps not a surprise given that this variable may not be as important as it once was as family size has deceased markedly in the developed world (this is US data). We'll check this out later.
# Let's create OD as mentioned earlier. Babies.df$OD = 1 * (Babies.df$gestation > 294) range(Babies.df$gestation[Babies.df$OD == 0]) # Check range(Babies.df$gestation[Babies.df$OD == 1]) # Check bwt.fit = lm(bwt ~ gestation * OD, data = Babies.df) eovcheck(bwt.fit) cooks20x(bwt.fit) bwt.fit2 = lm(bwt ~ gestation * OD, data = Babies.df[-239, ]) cooks20x(bwt.fit2) bwt.fit3 = lm(bwt ~ gestation * OD, data = Babies.df[-c(239, 820), ]) cooks20x(bwt.fit3) bwt.fit4 = lm(bwt ~ gestation * OD + weight, data = Babies.df[-c(239, 820), ]) summary(bwt.fit4) bwt.fit5 = lm(bwt ~ gestation * OD + weight + height, data = Babies.df[-c(239, 820), ]) summary(bwt.fit5) # Let's create BMI from both of these measurements Babies.df$bmi = with(Babies.df, weight/(height^2) * 703) bwt.fit6 = lm(bwt ~ gestation * OD + weight + height + bmi, data = Babies.df[-c(239, 820), ]) summary(bwt.fit6) bwt.fit7 = lm(bwt ~ gestation * OD + height + bmi + not.first.born, data = Babies.df[-c(239, 820), ]) summary(bwt.fit7) bwt.fit8 = lm(bwt ~ gestation * OD + height + bmi + not.first.born + smokes, data = Babies.df[-c(239, 820), ]) summary(bwt.fit8) cooks20x(bwt.fit8) normcheck(bwt.fit8) confint(bwt.fit8)
conf1 = as.data.frame(t(abs(confint(bwt.fit8)[2,]))) resultStr1 = paste0(sprintf("%.2f", conf1$`2.5 %`), " to ", sprintf("%.2f", conf1$`97.5 %`)) conf2 = as.data.frame(t(confint(bwt.fit8)[8,])) resultStr2 = paste0(sprintf("%.2f", conf2$`97.5 %`), " to ", sprintf("%.2f", conf2$`2.5 %`)) conf3 = as.data.frame(t(abs(confint(bwt.fit8)[4,]))) resultStr3 = paste0(sprintf("%.2f", conf3$`2.5 %`), " to ", sprintf("%.2f", conf3$`97.5 %`)) conf4 = as.data.frame(t(abs(confint(bwt.fit8)[5,]))) resultStr4 = paste0(sprintf("%.2f", conf4$`2.5 %`), " to ", sprintf("%.2f", conf4$`97.5 %`)) conf5 = as.data.frame(t(abs(confint(bwt.fit8)[7,]))) resultStr5 = paste0(sprintf("%.2f", conf5$`97.5 %`), " to ", sprintf("%.2f", conf5$`2.5 %`)) conf6 = as.data.frame(t(abs(confint(bwt.fit8)[6,]))) resultStr6 = paste0(sprintf("%.2f", conf6$`97.5 %`), " to ", sprintf("%.2f", conf6$`2.5 %`))
Looking at the pairs plot, we saw that birthweight was related to a number of our explanatory variables. We will construct a multiple linear regression model with a suitable selection of the explanatory variables.
Observations 239 and 820 were found to be highly influential. They were deemed to be anomolous and were removed from the dataset.
The hockey stick relationship between gestational age and birthweight required allowing the age effect to differ depending on whether the baby was overdue, and was fitted by including an interaction term between age and overdue status. Moreover, we also decided to include body mass weight as an explanatory variable, but had to remore weight as an explanatory due to multicollinearity. All model assumptions were satisfied by our final model.
Using forward model selection (i.e., adding the most promising explanatory variables in turn), our final model is \begin{align} bwt_i = &~\beta_0 + \beta_1 \times gestation_i + \beta_2 \times OD_i + \beta_3 \times height_i~+ \beta_4 \times bmi_i + \beta_5 \times not.first.born_i +\ &~\beta_6 \times smokes_i + \beta_7 \times gestation_i \times OD_i + \epsilon_i, \end{align} where $\epsilon_i~iid \sim N(0, \sigma)$. Here our three indicator variables take the value 1 if the baby was overdue, not the first born, and the mother smokes, respectively.
Our model only explains about 31% of the variability in a baby's birthweight.
We wanted to build a model to explain the birth weight of babies.
Keeping all other varaiables constant:
r resultStr1[1] ounces per gestation day. After 42 weeks this will decrease by about r resultStr2[1]3 ounces per gestational day [NOTE: it might have been better had we changed the OD baseline].r resultStr3[1] ounces, on average.r resultStr4[1] ounces, on average.r resultStr5[1] ounces, on average.r resultStr6[1] ounces, on average.Is there significant evidence that expected birthweight decreases with increasing gestational age for babies that are overdue? Provide a confidence interval for this effect.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.