library("learnr") load("Data.RData") knitr::opts_chunk$set(echo = FALSE) options(tutorial.exercise.timelimit = 1200)
The following questions test your knowledge in the introductory chapter of the principle of maximum likelihood, hypothesis testing, confidence intervals and regression models.
Three statements about p-values are:
quiz( question("Which of the following is correct?", answer("Statement 1 is correct, and the others wrong."), answer("Statement 2 is correct, and the others wrong."), answer("Statement 3 is correct, and the others wrong."), answer("Only Statements 1 and 3 are correct."), answer("None of the statments is correct.", correct = TRUE), allow_retry = TRUE, random_answer_order = FALSE ) )
The maximum likelihood estimation methods gives us consistent estimates.
quiz( question("What this means?", answer("The esimates we obtain are unbiased."), answer("The estimates we obtain equal the true value in the population."), answer("The estimates we obtain comes closer to the true value in the population as the sample size increases.", correct = TRUE), answer("The estimates we obtain are relatively close to the true value in the population."), allow_retry = TRUE, random_answer_order = TRUE ) )
quiz( question("What happens as sample size increases?", answer("P-values become smaller and confidence intervals narrower.", correct = TRUE), answer("P-values become larger and confidence intervals wider."), answer("P-values become smaller and confidence intervals wider."), answer("P-values become larger and confidence intervals narrower."), allow_retry = TRUE, random_answer_order = TRUE ) )
quiz( question("In the context of hypothesis testing for regression models, which of the following statements is correct?", answer("With the Wald test we can only test one coefficient at a time."), answer("With the likelihood ratio test we can compare nested models.", correct = TRUE), answer("With the score test we fit the model under the null hypothesis, and hence we can only conclude if the null is correct."), answer("With the Wald test we fit the model under the alternative hypothesis, and hence we can only conclude if the alternative hypothesis is correct."), allow_retry = TRUE, random_answer_order = FALSE ) )
The purpose of this practical is to illustrate how standard linear regression models can be performed in R.
The following questions are based on the PBC dataset. This dataset is available as the
object pbc2.id
and is already loaded in this session. From this dataset we will use the
following variables:
serBilir
: baseline serum bilirubin in mg/dl.
drug
: the treatment indicator with values 'placebo' and 'D-penicil'.
age
: baseline age in years.
sex
: the sex indicator with values 'male' and 'female'.
serChol
: baseline serum cholesterol in mg/dl.
prothrombin
: baseline prothrombin time in sec.
We want to see how the log-transformed serum bilirubin is related with the rest of
the variables. Start by fitting an additive linear regression model, using function
lm()
, and interpret the results you obtain from the summary()
method. Name the fitted
model fit1
.
# Linear regression models are fitted with function lm(). # As first argument you need to provide a formula of the model you want to fit. # Specify also the 'data' argument that is the database containing the variables you # want to use - here is 'pbc2.id'
# The code for the additive model is: fit1 <- lm(log(serBilir) ~ drug + age + sex + serChol + prothrombin, data = pbc2.id) summary(fit1)
Check the residuals of model fit1
using the plot()
method; before calling plot()
,
set par(mfrow = c(2, 2))
in order to obtain the four basic plots in one figure.
fit1 <- lm(log(serBilir) ~ drug + age + sex + serChol + prothrombin, data = pbc2.id)
# Just first execute par(mfrow = c(2, 2)) and then use plot() in the model from the # previous question
# The code is: par(mfrow = c(2, 2)) plot(fit1)
We believe that the association between serBilir
and each one of age
, serChol
and
prothrombin
could be different between males and females. Extend the previous model
fit1
to accommodate this. Use again the summary()
method to get a detailed output and
interpret the results. Name the fitted model fit2
.
# We will actually need to include the interaction terms between 'sex' and each one of # 'age', 'serChol' and 'prothrombin' # An interaction term between two variables is included using the ':' symbol, e.g., 'sex:age'
# The code is: fit2 <- lm(log(serBilir) ~ drug + age + sex + serChol + prothrombin + sex:age + sex:serChol + sex:prothrombin, data = pbc2.id) # To write exactly the same model but in a shorter syntax, we can use the '*' symbol # This symbol includes both the main effects and interaction terms between the two variables. # Hence, a shorter code is: fit2 <- lm(log(serBilir) ~ drug + sex * (age + serChol + prothrombin), data = pbc2.id) summary(fit2)
We would like to statistically test whether the extra interaction terms improve the
fit of the model. Using an F-test (the analogous of the likelihood ratio test for linear
regression models), compare the interaction model fit2
with the additive model fit1
.
In R this is done using the anova()
function.
fit1 <- lm(log(serBilir) ~ drug + age + sex + serChol + prothrombin, data = pbc2.id) fit2 <- lm(log(serBilir) ~ drug + sex * (age + serChol + prothrombin), data = pbc2.id)
# Just use the anova() function giving as arguments the two models
# The code is: anova(fit1, fit2)
Given that the F-test suggests that some of the interaction terms are significant, we
proceed to see which of the interactions terms seem to improve the model. To find these, we could
directly look at the individual p-values from the summary()
output of the last model
fit2
. However, one issue is that these p-values are not corrected for multiple testing.
Using the p.adjust()
function obtain the adjusted p-values.
fit2 <- lm(log(serBilir) ~ drug + sex * (age + serChol + prothrombin), data = pbc2.id)
# First save the output of summary() in an object, e.g., 'summ_fit2'
# The code is (this code just saves the output of summary() in the object 'summ_fit2' and # does not produce any output): summ_fit2 <- summary(fit2)
# Use function coef() to extract the coefficients' table from 'summ_fit2'
# The code is: coef(summ_fit2)
# Hence, the p-values are in the 4th (last) column of the table. Extract the # p-values of the interaction tersms from the table using the '[' for indexing
# The code is: coef(summ_fit2)[7:9, 4]
# Finally, give these p-values to p.adjust()
# The code is: p.adjust(coef(summ_fit2)[7:9, 4])
In a second analysis, the researchers are interested in studying the relationship between
the natural logarithm of serum bilirubin and serum cholesterol, corrected for age and sex.
It is believed that the relationship may be nonlinear. Use a 3rd degree polynomial of
serum cholesterol to explore this. Name the fitted model fit1
.
# To include in an R formula polynomials use either the I() or the poly() functions; # preferable is the latter. For example, to include the cubic effect of variable 'x', # use poly(x, 3) # A feature when using poly() is that you need first to exlude and missing data there are # This can be done with the complete.cases() function, e.g., data[complete.cases(data$x), ] # removes the rows of 'data' for which 'x' is missing. # With this information, proceed to fit the model described in the question.
# The code is: fit1 <- lm(log(serBilir) ~ poly(serChol, 3) + age + sex, data = pbc2.id[complete.cases(pbc2.id$serChol), ]) summary(fit1)
Investigate whether the relationship is truly nonlinear by first fitting the model that
assumes linearity (null hypothesis), and following comparing this model with the previous
model fit1
(alternative hypothesis) using an F-test via the anova()
function.
fit1 <- lm(log(serBilir) ~ poly(serChol, 3) + age + sex, data = pbc2.id[complete.cases(pbc2.id$serChol), ])
# Simply fit the linear model that contains 'serChol', 'age' and 'sex', and compare this # with the previous model # Note: this linear model needs to be fitted in the same version of the dataset as the # nonlinear model, i.e., excluding the rows with missing serum cholesterol values
# The code is: fit2 <- lm(log(serBilir) ~ serChol + age + sex, data = pbc2.id[complete.cases(pbc2.id$serChol), ]) anova(fit2, fit1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.