library(learnr)
require(JM)
require(memisc)
library(foreign)
pregdata <- read.spss('www/pregdata.sav',to.data.frame = TRUE)
knitr::opts_chunk$set(echo = FALSE)

Quiz

Functions

The questions below test your knowledge on the creation of functions.

quiz(
  question("How can a function with two parameters that calculates the sum of those parameters be defined?",
    answer("`vector fn(x, y){x + y}`"),
    answer("`function fn(x, y){return(x + y)}`"),
    answer("`fn<-function(x, y){x + y}`", correct=TRUE),
    answer("`fn(x, y)<-function{return(x + y)}`"),
    answer("None of the above.")
    )
)

Consider the following code:

x<-3
fn <- function(y=3){
  return(x/y)
  x<-9
}
x<- 6
fn()
quiz(
  question("What is the return value of the function call?",
    answer("1"),
    answer("2", correct=TRUE),
    answer("3"),
    answer("9"),
    answer("Some other number"),
    answer("The function does not run because it has an error.")
    )
)

Writing functions

We will now write our own functions that help in the analysis of the pregdata dataset.

Writing a function to analyse data

Again the data is available as the data.frame pregdata. Use the summary function to get a quick overview of the variables in the dataset. Also use the summary function to just look at the Age of the Mothers.


summary(...)
summary(...$...)
summary(pregdata)
summary(pregdata$AgeMother)

We will try and make a function that gives a simular overview of a data set. Like the summary function of R we want to make the type of output dependant on the type of the input. When we use a categorical variable we see a frequency table of the different categories and when we use a contineous variable we see the mean and some quantiles. When a whole data.frame is used we get information about all variables in the data.frame.

Types of variable

We can see whether an object is a data.frame using the function is.data.frame. To check whether an object is a factor we can use is.factor and with is.numeric we check if we are dealing with a numeric variable. (Be carefull, the function is.vector does not behave in the same way it checks if an object is only a vector and does not have any other attributes such as a class, so is.vector(as.factor(c(1,2,3))) returns FALSE.) Try out these functions on some of the variables in the data.frame.


# Try out these functions on some of the variables in the data.frame
is.data.frame(...)
is.factor(...)
is.numeric(...)
is.data.frame(pregdata)
is.data.frame(pregdata$BMI_Mo)
is.factor(pregdata$Sex)
is.factor(pregdata$BMI_Mo)
is.numeric(pregdata$Sex)

Writing our own function

To start with our own summary variant, we define a function with one parameter (let's call it x) that checks if it is a data.frame a factor or a numeric variable. Use the if statement for this. When you want to display results from within the function you have to use the print function explicitly.

mysummary <- function(x){

}
mysummary(c(1,2))
mysummary(as.factor(c('a', 'b')))
mysummary <- function(x){
  if(...){
    print('A data.frame')
  }else if(...){
    print('This is a factor')
  }else if(...){
    print('This variabele is numeric')
  }
}
mysummary(c(1,2))
mysummary(as.factor(c('a', 'b')))
mysummary <- function(x){
  if(is.data.frame(x)){
    print('A data.frame')
  }else if(is.factor(x)){
    print('This is a factor')
  }else if(is.numeric(x)){
    print('This variabele is numeric')
  }
}
mysummary(c(1,2))
mysummary(as.factor(c('a', 'b')))

Descriptive statistics categorical and continuous variables

We now have a function that tells us what kind of input variable we have. However, we wanted to make a function that gives us some descriptive statistics. We therefore modify the function in such a way that a frequency table is given for factors and the mean and sd are given for numeric data. Use the functions table(.), mean(.) and sd(.)for this.

mysummary <- function(x){
  if(is.data.frame(x)){
    print('A data.frame')
  }else if(is.factor(x)){
    print('This is a factor')
  }else if(is.numeric(x)){
    print('This variabele is numeric')
  }
}
# Complete the syntax below
mysummary <- function(x){
  if(is.data.frame(x)){
    print('A data.frame')
    print('No summary yet')
  }else if(is.factor(x)){
    # calculate summary for categorical variable
    print('This is a factor')
    print(the_summary_cat)
  }else if(is.numeric(x)){
    #calculate the summary for the numeric variable
    print(the_summary_cont)
  }
}
mysummary <- function(x){
  if(is.data.frame(x)){
    print('A data.frame')
    print('Not yet implemented')
  }else if(is.factor(x)){
    print('This is afactor')
    print(table(x))
  }else if(is.numeric(x)){
    print('This variabele is numeric')
    print('mean')
    print(mean(x, na.rm = TRUE))
    print('sd')
    print(sd(x[!is.na(x)]))
  }
}
mysummary(pregdata$CRL.1)
**Note:** There are other ways of making functions do different things depending on the class of the arguments using object oriented programming we will not discuss that here.

Descriptive statistics for a data.frame

Our function still does not do anything usefull when we apply it on the whole data.frame. This can be solved in the following way: whenever the function is called using a data.frame as argument we use a for loop to loop over its columns. Within this loop we let the function call itself but now using the column as argument. Try this below:

mysummary <- function(x){
  if(is.data.frame(x)){
    print('Een data.frame')

  }else if(is.factor(x)){
    print('dit is een factor')
    print(table(x))
  }else if(is.numeric(x)){
    print('deze variabele is numeriek')
    print('mean')
    print(mean(x, na.rm = TRUE))
    print('sd')
    print(sd(x[!is.na(x)]))
  }

}

mysummary(pregdata)
**Note:** Functions that call themselves are called recursive functions.
mysummary <- function(x){
  if(is.data.frame(x)){
    print('A data.frame')
    for( )){

    }     
  }else if(is.factor(x)){
    print('A factor')
    print(table(x))
  }else if(is.numeric(x)){
    print('This variable is numeric')
    print('mean')
    print(mean(x, na.rm = TRUE))
    print('sd')
    print(sd(x[!is.na(x)]))
  }
}

mysummary(pregdata)
mysummary <- function(x){
  if(is.data.frame(x)){
    print('A data.frame')
    for(i in 1:NCOL(x)){
      print(names(x)[i])
      mysummary(x[,i])
    }     
  }else if(is.factor(x)){
    print('A factor')
    print(table(x))
  }else if(is.numeric(x)){
    print('This variabele is numeric')
    print(mean(x, na.rm = TRUE))
    print('sd')
    print(sd(x[!is.na(x)]))
  }
}
mysummary(pregdata)

Visualisation

Here we will learn more about making graphics.

Histograms and boxplots

One way to look at the distribution of a variable is by making a histigram. You can do this using the function hist.

Make a histogram of the BMI of the mothers in pregdata (BMI_Mo)


# Complete the syntax below
hist(...)
hist(pregdata$BMI_Mo)

We can make the graph more beautifull by adding a title and replacing the x-axis by something more usefull. For this you can use the parameters main and xlab. In addition make the color of the bars of the histogram blue by using col='blue' Try it out below.


hist(pregdata$BMI_Mo,
     main=,
     xlab= ,
     col=)
hist(pregdata$BMI_Mo,
     main='Histogram of the mothers BMI',
     xlab='BMI',
     col='blue')

When you have categorical data you can make a bar plot. In R this can be done by the function barplot. It's easiests to make a frequency table first and use that as the argument to the barplot function. Please try it out yourself using the variable that contains the sex of the children.


tab <- table(...)
barplot(...)
tab <- table(pregdata$Sex)
barplot(tab)

Scatterplots

When you want to look at the relation between two continuous variables we can make a scatter plot. In R this is done with the function plot. This function can be used in several ways but here we simply use the parameters x and y for to specify the variables we want to plot on the x and y axis respectively. Try it out and make a scatterplot with the BMI of the mother on the x-axis and the birthweight on the y-axis.

plot(x=, y=)
plot(x = pregdata$BMI_Mo, y = pregdata$BW)

An other way in which we can use the plot function is by using a formula. This is a frequently used notation common to many R-functions. First the dependant variable is specified, followed by a ~ sign, finally the independant variable are specified. For example BW ~ BMI_Mo. Now a separate parameter is used to specify the data.frame that is used. Try it yourself.

plot(,,data=)
plot(BW ~ BMI_Mo, data = pregdata)

It is possible to specify your own labels for the x and y-axis using the parameters xlab en ylab. The title can be specified using main. The colors and symbols that are used to plot the observations can be changed using col (try 'orange' ) and pch (try and set this to 2).


plot(BW ~ BMI_Mo, data = pregdata,
     xlab="BMI Mother",
     ylab="Birth Weight",
     col="orange",
     pch=2,
     main="Relation between birth weight and BMI Mother")

Now change the color of the symbols to lightblue for boys and pink for girls. Also use different plotting symbols.


the_pch <- ifelse(pregdata$Sex=='boy', 16, 15)
the_col <- ifelse(pregdata$Sex=='boy', 'lightblue', 'pink')

plot(BW ~ BMI_Mo, data = pregdata,
     xlab="BMI Mother",
     ylab="Birth Weight",
     col=the_col,
     pch=the_pch,
     main="Relation between birth weight and BMI Mother")

If BMI is above 30 the women are classified as Obese. Draw a reference line in the plot at this value using line type lty=2 and line width lwd=2. In addition make the labels of the plot a little bigger using cex.lab=1.2.


the_pch <- ifelse(pregdata$Sex=='boy', 16, 15)
the_col <- ifelse(pregdata$Sex=='boy', 'lightblue', 'pink')

plot(BW ~ BMI_Mo, data = pregdata,
     xlab="BMI Mother",
     ylab="Birth Weight",
     col=the_col,
     pch=the_pch,
     main="Relation between birth weight and BMI Mother",
     cex.lab=1.2)

abline(v=30, lty=2, lwd=2)

Statistical Tests and Models

Commonly used statistical tests and models.

Statistical tests {data-progressive=TRUE}

Investigate the distributions of the variables serum bilirubin and serum cholesterol with histograms using the pbc2.id data set. Then perform the appropriate test(s) to investigate the differences between the groups placebo and D-penicil. Finally, extract the p-value of the tests:


hist(pbc2.id$serBilir)
hist(pbc2.id$serChol)
test_serBilir <- wilcox.test(pbc2.id$serBilir ~ pbc2.id$drug)
test_serChol <- t.test(pbc2.id$serChol ~ pbc2.id$drug)
test_serBilir$p.value
test_serChol$p.value

Using the same data set, investigate the association between sex and drug. Extract the statistic:


tbl <- table(pbc2.id$sex, pbc2.id$drug)
sexDrug_test <- chisq.test(tbl)
sexDrug_test$statistic

Regression models {data-progressive=TRUE}

Using the same data set, perform a univariable and multivariable regression analysis using as outcome the variable serum bilirubin and predictions the variables sex, drug, age, ascites, hepatomegaly, spiders, edema. Print the summary of the models:


fm1a <- lm(serBilir ~ sex, data = pbc2.id)
fm1b <- lm(serBilir ~ drug, data = pbc2.id)
fm1c <- lm(serBilir ~ age, data = pbc2.id)
fm1d <- lm(serBilir ~ ascites, data = pbc2.id)
fm1e <- lm(serBilir ~ hepatomegaly, data = pbc2.id)
fm1f <- lm(serBilir ~ spiders, data = pbc2.id)
fm2 <- lm(serBilir ~ sex + drug + age + ascites + hepatomegaly + spiders, data = pbc2.id)
lapply(list(fm1a, fm1b, fm1c, fm1d, fm1e, fm1f, fm2), function(x) summary(x))

Use a scatterplot to investigate the association between serBilir and age (pbc2.id data set). Assuming the multivariable model from the previous question, investigate whether a linear or a nonlinear age is needed. For the nonlinear assume a quadratic and a cubic term:


plot(pbc2.id$serBilir, pbc2.id$age)
fm2 <- lm(serBilir ~ age + drug + sex + ascites + hepatomegaly + spiders, data = pbc2.id)
fm2_nonlinear <- lm(serBilir ~ age + I(age^2) + I(age^3) + drug + sex + ascites + hepatomegaly + spiders, data = pbc2.id)
anova(fm2, fm2_nonlinear)
**Hint:** To compare models use the function `anova(...)`.

Assuming the pbc2.id data set, create a new categorical variable (serCholCat) for serum cholesterol as low = [minimum serum cholesterol until median serum cholesterol) and high = [median serum cholesterol until maximum serum cholesterol]. Fit a logistic regression assuming serCholCat as the outcome and age, sex and their interaction as the covariates. Extract the coeficients and p-values:


pbc2.id$serCholCat <- as.numeric(pbc2.id$serChol >= median(pbc2.id$serChol, na.rm = T))
pbc2.id$serCholCat <- factor(pbc2.id$serCholCat, levels = c(0, 1), labels = c("low", "high"))
gm1 <- glm(serBilir ~ age * sex, data = pbc2.id)
res <- summary(gm1)
res$coefficients[, c(1, 4)]
**Hint:** To investigate the options you can extract, save the summary on a new object and use `$`.

Perform a multivariable regression analysis using as outcome the variable serum bilirubin and predictions the variables sex, drug, age (pbc2.id data set). Obtain the sum of squared errors:


fm3 <- lm(serBilir ~ age + sex + drug, data = pbc2.id)
sum(resid(fm3)^2)
**Hint:** Sum of squared errors is $\sum_{i=1}^n \epsilon_i^2$.

Use linear regression to examine the relation between the BMI of mothers and birthweight assuming the pregdata:


lm(BW ~ BMI_Mo, data = pregdata)

R only gives you the parameter estimates of the intercept and the slope by default. Often we also want to see the p-values and confidence intervals. For this we first store the result of the regression analysis performed previously in a variable. Now we can use the functions confint and summary on thus newly created variable.


lm1 <- lm(BW ~ BMI_Mo, data = pregdata)
confint()
summary()
lm1 <- lm(BW ~ BMI_Mo, data = pregdata)
confint(lm1)
summary(lm1)

QUIZ - Statistical Tests and Models

Some questions have more than one correct answers.

quiz(
  question("Which of the following excludes the intercept?",
    answer("lm(y ~ x1 + x2, data = data)"),
    answer("lm(y ~ x1 * x2 - 1, data = data)", correct = TRUE),
    answer("lm(y ~ -1 + x1 * x2, data = data)", correct = TRUE),
    answer("lm(y ~ -1 + x1 + x2, data = data)", correct = TRUE)
  ),
  question("Which of the following codes will investigate the association of the explanatory variable age (as a continuous variable) with sex (as response variable) accounting for drug from the pbc2.id data set?",
    answer("glm(sex ~ age + drug, data = pbc2.id)", correct = TRUE),
    answer("glm(sex ~ age, data = pbc2.id)"),
    answer("lm(age ~ sex + drug, data = pbc2.id)"),
    answer("glm(sex ~ age + drug + sex, data = pbc2.id)")
  ),
  question("We want to fit a linear model in which we model y with the main effects and the interaction of treatment arm and time. All variables are included in the data.frame df. Which of the following codes does this correctly?",
    answer("lm(y ~ time - arm,  data = df)"),
    answer("lm(y ~ time * arm, data = df)", correct = TRUE),
    answer("lm(y ~ time + arm, data = df)"),
    answer("lm(y ~ time + arm + time : arm, data = df)", correct = TRUE)
  ),
  question("Examine the following code in R: 'fm1 <- lm(height ~ age:sex, data = dat)'. What does this code do?",
    answer("Performs a linear regression including the main effects of age and sex"),
    answer("Performs a linear regression including the interaction term of age and sex", correct = TRUE),
    answer("Performs a linear regression including the main effects of age and sex and their interaction term"),
    answer("Performs a logistic regression including the interaction term of age and sex")
  )
)


stenw/BST02R documentation built on May 26, 2019, 4:35 a.m.