library(learnr)
library(JM)
library(memisc)
library(foreign)
#data("pain", package = "BST02R")
pain <- readRDS('www/pain.Rds' )
pregdata <- read.spss('www/pregdata.sav',to.data.frame = TRUE, max.value.labels=8, stringsAsFactors=FALSE)
knitr::opts_chunk$set(echo = FALSE)

Descriptive statistics

In the next couple of question we will examine a data set called pregdata. It is loaded and available at the beginning of this excersize. To get to know the contents of this data set. Look at the first couple of observations (using e.g. pregdata[1:4,] or the head function).

cat(
# <p style="border:1px; border-style:solid; padding: 1em; border-color:#D11450">
# **NB:**
'- In practice we have to load a data set before we can use it. The way in which we can do this depends on the format of the data set. It it is an SPSS data set we can use the `read.spss()` function from the `foreign` package.
 - In RStudio we can also inspect the contents of a data set using the`View()` function
 - Press the hint button if you need help. The last hint reveals the answer.'
# </p>
)
# Inspect the first few observations
# Read the question again and fill in the dots of either line below.
pregdata[...,]
head(...)
pregdata[1:4,]
head(pregdata)

If we just want to know the variable names of the variables in the data.frame we can also use the function names(). If we also want to know about the variables types we can use the function str(). PLease try this out yourself. What is the class of the variable MoWestern and what us the mode? (First try to answer this question with the results you already have otherwise use the mode and class function)

# which variables are in the data set?
# Read the question again and fill in the dots of either line below.names(...)
str(...)
names(pregdata)
str(pregdata)

Mean and standard deviation

We will now examine the BMI of the mothers (the variable is called BMI_Mo). Let's store the BMIs in a separate variable using the statement bmi <- pregdata$BMI_Mo. We can check the number of observations in this vector using the function length().

# Try it out yourself
bmi <- pregdata$BMI_Mo
length(bmi)

Using the function mean() we can compute the mean of the vector (i.e. of the BMIs). In the same way the function sd() gives us the standard deviation. Try it out.

bmi <- pregdata$BMI_Mo
# Compute the mean and the standard deviation of the BMIs
# Read the question again and fill in the dots of either line below.names(...)
mean(...) # compute the mean
sd(...) # compute the standard deviation
mean(bmi) # compute the mean
sd(bmi) # compute the standard deviation

Remember that you can read the documentation on any function Using the function help. For example help('sd') gives you the documentation on the sd function.

Actually it is not really needed to make a new variable we also could have directly used mean(pregdata$BMI_Mo) and sd(pregdata$BMI_Mo). Do this yourself below. Also do this for the age of the mother (AgeMother).

bmi <- pregdata$BMI_Mo
# compute the mean and sd of bmi and age without making extra variables
mean(pregdata$BMI_Mo) # mean
sd(pregdata$BMI_Mo) # sd
mean(...)
sd(...)
mean(pregdata$BMI_Mo) # mean
sd(pregdata$BMI_Mo) # sd
mean(pregdata$AgeMother)
sd(pregdata$AgeMother)

A selection

We will now focus on mothers that give birth to a boy. Make the selection using: sel <- pregdata[pregdata$Sex=='boy',].


sel <- pregdata[pregdata$Sex=='boy',]

How many mothers are included in this selection. Wat is their average age?

bmi <- pregdata$BMI_Mo
sel <- pregdata[pregdata$Sex=='boy',]

length(...) # or use nrow or dim()
mean(...) 
length(sel$AgeMother) # or use nrow or dim()
mean(sel$AgeMother) 

Minimum, Maximum and median

What are the minimum, the maximum and the median BMI of the women that give birth to a girl? Use the functions min, max and median.


#sel <- pregdata[pregdata$Sex==...,]
#min(...)
#max(...)
#median(...)
sel <- pregdata[pregdata$Sex=='girl',]
min(sel[,'BMI_Mo'])
max(sel[,'BMI_Mo'])
median(sel[,'BMI_Mo'])

Sorting

Sorting a vector

To sort a vector in R you can use the function sort. Use decreasing = TRUE to sort in reverse order. Try sorting the BMIs in pregdata.


sort(pregdata$BMI_Mo)

Sorting a data.frame

Sorting a data.frame is slightly more complicated. We need to use the function order. The result of this function is a permutation that, when used as an index, puts the data.frame in the correct order. Try out what the function order does using pregdata below

o <- order(pregdata$BMI_Mo)
print(o)
sorted <- pregdata[o, ]
head(sorted)
tail(sorted)

The function sort can also be used on multiple columns. Try sorting pregdata on ethnicity (MoWestern) and Sex.


o <- order(pregdata$MoWestern, pregdata$Sex)
sorted <- pregdata[o, ]
head(sorted)
tail(sorted)

ranking

A function that is related to order is rank. This function gives the sample ranks of an input vector. So 1 for the smallest element, 2 for the next smallest and so on. See if you understand the next example

rank(c(8,5,4,2))

Now use the rank function to find the sample rank of the age of the first mother in the pregdata data set.


rank(pregdata$AgeMother)[1]

Duplicates and unique values

When you want to see if a value is duplicated in a vector you can use the function duplicated. Look up the documentation of this function and use it to select the first measuremnt of every subject in the pbc2 data set.Print the data for the first 10 patients.


pbc2[!duplicated(pbc2$id), 1:10]

A related function is the function unique(). Use it to select the unique patient numbers


unique(pbc2$id)

Sequences

If you have to enter a vector with many elements working with c becomes tiresome. However if the values follow a (regular) sequence it is often eaasy to create them by using the function seq.

seq(1, 10)
seq(1, 10, by = 2)
seq(1, 10, length.out = 5)
2:5   # same as seq(2,5)

Let's use a sequence to plot a function. create a sequence of length 100 from -5 to to 5 and call this x. Now calculate sin(x) and call this y. Finally plot x vs y.

# Calculate x and y as described above
# plot(x=x, y=y)
# Calculate x and y as described above
x <- seq(-5, 5, length.out = 100)
y <- sin(x)
plot(x=x, y=y)

We will discuss plots in more detail later on in the course.

When you want to repeat a vector multiple times the rep function can help. Try it out below.

rep(c('Active', 'Control'), times=4)
rep(c('Active', 'Control'), length.out=6)
rep(c('Active', 'Control'), each=5)
rep(c('Active', 'Control'), c(1,3))

Working with text data

In this course we mainly work with numerical data (or factors). Sometimes however, we also have to work with text data (type character in R, also sometimes called 'strings'). When you want to combine character variables you can use the paste or paste0 functions.

paste(c('foo', 'bar'), c('A', 'B'))
paste(c('foo', 'bar'), c('A', 'B'), sep='-')
paste(c('foo', 'bar'), c('A', 'B'), sep='-', collapse = '_')
paste0(c('foo', 'bar'), c('A', 'B'))

When you want to extract some of the characters of a character variable you can use the substr or strsplit functions.

substr(c('foo-delete this', 'bar-this should be removed'), start=1, stop=3)
strsplit(c('foo-delete this', 'bar-this should be removed'),split = '-')

Now use the substr function together with as.numeric to convert the patient id variable of pregdata to numeric format.


pregdata$newid <- as.numeric(substr(pregdata$id,3,5))
pregdata$newid[1:10]
mode(pregdata$newid)

If you need to work a lot with character data, take a look at the stringr library.

Statistical distributions

If you want to generate random normally distributed numbers (for example for a statistical simulation) you can use the function rnorm. Try it out yourself.

rnorm(5)
x <- rnorm(1000)
mean(x)
sd(x)
x <- rnorm(1000, mean = 8, sd = 2)
mean(x)
sd(x)

In the same way we can generate data from other statistical distributions using e.g. rt (t-distribution, like standard normal but with thinker tails), rbinom (for binary outcomes), rpois (for counts), rchisq (often used for sums of squares in statistical tests), etc...

rt(5, df=3)
rbinom(5, size = 1, prob = 0.4)
rbinom(5, size = 5, prob = 0.4)

Besides these function for generating random numbers. R also contains functions for the, density cumulative distribution and quantile functions for these distributions. Here the r at the beginning of the function names needs to be replaced by 'd', 'p' and 'q' respectively.

Use the qnorm function to find the 95th percentile of the normal distribution


qnorm(p=)
qnorm(p=0.95) 

The probability of the number of 'heads' on n coin tosses with a fair coin can be calculated using the binomial probability withparameter prob set to 0.5. Now try to use the dbinom function to find the probability of 2 heads in 4 tosses. Use ?binom if needed.


dbinom(x = , prob = , size = )
dbinom(x =2 , prob =0.5 , size = 4)

Creating your own functions {data-progressive=TRUE}

Function making a selection {data-progressive=TRUE}

We want to create a function that takes a data frame as input and selects all rows for which a certain variable ('AgeMother' by default) inside the data set is above a cutoff value (35 by default).

Complete the function:

myfn <-
function(dataset, varname='AgeMother', cutoff=35){
# complete the function
  return(rv)  
}
myfn(pregdata, 'AgeMother')
myfn <-
function(dataset, varname='AgeMother', cutoff=35){
  rv <- dataset[..., ]  # figure out what should be on the ...
  return(rv)
}
myfn(pregdata, 'AgeMother')
myfn <-
function(dataset, varname='AgeMother', cutoff=35){
  rv <- dataset[dataset[,...]>..., ]  
  # figure out what should be on the ...
  return(rv)
}
myfn(pregdata, 'AgeMother')
# Use:
myfn <- 
function(dataset, varname='AgeMother', cutoff=35){
  rv <- dataset[dataset[,varname] > cutoff , ]
  return(rv)
}
myfn(pregdata, 'AgeMother')

Extending the function {data-progressive=TRUE}

Now extend the function so it no longer gives the selection as return value but make it generate a frequency table of Sex for the selection.

# Change the function below
myfn <- 
function(dataset, varname='AgeMother', cutoff=35){
  rv <- dataset[dataset[,varname] > cutoff , ]
  return(rv)
}
myfn(pregdata, 'AgeMother')
myfn <-
function(dataset, varname='AgeMother', cutoff=35){
  rv <- dataset[dataset[,varname] > cutoff , ]
  rv <- table(rv[ , ...] )
  return(rv)
}
myfn(pregdata, 'AgeMother')
myfn <-
function(dataset, varname='AgeMother', cutoff=35){
  rv <- dataset[dataset[,varname] > cutoff , ]
  rv<-table(rv[ , 'Sex'] )
  return(rv)
}
myfn(pregdata, 'AgeMother')

Make the function more general {data-progressive=TRUE}

As a final exersize change the function in such a way so we do not always display the frequencies for Sex. Instead we want to choose the variable of which to make frequencies

# Change the function below
myfn <-
function(dataset, varname='age', cutoff=35){
  rv <- dataset[dataset[,varname] > cutoff , ]
  rv<-table(rv[ , 'Sex'] )
  return(rv)
}
myfn(pregdata, 'AgeMother')
myfn(pregdata, 'AgeMother', varname2='Death')
myfn <-
function(dataset, varname='age', cutoff=35, varname2= 'Sex'){
  rv <- dataset[dataset[,varname] > cutoff , ]
  rv<-table(rv[ , varname2] )
  return(rv)
}
myfn(pregdata, 'AgeMother')
myfn(pregdata, 'AgeMother', varname2='Death')

Results of a regression analysis

Using the pregdata data set we have caried out linear regression analysis (discussed later in the course in more detail) were we have estimated the coefficients in the equation: $BW = \beta_0 +\beta_1 Sex + \beta_2 Age$. A summary of the results is stored in the object slm1.

lm1<-lm(BW~ Sex + AgeMother, data=pregdata)
slm1 <- summary(lm1)

What is the class of slm1?


class(slm1)

Extract the coefficients {data-progressive=TRUE}

The coefficients (and their SE and p-values) are stored in the element named coefficients? What is its class? Can you extract the coefficient of the age and store it in a separate variable?


class(slm1$coefficients)
beta2 <- slm1$coefficients['AgeMother','Estimate']

QUIZ - Functions

quiz(
  question("How do we denine a function fn that takes two vectors as arguments and returns the sum:",
    answer(" vector fn(x, y){x + y}"),
    answer(" function fn(x, y){return(x + y)}"),
    answer("fn<-function(x, y){x + y}", correct = TRUE),
    answer(" fn(x, y)<-function{return(x + y)}")
  ),
  question("We have the following code:
x<-3 \n
fn <- function(y=3){ \n
  return(x/y) \n
  x<-9 \n
} \n
x<- 6 \n
fn() \n

What if anything is returned by the function call?
",
    answer("1"),
    answer("2", correct = TRUE),
    answer("3"),
    answer("9")
  ),
  question("


A matrix M is defined as: \n

M <- matrix(1:12, 3) \n

What is returned by:

length(M)
",
    answer("3"),
    answer("4"),
    answer("12", correct = TRUE),
    answer("NULL"),
    answer("we get an error")
    ),
  question("How can we remove all duplicated values from a vector x?",
    answer(" x[unique(x)]"),
    answer("!duplicated(x)"),
    answer(" x[!duplicated(x)]", correct = TRUE),
    answer(" complete.cases(x)")
  )
)

Loop, Apply, Merge

Understanding how loops, apply, merge work.

Basic loops {data-progressive=TRUE}

Implement a simple version of guess the number game using a while loop. The user should guess a number between 1 and 10. The loop should stop if the user guesses 5. Print all the numbers:


x <- sample(1:10, 1)
while (x != 5){
  x <- sample(1:10, 1)
  print(x)
}
**Hint:** Use the function `while(...)`. To print use the function `print(...)`.

Use a for loop to simulate the flip a coin twenty times, keeping track of the individual outcomes (1 = heads, 0 = tails) in a vector that you preallocate (y):


y <- numeric(20)
for (i in 1:20) {
  y[i] <- sample(0:1, 1)
}
**Hint:** To preallocate use the function `numeric(...)`. To sample use the function `sample(...)`.

Write a while loop that prints out standard random normal numbers but stops if you get a number bigger than 1. Print all the numbers:


x <- rnorm(1)
while(x <= 1) {
  x <- rnorm(1)
  print(x)
}
**Hint:** To sample a normal number use the function `rnorm(...)`. Use the function `print(...)` to print.

Apply family {data-progressive=TRUE}

Create the matrix dataset1 <- cbind(A = 1:30, B = sample(1:100, 30)) and find the row means of dataset1:


dataset1 <- cbind(A = 1:30, B = sample(1:100, 30))
apply(dataset1, 1, mean)
**Hint:** Use the `apply(...)` function.

Create the matrix dataset2 <- data.frame(A = 1:30, B = sample(1:100, 30), C = sample(0:1, 30, replace = T)) and find the means per group (indicated in the C column) of dataset2:


dataset2 <- data.frame(A = 1:30, B = sample(1:100, 30), C = sample(0:1, 30, replace = T))
tapply(dataset2$A, dataset2$C, mean)
**Hint:** Use the `tapply(...)` function.

Create a list with the name myList2 which consists of four matrices as given below. Extract the second column from each element (from each single matrix) :
first = matrix(38:67, 3)
second = matrix(56:91, 3)
third = matrix(82:147, 3)
fourth = matrix(46:95, 5)
myList2 = list(first, second, third, fourth)


first = matrix(38:67, 3)
second = matrix(56:91, 3)
third = matrix(82:147, 3)
fourth = matrix(46:95, 5)
myList2 = list(first, second, third, fourth)
lapply(myList2,"[", , 2)
**Hint:** Use the `lapply(...)` function.

Create the following function DerivativeFunction <- function(x) {log10(x) + 10}. Apply the "DerivativeFunction" to dataset1 <- cbind(A = 1:30, B = sample(1:100, 30)), with simplified output:


DerivativeFunction <- function(x) {log10(x) + 10}
dataset1 <- cbind(A = 1:30, B = sample(1:100, 30))
sapply(dataset1, DerivativeFunction)
**Hint:** Use the `sapply(...)` function.

Create the following function DerivativeFunction <- function(x) {log10(x) + 10}. Apply the "DerivativeFunction" to dataset1 <- cbind(A = 1:30, B = sample(1:100, 30)). The output should be a list:


DerivativeFunction <- function(x) {log10(x) + 10}
dataset1 <- cbind(A = 1:30, B = sample(1:100, 30))
lapply(dataset1, DerivativeFunction)
**Hint:** Use the `lapply(...)` function.

Merge examples {data-progressive=TRUE}

Assuming the following data sets:
Patients <- data.frame(id = c(1, 1, 1, 2, 2, 3), time = c(0, 1, 2, 0, 1, 0), blood_value = c(30, 55, 34, 13, 62, 87))
Hospital <- data.frame(id = c(1, 2, 3), hosp = c("EMC", "LUMC", "VUMC"))

Merge the columns of the two data frames by the id variable:


Patients <- data.frame(id = c(1, 1, 1, 2, 2, 3), time = c(0, 1, 2, 0, 1, 0), blood_value = c(30, 55, 34, 13, 62, 87))
Hospital <- data.frame(id = c(1, 2, 3), hosp = c("EMC", "LUMC", "VUMC"))
merge(Patients, Hospital)
**Hint:** Use the `merge(...)` function.

Assuming the following datasets:
Patients <- data.frame(id = c(1, 1, 1, 2, 2, 3), time = c(0, 1, 2, 0, 1, 0), blood_value = c(30, 55, 34, 13, 62, 87))
Hospitals <- data.frame(id = c(1, 1, 3, 3), hosp = c("EMC","EMC", "VUMC", "VUMC"), department = c("Cardiology", "Thoracic surgery", "Cardiology", "Thoracic surgery"))

Merge the columns of the two data frames that are only available in the data set Hospitals:


Patients <- data.frame(id = c(1, 1, 1, 2, 2, 3), time = c(0, 1, 2, 0, 1, 0), blood_value = c(30, 55, 34, 13, 62, 87))
Hospitals <- data.frame(id = c(1, 1, 3, 3), hosp = c("EMC","EMC", "VUMC", "VUMC"), department = c("Cardiology", "Thoracic surgery","Cardiology", "Thoracic surgery")) 
merge(Patients, Hospitals, all.y = TRUE)
**Hint:** To merge the rows that are only available in the data set Hospital use `merge(..., all.y = TRUE)`.

Assuming the following datasets:
Patients <- data.frame(id = c(1, 1, 1, 2, 2, 3), time = c(0, 1, 2, 0, 1, 0), blood_value = c(30, 55, 34, 13, 62, 87))
Hospitals <- data.frame(Id = c(1, 1, 3, 3), hosp = c("EMC","EMC", "VUMC", "VUMC"), department = c("Cardiology", "Thoracic surgery", "Cardiology", "Thoracic surgery"))

Merge the columns of the two data frames by the first variable of the data sets:


Patients <- data.frame(id = c(1, 1, 1, 2, 2, 3), time = c(0, 1, 2, 0, 1, 0), blood_value = c(30, 55, 34, 13, 62, 87))
Hospitals <- data.frame(Id = c(1, 1, 3, 3), hosp = c("EMC","EMC", "VUMC", "VUMC"), department = c("Cardiology", "Thoracic surgery","Cardiology", "Thoracic surgery")) 
merge(Patients, Hospitals, by.x = "id", by.y = "Id")

QUIZ - Loop, Apply, Merge

Some questions have more than one correct answers.

quiz(
  question("What is the output of the following program? for (i in seq(1, 10, by = 2)) print(i + 1)",
    answer("1, 3, 5, 7, 9"),
    answer("2, 4, 6, 8, 10", correct = TRUE),
    answer("1, 2, 3, 4, 5, 6, 7, 8, 9, 10"),
    answer("2, 3, 4, 5, 6, 9")
  ),
  question("What is the output of the following program? for (i in seq(1, 10, by = 2)) if (i > 5) print(2*i + 1)",
    answer("15, 19", correct = TRUE),
    answer("3, 7, 11, 15, 19"),
    answer("Na"),
    answer("8, 10")
  ),
  question("What is the output of the following code? lapply(list(1:20, 20:40, 40:80), mean)",
    answer("Output 1: <br> Na"),
    answer("Output 2: <br> [[1]]<br>
    [1] 10.5<br>
    [[2]]<br>
    [1] 30<br>
    [[3]]<br>
    [1] 60", correct = TRUE),
    answer("Output 3: <br> [1] 10.5 30.0 60.0"),
    answer("Output 4: <br> [[1]]<br>
    [1] 210<br>
    [[2]]<br>
    [1] 630<br>
    [[3]]<br>
    [1] 2460")
  ),
  question("Which of the following statements is wrong",
    answer("The sapply() function behaves similarly to lapply()"),
    answer("lapply() returns a list"),
    answer("apply() can be used to apply a function in subgroups", correct = TRUE),
    answer("All of the mentioned")
  ),
  question("Suppose there are 2 dataframes data1 and data2. data1 has blood information about 3 patients and data2 has hospital information of 2 other patients. What will the number of rows be in the resultant data frame after running the following command? merge(data1 ,data2 , by = 'id')",
    answer("<0 rows>", correct = TRUE),
    answer("3"),
    answer("2"),
    answer("5")
  )
)


stenw/BST02R documentation built on May 26, 2019, 4:35 a.m.