library(learnr) library(JM) library(memisc) library(foreign) #data("pain", package = "BST02R") pain <- readRDS('www/pain.Rds' ) pregdata <- read.spss('www/pregdata.sav',to.data.frame = TRUE, max.value.labels=8, stringsAsFactors=FALSE) knitr::opts_chunk$set(echo = FALSE)
In the next couple of question we will examine a data set called pregdata
. It is loaded and available at the beginning of this excersize. To get to know the contents of this data set. Look at the first couple of observations (using e.g. pregdata[1:4,] or the head function).
cat( # <p style="border:1px; border-style:solid; padding: 1em; border-color:#D11450"> # **NB:** '- In practice we have to load a data set before we can use it. The way in which we can do this depends on the format of the data set. It it is an SPSS data set we can use the `read.spss()` function from the `foreign` package. - In RStudio we can also inspect the contents of a data set using the`View()` function - Press the hint button if you need help. The last hint reveals the answer.' # </p> )
# Inspect the first few observations
# Read the question again and fill in the dots of either line below. pregdata[...,] head(...)
pregdata[1:4,] head(pregdata)
If we just want to know the variable names of the variables in the data.frame
we can also use the function names()
. If we also want to know about the variables types we can use the function str()
. PLease try this out yourself. What is the class of the variable MoWestern
and what us the mode? (First try to answer this question with the results you already have otherwise use the mode
and class
function)
# which variables are in the data set?
# Read the question again and fill in the dots of either line below.names(...) str(...)
names(pregdata) str(pregdata)
We will now examine the BMI of the mothers (the variable is called BMI_Mo
). Let's store the BMIs in a separate variable using the statement bmi <- pregdata$BMI_Mo
. We can check the number of observations in this vector using the function length()
.
# Try it out yourself
bmi <- pregdata$BMI_Mo length(bmi)
Using the function mean()
we can compute the mean of the vector (i.e. of the BMIs). In the same way the function sd()
gives us the standard deviation. Try it out.
bmi <- pregdata$BMI_Mo
# Compute the mean and the standard deviation of the BMIs
# Read the question again and fill in the dots of either line below.names(...) mean(...) # compute the mean sd(...) # compute the standard deviation
mean(bmi) # compute the mean sd(bmi) # compute the standard deviation
Remember that you can read the documentation on any function Using the function help
. For example help('sd')
gives you the documentation on the sd
function.
Actually it is not really needed to make a new variable we also could have directly used mean(pregdata$BMI_Mo)
and sd(pregdata$BMI_Mo)
. Do this yourself below. Also do this for the age of the mother (AgeMother
).
bmi <- pregdata$BMI_Mo
# compute the mean and sd of bmi and age without making extra variables
mean(pregdata$BMI_Mo) # mean sd(pregdata$BMI_Mo) # sd mean(...) sd(...)
mean(pregdata$BMI_Mo) # mean sd(pregdata$BMI_Mo) # sd mean(pregdata$AgeMother) sd(pregdata$AgeMother)
We will now focus on mothers that give birth to a boy. Make the selection using:
sel <- pregdata[pregdata$Sex=='boy',].
sel <- pregdata[pregdata$Sex=='boy',]
How many mothers are included in this selection. Wat is their average age?
bmi <- pregdata$BMI_Mo sel <- pregdata[pregdata$Sex=='boy',]
length(...) # or use nrow or dim() mean(...)
length(sel$AgeMother) # or use nrow or dim() mean(sel$AgeMother)
What are the minimum, the maximum and the median BMI of the women that give birth to a girl? Use the functions min
, max
and median
.
#sel <- pregdata[pregdata$Sex==...,] #min(...) #max(...) #median(...)
sel <- pregdata[pregdata$Sex=='girl',] min(sel[,'BMI_Mo']) max(sel[,'BMI_Mo']) median(sel[,'BMI_Mo'])
To sort a vector in R you can use the function sort
. Use decreasing = TRUE
to sort in reverse order. Try sorting the BMIs in pregdata
.
sort(pregdata$BMI_Mo)
Sorting a data.frame
is slightly more complicated. We need to use the function order
. The result of this function is a permutation that, when used as an index, puts the data.frame in the correct order. Try out what the function order
does using pregdata
below
o <- order(pregdata$BMI_Mo) print(o) sorted <- pregdata[o, ] head(sorted) tail(sorted)
The function sort can also be used on multiple columns. Try sorting pregdata
on ethnicity (MoWestern
) and Sex.
o <- order(pregdata$MoWestern, pregdata$Sex) sorted <- pregdata[o, ] head(sorted) tail(sorted)
A function that is related to order
is rank
. This function gives the sample ranks of an input vector. So 1 for the smallest element, 2 for the next smallest and so on. See if you understand the next example
rank(c(8,5,4,2))
Now use the rank
function to find the sample rank of the age of the first mother in the pregdata
data set.
rank(pregdata$AgeMother)[1]
When you want to see if a value is duplicated in a vector you can use the function duplicated
. Look up the documentation of this function and use it to select the first measuremnt of every subject in the pbc2
data set.Print the data for the first 10 patients.
pbc2[!duplicated(pbc2$id), 1:10]
A related function is the function unique()
. Use it to select the unique patient numbers
unique(pbc2$id)
If you have to enter a vector with many elements working with c
becomes tiresome. However if the values follow a (regular) sequence it is often eaasy to create them by using the function seq
.
seq(1, 10) seq(1, 10, by = 2) seq(1, 10, length.out = 5) 2:5 # same as seq(2,5)
Let's use a sequence to plot a function. create a sequence of length 100 from -5 to to 5 and call this x. Now calculate sin(x) and call this y. Finally plot x vs y.
# Calculate x and y as described above # plot(x=x, y=y)
# Calculate x and y as described above x <- seq(-5, 5, length.out = 100) y <- sin(x) plot(x=x, y=y)
We will discuss plots in more detail later on in the course.
When you want to repeat a vector multiple times the rep
function can help. Try it out below.
rep(c('Active', 'Control'), times=4) rep(c('Active', 'Control'), length.out=6) rep(c('Active', 'Control'), each=5) rep(c('Active', 'Control'), c(1,3))
In this course we mainly work with numerical data (or factors). Sometimes however, we also have to work with text data (type character
in R, also sometimes called 'strings'). When you want to combine character
variables you can use the paste
or paste0
functions.
paste(c('foo', 'bar'), c('A', 'B')) paste(c('foo', 'bar'), c('A', 'B'), sep='-') paste(c('foo', 'bar'), c('A', 'B'), sep='-', collapse = '_') paste0(c('foo', 'bar'), c('A', 'B'))
When you want to extract some of the characters of a character
variable you can use the substr
or strsplit
functions.
substr(c('foo-delete this', 'bar-this should be removed'), start=1, stop=3) strsplit(c('foo-delete this', 'bar-this should be removed'),split = '-')
Now use the substr
function together with as.numeric
to convert the patient id variable of pregdata
to numeric format.
pregdata$newid <- as.numeric(substr(pregdata$id,3,5)) pregdata$newid[1:10] mode(pregdata$newid)
If you need to work a lot with character
data, take a look at the stringr
library.
If you want to generate random normally distributed numbers (for example for a statistical simulation) you can use the function rnorm
. Try it out yourself.
rnorm(5) x <- rnorm(1000) mean(x) sd(x) x <- rnorm(1000, mean = 8, sd = 2) mean(x) sd(x)
In the same way we can generate data from other statistical distributions using e.g. rt
(t-distribution, like standard normal but with thinker tails), rbinom
(for binary outcomes), rpois
(for counts), rchisq
(often used for sums of squares in statistical tests), etc...
rt(5, df=3) rbinom(5, size = 1, prob = 0.4) rbinom(5, size = 5, prob = 0.4)
Besides these function for generating random numbers. R also contains functions for the, density cumulative distribution and quantile functions for these distributions. Here the r at the beginning of the function names needs to be replaced by 'd', 'p' and 'q' respectively.
Use the qnorm function to find the 95th percentile of the normal distribution
qnorm(p=)
qnorm(p=0.95)
The probability of the number of 'heads' on n coin tosses with a fair coin can be calculated using the binomial probability withparameter prob
set to 0.5.
Now try to use the dbinom
function to find the probability of 2 heads in 4 tosses. Use ?binom
if needed.
dbinom(x = , prob = , size = )
dbinom(x =2 , prob =0.5 , size = 4)
We want to create a function that takes a data frame as input and selects all rows for which a certain variable ('AgeMother' by default) inside the data set is above a cutoff value (35 by default).
Complete the function:
myfn <- function(dataset, varname='AgeMother', cutoff=35){ # complete the function return(rv) } myfn(pregdata, 'AgeMother')
myfn <- function(dataset, varname='AgeMother', cutoff=35){ rv <- dataset[..., ] # figure out what should be on the ... return(rv) } myfn(pregdata, 'AgeMother')
myfn <- function(dataset, varname='AgeMother', cutoff=35){ rv <- dataset[dataset[,...]>..., ] # figure out what should be on the ... return(rv) } myfn(pregdata, 'AgeMother')
# Use: myfn <- function(dataset, varname='AgeMother', cutoff=35){ rv <- dataset[dataset[,varname] > cutoff , ] return(rv) } myfn(pregdata, 'AgeMother')
Now extend the function so it no longer gives the selection as return value but make it generate a frequency table of Sex for the selection.
# Change the function below myfn <- function(dataset, varname='AgeMother', cutoff=35){ rv <- dataset[dataset[,varname] > cutoff , ] return(rv) } myfn(pregdata, 'AgeMother')
myfn <- function(dataset, varname='AgeMother', cutoff=35){ rv <- dataset[dataset[,varname] > cutoff , ] rv <- table(rv[ , ...] ) return(rv) } myfn(pregdata, 'AgeMother')
myfn <- function(dataset, varname='AgeMother', cutoff=35){ rv <- dataset[dataset[,varname] > cutoff , ] rv<-table(rv[ , 'Sex'] ) return(rv) } myfn(pregdata, 'AgeMother')
As a final exersize change the function in such a way so we do not always display the frequencies for Sex. Instead we want to choose the variable of which to make frequencies
# Change the function below myfn <- function(dataset, varname='age', cutoff=35){ rv <- dataset[dataset[,varname] > cutoff , ] rv<-table(rv[ , 'Sex'] ) return(rv) } myfn(pregdata, 'AgeMother') myfn(pregdata, 'AgeMother', varname2='Death')
myfn <- function(dataset, varname='age', cutoff=35, varname2= 'Sex'){ rv <- dataset[dataset[,varname] > cutoff , ] rv<-table(rv[ , varname2] ) return(rv) } myfn(pregdata, 'AgeMother') myfn(pregdata, 'AgeMother', varname2='Death')
Using the pregdata
data set we have caried out linear regression analysis (discussed later in the course in more detail) were we have estimated the coefficients in the equation: $BW = \beta_0 +\beta_1 Sex + \beta_2 Age$.
A summary of the results is stored in the object slm1
.
lm1<-lm(BW~ Sex + AgeMother, data=pregdata) slm1 <- summary(lm1)
What is the class
of slm1?
class(slm1)
The coefficients (and their SE and p-values) are stored in the element named coefficients
? What is its class? Can you extract the coefficient of the age and store it in a separate variable?
class(slm1$coefficients) beta2 <- slm1$coefficients['AgeMother','Estimate']
quiz( question("How do we denine a function fn that takes two vectors as arguments and returns the sum:", answer(" vector fn(x, y){x + y}"), answer(" function fn(x, y){return(x + y)}"), answer("fn<-function(x, y){x + y}", correct = TRUE), answer(" fn(x, y)<-function{return(x + y)}") ), question("We have the following code: x<-3 \n fn <- function(y=3){ \n return(x/y) \n x<-9 \n } \n x<- 6 \n fn() \n What if anything is returned by the function call? ", answer("1"), answer("2", correct = TRUE), answer("3"), answer("9") ), question(" A matrix M is defined as: \n M <- matrix(1:12, 3) \n What is returned by: length(M) ", answer("3"), answer("4"), answer("12", correct = TRUE), answer("NULL"), answer("we get an error") ), question("How can we remove all duplicated values from a vector x?", answer(" x[unique(x)]"), answer("!duplicated(x)"), answer(" x[!duplicated(x)]", correct = TRUE), answer(" complete.cases(x)") ) )
Understanding how loops, apply, merge work.
Implement a simple version of guess the number game using a while loop. The user should guess a number between 1 and 10. The loop should stop if the user guesses 5. Print all the numbers:
x <- sample(1:10, 1) while (x != 5){ x <- sample(1:10, 1) print(x) }
Use a for loop to simulate the flip a coin twenty times, keeping track of the individual outcomes (1 = heads, 0 = tails) in a vector that you preallocate (y):
y <- numeric(20) for (i in 1:20) { y[i] <- sample(0:1, 1) }
Write a while loop that prints out standard random normal numbers but stops if you get a number bigger than 1. Print all the numbers:
x <- rnorm(1) while(x <= 1) { x <- rnorm(1) print(x) }
Create the matrix dataset1 <- cbind(A = 1:30, B = sample(1:100, 30))
and find the row means of dataset1:
dataset1 <- cbind(A = 1:30, B = sample(1:100, 30)) apply(dataset1, 1, mean)
Create the matrix dataset2 <- data.frame(A = 1:30, B = sample(1:100, 30), C = sample(0:1, 30, replace = T))
and find the means per group (indicated in the C column) of dataset2:
dataset2 <- data.frame(A = 1:30, B = sample(1:100, 30), C = sample(0:1, 30, replace = T)) tapply(dataset2$A, dataset2$C, mean)
Create a list with the name myList2 which consists of four matrices as given below. Extract the second column from each element (from each single matrix) :
first = matrix(38:67, 3)
second = matrix(56:91, 3)
third = matrix(82:147, 3)
fourth = matrix(46:95, 5)
myList2 = list(first, second, third, fourth)
first = matrix(38:67, 3) second = matrix(56:91, 3) third = matrix(82:147, 3) fourth = matrix(46:95, 5) myList2 = list(first, second, third, fourth) lapply(myList2,"[", , 2)
Create the following function DerivativeFunction <- function(x) {log10(x) + 10}
. Apply the "DerivativeFunction" to dataset1 <- cbind(A = 1:30, B = sample(1:100, 30))
, with simplified output:
DerivativeFunction <- function(x) {log10(x) + 10} dataset1 <- cbind(A = 1:30, B = sample(1:100, 30)) sapply(dataset1, DerivativeFunction)
Create the following function DerivativeFunction <- function(x) {log10(x) + 10}
. Apply the "DerivativeFunction" to dataset1 <- cbind(A = 1:30, B = sample(1:100, 30))
. The output should be a list:
DerivativeFunction <- function(x) {log10(x) + 10} dataset1 <- cbind(A = 1:30, B = sample(1:100, 30)) lapply(dataset1, DerivativeFunction)
Assuming the following data sets:
Patients <- data.frame(id = c(1, 1, 1, 2, 2, 3), time = c(0, 1, 2, 0, 1, 0), blood_value = c(30, 55, 34, 13, 62, 87))
Hospital <- data.frame(id = c(1, 2, 3), hosp = c("EMC", "LUMC", "VUMC"))
Merge the columns of the two data frames by the id variable:
Patients <- data.frame(id = c(1, 1, 1, 2, 2, 3), time = c(0, 1, 2, 0, 1, 0), blood_value = c(30, 55, 34, 13, 62, 87)) Hospital <- data.frame(id = c(1, 2, 3), hosp = c("EMC", "LUMC", "VUMC")) merge(Patients, Hospital)
Assuming the following datasets:
Patients <- data.frame(id = c(1, 1, 1, 2, 2, 3), time = c(0, 1, 2, 0, 1, 0), blood_value = c(30, 55, 34, 13, 62, 87))
Hospitals <- data.frame(id = c(1, 1, 3, 3), hosp = c("EMC","EMC", "VUMC", "VUMC"), department = c("Cardiology", "Thoracic surgery", "Cardiology", "Thoracic surgery"))
Merge the columns of the two data frames that are only available in the data set Hospitals:
Patients <- data.frame(id = c(1, 1, 1, 2, 2, 3), time = c(0, 1, 2, 0, 1, 0), blood_value = c(30, 55, 34, 13, 62, 87)) Hospitals <- data.frame(id = c(1, 1, 3, 3), hosp = c("EMC","EMC", "VUMC", "VUMC"), department = c("Cardiology", "Thoracic surgery","Cardiology", "Thoracic surgery")) merge(Patients, Hospitals, all.y = TRUE)
Assuming the following datasets:
Patients <- data.frame(id = c(1, 1, 1, 2, 2, 3), time = c(0, 1, 2, 0, 1, 0), blood_value = c(30, 55, 34, 13, 62, 87))
Hospitals <- data.frame(Id = c(1, 1, 3, 3), hosp = c("EMC","EMC", "VUMC", "VUMC"), department = c("Cardiology", "Thoracic surgery", "Cardiology", "Thoracic surgery"))
Merge the columns of the two data frames by the first variable of the data sets:
Patients <- data.frame(id = c(1, 1, 1, 2, 2, 3), time = c(0, 1, 2, 0, 1, 0), blood_value = c(30, 55, 34, 13, 62, 87)) Hospitals <- data.frame(Id = c(1, 1, 3, 3), hosp = c("EMC","EMC", "VUMC", "VUMC"), department = c("Cardiology", "Thoracic surgery","Cardiology", "Thoracic surgery")) merge(Patients, Hospitals, by.x = "id", by.y = "Id")
Some questions have more than one correct answers.
quiz( question("What is the output of the following program? for (i in seq(1, 10, by = 2)) print(i + 1)", answer("1, 3, 5, 7, 9"), answer("2, 4, 6, 8, 10", correct = TRUE), answer("1, 2, 3, 4, 5, 6, 7, 8, 9, 10"), answer("2, 3, 4, 5, 6, 9") ), question("What is the output of the following program? for (i in seq(1, 10, by = 2)) if (i > 5) print(2*i + 1)", answer("15, 19", correct = TRUE), answer("3, 7, 11, 15, 19"), answer("Na"), answer("8, 10") ), question("What is the output of the following code? lapply(list(1:20, 20:40, 40:80), mean)", answer("Output 1: <br> Na"), answer("Output 2: <br> [[1]]<br> [1] 10.5<br> [[2]]<br> [1] 30<br> [[3]]<br> [1] 60", correct = TRUE), answer("Output 3: <br> [1] 10.5 30.0 60.0"), answer("Output 4: <br> [[1]]<br> [1] 210<br> [[2]]<br> [1] 630<br> [[3]]<br> [1] 2460") ), question("Which of the following statements is wrong", answer("The sapply() function behaves similarly to lapply()"), answer("lapply() returns a list"), answer("apply() can be used to apply a function in subgroups", correct = TRUE), answer("All of the mentioned") ), question("Suppose there are 2 dataframes data1 and data2. data1 has blood information about 3 patients and data2 has hospital information of 2 other patients. What will the number of rows be in the resultant data frame after running the following command? merge(data1 ,data2 , by = 'id')", answer("<0 rows>", correct = TRUE), answer("3"), answer("2"), answer("5") ) )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.