We start by first installing some packages that we will need throughout this notebook.
# install.packages("tidyverse") # install.packages("mlbench")
Besides installing the packages, they also have to be loaded in order to be operational.
library(learnr) library(tidyverse) library(mlbench)
This section lists some useful functions when working with R. First of all, it is good practice to cite R whenever it was used in the research process. citation()
displays the proper way to cite R, whereas citation("packagename")
can be used when citing R packages.
citation() citation("ggplot2")
Typically, one of the first things to do is specifying your working directory. The following functions can be used to display (getwd()
) and set (setwd()
) the working directory and to list its contents (dir()
). Keep in mind that R only accepts paths with forward slashes.
getwd() # setwd("path") dir()
To get familiar with R's help system, we can explore the documentation for the function help()
. This is equivalent to help(help)
# help()
The documentation for global R options.
# help(options)
Use help.search()
to search the help system.
# help.search("glm")
In this notebook, we use the Boston Housing data set. "This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston), and has been used extensively throughout the literature to benchmark algorithms."
Source: https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html
data(BostonHousing2) boston <- BostonHousing2
As a shortcut for help()
we can use ?
to get some information about this dataset.
?BostonHousing2
The following functions can be used to get a first impression of the data.
str(boston) head(boston)
Using index notation to access only specific variables or observations is an important tool as it can be used in conjunction with many different functions. It is therefore worthwhile to consider some basic examples.
boston[, 1] boston[, 1:5] boston[1:10, c(1:2,5)]
List all variable names of the Boston Housing data.
names(boston)
Now we can access variables by using their names and the $-notation. This can be combined with conditional statements regarding rows to also filter specific observations.
boston$medv boston$medv[1:10] boston$medv[boston$chas == 1]
We can also draw random samples from our data set and store those in new objects.
index <- sample(1:nrow(boston), 0.75*nrow(boston)) subset <- boston[index,] nrow(subset)
Finally, here is a dplyr
approach at selecting rows and columns of the Boston housing dataset.
boston %>% select(medv, chas) %>% filter(chas == 1)
Basic descriptive statistics can be computed using summary()
.
summary(boston$medv)
Note that this function is class-sensitive, i.e. here we get a different output depending on the class of the respective object.
class(boston$medv) summary(boston$town) class(boston$town)
Some summary statistics for the value of owner-occupied homes grouped by the chas
river indicator, now using dplyr
.
boston %>% group_by(chas) %>% summarise(mean(medv), var(medv), min(medv), max(medv))
Summary statistics again, now for selected towns.
boston %>% filter(town %in% c("Cambridge", "Boston South Boston")) %>% group_by(town) %>% summarise(mean = mean(medv), variance = var(medv), IQR = IQR(medv), n = n())
A boxplot via qplot()
, separated by the chas
dummy variable.
qplot(chas, medv, data = boston, geom = "boxplot", fill = chas)
The previous boxplot with better labels, now using the ggplot()
function.
ggplot(boston) + geom_boxplot(aes(x = chas, y = medv, fill = chas)) + labs(x = "Charles River dummy", y = "Median home value") + guides(fill = FALSE) + theme_light()
A density plot of the median value of owner-occupied homes, faceted by the river dummy.
ggplot(boston) + geom_density(aes(x = medv), color = "red") + geom_rug(aes(x = medv, y = 0), position = position_jitter(height = 0)) + facet_grid(. ~ chas)
Grouped scatterplots of median home values and crime rates with overlayed loess curves.
ggplot(boston) + geom_point(aes(x = lstat, y = medv)) + geom_smooth(aes(x = lstat, y = medv)) + facet_grid(. ~ chas)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.