In beep-boopR/ubcBIOL548L: Data Visualization in R for UBC Graduate Course BIO548L

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

  file.copy(system.file("extdata", "gapminder.csv",
              package = "ubcBIOL548L"),
            "./gapminder.csv")
  file.copy(system.file("extdata", "gapminder.xlsx",
                        package = "ubcBIOL548L"),
            "./gapminder.xlsx")

Introduction

This vignette is Part 1 of 3 for an R workshop created for BIOL 548L, a graduate-level course on data visualization taught at the University of British Columbia.

When the workshop runs, we split students into three groups with successively increasing levels of difficulty. We recommend everyone starts here and works through the code that follows. This vignette shows users how to load a clean data file into R and make boxplots and scatterplots. Once you are comfortable with the contents of this page, please feel free to move on to Part 2 and ultimately to Part 3 (which we recommend going through only after completing Part 2) here.

All code and contents of this vignette were written together by Vikram B. Baliga, Andrea Gaede and Shreeram Senthivasan.

Learning Objectives:

Determine how to import and control data stored in different filetypes
Understand the best practices for structuring data in tidyverse/ggplot
Construct basic plots using ggplot

Load or install & load necessary packages

Before running the code below, make sure you have the necessary packages loaded. If you do not have the packages listed below installed on your computer, then download them from the CRAN site using install.packages() and then load them with library() as shown below.

library(ubcBIOL548L)
library(gapminder)
library(ggplot2)
library(tidyr)
library(dplyr)
library(tibble)
library(readr)
library(readxl)

Data sets

You can get all data used in this vignette (and the other two!) by downloading this zip file.

Importing and peeking at data:

Import data using base R command, and give it the name my_data

my_data <- read.csv("gapminder.csv")
# In practise, read_csv() is often better

Take a look at your data

# Same as print(my_data)
my_data

# Structure
str(my_data)
# Summarize columns
summary(my_data)
# Get column names (good for wide datasets)
names(my_data)
# Get first 6 lines
head(my_data)
# Get last 6 lines
tail(my_data)

Arguments can be added to a function using commas
Note: arguments with the default setting are hidden, unless specified. Here n changes the default from 6 to 10 lines

head(my_data, n = 10)

The helpfile lists what arguments are available

?head

A better import option using Tidyverse

my_data <- read_csv("gapminder.csv")
# Cleaner import and print with read_csv, don't need head()
my_data
# Note that words read in as `chr` not `factors`, this is good!
str(my_data)
# But underlying data is the same
summary(my_data)

Other formats for import

my_data <- read_delim("gapminder.csv", ',')
# Looks like a weird error
my_data <- read_excel("gapminder.xlsx")

Ways to clean up your data during import

# Inspect with head, or excel. We see two junk rows:
head(my_data)

# This can be solved by adding an argument `skip` is the number of rows to skip
my_data <- read_excel("gapminder.xlsx",
                   skip = 2)

my_data <- read_csv("gapminder.csv",col_names = FALSE)
# Setting `col_names` to false made the column headers row one and added dummy column names
my_data

# We're now going to import the gapminder dataset using the preferred read_csv() function
my_data <- read_csv("gapminder.csv",col_names = TRUE)

# This looks correct. Note: TRUE is the default so was not needed above
my_data

Using qplot to make a histogram, scatter plot, or dot plot

# This command makes a histogram of the `lifeExp` column of the `my_data` dataset
qplot(x = lifeExp, data = my_data)

# The same function here makes a scatter plot
qplot(x = gdpPercap, y = lifeExp, data = my_data)

# The same function here makes a dot plot because the x axis is categorical
qplot(x = continent, y = lifeExp, data = my_data)

How can the same function make three different classes of plots?
One of the hidden arguments is geom which specifies the type of plot. The default is auto which leads to a guess of the plot type based on the data type(s) in the column(s) you specify.

Type ?qplot in the console to read the qplot documentation

Now let's specify the type of plot explicitly

qplot(x = lifeExp, data = my_data, geom = 'histogram')
qplot(x = gdpPercap, y = lifeExp, data = my_data, geom = 'point')

# Note that we are now specifying boxplot instead of point plot
qplot(x = continent, y = lifeExp, data = my_data, geom = 'boxplot')

How to quickly make plots with `qplot()` that are easy to interpret

Now let's change the number of bins in a histogram and make the plot prettier

# The hidden argument `bins` has a default valute of 30
qplot(x = lifeExp, data = my_data, geom = 'histogram')

# This changes the number of bins to 10
qplot(x = lifeExp, bins = 10, data = my_data, geom = 'histogram')

# Alternatively you can choose the width you want the bins to have
qplot(x = lifeExp, binwidth = 5, data = my_data, geom = 'histogram')

# Let's add a title
qplot(x = lifeExp, binwidth = 5, main = "Histogram of life expectancy", data = my_data, geom = 'histogram')

# Let's add an x axis label
qplot(x = lifeExp, binwidth = 5, main = "Histogram of life expectancy", xlab = "Life expectancy (years)", data = my_data, geom = 'histogram')

# Let's add a y axis label
qplot(x = lifeExp, binwidth = 5, main = "Histogram of life expectancy", xlab = "Life expectancy (years)", ylab = "Count", data = my_data, geom = 'histogram')

# This format is easier to read, but otherwise exactly the same.
# The convention is to break lines after commas.
qplot(x = lifeExp,
      binwidth = 5,
      main = "Histogram of life expectancy",
      xlab = "Life expectancy (years)",
      ylab = "Count",
      data = my_data,
      geom = 'histogram')

Log scale & trendline

Let's apply a log scale and add a trendline to a scatter plot

# Note that x axis is compressed
qplot(x = gdpPercap, y = lifeExp, data = my_data, geom = 'point')
# Here the x axis is log transformed
qplot(x = gdpPercap, 
      y = lifeExp, 
      log = 'x',
      data = my_data, 
      geom = 'point')

# Let's add a trendline to the data as well.
# The linear regression model (`lm`) will be added on top of our previous plot
qplot(x = gdpPercap,
      y = lifeExp,
      log = 'x',
      main = "Scatter plot of life expectancy versus GDP per capita",
      xlab = "Log-transformed GDP per capita ($)",
      ylab = "Life expectancy (years)",
      data = my_data,
      # The following line adds a `smooth` trendline
      # We want our regression to be a linear model, or `lm`
      method = 'lm',
      # the `c()` function allows us to pass multiple variables
      # to the `geom` argument
      geom = c('point','smooth'))

Building a boxplot using gapminder data

qplot(x = continent,
      y = lifeExp,
      main = "Boxplot of life expectancy by continent",
      xlab = "Continent",
      ylab = "Life expectancy (years)",
      data = my_data,
      geom = 'boxplot')