knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
file.copy(system.file("extdata", "gapminder.csv", package = "MultiPanelPlotsWithR"), "./gapminder.csv") file.copy(system.file("extdata", "gapminder.xlsx", package = "MultiPanelPlotsWithR"), "./gapminder.xlsx") library(MultiPanelPlotsWithR)
This vignette is Part 1 of 3 for an R workshop created for BIOL 548L, a graduate-level course taught at the University of British Columbia.
When the workshop runs, we split students into three groups with successively increasing levels of difficulty. We recommend everyone starts here and works through the code that follows. This vignette shows users how to load a clean data file into R and make boxplots and scatterplots. Once you are comfortable with the contents of this page, please feel free to move on to Part 2 and ultimately to Part 3 (which we recommend going through only after completing Part 2) here.
All code and contents of this vignette were written together by Vikram B. Baliga, Andrea Gaede, Shreeram Senthivasan, and Eric R. Press
tidyverse/ggplot2
ggplot2
Before running the code below, make sure you have the necessary packages loaded.
If you do not have the packages listed below installed on your computer, then
download them from the CRAN site using install.packages()
and then load them
with library()
as shown below.
library(gapminder) library(ggplot2) library(tidyr) library(dplyr) library(tibble) library(readr) library(readxl)
You can get all data used in this vignette (and the other two!) by downloading this zip file.
Import data using base R command, and give it the name my_data
my.data <- read.csv("gapminder.csv") # In practise, the function read_csv() from the 'readr' package is often better
Take a look at your data
# Same as print(my.data)
my.data
# Inspect the structure of the data str(my.data) # Summarize column information summary(my.data) # Get column names (variables). This is handy for wide datasets i.e. many variables names(my.data) # Get first 6 lines/rows (observations) head(my.data) # Get last 6 lines/rows tail(my.data)
Simply explore the entire data frame
View(my.data)
Arguments can be added to a function using commas
Note: arguments with the default setting are hidden, unless specified. Here
n
changes the default from 6 to 10 lines
head(my.data, n = 10)
The helpfile lists what arguments are available
?head
A better import option using Tidyverse
my_data <- read_csv("gapminder.csv") # Cleaner import and print with read_csv, don't need head() str(my_data) # But underlying data is the same summary(my.data) summary(my_data)
Other formats for import
my_data_c <- read_delim("gapminder.csv", ',') my_data_x <- read_excel("gapminder.xlsx")
Ways to clean up your data during import
# Inspect with head. We see two junk rows: head(my_data_x) # This can be solved by adding an argument `skip` # is the number of rows to skip my_data_x <- read_excel("gapminder.xlsx", skip = 2) my_data <- read_csv("gapminder.csv", col_names = FALSE) # Setting `col_names` to false made the column headers # row one and added dummy column names my_data
# We're now going to import the gapminder dataset # using the preferred read_csv() function my_data <- read_csv("gapminder.csv", col_names = TRUE)
# This looks correct. Note: TRUE is the default so it was not needed above
my_data
# This command makes a histogram of the `lifeExp` column of the `my_data` dataset qplot(x = lifeExp, data = my_data)
# The same function here makes a scatter plot qplot(x = gdpPercap, y = lifeExp, data = my_data)
# The same function here makes a dot plot because # the x axis is categorical qplot(x = continent, y = lifeExp, data = my_data)
How can the same function make three different classes of plots?
One of the hidden arguments is geom
which specifies the type of plot. The
default is auto
which leads to a guess of the plot type based on the data
type(s) in the column(s) you specify.
Type ?qplot
in the console to read the qplot
documentation
Now let's specify the type of plot explicitly
qplot(x = lifeExp, data = my_data, geom = 'histogram') qplot(x = gdpPercap, y = lifeExp, data = my_data, geom = 'point')
# Note that we are now specifying boxplot instead of point plot qplot(x = continent, y = lifeExp, data = my_data, geom = 'boxplot')
qplot()
that are easy to interpretNow let's change the number of bins in a histogram and make the plot prettier
# The hidden argument `bins` has a default valute of 30 qplot(x = lifeExp, data = my_data, geom = 'histogram')
# This changes the number of bins to 10 qplot(x = lifeExp, bins = 10, data = my_data, geom = 'histogram')
# Alternatively you can choose the width you want the bins to have qplot(x = lifeExp, binwidth = 5, data = my_data, geom = 'histogram')
# Let's add a title qplot(x = lifeExp, binwidth = 5, main = "Histogram of life expectancy", data = my_data, geom = 'histogram')
# Let's add x and y axes labels qplot(x = lifeExp, binwidth = 5, main = "Histogram of life expectancy", xlab = "Life expectancy (years)", ylab = "Count", data = my_data, geom = 'histogram')
# This format is easier to read, but otherwise exactly the same. # The convention is to break lines after commas. qplot(x = lifeExp, binwidth = 5, main = "Histogram of life expectancy", xlab = "Life expectancy (years)", ylab = "Count", data = my_data, geom = 'histogram')
Let's apply a log scale and add a trendline to a scatter plot
# Note that data points on the x axis are compressed with a linear scale qplot(x = gdpPercap, y = lifeExp, data = my_data, geom = 'point') # Here the x axis is log transformed qplot(x = gdpPercap, y = lifeExp, log = 'x', data = my_data, geom = 'point') # Let's add a trendline to the data as well. # The linear regression model 'lm' will be added on top of our previous plot qplot(x = gdpPercap, y = lifeExp, log = 'x', main = "Scatter plot of life expectancy versus GDP per capita", xlab = "Log-transformed GDP per capita ($)", ylab = "Life expectancy (years)", data = my_data, # The following line adds a `smooth` trendline # We want our regression to be a linear model, or `lm` method = lm, # the `c()` function allows us to pass multiple variables # to the `geom` argument geom = c('point','smooth')) ## Ignore warning message
qplot(x = continent, y = lifeExp, main = "Boxplot of life expectancy by continent", xlab = "Continent", ylab = "Life expectancy (years)", data = my_data, geom = 'boxplot')
These plots (or anything else really) can be assigned to an object using the
<-
symbol so that it is stored in your "global environment" and can be
recalled, modified or worked with elsewhere in the script.
my_boxplot <- qplot(x = continent, y = lifeExp, main = "Boxplot of life expectancy by continent", xlab = "Continent", ylab = "Life expectancy (years)", data = my_data, geom = 'boxplot') # Now displaying your plot is as simple as printing the original dataset my_boxplot
🐢
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.