Close down R, say 'No' to saving the workspace. In fact, I recommend you change your settings so that
The typical workflow in R is
If you are using RStudio, I highly recommend you look into setting up projects.
Start R and then choose your working directory.
setwd(choose.dir())
Now start the workshop to get your data files and scripts set up. Typically you would choose your working directory to be where you actually have your data and scripts.
library('ISDSWorkshop') workshop(launch_index = FALSE) # If you need the outline, set this to TRUE
Open the 03_advanced_graphics.R
script.
A convention is to put library()
calls for all packages that you will use in
the script at the top of the script.
library('dplyr') library('ggplot2')
Typically, you will have a script that reads in the data, converts variables, and then performs statistical analyses.
This time we will use the mutate()
function from the
dplyr package.
This function allows you to create new columns in a data.frame
.
# Read in csv files GI = read.csv('GI.csv') icd9df = read.csv("icd9.csv") # Add columns to the data.frame GI <- GI %>% mutate( date = as.Date(date), weekC = cut(date, breaks="weeks"), week = as.numeric(weekC), facility = as.factor(facility), icd9class = factor(cut(icd9, breaks = icd9df$code_cutpoint, labels = icd9df$classification[-nrow(icd9df)], right = TRUE)), ageC = cut(age, breaks = c(-Inf, 5, 18, 45 ,60, Inf)), zip3 = trunc(zipcode/100))
If the script to read in the data is extensive, it may be better to separate
your scripts into two different files: one for reading in the data and
converting variables and another (or more) to perform statistical analyses.
Then, in the second file, i.e. the one that does an analysis, you can source()
the one that just reads in the data, e.g.
source("read_data.R")
At this point, I typically check the data to make sure the data.frame
is what
I think it is.
I usually do this in interactive mode, i.e. not in a script.
dim(GI) str(GI) summary(GI)
Create a new variable in the GI data set called weekD
that is a Date
object
with each observation having the Monday of the week for that observation.
Check that it is actually a date using the str()
function.
# Create weekD variable in GI data set
When you have completed the activity, compare your results to the solutions.
Now to construct a graph, the workflow is
Suppose we want to plot the number of weekly GI cases by facility. We need to summarize the data by week and facility.
GI_wf <- GI %>% group_by(week, facility) %>% summarize(count = n())
In interactive mode, you should verify that this data set is correct, e.g.
nrow(GI_wf) # Should have number of weeks times number of facilities rows ncol(GI_wf) # Should have 3 columns: week, facility, count dim(GI_wf) head(GI_wf) tail(GI_wf) summary(GI_wf) summary(GI_wf$facility)
Now, we would like week on the x-axis and count on the y-axis.
ggplot(GI_wf, aes(x = week, y = count)) + geom_point()
But, clearly we need to distinguish the facilities.
Colors:
ggplot(GI_wf, aes(x = week, y = count, color = facility)) + geom_point()
Shapes:
ggplot(GI_wf, aes(x = week, y = count, shape = facility)) + geom_point()
ggplot2
only plots 6 different shapes and provides a warning.
Construct a data set and then a plot to look at weekly GI cases by age category.
# Construct a data set aggregated by week and age category # Construct a plot to look at weekly GI cases by age category.
When you have completed the activity, compare your results to the solutions.
Faceting is a technique to view many small graphs on a single page.
In ggplot, there are two different ways to construct facets:
facet_wrap()
and facet_grid()
.
The former performs the faceting for a single variable and constructs the
matrix of plots automatically.
The latter performs faceting for two variables putting one vertically and the
other horizontally.
When there are many levels of a single variable,
facet_wrap()
can often allow comparisons across that variable.
ggplot(GI_wf, aes(x = week, y = count)) + geom_point() + facet_wrap(~ facility)
By default, faceting forces all axes to be exactly the same.
If you want the axis to automatically scale for each facet use scales="free"
.
ggplot(GI_wf, aes(x = week, y = count)) + geom_point() + facet_wrap(~ facility, scales = "free")
When there are two variables (with not very many levels in each),
facet_grid()
can allow relevant comparisons.
Suppose we are interested in weekly GI counts by gender and age category. First, construct the data set
GI_sa <- GI %>% group_by(week, gender, ageC) %>% summarize(count = n())
Next, construct the plot using facet_grid(row ~ column)
syntax.
ggplot(GI_sa, aes(x = week, y = count)) + geom_point() + facet_grid(gender ~ ageC)
In addition, we could use faceting together with colors/shapes
ggplot(GI_sa, aes(x = week, y = count, shape = gender, color = gender)) + geom_point() + facet_wrap(~ ageC)
Construct a plot of weekly GI counts by zip3 and ageC.
# Construct a plot of weekly GI counts by zip3 and ageC.
When you have completed the activity, compare your results to the solutions.
Often our data set is large and we would like to filter the data to focus on a particular group or time frame. You may want to filter by a
Date
variable.For a categorical variable, you want to filter by
Suppose we are only interested in counts for infectious and parasitic diseases (IPD).
IPD_w <- GI %>% filter(icd9class == "infectious and parasitic disease") %>% group_by(week) %>% summarize(count = n()) ggplot(IPD_w, aes(x = week, y = count)) + geom_point()
Suppose we are interested in looking at both IPD and ill-defined conditions
conditions = levels(GI$icd9class) conditions
So we want levels 1 and 3.
IPDp <- GI %>% filter(icd9class %in% levels(icd9class)[c(1,3)]) summary(IPDp$icd9class)
Now construct the graph
IPDp_w <- IPDp %>% group_by(week, icd9class) %>% summarize(count = n()) ggplot(IPDp_w, aes(x = week, y = count)) + geom_point() + facet_wrap(~ icd9class, scales = "free")
Suppose we are interested in all icd9 codes except IPD.
# Combine filtering and summarizing nIPD_w <- GI %>% filter(icd9class != "infectious and parasitic disease") %>% group_by(week, icd9class) %>% summarize(count = n()) ggplot(nIPD_w, aes(x = week, y = count)) + geom_point() + facet_wrap(~ icd9class, scales = "free")
Suppose we are interested in all icd9 codes except IPD and ill-defined conditions.
nIPDp_w <- GI %>% filter(!(icd9class %in% levels(icd9class)[c(1,3)])) %>% group_by(week, icd9class) %>% summarize(count = n()) ggplot(nIPDp_w, aes(x = week, y = count)) + geom_point() + facet_wrap(~ icd9class, scales = "free")
We may want to subset a continuous (numeric) variable by
Age60p_w <- GI %>% filter(age > 60) %>% group_by(week) %>% summarize(count = n()) ggplot(Age60p_w, aes(x = week, y = count)) + geom_point()
Age_lte30_w <- GI %>% filter(age <= 30) %>% group_by(week) %>% summarize(count = n()) ggplot(Age_lte30_w, aes(x = week, y = count)) + geom_point()
Age30to60_w <- GI %>% filter(age > 30, age <= 60) %>% # This includes those who are 60, but not those who are 30 group_by(week) %>% summarize(count = n()) ggplot(Age30to60_w, aes(x = week, y = count)) + geom_point()
Filtering a date works similar to filtering a continuous (numeric) variable
once you convert a comparison value to a Date
using the as.Date()
function.
Let's create a Date
vector containing the two values we will use for
filtering.
two_dates = as.Date(c("2007-01-01", "2007-12-31"))
early <- GI %>% filter(date < two_dates[1]) late <- GI %>% filter(date >= two_dates[2]) mid <- GI %>% filter(date >= two_dates[1], date < two_dates[2])
Now make plots
early_w <- early %>% group_by(week) %>% summarize(count = n()) late_w <- late %>% group_by(week) %>% summarize(count = n()) mid_w <- mid %>% group_by(week) %>% summarize(count = n()) g1 = ggplot(early_w, aes(x = week, y = count)) + geom_point() g2 = ggplot(late_w, aes(x = week, y = count)) + geom_point() g3 = ggplot(mid_w, aes(x = week, y = count)) + geom_point()
To arrange multiple plots, use the grid.arrange
function in the gridExtra
package.
# you may need to run install.packages("gridExtra") if (require('gridExtra')) gridExtra::grid.arrange(g1,g3,g2)
This set of figures would have been a lot easier to construct if we had simply created a new variable that had values early, mid, and late and then used faceting to create the plot, e.g.
cut_dates <- c(as.Date("1900-01-01"), two_dates, Sys.Date()) GI_time <- GI %>% mutate(time = cut(date, breaks = cut_dates, labels = c("early","mid","late"))) %>% group_by(week, time) %>% summarize(count = n()) ggplot(GI_time, aes(x = week, y = count)) + geom_point() + facet_wrap(~ time, ncol = 1, scales = "free")
Construct a plot for those in zipcode 206xx between Jan 1, 2008 and Dec 31, 2008.
# Filter the data to zipcode 206xx between Jan 1, 2008 and Dec 31, 2008 # Aggregate the date for each week in this time frame # Construct the plot of weekly GI counts in zipcode 206xx.
When you have completed the activity, compare your results to the solutions.
To make the graphics we are producing look professional, we will need to
customize the graph according to the final
production platform, e.g. website, word document, pdf document, etc.
This will be highly specific to your use.
Nonetheless, likely you will use the following ggplot2
functions:
scale_color_manual()
labs()
theme()
But there are many others. One thing you will notice is that we can keep updating the graph by updating the R object that holds the graph.
Let's start with the following graph
GI$weekD = as.Date(GI$weekC) GI_sum <- GI %>% group_by(weekD, gender, ageC) %>% summarize(count = n()) ggplot(GI_sum, aes(x = weekD, y = count, color = gender)) + geom_point(size = 3) + facet_grid(ageC ~ ., scales = "free_y")
The easiest way to change some of the labels on the graph is to change the
underlying data. Here we change the levels associated with the ageC
variable.
levels(GI_sum$ageC) levels(GI_sum$ageC) = paste("Age:", c("<5","5-18","18-45","45-60",">60")) table(GI_sum$ageC)
g = ggplot(GI_sum, aes(x = weekD, y = count, color = gender)) + geom_point(size = 3) + facet_grid(ageC ~ ., scales = "free_y") g
g
is the R object that contains the graph
Intuitive colors can make a graph easier to read.
g = g + scale_color_manual(values=c("hotpink","blue"), name='Gender') g
Use proper names and capitalization in title and axis labels.
g = g + labs(title = "Weekly GI cases", x = "Year", y = "Weekly count") + theme(legend.position = "bottom") g
Try alternate themes:
g = g + theme_bw() g
To see a list of possibilities:
?theme_bw ?theme
Depending on the final platform, font sizes may need to be adjusted.
g = g + theme(title = element_text(size=rel(2)), text = element_text(size=16), legend.background = element_rect(fill = "white", size = .5, color = "black")) g
To export the graphs, use the ggsave()
function which saves the last plot that
you displayed.
As a default, it uses the size of the current graphics device,
but you will likely want to modify this.
g ggsave("plot.pdf")
If you look in your current working directory, you will see the plot.pdf file.
ggsave("plot.pdf", width=14, height=8)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.