library(learnr) library(tidyverse) library(knitr) library(here) library(twitterwidget) library(rlang) library(patchwork) library(ggrepel) knitr::opts_chunk$set(echo = FALSE, fig.align="center") source(here::here("z_plusDS/code", "bespoke.R")) #loads custom objects
url <- "https://github.com/tidyverse/ggplot2/raw/master/man/figures/logo.png" knitr::include_graphics(url)
ggplot
in RWe ended the last class with:
ggplot(heart, aes(x = sex, y = chol)) + geom_point(color = "darkblue", size = 3) + labs(x = "sex", y = "Cholesterol", title = "Cholesterol values from the heart disease dataset", caption = "Data from Kaggle | Plot from @matthewhirschey") + theme_minimal() + NULL
#need to parameterize this; embed rmd here?; but still need to parameterize that doc. ggplot(heart, aes(x = sex, y = chol)) + geom_point(color = "darkblue", size = 3) + labs(x = "sex", y = "Cholesterol", title = "Cholesterol values from the heart disease dataset", caption = "Data from Kaggle | Plot from @matthewhirschey") + theme_minimal() + NULL
Any questions from last week?
Below is an example of the most basic form of the ggplot code
ggplot(data = dataframe) +
geom(mapping = aes(x, y))
Take a moment to look back at the code template. You can see that in that code we assigned a dataset
and the information we needed to map
it to a type of plot
ggplot(data =
r dataframe_name) +
geom_point(mapping=aes(x =
r df_numeric1_name, y =
r df_numeric2_name))
df_input %>% ggplot(aes(!!sym(df_numeric1_name), !!sym(df_numeric2_name))) + geom_point()
ggplot
{.build}ggplot()
ggplot(data =
r dataframe_name) +
geom_functions
geom_point(mapping=aes(x =
r df_numeric1_name, y =
r df_numeric2_name))
ggplot
style {.build}Generally, different people have strong opinions about style and data visualization
Data visualization is a rich and complex area of study and is beyond the scope of this introductory course
That being said, here are a few style tips:
- While you can put the +
at the beginning of the next line, it is generally put at the end of the previous line
ggplot(
r dataframe_name) +
geom_point(aes(x =
r df_numeric2_name, y =
r df_numeric2_name))
plot1 <- df_input %>% ggplot(aes(!!sym(df_numeric1_name), !!sym(df_numeric2_name))) + geom_point() plot2 <- df_input %>% ggplot(aes(!!sym(df_numeric1_name), !!sym(df_numeric2_name))) + geom_smooth() plot1 + plot2
geom
is different between these plotsgeom
is short for geometric object, which is the visual object used to represent the data
plot1 <- ggplot(
r dataframe_name) +
geom_point(aes(
r df_numeric1_name,
r df_numeric2_name))
plot2 <- ggplot(
r df_numeric1_name,
r df_numeric2_name) +
geom_smooth(aes(
r df_numeric1_name,
r df_numeric2_name))
Different data types require different plot types.
When plotting your data, it is often helpful to take a glimpse at the data you intend to plot to know what kinds of variables you will be working with
glimpse(
r dataframe_name)
So now that you know your variable types, how do you know what geoms to use??
Use the following resources to match your data type to the appropriate geoms
https://rstudio.com/resources/cheatsheets/
ggplot(
r dataframe_name) + geom_point(aes(x=
r df_char1_name,y=
r df_numeric1_name))
Use the cheatsheet. Try your best guess.
ggplot(
r dataframe_name) +
geom_boxplot(aes(x=
r df_char1_name,y=
r df_numeric1_name)) +
geom_point(aes(x=
r df_char1_name,y=
r df_numeric1_name))
geoms
for yourselfEach new geom adds a new layer
Everything up to this point gets you a basic graph; but what about colors, shapes and overall style?
You can change 5 basic aesthetics
1. Color- changes the outline color of your datapoints
2. Size - choose the size of the datapoint
3. Shape - choose a pre-defined shape
4. Alpha- changes the transparency of each point
5. Fill- changes the fill color of your points
Go to code/
Open 04_ggplot2.Rmd
Complete the exercise.
Beyond simply changing the size or color of the variables in your plot, you can encode more information by mapping these values to data in your data set.
Go to code/
Open 05_aes.Rmd
Complete the exercise.
In ggplot2, we have the options to set mappings globally or locally. Setting a mapping globally means to set those values in the original ggplot function.
Example: Earlier in class you made this graph:
ggplot(
r dataframe_name) +
geom_jitter(aes(x=
r df_char1_name, y=
r df_numeric1_name))+
geom_boxplot(aes(x=
r df_char1_name, y=
r df_numeric1_name))
ggplot(df_input) + geom_jitter(aes(!!sym(df_char1_name), !!sym(df_numeric1_name))) + geom_boxplot(aes(!!sym(df_char1_name), !!sym(df_numeric1_name)))
However, if we map our x and y values in the ggplot function we find that we generate the same graph
ggplot(
r dataframe_name, aes(x=
r df_char1_name, y=
r df_numeric1_name) +
geom_jitter() +
geom_boxplot()
ggplot(df_input, aes(!!sym(df_char1_name), !!sym(df_numeric1_name))) + geom_jitter() + geom_boxplot()
This is because when you set the aes mappings in the original ggplot
function you are setting the aes
globally.
This means all the functions afterwards will inherit that mapping. So in our example, this means that both the jitter and boxplot geoms know to graph the same information
You can also set aes values locally within the geom function. Doing so will only change the values in that geom
ggplot(
r dataframe_name, aes(x=
r df_char1_name, y=
r df_numeric1_name) +
geom_jitter() +
geom_boxplot(aes(color =
r df_char1_name))
ggplot(df_input, aes(!!sym(df_char1_name), !!sym(df_numeric1_name))) + geom_jitter() + geom_boxplot(aes(color = !!sym(df_char1_name)))
mean <- mean(df_numeric2_vec) sd <- sd(df_numeric2_vec)
Data can also be set locally or globally. For this example, let's filter our original data first using the dplyr::filter
function
df_filter <-
r dataframe_name%>% filter(
r df_numeric2_name>
r round(mean + 2*sd))
*this number is two standard deviations above the mean
value of r df_numeric2_name
Now, let's identify only the r dataframe_about
in our data that are outliers, more than 2SD above the mean, by setting data locally in a new geom
ggplot(
r dataframe_name, aes(x=
r df_char1_name, y=
r df_numeric1_name) +
geom_jitter() +
geom_boxplot(aes(color =
r df_char1_name)) +
geom_label(data=df_filter, aes(label=
r df_id_name))
df_filter <- df_input %>% filter(!!sym(df_numeric2_name) > round(mean + 2*sd)) ggplot(df_input, aes(!!sym(df_char1_name), !!sym(df_numeric1_name))) + geom_jitter() + geom_boxplot(aes(color = !!sym(df_char1_name))) + geom_label_repel(data = df_filter, aes(label = !!sym(df_id_name)))
You notice we have to indicate the new dataset, but because it has the same x and y values, we did not need to set those mappings
Go to code/
Open 06_global_v_local.Rmd
Complete the exercise to practice mapping locally and globally.
Several options exist to change the default labels and legends. Recall, this code:
ggplot(
r dataframe_name, aes(x=
r df_char1_name, y=
r df_numeric1_name) +
geom_jitter() +
geom_boxplot(aes(color =
r df_char1_name))
original_plot <- ggplot(df_input, aes(!!sym(df_char1_name), !!sym(df_numeric1_name))) + geom_jitter() + geom_boxplot(aes(color = !!sym(df_char1_name))) original_plot
But it has two problems:
1. The x-axis label is redundant
2. The figure legend is also redundant
labs
ggplot(
r dataframe_name, aes(x=
r df_char1_name, y=
r df_numeric1_name) +
geom_jitter() +
geom_boxplot(aes(color =
r df_char1_name)) +
labs(x ="") #blank quotes removes the label
labs
Gave us this plot:
ggplot(df_input, aes(!!sym(df_char1_name), !!sym(df_numeric1_name))) + geom_jitter() + geom_boxplot(aes(color = !!sym(df_char1_name))) + labs(x = "")
guides
ggplot(
r dataframe_name, aes(x=
r df_char1_name, y=
r df_numeric1_name) +
geom_jitter() +
geom_boxplot(aes(color =
r df_char1_name)) +
labs(x ="") #blank quotes removes the label +
guides(color = "none")
guides
lab_plot <- ggplot(df_input, aes(!!sym(df_char1_name), !!sym(df_numeric1_name))) + geom_jitter() + geom_boxplot(aes(color = !!sym(df_char1_name))) + labs(x = "") + guides(color = "none") original_plot + lab_plot
Faceting allows you to create multiple graphs side by side in one panel. Especially useful when you want to see the data together, but not on top of each other
For example:
ggplot(
r dataframe_name) +
geom_point(aes(x=
r df_char1_name, y=
r df_numeric1_name)) +
facet_grid(cols = vars(
r df_char2_name))
ggplot(df_input, aes(!!sym(df_char1_name), !!sym(df_numeric1_name))) + geom_point() + facet_grid(cols = vars(!!sym(df_char2_name)))
*This is especially useful for exploratory data analysis
You can change almost everything you see on your chart, but a lot of the things you may look to change are part of the "theme"
Here we are going to change some features about our title text:
ggplot(
r dataframe_name, aes(x=
r df_char1_name, y=
r df_numeric1_name) +
geom_jitter() +
geom_boxplot(aes(color =
r df_char1_name)) +
labs(title = "My first plot") +
theme(plot.title = element_text(face = "bold", size = 12))
theme1 <- ggplot(df_input, aes(!!sym(df_char1_name), !!sym(df_numeric1_name))) + geom_jitter() + geom_boxplot(aes(color = !!sym(df_char1_name))) + labs(title = "My first plot") + theme(plot.title = element_text(face = "bold", size = 12)) original_plot + theme1
Next, let's change the aesthetics of our legend box
ggplot(
r dataframe_name, aes(x=
r df_char1_name, y=
r df_numeric1_name) +
geom_jitter() +
geom_boxplot(aes(color =
r df_char1_name)) +
labs(title = "My first plot") +
theme(plot.title = element_text(face = "bold", size = 12),
legend.background = element_rect(fill="gray", colour="black"))
theme2 <- ggplot(df_input, aes(!!sym(df_char1_name), !!sym(df_numeric1_name))) + geom_jitter() + geom_boxplot(aes(color = !!sym(df_char1_name))) + labs(title = "My first plot") + theme(plot.title = element_text(face = "bold", size = 12), legend.background = element_rect(fill="gray", colour="black") ) theme1 + theme2
Finally, let's change the legend position
ggplot(
r dataframe_name, aes(x=
r df_char1_name, y=
r df_numeric1_name) +
geom_jitter() +
geom_boxplot(aes(color =
r df_char1_name)) +
labs(title = "My first plot") +
theme(plot.title = element_text(face = "bold", size = 12),
legend.background = element_rect(fill="gray", colour="black"),
legend.position = "bottom"))
theme3 <- ggplot(df_input, aes(!!sym(df_char1_name), !!sym(df_numeric1_name))) + geom_jitter() + geom_boxplot(aes(color = !!sym(df_char1_name))) + labs(title = "My first plot") + theme(plot.title = element_text(face = "bold", size = 12), legend.background = element_rect(fill="gray", colour="black"), legend.position = "bottom" ) theme2 + theme3
Pre-set themes also exist as an easy way to change the entire theme of your graph quickly. They can also be combined with custom theme settings
ggplot(
r dataframe_name, aes(x=
r df_char1_name, y=
r df_numeric1_name) +
geom_jitter() +
geom_boxplot(aes(color =
r df_char1_name)) +
labs(title = "My first plot") +
theme_minimal()
theme4 <- ggplot(df_input, aes(!!sym(df_char1_name), !!sym(df_numeric1_name))) + geom_jitter() + geom_boxplot(aes(color = !!sym(df_char1_name))) + labs(title = "My first plot") + theme_minimal() theme1 + theme4
ggsave
{.build}If you make a plot there are a few ways to save it, though the simplest is to use ggsave
ggsave("ggsaveexample.png", plot = last_plot())
You can change the type of file you save or the size.
ggsave("ggsaveexample.pdf", plot = my_awesome_object, width = 6, height = 6, units = "cm")
Where does it save??
getwd()
Go to code/
Open 07_ggplot_together.Rmd
Complete the exercise to put all these ggplot skills to work.
Any questions?
The readr
package (found in the tidyverse
collection) contains a number of useful functions of the form read_*
to import data. For example, if you have a .csv
file, you would use the read_csv
function
The dataset provided to you is a cleaned R-specific document. But you will never find this in 'the wild'.
Most often, you will need to find a data file (such as csv), and import it
For the purpose of this class, we have generated a simulated dataset of r dataframe_join_about
to accompany the r dataframe_name
dataset.
To import the r dataframe_join_file_name
file into RStudio, run the following:
`r dataframe_join_name
<- read_csv(here::here("data", "r dataframe_join_file_name
"))`
Go to code/
Open 08_import_and_join.Rmd
Complete the exercise to import this new dataset.
You can also use the readr
package to import data from a URL
For example, to load a dataset from a URL, run the following
path <- here::here("data", "phx.csv") url <- "https://raw.githubusercontent.com/matthewhirschey/tidybiology-plusds/master/data/phx.csv" patient_hx <- read_csv(url)
There are many times when you have two or more overlapping datasets that you would like to combine
The dplyr
package has a number of *_join
functions for this purpose
url <- "https://github.com/matthewhirschey/tidybiology-plusds/raw/master/media/dplyr-joins.png" knitr::include_graphics(url)
left_join
{.build}Returns all rows from a, and all columns from a and b
Rows in a with no match in b will have NA values in the new columns
If there are multiple matches between a and b, all combinations of the matches are returned
left_join
example {.build}Take a look at the variables in each dataset - r dataframe_name
and r dataframe_join_name
You will notice that both datasets contain common variable - r df_id_name
. This can therefore serve as a common variable to join on. Let's join on this:
left_join
r dataframe_name
with r dataframe_join_name
and assign the output to a new object called `r dataframe_name
_join_left`
Go to code/
Open 08_import_and_join.Rmd
Complete the exercise to join the two datasets.
Now you have one dataset with additional useful information
right_join
url <- "https://github.com/matthewhirschey/tidybiology-plusds/raw/master/media/dplyr-joins.png" knitr::include_graphics(url)
right_join
{.build}Returns all rows from b, and all columns from a and b
Rows in b with no match in a will have NA values in the new columns
If there are multiple matches between a and b, all combinations of the matches are returned
This is conceptually equivalent to a left_join
, but can be useful when stringing together multiple steps using %>%
inner_join
url <- "https://github.com/matthewhirschey/tidybiology-plusds/raw/master/media/dplyr-joins.png" knitr::include_graphics(url)
inner_join
{.build}Returns all rows from a where there are matching values in b, and all columns from a and b
If there are multiple matches between a and b, all combination of the matches are returned
full_join
url <- "https://github.com/matthewhirschey/tidybiology-plusds/raw/master/media/dplyr-joins.png" knitr::include_graphics(url)
full_join
{.build}Returns all rows and all columns from both a and b
Where there are no matching values, returns NA for the one missing
Any questions?
A string is what we store text within. It can be either:
- A single word: "awesome"
- A sentence: "this class is awesome"
- A combination: c("blue", "is my favorite", "color")
Any of these can be stored as object, which we call strings.
Dealing with character strings is a bit different than dealing with numbers in R?
Fortunately, the tidyverse
as a package called stringr
for dealing with them.
df_input2 %>% count(health_status, sort = TRUE) %>% slice(2) %>% select(1) %>% pull() #restart here
str_count()
is a function we can use to count the number of rows that match a particular pattern.
The output with either be 1 (match), or 0 (no match)
In this code:
- string we want to evaluate is r dataframe_joined_name`$`r df_joined_string3_name
- pattern we want to count "High Cholesterol"
str_count(heart_joined$health_status, "High Cholesterol")
str_count(heart_joined$health_status, "High Cholesterol")
str_count(heart_joined$health_status, "High Cholesterol")
A bunch of 0 and 1 are not incredibly useful.
But since R is good at adding, we can simply wrap the previous expression in sum()
Try it below:
We previously matched the entire string "High Cholesterol"
But we can use the same function to detect patterns within longer strings.
Let's look for how many patients take a statin of any kind using
str_count(heart_joined$medication_hx, "statin")
str_count(heart_joined$medication_hx, "statin")
What does the ouput mean?
When using a stringr function, you may get an output saying a string pattern doesn't exist. If you know for sure it does,
The string must match exactly, or it will not be found!
How many people having an "auntie" or "aunt"
in their health history?
Go to code/
Open 09_stringr.Rmd
Complete the exercise.
That solution worked in this case, but was not very elegant, and might not work for all cases (what if there was a 'great aunt' in the list?)
Or here is a more specific case for this data set.
How many patients have a father with a history of disease? But we don't want to include grandfathers in the results.
We can use something called Regular Expressions, aka Regex, to solve this
Think of regex as a separate language, with it's own code, syntax, and rules.
Regex rules allow complex matching patterns for strings, to ensure matching exactly the content desired
It is far too complex to cover in its entirely here, but here is one specific example.
GOAL: identify all of the patients that have a father with a history of disease, but excluding grandfathers in the results.
father
.
But then we want to make sure that we capture both Father
and father
. To accept either case f in the first spot we add (F|f)
, so now our regex looks like (F|f)ather
Lastly, we want this pattern to appear at the beginning of the word, so we add the regex ^
symbol.
Our completed regex looks like:
str_count(heart_joined$family_history, "^(F|f)ather")
Go to code/
Open 09_stringr.Rmd
Complete the exercise to count mothers.
In addition to counting, we can use another function str_detect()
to logically evaluate a character string.
Because this logically evaluates an expression, the output is either TRUE or FALSE
Practially, str_detect
is used to detect the presence or absence of a pattern in a string
str_detect(heart_joined$health_status, "Diabetic")
str_replace()
{.build}In the health_status column we have:
-"Diabetic"
-"High Cholesterol"
-"Normal blood sugar and cholesterol"
But let's say we want to simplify healthy individuals to "Normal"
str_replace(heart_joined$health_status, "Normal blood sugar and cholesterol", "Normal")
str_replace()
{.build}We use this same code to modify the health_status
column by assigning it to the same variable
heart_joined$health_status <-
str_replace(heart_joined$health_status, "Normal blood sugar and cholesterol", "normal")
heart_joined$health_status <- str_replace(heart_joined$health_status, "Normal blood sugar and cholesterol", "normal") head(heart_joined$health_status, n = 10)
stringr
with dplyr
{.build}We can use stringr
functions in tandem with dplyr
functions.
We want to make a logical variable (TRUE
/FALSE
) that tells us if a patient has a normal health history using
heart_joined2 <- heart_joined %>% mutate(healthy = str_detect(health_status, "normal"))
heart_joined2 <- mutate(heart_joined, healthy = str_detect(health_status, "normal")) head(heart_joined2$healthy, n = 10)
Go to code/
Open 09_stringr.Rmd
Complete the exercise to count mothers.
Any questions?
url <- "https://github.com/rstudio/hex-stickers/raw/master/PNG/rmarkdown.png" knitr::include_graphics(url)
Plain text file with 3 types of content
url <- "https://github.com/matthewhirschey/tidybiology-plusds/raw/master/media/Rmarkdown.png" knitr::include_graphics(url)
url <- "https://d33wubrfki0l68.cloudfront.net/96ec0c54c6d64ea2ec3665db9b3b781962ff6339/5cee1/lesson-images/how-3-output.png" knitr::include_graphics(url)
url <- "https://d33wubrfki0l68.cloudfront.net/61d189fd9cdf955058415d3e1b28dd60e1bd7c9b/b739c/lesson-images/rmarkdownflow.png" knitr::include_graphics(url)
When you run render, R Markdown feeds the .Rmd file to knitr ⧉, which executes all of the code chunks and creates a new markdown (.md) document which includes the code and it’s output.
The markdown file generated by knitr
is then processed by pandoc
which is responsible for creating the finished format.
This may sound complicated, but R Markdown makes it extremely simple by encapsulating all of the above processing into a single render function.
knitr
points {.build}Knitr
runs the document in a fresh R session, which means you need to load the libraries that the document uses in the document
Objects made in one code chunk will be available to code in later code chunks, but not before
For example, first create r dataframe_name
and then using dplyr::left_join
you create r dataframe_joined_name
, r dataframe_name
will be available later on in the document to do this. However, you cannot use r dataframe_joined_name
in a code chunk before you make it, even if it is available in your environmnet
To keep this straight, just think (and code) in sequential chunks
https://bookdown.org/yihui/rmarkdown/
Any questions?
url <- "https://d33wubrfki0l68.cloudfront.net/59f29676ef5e4d74685e14f801bbc10c2dbd3cef/c0688/lesson-images/markdown-1-markup.png" knitr::include_graphics(url)
code
Any questions?
url <- "https://github.com/matthewhirschey/tidybiology-plusds/raw/master/media/Rmarkdown-code.png" knitr::include_graphics(url)
url <- "https://github.com/matthewhirschey/tidybiology-plusds/raw/master/media/Rmarkdown-code-shortcut.png" knitr::include_graphics(url)
url <- "https://github.com/matthewhirschey/tidybiology-plusds/raw/master/media/Rmarkdown-chunk1.png" knitr::include_graphics(url)
url <- "https://github.com/matthewhirschey/tidybiology-plusds/raw/master/media/Rmarkdown-chunk2.png" knitr::include_graphics(url)
url <- "https://github.com/matthewhirschey/tidybiology-plusds/raw/master/media/Rmarkdown-chunk3.png" knitr::include_graphics(url)
url <- "https://github.com/matthewhirschey/tidybiology-plusds/raw/master/media/Rmarkdown-chunk4.png" knitr::include_graphics(url)
url <- "https://github.com/matthewhirschey/tidybiology-plusds/raw/master/media/Rmarkdown-chunk5.png" knitr::include_graphics(url)
url <- "https://github.com/matthewhirschey/tidybiology-plusds/raw/master/media/Rmarkdown-chunk6.png" knitr::include_graphics(url)
url <- "https://github.com/matthewhirschey/tidybiology-plusds/raw/master/media/Rmarkdown-chunk7.png" knitr::include_graphics(url)
Any questions?
url <- "https://github.com/matthewhirschey/tidybiology-plusds/raw/master/media/Rmarkdown-yaml.png" knitr::include_graphics(url)
url <- "https://github.com/matthewhirschey/tidybiology-plusds/raw/master/media/Rmarkdown-output.png" knitr::include_graphics(url)
Recall that Rmarkdown documents can be rendered into several different output file types
Parameters of a document are defined in the YAML header, and can pre-populate an Rmarkdown document. To see this in action,
url <- "https://github.com/matthewhirschey/tidybiology-plusds/raw/master/media/Rmarkdown-parameters.png" knitr::include_graphics(url)
url <- "https://github.com/matthewhirschey/tidybiology-plusds/raw/master/media/Rmarkdown-using-parameters.png" knitr::include_graphics(url)
Data science enables data-driven information gathering and hypothesis generation
-- Scientific Research
-- Reviews
Data science enables the ability to ask new types of questions
Process-centric, not necessarily question-centric
Making things computable makes them actionable at zero marginal cost.
Workflows save time, achieve reproducibility
Teaching Assistants
- Allie Mills, Ph.D.
- Akshay Bareja, D.Phil.
Inspiration, ideas, packages, code
- R4DS (Garrett Grolemund and Hadley Wickham)
- Mine Çetinkaya-Rundel (datasciencebox.org)
- Chester Ismay and Albert Y. Kim (Modern Dive)
- Garrett Grolemund (Remastering the Tidyverse)
- Tidyverse devs and community
- Rstudio
Any questions?
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.