knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center" )
library(tidyverse) library(pbda)
NA
, NaN
and NULL
Don't use use zeroes to represent missing data. 0
is valid observed value.
NA
(Not Available) is most often use to represent missing data.
NaN
(Not a Number) is the result of an undefined operation, e.g. 0 / 0
.
NULL
means "undefined" and is only used in a programming context (i.e., a function that returns NULL
). You can't put NULL
values in a data frame.
Let's examine a data frame with some missing data.
missing_ex
filter
with is.na()
You can identify variables with NA
values by combining filter()
and is.na()
.
# find rows where name is NA missing_ex %>% filter(is.na(name)) # find rows where id is *not* NA missing_ex %>% filter(!is.na(id))
na.omit()
You can remove all rows containing NA
values with na.omit()
.
missing_ex %>% na.omit()
NA
valuesExclude NA
values from operations with na.rm = TRUE
.
missing_ex$value1 # if NAs are present, the result is NA sum(missing_ex$value1) # solution: drop NAs from the calculation sum(missing_ex$value1, na.rm = TRUE)
How many rows of missing_ex
have NA
values for either value1
or value2
(27)?
How many rows of missing_ex
remain after filtering out id
values of NA
(25)?
Calculate the mean for values in each group in missing_ex
. Use summarize()
, then try summarize_at()
.
The brauer_gene_exp
data contains a data set from a manuscript describing how gene expression changes in yeast under several nutrient limitation conditions.
We'll explore this data in broad strokes to get you thinking about how to ask simple biological questions with a complex data set. We'll draw from a more thorough examination by David Robinson in a blog post---this is a good read if you're interested in digging in more deeply.
sorted <- brauer_gene_exp %>% select(systematic_name, rate, expression) %>% arrange(rate, desc(expression)) sorted %>% head(10) sorted %>% tail(10)
You can also use the View()
function to examine the data in a spreadsheet-like viewer.
sorted %>% View()
Use sample_n()
and sample_frac()
to randomly select rows from a large data set. Get a reproducible sample using set.seed()
brauer_gene_exp # sample 1e5 rows set.seed(1234) brauer_gene_exp %>% sample_n(1e3) # sample 10% of the rows set.seed(1234) brauer_gene_exp %>% sample_frac(0.1)
Many biological databases store information in multiple tables. These tables are related by a common identifier: a gene name, or possibly a numeric identifer.
We will use two datasets brauer_gene_exp
and yeast_prot_prop
to illustrate how data from related tables can be combined using joining operations.
brauer_gene_exp
contains gene expression data from yeast grown under different nutrient limitation conditions.
yeast_prot_prop
contains descriptive data for yeast proteins.
Examine the two tables and identify common variables for linking. If the variables have different names, Use the by
argument to left_join()
to specify common variables.
gene_exp_rate <- brauer_gene_exp %>% select(systematic_name, rate, expression) gene_mw <- yeast_prot_prop %>% select(ORF, Mw) # Find rows from gene_exp_rate with a match in gene_mw and return all variables left_join(gene_exp_rate, gene_mw, by = c("systematic_name" = "ORF")) # Find rows from gene_exp_rate with a match in gene_mw and return only the gene_exp_rate variables semi_join(gene_exp_rate, gene_mw, by = c("systematic_name" = "ORF")) # Find rows from gene_exp_rate *without* a match in gene_mw anti_join(gene_exp_rate, gene_mw, by = c("systematic_name" = "ORF"))
Calucate mean expression values for each systematic_name
in brauer_gene_exp
. How do the numbers change after sampling 1% of the data?
How many unique gene names (systematic_name
) are shared between the brauer_gene_exp
and yeast_prot_prop
tables? (5,536).
We'll use the brauer_gene_exp
data set to illustrate how you can use summary statistics to capture relationships between many variables.
This plot illustrates the relationship between gene expression and growth rate for the Grr1/YJR090C gene under glucose limitation.
library(cowplot) exp_data <- brauer_gene_exp %>% filter(nutrient == "Glucose" & systematic_name == "YJR090C") exp_data %>% ggplot(aes(x = rate, y = expression)) + geom_point() + geom_smooth(method = 'lm', se = FALSE) + ggtitle("Gene expression vs. growth rate for Grr3 under glucose limitation")
You can also use lm()
to model this relationship.
lm(expression ~ rate, exp_data)
The results of many statistical operations (lm()
, t.test()
, cor.test()
) are not tidy and relevant parameters are not easily assessible. The broom
package provides a function called tidy()
that tidies the results of these operations.
library(broom) fit <- exp_data %>% lm(expression ~ rate, data = .) tidy(fit)
Can we capture linear relationships between expression and rate for all combinations of variables? This is surprisingly easy.
# this takes ~60 seconds to run models <- brauer_gene_exp %>% select(systematic_name:expression) %>% group_by(systematic_name, nutrient) %>% nest() %>% # remove `head` to generate the full data set head() %>% mutate( model = map(data, ~ tidy(lm(rate ~ expression, data = .x))) ) %>% select(-data)
Plot the expression / rate relationship for the top-ten most significant genes. Color lines by nutrient.
topten <- models %>% unnest() %>% arrange(p.value) %>% head(10) %>% select(systematic_name) topten_exp <- brauer_gene_exp %>% semi_join(topten) topten_exp %>% ggplot(aes(x = rate, y = expression, color = nutrient)) + geom_point() + geom_smooth(method = 'lm', se = FALSE) + facet_wrap(~systematic_name + name) + scale_color_brewer(palette = "Set1")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.