RStudio

knitr::opts_chunk$set(echo = TRUE)

Analyze a qPCR experiment

Here is a link to the Rmarkdown for this assignment. Download that and update your name at the top.

You are investigating how the transcription factor MAGIC regulates the expression of two genes WAND and POOF. You use wild-type and mutant lines with different forms of the MAGIC gene and examine the expression of MAGIC, WAND, and POOF. The experiment is a time-course after you add an activator of MAGIC. You collect 4 time points (0, 12, 24, 48 hour) with three biological replicates.

The cells have three mutant forms of MAGIC:

a hypomorphic allele (MAGIC-hypo) that is known to reduce its activity
an siRNA-mediated knockdown of MAGIC (MAGIC-siRNA)
a deletion of the MAGIC gene (MAGIC-null)

You collect expression data in a qPCR experiment in a 384-well plate (a common format for these experiments). The data in CSV format are available in the class package. One CSV contains expression levels and the other contains sample names coded as cell_time_gene_rt_rep.

This code will give you the path and file names of the CSVs.

# devtools::install_github('rnabioco/eda')
qpcr384_data_csv <- system.file("extdata", "qpcr-data-384.csv", package = 'pbda')
qpcr384_names_csv <- system.file("extdata", "qpcr-names-384.csv", package = 'pbda')

Exercise 1

Use a function from the readr library--part of the tidyverse--to load the CSV data. You should have two new tibbles.

Exercise 2

Inspect these tibbles and make note of their resemblance to a 384 plate. Tidy these tibbles into this format:

# A tibble: 128 x 6
cell  time  gene    rt   exp_mean      exp_var
<chr> <chr> <chr> <chr>      <dbl>        <dbl>
1 MAGIC-hypo     0 ACTIN     -   1.000000 0.000000e+00
2 MAGIC-hypo     0 ACTIN     +   2.333333 4.133333e-01
3 MAGIC-hypo     0 MAGIC     -   1.000000 0.000000e+00
4 MAGIC-hypo     0 MAGIC     +  13.000000 9.750000e+00
5 MAGIC-hypo     0  POOF     -   1.000000 0.000000e+00
6 MAGIC-hypo     0  POOF     + 981.333333 5.717973e+05
7 MAGIC-hypo     0  WAND     -   1.000000 0.000000e+00
8 MAGIC-hypo     0  WAND     +   1.000000 1.900000e-01
9 MAGIC-hypo    12 ACTIN     -   1.000000 0.000000e+00
10 MAGIC-hypo    12 ACTIN     +  10.000000 1.600000e-01
# ... with 118 more rows

Note that this table does not have row and col (they have already been dropped) and that the replicates have been grouped and summarized by their mean (exp_mean) and variance (exp_var).

Exercise 3

Thare are two sets of qPCR reactions: one where reverse transcriptase was added to the RNA sample, and one where it was not. The rt variable reflects this by noting samples with + and -.

Make two plots of the distribution of expression values for all sample values. In one plot, use geom_histogram() and facet by rt. In the other plot, change the x-axis scale using scale_x_log10().

What do these plots tell you about the values from the rt == "-" samples?

Exercise 4

Create a plot of expression versus time for each of the MAGIC cell types. At this point you can remove the rt == "-" controls. You will need to plot expression value on a log-scale to see differences.

In which cell lines is the expression of WAND affected? In what way? Can you make a statement about this (e.g., XXX is required for YYY expression)?
Compare the data per-cell and per-gene by creating two separate plots by grouping (e.g., using facets) one or the other of those variables. Which of these is more useful to see differences in gene expression?

Exercise 5

Normalize the expression data dividing each of the MAGIC, WAND, and POOF value by the ACTIN values. You will need to use spread() to rearrange the data for calculation, and then gather() to reformat for plotting. Re-create the plots from question 4 with this normalized data. Did your interpretation change?

Exploratory analysis of biological data

Explore the brauer_gene_exp and yeast_protein_prop data sets to address your own hypothesis. Here is an example.

Hypothesis

Sulfate limitation of yeast causes downregulation of genes containing high levels of cysteine and methionine residues.

Approach

Calculate cysteine / methionine content for each gene using the yeast_protein_prop table.

library(tidyverse)
library(pbda)
sulfurs <- yeast_prot_prop %>%
  select(ORF, Cys, Met) %>%
  group_by(ORF) %>%
  mutate(n_sulfur = sum(Cys, Met)) %>%
  ungroup() %>%
  rename(systematic_name = ORF) %>%
  select(systematic_name, n_sulfur)

Test whether gene expression and n_sulfur are correlated.

library(broom)

brauer_gene_exp %>%
  select(systematic_name, nutrient, expression) %>%
  left_join(sulfurs) %>%
  select(-systematic_name) %>%
  group_by(nutrient) %>%
  nest() %>%
  mutate(
    cor = map(data, ~ tidy(cor.test(.$n_sulfur, .$expression)))
    ) %>%
  select(nutrient, cor) %>%
  unnest() %>%
  arrange(-estimate)

Identify genes that are repressed under sulfate starvation conditions. Use the linear modeling approach to identify genes with negative slopes.

library(broom)
models <- brauer_gene_exp %>%
  select(systematic_name:expression) %>%
  group_by(systematic_name, nutrient) %>%
  nest() %>%
  # head() %>% 
  mutate(
    model = map(data, ~ tidy(lm(expression ~ rate, data = .x)))
  ) %>%
  select(-data)

# Now unnest and grab relevant data, i.e. genes with negative slopes
models <- models %>%
  unnest() %>%
  filter(term == "rate" & estimate < 0 & !is.na(nutrient)) %>%
  select(systematic_name, nutrient, estimate)

Cross-reference the two tables above to count the overlaps. What is a good control here? Need to ask whether the downregulation of these genes is specific to sulfate limitation. Are these same genes affected by limitation of other nutrients?

combined <- left_join(sulfurs, models) %>%
  select(-systematic_name) %>%
  filter(!is.na(nutrient))

library(cowplot)
ggplot(combined, aes(x = n_sulfur, y = estimate)) +
  geom_point() +
  scale_x_log10() +
  facet_grid(~nutrient)