knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Install and Load Package

usually this should work:

library(devtools)
install_github('manschmi/RMetaTools') #full install including dependencies

Note, if the above fails, you can also try to load RMetatools as Project in R and the run 'Build and Install'. Finally, if both fail you can try to source relevant .R files. You will then have manually load dependencies (these should be pretty common R packages tidyverse, broom, magrittr, rtracklayer and jsonlite).

and then this should not throw an error

library(RMetaTools)

INSTALL on ie genome.dk

Handling R packages at genome.dk can be tricky, ie installing devtools which will be needed failed for me so better stick to 'conda' for handling such annoyancies. Ups: you will have to install conda as described in the genome.au.dk helppages.

Run this in the Terminal, front-end: (ups can take a long time each step...). Just say yes to everything when asked. ```{bash, eval=F} conda create -n RMetaTools_env conda activate RMetaTools_env conda install -c r r conda install -c r rstudio conda install -c r r-devtools conda install -c bioconda bioconductor-rtracklayer

now go to backend activate the environment again and start R
```{bash, eval=FALSE}
srun --pty bash
conda activate RMetaTools_env
rstudio #or R if you prefer that, rstudio requires that you logged in with ssh -Y

install RMetaTools for use in RMetaTools_env at genome.dk:

library(devtools)
library(httr)
set_config( config( ssl_verifypeer = 0L ) )
options(unzip = "internal")

install_github('manschmi/RMetaTools') #full install including dependencies
install_github('manschmi/RMetaTools', upgrade=FALSE) #only RMetaTools but no dependencies are updated

check by trying to load RMetaTools

library('RMetaTools')

Hopefully this does not throw and error. Now you should be all set to try and runt he examples below and/or test with your own data!

Loading and visualizing deeptools matrices

This should work with any deeptools matrix. From deeptools v2 to v3 some things, a version for deeptools 2 is posted at the very end of this document.

?load_deeptoolsmatrix3_new

for deeptools v3

Load an example matrix from RMetatools (if package installed, otherwise you can find this file in subfolder RMetatools/inst/extdata). For your own data simply specify full path and filename of your matrix.

fname <- system.file("extdata", "deeptools3_matrix.gz", package = "RMetaTools")

fname
m <- load_deeptoolsmatrix3_new(fname)

This loads an object that contains all info from the computeMatrix output file, with slots R names(m). Subplot in the output from plotHeatmap are separated, that is, each samle and group are present several times if more than 1x group or samples, respectively, are present.

We can easily get information about samples and groups (subgroups of regions).

unique(m$samples)
unique(m$groups)

To modify the relevant names in the object you can directly modify samples or groups and regions. Note: groups and region names must be consistent, as shown below.

change sample names

ie remove everything after _hg38 since this is irrelevant. This can be done like this:

m$samples <- sub('_hg38.*', '', m$samples)

change group and region names

ie rename, not relevant in this example but name could be changed to somethin informative to be displayed in plots below. In any case if you change, the names must also be used as names of m$regions. But m$groups can have many more entries than m$regions. So this has to be done for each of those separate.

ie

m$groups <- sub('genes', 'Gene', m$groups)
names(m$regions) <- sub('genes', 'Gene', names(m$regions))

aggregate for metagene plots

Easiest is to create metagene profile plots using the metagene_aggregate function in the tool.

You can get help on this function using the usual:

?metagene_aggregate

a usual example would be to add pseudocount and log2 transform the data first and then run a t.test for each position(bin) like this:

agg <- metagene_aggregate(m, aggregate_fun = 'ttest', transform_fun = function(x){log2(x+1)})

the output from this is pretty self-explanatory, but depends on aggregate fun. For the call above we get from the t.test the mean of log2(value+pseudocount) please lower and higher confidence interval.

agg

simple metagene plot therefrom

ggplot(agg, aes(x=rel_pos, y=mean, color=sample)) +
  geom_line() +
  geom_ribbon(aes(x=rel_pos, ymin=conf.low, ymax=conf.high, fill=sample), alpha=.2, color=NA) +
  theme(axis.text.x = element_text(angle=45, hjust=1))

you can see this is pretty messy. But can easily be modified using base R or tidyverse functionality.

For example we are only interested in ctrl, Z1 vs Z8 but the replicates should be distinguished more readily

agg %>%
  dplyr::filter(grepl('^ZCCHC8', sample) | grepl('^ZFC3H1', sample) | grepl('^EGFP', sample)) %>%
  tidyr::separate(sample, c('siRNA', 'rep'), sep='_') %>%
ggplot(., aes(x=rel_pos, y=mean, color=siRNA, linetype=rep)) +
  geom_line() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle=45, hjust=1))

--> much easier to appreciate that both Z8 and Z1 have higher signal close to 0.

heatmaps

heatmap display in R is somewhat "heavy". In most cases its probably best to use deeptools plotHeatmap, perhaps modify the input matrix before using computeMatrixOperations from deeptools or the script computeMatrixOperationsMS.py from Manfreds Software folder.

In any case, if you rather use R, you will first need to convert m into a tidy object. RMetatools has a built-in function for this.

m_tidy <- to_tibble(m)

m_tidy

There is small issue that rel_pos is not considered to be numeric, need to be fixed for most use-cases below.

m_tidy %<>%
  mutate(rel_pos = as.numeric(rel_pos))

Also: a pretty slowing-down feature is actually the y axis labels, usually best to simply suppress those with theme(axis.text.y = element_blank())

Basic heatmap can be created like this:

m_tidy %>%
  ggplot(., aes(x=rel_pos, y=name, fill=log2(value+1))) +
  geom_tile() +
  facet_grid(.~sample) +
  theme(axis.text.y = element_blank(),
        axis.text.x = element_text(angle=45, hjust=1))

This is pretty useless as there is way too much information. Preferably take as above only ctrl, Z8 and Z1 and use the average of the values like that.

First subset the tidy object and compute the average, not the NA values that may exist (depending on whether you chose to convert NA to 0 earlier or not).

m_tidy_simple <- m_tidy %>%
  dplyr::filter(grepl('^ZCCHC8', sample) | grepl('^ZFC3H1', sample) | grepl('^EGFP', sample)) %>%
  mutate(value = ifelse(is.na(value), 0, value)) %>%
  tidyr::separate(sample, c('siRNA', 'rep'), sep='_') %>%
  group_by(name, siRNA, rel_pos) %>%
  summarize(mean_log2_value = mean(log2(value+1)))

create the much more comprehensive heatmap

m_tidy_simple %>%
  ggplot(., aes(x=rel_pos, y=name, fill=mean_log2_value)) +
  geom_tile() +
  facet_grid(.~siRNA) +
  theme(axis.text.y = element_blank(),
        axis.text.x = element_text(angle=45, hjust=1))

a heatmap sorted by signal

It can happen that heatmaps actually contain very little data, ie heavily inflated by 0 values. To get nicer plots, remove 'all-0' rows and sort by signal.

ids_sum_sorted <- m_tidy_simple %>%
  group_by(name) %>%
  summarize(total_value = sum(mean_log2_value)) %>%
  dplyr::filter(total_value > 0) %>%
  arrange(total_value) %$%
  name
m_tidy_simple %>%
  dplyr::filter(name %in% ids_sum_sorted) %>%
  mutate(name = factor(name, levels=ids_sum_sorted)) %>%
  ggplot(., aes(x=rel_pos, y=name, fill=mean_log2_value)) +
  geom_tile() +
  facet_grid(.~siRNA) +
  theme_minimal() +
  theme(axis.text.y = element_blank(),
        axis.text.x = element_text(angle=45, hjust=1))

slightly better but the color scheme also not helping much. This can still be improved a bit more

m_tidy_simple %>%
  dplyr::filter(name %in% ids_sum_sorted) %>%
  mutate(name = factor(name, levels=ids_sum_sorted)) %>%
  ggplot(., aes(x=rel_pos, y=name, fill=mean_log2_value)) +
  geom_tile() +
  scale_fill_gradient(low='white', high='firebrick4') +
  facet_grid(.~siRNA) +
  theme_minimal() +
  theme(axis.text.y = element_blank(),
        axis.text.x = element_text(angle=45, hjust=1))

Metagene Analysis Directly with RMetaTools

You can also totally avoid using deeptools. But this was a hobby project hardly ever used in practice, even though it works. ie The part of the vignette below was mostly created for my own record.

Load metagene values

First step is to define the regions of interest.

Using a bed file this tool will query a range around an anchor point TSS, TES or center. Note, under the hood this uses rtracklayer to load, so gtfs etc should work, but non-conventional columns in your files may cause this to fail.

bedfile <- system.file("extdata", "Chen_PROMPT_TSSs_liftedTohg38.bed", package = "RMetaTools")

anchor = 'TSS'
upstream = 1000
downstream = 1000
window_size = 50

Track files, this are usually 2 bigwig files for stranded data. If you are assaying unstranded data use the same bigwig file for bw_plus and bw_minus.

bw_plus <- system.file("extdata", "GSM1573841_mNET_8WG16_siLuc_plus_hg38.bw", package = "RMetaTools")
bw_minus <- system.file("extdata", "GSM1573841_mNET_8WG16_siLuc_minus_hg38.bw", package = "RMetaTools")

Compute matrices and create metagene object directly in R

this can take some time and make downstream processing slow

mR <- metagene_rel_anchor(bw_plus, bw_minus, bedfile, anchor, upstream, downstream, window_size)

names(mR)
mR$header
mR$groups
mR$samples
mRo <- modify(mR, select_samples =c('mNET'), rename_samples = '.*\\/')
mRo$samples
mRo <- modify(mRo, rename_samples = 'GSM......._')
mRo <- modify(mRo, rename_samples = '__.*')
mRo$samples

R create metagene values directly to tibble

this can take some time and make downstream processing slow

meta_tbl <- RMetaTools::metagene_matrix(bw_plus, bw_minus, bedfile, anchor, upstream, downstream, window_size)

meta_tbl

Plot metagene values

There is also a built-in heatmap plotter for the tidy data

plot_htmp(meta_tbl)

May be useful to represent as log2 values

(pseudocount <- min(meta_tbl$value[meta_tbl$value > 0]))
log2_meta_tbl <- mutate(meta_tbl, value=log2(value+pseudocount))
plot_htmp(log2_meta_tbl) +
  scale_fill_gradient(low='white', high='navyblue') + #custom color scheme
  coord_fixed(ratio = 4) +#y/x aspect ratio of output plot
  theme(axis.text.x = element_text(angle=45, hjust=1), #rotate x axis labels
        axis.title.y = element_blank()) +#remove y axis title
  xlab('bp to PROMPT TSS')

this is fine, the only caveat here is that negatives are mentally a bit challenging, ie they still refer to "positive signal".

so its much more paedogical to use pseudocount = 1

log2_meta_tbl <- mutate(meta_tbl, value=log2(value+1))
plot_htmp(log2_meta_tbl) +
  scale_fill_gradient(low='white', high='navyblue') + #custom color scheme
  coord_fixed(ratio = 4) +#y/x aspect ratio of output plot
  theme(axis.text.x = element_text(angle=45, hjust=1), #rotate x axis labels
        axis.title.y = element_blank()) +#remove y axis title
  xlab('bp to PROMPT TSS')

but obviously this also changes the appearance quite a bit, since low values are now very close to 0... matter of choice imho.

metagene profiles

So far this is simply doing a one-sample t.test which computes mean and parametric confidence intervals.

log2_meta_tbl_avgs <- meta_average(log2_meta_tbl)

This can be plotted easily

plot_profile(log2_meta_tbl_avgs)

Recipes

using Rscript metageneR.R

Instead of running the metagene matrix generation interactively, it can alos be run using Rscript. The easiest is to get this from github at \link{https://raw.githubusercontent.com/manschmi/RMetaTools/master/scripts/metageneR.R}.

or on genome.au.dk in the Terminal: ```{bash, eval=FALSE} wget https://raw.githubusercontent.com/manschmi/RMetaTools/master/scripts/metageneR.R

which will add metageneR.R to the current location.

in a Terminal you can now do:
```{bash, eval=FALSE}
Rscript metageneR.R -h

Recipe: order heatmap rows

using sum of values per row

To order heatmap rows in ggplot2 requires a bit of work. ie to order according to the sum in each row, we first compute and order 'by-hand'.

value_order <- log2_meta_tbl %>%
  filter(rel_pos == 0 | rel_pos==50) %>% #optional: select ie first 2 50bp intervals from TSS
  group_by(name) %>%
  summarize(total_value = sum(value)) %>%
  arrange(total_value) %$%
  name

We can apply this ordering to the tbl using factor().

log2_meta_tbl %>%
  mutate(name = factor(name, levels=value_order)) %>%
  plot_htmp(., do_interpolate = TRUE) +
  scale_fill_gradient(low='white', high='navyblue')

using region size

ups for PROMPT annotation used above we only have TSS position but no actual size from start. For this better to use something else, ie snRNA.

snRNA_bedfile <- system.file("extdata", "snRNA_MS31_hg38.bed", package = "RMetaTools")
snRNA_meta_tbl <- RMetaTools::metagene_matrix(bw_plus, bw_minus, snRNA_bedfile, anchor, upstream, downstream, window_size) %>%
  mutate(value = log2(value+1))

snRNA_meta_tbl

without ordering

plot_htmp(snRNA_meta_tbl, do_interpolate = TRUE)

The actual annotations are never used in the default pipeline but you can get the info from the name.

size_order <- snRNA_meta_tbl %>%
  distinct(name) %>%
  separate(name, c('chr', 'start', 'end'), sep='\t', extra='drop', remove = FALSE) %>%
  mutate(region_size = as.integer(end) - as.integer(start)) %>%
  dplyr::select(name, region_size) %>%
  arrange(-region_size)

Then apply the order as factor levels as above.

p <- snRNA_meta_tbl %>%
  mutate(name = factor(name, levels=size_order$name)) %>%
  plot_htmp(., do_interpolate = TRUE)

p

and add lines for anchor and regions size (ups: not tested rigorously):

p +
  geom_vline(xintercept = 0, size=.1, color='black', alpha=.5) +
  geom_point(data = mutate(size_order, value=0), aes(x=region_size, y=name), size=1, color='black', alpha=.5, stroke=0)

Recipe: add another matrix side-by-side

First create the new matrix.

bw_plus2 <- system.file("extdata", "GSM1573839_mNET_8WG16_siCPSF73_plus_hg38.bw", package = "RMetaTools")
bw_minus2 <- system.file("extdata", "GSM1573839_mNET_8WG16_siCPSF73_minus_hg38.bw", package = "RMetaTools")

meta_tbl2 <- metagene_matrix(bw_plus2, bw_minus2, bedfile, anchor, upstream, downstream, window_size)

Before we combine, better add some identifier column, ie 'condition'.

meta_tbl$condition <- 'ctrl'
meta_tbl2$condition <- 'siCPSF73'

simply bind_rows then

both_meta <- bind_rows(meta_tbl, meta_tbl2)
log2_both_meta <- mutate(both_meta, value=log2(value+1))

heatmap side-by-side

log2_both_meta %>%
  mutate(name = factor(name, levels=value_order)) %>%
plot_htmp(.) +
  scale_fill_gradient(low='white', high='navyblue') +
  facet_grid(.~condition)

profiles of combine dataset

average of a combined dataset, should work

log2_both_meta_avg <- meta_average(log2_both_meta)

can be visualized as side-by-side plot:

plot_profile(log2_both_meta_avg) +
  facet_grid(.~condition)

but probably often more informative using overlay:

plot_profile(log2_both_meta_avg, color_by='condition')

Recipe: Add antisense coverage for comparison "on the bottom"

For this example we simply invert plus_bw and minus_bw --> ie get the antisense coverage on same annotations. Alternatively you can use other bedfile but same bws as above...

First create the new matrix to be put on the bottom. We use the set with already 2 columns ;-)

bottom1 <- metagene_matrix(bw_minus, bw_plus, bedfile, anchor, upstream, downstream, window_size)  %>%
  mutate(condition='ctrl')

bottom2 <- metagene_matrix(bw_minus2, bw_plus2, bedfile, anchor, upstream, downstream, window_size) %>%
  mutate(condition='siCPSF73')

bottom <- bind_rows(bottom1, bottom2) %>%
  mutate(strand = 'antisense')
log2_meta_all <- both_meta %>%
  mutate(strand = 'sense') %>%
  bind_rows(., bottom) %>%
  mutate(value=log2(value+1),
         strand=factor(strand, levels=c('sense', 'antisense')))

heatmap 2x2 matrix

log2_meta_all %>%
  mutate(name = factor(name, levels=value_order),
         value = ifelse(strand == 'sense', value, -value)) %>%
plot_htmp(.) +
  scale_fill_gradient2(low='firebrick', mid='white', high='navyblue') +
  facet_grid(strand~condition) +
  theme_minimal() +
  theme(axis.text.y=element_blank(),
        axis.title.y = element_blank())

heatmap sense antisense color overlay

this is a bit iffy but can be faked by duplicate each rowname and offsetting accordingly

rows of the heatmap are ordered by levels of 'name'. So we simply duplicate each name and order so that the positive strand values for each name are always followed by the negative strand values for the same name .

This is the order we apply now:

head(value_order)

lets make a value order for antisense

put this 2 together zipped way

zipped_value_order <- as.character(rbind(value_order,paste0(value_order, 'antisense')))

head(zipped_value_order)
log2_meta_all %>%
  mutate(name = ifelse(strand == 'antisense', name, paste0(value_order, 'antisense')),
         name = factor(name, levels=zipped_value_order),
         value = ifelse(strand == 'sense', value, -value)) %>%
plot_htmp(.) +
  scale_fill_gradient2(low='firebrick', mid='white', high='navyblue') +
  facet_grid(.~condition) +
  theme_minimal() +
  theme(axis.text.y=element_blank(),
        axis.title.y = element_blank())

another version would simply use a majority vote for each tile

log2_meta_all %>%
  spread(strand, value) %>%
  mutate(value = ifelse(sense > antisense, sense, -antisense),
         name = factor(name, levels=zipped_value_order)) %>%
plot_htmp(.) +
  scale_fill_gradient2(low='firebrick', mid='white', high='navyblue') +
  facet_grid(.~condition) +
  theme_minimal() +
  theme(axis.text.y=element_blank(),
        axis.title.y = element_blank())

profiles of 2x2 dataset

average of a combined dataset, should work

log2_meta_all_avg <- meta_average(log2_meta_all)

can be visualized as side-by-side plot:

plot_profile(log2_meta_all_avg) +
  facet_grid(strand~condition)

2x2 Strand vs antisense compound plot:

log2_meta_all_avg %>%
  mutate(estimate = ifelse(strand == 'antisense', -estimate, estimate),
         conf.low = ifelse(strand == 'antisense', -conf.low, conf.low),
         conf.high = ifelse(strand == 'antisense', -conf.high, conf.high)) %>%
plot_profile(., color_by='condition') +
  facet_grid(strand~., scales='free') +
  theme_minimal()

Recipe: log2 ratio between conditions

using the tidy tbl this is pretty straightforward:

(LFCs <- log2_both_meta %>%
  spread(condition, value) %>%
  mutate(value = siCPSF73-ctrl) %>%
  dplyr::select(-siCPSF73, -ctrl))
plot_htmp(LFCs, color_by='value') +
  scale_fill_gradient2(low='firebrick4', mid='white', high='navyblue', name="log2(siCPSF73/ctrl)") +
  coord_fixed(ratio = 4) 
ratio_profile <- meta_average(LFCs)
plot_profile(ratio_profile) +
  ylab('mean log2FC siCPSF73/ctrl')

Recipe: sensitivity (Sandelin lab style) ratio between conditions instead of log2FC

using the tidy tbl this is pretty straightforward:

(pseudocount <- min(both_meta$value[both_meta$value > 0]))
(sensitivities <- both_meta %>%
  mutate(value = value + pseudocount) %>%
  spread(condition, value) %>%
  mutate(value = (siCPSF73-ctrl)/max(siCPSF73, ctrl)) %>%
  dplyr::select(-siCPSF73, -ctrl))
plot_htmp(sensitivities, color_by='value') +
  scale_fill_gradient2(low='firebrick4', mid='white', high='navyblue', name="sensitivity = (siCPSF73-ctrl)/max(siCPSF73, ctrl)") +
  coord_fixed(ratio = 4) 
sensitivity_profile <- meta_average(sensitivities)
plot_profile(sensitivity_profile) +
  ylab('mean sensitibity siCPSF73/ctrl')

Recipe: p value of log2 ratio between conditions

using a t.test would require roughly normal distributed data. This is often not the case due to highly 0-inflated data. As shown here for the density of signals in the central bins. The TSS and 50bp bin are close to normal, but the others are highly skewed towards 0s.

log2_both_meta %>%
  filter(rel_pos > -100 & rel_pos < 200) %>%
  mutate(rel_pos = factor(rel_pos)) %>%
  ggplot(., aes(x=value, color=rel_pos)) +
  geom_density() +
  facet_grid(.~condition) +
  theme_minimal()

So its probably more appropriate to do a non-parametric test ie wilcoxon rank sum test. Together with the tidy function from R package broom.

suppressWarnings(library(broom))
(LFC_wilcox_p <- log2_both_meta %>%
  spread(condition, value) %>%
  group_by(rel_pos) %>%
  do(tidy(wilcox.test(.$siCPSF73, .$ctrl))))

This applies a significance test for every position. Hence, p-values need to be corrected for multiple testing:

LFC_wilcox_p %<>%
  mutate(padj = p.adjust(p.value))

plot the adjusted p values, ie visually better as -log10(p.value)

ggplot(LFC_wilcox_p,
       aes(x=rel_pos, y=-log10(padj))) +
  geom_line()

Recipe: bigwigs where minus bigwig values are represented as negatives

Some people present values of minus strand data as negatives. This can help visualization in Genome Browser, etc.. To make such datasets directly accessible I added an option to negate negative strand bigwigs:

meta_tbl <- RMetaTools::metagene_matrix(bw_plus, bw_minus, bedfile, anchor, upstream, downstream, window_size, negate_neg_strand_values = TRUE)

Individual Steps in the Pipeline

Loading Regions

Using a bed file this will create a range around an anchor point TSS, TES or center. Note, under the hood this uses rtracklayer to load, so gtfs etc should work, but non-conventional columns in your files may cause this to fail.

bedfile <- system.file("extdata", "Chen_PROMPT_TSSs_liftedTohg38.bed", package = "RMetaTools")

anchor = 'TSS'
upstream = 1000
downstream = 1000
window_size = 50
regions <- meta_regions(bedfile, 'TSS', upstream, downstream)

regions

Create a metagene matrix

On top of the general method one can also get the intermediate non-tidy matrix separately:

bw_plus <- system.file("extdata", "GSM1573841_mNET_8WG16_siLuc_plus_hg38.bw", package = "RMetaTools")
bw_minus <- system.file("extdata", "GSM1573841_mNET_8WG16_siLuc_minus_hg38.bw", package = "RMetaTools")

mat <- get_matrix(bw_plus, bw_minus, regions, upstream, downstream, window_size)

head(mat)

this matrix can be directly plotted either as heatmap or colMeans, etc...

heatmap(as.matrix(mat), Rowv = NA, Colv = NA, labRow = NA, labCol = NA)
plot(colSums(mat), type='l')

Conversion to tidy format is also implemented as function. This should return exactly same data as the metagene_matrix functioncreated using the same settings.

(tidy_meta <- mat_to_tbl(mat))

DEPRECATED

for deeptools v2

probably never used anymore

fname <- system.file("extdata", "matrix.gz", package = "RMetaTools")
df <- load_deeptoolsmatrix(fname)

df

older version for deeptools v3

fname <- system.file("extdata", "deeptools3_matrix.gz", package = "RMetaTools")

df <- load_deeptoolsmatrix3(fname)

simple metagene plot therefrom

ups: this example has way too many samples --> simplify before plotting

mini_df <- df %>%
  filter(grepl('^EGFP', sample_name) | grepl('^RRP40', sample_name)) %>%
  mutate(sample_name = sub('_hg38.*', '', sample_name))
mini_df %>%
  group_by(sample_name, rel_pos) %>%
  summarize(mean_log2_value = mean(log2(value+1))) %>%
  ggplot(., aes(x=rel_pos, y=mean_log2_value, color=sample_name)) +
  geom_line() +
  theme(axis.text.x = element_text(angle=45, hjust=1))

Session

sessionInfo()


manschmi/RMetaTools documentation built on Dec. 14, 2021, 4:33 a.m.