3. PAT-Seq alternative polyadenylation example

APA sites can be detected using the PAT-Seq protocol. This protocol produces 3'-end focussed reads. We examine GSE83162. This is a time-series experiment in which two strains of yeast are released into synchronized cell cycling and observed through two cycles. Yeast are treated with $\alpha$-factor, which causes them to stop dividing in antici... pation of a chance to mate. When the $\alpha$-factor is washed away, they resume cycling.

Shift score definition

Each gene has several APA sites, ordered from furthest upstrand to furthest downstrand. For each sample, we have a read count at each site.

For each gene:

We define the "shift" of a particular sample relative to all reads from all samples.

A "shift" score is first assigned to each site, being the proportion of all reads upstrand of the site minus the proportion of all reads downstrand of it (i.e. an average over all reads where upstrand reads are 1, downstrand reads are -1 and reads from the site itself are 0). The measurement for each sample is then an average over the site score for each read. The weight is the number of reads.

Shifts scores range from -1 to 1, and summarize whether upstrand (more negative) or downstrand (more positive) sites are being favoured. The weighted average score is zero.

The weights are the number of reads, however for a randomly chosen read we can estimate its variance based on the number of reads at each site and the site scores. (This estimate of variance includes any biological signal, so it's not exactly a residual variance.) This is stored in the rowData() of the weitrix, and can be used to further calibrate weights. We prefer to defer this sort of calibration until after we've discoverd components of variation, as it tends to give high weight to genes with little APA. There are clearly some alternative choices to how weighting could be performed, and we hope the weitrix package gives you basic building blocks with which you can experiment!

Load files

knitr::opts_chunk$set(cache=TRUE, autodep=TRUE)

# To examine objects:
# devtools::load_all(".", export_all=F) ; qwraps2::lazyload_cache_dir("vignettes/3_shift_cache/html")
library(tidyverse)    # ggplot2, dplyr, etc
library(patchwork)    # side-by-side ggplots
library(reshape2)     # melt()
library(weitrix)      # Matrices with precision weights

# Produce consistent results
set.seed(12345)

# BiocParallel supports multiple backends. 
# If the default hangs or errors, try others.
# The most reliable backed is to use serial processing
BiocParallel::register( BiocParallel::SerialParam() )
peaks <- system.file("GSE83162", "peaks.csv.gz", package="weitrix") %>%
    read_csv()

counts <- system.file("GSE83162", "peak_count.csv.gz", package="weitrix") %>%
    read_csv() %>%
    column_to_rownames("name") %>%
    as.matrix()

genes <- system.file("GSE83162", "genes.csv.gz", package="weitrix") %>%
    read_csv() %>%
    column_to_rownames("name")

samples <- data.frame(sample=I(colnames(counts))) %>%
    extract(sample, c("strain","time"), c("(.+)-(.+)"), remove=FALSE) %>%
    mutate(
        strain=factor(strain,unique(strain)), 
        time=factor(time,unique(time)))
rownames(samples) <- samples$sample

groups <- dplyr::select(peaks, group=gene_name, name=name)
# Note the order of this data frame is important
samples

head(groups, 10)

counts[1:10,1:5]

A "shift" weitrix is constructed based on a matrix of site-level counts, plus a data frame grouping sites into genes. The order of this data frame is important, earlier sites are considered upstrand of later sites.

wei <- counts_shift(counts, groups)

colData(wei) <- cbind(colData(wei), samples)
rowData(wei)$symbol <- genes$symbol[match(rownames(wei), rownames(genes))]

Having obtained a weitrix, everthing discussed for the poly(A) tail length example is applicable here as well. We will only perform a brief exploratory analysis here.

Exploratory analysis

Components of variation

We can look for components of variation.

comp_seq <- weitrix_components_seq(wei, p=6, design=~0)
components_seq_screeplot(comp_seq)

Pushing a point somewhat, we examine four components.

comp <- comp_seq[[4]]

matrix_long(comp$col, row_info=samples, varnames=c("sample","component")) %>%
    ggplot(aes(x=time, y=value, color=strain, group=strain)) + 
    geom_hline(yintercept=0) + 
    geom_line() + 
    geom_point(alpha=0.75, size=3) + 
    facet_grid(component ~ .) +
    labs(title="Sample scores for each component", y="Sample score", x="Time", color="Strain")

Calibration

Weights can be calibrated. Ideally the weights should be 1 over the residual variance. We will fit a gamma GLM with log link function to the squared residuals, and use the predictions from this to produce better weights. This model uses the per_read_var information present in the rowData, as well as applying a non-linear transformation to the existing weights.

cal <- weitrix_calibrate_all(
    wei, 
    design = comp, 
    trend_formula = ~log(per_read_var)+well_knotted_spline(log2(weight),3))

metadata(cal)$weitrix$all_coef

A mean-variance trend is visible in the uncalibrated weighted residuals. Calibration has removed this, by making use of the information in per_read_var.

(weitrix_calplot(wei, comp, covar=mu) + labs(title="Before")) +
(weitrix_calplot(cal, comp, covar=mu) + labs(title="After"))

We can look at the calibration to per_read_var directly.

(weitrix_calplot(wei, comp, covar=per_read_var) + labs(title="Before")) +
(weitrix_calplot(cal, comp, covar=per_read_var) + labs(title="After"))

Weights were too large for large read counts. Calibration has applied a non-linear transformation to the weights that fixes this.

(weitrix_calplot(wei, comp, covar=log(weitrix_weights(wei))) + labs(title="Before")) +
(weitrix_calplot(cal, comp, covar=log(weitrix_weights(wei))) + labs(title="After"))

Another way of looking at this is with a "funnel" plot of 1/sqrt(weight) vs residuals. This should make a cone shape. Note how before calibration the pointy end of the cone is too wide.

(weitrix_calplot(wei, comp, funnel=TRUE, guides=FALSE) + labs(title="Before")) +
(weitrix_calplot(cal, comp, funnel=TRUE, guides=FALSE) + labs(title="After"))

Using the calibrated weitrix with weitrix_confects

We will estimate confident loadings for the different components using weitrix_confects.

Treat these results with caution. Confindence bounds take into account uncertainty in the loadings but not in the scores! What follows is best regarded as exploratory rather than a final result.

Gene loadings for C1

weitrix_confects(cal, comp$col, "C1", fdr=0.05)

Gene loadings for C2

weitrix_confects(cal, comp$col, "C2", fdr=0.05)

Gene loadings for C3

weitrix_confects(cal, comp$col, "C3", fdr=0.05)

Gene loadings for C4

weitrix_confects(cal, comp$col, "C4", fdr=0.05)

Genes with high variability

Instead of looking for genes following some particular pattern, we can look for genes that simply have surprisingly high variability with weitrix_sd_confects.

weitrix_sd_confects(cal, step=0.01)

Examine individual genes

Let's examine peak-level read counts for some genes we've identified.

examine <- function(gene_wanted, title) {
    peak_names <- filter(peaks, gene_name==gene_wanted)$name

    counts[peak_names,] %>% melt(value.name="reads", varnames=c("peak","sample")) %>%
        left_join(samples, by="sample") %>%
        ggplot(aes(x=factor(as.integer(peak)), y=reads)) + 
        facet_grid(strain ~ time) + geom_col() +
        labs(x="Peak",y="Reads",title=title)
}

examine("YLR058C", "SHM2, from C1")
examine("YLR333C", "RPS25B, from C2")
examine("YDR077W", "SED1, from C3")
examine("YIL015W", "BAR1, from C4")
examine("tK(CUU)M", "tK(CUU)M, from C4")
examine("YKL081W", "TEF4, from weitrix_sd_confects")
examine("YPR080W", "TEF1, from weitrix_sd_confects")

Alternative calibration method

For larger datasets, weitrix_calibrate_all may use a lot of memory. The weitrix packages also has an older version of calibration, weitrix_calibrate_trend. Rather than operating on weights individually, it applies a scaling to the weights of each row.

cal_trend <- weitrix_calibrate_trend(
    wei, 
    design = comp, 
    trend_formula = ~log(per_read_var)+well_knotted_spline(log(total_reads),3))

A mean-variance trend is visible in the uncalibrated weighted residuals. Calibration removes this, by making use of the information in per_read_var.

(weitrix_calplot(wei, comp, covar=mu) + labs(title="Before")) +
(weitrix_calplot(cal_trend, comp, covar=mu) + labs(title="After"))

We can look at the calibration to per_read_var directly.

(weitrix_calplot(wei, comp, covar=per_read_var) + labs(title="Before")) +
(weitrix_calplot(cal_trend, comp, covar=per_read_var) + labs(title="After"))

Weights were too large for large read counts. Calibration has applied a row-level adjustment for this based on the total number of reads for each row.

(weitrix_calplot(wei, comp, covar=log(weitrix_weights(wei))) + labs(title="Before")) +
(weitrix_calplot(cal_trend, comp, covar=log(weitrix_weights(wei))) + labs(title="After"))


Try the weitrix package in your browser

Any scripts or data that you put into this service are public.

weitrix documentation built on Nov. 8, 2020, 8:10 p.m.