# 3. PAT-Seq alternative polyadenylation example In weitrix: Tools for matrices with precision weights, test and explore weighted or sparse data

APA sites can be detected using the PAT-Seq protocol. This protocol produces 3'-end focussed reads. We examine GSE83162. This is a time-series experiment in which two strains of yeast are released into synchronized cell cycling and observed through two cycles. Yeast are treated with $\alpha$-factor, which causes them to stop dividing in antici... pation of a chance to mate. When the $\alpha$-factor is washed away, they resume cycling.

# Shift score definition

Each gene has several APA sites, ordered from furthest upstrand to furthest downstrand. For each sample, we have a read count at each site.

For each gene:

We define the "shift" of a particular sample relative to all reads from all samples.

A "shift" score is first assigned to each site, being the proportion of all reads upstrand of the site minus the proportion of all reads downstrand of it (i.e. an average over all reads where upstrand reads are 1, downstrand reads are -1 and reads from the site itself are 0). The measurement for each sample is then an average over the site score for each read. The weight is the number of reads.

Shifts scores range from -1 to 1, and summarize whether upstrand (more negative) or downstrand (more positive) sites are being favoured. The weighted average score is zero.

The weights are the number of reads, however for a randomly chosen read we can estimate its variance based on the number of reads at each site and the site scores. (This estimate of variance includes any biological signal, so it's not exactly a residual variance.) This is stored in the rowData() of the weitrix, and can be used to further calibrate weights. We prefer to defer this sort of calibration until after we've discoverd components of variation, as it tends to give high weight to genes with little APA. There are clearly some alternative choices to how weighting could be performed, and we hope the weitrix package gives you basic building blocks with which you can experiment!

# Load files

knitr::opts_chunk$set(cache=TRUE, autodep=TRUE) # To examine objects: # devtools::load_all(".", export_all=F) ; qwraps2::lazyload_cache_dir("vignettes/3_shift_cache/html")  library(tidyverse) # ggplot2, dplyr, etc library(patchwork) # side-by-side ggplots library(reshape2) # melt() library(weitrix) # Matrices with precision weights # Produce consistent results set.seed(12345) # BiocParallel supports multiple backends. # If the default hangs or errors, try others. # The most reliable backed is to use serial processing BiocParallel::register( BiocParallel::SerialParam() )  peaks <- system.file("GSE83162", "peaks.csv.gz", package="weitrix") %>% read_csv() counts <- system.file("GSE83162", "peak_count.csv.gz", package="weitrix") %>% read_csv() %>% column_to_rownames("name") %>% as.matrix() genes <- system.file("GSE83162", "genes.csv.gz", package="weitrix") %>% read_csv() %>% column_to_rownames("name") samples <- data.frame(sample=I(colnames(counts))) %>% extract(sample, c("strain","time"), c("(.+)-(.+)"), remove=FALSE) %>% mutate( strain=factor(strain,unique(strain)), time=factor(time,unique(time))) rownames(samples) <- samples$sample

groups <- dplyr::select(peaks, group=gene_name, name=name)
# Note the order of this data frame is important

samples

head(groups, 10)

counts[1:10,1:5]


A "shift" weitrix is constructed based on a matrix of site-level counts, plus a data frame grouping sites into genes. The order of this data frame is important, earlier sites are considered upstrand of later sites.

wei <- counts_shift(counts, groups)

colData(wei) <- cbind(colData(wei), samples)
rowData(wei)$symbol <- genes$symbol[match(rownames(wei), rownames(genes))]


Having obtained a weitrix, everthing discussed for the poly(A) tail length example is applicable here as well. We will only perform a brief exploratory analysis here.

# Exploratory analysis

## Components of variation

We can look for components of variation.

comp_seq <- weitrix_components_seq(wei, p=6, design=~0)

components_seq_screeplot(comp_seq)


Pushing a point somewhat, we examine four components.

comp <- comp_seq[[4]]





weitrix_confects(cal, comp$col, "C4", fdr=0.05)  ## Genes with high variability Instead of looking for genes following some particular pattern, we can look for genes that simply have surprisingly high variability with weitrix_sd_confects. weitrix_sd_confects(cal, step=0.01)  ## Examine individual genes Let's examine peak-level read counts for some genes we've identified. examine <- function(gene_wanted, title) { peak_names <- filter(peaks, gene_name==gene_wanted)$name

counts[peak_names,] %>% melt(value.name="reads", varnames=c("peak","sample")) %>%
left_join(samples, by="sample") %>%
ggplot(aes(x=factor(as.integer(peak)), y=reads)) +
facet_grid(strain ~ time) + geom_col() +
labs(x="Peak",y="Reads",title=title)
}

examine("YLR058C", "SHM2, from C1")
examine("YLR333C", "RPS25B, from C2")
examine("YDR077W", "SED1, from C3")
examine("YIL015W", "BAR1, from C4")
examine("tK(CUU)M", "tK(CUU)M, from C4")
examine("YKL081W", "TEF4, from weitrix_sd_confects")
examine("YPR080W", "TEF1, from weitrix_sd_confects")


# Alternative calibration method

For larger datasets, weitrix_calibrate_all may use a lot of memory. The weitrix packages also has an older version of calibration, weitrix_calibrate_trend. Rather than operating on weights individually, it applies a scaling to the weights of each row.

cal_trend <- weitrix_calibrate_trend(
wei,
design = comp,
trend_formula = ~log(per_read_var)+well_knotted_spline(log(total_reads),3))


A mean-variance trend is visible in the uncalibrated weighted residuals. Calibration removes this, by making use of the information in per_read_var.

(weitrix_calplot(wei, comp, covar=mu) + labs(title="Before")) +
(weitrix_calplot(cal_trend, comp, covar=mu) + labs(title="After"))


We can look at the calibration to per_read_var directly.

(weitrix_calplot(wei, comp, covar=per_read_var) + labs(title="Before")) +
(weitrix_calplot(cal_trend, comp, covar=per_read_var) + labs(title="After"))


Weights were too large for large read counts. Calibration has applied a row-level adjustment for this based on the total number of reads for each row.

(weitrix_calplot(wei, comp, covar=log(weitrix_weights(wei))) + labs(title="Before")) +
(weitrix_calplot(cal_trend, comp, covar=log(weitrix_weights(wei))) + labs(title="After"))


## Try the weitrix package in your browser

Any scripts or data that you put into this service are public.

weitrix documentation built on Nov. 8, 2020, 8:10 p.m.