%\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Vignette Title} %\VignetteEncoding{UTF-8}


title: "ASEP: Allele-Specific Expression in a Population" author: "Jiaxin Fan" output: html_vignette: number_sections: TRUE self_contained: yes toc: yes


library(knitr)
library(DT)

Introduction


This vignette provides a walk through tutorial on how to use ASEP to perform cross-individual gene-based ASE detection under one condition and differential ASE detection between two conditions (e.g., pre- vs post-treatment) using bulk RNA sequencing data as shown in ASEP paper.

Installation

# install devtools if necessary
if (!"devtools" %in% rownames(installed.packages())) {
  install.packages('devtools')
}
# install the ASEP package
if (!"ASEP" %in% rownames(installed.packages())) {
  devtools::install_github('Jiaxin-Fan/ASEP')
}
# load
library(ASEP)

Data

ASEP uses the allele-specific read counts obtained from RNA sequencing. Homozygous transcribed SNPs (tSNPs) are excluded from the analysis since they do not provide information on allelic exmpression.

Figure 1 illustrates how the input data should look like for one condition analysis detecting ASE under both phase known and unknown scenarios. Human is diploid organism. Ideally, for a given gene g, we want to have data that indicates, i.e., for each heterozygous tSNP, which of the two alleles is from the paternal haplotype (p) and which is from the maternal haplotype (m). However, what we observed in RNA-seq data is only the read counts for two alleles, reference allele and non-reference allele, of each tSNP, and don’t known exactly which alleles at different tSNPs reside on the same haplotype, which in this case is called haplotype phase unknown.

Under phase unknown, the input data should contain columns:

If DNA genotype data is also available, it is possible to infer back the haplotype phase of the RNA-seq data. Therefore, with other columns being the same as phase unknown, under phase known, for a single individual across all tSNPs of a given gene, the ref should be the read counts of alleles aligned on the same haplotype, i.e. paternal or maternal.

Figure 1: Input data for one condition analysis

Paired RNA-seq data is used for two conditions analysis detecting differential ASE, where paired means the same individual is sequenced under both conditions (i.e., pre- vs post-treatment). In addition to the similar columns as in one condition analysis, gene, id, snp, ref and total, the input for two conditions analysis requires two more columns:

Figure 2 illustrates how the input data should look like for two conditions analysis both under phase known and unknown scenarios. One thing to notice is that when phase known, the haplotype used should be the same for both conditions. If, for a given individual, ref contains the read counts of alleles from maternal haplotype for condition A, then ref should also contain the read counts of alleles from maternal haplotype for condition B, and vice versa.

Figure 2: Input data for two conditions analysis

Tutorial


This vignette uses the data of an example gene, PIEZO1, obtained from a paired macrophage RNA-seq dataset (M0 macrophages vs. M1 macrophages) generated from 48 healthy individuals, to illustrate, step by step, how to use ASEP for ASE and differential ASE detection when haplotype phase is unknown. For phase known, the procedure is identical except that the input parameter phased for each function should set to be TRUE.

dat = read.table("real_data_example.txt", header=T)
dat = dat[order(dat$id,dat$snp,dat$group),]
dat = dat[,c("gene","id","group","snp","ref","total")]

dat_M0 = dat[dat$group=='M0',]
dat_M0 = dat_M0[,c("gene","id","snp","ref","total","group")]
dat_M1 = dat[dat$group=='M1',]
dat_M1 = dat_M1[,c("gene","id","snp","ref","total","group")]

dat$ref_condition = 'M1'
set.seed(6)

One Condition Analysis

In order to detect gene-level ASE effect in the population, the first step is to align the major alleles across tSNPs and individuals through pseudo haplotype phasing. Next, given the phased data, a generalized linear mixed-effects model is employed for ASE detection and the estimated p-value is obtained through resampling. The whole procedure is taken care of by the function ASE_detection. The essential inputs for ASE_detection are:

The tables below show the input RNA-seq data of gene PIEZO1 under each condition (Left: M0 macrophages; Right: M1 macrophages).

wzxhzdk:3
wzxhzdk:4

The outputs of ASE_detection is a matrix with 2 columns:

We performed ASE detection on both M0 and M1 macrophages samples. We use the function ASE_detection to get the estimated ASE effect and its estimated p-value under each condition. For presentation purpose and to save running time, we decrease the number of resampling to $10^3$, and perform the analysis without parallel computing.

# result for M0 macrophages
ASE_detection(dat_all = dat_M0, phased=FALSE, varList=NULL, adaptive=TRUE, n_resample=10^3, parallel=FALSE, save_out=FALSE)
# result for M1 macrophages
ASE_detection(dat_all = dat_M1, phased=FALSE, varList=NULL, adaptive=TRUE, n_resample=10^3, parallel=FALSE, save_out=FALSE)

We do not output the data after pseudo alignment when using function ASE_detection. However, function phasing can be used to obtain the phased data for a given gene. The essential inputs for phasing are:

The outputs of phasing is a data frame with one additional column as compared to the original input:

The tables below show the RNA-seq data under each condition after pseudo phasing (Left: M0 macrophages; Right: M1 macrophages).

dat_M0_phased = phasing(dat_M0, phased=FALSE, n_condition="one")
dat_M1_phased = phasing(dat_M1, phased=FALSE, n_condition="one")
wzxhzdk:7
wzxhzdk:8

In addition, we provide function plot_ASE to show the estimated SNP-level ASE, i.e. major allele proportion, for each individual using boxplot. The individuals are sorted by their median of estimated ASE among heterozygous SNPs. The essential inputs for plot_ASE are:

The boxplot below shows the the estimated SNP-level ASE for M1 macrophages across subjects.

plot_ASE(dat_M1, phased=FALSE)

Two Conditions Analysis

Similar as one condition analysis, the first step for detecting gene-level differential ASE is to align the major alleles across tSNPs and individuals for both conditions. To ensure that major alelles are identical for the same individual under different conditions, we choose one condition as the "reference" (ref_condition), obtain its phasing information, and phase the data from the other condition accordingly. Since gene PIEZO1 only showed ASE effect under M1 macrophages, we chose 'M1' as the ref_condition. The table below shows the input RNA-seq data for two conditions analysis.

wzxhzdk:10

Next, a generalized linear mixed-effects model is also employed for differential ASE detection and the estimated p-value is obtained through resampling. The whole procedure of pseudo alignement and model fitting is taken care of by the function differential_ASE_detection. The essential inputs and outputs of differential_ASE_detection are the same as ASE_detection.

Below, we performed differential ASE detection on the example gene PIEZO1. For presentation purpose and to save running time, we decrease the number of resampling to $10^3$, and perform the analysis without parallel computing.

differential_ASE_detection(dat, phased=FALSE, varList=NULL, adaptive=TRUE, n_resample=10^3, parallel=FALSE, save_out=FALSE)

Again, we do not output the pseudo aligned data when using function differential_ASE_detection. However, function phasing can be used to obtain the phased data for a given gene. The table below show the RNA-seq data after pseudo alignment.

dat_phased = phasing(dat, phased=FALSE, n_condition="two")
wzxhzdk:13

We also provide function plot_ASE_diff to show the estimated SNP-level ASE difference, i.e. major allele proportion difference, between two-condition samples for each individual using boxplot. The individuals are sorted by their median of estimated ASE difference among heterozygous SNPs. The essential inputs for plot_ASE_diff are:

The boxplot below shows the the estimated SNP-level ASE difference between M1 and M0 macrophages for each subject.

plot_ASE_diff(dat, phased=FALSE, minu_ref=TRUE)


Jiaxin-Fan/ASEP documentation built on Aug. 9, 2021, 6:39 a.m.