knitr::opts_chunk$set(echo = TRUE) library(Olivia)
Here we demonstrate how to perform association analyses of continuous phenotypes using the Olivia package with RNA-seq data based on the pipeline proposed in the manuscript Benchmarking Association Analyses of Continuous Exposures with RNA-seq in Observational Studies
\url{https://www.biorxiv.org/content/10.1101/2021.02.12.430989v1.abstract}.
To install, open R and type:
library("devtools") install_github("nkurniansyah/Olivia") library(Olivia)
Olivia requires external packages from CRAN (dplyr, ggplot2, tableone, reshape, and ggrepel) and Bioconductor(qvalue)
install.packages("dplyr")
Load packages
library(dplyr) library(ggplot2) library(reshape2) library(tableone) library(ggrepel) library(EnsDb.Hsapiens.v86)
First we load the transcripts. The transcripts were obtained from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE151243. Note: we reformatted the transcript matrix into desired form and embedded them into the Olivia package.
data(rnaseq_count_matrix) rnaseq_count_matrix[1:5,1:5]
We simulated in advance a data.frame of phenotypes.
data(phenotype) head(phenotype)
We create a table summarizing the phenotypes.
summary_phen<- summarize_phenotypes(pheno = phenotype, categorical_variables = c("Sex"), numeric_variables = c("Age","Trait.1","Trait.2"), strata = "Race") summary_phen
We define the trait of interest to study as an exposure associated with genes. The trait/phenotype has to correspond to a column name in the phenotype data.frame.
trait <- "Trait.1"
We will adjust our analysis to the simulated covariates Age and Sex. The covariates have to correspond to column names in the phenotype data.frame. In the analysis, we will use a string defining the regression model (just the covariates part of it), so we define it here:
covariates_string <- "Age + Sex"
Note that we can also define the string to be "Age + as.factor(Sex)", or use interaction terms, like one would use in regression functions in R.
Match the (simulated) individuals between the phenotype and the RNA-seq count matrix. Make sure the there are matching IDs.
IDs_both <- intersect(rownames(phenotype), colnames(rnaseq_count_matrix)) rnaseq_matrix <- rnaseq_count_matrix[, IDs_both] phenotypes <- phenotype[match(IDs_both,rownames(phenotype)),]
After defining the trait of interest and covariates to adjust to the model, it is helpful to look at the trait's residual distribution.
resid_plot<- residual_plot(pheno = phenotype, traits = trait, covariates_string = covariates_string) resid_plot
Here, the residuals' distribution has short tails.
We use median normalization in Olivia to reduce package dependencies. However, users can use different normalization method using different packages, for example: estimateSizeFator(DESeq2) or TMM(edgeR). There are no downstream differences in how the methods are applied once the data is normalized.
After median normalization, the sum of the gene expression values over all transcripts is the same across individuals.
median_norm <- median_normalization(rnaseq_matrix) xrange <- range(colSums(rnaseq_matrix)) par(mfrow = c(1,2)) hist(colSums(rnaseq_matrix), xlim = xrange, main = "", xlab = "Before median normalization", ylab = "Library size distribution") hist(colSums(median_norm), xlim = xrange, main = "", xlab = "After median normalization", ylab = "Library size distribution")
Remove lowly expressed genes
clean_count_matrix <- apply_filters(count_matrix = median_norm, median_min = 1, expression_sum_min = 10, max_min = 10, range_min = 5, prop_zero_max = 0.5)
After filtering genes, there are r nrow(clean_count_matrix)
remaining for differential expression analysis. The plot below illustrates the proportion of transcripts/genes in the “12N_S27_L006_R1_001” sample (selected randomly) before and after filtering.
par(mfrow=c(1,2)) plot(density(log_replace_half_min(median_norm)[,"12N_S27_L006_R1_001"]), main="Before filtering", xlab="log trascript") plot(density(log_replace_half_min(clean_count_matrix)[,"12N_S27_L006_R1_001"]), main="After filtering", xlab="log trascript")
The figure below shows the distribution of transcript expressions (counts) in the sample in boxplots after filtering and after log transformation.
log_counts<- log_replace_half_min(clean_count_matrix) log_counts<-melt(log_counts) box_plot<- ggplot(log_counts, aes(x = Var2, y = value)) + stat_boxplot(aes(Var2, value), geom='errorbar', linetype=1, width=0.5)+ xlab("Sample")+ ylab("log(Transcripts)")+ geom_boxplot( aes(Var2, value),outlier.shape=1)+ stat_summary(fun = mean, geom = "errorbar", aes(ymax = ..y.., ymin = ..y..), width = .75, linetype = "dotted") + theme_bw()+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) box_plot
We show how we perform differential expression analysis (Transcriptome-wide association study) on all transcripts using empirical p-value (quantile empirical p-values). To generate p-values under the null, we create a residual permuted
trait 100 times, perform differential expression analysis, and use the resulting p-values as our null p-values. However, users also can implement Storey empirical p-value (as these are referred to in the manuscript) using test statistics.
set.seed(12) quantile_emp_trascript<-lm_count_mat_emp_pval(clean_count_matrix, pheno=phenotypes, trait=trait, covariates_string=covariates_string, n_permute=100, log_transform = "log_replace_half_min", outcome_type ="continuous", gene_IDs=NULL)
We do not implement the annotation feature into the Olvia package, to limit chances to run into compatibility issues as packages update. We here demonstrate how to create a function to add an annotation in a transcriptome-wide association study. We use EnsDb.Hsapiens.v86
add_annotation<-function(deg_res){ gene_symbol<- select(EnsDb.Hsapiens.v86, keys =as.character(deg_res$geneID) , keytype = "GENEID", columns = c("GENEID", "GENENAME")) colnames(gene_symbol)<- c("geneID","geneName") annot_deg<-left_join(deg_res,gene_symbol, by="geneID") annot_deg<- annot_deg %>% dplyr::rename(IDs=geneID, geneID= geneName) return(annot_deg) }
quantile_emp_trascript <- add_annotation(quantile_emp_trascript) head(quantile_emp_trascript)
Now, we can obtain significant genes (the genes which have bh_emp_pvals < 0.05)
tophits <- quantile_emp_trascript[which(quantile_emp_trascript$bh_emp_pvals< 0.05),] head(tophits)
After completing the transcriptome-wide association study, now we can visualize up-regulated and down-regulated genes using a volcano plot.
volcano <- volcano_plot(deg_res = quantile_emp_trascript, significant_threshold = 0.05 ) volcano
Looking at the results, we may want to see how a transcript of interest (e.g. the most significantly-associated gene) distributes across population strata. We here visualize this using violin plots. The row names in the matrix of transcript counts and in the phenotype matrix have to match.
top_gene<- "ENSG00000000005" violin_plot(pheno = phenotypes, strata = "Race", norm_count_matrix = clean_count_matrix, selected_transcript = top_gene)
When testing only a handful of genes, we may not want to perform a transcriptome-wide association analysis. Therefore, empirical p-values using the quantile or Storey's approach cannot be computed (not enough tests to generate the null distribution). Instead, we permute specific genes many times. Here we show how to perform differential expression analysis on selected transcripts when computing a permutation p-value for each gene based on permutations for this gene only. We suggest running 100000 permutations, but more permutation are needed if higher precision in p-value computation is needed.
set.seed(12) gene_names<-sample(rownames(clean_count_matrix),3) perm_res<- lm_count_mat_perm_pval(count_matrix=clean_count_matrix, pheno=phenotypes, trait=trait, covariates_string=covariates_string, n_permute=100000, gene_IDs=gene_names, seed = NULL, log_transform = "log_replace_half_min", outcome_type ="continuous") perm_res<-add_annotation(perm_res) head(perm_res)
We show how we perform differential expression analysis on all transcripts using emprical p-value(quantile empirical p-values) when testing association using multiple exposure at the same time.
set.seed(12) quantile_emp_multi<-lm_mult_count_mat_emp_pval(clean_count_matrix, pheno=phenotypes, traits=c("Trait.1","Trait.2") , covariates_string=covariates_string, n_permute=100, gene_IDs=NULL, log_transform = "log_replace_half_min", outcome_type="continuous") quantile_emp_multi<-add_annotation(quantile_emp_multi) head(quantile_emp_multi)
Now, we can identify significantly-associated genes based on transcriptome-wide associations using multiple exposures (the genes which have bh_emp_pvals < 0.05)
top_emp_multi<-quantile_emp_multi[quantile_emp_multi$bh_emp_pvals< 0.05,] rownames(top_emp_multi)<-NULL head(top_emp_multi)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.