knitr::opts_chunk$set(echo = TRUE)

library(Olivia)

Introduction

Here we demonstrate how to perform association analyses of continuous phenotypes using the Olivia package with RNA-seq data based on the pipeline proposed in the manuscript Benchmarking Association Analyses of Continuous Exposures with RNA-seq in Observational Studies \url{https://www.biorxiv.org/content/10.1101/2021.02.12.430989v1.abstract}.

Installation and required packages

To install, open R and type:

library("devtools")
install_github("nkurniansyah/Olivia")
library(Olivia)

Olivia requires external packages from CRAN (dplyr, ggplot2, tableone, reshape, and ggrepel) and Bioconductor(qvalue)

install.packages("dplyr")

Load packages

library(dplyr)
library(ggplot2)
library(reshape2)
library(tableone)
library(ggrepel)
library(EnsDb.Hsapiens.v86)

Load example data

Load raw gene counts matrix

First we load the transcripts. The transcripts were obtained from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE151243. Note: we reformatted the transcript matrix into desired form and embedded them into the Olivia package.

data(rnaseq_count_matrix)
rnaseq_count_matrix[1:5,1:5]

Load simulated phenotypes

We simulated in advance a data.frame of phenotypes.

data(phenotype)
head(phenotype)

Summarize phenotypes

We create a table summarizing the phenotypes.

summary_phen<- summarize_phenotypes(pheno = phenotype,
                                    categorical_variables = c("Sex"),
                                    numeric_variables = c("Age","Trait.1","Trait.2"),
                                    strata = "Race")
summary_phen

We define the trait of interest to study as an exposure associated with genes. The trait/phenotype has to correspond to a column name in the phenotype data.frame.

trait <- "Trait.1"

We will adjust our analysis to the simulated covariates Age and Sex. The covariates have to correspond to column names in the phenotype data.frame. In the analysis, we will use a string defining the regression model (just the covariates part of it), so we define it here:

covariates_string <- "Age + Sex"

Note that we can also define the string to be "Age + as.factor(Sex)", or use interaction terms, like one would use in regression functions in R.

Match the (simulated) individuals between the phenotype and the RNA-seq count matrix. Make sure the there are matching IDs.

IDs_both <- intersect(rownames(phenotype), colnames(rnaseq_count_matrix))
rnaseq_matrix <- rnaseq_count_matrix[, IDs_both] 
phenotypes <- phenotype[match(IDs_both,rownames(phenotype)),]

Residual plot of the trait

After defining the trait of interest and covariates to adjust to the model, it is helpful to look at the trait's residual distribution.

resid_plot<- residual_plot(pheno = phenotype, 
                           traits = trait,
                           covariates_string = covariates_string)

resid_plot

Here, the residuals' distribution has short tails.

Normalize the RNA-seq dataset

We use median normalization in Olivia to reduce package dependencies. However, users can use different normalization method using different packages, for example: estimateSizeFator(DESeq2) or TMM(edgeR). There are no downstream differences in how the methods are applied once the data is normalized.

Median normalization

After median normalization, the sum of the gene expression values over all transcripts is the same across individuals.

median_norm <- median_normalization(rnaseq_matrix)

xrange <- range(colSums(rnaseq_matrix))
par(mfrow = c(1,2))
hist(colSums(rnaseq_matrix), 
     xlim = xrange, 
     main = "",
     xlab = "Before median normalization",
     ylab = "Library size distribution")
hist(colSums(median_norm), 
     xlim = xrange, 
     main = "",
     xlab = "After median normalization",
     ylab = "Library size distribution")

Filtering transcripts

Remove lowly expressed genes

clean_count_matrix <- apply_filters(count_matrix = median_norm, 
                                    median_min = 1, 
                                    expression_sum_min = 10, 
                                    max_min = 10, 
                                    range_min = 5, 
                                    prop_zero_max = 0.5)

After filtering genes, there are r nrow(clean_count_matrix) remaining for differential expression analysis. The plot below illustrates the proportion of transcripts/genes in the “12N_S27_L006_R1_001” sample (selected randomly) before and after filtering.

par(mfrow=c(1,2))
plot(density(log_replace_half_min(median_norm)[,"12N_S27_L006_R1_001"]),
     main="Before filtering", xlab="log trascript")
plot(density(log_replace_half_min(clean_count_matrix)[,"12N_S27_L006_R1_001"]), 
     main="After filtering", xlab="log trascript")

The figure below shows the distribution of transcript expressions (counts) in the sample in boxplots after filtering and after log transformation.

log_counts<- log_replace_half_min(clean_count_matrix)
log_counts<-melt(log_counts)

box_plot<- ggplot(log_counts, aes(x = Var2, y = value)) + 
                  stat_boxplot(aes(Var2, value), 
                  geom='errorbar', linetype=1, width=0.5)+ 
                  xlab("Sample")+ 
                  ylab("log(Transcripts)")+ 
                  geom_boxplot( aes(Var2, value),outlier.shape=1)+
                  stat_summary(fun = mean, geom = "errorbar", 
                                aes(ymax = ..y.., ymin = ..y..),
                                width = .75, linetype = "dotted") +
                 theme_bw()+
                 theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

box_plot

Perform transcriptome-wide association study

We show how we perform differential expression analysis (Transcriptome-wide association study) on all transcripts using empirical p-value (quantile empirical p-values). To generate p-values under the null, we create a residual permuted trait 100 times, perform differential expression analysis, and use the resulting p-values as our null p-values. However, users also can implement Storey empirical p-value (as these are referred to in the manuscript) using test statistics.

set.seed(12)

quantile_emp_trascript<-lm_count_mat_emp_pval(clean_count_matrix, 
                                              pheno=phenotypes, 
                                              trait=trait,
                                              covariates_string=covariates_string,
                                              n_permute=100,
                                              log_transform = "log_replace_half_min",
                                              outcome_type ="continuous",
                                              gene_IDs=NULL)

Add annotation

We do not implement the annotation feature into the Olvia package, to limit chances to run into compatibility issues as packages update. We here demonstrate how to create a function to add an annotation in a transcriptome-wide association study. We use EnsDb.Hsapiens.v86

add_annotation<-function(deg_res){

   gene_symbol<- select(EnsDb.Hsapiens.v86, 
                        keys =as.character(deg_res$geneID) ,
                        keytype = "GENEID",
                        columns = c("GENEID", "GENENAME"))

   colnames(gene_symbol)<- c("geneID","geneName")

   annot_deg<-left_join(deg_res,gene_symbol, by="geneID")
   annot_deg<- annot_deg %>% dplyr::rename(IDs=geneID,
                                           geneID= geneName)
   return(annot_deg)

}
quantile_emp_trascript <- add_annotation(quantile_emp_trascript)
head(quantile_emp_trascript)

Now, we can obtain significant genes (the genes which have bh_emp_pvals < 0.05)

tophits <- quantile_emp_trascript[which(quantile_emp_trascript$bh_emp_pvals< 0.05),]
head(tophits)

Visualize up-regulated and down-regulated transcripts

After completing the transcriptome-wide association study, now we can visualize up-regulated and down-regulated genes using a volcano plot.

volcano <- volcano_plot(deg_res = quantile_emp_trascript,
                        significant_threshold = 0.05 )

volcano

Violin plot

Looking at the results, we may want to see how a transcript of interest (e.g. the most significantly-associated gene) distributes across population strata. We here visualize this using violin plots. The row names in the matrix of transcript counts and in the phenotype matrix have to match.

top_gene<- "ENSG00000000005" 
violin_plot(pheno = phenotypes,
            strata = "Race",
            norm_count_matrix  = clean_count_matrix,
            selected_transcript = top_gene)

Perform association analysis for selected gene/s

When testing only a handful of genes, we may not want to perform a transcriptome-wide association analysis. Therefore, empirical p-values using the quantile or Storey's approach cannot be computed (not enough tests to generate the null distribution). Instead, we permute specific genes many times. Here we show how to perform differential expression analysis on selected transcripts when computing a permutation p-value for each gene based on permutations for this gene only. We suggest running 100000 permutations, but more permutation are needed if higher precision in p-value computation is needed.

set.seed(12)

gene_names<-sample(rownames(clean_count_matrix),3)

perm_res<- lm_count_mat_perm_pval(count_matrix=clean_count_matrix,
                                  pheno=phenotypes,
                                  trait=trait, 
                                  covariates_string=covariates_string,
                                  n_permute=100000,
                                  gene_IDs=gene_names,
                                  seed = NULL,
                                  log_transform = "log_replace_half_min",
                                  outcome_type ="continuous")

perm_res<-add_annotation(perm_res)

head(perm_res)

Perform transcriptome-wide association study for multiple exposures

We show how we perform differential expression analysis on all transcripts using emprical p-value(quantile empirical p-values) when testing association using multiple exposure at the same time.

set.seed(12)

quantile_emp_multi<-lm_mult_count_mat_emp_pval(clean_count_matrix, 
                                               pheno=phenotypes, 
                                               traits=c("Trait.1","Trait.2") ,
                                               covariates_string=covariates_string,
                                               n_permute=100,
                                               gene_IDs=NULL,
                                               log_transform = "log_replace_half_min", 
                                               outcome_type="continuous")


quantile_emp_multi<-add_annotation(quantile_emp_multi)
head(quantile_emp_multi)

Now, we can identify significantly-associated genes based on transcriptome-wide associations using multiple exposures (the genes which have bh_emp_pvals < 0.05)

top_emp_multi<-quantile_emp_multi[quantile_emp_multi$bh_emp_pvals< 0.05,]
rownames(top_emp_multi)<-NULL

head(top_emp_multi)


nkurniansyah/Olivia documentation built on July 29, 2023, 9:10 a.m.