Introduction to rnaEditr

knitr::opts_chunk$set(echo = TRUE)

rnaEditr is an R package that identifies genomic sites and genomic regions that are differentially edited in RNA-seq datasets. rnaEditr can analyze studies with continuous, binary, or survival phenotypes, along with multiple covariates and/or interaction effects. To identify hyper-edited regions, rnaEditr first determines co-edited sub-regions without using any phenotype information. Next, rnaEditr tests association between RNA editing levels within the co-edited regions with binary, continuous or survival phenotypes.

1 Installation

The latest version can be installed by:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("rnaEditr")

After installation, the rnaEditr package can be loaded into R using:

library(rnaEditr)

2 Datasets

The input of rnaEditr are: (1) an RNA editing dataset, with rows corresponding to the edited sites and columns corresponding to the samples; (2) a phenotype dataset, with rows corresponding to different samples, ordered in the same way as columns of dataset (1); (3) a character string of genomic regions that users are interested in. This is only required for region-based analysis and will be explained in section 4.1.

The first dataset includes RNA editing levels for each sample at each edited site. RNA editing levels are typically defined as the total number of edited reads (i.e. reads with G nucleotides) divided by the total number of reads covering the site (i.e. reads with A and G nucleotides) (Breen et al. 2019 PMID: 31455887). They range from 0 (un-edited site) to 1 (completely edited site).

We assume quality control and normalization of the RNA editing dataset have been performed, and that each site is edited in at least 5% of the samples. For illustration, we use a subset of the TCGA breast cancer RNA editing dataset (syn 2374375). This example dataset includes RNA editing levels for 272 edited sites mapped to genes PHACTR4, CCR5, METTL7A (using hg19 reference), along with a few randomly sampled sites, for 221 subjects.

data(rnaedit_df)
rnaedit_df[1:3, 1:3]

In order to link the RNA editing dataset with phenotype data, the phenotype dataset needs to have a column called sample, whose values must be a exact match to the column names in the above RNA editing dataset.

pheno_df <- readRDS(
  system.file(
    "extdata",
    "pheno_df.RDS",
    package = 'rnaEditr',
    mustWork = TRUE
  )
)
pheno_df[1:3, 1:3]

Here we are using the command below to make sure sample variable in phenotype dataset has a exact match to the column names in the RNA editing dataset.

identical(pheno_df$sample, colnames(rnaedit_df))

rnaEditr can perform both site-specific and region-based analysis. In Section 3, we illustrate testing associations between cancer status and RNA editing levels at individual sites. In Section 4, we illustrate identifying cancer associated co-edited regions.

3 Site-specific analysis

A quick workflow for site-specific analysis{ width=100%}

3.1 Testing all edited sites

Before testing the associations, we use function CreateEditingTable() to turn RNA editing matrix into a dataframe with special class rnaEdit_df, which is a required format of input dataset for function TestAssociations().

rnaedit2_df <- CreateEditingTable(
  rnaEditMatrix = rnaedit_df
)

In this example, we will test the association between cancer status ( sample_type variable from pheno_df dataset) and all edited sites. Because the outcome variable "cancer status" is a binary variable, rnaEditr will apply a logistic regression model for each site: $$logit(Pr(cancer\;status = “yes”)) \sim RNA\;editing\;level$$

First, we use the command below to check the distribution of variable sample_type. Please note that rnaEditr will fit firth corrected logistic regression instead of regular logistic regression models for binary outcomes, when the minimum sample size per group is less than 5.

table(pheno_df$sample_type)
tumor_single_df <- TestAssociations(
  # an RNA editing dataframe with special class "rnaEdit_df" from function
  #   CreateEditingTable() if site-specific analysis, from function 
  #   SummarizeAllRegions() if region-based analysis.
  rnaEdit_df = rnaedit2_df,
  # a phenotype dataset that must have variable "sample" whose values are a 
  #   exact match to the colnames of "rnaEdit_df".
  pheno_df = pheno_df,
  # name of outcome variable in phenotype dataset "pheno_df" that you want to 
  #   test.
  responses_char = "sample_type",
  # names of covariate variables in phenotype dataset "pheno_df" that you want
  #   to add into the model.
  covariates_char = NULL,
  # type of outcome variable that you input in argument "responses_char".
  respType = "binary",
  # order the final results by p-values or not.
  orderByPval = TRUE
)
tumor_single_df[1:3, ]

Here tumor_single_df is a data frame of all edited sites, with corresponding p-values and false discovery rate (FDRs) from the logistic regression model.

3.2 Annotate results

We next annotate these results by adding gene names mapped to the RNA editing sites.

tumor_annot_df <- AnnotateResults(
  # the output dataset from function TestAssociations().
  results_df = tumor_single_df,
  # close-by regions, since this is site-specific analysis, set to NULL.
  closeByRegions_gr = NULL,
  # input regions, since this is site-specific analysis, set to NULL.
  inputRegions_gr = NULL,
  genome = "hg19",
  # the type of analysis result from function TestAssociations(), since we are 
  #   running site-specific analysis, set to "site-specific".
  analysis = "site-specific"
)
tumor_annot_df[1:3, ]

4 Region-based analysis

A quick workflow for region-based analysis{ width=100%}

The region-based analysis consists of several steps: (1) make GRanges object for the genomic regions, (2) determine RNA editing sites that are located closely within the genomic regions, (3) identify co-edited regions, and (4) test the association between cancer status and co-edited regions.

4.1 Input regions

First, we make GRanges object for the genomic regions that we are interested in . We saved both hg19 and hg38 gene references in the package. The following command retrieves regions associated with 20,314 pre-processed hg19 genes. To retrieve hg38 gene reference, easily change "hg19_annoGene_gr.RDS" into "hg38_annoGene_gr.RDS" in the command below.

allGenes_gr <- readRDS(
  system.file(
    "extdata",
    "hg19_annoGene_gr.RDS",
    package = 'rnaEditr',
    mustWork = TRUE
  )
)
allGenes_gr[1:3]

For candidate gene studies, we are often only interested in a few specific genes. We can use the following commands to extract GRanges associated with the specific genes:

# If input is gene symbol
inputGenes_gr <- TransformToGR(
  # input a character vector of gene symbols
  genes_char = c("PHACTR4", "CCR5", "METTL7A"),
  # the type of "gene_char". As we input gene symbols above, set to "symbol"
  type = "symbol",
  genome = "hg19"
)
inputGenes_gr

If we're interested in particular regions, we can use the following commands to make GRanges.

# If input is region ranges
inputRegions_gr <- TransformToGR(
  # input a character vector of region ranges.
  genes_char = c("chr22:18555686-18573797", "chr22:36883233-36908148"),
  # the type of "gene_char". As we input region ranges above, set to "region".
  type = "region",
  genome = "hg19"
)

# Here we use AddMetaData() to find the gene symbols for inputRegions_gr.
AddMetaData(target_gr = inputRegions_gr, genome = "hg19")

4.2 Find close-by regions

Next, within each input candidate region, we identify the sub-regions that contain closely located RNA editing sites. At default, we identify sub-regions with at least 3 edited sites, for which maximum distance between two neighboring sites is 50 bp.

closeByRegions_gr <- AllCloseByRegions(
  # a GRanges object of genomic regions retrieved or created in section 4.1.
  regions_gr = inputGenes_gr,
  # an RNA editing matrix.
  rnaEditMatrix = rnaedit_df,
  maxGap = 50,
  minSites = 3
)
closeByRegions_gr

closeByRegions_gr is a GRanges object that includes sub-regions that contain closely located RNA editing sites (close-by regions). In this example, within the three input candidate regions, 8 close-by regions were found.

4.3 Find co-edited regions

Next, within each close-by region, we identify co-edited regions based on RNA editing levels. At default, a co-edited region needs to satisfy the following requirements: (1) with at least 3 edited sites; (2) the minimum correlation between RNA editing levels of one site and the mean RNA editing levels of the rest of the sites is at least 0.4; (3) the minimum pairwise correlation of sites within the selected cluster is at least 0.1.

closeByCoeditedRegions_gr <- AllCoeditedRegions(
  # a GRanges object of close-by regions created by AllCloseByRegions().
  regions_gr = closeByRegions_gr,
  # an RNA editing matrix.
  rnaEditMatrix = rnaedit_df,
  # type of output data.
  output = "GRanges",
  rDropThresh_num = 0.4,
  minPairCorr = 0.1,
  minSites = 3,
  # the method for computing correlations.
  method = "spearman",
  # When no co-edited regions are found in an input genomic region, you want to
  #   output the whole region (when set to TRUE) or NULL (when set to FALSE).
  returnAllSites = FALSE
)
closeByCoeditedRegions_gr

closeByCoeditedRegions_gr is a GRanges that contains co-edited regions within the close-by regions we found from last step. We identified 14 co-edited regions from the 8 close-by regions.

Let's take a look at correlations of RNA editing levels within the first co-edited region:

PlotEditingCorrelations(
  region_gr = closeByCoeditedRegions_gr[1],
  rnaEditMatrix = rnaedit_df
)

4.4 Summarize all regions

Next, we summarize RNA editing levels from multiple edited sites within each co-edited region using medians.

summarizedRegions_df <- SummarizeAllRegions(
  # a GRanges object of close-by regions created by AllCoeditedRegions().
  regions_gr = closeByCoeditedRegions_gr,
  # an RNA editing matrix.
  rnaEditMatrix = rnaedit_df,
  # available methods: "MaxSites", "MeanSites", "MedianSites", and "PC1Sites".
  selectMethod = MedianSites
)
summarizedRegions_df[1:3, 1:5]

summarizedRegions_df is a data frame with each row corresponding to one co-edited region identified from function AllCoeditedRegions(). Since there are 14 co-edited regions, this data frame has 14 rows. In column 5 to the end, each column includes median editing levels (over all sites within each co-edited region) for one sample.

4.5 Test all regions

Next, we test the association between cancer status and co-edited regions using logistic regression model: $$logit (Pr(sample\;type = “cancer”)) \sim median\;RNA\;editing\;levels$$

tumor_region_df <- TestAssociations(
  # an RNA editing dataframe with special class "rnaEdit_df" from function
  #   CreateEditingTable() if site-specific analysis, from function 
  #   SummarizeAllRegions() if region-based analysis.
  rnaEdit_df = summarizedRegions_df,
  # a phenotype dataset that must have variable "sample" whose values are a 
  #   exact match to the colnames of "rnaEdit_df".
  pheno_df = pheno_df,
  # name of outcome variable in phenotype dataset "pheno_df" that you want to 
  #   test.
  responses_char = "sample_type",
  # names of covariate variables in phenotype dataset "pheno_df" that you want
  #   to add into the model.
  covariates_char = NULL,
  # type of outcome variable that you input in argument "responses_char".
  respType = "binary",
  # order the final results by p-values or not.
  orderByPval = TRUE
)
tumor_region_df[1:3, ]

Here tumor_region_df contains results for testing each co-edited region using logistic regression model.

4.6 Annotate results

We next annotate results by adding the corresponding input regions, close-by regions and genes mapped to these genomic regions.

tumor_annot_df <- AnnotateResults(
  # the output dataset from function TestAssociations().
  results_df = tumor_region_df,
  # close-by regions which is a output of AllCloseByRegions().
  closeByRegions_gr = closeByRegions_gr,
  # input regions, which are created in section 4.1.
  inputRegions_gr = inputGenes_gr,
  genome = "hg19",
  # the type of analysis result from function TestAssociations(), since we are 
  #   doing region-based analysis, use default here.
  analysis = "region-based"
)
tumor_annot_df[1:3, ]

To summarize this final dataset, values of column inputRegion are the user-specified candidate genomic regions, values of column closeByRegion are the regions that contain closely located RNA editing sites, and columns seqnames, start, end are the co-edited regions.

5 Further examples of function TestAssociations in rnaEditr

rnaEditr can analyze different types of phenotypes, including binary, continuous and survival outcomes. In the last section, we analyzed a binary phenotype sample_type. We next illustrate analysis for continuous and survival phenotypes.

For continuous outcome, as an example, to identify co-edited regions associated age, we use the following commands:

tumor_region_df <- TestAssociations(
  # an RNA editing dataframe with special class "rnaEdit_df" from function
  #   CreateEditingTable() if site-specific analysis, from function 
  #   SummarizeAllRegions() if region-based analysis.
  rnaEdit_df = summarizedRegions_df,
  # a phenotype dataset that must have variable "sample" whose values are a 
  #   exact match to the colnames of "rnaEdit_df".
  pheno_df = pheno_df,
  # name of outcome variable in phenotype dataset "pheno_df" that you want to 
  #   test.
  responses_char = "age_at_diagnosis",
  # names of covariate variables in phenotype dataset "pheno_df" that you want
  #   to add into the model.
  covariates_char = NULL,
  # type of outcome variable that you input in argument "responses_char".
  respType = "continuous",
  # order the final results by p-values or not.
  orderByPval = TRUE
)
tumor_region_df[1:3, ]

For survival outcome, for example, we use the following commands:

tumor_region_df <- TestAssociations(
  # an RNA editing dataframe with special class "rnaEdit_df" from function
  #   CreateEditingTable() if site-specific analysis, from function 
  #   SummarizeAllRegions() if region-based analysis.
  rnaEdit_df = summarizedRegions_df,
  # a phenotype dataset that must have variable "sample" whose values are a 
  #   exact match to the colnames of "rnaEdit_df".
  pheno_df = pheno_df,
  # name of outcome variable in phenotype dataset "pheno_df" that you want to 
  #   test.
  responses_char = c("OS.time", "OS"),
  # names of covariate variables in phenotype dataset "pheno_df" that you want
  #   to add into the model.
  covariates_char = NULL,
  # type of outcome variable that you input in argument "responses_char".
  respType = "survival",
  # order the final results by p-values or not.
  orderByPval = TRUE
)
tumor_region_df[1:3, ]

6 Session information

Here is the R session information for this vignette:

sessionInfo()


Try the rnaEditr package in your browser

Any scripts or data that you put into this service are public.

rnaEditr documentation built on Nov. 8, 2020, 8:26 p.m.