knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Trans2Kegg annotates differential expression (DE) results to KEGG [@kanehisa_new_2019; @kanehisa_kegg:_2017; @kanehisa_kegg:_2000] orthologs and pathways requiring only two inputs - a DESeqDataSet [@love_moderated_2014], and the transcriptome FASTA file to which RNA-Seq reads were aligned. Trans2Kegg extracts all comparisons from the DESeqDataSet and is therefore suitable for single-factor or multi-factor experiments. Two BLAST [@altschul_basic_1990; @noauthor_database_2016] options are available - BLAST on NCBI servers, or BLAST on AWS using an NCBI BLAST AMI. The AWS interface speeds up the annotation process approximately 14-fold over the NCBI servers. NCBI BLAST AMI requires the user to have an AWS account and launch an EC2 instance using the latest NCBI BLAST image. Two parameters are required from the running instance - the instance ID and the DNS.
Trans2Kegg merges all comparisons from the DESeqDataSet into a single dataframe
with an added column to indicate the comparison associated with each DE row.
This dataframe is saved as
dfAll.csv. The IDs of DE genes (rownames of
dfAll.csv) are passed to the
annotateAWS function, which extracts IDs of DE
genes from the DESeqDataSet, extracts the corresponding sequences from the
transcriptome FASTA file, then aligns them to SwissProt
[@uniprotconsortium_uniprot:_2018] using the blastSequences function of the
BioConductor annotate [@gentleman_annotate:_2018] package. Trans2Kegg then
queries the KEGG API using the KEGGREST [@tenenbaum_keggrest:_2018] package to
find KEGG ortholog matches for each of the SwissProt BLAST matches, and the
results are written to the user-specified output file. We chose KEGG orthologs
as the common set of gene identifiers to facilitate cross-species comparisons
of DE results. The BLAST and KEGG steps return a list of multiple SwissProt
species matches which will generally be associated with one to three KEGG
To filter these matches, Trans2Kegg first applies e-value and query coverage
cutoffs. The default values are e-value less than 1e-10 and query coverage
greater than 50%, but the user may optionally pass different cutoff parameters.
After applying the e-value and coverage filters, Trans2Kegg then selects the
KEGG ortholog with the greatest number of BLAST hits for each transcript, and
for each KEGG ortholog the transcript with the greatest number of BLAST hits.
Where the number of BLAST hits is equal, for each transcript the KEGG ortholog
with the highest average query coverage is selected, and the transcript with
the highest average coverage for each ortholog is selected. This reciprocal
filtering process eliminates the duplicate DE counting that can occur with
simple e-value BLAST cutoffs. The filtered BLAST hits are then merged with the
DE data to produce
deCovAndCountDesc.csv, a table of 1:1 best matches between
transcripts and KEGG orthologs.
To associate orthologs with KEGG pathways, the
getPathways function queries
the KEGG API using KEGGREST for each of the DE orthologs, producing
dfPathsKos.csv. To provide higher-level groupings of pathways like "Immune
system" or "Signal transduction" getPathways also queries KEGG for each pathway
with DE genes, producing
mergePaths function merges and
summarizes DE, ortholog, pathway, and pathway class information to provide
up-regulated/down-regulated DE counts by pathway and pathway class, as well as
producing a detail table of all merged results.
Aligning transcripts via BLAST is time-consuming, so we allowed for the
possibility that R scripts using this package may time-out or experience other
network connectivity errors during the annotation process. Each BLAST hit is
annot.csv as soon as it is received. If the script times out or
errors out and is restarted, it will start BLASTing where it left off. The
restart feature also ensures that if you change parameters - for example you
start with a very restrictive DE p-value and fold change cutoff, then make it
less stringent, the BLAST step only needs to annotate the additional DE genes,
not re-annotate the entire list.
annotateDE function is a helper function that accepts a DESeqDataSet as
input, and combines all comparisons into a single dataframe by adding a
"Factor" column to indicate the comparison applicable to each row. The output
dfAll.csv includes the ID number associated with the gene, the
log~2~ fold change, the adjusted p-value, and the comparison.
# Load the Trans2Kegg library library(dplyr) library(Trans2Kegg) library(knitr) library(tidyr)
The annotateTranscript function requires three parameters - 1) a vector of accession numbers from the FASTA file, 2) the path to the FASTA file and 3) the output filename.
dfAll <- system.file("extdata", "dfAll.csv", package="Trans2Kegg") aiptasia <- system.file("extdata", "aiptasia.fa", package="Trans2Kegg") deAll <- read.csv(dfAll, row.names=1) # Get the IDs (rownames) from deAll. ids <- rownames(deAll) # Select a subset of rows for testing purposes ids2 <- head(ids, n=10) #annotateNCBI(ids2, aiptasia)
The annotation process can be sped up dramatically using NCBI BLAST AMI. To setup the AMI, first create an AWS account and login, select the EC2 link (Fig. 2), search for NCBI within the public AMIs (Fig. 3). Check the box to the left of the most recent AMI (as of this writing dated April 11, 2018) and select launch from the Actions drop-down. Select the desired instance type. The type used for the performance tests was m4.2xlarge. Select defaults for other settings, with the exception of security group. Click "Add Rule" and select HTTP (Fig. 4). After launching the instance copy and paste the instance ID and DNS from the EC2 console (Fig. 1) to the instance and dns variables passed to annotateAws. The annotateAws function requires three parameters - IDs to annotate (ids2), FASTA file of transcripts (fastaFile), instance ID, and DNS ID. Optionally pass the threads parameter to change from the default of four threads.
deAll <- read.csv(dfAll, row.names=1) # Get the IDs (rownames) from deAll. ids <- rownames(deAll) # Select a subset of rows for testing purposes ids2 <- head(ids, n=200) # NOTE: instance and DNS show below are examples only. # Users must launch their own NCBI BLAST AMI instance <- 'i-07da948c2d85b7388' dns <- 'ec2-54-175-9-203.compute-1.amazonaws.com' annotateAWS(ids2, aiptasia, instance=instance, dns=dns, threads=2)
Merge the BLAST and KEGG annotations with DE results, and filter to obtain 1:1 associations between transcript IDs and KEGG orthologs (Table 3).
annot <- system.file("extdata", "annot.csv", package="Trans2Kegg") mergeAnnotations(dfAll, annot)
Query the KEGG API to obtain pathway information, and produce a table of pathway to ortholog mappings (Table 4), and pathway ID to pathway detail mappings (Table 5).
cvCnt <- system.file("extdata", "cvCnt.csv", package="Trans2Kegg") getPathways(cvCnt)
Merge the DE, KEGG ortholog, and pathway and provide summary counts.
prefix <- system.file("extdata", package="Trans2Kegg") mergePaths(prefix=prefix)
library(knitr) dfAll <- system.file("extdata", "dfAll.csv", package="Trans2Kegg") deAll <- read.csv(dfAll, row.names=1) kable(head(deAll), caption = "Sample of deAll.csv, the output of function AnnotateDE, which combines all comparisons from a DESeqDataSet into a single file to facilitate annotation.", booktabs=TRUE)
annot <- system.file("extdata", "annot.csv", package="Trans2Kegg") dfAnnot <- read.csv(annot, stringsAsFactors=FALSE) kable(head(dfAnnot, n= 20L), caption="Sample annot.csv output of annotateTranscript and annotateAws functions.", booktab=TRUE)
dfCovCount <- read.csv(cvCnt, stringsAsFactors=FALSE) dfCovCount$Factor <- gsub("_", " ", dfCovCount$Factor) kable(head(dfCovCount, n=20L), caption="Sample cvCnt.csv output of mergePathways function. This table summarizes the annotation results. qCov is the mean query coverage for the species BLAST hits matching the KEGG ortholog. n is the number of species BLAST hits matching the KEGG ortholog.", booktab=TRUE, row.names=FALSE)
dfPathsKos <- read.csv("pthKo.csv", stringsAsFactors=FALSE) kable(head(dfPathsKos, n=20L), caption="Sample pthKo.csv output of getPathways function.", booktab=TRUE, row.names=FALSE)
dfPaths <- read.csv("path.csv", stringsAsFactors=FALSE) kable(head(dfPaths, n=20L), caption="Sample path.csv output of getPathways function.", booktab=TRUE, row.names=FALSE)
countByPath <- read.csv("countByPath.csv", stringsAsFactors=FALSE) # Replace _ with spaces in Factor to allow word-wrap within the column. countByPath$Factor <- gsub("_", " ", countByPath$Factor) # Subset to include only "Organismal Systems" category countByPathOrg <- subset(countByPath, category %in% c("Organismal Systems")) countByPathOrg <- arrange(countByPathOrg, class, path, direction) kable(head(countByPathOrg, n=15L), caption="Sample countByPath.csv output of mergePathways function.", booktab=TRUE, row.names=FALSE)
# Read in countByClass.csv countByClass <- read.csv("countByClass.csv", stringsAsFactors=FALSE) # Replace _ with spaces in Factor to allow word-wrap within the column. countByClass$Factor <- gsub("_", " ", countByClass$Factor) # Subset to include only "Organismal Systems" category countByClassOrg <- subset(countByClass, category %in% c("Organismal Systems")) countByClassOrg <- arrange(countByClassOrg, class, direction) kable(head(countByClassOrg, n=20L), caption="Sample countByClass.csv output of mergePathways function.", booktab=TRUE, row.names=FALSE)
# Read in dePathsDetails.csv deDetails <- read.csv("dePathsDetails.csv", stringsAsFactors=FALSE) # Replace _ with spaces in Factor to allow word-wrap within the column. deDetails$Factor <- gsub("_", " ", deDetails$Factor) # Subset genes with Immune system pathways deDetailsImmune <- subset(deDetails, class %in% c(" Immune system")) kable(head(deDetailsImmune, n=10L), caption="Sample dePathsDetails.csv output of mergePathways function, subset to show the first ten genes within Immune system. The combinations of DE and KEGG information in dePathsDetails.csv facilitates subsetting and summarizing DE results", booktab=TRUE, row.names=FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.