extractSequence | R Documentation |
This function extracts the nucleotide (NT) sequence of transcripts by extracting and concatenating the sequences of a reference genome corresponding to the genomic coordinates of the isoforms. If ORF is annotated (e.g. via analyzeORF
) this function can furthermore translate the ORF NT sequence to Amino Acid (AA) sequence (via the Biostrings::translate() function where if.fuzzy.codon='solve' is specified). The sequences (both NT and AA) can be outputted as fasta file(s) and/or added to the switchAnalyzeRlist
.
extractSequence(
switchAnalyzeRlist,
genomeObject = NULL,
onlySwitchingGenes = TRUE,
alpha = 0.05,
dIFcutoff = 0.1,
extractNTseq = TRUE,
extractAAseq = TRUE,
removeShortAAseq = TRUE,
removeLongAAseq = FALSE,
alsoSplitFastaFile = FALSE,
removeORFwithStop=TRUE,
addToSwitchAnalyzeRlist = TRUE,
writeToFile = TRUE,
pathToOutput = getwd(),
outputPrefix='isoformSwitchAnalyzeR_isoform',
forceReExtraction = FALSE,
quiet=FALSE
)
switchAnalyzeRlist |
A |
genomeObject |
A |
onlySwitchingGenes |
A logic indicating whether the only sequences from transcripts in genes with significant switching isoforms (as indicated by the |
alpha |
The cutoff which the FDR correct p-values must be smaller than for calling significant switches. Default is 0.05. |
dIFcutoff |
The cutoff which the changes in (absolute) isoform usage must be larger than before an isoform is considered switching. This cutoff can remove cases where isoforms with (very) low dIF values are deemed significant and thereby included in the downstream analysis. This cutoff is analogous to having a cutoff on log2 fold change in a normal differential expression analysis of genes to ensure the genes have a certain effect size. Default is 0.1 (10%). |
extractNTseq |
A logical indicating whether the nucleotide sequence of the transcripts should be extracted (necessary for CPAT analysis). Default is TRUE. |
extractAAseq |
A logical indicating whether the amino acid (AA) sequence of the annotated open reading frames (ORF) should be extracted (necessary for pfam and SignalP analysis). The ORF can be annotated with the |
removeShortAAseq |
A logical indicating whether to remove sequences based on their length. This option exist to allows for easier usage of the Pfam and SignalP web servers which both currently have restrictions on allowed sequence lengths. If enabled AA sequences are filtered to be > 5 AA. This will only affect the sequences written to the fasta file (if |
removeLongAAseq |
A logical indicating whether to removesequences based on their length. This option exist to allows for easier usage of the Pfam and SignalP web servers which both currently have restrictions on allowed sequence lengths. If enabled AA sequences are filtered to be < 1000 AA. This will only affect the sequences written to the fasta file (if |
alsoSplitFastaFile |
A subset of the web based analysis tools currently supported by IsoformSwitchAnalyzeR have restrictions on the number of sequences in each submission (currently PFAM and to a less extend SignalP). To enable easy use of those web tool this parameter was implemented. By setting this parameter to TRUE a number of amino acid FASTA files will ALSO be generated each only containing the number of sequences allow (currently max 500 for some tools) thereby enabling easy analysis of the data in multiple web-based submissions. Only considered (if |
removeORFwithStop |
A logical indicating whether ORFs containing stop codons, defined as * when the ORF nucleotide sequences is translated to the amino acid sequence, should be A) removed from the ORF annotation in the switchAnalyzeRlist and B) removed from the sequences added to the switchAnalyzeRlist and/or written to fasta files. This is only necessary if you are analyzing quantified known annotated data where you supplied a GTF file to the import function. If you have used |
addToSwitchAnalyzeRlist |
A logical indicating whether the extracted sequences should be added to the |
writeToFile |
A logical indicating whether the extracted sequence(s) should be exported to (separate) fasta files (thereby enabling analysis with external software such as CPAT, Pfam and SignalP). Default is TRUE. |
pathToOutput |
If |
outputPrefix |
If |
forceReExtraction |
A logic indicating whether to force re-extraction of the biological sequences - else sequences already stored in the switchAnalyzeRlist will be used instead if available (because this function had already been used once). Default is FALSE |
quiet |
A logic indicating whether to avoid printing progress messages. Default is FALSE |
Changes in isoform usage are measure as the difference in isoform fraction (dIF) values, where isoform fraction (IF) values are calculated as <isoform_exp> / <gene_exp>.
The BSGenome object are loaded as separate packages. Use for example library(BSgenome.Hsapiens.UCSC.hg19)
to load the human genome v19 - which is then loaded as the object Hsapiens (that should be supplied to the genomeObject
argument). It is essential that the chromosome names of the annotation fit with the genome object. The extractSequence
function will automatically take the most common ambiguity into account: whether to use 'chr' in front of the chromosome name (UCSC style, e.g.. 'chr1') or not (Ensembl style, e.g.. '1').
The two fasta files outputted by this function (if writeToFile=TRUE
) can be used as input to among others:
CPAT
: The Coding-Potential Assessment Tool, which can be run either locally or via their webserver http://lilab.research.bcm.edu/cpat/
Pfam
: Prediction of protein domains, which can be run either locally or via their webserver http://pfam.xfam.org/search#tabview=tab1
SignalP
: Prediction of Signal Peptide, which can be run either locally or via their webserver http://www.cbs.dtu.dk/services/SignalP/
See ?analyzeCPAT
, ?analyzePFAM
or ?analyzeSignalP
(under details) for suggested ways of running these tools.
If writeToFile=TRUE
one fasta file pr sequence type (controlled via extractNTseq
and extractAAseq
) are written to the folder indicated by pathToOutput
. If alsoSplitFastaFile=TRUE
both a fasta file containing all isoforms (denoted '_complete' in file name) as well as a number of fasta files containing subsets of the entire file will be created. The subset fasta files will have the following indication "subset_X_of_Y" in the file names.
If addToSwitchAnalyzeRlist=TRUE
the sequences are added to the switchAnalyzeRlist
as respectively DNAStringSet
and AAStringSet
objects under the names 'ntSequence' and 'aaSequence'. The names of these sequences matches the 'isoform_id' entry in the 'isoformFeatures' entry of the switchAnalyzeRlist. The switchAnalyzeRlist is return no matter whether it was modified or not.
Kristoffer Vitting-Seerup
For
This function
: Vitting-Seerup et al. The Landscape of Isoform Switches in Human Cancers. Mol. Cancer Res. (2017).
switchAnalyzeRlist
isoformSwitchTestDEXSeq
isoformSwitchTestSatuRn
analyzeORF
### Prepare for sequence extraction
# Load example data and prefilter
data("exampleSwitchList")
exampleSwitchList <- preFilter(exampleSwitchList)
# Perfom test
exampleSwitchListAnalyzed <- isoformSwitchTestDEXSeq(exampleSwitchList, dIFcutoff = 0.3) # high dIF cutoff for fast runtime
# analyzeORF
library(BSgenome.Hsapiens.UCSC.hg19)
exampleSwitchListAnalyzed <- analyzeORF(exampleSwitchListAnalyzed, genomeObject = Hsapiens)
### Extract sequences
exampleSwitchListAnalyzed <- extractSequence(
exampleSwitchListAnalyzed,
genomeObject = Hsapiens,
writeToFile=FALSE # to avoid output when running example data
)
### Explore result
head(exampleSwitchListAnalyzed$ntSequence,2)
head(exampleSwitchListAnalyzed$aaSequence,2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.