defineDetectedTx | R Documentation |
Define detected transcripts
defineDetectedTx(
iMatrixTx = NULL,
iMatrixTxGrp = NULL,
iMatrixTxTPM = NULL,
iMatrixTxTPMGrp = NULL,
groups = NULL,
tx2geneDF = NULL,
cutoffTxPctMax = 10,
cutoffTxExpr = 5,
cutoffTxTPMExpr = 0.1,
txColname = "transcript_id",
geneColname = "gene_name",
zeroAsNA = TRUE,
applyTxPctTo = c("TPM", "counts", "both", "either"),
useMedian = FALSE,
floorTPM = 0.001,
floorCounts = 0.001,
verbose = FALSE,
...
)
iMatrixTx |
numeric matrix of read counts (or pseudocounts) with
transcript rows and sample columns. This data is assumed to be
log2-transformed, and if any value is higher than 50, it will
be log2-transformed with |
iMatrixTxGrp |
numeric matrix of read counts averaged by sample
group. If this matrix is not provided, it will be calculated
from |
iMatrixTxTPM |
numeric matrix of TPM values, with sample columns
and transcript rows. Note that if this parameter is not supplied,
the counts in |
iMatrixTxTPMGrp |
numeric matrix of TPM values averaged by sample
group. If this matrix is not provided, it will be calculated
from |
groups |
vector of group labels, either as character vector
or factor. It should be named by |
tx2geneDF |
data.frame with colnames including
|
cutoffTxPctMax |
numeric value scaled from 0 to 100 indicating the percentage of the maximum isoform expression per gene, for an alternate isoform to be considered for detection. |
cutoffTxExpr |
numeric value indicating the minimum group mean
counts in |
cutoffTxTPMExpr |
numeric value indicating the minimum group mean
TPM in |
txColname , geneColname |
the |
zeroAsNA |
logical indicating whether values of zero
(or less than zero) should
be treated as |
applyTxPctTo |
|
useMedian |
logical indicating whether to use group median values instead of group mean values. |
floorTPM , floorCounts |
|
verbose |
logical indicating whether to print verbose output. |
This function aims to combine evidence from RNA-seq sequence read counts (or pseudocounts from a kmer tool such as Salmon or Kallisto), along with alternative TPM quantitation, to determine the observed "detected" transcript space for a given experiment.
Each input data matrix is assumed to be appropriately log-transformed,
typically using log2(1+x)
. If any value is >= 50
then the data
matrix will be log2-transformed using log2(1+x)
.
The criteria must be met in at least one sample group, but all criteria must be met in the same sample group for an isoform to be considered "detected".
In our experience the use of TPM values appears more robust and is conceptually the best approach for comparing the relative quantity of one transcript isoform to another. Our reasoning is that TPM is intended to be roughly a molar quantity of transcript molecules, independent of the transcript length, and the potential for overlapping regions between isoforms. We also recommend the use of a kmer quantitation method, such as Salmon or Kallisto, which estimates isoform abundances not by specific read counts, but by quantifying kmers unique to particular isoforms for a given gene.
In all cases, the thresholds for detection can be modified, however from our experiences thus far the default values perform reasonably well at identifying expressed isoforms, while filtering out isoforms that we considered to be spuriously expressed.
There are three default requirements for a transcript to be considered "detected".
An isoform must be expressed at least 10% of the max isoform for a given gene, using TPM values.
An isoform must have at least log2(32) pseudocounts to be considered detected, based upon our view of Salmon pseudocount data using MA-plots.
An isoform must have at least log2(2) TPM units to be considered detected, based upon our view of Salmon TPM values using MA-plots.
Each experiment is likely to be different in terms of total sequenced reads, quality of read alignment or quantitation to the transcriptome, etc. We suggest observing MA-plots for the counts and TPM values, for the point at which the signal substantially increases from baseline zero. We also plotted the TPM versus count per sample, noted the point at which the two signals began to correlate. These observations along with careful review of numerous gene model transcript isoforms supported our selection of these criteria.
Lastly, the requirement for 10 percent of max isoform expression was motivated by observing highly expressed genes, which sometimes had alternative isoforms with extremely low abundance compared to the most abundant isoform, but which was notably higher than the minimum for detection. For example Gapdh expression above 100,000 pseudocounts, may have an isoform with 120 pseudocounts. When we reviewed the sequence coverage, we could find no compelling evidence to support the minor isoform, and theorized that the pseudocounts arose from the stochastic nature of rebalancing relative expression among isoforms.
Note the argument zeroAsNA=TRUE
, which by default treats any
expression value of zero (or less than zero) as NA
,
thus removing them from group
mean calculations. When iMatrixTxGrp
and iMatrixTxTPMGrp
are not supplied, this option is helpful in calculating a more
appropriate group mean expression value, notably when a
value of zero represents absence of data. Any group mean that is
NA
as a result is converted to zero for the purpose of applying
filters.
List with the following elements:
Numeric matrix representing the expression
counts per transcript, grouped by "gene_name"
.
Numeric matrix representing the
percent expression of each transcript isoform per gene, as
compared to the highest expression of isoforms for that gene,
using iMatrixTxGrp
data.
(New to verion 0.0.61.900.)
Numeric matrix representing the
percent expression of each transcript isoform per gene, as
compared to the highest expression of isoforms for that gene,
using iMatrixTxTPMGrp
data. This data is returned only
if iMatrixTxTPM
or iMatrixTxTPMGrp
were supplied.
(New to verion 0.0.61.900.)
Numeric matrix representing the
percent max expression used for filtering, after applying
applyTxPctTo
: "counts"
uses txPctMaxTxGrpAll
;
"TPM"
uses txPctMaxTxTPMGrpAll
; "both"
uses the higher
of txPctMaxTxGrpAll
and txPctMaxTxTPMGrpAll
; "either"
uses the lower of txPctMaxTxGrpAll
and txPctMaxTxTPMGrpAll
.
Numeric matrix of sample group counts, exponentiated and rounded to integer values.
Numeric matrix of sample group TPM values, exponentiated and rounded to integer values.
Numeric matrix indicating whether each isoform met the criteria to be considered detected. The criteria must be met in the same group for an isoform to be considered detected.
Character vector of transcripts, as defined by
the rownames(iMatrixTx)
.
Other jam RNA-seq functions:
assignGRLexonNames()
,
closestExonToJunctions()
,
combineGRcoverage()
,
detectedTxInfo()
,
exoncov2polygon()
,
flattenExonsBy()
,
getGRcoverageFromBw()
,
groups2contrasts()
,
internal_junc_score()
,
makeTx2geneFromGtf()
,
make_ref2compressed()
,
prepareSashimi()
,
runDiffSplice()
,
sortSamples()
,
spliceGR2junctionDF()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.