View source: R/export_stats_genesummary.R
export_stats_genesummary | R Documentation |
To expedite downstream analysis, this function maps the proteingroup results from your differential expression analysis to Human gene identifiers and exports these as Excel tables that are ready for use with GO tools. Optionally, one can also include results from differential detection analysis. Multiple proteingroups that map to the same unique gene are merged and there are various options for dealing with ambiguous proteingroups.
The resulting output tables can be used as input with the GOAT online geneset analysis tool @ https://ftwkoopmans.github.io/goat/
combining differential expression and differential detection stats
If differential detection (DD) was performed on this dataset and the diffdetect_zscore_threshold
parameter is not NA,
the DD z-scores (per contrast) are converted to p-values by pnorm(abs(zscore), lower.tail = F)
,
assuming these are approximately normally distributed (!), and then multiple testing correction by FDR is applied.
However, this is just an approximation; the differential detection scores are a very simplistic approach and should
be considered inferior to DEA analyses. Double-check the zscore distributions and treat these results with care !
Plot your DD results with plot_differential_detect(dataset, zscore_threshold = 6)
.
Stringent cutoffs (absolute diffdetect_zscore_threshold of at least 5~6, or higher) are encouraged when using differential detection results.
If DD is enabled (diffdetect_zscore_threshold
parameter is not NA), the effectsizes from DEA and DD are both standardized to make
their distributions somewhat comparable. Note that the downside is that the effectsizes returned by this function
no longer will be the exact effectsizes returned by the DEA model (those can be then found @ output column effectsize_dea
).
dealing with overlapping gene symbols between proteingroups
Importantly, there may be multiple proteingroups (isoforms) in any proteomics dataset that match the same gene symbol.
This function will return the proteingroup/row with the lowest p-value for each unique gene (per contrast, per DEA algorithm).
Your selected option for gene_ambiguity
will affect this;
if for example "leading_gene" is selected as an option then a "GRIA1;GRIA2" proteingroup with pvalue=0.01 will be
chosen as representative value for "GRIA1" over a "GRIA1" proteingroup with pvalue=0.5 (see further below parameter documentation)
export_stats_genesummary(
dataset,
gene_ambiguity = "prio_specific",
diffdetect_zscore_threshold = NA,
diffdetect_type = "auto",
dea_logfc_instead_of_effectsize = FALSE,
output_dir = NA,
hgnc,
xref = NULL,
remove_nohgnc = FALSE
)
dataset |
to use this function, your dataset should meet these requirements:
|
gene_ambiguity |
options: leading_gene, prio_specific, only_specific (default: prio_specific). example table from GOAT online websitethis list describes various scenarios; gene symbol(s) that represent some proteingroup together with a description of presented information
options and their result when applied to above examples
|
diffdetect_zscore_threshold |
Optionally, a differential detection absolute z-score cutoff. Set to NA to disable diffdetect (default) to return only DEA results. When using this option, we strongly recommend to first review the zscore distributions using MS-DAP function: |
diffdetect_type |
type of differential detect scores. options: 'auto' = set to 'detect' if this score is available, 'quant' otherwise 'detect' = differential detection z-scores computed from only "detected" peptides (no MBR) 'quant' = differential detection z-scores computed from all quantified peptides (uses MBR) |
dea_logfc_instead_of_effectsize |
boolean parameter: should log2fc values be returned in the effectsize column? In some edge cases this might be helpful for downstream analysis but should generally be left at default value (FALSE) |
output_dir |
where to write the Excel output tables (e.g. "C:/temp"). Set to NA to disable writing tables |
hgnc |
HGNC lookup table, the output from function |
xref |
For Mouse and Rat datasets we recommend to provide MGI/RGD database files to increase ID mapping accuracy. For Mouse species, this should be the MGI lookup table from function |
remove_nohgnc |
set to TRUE to remove all rows that could not be matched to a HGNC ID. default; FALSE |
the table(s) produced by this function are gene-level summaries of your statistical analyses. The R data.table returned by this function is in long-format (i.e. results from multiple DEA algorithms and contrast are appended). The Excel output files are split by contrast and DEA algorithm for convenience: these can be directly used as input for gene set analysis tools. Proteingroups that failed to map to human genes are included in the output table but will have NA values in the hgnc_id column.
A description of columns:
'protein_id' contains the proteingroup identifier (concatenation of protein identifiers)
'contrast' describes the statistical contrast (e.g. WT vs KO)
'dea_algorithm' describes the DEA algorithm that configured. Note that if you select multiple algorithms while running analysis_quickstart()
, there will be multiple rows in this table for each gene*contrast !
'score_type' is either 'dea', 'dd' or 'dea,dd' indicating whether the effectsize/pvalue/pvalue_adjust shown for this row originates from differential expression analysis (DEA) or differential detection (DD), or both (i.e. statistics were available for both, in which case the metric with lowest pvalue was used to populate the effectsize/pvalue/pvalue_adjust columns)
'peptide_count' describes the number of peptides that was used for DEA. So this is NOT the total number of detected peptides, rather this is the subset of peptides that passed your specified filtering rules (and were thus used for DEA)
'log2fc' log2 foldchanges; for DEA results this value is taken straight from the DEA result table. For DD results this is the 'diffdetect zscore'
'pvalue' analogous to log2fc
'pvalue_adjust' analogous to log2fc
'signif' boolean flag indicating significance. For DEA, the criteria for significance that were previously configured when performing DEA using dea()
or
analysis_quickstart()
functions are reused here (i.e. cutoffs for adjusted p-value and log2 foldchange). For DD results, any included protein (that passes the diffdetect_zscore_threshold
cutoff) is set to TRUE.
The 'effectsize' column data depends on the user parameters/configuration;
When only differential detect is used, this contains the z-scores. When only DEA is used, this contains effectsizes from DEA as-is.
However, when DEA and differential detect are combined, the effectsizes from DEA are standardized (divided by std) such that effectsizes from
both statistics can be integrated into 1 distribution (and thus these can be compared/ranked in downstream analyses).
For rows that have both DD and DEA results, see item score_type
Not a perfect solution, but in combination with stringent diffdetect_zscore_threshold
parameter this seems to work ok in our hands.
'effectsize_dea' is only included when DD is enabled; it shows the DEA effectsizes as-is without any rescaling (and rescaled values merged with rescaled DD values are shown in 'effectsize')
'gene_symbols_or_id' contains the uniprot gene symbols for respective accessions in the proteingroup (protein_id column) as per usual in MS-DAP.
'hgnc_id' the mapped Human gene ID (can be NA in case of failed mappings)
'symbol' the gene symbol of the respective 'hgnc_id' (can be NA in case of failed mappings)
'ensembl_id' the Ensembl gene ID of the respective 'hgnc_id' (can be NA in case of failed mappings)
'entrez_id' the NCBI Entrez gene ID of the respective 'hgnc_id' (can be NA in case of failed mappings)
'gene' same as the 'entrez_id' column. Provided for drag-and-drop compatibility with the GOAT online tool
## Not run:
## note that in this example, we have renamed the downloaded files
## to include a timestamp; useful for tracking versions
# load the HGNC data table you previously downloaded
hgnc_table = hgnc_lookuptable("C:/DATA/hgnc_complete_set__2024-01-01.txt")
# MGI Mouse database, only needed for Mouse datasets
#mgi_table = mgi_lookuptable("C:/DATA/MRK_SwissProt_TrEMBL__2024-01-01.rpt")
# RGD Rat database, only needed for Rat datasets
#rgd_table = rgd_lookuptable("C:/DATA/GENES_RAT__2024-01-01.rpt")
export_stats_genesummary(
dataset,
# set NA to ignore differential detection (default), or define an absolute
# zscore threshold (e.g. 6). ! when using diffdetect, first review the
# zscore distributions using MS-DAP function: plot_differential_detect()
diffdetect_zscore_threshold = NA,
# options for dealing with proteingroups that map to multiple genes:
# leading_gene, prio_specific, only_specific
gene_ambiguity = "prio_specific", # prio_specific and only_specific are recommended
# always need the HGNC data table
hgnc = hgnc_table,
# example: for Mouse datasets, provide the MGI database for more accurate ID mapping
# xref = mgi_table,
# example: for Rat datasets, provide the RGD database for more accurate ID mapping
# xref = rgd_table,
# write the files to the same directory as main MS-DAP results
output_dir = dataset$output_dir
)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.