export_stats_genesummary: Create gene-level summary tables for your DEA and...
In ftwkoopmans/msdap: Mass Spectrometry Downstream Analysis Pipeline

View source: R/export_stats_genesummary.R

export_stats_genesummary

R Documentation

Create gene-level summary tables for your DEA and differential detection results

Description

To expedite downstream analysis, this function maps the proteingroup results from your differential expression analysis to Human gene identifiers and exports these as Excel tables that are ready for use with GO tools. Optionally, one can also include results from differential detection analysis. Multiple proteingroups that map to the same unique gene are merged and there are various options for dealing with ambiguous proteingroups.

The resulting output tables can be used as input with the GOAT online geneset analysis tool @ https://ftwkoopmans.github.io/goat/

combining differential expression and differential detection stats

If differential detection (DD) was performed on this dataset and the diffdetect_zscore_threshold parameter is not NA, the DD z-scores (per contrast) are converted to p-values by pnorm(abs(zscore), lower.tail = F), assuming these are approximately normally distributed (!), and then multiple testing correction by FDR is applied. However, this is just an approximation; the differential detection scores are a very simplistic approach and should be considered inferior to DEA analyses. Double-check the zscore distributions and treat these results with care ! Plot your DD results with plot_differential_detect(dataset, zscore_threshold = 6). Stringent cutoffs (absolute diffdetect_zscore_threshold of at least 5~6, or higher) are encouraged when using differential detection results.

If DD is enabled (diffdetect_zscore_threshold parameter is not NA), the effectsizes from DEA and DD are both standardized to make their distributions somewhat comparable. Note that the downside is that the effectsizes returned by this function no longer will be the exact effectsizes returned by the DEA model (those can be then found @ output column effectsize_dea).

dealing with overlapping gene symbols between proteingroups

Importantly, there may be multiple proteingroups (isoforms) in any proteomics dataset that match the same gene symbol. This function will return the proteingroup/row with the lowest p-value for each unique gene (per contrast, per DEA algorithm). Your selected option for gene_ambiguity will affect this; if for example "leading_gene" is selected as an option then a "GRIA1;GRIA2" proteingroup with pvalue=0.01 will be chosen as representative value for "GRIA1" over a "GRIA1" proteingroup with pvalue=0.5 (see further below parameter documentation)

Usage

export_stats_genesummary(
  dataset,
  gene_ambiguity = "prio_specific",
  diffdetect_zscore_threshold = NA,
  diffdetect_type = "auto",
  dea_logfc_instead_of_effectsize = FALSE,
  output_dir = NA,
  hgnc,
  xref = NULL,
  remove_nohgnc = FALSE
)

Arguments

`dataset`	to use this function, your dataset should meet these requirements: dataset was searched against a uniprot FASTA in MaxQuant/Spectronaut/DIA-NN/etc. (i.e. MS-DAP can only extract gene symbol info from from uniprot.org FASTA databases) `import_fasta()` has been applied to this dataset differential expression analysis has been applied (e.g. typical `analysis_quickstart()` workflow, or custom code that applied `filter_dataset()` and `dea()`)
`gene_ambiguity`	options: leading_gene, prio_specific, only_specific (default: prio_specific). example table from GOAT online website this list describes various scenarios; gene symbol(s) that represent some proteingroup together with a description of presented information GRIA1 = protein group maps to exactly 1 gene GRIA2 = protein group maps to exactly 1 gene GRIA1;GRIA2 = ambiguous, one might want to use only the first entry ('leading' gene) GRIA3;GRIA4 = ambiguous, but this row contributes a new gene (GRIA3) tr\|A8K0K0\|A8K0K0_HUMAN;GRIA2;GRIA3 = ambiguous, but the first entry has no gene symbol only an accession options and their result when applied to above examples leading_gene = use only the leading/first gene per proteingroup. e.g. "GRIA1;GRIA2" is mapped to "GRIA1" and will overwrite a specific "GRIA1" proteingroup if (and only if) the former has a lower p-value. all rows in this example will be mapped to a gene ID (respectively, GRIA1, GRIA2, GRIA1, GRIA3, GRIA2). prio_specific = discard ambiguous proteingroups if their leading/first gene overlaps with a specific proteingroup. e.g. "GRIA1;GRIA2" is mapped to "GRIA1" only if there is no specific/unambiguous "GRIA1" proteingroup. rows 3 and 5 in this example are discarded because there exist specific entries for GRIA1 and GRIA2 (respectively, GRIA1, GRIA2, -, GRIA3, -). This approach favors rows that are specific and supplements the results only with ambiguous rows that contribute new information (leading gene is not in the results yet). only_specific = all ambiguous proteingroups are discarded (included in output table, but human gene ID is set to NA). only the first 2 rows in this example are mapped (respectively, GRIA1, GRIA2, -, -, -).
`diffdetect_zscore_threshold`	Optionally, a differential detection absolute z-score cutoff. Set to NA to disable diffdetect (default) to return only DEA results. When using this option, we strongly recommend to first review the zscore distributions using MS-DAP function: `plot_differential_detect()` -> double-check that your selected absolute zscore threshold cuts at intended positions in the distribution !
`diffdetect_type`	type of differential detect scores. options: 'auto' = set to 'detect' if this score is available, 'quant' otherwise 'detect' = differential detection z-scores computed from only "detected" peptides (no MBR) 'quant' = differential detection z-scores computed from all quantified peptides (uses MBR)
`dea_logfc_instead_of_effectsize`	boolean parameter: should log2fc values be returned in the effectsize column? In some edge cases this might be helpful for downstream analysis but should generally be left at default value (FALSE)
`output_dir`	where to write the Excel output tables (e.g. "C:/temp"). Set to NA to disable writing tables
`hgnc`	HGNC lookup table, the output from function `hgnc_lookuptable()`
`xref`	For Mouse and Rat datasets we recommend to provide MGI/RGD database files to increase ID mapping accuracy. For Mouse species, this should be the MGI lookup table from function `mgi_lookuptable()`. For Rat, the RGD lookup table from function `rgd_lookuptable()`
`remove_nohgnc`	set to TRUE to remove all rows that could not be matched to a HGNC ID. default; FALSE

Value

the table(s) produced by this function are gene-level summaries of your statistical analyses. The R data.table returned by this function is in long-format (i.e. results from multiple DEA algorithms and contrast are appended). The Excel output files are split by contrast and DEA algorithm for convenience: these can be directly used as input for gene set analysis tools. Proteingroups that failed to map to human genes are included in the output table but will have NA values in the hgnc_id column.

A description of columns:

'protein_id' contains the proteingroup identifier (concatenation of protein identifiers)
'contrast' describes the statistical contrast (e.g. WT vs KO)
'dea_algorithm' describes the DEA algorithm that configured. Note that if you select multiple algorithms while running analysis_quickstart(), there will be multiple rows in this table for each gene*contrast !
'score_type' is either 'dea', 'dd' or 'dea,dd' indicating whether the effectsize/pvalue/pvalue_adjust shown for this row originates from differential expression analysis (DEA) or differential detection (DD), or both (i.e. statistics were available for both, in which case the metric with lowest pvalue was used to populate the effectsize/pvalue/pvalue_adjust columns)
'peptide_count' describes the number of peptides that was used for DEA. So this is NOT the total number of detected peptides, rather this is the subset of peptides that passed your specified filtering rules (and were thus used for DEA)
'log2fc' log2 foldchanges; for DEA results this value is taken straight from the DEA result table. For DD results this is the 'diffdetect zscore'
'pvalue' analogous to log2fc
'pvalue_adjust' analogous to log2fc
'signif' boolean flag indicating significance. For DEA, the criteria for significance that were previously configured when performing DEA using dea() or analysis_quickstart() functions are reused here (i.e. cutoffs for adjusted p-value and log2 foldchange). For DD results, any included protein (that passes the diffdetect_zscore_threshold cutoff) is set to TRUE.
The 'effectsize' column data depends on the user parameters/configuration; When only differential detect is used, this contains the z-scores. When only DEA is used, this contains effectsizes from DEA as-is. However, when DEA and differential detect are combined, the effectsizes from DEA are standardized (divided by std) such that effectsizes from both statistics can be integrated into 1 distribution (and thus these can be compared/ranked in downstream analyses). For rows that have both DD and DEA results, see item score_type Not a perfect solution, but in combination with stringent diffdetect_zscore_threshold parameter this seems to work ok in our hands.
'effectsize_dea' is only included when DD is enabled; it shows the DEA effectsizes as-is without any rescaling (and rescaled values merged with rescaled DD values are shown in 'effectsize')
'gene_symbols_or_id' contains the uniprot gene symbols for respective accessions in the proteingroup (protein_id column) as per usual in MS-DAP.
'hgnc_id' the mapped Human gene ID (can be NA in case of failed mappings)
'symbol' the gene symbol of the respective 'hgnc_id' (can be NA in case of failed mappings)
'ensembl_id' the Ensembl gene ID of the respective 'hgnc_id' (can be NA in case of failed mappings)
'entrez_id' the NCBI Entrez gene ID of the respective 'hgnc_id' (can be NA in case of failed mappings)
'gene' same as the 'entrez_id' column. Provided for drag-and-drop compatibility with the GOAT online tool

Examples

## Not run: 

## note that in this example, we have renamed the downloaded files
## to include a timestamp; useful for tracking versions
# load the HGNC data table you previously downloaded
hgnc_table = hgnc_lookuptable("C:/DATA/hgnc_complete_set__2024-01-01.txt")

# MGI Mouse database, only needed for Mouse datasets
#mgi_table = mgi_lookuptable("C:/DATA/MRK_SwissProt_TrEMBL__2024-01-01.rpt")
# RGD Rat database, only needed for Rat datasets
#rgd_table = rgd_lookuptable("C:/DATA/GENES_RAT__2024-01-01.rpt")

export_stats_genesummary(
  dataset,
  # set NA to ignore differential detection (default), or define an absolute
  # zscore threshold (e.g. 6). ! when using diffdetect, first review the
  # zscore distributions using MS-DAP function: plot_differential_detect()
  diffdetect_zscore_threshold = NA,
  # options for dealing with proteingroups that map to multiple genes:
  # leading_gene, prio_specific, only_specific
  gene_ambiguity = "prio_specific", # prio_specific and only_specific are recommended
  # always need the HGNC data table
  hgnc = hgnc_table,
  # example: for Mouse datasets, provide the MGI database for more accurate ID mapping
  # xref = mgi_table,
  # example: for Rat datasets, provide the RGD database for more accurate ID mapping
  # xref = rgd_table,
  # write the files to the same directory as main MS-DAP results
  output_dir = dataset$output_dir
)

## End(Not run)

ftwkoopmans/msdap documentation built on March 5, 2025, 12:15 a.m.