drop_uninformative_genes: Drop uninformative genes
In NathanSkene/EWCE: Expression Weighted Celltype Enrichment

View source: R/drop_uninformative_genes.r

drop_uninformative_genes

R Documentation

Drop uninformative genes

Description

drop_uninformative_genes drops uninformative genes in order to reduce compute time and noise in subsequent steps. It achieves this through several steps, each of which are optional:

Drop non-1:1 orthologs:
Removes genes that don't have 1:1 orthologs with the output_species ("human" by default).
Drop non-varying genes:
Removes genes that don't vary across cells based on variance deciles.
Drop non-differentially expressed genes (DEGs):
Removes genes that are not significantly differentially expressed across cell-types (multiple DEG methods available).

Usage

drop_uninformative_genes(
  exp,
  level2annot,
  mtc_method = "BH",
  adj_pval_thresh = 1e-05,
  convert_orths = FALSE,
  input_species = NULL,
  output_species = "human",
  non121_strategy = "drop_both_species",
  method = "homologene",
  as_sparse = TRUE,
  as_DelayedArray = FALSE,
  return_sce = FALSE,
  no_cores = 1,
  verbose = TRUE,
  ...
)

Arguments

`exp`	Expression matrix with gene names as rownames.
`level2annot`	Array of cell types, with each sequentially corresponding a column in the expression matrix.
`mtc_method`	Multiple-testing correction method used by DGE step. See p.adjust for more details.
`adj_pval_thresh`	Minimum differential expression significance that a gene must demonstrate across `level2annot` (i.e. cell types).
`convert_orths`	If `input_species!=output_species` and `convert_orths=TRUE`, will drop genes without 1:1 `output_species` orthologs and then convert `exp` gene names to those of `output_species`.
`input_species`	Which species the gene names in `exp` come from. See list_species for all available species.
`output_species`	Which species' genes names to convert `exp` to. See list_species for all available species.
`non121_strategy`	How to handle genes that don't have 1:1 mappings between `input_species`:`output_species`. Options include: `"drop_both_species" or "dbs" or 1` : Drop genes that have duplicate mappings in either the `input_species` or `output_species` (DEFAULT). `"drop_input_species" or "dis" or 2` : Only drop genes that have duplicate mappings in the `input_species`. `"drop_output_species" or "dos" or 3` : Only drop genes that have duplicate mappings in the `output_species`. `"keep_both_species" or "kbs" or 4` : Keep all genes regardless of whether they have duplicate mappings in either species. `"keep_popular" or "kp" or 5` : Return only the most "popular" interspecies ortholog mappings. This procedure tends to yield a greater number of returned genes but at the cost of many of them not being true biological 1:1 orthologs. `"sum","mean","median","min" or "max"` : When `gene_df` is a matrix and `gene_output="rownames"`, these options will aggregate many-to-one gene mappings (`input_species`-to-`output_species`) after dropping any duplicate genes in the `output_species`.
`method`	R package to use for gene mapping: `"gprofiler"` : Slower but more species and genes. `"homologene"` : Faster but fewer species and genes. `"babelgene"` : Faster but fewer species and genes. Also gives consensus scores for each gene mapping based on a several different data sources.
`as_sparse`	Convert `exp` to sparse matrix.
`as_DelayedArray`	Convert `exp` to `DelayedArray` for scalable processing.
`return_sce`	Whether to return the filtered results as an expression matrix or a SingleCellExperiment.
`no_cores`	Number of cores to parallelise across. Set to `NULL` to automatically optimise.
`verbose`	Print messages. #' @inheritParams orthogene::convert_orthologs
`...`	Arguments passed on to `orthogene::convert_orthologs` `gene_df` Data object containing the genes (see `gene_input` for options on how the genes can be stored within the object). Can be one of the following formats: `matrix` : A sparse or dense matrix. `data.frame` : A `data.frame`, `data.table`. or `tibble`. codelist : A `list` or character `vector`. Genes, transcripts, proteins, SNPs, or genomic ranges can be provided in any format (HGNC, Ensembl, RefSeq, UniProt, etc.) and will be automatically converted to gene symbols unless specified otherwise with the `...` arguments. Note: If you set `method="homologene"`, you must either supply genes in gene symbol format (e.g. "Sox2") OR set `standardise_genes=TRUE`. `gene_input` Which aspect of `gene_df` to get gene names from: `"rownames"` : From row names of data.frame/matrix. `"colnames"` : From column names of data.frame/matrix. `<column name>` : From a column in `gene_df`, e.g. `"gene_names"`. `gene_output` How to return genes. Options include: `"rownames"` : As row names of `gene_df`. `"colnames"` : As column names of `gene_df`. `"columns"` : As new columns "input_gene", "ortholog_gene" (and "input_gene_standard" if `standardise_genes=TRUE`) in `gene_df`. `"dict"` : As a dictionary (named list) where the names are input_gene and the values are ortholog_gene. `"dict_rev"` : As a reversed dictionary (named list) where the names are ortholog_gene and the values are input_gene. `standardise_genes` If `TRUE` AND `gene_output="columns"`, a new column "input_gene_standard" will be added to `gene_df` containing standardised HGNC symbols identified by gorth. `drop_nonorths` Drop genes that don't have an ortholog in the `output_species`. `agg_fun` Aggregation function passed to aggregate_mapped_genes. Set to `NULL` to skip aggregation step (default). `mthreshold` Maximum number of ortholog names per gene to show. Passed to gorth. Only used when `method="gprofiler"` (DEFAULT : `Inf`). `sort_rows` Sort `gene_df` rows alphanumerically. `gene_map` A data.frame that maps the current gene names to new gene names. This function's behaviour will adapt to different situations as follows: `gene_map=<data.frame>` : When a data.frame containing the gene key:value columns (specified by `input_col` and `output_col`, respectively) is provided, this will be used to perform aggregation/expansion. `gene_map=NULL` and `input_species!=output_species` : A `gene_map` is automatically generated by map_orthologs to perform inter-species gene aggregation/expansion. `gene_map=NULL` and `input_species==output_species` : A `gene_map` is automatically generated by map_genes to perform within-species gene gene symbol standardization and aggregation/expansion. `input_col` Column name within `gene_map` with gene names matching the row names of `X`. `output_col` Column name within `gene_map` with gene names that you wish you map the row names of `X` onto.

Value

exp Expression matrix with gene names as row names.

Examples

cortex_mrna <- ewceData::cortex_mrna()
# Use only a subset of genes to keep the example quick
cortex_mrna$exp <- cortex_mrna$exp[1:300, ]

## Convert orthologs at the same time
exp2_orth <- drop_uninformative_genes(
    exp = cortex_mrna$exp,
    level2annot = cortex_mrna$annot$level2class,
    input_species = "mouse"
)

NathanSkene/EWCE documentation built on Feb. 17, 2025, 7:52 a.m.