drop_uninformative_genes: Drop uninformative genes

View source: R/drop_uninformative_genes.r

drop_uninformative_genesR Documentation

Drop uninformative genes


drop_uninformative_genes drops uninformative genes in order to reduce compute time and noise in subsequent steps. It achieves this through several steps, each of which are optional:

  • Drop non-1:1 orthologs:
    Removes genes that don't have 1:1 orthologs with the output_species ("human" by default).

  • Drop non-varying genes:
    Removes genes that don't vary across cells based on variance deciles.

  • Drop non-differentially expressed genes (DEGs):
    Removes genes that are not significantly differentially expressed across cell-types (multiple DEG methods available).


  mtc_method = "BH",
  adj_pval_thresh = 1e-05,
  convert_orths = FALSE,
  input_species = NULL,
  output_species = "human",
  non121_strategy = "drop_both_species",
  method = "homologene",
  as_sparse = TRUE,
  as_DelayedArray = FALSE,
  return_sce = FALSE,
  no_cores = 1,
  verbose = TRUE,



Expression matrix with gene names as rownames.


Array of cell types, with each sequentially corresponding a column in the expression matrix.


Multiple-testing correction method used by DGE step. See p.adjust for more details.


Minimum differential expression significance that a gene must demonstrate across level2annot (i.e. cell types).


If input_species!=output_species and convert_orths=TRUE, will drop genes without 1:1 output_species orthologs and then convert exp gene names to those of output_species.


Which species the gene names in exp come from. See list_species for all available species.


Which species' genes names to convert exp to. See list_species for all available species.


How to handle genes that don't have 1:1 mappings between input_species:output_species. Options include:

  • "drop_both_species" or "dbs" or 1 :
    Drop genes that have duplicate mappings in either the input_species or output_species

  • "drop_input_species" or "dis" or 2 :
    Only drop genes that have duplicate mappings in the input_species.

  • "drop_output_species" or "dos" or 3 :
    Only drop genes that have duplicate mappings in the output_species.

  • "keep_both_species" or "kbs" or 4 :
    Keep all genes regardless of whether they have duplicate mappings in either species.

  • "keep_popular" or "kp" or 5 :
    Return only the most "popular" interspecies ortholog mappings. This procedure tends to yield a greater number of returned genes but at the cost of many of them not being true biological 1:1 orthologs.

  • "sum","mean","median","min" or "max" :
    When gene_df is a matrix and gene_output="rownames", these options will aggregate many-to-one gene mappings (input_species-to-output_species) after dropping any duplicate genes in the output_species.


R package to use for gene mapping:

  • "gprofiler" : Slower but more species and genes.

  • "homologene" : Faster but fewer species and genes.

  • "babelgene" : Faster but fewer species and genes. Also gives consensus scores for each gene mapping based on a several different data sources.


Convert exp to sparse matrix.


Convert exp to DelayedArray for scalable processing.


Whether to return the filtered results as an expression matrix or a SingleCellExperiment.


Number of cores to parallelise across. Set to NULL to automatically optimise.


Print messages. #' @inheritParams orthogene::convert_orthologs


Arguments passed on to orthogene::convert_orthologs


Data object containing the genes (see gene_input for options on how the genes can be stored within the object).
Can be one of the following formats:

  • matrix :
    A sparse or dense matrix.

  • data.frame :
    A data.frame, data.table. or tibble.

  • codelist :
    A list or character vector.

Genes, transcripts, proteins, SNPs, or genomic ranges can be provided in any format (HGNC, Ensembl, RefSeq, UniProt, etc.) and will be automatically converted to gene symbols unless specified otherwise with the ... arguments.
Note: If you set method="homologene", you must either supply genes in gene symbol format (e.g. "Sox2") OR set standardise_genes=TRUE.


Which aspect of gene_df to get gene names from:

  • "rownames" :
    From row names of data.frame/matrix.

  • "colnames" :
    From column names of data.frame/matrix.

  • <column name> :
    From a column in gene_df, e.g. "gene_names".


How to return genes. Options include:

  • "rownames" :
    As row names of gene_df.

  • "colnames" :
    As column names of gene_df.

  • "columns" :
    As new columns "input_gene", "ortholog_gene" (and "input_gene_standard" if standardise_genes=TRUE) in gene_df.

  • "dict" :
    As a dictionary (named list) where the names are input_gene and the values are ortholog_gene.

  • "dict_rev" :
    As a reversed dictionary (named list) where the names are ortholog_gene and the values are input_gene.


If TRUE AND gene_output="columns", a new column "input_gene_standard" will be added to gene_df containing standardised HGNC symbols identified by gorth.


Drop genes that don't have an ortholog in the output_species.


Aggregation function passed to aggregate_mapped_genes. Set to NULL to skip aggregation step (default).


Maximum number of ortholog names per gene to show. Passed to gorth. Only used when method="gprofiler" (DEFAULT : Inf).


Sort gene_df rows alphanumerically.


exp Expression matrix with gene names as row names.


cortex_mrna <- ewceData::cortex_mrna()
# Use only a subset of genes to keep the example quick
cortex_mrna$exp <- cortex_mrna$exp[1:300, ]

## Convert orthologs at the same time
exp2_orth <- drop_uninformative_genes(
    exp = cortex_mrna$exp,
    level2annot = cortex_mrna$annot$level2class,
    input_species = "mouse"

NathanSkene/EWCE documentation built on May 25, 2023, 8:30 a.m.