preprocess: Pre-process ST pixel gene count matrices to construct corpus...

View source: R/functions.R

preprocessR Documentation

Pre-process ST pixel gene count matrices to construct corpus for input into LDA

Description

Takes pixel (row) x gene (columns) matrix and filters out poor genes and pixels. Then selects for genes to be included in final corpus for input into LDA. If the pixel IDs are made up of their positions in "XxY" these can be extracted as the pixel position coordinates (a characteristic of Stahl datasets).

         Order of filtering options:
         1. Selection to use specific genes only
         2. `cleanCounts` to remove poor pixels and genes
         3. Remove top expressed genes in matrix
         4. Remove specific genes based on grepl pattern matching
         5. Remove genes that appear in more/less than a percentage of pixels
         6. Use the over dispersed genes computed from the remaining genes
            after filtering steps 1-5 (if selected)
         7. Choice to use the top over dispersed genes based on -log10(p.adj)

Usage

preprocess(
  dat,
  extractPos = FALSE,
  selected.genes = NA,
  nTopGenes = NA,
  genes.to.remove = NA,
  removeAbove = NA,
  removeBelow = NA,
  min.reads = 1,
  min.lib.size = 1,
  min.detected = 1,
  ODgenes = TRUE,
  nTopOD = 1000,
  od.genes.alpha = 0.05,
  gam.k = 5,
  verbose = TRUE,
  plot = TRUE
)

Arguments

dat

pixel (row) x gene (columns) mtx with gene counts OR path to it

extractPos

Boolean to extract pixel positional coordinates from pixel name names (default: FALSE)

selected.genes

vector of gene names to use specifically for the corpus (default: NA)

nTopGenes

integer for number of top expressed genes to remove (default: NA)

genes.to.remove

vector of gene names or patterns for matching to genes to remove (default: NA). ex: c("^mt-") or c("^MT", "^RPL", "^MRPL")

removeAbove

non-negative numeric <=1 to use as a percentage. Removes genes present in this fraction or more of pixels (default: NA)

removeBelow

non-negative numeric <=1 to use as a percentage. Removes genes present in this fraction or less of pixels (default: NA)

min.reads

cleanCounts() param; minimum number of reads to keep a gene (default: 1)

min.lib.size

cleanCounts() param; minimum number of counts a pixel needs to keep (default: 1)

min.detected

cleanCounts() param; minimum number of pixels a gene needs to have been detected in to keep (default: 1)

ODgenes

Boolean to use getOverdispersedGenes() for the corpus genes (default: TRUE)

nTopOD

number of top over dispersed genes to use. int (default: 1000). If the number of overdispersed genes is less then this number will use all of them, or set to NA to use all overdispersed genes.

od.genes.alpha

alpha parameter for getOverdispersedGenes(). Higher = less stringent and more over dispersed genes returned (default: 0.05)

gam.k

gam.k parameter for getOverdispersedGenes(). Dimension of the "basis" functions in the GAM used to fit, higher = "smoother" (default: 5)

verbose

control verbosity (default: TRUE)

plot

control if plots are returned (default: TRUE)

Value

A list that contains

  • corpus: (pixels x genes) matrix of the counts of the selected genes

  • slm: slam::as.simple_triplet_matrix(corpus); required format for topicmodels::LDA input

  • positions: matrix of x and y coordinates of pixels rownames = pixels, colnames = "x", "y"

Examples

data(mOB)
cd <- mOB$counts
corpus <- preprocess(t(cd), removeAbove = 0.95, removeBelow = 0.05)


JEFworks-Lab/STdeconvolve documentation built on Nov. 14, 2024, 7:24 p.m.