preprocess | R Documentation |
Takes pixel (row) x gene (columns) matrix and filters out poor genes and pixels. Then selects for genes to be included in final corpus for input into LDA. If the pixel IDs are made up of their positions in "XxY" these can be extracted as the pixel position coordinates (a characteristic of Stahl datasets).
Order of filtering options: 1. Selection to use specific genes only 2. `cleanCounts` to remove poor pixels and genes 3. Remove top expressed genes in matrix 4. Remove specific genes based on grepl pattern matching 5. Remove genes that appear in more/less than a percentage of pixels 6. Use the over dispersed genes computed from the remaining genes after filtering steps 1-5 (if selected) 7. Choice to use the top over dispersed genes based on -log10(p.adj)
preprocess(
dat,
extractPos = FALSE,
selected.genes = NA,
nTopGenes = NA,
genes.to.remove = NA,
removeAbove = NA,
removeBelow = NA,
min.reads = 1,
min.lib.size = 1,
min.detected = 1,
ODgenes = TRUE,
nTopOD = 1000,
od.genes.alpha = 0.05,
gam.k = 5,
verbose = TRUE,
plot = TRUE
)
dat |
pixel (row) x gene (columns) mtx with gene counts OR path to it |
extractPos |
Boolean to extract pixel positional coordinates from pixel name names (default: FALSE) |
selected.genes |
vector of gene names to use specifically for the corpus (default: NA) |
nTopGenes |
integer for number of top expressed genes to remove (default: NA) |
genes.to.remove |
vector of gene names or patterns for matching to genes to remove (default: NA). ex: c("^mt-") or c("^MT", "^RPL", "^MRPL") |
removeAbove |
non-negative numeric <=1 to use as a percentage. Removes genes present in this fraction or more of pixels (default: NA) |
removeBelow |
non-negative numeric <=1 to use as a percentage. Removes genes present in this fraction or less of pixels (default: NA) |
min.reads |
cleanCounts() param; minimum number of reads to keep a gene (default: 1) |
min.lib.size |
cleanCounts() param; minimum number of counts a pixel needs to keep (default: 1) |
min.detected |
cleanCounts() param; minimum number of pixels a gene needs to have been detected in to keep (default: 1) |
ODgenes |
Boolean to use getOverdispersedGenes() for the corpus genes (default: TRUE) |
nTopOD |
number of top over dispersed genes to use. int (default: 1000). If the number of overdispersed genes is less then this number will use all of them, or set to NA to use all overdispersed genes. |
od.genes.alpha |
alpha parameter for getOverdispersedGenes(). Higher = less stringent and more over dispersed genes returned (default: 0.05) |
gam.k |
gam.k parameter for getOverdispersedGenes(). Dimension of the "basis" functions in the GAM used to fit, higher = "smoother" (default: 5) |
verbose |
control verbosity (default: TRUE) |
plot |
control if plots are returned (default: TRUE) |
A list that contains
corpus: (pixels x genes) matrix of the counts of the selected genes
slm: slam::as.simple_triplet_matrix(corpus); required format for topicmodels::LDA input
positions: matrix of x and y coordinates of pixels rownames = pixels, colnames = "x", "y"
data(mOB)
cd <- mOB$counts
corpus <- preprocess(t(cd), removeAbove = 0.95, removeBelow = 0.05)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.