select_counts: Subsample the rows and columns of a count matrix.

View source: R/select_counts.R

select_countsR Documentation

Subsample the rows and columns of a count matrix.


It is a good idea to subsample (each iteration) the genes and samples from a real RNA-seq dataset prior to applying thin_diff (and related functions) so that your conclusions are not dependent on the specific structure of your dataset. This function is designed to efficiently do this for you.


  nsamp = ncol(mat),
  ngene = nrow(mat),
  gselect = c("random", "max", "mean_max", "custom"),
  gvec = NULL,
  filter_first = FALSE,
  nskip = 0L



A numeric matrix of RNA-seq counts. The rows index the genes and the columns index the samples.


The number of samples (columns) to select from mat.


The number of genes (rows) to select from mat.


How should we select the subset of genes? Options include:


Randomly select the genes, with each gene having an equal probability of being included in the subsampled matrix.


Choose the ngene most median-expressed genes. Ties are broken by mean-expression.


Choose the ngene most mean-expressed genes.


A user-specified list of genes. If gselect = "custom" then gvec needs to be non-NULL.


A logical vector of length nrow(mat). A TRUE in position i indicates inclusion into the smaller dataset. Hence, sum(gvec) should equal ngene.


Should we first filter genes by the method of Chen et al. (2016) (TRUE) or not (FALSE)? If TRUE then the edgeR package should be installed.


The number of median-maximally expressed genes to skip. Not used if gselect = "custom".


The samples (columns) are chosen randomly, with each sample having an equal probability of being in the sub-matrix. The genes are selected according to one of four schemes (see the description of the gselect argument).

If you have edgeR installed, then some functionality is provided for filtering out the lowest expressed genes prior to applying subsampling (see the filter_first argument). This filtering scheme is described in Chen et al. (2016). If you want more control over this filtering, you should use the filterByExpr function from edgeR directly. You can install edgeR by following instructions at doi: 10.18129/B9.bioc.edgeR.


A numeric matrix, which is a ngene by nsamp sub-matrix of mat. If rownames(mat) is NULL, then the row names of the returned matrix are the indices in mat of the selected genes. If colnames(mat) is NULL, then the column names of the returned matrix are the indices in mat of the selected samples.


David Gerard


  • Chen, Yunshun, Aaron TL Lun, and Gordon K. Smyth. "From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline." F1000Research 5 (2016). doi: 10.12688/f1000research.8987.2.


## Simulate data from given matrix of counts
## In practice, you would obtain mat from a real dataset, not simulate it.
n   <- 100
p   <- 1000
mat <- matrix(stats::rpois(n * p, lambda = 50), nrow = p)

## Subsample the matrix, then feed it into a thinning function
submat <- select_counts(mat = mat, nsamp = 10, ngene = 100)
thout  <- thin_2group(mat = submat, prop_null = 0.5)

## The rownames and colnames (if NULL in mat) tell you which genes/samples
## were selected.

seqgendiff documentation built on March 18, 2022, 5:21 p.m.