estimate_saturation: Estimate saturation of genes based on rarefaction of reads

View source: R/estimate_saturation.R

estimate_saturationR Documentation

Estimate saturation of genes based on rarefaction of reads

Description

Estimate the saturation of gene detection based on rarefaction of the mapped read counts from each library in a read counts object. This function takes the read counts for each library and sequentially rarefies them at different levels to determine how thoroughly genes are being sampled. Optional settings include the number of intermediate points to sample (default=6), the number of times to sample at each depth (default=5), and the minimum number of counts for a gene to be counted as "detected" (default=1).

Usage

estimate_saturation(
  counts,
  max_reads = Inf,
  method = "sampling",
  ndepths = 6,
  nreps = 5,
  min_counts = 1,
  min_cpm = NULL,
  verbose = FALSE
)

Arguments

counts

a numeric matrix (or object that can be coerced to a matrix) containing read counts, or an object from which counts can be extracted. Should have genes in rows and samples in columns.

max_reads

the maximum number of reads to sample at. By default, this value is the maximum of total read counts across all libraries.

method

character, either "division" or "sampling". Method "sampling" is slower but more realistic, and yields smoother curves. Method "division" is faster but more coarse and less realistic. See Details for more complete description

ndepths

the number of depths to sample at. 0 is always included.

nreps

the number of samples to take for each library at each depth. With well-sampled libraries, 1 should be sufficient. With poorly-sampled libraries, sampling variance may be substantial, requiring higher values.

min_counts

the minimum number of counts for a gene to be counted as detected. Genes with sample counts >= this value are considered detected. Defaults to 1. Set to NULL to use min_cpm.

min_cpm

the minimum counts per million for a gene to be counted as detected. Genes with sample counts >= this value are considered detected. Either this or min_count should be specified, but not both; including both yields an error. Defaults to NULL.

verbose

logical, whether to output the status of the estimation.

Details

The method parameter determines the approach used to estimate the number of genes detected at different sequencing depths. Method "division" simply divides the counts for each gene by a series of scaling factors, then counts the genes whose adjusted counts exceed the detection threshold. Method "sampling" generates a number of sets (nreps) of simulated counts for each library at each sequencing depth, by probabilistically simulating counts using observed proportions. It then counts the number of genes that meet the detection threshold in each simulation, and takes the arithmetic mean of the values for each library at each depth.

Value

A data frame containing nrep * ndepths rows, with one row for each sample at each depth. Columns include "sample" (the name of the sample identifier), "depth" (the depth value for that iteration), and "sat" (the number of genes detected at that depth for that sample). For method "sampling", it includes an additional column with the variance of genes detected across all replicates of each sample at each depth.


BenaroyaResearch/RNAseQC documentation built on April 19, 2024, 7:38 p.m.