Filter samples with too many missing entries across multiple data sets

Description

This function checks data for missing entries and returns a list of samples that pass two criteria on maximum number of missing values: the fraction of missing values must be below a given threshold and the total number of missing genes must be below a given threshold.

Usage

1
2
3
4
5
6
7
goodSamplesMS(multiExpr, 
      useSamples = NULL,
      useGenes = NULL,
      minFraction = 1/2,
      minNSamples = ..minNSamples,
      minNGenes = ..minNGenes,
      verbose = 1, indent = 0)

Arguments

multiExpr

expression data in the multi-set format (see checkSets). A vector of lists, one per set. Each set must contain a component data that contains the expression data, with rows corresponding to samples and columns to genes or probes.

useSamples

optional specifications of which samples to use for the check. Should be a logical vector; samples whose entries are FALSE will be ignored for the missing value counts. Defaults to using all samples.

useGenes

optional specifications of genes for which to perform the check. Should be a logical vector; genes whose entries are FALSE will be ignored. Defaults to using all genes.

minFraction

minimum fraction of non-missing samples for a gene to be considered good.

minNSamples

minimum number of good samples for the data set to be considered fit for analysis. If the actual number of good samples falls below this threshold, an error will be issued.

minNGenes

minimum number of non-missing samples for a sample to be considered good.

verbose

integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.

indent

indentation for diagnostic messages. Zero means no indentation, each unit adds two spaces.

Details

The constants ..minNSamples and ..minNGenes are both set to the value 4. For most data sets, the fraction of missing samples criterion will be much more stringent than the absolute number of missing samples criterion.

Value

A list with one component per input set. Each component is a logical vector with one entry per sample in the corresponding set, indicating whether the sample passed the missing value criteria.

Author(s)

Peter Langfelder and Steve Horvath

See Also

goodGenes, goodSamples, goodSamplesGenes for cleaning individual sets separately;

goodGenesMS, goodSamplesGenesMS for additional cleaning of multiple data sets together.

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.