prepanel: Preliminary Data Cleaning and Preperation

View source: R/BiomeHorizon.R

prepanelR Documentation

Preliminary Data Cleaning and Preperation

Description

This function prepares the OTU table and additional datasets for analysis with the horizonplot() function.

Usage

prepanel(
  otudata,
  metadata = NA,
  taxonomydata = NA,
  thresh_prevalence = 80,
  thresh_abundance = 0.5,
  thresh_abundance_override = NA,
  thresh_NA = 5,
  regularInterval = NA,
  maxGap = NA,
  minSamplesPerFacet = 2,
  otulist = NA,
  subj = NA,
  singleVarOTU = NA,
  band.thickness = NA,
  origin = NA,
  facetLabelsByTaxonomy = FALSE,
  customFacetLabels = NA,
  interpolate_NA = TRUE,
  formatStep = FALSE,
  nbands = 4
)

Arguments

otudata

Data frame representing OTU Table. Assumes first column contains OTU IDs, and all other columns are numeric vectors containing the number of sample reads for each OTU. Values can also be represented as proportions or percentages of the total sample for each OTU.

metadata

Data frame representing metadata table; matches samples to collection dates, and to subject names if applicable. If this data frame is supplemented, the columns with sample IDs, collection dates and subject names should be named "sample", "collection_date" and "subject", respectively. collection_date must be of class numeric or Date.

taxonomydata

Taxonomy information for OTUs, used for labeling facets. There are two options:

  • A data frame with columns as taxonomic levels, plus a column for OTU IDs. Assumes first column contains OTU IDs, and all other columns are character vectors. If OTUs have different levels of taxonomic classification (e.g. one is specified up to Genus and one only to Phylum), then NAs should substitute levels without specification.

  • A vector where each element contains the entire taxonomy for an OTU, with taxonomic levels separated by semicolons. The order of this vector should match the order of OTU IDs specified in otudata.

Taxonomic levels should start from Kingdom and can go as far as Subspecies. Defaults to NA (do not label by taxonomy).

thresh_prevalence

numeric threshold for OTU filtering. Minimum % of total samples in which OTU must be present to be included in analysis (defaults to 80).

thresh_abundance

numeric threshold for OTU filtering. Minimum % of total sample reads the OTU must constitute to be included in analysis (defaults to 0.5).

thresh_abundance_override

numeric threshold for OTU filtering. Minimum % of total sample reads the OTU must constitute to override all other standards, and be included in analysis (defaults to NA: disabled).

thresh_NA

numeric threshold for OTU filtering. Maximum % of samples with missing data (defaults to 5).

regularInterval

integer. For regularized data, this specifies the fixed interval of days separating each sample timepoint. If this value is 20, for example, new timepoints will be created at 1, 21, 41, 61, etc. To leave data irregularly spaced, do not specify a number here. Defaults to NA (do not regularize).

maxGap

numeric specifying the maximum number of days between the previous and subsequent irregular timepoints in order to interpolate a new timepoint. If the distance between the nearest time points exceeds the threshold specified by maxGap, all OTU values for that time point will be set to NA, and a scale break in the time axis will appear on the horizon plot. Must be an integer > 0.

minSamplesPerFacet

numeric. For regularized data with breaks in the time axis, specifies the minimum number of samples required of each facet time interval. Facets without this many timepoints will be removed. Defaults to 2.

otulist

character vector specifying OTU IDs for manual selection. Also determines the order from top to bottom of OTU panels displayed on the horizon plot. Defaults to NA (use filtering thresholds). In this case, OTU panels will be ordered alphabetically by OTU ID.

subj

character, used for datasets with multiple individual microbiomes. Filter samples to this subject or subjects. In most cases, you should specify just one subject, but if single OTU analysis is enabled you can select multiple subjects. Subject names should be described in metadata under the variable "subject". Defaults to NA (assume all samples are from one individual; do not filter by subject name).

singleVarOTU

character string specifying an OTU ID for facetting by subject. Facetting by subject requires metadata with columns on sample and subject, with an equal number of samples for each subject. If collection dates are provided, they must be identical for each subject. If they are not provided, the function assumes samples are ordered chronologically. A subset of subjects may be selected for analysis by supplying a vector of multiple subjects to subj.

band.thickness

The height of each horizontal band (denoted by a unique color), i.e. the size of the scale of a horizon subplot. There are three options:

  • If NA, the default, the band thickness will be evaluated using the function function(y) {max((abs(y - origin(y))), na.rm=TRUE) / nbands}. This calculates the maximum extreme (lowest or highest abundance value) divided by the number of bands.

  • A function will be called with a single argument, the sample values for one OTU, to evaluate a unique band thickness for each panel based on its sample values. The return value must be numeric.

  • A numeric constant, providing a fixed band thickness for all OTUs. This should be expressed as a percentage (0-100).

origin

The baseline (value=0, the base of the first positive band) for horizon subplots. There are three options:

  • If NA, the default, the origin will be evaluated separately for each OTU using the median of the sample values.

  • A function will be called with a single argument, a numeric vector representing the sample values for one OTU, to evaluate a unique origin for each panel. The return value must be numeric.

  • A numeric constant, providing a fixed origin value for all OTUs. This should be expressed as a percentage (0-100).

facetLabelsByTaxonomy

If TRUE, label facets by taxonomy, using taxonomydata. Facets will be labelled using the most specific classification available for each OTU. If FALSE (default), label facets by OTU ID.

customFacetLabels

Use a custom character vector to label facets. Must be the same length as the number of OTUs post-filtering, or the number of subjects if single OTU analysis is enabled. Overrides facetLabelsByTaxonomy, but if set to NA (the default), facetLabelsByTaxonomy is used instead.

interpolate_NA

logical. How should NA values be dealt with? If TRUE (default), NA values are interpolated using previous and subsequent OTU values. If FALSE, they are set to value=0. Note that this only applies to sample timepoints that contain values for some OTUs; if a sample consists entirely of NAs, it will be treated as a break in the timescale (see maxGap).

formatStep

If FALSE (default), horizon plot is a line graph. If TRUE, horizon plot is formatted as a step graph, with steps horizontal and then vertical.

nbands

integer specifying the number of positive bands (each denoted by a unique color) on each horizon subplot. For example, if you set nbands=4, there will be four positive bands and four negative bands, with 8 total colors. Must be an integer >=3. If nbands > 5, you must supply your own color palette of length 2 * nbands.

Details

The prepanel() function has 6 main purposes in preparing data sets and other parameters for the main horizonplot() function:

1) Filter the OTU table to the OTUs displayed on the final horizon plot, and to the samples of just one individual (for datasets with multiple subjects). By default, the "most important" OTUs are selected using four filtering thresholds: thresh_prevalence, thresh_abundance, thresh_abundance_override, and thresh_NA. They can also be manually specified as a vector of OTU IDs using otulist.

2) If single OTU analysis is enabled, convert the OTU table to values by subject for the OTU being analyzed

3) Ensure data sets are formatted correctly

4) Set the functions for finding the origin and horizon band thickness (band.thickness) of each OTU panel, if the default (NA) or a constant is entered.

5) Set other parameters to their defaults, and ensure correct data types are entered. For boolean values, NA is converted to FALSE.

6) Check for common user errors, such as entering ".8" rather than "80" as a percentage filtering threshold (this will leave a warning message).

By default, OTUs are filtered automatically using two thresholds. An abundance threshold (thresh_abundance) sets the minimum average proportion an OTU must represent across all samples, and a prevalence threshold (thresh_prevalence) sets the minimum proportion of all samples where this OTU must be present (at least 1 sample read). These thresholds can be used in combination, or alone by setting one of them to 0 or NA.

In addition, you can set a second abundance threshold that overrides the prevalence threshold if it is reached, using thresh_abundance_override. This is useful for catching OTUs that are abundant for a brief period of time, but are absent from most of the samples, and are nevertheless important to include in analysis. This is disabled by default (thresh_abundance_override == NA).

Finally, a fourth filtering threshold, thresh_NA, filters out OTUs with missing data in a substantial fraction of the samples. This defaults to eliminating OTUs missing data in >5% of samples.

Alternatively, OTUs can be manually specified in otulist as a vector of OTU IDs. The order in which these are specified will also determine the arrangement of OTU panels on the horizon plot.

You can also compare a single OTU across multiple subjects, by specifying the OTU ID in singleVarOTU. This is useful for comparing the same timepoint across multiple individuals, rather than multiple OTUs or taxa.

Value

Returns a list containing the appropriate arguments for the horizonplot function. This result list should then be inputted into horizonplot() to produce the graph. You should not need to alter any parameters in this list before using them in horizonplot, but this preliminary function allows you to check the refined parameters in case of an error in horizonplot.

Examples

# Pass just the OTU table to prepanel, and it will assume all samples belong
# to the same subject.
prepanel(otusample = otusample_diet)

# Supplement metadata and a subject name, and it will select samples from
# just one subject (this is what you should do with more than one subject).
prepanel(otusample = otusample_diet, metadatasample = metadatasample_diet, subj="MCTs01")

# Pass taxonomydata to prepanel if you want to label facets by taxonomy
# rather than by OTU ID.
prepanel(otusample = otusample_diet, metadatasample = metadatasample_diet, 
taxonomydata = taxonomysample_diet, subj="MCTs01", facetLabelsByTaxonomy=TRUE)

# OTU filtering using both a prevalence and an abundance standard (default)
prepanel(otusample = otusample_diet, metadatasample = metadatasample_diet, subj="MCTs01", 
thresh_prevalence=75, thresh_abundance=0.75)

# OTU filtering using just an abundance standard
prepanel(otusample = otusample_diet, metadatasample = metadatasample_diet, subj="MCTs01",
thresh_prevalence=NA, thresh_abundance=0.75)

# If an OTU's average abundance reaches a high enough threshold, override
# other standards and include it in analysis
prepanel(otusample = otusample_diet, metadatasample = metadatasample_diet, subj="MCTs01", 
thresh_prevalence=90, thresh_abundance=0.75, thresh_abundance_override=1.5)

# Filter OTUs where >2% samples are NA values
prepanel(otusample = otusample_diet, metadatasample = metadatasample_diet, subj="MCTs01", 
thresh_NA=2)

# You can also manually select OTUs by OTU ID
prepanel(otusample = otusample_diet, metadatasample = metadatasample_diet, subj="MCTs01",
otulist=c("taxon 1", "taxon 2", "taxon 10", "taxon 14"))

# Manual selection can be used to specify the order OTUs will appear on
# the horizon plot. For example, these two datasets have identical OTUs, but
# they are ordered differently.
params <- prepanel(otusample = otusample_diet, metadatasample = metadatasample_diet, 
subj="MCTs01", thresh_prevalence=95, thresh_abundance=1.5, 
otulist=c("taxon 1", "taxon 2", "taxon 10", "taxon 14"))
params[[1]]$otuid
params <- prepanel(otusample = otusample_diet, metadatasample = metadatasample_diet, 
subj="MCTs01", otulist=c("taxon 10", "taxon 2", "taxon 1", "taxon 14"))
params[[1]]$otuid

# The origin and band.thickness variables can be set to either a numeric
# constant or a function that evaluates separately for every OTU subpanel based
# on its sample values.

# Use a fixed origin of 5% for all OTU subpanels
prepanel(otusample = otusample_diet, metadatasample = metadatasample_diet, 
subj="MCTs01", origin=5)

# Evaluate a different origin for each OTU subpanel using a custom function
prepanel(otusample = otusample_diet, metadatasample = metadatasample_diet, 
subj="MCTs01", origin=function(y){mad(y, na.rm=TRUE)})


blekhmanlab/biomehorizon documentation built on Nov. 8, 2023, 12:16 a.m.