data_sanity_check: Data Sanity and Integrity Check

Data Sanity and Integrity Check


Determine if the input data is in a correct format


  taxa_are_rows = TRUE,
  assay.type = assay_name,
  assay_name = "counts",
  rank = tax_level,
  tax_level = NULL,
  aggregate_data = NULL,
  meta_data = NULL,
  group = NULL,
  struc_zero = FALSE,
  global = FALSE,
  pairwise = FALSE,
  dunnet = FALSE,
  mdfdr_control = list(fwer_ctrl_method = "holm", B = 100),
  trend = FALSE,
  trend_control = list(contrast = NULL, node = NULL, solver = "ECOS", B = 100),
  verbose = TRUE



the input data. The data parameter should be either a matrix, data.frame, phyloseq or a TreeSummarizedExperiment object. Both phyloseq and TreeSummarizedExperiment objects consist of a feature table (microbial count table), a sample metadata table, a taxonomy table (optional), and a phylogenetic tree (optional). If a matrix or data.frame is provided, ensure that the row names of the metadata match the sample names (column names if taxa_are_rows is TRUE, and row names otherwise) in data. if a phyloseq or a TreeSummarizedExperiment is used, this standard has already been enforced. For detailed information, refer to ?phyloseq::phyloseq or ?TreeSummarizedExperiment::TreeSummarizedExperiment. It is recommended to use low taxonomic levels, such as OTU or species level, as the estimation of sampling fractions requires a large number of taxa.


logical. Whether taxa are positioned in the rows of the feature table. Default is TRUE.


alias for assay_name.


character. Name of the count table in the data object (only applicable if data object is a (Tree)SummarizedExperiment). Default is "counts". See ?SummarizedExperiment::assay for more details.


alias for tax_level.


character. The taxonomic or non taxonomic(rowData) level of interest. The input data can be analyzed at any taxonomic or rowData level without prior agglomeration. Note that tax_level must be a value from taxonomyRanks or rowData, which includes "Kingdom", "Phylum" "Class", "Order", "Family" "Genus" "Species" etc. See ?mia::taxonomyRanks for more details. Default is NULL, i.e., do not perform agglomeration, and the ANCOM-BC2 analysis will be performed at the lowest taxonomic level of the input data.


The abundance data that has been aggregated to the desired taxonomic level. This parameter is required only when the input data is in matrix or data.frame format. For phyloseq or TreeSummarizedExperiment data, aggregation is performed by specifying the tax_level parameter.


a data.frame containing sample metadata. This parameter is mandatory when the input data is a generic matrix or data.frame. Ensure that the row names of the metadata match the sample names (column names if taxa_are_rows is TRUE, and row names otherwise) in data.


the character string expresses how the microbial absolute abundances for each taxon depend on the fixed effects in metadata. When specifying the fix_formula, make sure to include the group variable in the formula if it is not NULL.


character. the name of the group variable in metadata. The group parameter should be a character string representing the name of the group variable in the metadata. The group variable should be discrete, meaning it consists of categorical values. Specifying the group variable is required if you are interested in detecting structural zeros and performing performing multi-group comparisons (global test, pairwise directional test, Dunnett's type of test, and trend test). However, if these analyses are not of interest to you, you can leave the group parameter as NULL. If the group variable of interest contains only two categories, you can also leave the group parameter as NULL. Default is NULL.


logical. Whether to detect structural zeros based on group. Default is FALSE. See Details for a more comprehensive discussion on structural zeros.


logical. Whether to perform the global test. Default is FALSE.


logical. Whether to perform the pairwise directional test. Default is FALSE.


logical. Whether to perform the Dunnett's type of test. Default is FALSE.


a named list of control parameters for mixed directional false discover rate (mdFDR), including 1) fwer_ctrl_method: family wise error (FWER) controlling procedure, such as "holm", "hochberg", "bonferroni", etc (default is "holm") and 2) B: the number of bootstrap samples (default is 100). Increase B will lead to a more accurate p-values. See Details for a more comprehensive discussion on mdFDR.


logical. Whether to perform trend test. Default is FALSE.


a named list of control parameters for the trend test, including 1) contrast: the list of contrast matrices for constructing inequalities, 2) node: the list of positions for the nodal parameter, 3) solver: a string indicating the solver to use (default is "ECOS"), and 4) B: the number of bootstrap samples (default is 100). Increase B will lead to a more accurate p-values. See vignette for the corresponding trend test examples.


logical. Whether to display detailed progress messages.


a list containing the outputs formatted appropriately for downstream analysis.


Huang Lin


data(atlas1006, package = "microbiome")
check_results = data_sanity_check(data = atlas1006,
                                  tax_level = "Family",
                                  fix_formula = "age + sex + bmi_group",
                                  group = "bmi_group",
                                  struc_zero = TRUE,
                                  global = TRUE,
                                  verbose = TRUE)

