ps_dedupe: De-duplicate phyloseq samples
In david-barnett/microViz: Microbiome Data Analysis and Visualization

View source: R/ps_dedupe.R

ps_dedupe

R Documentation

De-duplicate phyloseq samples

Description

Use one or more variables in the sample_data to identify and remove duplicate samples (leaving one sample per group).

methods:

method = "readcount" keeps the one sample in each duplicate group with the highest total number of reads (phyloseq::sample_sums)
method = "first" keeps the first sample in each duplicate group encountered in the row order of the sample_data
method = "last" keeps the last sample in each duplicate group encountered in the row order of the sample_data
method = "random" keeps a random sample from each duplicate group (set.seed for reproducibility)

More than one "duplicate" sample can be kept per group by setting n samples > 1.

Usage

ps_dedupe(
  ps,
  vars,
  method = "readcount",
  verbose = TRUE,
  n = 1,
  .keep_group_var = FALSE,
  .keep_readcount = FALSE,
  .message_IDs = FALSE,
  .label_only = FALSE,
  .keep_all_taxa = FALSE
)

Arguments

`ps`	phyloseq object
`vars`	names of variables, whose (combined) levels identify groups from which only 1 sample is desired
`method`	keep sample with max "readcount" or the "first" or "last" or "random" samples encountered in given sample_data order for each duplicate group
`verbose`	message about number of groups, and number of samples dropped?
`n`	number of 'duplicates' to keep per group, defaults to 1
`.keep_group_var`	keep grouping variable .GROUP. in phyloseq object?
`.keep_readcount`	keep readcount variable .READCOUNT. in phyloseq object?
`.message_IDs`	message sample names of dropped variables?
`.label_only`	if TRUE, the samples will NOT be filtered, just labelled with a new logical variable .KEEP_SAMPLE.
`.keep_all_taxa`	keep all taxa after removing duplicates? If FALSE, the default, taxa are removed if they never occur in any of the retained samples

Details

What happens when duplicated samples have exactly equal readcounts in method = "readcount"? The first encountered maximum is kept (in sample_data row order, like method = "first")

Value

phyloseq object

Examples

data("dietswap", package = "microbiome")

dietswap
# let's pretend the dietswap data contains technical replicates from each subject
# we want to keep only one of them
ps_dedupe(dietswap, vars = "subject", method = "readcount", verbose = TRUE)

# contrived example to show identifying "duplicates" via the interaction of multiple columns
ps1 <- ps_dedupe(
  ps = dietswap, method = "readcount", verbose = TRUE,
  vars = c("timepoint", "group", "bmi_group")
)
phyloseq::sample_data(ps1)

ps2 <- ps_dedupe(
  ps = dietswap, method = "first", verbose = TRUE,
  vars = c("timepoint", "group", "bmi_group")
)
phyloseq::sample_data(ps2)

david-barnett/microViz documentation built on April 17, 2025, 4:25 a.m.