ps_dedupe: De-duplicate phyloseq samples

View source: R/ps_dedupe.R

ps_dedupeR Documentation

De-duplicate phyloseq samples

Description

Use one or more variables in the sample_data to identify and remove duplicate samples (leaving one sample per group).

methods:

  • method = "readcount" keeps the one sample in each duplicate group with the highest total number of reads (phyloseq::sample_sums)

  • method = "first" keeps the first sample in each duplicate group encountered in the row order of the sample_data

  • method = "last" keeps the last sample in each duplicate group encountered in the row order of the sample_data

  • method = "random" keeps a random sample from each duplicate group (set.seed for reproducibility)

More than one "duplicate" sample can be kept per group by setting n samples > 1.

Usage

ps_dedupe(
  ps,
  vars,
  method = "readcount",
  verbose = TRUE,
  n = 1,
  .keep_group_var = FALSE,
  .keep_readcount = FALSE,
  .message_IDs = FALSE,
  .label_only = FALSE,
  .keep_all_taxa = FALSE
)

Arguments

ps

phyloseq object

vars

names of variables, whose (combined) levels identify groups from which only 1 sample is desired

method

keep sample with max "readcount" or the "first" or "last" or "random" samples encountered in given sample_data order for each duplicate group

verbose

message about number of groups, and number of samples dropped?

n

number of 'duplicates' to keep per group, defaults to 1

.keep_group_var

keep grouping variable .GROUP. in phyloseq object?

.keep_readcount

keep readcount variable .READCOUNT. in phyloseq object?

.message_IDs

message sample names of dropped variables?

.label_only

if TRUE, the samples will NOT be filtered, just labelled with a new logical variable .KEEP_SAMPLE.

.keep_all_taxa

keep all taxa after removing duplicates? If FALSE, the default, taxa are removed if they never occur in any of the retained samples

Details

What happens when duplicated samples have exactly equal readcounts in method = "readcount"? The first encountered maximum is kept (in sample_data row order, like method = "first")

Value

phyloseq object

See Also

ps_filter for filtering samples by sample_data variables

Examples

data("dietswap", package = "microbiome")

dietswap
# let's pretend the dietswap data contains technical replicates from each subject
# we want to keep only one of them
ps_dedupe(dietswap, vars = "subject", method = "readcount", verbose = TRUE)

# contrived example to show identifying "duplicates" via the interaction of multiple columns
ps1 <- ps_dedupe(
  ps = dietswap, method = "readcount", verbose = TRUE,
  vars = c("timepoint", "group", "bmi_group")
)
phyloseq::sample_data(ps1)

ps2 <- ps_dedupe(
  ps = dietswap, method = "first", verbose = TRUE,
  vars = c("timepoint", "group", "bmi_group")
)
phyloseq::sample_data(ps2)

david-barnett/microViz documentation built on April 17, 2025, 4:25 a.m.