ps_join: Join a dataframe to phyloseq sample data
In david-barnett/microViz: Microbiome Data Analysis and Visualization

ps_join

R Documentation

Join a dataframe to phyloseq sample data

Description

You can use most types of join from the dplyr::*_join function family, including e.g. "inner", "left", "semi", "anti" (see details below). Defaults to type = "left" which calls left_join(), this supports x as a phyloseq and y as a dataframe. Most of the time you'll want "left" (adds variables with no sample filtering), or "inner" (adds variables and filters samples). This function simply:

extracts the sample_data from the phyloseq as a dataframe
performs the chosen type of join (with the given arguments)
filters the phyloseq if type = inner, semi or anti
reattaches the modified sample_data to the phyloseq and returns the phyloseq

Usage

ps_join(
  x,
  y,
  by = NULL,
  match_sample_names = NULL,
  keep_sample_name_col = TRUE,
  sample_name_natural_join = FALSE,
  type = "left",
  .keep_all_taxa = FALSE
)

Arguments

`x`	phyloseq (or dataframe)
`y`	dataframe (or phyloseq for e.g. type = "right")
`by`	A character vector of variables to join by (col must be present in both x and y or paired via a named vector like c("xname" = "yname", etc.))
`match_sample_names`	match against the phyloseq sample_names by naming a variable in the additional dataframe (this is in addition to any variables named in by)
`keep_sample_name_col`	should the column named in match_sample_names be kept in the returned phyloseq's sample_data? (only relevant if match_sample_names is not NULL)
`sample_name_natural_join`	if TRUE, use sample_name AND all shared colnames to match rows (only relevant if match_sample_names is not NULL, this arg takes precedence over anything also entered in `by` arg)
`type`	name of type of join e.g. "left", "right", "inner", "semi" (see dplyr help pages)
`.keep_all_taxa`	if FALSE (the default), remove taxa which are no longer present in the dataset after filtering

Details

Mutating joins, which will add columns from a dataframe to phyloseq sample data, matching rows based on the key columns named in the by argument:

"inner": includes all rows in present in both x and y.
"left": includes all rows in x. (so x must be the phyloseq)
"right": includes all rows in y. (so y must be the phyloseq)
"full": includes all rows present in x or y. (will likely NOT work, as additional rows cannot be added to sample_data!)

If a row in x matches multiple rows in y (based on variables named in the by argument), all the rows in y will be added once for each matching row in x. This will cause this function to fail, as additional rows cannot be added to the phyloseq sample_data!

Filtering joins filter rows from x based on the presence or absence of matches in y:

"semi": return all rows from x with a match in y.
"anti": return all rows from x without a match in y.

Value

phyloseq with modified sample_data (and possibly filtered)

Examples

library(phyloseq)
data("enterotype", package = "phyloseq")

x <- enterotype
y <- data.frame(
  ID_var = sample_names(enterotype)[c(1:50, 101:150)],
  SeqTech = sample_data(enterotype)[c(1:50, 101:150), "SeqTech"],
  arbitrary_info = rep(c("A", "B"), 50)
)

# simply match the new data to samples that exist in x, as default is a left_join
# where some sample names of x are expected to match variable ID_var in dataframe y
out1A <- ps_join(x = x, y = y, match_sample_names = "ID_var")
out1A
sample_data(out1A)[1:6, ]


# use sample_name and all shared variables to join
# (a natural join is not a type of join per se,
# but it indicates that all shared variables should be used for matching)
out1B <- ps_join(
  x = x, y = y, match_sample_names = "ID_var",
  sample_name_natural_join = TRUE, keep_sample_name_col = FALSE
)
out1B
sample_data(out1B)[1:6, ]

# if you only want to keep phyloseq samples that exist in the new data, try an inner join
# this will add the new variables AND filter the phyloseq
# this example matches sample names to ID_var and by matching the shared SeqTech variable
out1C <- ps_join(x = x, y = y, type = "inner", by = "SeqTech", match_sample_names = "ID_var")
out1C
sample_data(out1C)[1:6, ]

# the id variable is named Sample_ID in x and ID_var in y
# semi_join is only a filtering join (doesn't add new variables but just filters samples in x)
out2A <- ps_join(x = x, y = y, by = c("Sample_ID" = "ID_var"), type = "semi")
out2A
sample_data(out2A)[1:6, ]

# anti_join is another type of filtering join
out2B <- ps_join(x = x, y = y, by = c("Sample_ID" = "ID_var"), type = "anti")
out2B
sample_data(out2B)[1:6, ]

# semi and anti joins keep opposite sets of samples
intersect(sample_names(out2A), sample_names(out2B))

# you can mix and match named and unnamed values in the `by` vector
# inner is like a combination of left join and semi join
out3 <- ps_join(x = x, y = y, by = c("Sample_ID" = "ID_var", "SeqTech"), type = "inner")
out3
sample_data(out3)[1:6, ]

david-barnett/microViz documentation built on April 17, 2025, 4:25 a.m.