ps_join: Join a dataframe to phyloseq sample data

View source: R/ps_join.R

ps_joinR Documentation

Join a dataframe to phyloseq sample data

Description

You can use most types of join from the dplyr::*_join function family, including e.g. "inner", "left", "semi", "anti" (see details below). Defaults to type = "left" which calls left_join(), this supports x as a phyloseq and y as a dataframe. Most of the time you'll want "left" (adds variables with no sample filtering), or "inner" (adds variables and filters samples). This function simply:

  1. extracts the sample_data from the phyloseq as a dataframe

  2. performs the chosen type of join (with the given arguments)

  3. filters the phyloseq if type = inner, semi or anti

  4. reattaches the modified sample_data to the phyloseq and returns the phyloseq

Usage

ps_join(
  x,
  y,
  by = NULL,
  match_sample_names = NULL,
  keep_sample_name_col = TRUE,
  sample_name_natural_join = FALSE,
  type = "left",
  .keep_all_taxa = FALSE
)

Arguments

x

phyloseq (or dataframe)

y

dataframe (or phyloseq for e.g. type = "right")

by

A character vector of variables to join by (col must be present in both x and y or paired via a named vector like c("xname" = "yname", etc.))

match_sample_names

match against the phyloseq sample_names by naming a variable in the additional dataframe (this is in addition to any variables named in by)

keep_sample_name_col

should the column named in match_sample_names be kept in the returned phyloseq's sample_data? (only relevant if match_sample_names is not NULL)

sample_name_natural_join

if TRUE, use sample_name AND all shared colnames to match rows (only relevant if match_sample_names is not NULL, this arg takes precedence over anything also entered in by arg)

type

name of type of join e.g. "left", "right", "inner", "semi" (see dplyr help pages)

.keep_all_taxa

if FALSE (the default), remove taxa which are no longer present in the dataset after filtering

Details

Mutating joins, which will add columns from a dataframe to phyloseq sample data, matching rows based on the key columns named in the by argument:

  • "inner": includes all rows in present in both x and y.

  • "left": includes all rows in x. (so x must be the phyloseq)

  • "right": includes all rows in y. (so y must be the phyloseq)

  • "full": includes all rows present in x or y. (will likely NOT work, as additional rows cannot be added to sample_data!)

If a row in x matches multiple rows in y (based on variables named in the by argument), all the rows in y will be added once for each matching row in x. This will cause this function to fail, as additional rows cannot be added to the phyloseq sample_data!

Filtering joins filter rows from x based on the presence or absence of matches in y:

  • "semi": return all rows from x with a match in y.

  • "anti": return all rows from x without a match in y.

Value

phyloseq with modified sample_data (and possibly filtered)

See Also

ps_mutate for computing new variables from existing sample data

ps_select for selecting only some sample_data variables

https://www.garrickadenbuie.com/project/tidyexplain/ for an animated introduction to joining dataframes

Examples

library(phyloseq)
data("enterotype", package = "phyloseq")

x <- enterotype
y <- data.frame(
  ID_var = sample_names(enterotype)[c(1:50, 101:150)],
  SeqTech = sample_data(enterotype)[c(1:50, 101:150), "SeqTech"],
  arbitrary_info = rep(c("A", "B"), 50)
)

# simply match the new data to samples that exist in x, as default is a left_join
# where some sample names of x are expected to match variable ID_var in dataframe y
out1A <- ps_join(x = x, y = y, match_sample_names = "ID_var")
out1A
sample_data(out1A)[1:6, ]


# use sample_name and all shared variables to join
# (a natural join is not a type of join per se,
# but it indicates that all shared variables should be used for matching)
out1B <- ps_join(
  x = x, y = y, match_sample_names = "ID_var",
  sample_name_natural_join = TRUE, keep_sample_name_col = FALSE
)
out1B
sample_data(out1B)[1:6, ]

# if you only want to keep phyloseq samples that exist in the new data, try an inner join
# this will add the new variables AND filter the phyloseq
# this example matches sample names to ID_var and by matching the shared SeqTech variable
out1C <- ps_join(x = x, y = y, type = "inner", by = "SeqTech", match_sample_names = "ID_var")
out1C
sample_data(out1C)[1:6, ]

# the id variable is named Sample_ID in x and ID_var in y
# semi_join is only a filtering join (doesn't add new variables but just filters samples in x)
out2A <- ps_join(x = x, y = y, by = c("Sample_ID" = "ID_var"), type = "semi")
out2A
sample_data(out2A)[1:6, ]

# anti_join is another type of filtering join
out2B <- ps_join(x = x, y = y, by = c("Sample_ID" = "ID_var"), type = "anti")
out2B
sample_data(out2B)[1:6, ]

# semi and anti joins keep opposite sets of samples
intersect(sample_names(out2A), sample_names(out2B))

# you can mix and match named and unnamed values in the `by` vector
# inner is like a combination of left join and semi join
out3 <- ps_join(x = x, y = y, by = c("Sample_ID" = "ID_var", "SeqTech"), type = "inner")
out3
sample_data(out3)[1:6, ]

david-barnett/microViz documentation built on April 17, 2025, 4:25 a.m.