ps_join | R Documentation |
You can use most types of join from the dplyr::*_join function family, including e.g. "inner", "left", "semi", "anti" (see details below). Defaults to type = "left" which calls left_join(), this supports x as a phyloseq and y as a dataframe. Most of the time you'll want "left" (adds variables with no sample filtering), or "inner" (adds variables and filters samples). This function simply:
extracts the sample_data from the phyloseq as a dataframe
performs the chosen type of join (with the given arguments)
filters the phyloseq if type = inner, semi or anti
reattaches the modified sample_data to the phyloseq and returns the phyloseq
ps_join(
x,
y,
by = NULL,
match_sample_names = NULL,
keep_sample_name_col = TRUE,
sample_name_natural_join = FALSE,
type = "left",
.keep_all_taxa = FALSE
)
x |
phyloseq (or dataframe) |
y |
dataframe (or phyloseq for e.g. type = "right") |
by |
A character vector of variables to join by (col must be present in both x and y or paired via a named vector like c("xname" = "yname", etc.)) |
match_sample_names |
match against the phyloseq sample_names by naming a variable in the additional dataframe (this is in addition to any variables named in by) |
keep_sample_name_col |
should the column named in match_sample_names be kept in the returned phyloseq's sample_data? (only relevant if match_sample_names is not NULL) |
sample_name_natural_join |
if TRUE, use sample_name AND all shared colnames to match rows (only relevant if match_sample_names is not NULL, this arg takes precedence over anything also entered in |
type |
name of type of join e.g. "left", "right", "inner", "semi" (see dplyr help pages) |
.keep_all_taxa |
if FALSE (the default), remove taxa which are no longer present in the dataset after filtering |
Mutating joins, which will add columns from a dataframe to phyloseq sample data, matching rows based on the key columns named in the by
argument:
"inner": includes all rows in present in both x and y.
"left": includes all rows in x. (so x must be the phyloseq)
"right": includes all rows in y. (so y must be the phyloseq)
"full": includes all rows present in x or y. (will likely NOT work, as additional rows cannot be added to sample_data!)
If a row in x matches multiple rows in y (based on variables named in the by
argument),
all the rows in y will be added once for each matching row in x.
This will cause this function to fail, as additional rows cannot be added to the phyloseq sample_data!
Filtering joins filter rows from x based on the presence or absence of matches in y:
"semi": return all rows from x with a match in y.
"anti": return all rows from x without a match in y.
phyloseq with modified sample_data (and possibly filtered)
ps_mutate
for computing new variables from existing sample data
ps_select
for selecting only some sample_data variables
https://www.garrickadenbuie.com/project/tidyexplain/ for an animated introduction to joining dataframes
library(phyloseq)
data("enterotype", package = "phyloseq")
x <- enterotype
y <- data.frame(
ID_var = sample_names(enterotype)[c(1:50, 101:150)],
SeqTech = sample_data(enterotype)[c(1:50, 101:150), "SeqTech"],
arbitrary_info = rep(c("A", "B"), 50)
)
# simply match the new data to samples that exist in x, as default is a left_join
# where some sample names of x are expected to match variable ID_var in dataframe y
out1A <- ps_join(x = x, y = y, match_sample_names = "ID_var")
out1A
sample_data(out1A)[1:6, ]
# use sample_name and all shared variables to join
# (a natural join is not a type of join per se,
# but it indicates that all shared variables should be used for matching)
out1B <- ps_join(
x = x, y = y, match_sample_names = "ID_var",
sample_name_natural_join = TRUE, keep_sample_name_col = FALSE
)
out1B
sample_data(out1B)[1:6, ]
# if you only want to keep phyloseq samples that exist in the new data, try an inner join
# this will add the new variables AND filter the phyloseq
# this example matches sample names to ID_var and by matching the shared SeqTech variable
out1C <- ps_join(x = x, y = y, type = "inner", by = "SeqTech", match_sample_names = "ID_var")
out1C
sample_data(out1C)[1:6, ]
# the id variable is named Sample_ID in x and ID_var in y
# semi_join is only a filtering join (doesn't add new variables but just filters samples in x)
out2A <- ps_join(x = x, y = y, by = c("Sample_ID" = "ID_var"), type = "semi")
out2A
sample_data(out2A)[1:6, ]
# anti_join is another type of filtering join
out2B <- ps_join(x = x, y = y, by = c("Sample_ID" = "ID_var"), type = "anti")
out2B
sample_data(out2B)[1:6, ]
# semi and anti joins keep opposite sets of samples
intersect(sample_names(out2A), sample_names(out2B))
# you can mix and match named and unnamed values in the `by` vector
# inner is like a combination of left join and semi join
out3 <- ps_join(x = x, y = y, by = c("Sample_ID" = "ID_var", "SeqTech"), type = "inner")
out3
sample_data(out3)[1:6, ]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.