View source: R/io_repertoires_read.R
| read_repertoires | R Documentation |
This is the main function for reading immune repertoire data into the
immundata framework. It reads one or more repertoire files (AIRR TSV,
10X CSV, Parquet), performs optional preprocessing and column renaming,
aggregates sequences into receptors based on a provided schema, optionally
joins external metadata, performs optional postprocessing, and returns
an ImmunData object.
The function handles different data types (bulk, single-cell) based on
the presence of barcode_col and count_col. For efficiency with large
datasets, it processes the data and saves intermediate results (annotations)
as a Parquet file before loading them back into the final ImmunData object.
read_repertoires(
path,
schema,
metadata = NULL,
barcode_col = NULL,
count_col = NULL,
locus_col = NULL,
umi_col = NULL,
preprocess = make_default_preprocessing(),
postprocess = make_default_postprocessing(),
rename_columns = imd_rename_cols("10x"),
enforce_schema = TRUE,
metadata_file_col = "File",
output_folder = NULL,
repertoire_schema = NULL
)
path |
Character vector. Path(s) to input repertoire files (e.g.,
|
schema |
Defines how unique receptors are identified. Can be:
|
metadata |
Optional. A data frame containing
metadata to be joined with the repertoire data, read by
|
barcode_col |
Character(1). Name of the column containing cell barcodes
or other unique cell/clone identifiers for single-cell data. Triggers
single-cell processing logic in |
count_col |
Character(1). Name of the column containing UMI counts or
frequency counts for bulk sequencing data. Triggers bulk processing logic
in |
locus_col |
Character(1). Name of the column specifying the receptor chain
locus (e.g., "TRA", "TRB", "IGH", "IGK", "IGL"). Required if |
umi_col |
Character(1). Name of the column containing UMI counts for
single-cell data. Required when |
preprocess |
List. A named list of functions to apply sequentially to the
raw data before receptor aggregation. Each function should accept a
data frame (or duckplyr_df) as its first argument. See
|
postprocess |
List. A named list of functions to apply sequentially to the
annotation data after receptor aggregation and metadata joining. Each
function should accept a data frame (or duckplyr_df) as its first argument.
See |
rename_columns |
Named character vector. Optional mapping to rename columns
in the input files using |
enforce_schema |
Logical(1). If |
metadata_file_col |
Character(1). The name of the column in the |
output_folder |
Character(1). Path to a directory where intermediate
processed annotation data will be saved as |
repertoire_schema |
Character vector or Function. Defines columns used to
group annotations into distinct repertoires (e.g., by sample or donor).
If provided, |
The function executes the following steps:
Validates inputs.
Determines the list of input files based on path and metadata. Checks file extensions.
Reads data using duckplyr (read_parquet_duckdb or read_csv_duckdb). Handles .gz.
Applies column renaming if rename_columns is provided.
Applies preprocessing steps sequentially if preprocess is provided.
Aggregates sequences into receptors using agg_receptors(), based on schema, barcode_col, count_col, locus_col, and umi_col. This creates the core annotation table.
Joins the metadata table if provided.
Applies postprocessing steps sequentially if postprocess is provided.
Creates a temporary ImmunData object in memory.
Determines the output_folder path.
If repertoire_schema is provided, calls agg_repertoires() to define and summarize repertoires.
Saves the processed annotation table and metadata using write_immundata() to the output_folder.
Loads the data back from the saved Parquet files using read_immundata() to create the final ImmunData object. This ensures the returned object is backed by efficient storage.
Returns the final ImmunData object.
An ImmunData object containing the processed receptor annotations.
If repertoire_schema was provided, the object will also contain repertoire
definitions and summaries calculated by agg_repertoires().
ImmunData, read_immundata(), write_immundata(), read_metadata(),
agg_receptors(), agg_repertoires(), make_receptor_schema(),
make_default_preprocessing(), make_default_postprocessing()
## Not run:
#
# Example 1: single-chain, one file
#
# Read a single AIRR TSV file, defining receptors by V/J/CDR3_aa
# Assume "my_sample.tsv" exists and follows AIRR format
# Create a dummy file for illustration
airr_data <- data.frame(
sequence_id = paste0("seq", 1:5),
v_call = c("TRBV1", "TRBV1", "TRBV2", "TRBV1", "TRBV3"),
j_call = c("TRBJ1", "TRBJ1", "TRBJ2", "TRBJ1", "TRBJ1"),
junction_aa = c("CASSL...", "CASSL...", "CASSD...", "CASSL...", "CASSF..."),
productive = c(TRUE, TRUE, TRUE, FALSE, TRUE),
locus = c("TRB", "TRB", "TRB", "TRB", "TRB")
)
readr::write_tsv(airr_data, "my_sample.tsv")
# Define receptor schema
receptor_def <- c("v_call", "j_call", "junction_aa")
# Specify output folder
out_dir <- tempfile("immundata_output_")
# Read the data (disabling default preprocessing for this simple example)
idata <- read_repertoires(
path = "my_sample.tsv",
schema = receptor_def,
output_folder = out_dir,
preprocess = NULL, # Disable default productive filter for demo
postprocess = NULL # Disable default barcode prefixing
)
print(idata)
print(idata$annotations)
#
# Example 2: single-chain, multiple files
#
# Read multiple files using metadata
# Create dummy files and metadata
readr::write_tsv(airr_data[1:2, ], "sample1.tsv")
readr::write_tsv(airr_data[3:5, ], "sample2.tsv")
meta <- data.frame(
SampleID = c("S1", "S2"),
Tissue = c("PBMC", "Tumor"),
FilePath = c(normalizePath("sample1.tsv"), normalizePath("sample2.tsv"))
)
readr::write_tsv(meta, "metadata.tsv")
idata_multi <- read_repertoires(
path = "<metadata>",
metadata = meta,
metadata_file_col = "FilePath",
schema = receptor_def,
repertoire_schema = "SampleID", # Aggregate by SampleID
output_folder = tempfile("immundata_multi_"),
preprocess = make_default_preprocessing("airr"), # Use default AIRR filters
postprocess = NULL
)
print(idata_multi)
print(idata_multi$repertoires) # Check repertoire summary
# Clean up dummy files
file.remove("my_sample.tsv", "sample1.tsv", "sample2.tsv", "metadata.tsv")
unlink(out_dir, recursive = TRUE)
unlink(attr(idata_multi, "output_folder"), recursive = TRUE) # Get path used by function
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.