read_repertoires: Read and process immune repertoire files to immundata
In immundata: A Unified Data Layer for Large-Scale Single-Cell, Spatial and Bulk Immunomics

read_repertoires

R Documentation

Read and process immune repertoire files to immundata

Description

This is the main function for reading immune repertoire data into the immundata framework. It reads one or more repertoire files (AIRR TSV, 10X CSV, Parquet), performs optional preprocessing and column renaming, aggregates sequences into receptors based on a provided schema, optionally joins external metadata, performs optional postprocessing, and returns an ImmunData object.

The function handles different data types (bulk, single-cell) based on the presence of barcode_col and count_col. For efficiency with large datasets, it processes the data and saves intermediate results (annotations) as a Parquet file before loading them back into the final ImmunData object.

Usage

read_repertoires(
  path,
  schema,
  metadata = NULL,
  barcode_col = NULL,
  count_col = NULL,
  locus_col = NULL,
  umi_col = NULL,
  preprocess = make_default_preprocessing(),
  postprocess = make_default_postprocessing(),
  rename_columns = imd_rename_cols("10x"),
  enforce_schema = TRUE,
  metadata_file_col = "File",
  output_folder = NULL,
  repertoire_schema = NULL
)

Arguments

`path`	Character vector. Path(s) to input repertoire files (e.g., `"/path/to/data/*.tsv.gz"`). Supports glob patterns via `Sys.glob()`. Files can be Parquet, CSV, TSV, or gzipped versions thereof. All files must be of the same type. Alternatively, pass the special string `"<metadata>"` to read file paths from the `metadata` table (see `metadata` and `metadata_file_col` params).
`schema`	Defines how unique receptors are identified. Can be: A character vector of column names (e.g., `c("v_call", "j_call", "junction_aa")`). A schema object created by `make_receptor_schema()`, allowing specification of chains for pairing (e.g., `make_receptor_schema(features = c("v_call", "junction_aa"), chains = c("TRA", "TRB"))`).
`metadata`	Optional. A data frame containing metadata to be joined with the repertoire data, read by `read_metadata()` function. If `path = "<metadata>"`, this table must be provided and contain the file paths column specified by `metadata_file_col`. Default: `NULL`.
`barcode_col`	Character(1). Name of the column containing cell barcodes or other unique cell/clone identifiers for single-cell data. Triggers single-cell processing logic in `agg_receptors()`. Default: `NULL`.
`count_col`	Character(1). Name of the column containing UMI counts or frequency counts for bulk sequencing data. Triggers bulk processing logic in `agg_receptors()`. Default: `NULL`. Cannot be specified if `barcode_col` is also specified.
`locus_col`	Character(1). Name of the column specifying the receptor chain locus (e.g., "TRA", "TRB", "IGH", "IGK", "IGL"). Required if `schema` specifies chains for pairing. Default: `NULL`.
`umi_col`	Character(1). Name of the column containing UMI counts for single-cell data. Required when `barcode_col` is used. It is used to select the most abundant chain within a barcode (and within a locus for paired-chain schemas). Default: `NULL`.
`preprocess`	List. A named list of functions to apply sequentially to the raw data before receptor aggregation. Each function should accept a data frame (or duckplyr_df) as its first argument. See `make_default_preprocessing()` for examples. Default: `make_default_preprocessing()`. Set to `NULL` or `list()` to disable.
`postprocess`	List. A named list of functions to apply sequentially to the annotation data after receptor aggregation and metadata joining. Each function should accept a data frame (or duckplyr_df) as its first argument. See `make_default_postprocessing()` for examples. Default: `make_default_postprocessing()`. Set to `NULL` or `list()` to disable.
`rename_columns`	Named character vector. Optional mapping to rename columns in the input files using `dplyr::rename()` syntax (e.g., `c(new_name = "old_name", barcode = "cell_id")`). Renaming happens before preprocessing and schema application. See `imd_rename_cols()` for presets. Default: `imd_rename_cols("10x")`.
`enforce_schema`	Logical(1). If `TRUE` (default), reading multiple files requires them to have the exact same columns and types. If `FALSE`, columns are unioned across files (potentially slower, requires more memory). Default: `TRUE`.
`metadata_file_col`	Character(1). The name of the column in the `metadata` table that contains the full paths to the repertoire files. Only used when `path = "<metadata>"`. Default: `"File"`.
`output_folder`	Character(1). Path to a directory where intermediate processed annotation data will be saved as `annotations.parquet` and `metadata.json`. If `NULL` (default), a folder named `⁠immundata-<basename_without_ext>⁠` is created in the same directory as the first input file specified in `path`. The final `ImmunData` object reads from these saved files. Default: `NULL`.
`repertoire_schema`	Character vector or Function. Defines columns used to group annotations into distinct repertoires (e.g., by sample or donor). If provided, `agg_repertoires()` is called after loading to add repertoire-level summaries and metrics. Default: `NULL`.

Details

The function executes the following steps:

Validates inputs.
Determines the list of input files based on path and metadata. Checks file extensions.
Reads data using duckplyr (read_parquet_duckdb or read_csv_duckdb). Handles .gz.
Applies column renaming if rename_columns is provided.
Applies preprocessing steps sequentially if preprocess is provided.
Aggregates sequences into receptors using agg_receptors(), based on schema, barcode_col, count_col, locus_col, and umi_col. This creates the core annotation table.
Joins the metadata table if provided.
Applies postprocessing steps sequentially if postprocess is provided.
Creates a temporary ImmunData object in memory.
Determines the output_folder path.
If repertoire_schema is provided, calls agg_repertoires() to define and summarize repertoires.
Saves the processed annotation table and metadata using write_immundata() to the output_folder.
Loads the data back from the saved Parquet files using read_immundata() to create the final ImmunData object. This ensures the returned object is backed by efficient storage.
Returns the final ImmunData object.

Value

An ImmunData object containing the processed receptor annotations. If repertoire_schema was provided, the object will also contain repertoire definitions and summaries calculated by agg_repertoires().

Examples

## Not run: 
#
# Example 1: single-chain, one file
#
# Read a single AIRR TSV file, defining receptors by V/J/CDR3_aa
# Assume "my_sample.tsv" exists and follows AIRR format

# Create a dummy file for illustration
airr_data <- data.frame(
  sequence_id = paste0("seq", 1:5),
  v_call = c("TRBV1", "TRBV1", "TRBV2", "TRBV1", "TRBV3"),
  j_call = c("TRBJ1", "TRBJ1", "TRBJ2", "TRBJ1", "TRBJ1"),
  junction_aa = c("CASSL...", "CASSL...", "CASSD...", "CASSL...", "CASSF..."),
  productive = c(TRUE, TRUE, TRUE, FALSE, TRUE),
  locus = c("TRB", "TRB", "TRB", "TRB", "TRB")
)
readr::write_tsv(airr_data, "my_sample.tsv")

# Define receptor schema
receptor_def <- c("v_call", "j_call", "junction_aa")

# Specify output folder
out_dir <- tempfile("immundata_output_")

# Read the data (disabling default preprocessing for this simple example)
idata <- read_repertoires(
  path = "my_sample.tsv",
  schema = receptor_def,
  output_folder = out_dir,
  preprocess = NULL, # Disable default productive filter for demo
  postprocess = NULL # Disable default barcode prefixing
)

print(idata)
print(idata$annotations)

#
# Example 2: single-chain, multiple files
#
# Read multiple files using metadata
# Create dummy files and metadata
readr::write_tsv(airr_data[1:2, ], "sample1.tsv")
readr::write_tsv(airr_data[3:5, ], "sample2.tsv")
meta <- data.frame(
  SampleID = c("S1", "S2"),
  Tissue = c("PBMC", "Tumor"),
  FilePath = c(normalizePath("sample1.tsv"), normalizePath("sample2.tsv"))
)
readr::write_tsv(meta, "metadata.tsv")

idata_multi <- read_repertoires(
  path = "<metadata>",
  metadata = meta,
  metadata_file_col = "FilePath",
  schema = receptor_def,
  repertoire_schema = "SampleID", # Aggregate by SampleID
  output_folder = tempfile("immundata_multi_"),
  preprocess = make_default_preprocessing("airr"), # Use default AIRR filters
  postprocess = NULL
)

print(idata_multi)
print(idata_multi$repertoires) # Check repertoire summary

# Clean up dummy files
file.remove("my_sample.tsv", "sample1.tsv", "sample2.tsv", "metadata.tsv")
unlink(out_dir, recursive = TRUE)
unlink(attr(idata_multi, "output_folder"), recursive = TRUE) # Get path used by function

## End(Not run)

immundata documentation built on April 4, 2026, 9:09 a.m.

immundata index

Package overview README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

immundata
A Unified Data Layer for Large-Scale Single-Cell, Spatial and Bulk Immunomics

read_repertoires: Read and process immune repertoire files to immundata
In immundata: A Unified Data Layer for Large-Scale Single-Cell, Spatial and Bulk Immunomics

Read and process immune repertoire files to immundata

Description

Usage

Arguments

Details

Value

See Also

Examples

Related to read_repertoires in immundata...

R Package Documentation

Browse R Packages

We want your feedback!

immundata A Unified Data Layer for Large-Scale Single-Cell, Spatial and Bulk Immunomics

read_repertoires: Read and process immune repertoire files to immundata In immundata: A Unified Data Layer for Large-Scale Single-Cell, Spatial and Bulk Immunomics

Read and process immune repertoire files to immundata

Description

Usage

Arguments

Details

Value

See Also

Examples

Related to read_repertoires in immundata...

R Package Documentation

Browse R Packages

We want your feedback!

immundata
A Unified Data Layer for Large-Scale Single-Cell, Spatial and Bulk Immunomics

read_repertoires: Read and process immune repertoire files to immundata
In immundata: A Unified Data Layer for Large-Scale Single-Cell, Spatial and Bulk Immunomics