agg_receptors: Aggregates AIRR data into receptors
In immundata: A Unified Data Layer for Large-Scale Single-Cell, Spatial and Bulk Immunomics

View source: R/operations_agg_receptors.R

agg_receptors

R Documentation

Aggregates AIRR data into receptors

Description

Processes a table of immune receptor sequences (chains or clonotypes) to identify unique receptors based on a specified schema. It assigns a unique identifier (imd_receptor_id) to each distinct receptor signature and returns an annotated table linking the original sequence data to these receptor IDs.

This function is a core component used within read_repertoires() and handles different input data structures:

Simple tables (no counts, no cell IDs).
Bulk sequencing data (using a count column).
Single-cell data (using a barcode/cell ID column). For single-cell data, it can perform chain pairing if the schema specifies multiple chains (e.g., TRA and TRB).

Usage

agg_receptors(
  dataset,
  schema,
  barcode_col = NULL,
  count_col = NULL,
  locus_col = NULL,
  umi_col = NULL
)

Arguments

`dataset`	A `duckplyr_df` containing AIRR data. Must include columns specified in `schema` and potentially `barcode_col`, `count_col`, `locus_col`, `umi_col`. Expected `idata$annotations`, support for `ImmunData` will probably be added later.
`schema`	Defines how a unique receptor is identified. Can be: A character vector of column names representing receptor features (e.g., `c("v_call", "j_call", "junction_aa")`). A list created by `make_receptor_schema()`, specifying both `features` (character vector) and optionally `chains` (character vector of locus names like `"TRA"`, `"TRB"`, `"IGH"`, `"IGK"`, `"IGL"`, max length 2). Specifying `chains` triggers filtering by locus and enables pairing logic if two chains are given.
`barcode_col`	Character(1). The name of the column containing cell identifiers (barcodes). Required for single-cell processing and chain pairing. Default: `NULL`.
`count_col`	Character(1). The name of the column containing counts (e.g., UMI counts for bulk, clonotype frequency). Used for bulk data processing. Default: `NULL`. Cannot be specified if `barcode_col` is set.
`locus_col`	Character(1). The name of the column specifying the chain locus (e.g., "TRA", "TRB"). Required if `schema` includes `chains` for filtering or pairing. Default: `NULL`.
`umi_col`	Character(1). The name of the column containing UMI counts. Required for single-cell data (`barcode_col` is set). Used to select the most abundant chain within each barcode and, for paired schemas, within each barcode/locus group when multiple chains are present. Default: `NULL`.

Details

The function performs the following main steps:

Validation: Checks inputs, schema validity, and existence of required columns.
Schema Parsing: Determines receptor features and target chains from schema.
Locus Filtering: If schema$chains is provided, filters the dataset to include only rows matching the specified locus/loci.
Processing Logic (based on barcode_col and count_col):
- Simple Table/Bulk (No Barcodes): Assigns unique internal barcode/chain IDs. Identifies unique receptors based on schema$features. Calculates imd_chain_count (1 for simple table, from count_col for bulk).
- Single-Cell (Barcodes Provided): Uses barcode_col for imd_barcode_id.
  - Single Chain: (length(schema$chains) <= 1). Identifies unique receptors based on schema$features. Uses umi_col to keep one chain per barcode when needed. imd_chain_count is 1.
  - Paired Chain: (length(schema$chains) == 2). Requires locus_col and umi_col. Filters chains within each cell/locus group based on max umi_col. Creates paired receptors by joining the two specified loci for each cell based on schema$features from both. Assigns a unique imd_receptor_id to each pair. imd_chain_count is 1 (representing the chain record).
Output: Returns an annotated data frame containing original columns plus internal identifiers (imd_receptor_id, imd_barcode_id, imd_chain_id) and counts (imd_chain_count).

Internal column names are typically managed by immundata:::imd_schema().

Value

A duckplyr_df (or data frame) representing the annotated sequences. This table links each original sequence record (chain) to a defined receptor and includes standardized columns:

imd_receptor_id: Integer ID unique to each distinct receptor signature.
imd_barcode_id: Integer ID unique to each cell/barcode (or row if no barcode).
imd_chain_id: Integer ID unique to each input row (chain).
imd_chain_count: Integer count associated with the chain (1 for SC/simple, from count_col for bulk). This output is typically assigned to the ⁠$annotations⁠ field of an ImmunData object.

immundata
A Unified Data Layer for Large-Scale Single-Cell, Spatial and Bulk Immunomics

agg_receptors: Aggregates AIRR data into receptors
In immundata: A Unified Data Layer for Large-Scale Single-Cell, Spatial and Bulk Immunomics

Aggregates AIRR data into receptors

Description

Usage

Arguments

Details

Value

See Also

Related to agg_receptors in immundata...

R Package Documentation

Browse R Packages

We want your feedback!

immundata A Unified Data Layer for Large-Scale Single-Cell, Spatial and Bulk Immunomics

agg_receptors: Aggregates AIRR data into receptors In immundata: A Unified Data Layer for Large-Scale Single-Cell, Spatial and Bulk Immunomics

Aggregates AIRR data into receptors

Description

Usage

Arguments

Details

Value

See Also

Related to agg_receptors in immundata...

R Package Documentation

Browse R Packages

We want your feedback!

immundata
A Unified Data Layer for Large-Scale Single-Cell, Spatial and Bulk Immunomics

agg_receptors: Aggregates AIRR data into receptors
In immundata: A Unified Data Layer for Large-Scale Single-Cell, Spatial and Bulk Immunomics