process_antigenic_data: Process Raw Antigenic Assay Data

View source: R/data_preprocessing.R

process_antigenic_dataR Documentation

Process Raw Antigenic Assay Data

Description

Processes raw antigenic assay data from data frames into standardized long and matrix formats. Handles both similarity data (like titers, which need conversion to distances) and direct dissimilarity measurements like IC50. Preserves threshold indicators (<, >) and handles repeated measurements by averaging.

Usage

process_antigenic_data(
  data,
  antigen_col,
  serum_col,
  value_col,
  is_similarity = FALSE,
  metadata_cols = NULL,
  base = NULL,
  scale_factor = 1
)

Arguments

data

Data frame containing raw data.

antigen_col

Character. Name of column containing virus/antigen identifiers.

serum_col

Character. Name of column containing serum/antibody identifiers.

value_col

Character. Name of column containing measurements (titers or distances).

is_similarity

Logical. Whether values are measures of similarity such as titers or binding affinities (TRUE) or dissimilarities like IC50 (FALSE). Default: FALSE.

metadata_cols

Character vector. Names of additional columns to preserve.

base

Numeric. Base for logarithm transformation (default: 2 for similarities, e for dissimilarities).

scale_factor

Numeric. Scale factor for similarities. This is the base value that all other dilutions are multiples of. E.g., 10 for HI assay where titers are 10, 20, 40,... Default: 1.

Details

The function handles these key steps:

  1. Validates input data and required columns

  2. Transforms values to log scale

  3. Converts similarities to distances using Smith's method if needed

  4. Averages repeated measurements

  5. Creates standardized long format

  6. Creates symmetric distance matrix

  7. Preserves metadata and threshold indicators

  8. Preserves virusYear and serumYear columns if present

Input requirements and constraints:

  • Data frame must contain required columns

  • Column names must match specified parameters

  • Values can include threshold indicators (< or >)

  • Metadata columns must exist if specified

  • Allowed Year-related column names are "virusYear" and "serumYear"

Value

A list containing two elements:

long

A data.frame in long format with standardized columns, including the original identifiers, processed values, and calculated distances. Any specified metadata is also included.

matrix

A numeric matrix representing the processed symmetric distance matrix, with antigens and sera on columns and rows.

Examples

# Example 1: Processing HI titer data (similarities)
antigen_data <- data.frame(
  virus = c("A/H1N1/2009", "A/H1N1/2010", "A/H1N1/2011", "A/H1N1/2009", "A/H1N1/2010"),
  serum = c("anti-2009", "anti-2009", "anti-2009", "anti-2010", "anti-2010"),
  titer = c(1280, 640, "<40", 2560, 1280),  # Some below detection limit
  cluster = c("A", "A", "B", "A", "A"),
  color = c("red", "red", "blue", "red", "red")
)

# Process HI titer data (similarities -> distances)
results <- process_antigenic_data(
  data = antigen_data,
  antigen_col = "virus",
  serum_col = "serum", 
  value_col = "titer",
  is_similarity = TRUE,  # Titers are similarities
  metadata_cols = c("cluster", "color"),
  scale_factor = 10  # Base dilution factor
)

# View the long format data
print(results$long)
# View the distance matrix
print(results$matrix)

# Example 2: Processing IC50 data (already dissimilarities)
ic50_data <- data.frame(
  virus = c("HIV-1", "HIV-2", "HIV-3"),
  antibody = c("mAb1", "mAb1", "mAb2"),
  ic50 = c(0.05, ">10", 0.2)
)

results_ic50 <- process_antigenic_data(
  data = ic50_data,
  antigen_col = "virus",
  serum_col = "antibody",
  value_col = "ic50",
  is_similarity = FALSE  # IC50 values are dissimilarities
)


topolow documentation built on Aug. 31, 2025, 1:07 a.m.