View source: R/data_preprocessing.R
process_antigenic_data | R Documentation |
Processes raw antigenic assay data from data frames into standardized long and matrix formats. Handles both similarity data (like titers, which need conversion to distances) and direct dissimilarity measurements like IC50. Preserves threshold indicators (<, >) and handles repeated measurements by averaging.
process_antigenic_data(
data,
antigen_col,
serum_col,
value_col,
is_similarity = FALSE,
metadata_cols = NULL,
base = NULL,
scale_factor = 1
)
data |
Data frame containing raw data. |
antigen_col |
Character. Name of column containing virus/antigen identifiers. |
serum_col |
Character. Name of column containing serum/antibody identifiers. |
value_col |
Character. Name of column containing measurements (titers or distances). |
is_similarity |
Logical. Whether values are measures of similarity such as titers or binding affinities (TRUE) or dissimilarities like IC50 (FALSE). Default: FALSE. |
metadata_cols |
Character vector. Names of additional columns to preserve. |
base |
Numeric. Base for logarithm transformation (default: 2 for similarities, e for dissimilarities). |
scale_factor |
Numeric. Scale factor for similarities. This is the base value that all other dilutions are multiples of. E.g., 10 for HI assay where titers are 10, 20, 40,... Default: 1. |
The function handles these key steps:
Validates input data and required columns
Transforms values to log scale
Converts similarities to distances using Smith's method if needed
Averages repeated measurements
Creates standardized long format
Creates symmetric distance matrix
Preserves metadata and threshold indicators
Preserves virusYear and serumYear columns if present
Input requirements and constraints:
Data frame must contain required columns
Column names must match specified parameters
Values can include threshold indicators (< or >)
Metadata columns must exist if specified
Allowed Year-related column names are "virusYear" and "serumYear"
A list containing two elements:
long |
A |
matrix |
A numeric |
# Example 1: Processing HI titer data (similarities)
antigen_data <- data.frame(
virus = c("A/H1N1/2009", "A/H1N1/2010", "A/H1N1/2011", "A/H1N1/2009", "A/H1N1/2010"),
serum = c("anti-2009", "anti-2009", "anti-2009", "anti-2010", "anti-2010"),
titer = c(1280, 640, "<40", 2560, 1280), # Some below detection limit
cluster = c("A", "A", "B", "A", "A"),
color = c("red", "red", "blue", "red", "red")
)
# Process HI titer data (similarities -> distances)
results <- process_antigenic_data(
data = antigen_data,
antigen_col = "virus",
serum_col = "serum",
value_col = "titer",
is_similarity = TRUE, # Titers are similarities
metadata_cols = c("cluster", "color"),
scale_factor = 10 # Base dilution factor
)
# View the long format data
print(results$long)
# View the distance matrix
print(results$matrix)
# Example 2: Processing IC50 data (already dissimilarities)
ic50_data <- data.frame(
virus = c("HIV-1", "HIV-2", "HIV-3"),
antibody = c("mAb1", "mAb1", "mAb2"),
ic50 = c(0.05, ">10", 0.2)
)
results_ic50 <- process_antigenic_data(
data = ic50_data,
antigen_col = "virus",
serum_col = "antibody",
value_col = "ic50",
is_similarity = FALSE # IC50 values are dissimilarities
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.