standardize: Standardize Allelic Ratio Data and Compute BAF and Z-Scores
In Qploidy: Estimation of Ploidy and Detection of Aneuploidy Using Genotyping Data

standardize

R Documentation

Standardize Allelic Ratio Data and Compute BAF and Z-Scores

Description

This function performs signal standardization of genotype data by aligning 'theta' values (allelic ratios or normalized intensities) to expected genotype clusters. It outputs standardized BAF (B-allele frequency) and Z-scores per sample and marker.

Usage

standardize(
  data = NULL,
  genos = NULL,
  geno.pos = NULL,
  threshold.missing.geno = 0.9,
  threshold.geno.prob = 0.8,
  ploidy.standardization = NULL,
  threshold.n.clusters = NULL,
  n.cores = 1,
  out_filename = NULL,
  type = "intensities",
  multidog_obj = NULL,
  parallel.type = "PSOCK",
  verbose = TRUE,
  rm_outlier = TRUE,
  cluster_median = TRUE
)

Arguments

`data`	A 'data.frame' containing the full dataset with the following columns: MarkerName Marker identifiers. SampleName Sample identifiers. X Reference allele intensity or count. Y Alternative allele intensity or count. R Total signal intensity or read depth (X + Y). ratio Allelic ratio, typically Y / (X + Y).
`genos`	A 'data.frame' containing genotype dosage information for the reference panel. This should include samples of known ploidy and ideally euploid individuals. Required columns: MarkerName Marker identifiers. SampleName Sample identifiers. geno Estimated dosage (0, 1, 2, ...). prob Genotype call probability (used for filtering low-confidence genotypes).
`geno.pos`	A 'data.frame' with marker position metadata. Required columns: MarkerName Marker identifiers. Chromosome Chromosome names. Position Base-pair positions on the genome.
`threshold.missing.geno`	Numeric (0–1). Maximum fraction of missing genotype data allowed per marker. Markers with a higher fraction will be removed.
`threshold.geno.prob`	Numeric (0–1). Minimum genotype call probability threshold. Genotypes with lower probability will be treated as missing.
`ploidy.standardization`	Integer. The ploidy level of the reference panel used for standardization.
`threshold.n.clusters`	Integer. Minimum number of expected dosage clusters per marker. For diploid data, this is typically 3 (corresponding to genotypes 0, 1, and 2).
`n.cores`	Integer. Number of cores to use in parallel computations (e.g., for cluster center estimation and BAF generation).
`out_filename`	Optional. Path to save the final standardized dataset to disk as a CSV file (suitable for Qploidy).
`type`	Character. Type of data used for clustering: "intensities" For array-based allele intensity data. "counts" For sequencing data. "updog" Automatically set when 'multidog_obj' is provided.
`multidog_obj`	Optional. An object of class 'multidog' from the 'updog' package, containing model fits and estimated biases. If provided, this will override the ‘type' parameter and use 'updog'’s expected cluster positions.
`parallel.type`	Character. Parallel backend to use ('"FORK"' or '"PSOCK"'). '"FORK"' is faster but only works on Unix-like systems.
`verbose`	Logical. If 'TRUE', prints progress and filtering information to the console.
`rm_outlier`	Logical. If 'TRUE', uses Bonferroni-Holm corrected residuals to remove outliers before estimating cluster centers.
`cluster_median`	Logical. If 'TRUE', uses the median of theta values to estimate cluster centers. If 'FALSE', uses the mean.

Details

Reference genotypes are used to estimate cluster centers either from dosage data (e.g., via 'fitpoly' or 'updog') or using an 'updog' 'multidog' object directly. This function supports both array-based (intensity) and sequencing-based (count) data.

It applies marker and genotype-level quality filters, uses parallel computing to estimate BAF, and generates a final annotated output suitable for CNV or dosage variation analyses.

Value

An object of class '"qploidy_standardization"' (list) with the following components:

info: Named vector of standardization parameters.
filters: Named vector summarizing how many markers were removed at each filtering step.
data: A data.frame containing merged BAF, Z-score, and genotype information by marker and sample.