colocboost_validate_input_data: Validate and Process All Input Data for ColocBoost
In colocboost: Multi-Context Colocalization Analysis for QTL and GWAS Studies

colocboost_validate_input_data

R Documentation

Validate and Process All Input Data for ColocBoost

Description

Internal function to validate and process both individual-level and summary-level input data

Usage

colocboost_validate_input_data(
  X = NULL,
  Y = NULL,
  sumstat = NULL,
  LD = NULL,
  X_ref = NULL,
  dict_YX = NULL,
  dict_sumstatLD = NULL,
  effect_est = NULL,
  effect_se = NULL,
  effect_n = NULL,
  overlap_variables = FALSE,
  M = 500,
  min_abs_corr = 0.5,
  jk_equiv_corr = 0.8,
  jk_equiv_loglik = 1,
  func_simplex = "LD_z2z",
  cos_npc_cutoff = 0.2,
  npc_outcome_cutoff = 0.2
)

Arguments

`X`	A list of genotype matrices for different outcomes, or a single matrix if all outcomes share the same genotypes.
`Y`	A list of vectors of outcomes or an N by L matrix if it is considered for the same X and multiple outcomes.
`sumstat`	A list of data.frames of summary statistics.
`LD`	A list of correlation matrix indicating the LD matrix for each genotype. It also could be a single matrix if all sumstats were obtained from the same genotypes. Provide either `LD` or `X_ref`, not both. If neither is provided, LD-free mode is used.
`X_ref`	A reference panel genotype matrix (N_ref x P) or a list of matrices, as an alternative to providing a precomputed `LD` matrix. Column names must include variant names matching those in `sumstat`. When the number of reference panel samples is less than the number of variants (N_ref < P), this avoids storing the full P x P LD matrix and reduces memory usage. When N_ref >= P, LD is precomputed from `X_ref` internally. Provide either `LD` or `X_ref`, not both. If neither is provided, LD-free mode is used.
`dict_YX`	A L by 2 matrix of dictionary for X and Y if there exist subsets of outcomes corresponding to the same X matrix.
`dict_sumstatLD`	A L by 2 matrix of dictionary for sumstat and LD if there exist subsets of outcomes corresponding to the same sumstat.
`effect_est`	Matrix of variable regression coefficients (i.e. regression beta values) in the genomic region
`effect_se`	Matrix of standard errors associated with the beta values
`effect_n`	A scalar or a vector of sample sizes for estimating regression coefficients.
`overlap_variables`	If overlap_variables = TRUE, only perform colocalization in the overlapped region.
`M`	The maximum number of gradient boosting rounds for each outcome (default is 500).
`min_abs_corr`	Minimum absolute correlation allowed in a confidence set.

Value

A list containing:

`X`	Processed list of genotype matrices
`Y`	Processed list of phenotype vectors
`yx_dict`	Dictionary mapping Y to X
`keep_variable_individual`	List of variable names for each X matrix
`sumstat`	Processed list of summary statistics data.frames
`LD`	Processed list of LD matrices
`X_ref`	Processed list of reference genotype matrices
`ref_label`	Style of reference matrices
`sumstatLD_dict`	Dictionary mapping sumstat to LD
`keep_variable_sumstat`	List of variant names for each sumstat
`Z`	List of z-scores for each outcome
`N_sumstat`	List of sample sizes for each outcome
`Var_y`	List of phenotype variances for each outcome
`SeBhat`	List of standard errors for each outcome
`M_updated`	Updated M value (may be changed if LD not provided)
`min_abs_corr_updated`	Updated min_abs_corr value (may be changed if LD not provided)
`jk_equiv_corr_updated`	Updated jk_equiv_corr value
`jk_equiv_loglik_updated`	Updated jk_equiv_loglik value
`func_simplex_updated`	Updated func_simplex value