knitr::opts_chunk$set( collapse = TRUE, comment = "#>", dpi = 80 )
This vignette demonstrates LD mismatch diagnosis in the colocboost
package and how to perform LD-mismatch and LD-free colocalization analysis,
when some traits completely lack LD information or share only partial variant coverage with other traits.
library(colocboost)
The colocboost
assumes that the LD matrix accurately estimates the correlations among variants from the original GWAS genotype data.
Typically, the LD matrix comes from some public databases of genotypes in a suitable reference population.
An inaccurate LD matrix may lead to unreliable colocalization results, especially if the LD matrix is significantly different from the one estimated from the original genotype data.
An inaccurate LD matrix can cause inconsistencies between the summary statistics and the reference LD matrix, leading to:
ColocBoost provides diagnostic warnings for assessing the consistency of the summary statistics with the reference LD matrix.
rtr < 0
or rtr > var_y
;
see details in Supplementary Note S3.5.2). In this example, we create a simulated dataset with LD mismatch by changing the sign of Z-scores for 1% of variants for each trait.
# Create a simulated dataset with LD mismatch data("Sumstat_5traits") data("Ind_5traits") LD <- get_cormat(Ind_5traits$X[[1]]) # Change sign of Z-score for 1% of variants for each trait by including mismatched LD set.seed(123) miss_prop <- 0.005 sumstat <- lapply(Sumstat_5traits$sumstat, function(ss){ p <- nrow(ss) pos_miss <- sample(1:p, ceiling(miss_prop * p)) ss$z[pos_miss] <- -ss$z[pos_miss] return(ss) })
When running colocboost
with an LD mismatch, you may encounter diagnostic warnings.
These warnings are not errors, and the analysis will still proceed.
However, the results may be less reliable due to the mismatch, and the computational time may increase as the algorithm takes longer to converge.
res <- colocboost(sumstat = sumstat, LD = LD) res$cos_details$cos$cos_index
These warnings serve as diagnostic tools to alert users about potential inconsistencies in the input data.
res$cos_details$cos_outcomes_npc
Note: In the above example, the normalized probability of trait 2 is 0, indicating that colocalization with trait 2 may be less reliable due to the LD mismatch. This is a warning, not an error, and the colocalization analysis will still proceed. Therefore, in this case, we suggest treating the colocalization of trait 2 with caution.
Potential solutions include:
colocboost_plot(res)
. Remove the potential spurious signals when LD mismatch is detected using get_robust_colocalization(res, cos_npc_cutoff = 0.5, npc_outcome_cutoff = 0.2)
to exclude the trait 2 in the above example if the signals are not reasonable.get_robust_colocalization(res, pvalue_cutoff = 1e-5, cos_npc_cutoff = 0, npc_outcome_cutoff = 0)
to include all colocalized traits with the larger marginal evidence,
but the mismatch is detected.When there is substantial discordance between the LD matrix and summary statistics, the reliability of colocalization analysis may be compromised. Such discordance can arise when the LD matrix and summary statistics are derived from different populations or when the LD matrix is estimated from a smaller or less representative reference sample. This can lead to unexpected results, such as biased causal variant identification or reduced accuracy in the analysis.
To address these challenges, ColocBoost provides two alternative approaches for colocalization analysis with the assumption of one causal variant per trait per region:
One iteration approach (recommended): performing only 1 iteration of gradient boosting with the LD matrix ensures that:
The LD matrix is only used to check the equivalence among trait-specific best update variants.
This method is particularly useful when the LD matrix is mismatched but still provides valuable insights into variant correlations.
# Perform only 1 iteration of gradient boosting with LD matrix res_mismatch <- colocboost(sumstat = sumstat, LD = LD, M = 1)
res_free <- colocboost(sumstat = sumstat)
While this method is computationally efficient, it has limitations due to the strong assumption of a single causal variant per trait per region. Users should interpret the results with caution, especially in regions with complex LD structures or multiple causal variants.
ColocBoost also provides a flexibility to use HyPrColoc compatible format for summary statistics without LD matrix.
# Loading the Dataset data(Ind_5traits) X <- Ind_5traits$X Y <- Ind_5traits$Y # Coverting to HyPrColoc compatible format effect_est <- effect_se <- effect_n <- c() for (i in 1:length(X)){ x <- X[[i]] y <- Y[[i]] effect_n[i] <- length(y) output <- susieR::univariate_regression(X = x, y = y) effect_est <- cbind(effect_est, output$beta) effect_se <- cbind(effect_se, output$sebeta) } colnames(effect_est) <- colnames(effect_se) <- c("Y1", "Y2", "Y3", "Y4", "Y5") rownames(effect_est) <- rownames(effect_se) <- colnames(X[[1]]) # Run colocboost res <- colocboost(effect_est = effect_est, effect_se = effect_se, effect_n = effect_n) # Identified CoS res$cos_details$cos$cos_index
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.