knitr::opts_chunk$set( collapse = TRUE, comment = "#>", dpi = 80 )
This vignette demonstrates how to perform multi-trait colocalization analysis using summary statistics data,
specifically focusing on the Sumstat_5traits
dataset included in the package.
library(colocboost)
Sumstat_5traits
DatasetThe Sumstat_5traits
dataset contains 5 simulated summary statistics, where it is directly derived from the Ind_5traits
dataset using marginal association.
The dataset is specifically designed to evaluate and demonstrate the capabilities of ColocBoost in multi-trait colocalization analysis with summary association data.
sumstat
: A list of data.frames of summary statistics for different traits.true_effect_variants
: True effect variants indices for each trait.LD
could be calculated from the X
data in the Ind_5traits
dataset, but it is not included in the Sumstat_5traits
dataset.The dataset features two causal variants with indices 194 and 589.
This structure creates a realistic scenario in which multiple traits are influenced by different but overlapping sets of genetic variants.
# Loading the Dataset data("Sumstat_5traits") names(Sumstat_5traits) Sumstat_5traits$true_effect_variants
Due to the file size limitation of CRAN release, this is a subset of simulated data. See full dataset in colocboost paper repo.
sumstat
must include the following columns:
z
or (beta
, sebeta
): either z-score or (effect size and standard error)n
: sample size for the summary statistics. Highly recommended: Providing the sample size, or even a rough estimate of n
,
is highly recommended. Without n
, the implicit assumption is n
is large (Inf) and the effect sizes are small (close to zero). variant
: required if sumstat
for different outcomes do not have the same number of variables (multiple sumstat
and multiple LD
).class(Sumstat_5traits$sumstat[[1]]) head(Sumstat_5traits$sumstat[[1]])
The preferred format for colocalization analysis in ColocBoost using summary statistics data is where one LD matrix is provided for all traits, and the summary statistics are organized in a list. The Basic format is
sumstat
is organized as a list of data.frames for all traitsLD
is a matrix of linkage disequilibrium (LD) information for all variants across all traits.This function requires specifying summary statistics sumstat
and LD matrix LD
from the dataset:
# Extract genotype (X) and calculate LD matrix data("Ind_5traits") LD <- get_cormat(Ind_5traits$X[[1]]) # Run colocboost res <- colocboost(sumstat = Sumstat_5traits$sumstat, LD = LD) # Identified CoS res$cos_details$cos$cos_index # Plotting the results colocboost_plot(res)
For comprehensive tutorials on result interpretation and advanced visualization techniques, please visit our tutorials portal at Visualization of ColocBoost Results and Interpret ColocBoost Output.
When studying multiple traits with their own trait-specific LD matrices, you could provide a list of LD matrices matched with a list of summary statistics.
sumstat
and LD
are organized as lists, matched by trait index,(sumstat[1], LD[1])
contains information for trait 1,(sumstat[2], LD[2])
contains information for trait 2,# Duplicate LD with matched summary statistics LD_multiple <- lapply(1:length(Sumstat_5traits$sumstat), function(i) LD ) # Run colocboost res <- colocboost(sumstat = Sumstat_5traits$sumstat, LD = LD_multiple) # Identified CoS res$cos_details$cos$cos_index
When the LD matrix includes a superset of variants across different summary statistics, with Input Format:
sumstat
is a list of data.frames for all traitsLD
is a matrix of linkage disequilibrium (LD) information for all variants across all traits.# Create sumstat with different number of variants - remove 100 variants in each sumstat LD_superset <- LD sumstat <- lapply(Sumstat_5traits$sumstat, function(x) x[-sample(1:nrow(x), 20), , drop = FALSE]) # Run colocboost res <- colocboost(sumstat = sumstat, LD = LD_superset) # Identified CoS res$cos_details$cos$cos_index
When studying multiple traits with arbitrary LD matrices for different summary statistics, we also provide the interface for arbitrary LD matrices with multiple sumstat. This particularly benefits meta-analysis across heterogeneous datasets where, for different subsets of summary statistics, LD comes from different populations.
sumstat = list(sumstat1, sumstat2, sumstat3, sumstat4, sumstat5)
is a list of data.frames for all traits.LD = list(LD1, LD2)
is a list of LD matrices.dict_sumstatLD
is a dictionary matrix that index of sumstat to index of LD.# Create a simple dictionary for demonstration purposes LD_arbitrary <- list(LD, LD) # traits 1 and 2 matched to the first genotype matrix; traits 3,4,5 matched to the third genotype matrix. dict_sumstatLD = cbind(c(1:5), c(1,1,2,2,2)) # Display the dictionary dict_sumstatLD # Run colocboost res <- colocboost(sumstat = Sumstat_5traits$sumstat, LD = LD_arbitrary, dict_sumstatLD = dict_sumstatLD) # Identified CoS res$cos_details$cos$cos_index
ColocBoost also provides a flexibility to use HyPrColoc compatible format for summary statistics with and without LD matrix.
# Loading the Dataset data(Ind_5traits) X <- Ind_5traits$X Y <- Ind_5traits$Y # Coverting to HyPrColoc compatible format effect_est <- effect_se <- effect_n <- c() for (i in 1:length(X)){ x <- X[[i]] y <- Y[[i]] effect_n[i] <- length(y) output <- susieR::univariate_regression(X = x, y = y) effect_est <- cbind(effect_est, output$beta) effect_se <- cbind(effect_se, output$sebeta) } colnames(effect_est) <- colnames(effect_se) <- c("Y1", "Y2", "Y3", "Y4", "Y5") rownames(effect_est) <- rownames(effect_se) <- colnames(X[[1]]) # Run colocboost LD <- get_cormat(Ind_5traits$X[[1]]) res <- colocboost(effect_est = effect_est, effect_se = effect_se, effect_n = effect_n, LD = LD) # Identified CoS res$cos_details$cos$cos_index
See more details about data format to implement LD-free ColocBoost and LD-mismatch diagnosis in LD mismatch and LD-free Colocalization).
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.