Summary Statistics Data Colocalization

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  dpi = 80
)

This vignette demonstrates how to perform multi-trait colocalization analysis using summary statistics data, specifically focusing on the Sumstat_5traits dataset included in the package.

library(colocboost)

1. The Sumstat_5traits Dataset

The Sumstat_5traits dataset contains 5 simulated summary statistics, where it is directly derived from the Ind_5traits dataset using marginal association. The dataset is specifically designed to evaluate and demonstrate the capabilities of ColocBoost in multi-trait colocalization analysis with summary association data.

Causal variant structure

The dataset features two causal variants with indices 194 and 589.

This structure creates a realistic scenario in which multiple traits are influenced by different but overlapping sets of genetic variants.

# Loading the Dataset
data("Sumstat_5traits")
names(Sumstat_5traits)
Sumstat_5traits$true_effect_variants

Due to the file size limitation of CRAN release, this is a subset of simulated data. See full dataset in colocboost paper repo.

Important data format for summary data

sumstat must include the following columns:

class(Sumstat_5traits$sumstat[[1]])
head(Sumstat_5traits$sumstat[[1]])

2. Multiple summary statistics data with shared LD reference

The preferred format for colocalization analysis in ColocBoost using summary statistics data is where one LD matrix is provided for all traits, and the summary statistics are organized in a list. The Basic format is

This function requires specifying summary statistics sumstat and LD matrix LD from the dataset:

# Extract genotype (X) and calculate LD matrix
data("Ind_5traits")
LD <- get_cormat(Ind_5traits$X[[1]])

# Run colocboost
res <- colocboost(sumstat = Sumstat_5traits$sumstat, LD = LD)

# Identified CoS
res$cos_details$cos$cos_index

# Plotting the results
colocboost_plot(res)

Results Interpretation

For comprehensive tutorials on result interpretation and advanced visualization techniques, please visit our tutorials portal at Visualization of ColocBoost Results and Interpret ColocBoost Output.

3. Other summary statistics and LD input combinations

3.1. Matched LD with multiple sumstat (Trait-specific LD)

When studying multiple traits with their own trait-specific LD matrices, you could provide a list of LD matrices matched with a list of summary statistics.

# Duplicate LD with matched summary statistics
LD_multiple <- lapply(1:length(Sumstat_5traits$sumstat), function(i) LD )

# Run colocboost
res <- colocboost(sumstat = Sumstat_5traits$sumstat, LD = LD_multiple)

# Identified CoS
res$cos_details$cos$cos_index

3.2. LD matrix is a superset of variants across different summary statistics

When the LD matrix includes a superset of variants across different summary statistics, with Input Format:

# Create sumstat with different number of variants - remove 100 variants in each sumstat
LD_superset <- LD
sumstat <- lapply(Sumstat_5traits$sumstat, function(x) x[-sample(1:nrow(x), 20), , drop = FALSE])

# Run colocboost
res <- colocboost(sumstat = sumstat, LD = LD_superset)

# Identified CoS
res$cos_details$cos$cos_index

3.3. Arbitrary LD and sumstat with dictionary provided

When studying multiple traits with arbitrary LD matrices for different summary statistics, we also provide the interface for arbitrary LD matrices with multiple sumstat. This particularly benefits meta-analysis across heterogeneous datasets where, for different subsets of summary statistics, LD comes from different populations.

# Create a simple dictionary for demonstration purposes
LD_arbitrary <- list(LD, LD) # traits 1 and 2 matched to the first genotype matrix; traits 3,4,5 matched to the third genotype matrix.
dict_sumstatLD = cbind(c(1:5), c(1,1,2,2,2))

# Display the dictionary
dict_sumstatLD

# Run colocboost
res <- colocboost(sumstat = Sumstat_5traits$sumstat, LD = LD_arbitrary, dict_sumstatLD = dict_sumstatLD)

# Identified CoS
res$cos_details$cos$cos_index

3.4. HyPrColoc compatible format: effect size and standard error matrices

ColocBoost also provides a flexibility to use HyPrColoc compatible format for summary statistics with and without LD matrix.

# Loading the Dataset
data(Ind_5traits)
X <- Ind_5traits$X
Y <- Ind_5traits$Y

# Coverting to HyPrColoc compatible format
effect_est <- effect_se <- effect_n <- c()
for (i in 1:length(X)){
  x <- X[[i]]
  y <- Y[[i]]
  effect_n[i] <- length(y)
  output <- susieR::univariate_regression(X = x, y = y)
  effect_est <- cbind(effect_est, output$beta)
  effect_se <- cbind(effect_se, output$sebeta)
}
colnames(effect_est) <- colnames(effect_se) <- c("Y1", "Y2", "Y3", "Y4", "Y5")
rownames(effect_est) <- rownames(effect_se) <- colnames(X[[1]])

# Run colocboost
LD <- get_cormat(Ind_5traits$X[[1]])
res <- colocboost(effect_est = effect_est, effect_se = effect_se, effect_n = effect_n, LD = LD)

# Identified CoS
res$cos_details$cos$cos_index

See more details about data format to implement LD-free ColocBoost and LD-mismatch diagnosis in LD mismatch and LD-free Colocalization).



Try the colocboost package in your browser

Any scripts or data that you put into this service are public.

colocboost documentation built on June 8, 2025, 11:07 a.m.