harmonize_sumstats: Harmonizing GWAS summary to reference data
In snpsettest: A Set-Based Association Test using GWAS Summary Statistics

harmonize_sumstats

R Documentation

Harmonizing GWAS summary to reference data

Description

Finds an intersection of variants between GWAS summary and reference data.

Usage

harmonize_sumstats(
  sumstats,
  x,
  match_by_id = TRUE,
  check_strand_flip = FALSE,
  return_indice = FALSE
)

Arguments

`sumstats`	A data frame with two columns: "id" and "pvalue". id = SNP ID (e.g., rs numbers) pvalue = SNP-level p value If `match_by_id = FALSE`, it requires additional columns: "chr", "pos", "A1" and "A2". chr = chromosome pos = base-pair position (must be integer) A1, A2 = allele codes (allele order is not important)
`x`	A `bed.matrix` object created using the reference data.
`match_by_id`	If `TRUE`, SNP matching will be performed by SNP IDs instead of genomic position and allele codes. Default is `TRUE`.
`check_strand_flip`	Only applies when `match_by_id = FALSE`. If `TRUE`, the function 1) removes ambiguous A/T and G/C SNPs for which the strand is not obvious, and 2) attempts to find additional matching entries by flipping allele codes (i.e., A->T, T->A, C->G, G->A). If the GWAS genotype data itself is used as the reference data, it would be safe to set `FALSE`. Default is `FALSE`.
`return_indice`	Only applied when `match_by_id = FALSE`. If `TRUE`, the function provides an additional column indicating whether the match is with swapped alleles. If `check_strand_flip = TRUE`, the function also provides an additional column indicating whether the match is with flipped strand. Unnecessary for gene-based tests in this package, but may be useful for other purposes (e.g., harmonization for meta-analysis that needs to flip the sign of beta for a match with swapped alleles).

Details

Pre-processing of GWAS summary data is required because the sets of variants available in a particular GWAS might be poorly matched to the variants in reference data. SNP matching can be performed either 1) by SNP ID or 2) by chromosome code, base-pair position, and allele codes, while taking into account possible strand flips and reference allele swap. For matched entries, the SNP IDs in GWAS summary data are replaced with the ones in the reference data.

Value

A data frame with columns: "id", "chr", "pos", "A1", "A2" and "pvalue". If return_indice = TRUE, the data frame includes additional columns key_, swapped_, and flipped_. key_ is "chr_pos_A1_A2" in sumstat (the original input before harmonization). swapped_ contains a logical vector indicating reference allele swap. flipped_ contains a logical vector indicating strand flip.

Examples


## GWAS summary statistics
head(exGWAS)

## Load reference genotype data
bfile <- system.file("extdata", "example.bed", package = "snpsettest")
x <- read_reference_bed(path = bfile)

## Harmonize by SNP IDs
hsumstats1 <- harmonize_sumstats(exGWAS, x)

## Harmonize by genomic position and allele codes
## Reference allele swap will be taken into account
hsumstats2 <- harmonize_sumstats(exGWAS, x, match_by_id = FALSE)

## Check matching entries by flipping allele codes
## Ambiguous SNPs will be excluded from harmonization
hsumstats3 <- harmonize_sumstats(exGWAS, x, match_by_id = FALSE,
                                 check_strand_flip = TRUE)

snpsettest documentation built on Sept. 10, 2023, 1:08 a.m.