getInput: Preprocess GWAS summary statistics datasets

View source: R/preprocess.R

getInputR Documentation

Preprocess GWAS summary statistics datasets

Description

This function has GWAS summary statistics data files as inputs, perform genetic instrument selection and return matrices that are ready to use for GRAPPLE

Usage

getInput(
  sel.files,
  exp.files,
  out.files,
  plink_refdat,
  max.p.thres = 0.01,
  cal.cor = T,
  p.thres.cor = 0.5,
  get.marker.candidates = T,
  marker.p.thres = 1e-05,
  marker.p.source = "exposure",
  clump_r2 = 0.001,
  clump_r2_formarkers = 0.05,
  plink_exe = NULL
)

Arguments

sel.files

A vector of the GWAS summary statistics file names for the risk factors SNP selection. Each GWAS file is a ".csv" or ".txt" file containing a data frame that at least has a column "SNP" for the SNP ids and "pval" for the p-values. The length of sel.files are not required to be the same as that of exp.files and the order of the files do not matter, while we strongly suggest having one selection file for each risk factor.

exp.files

A vector of length k of the GWAS summary statistics file names of the k risk factors for getting the effect sizes and standard deviations. Each GWAS file should have a column "SNP" for the SNP ids, "beta" for the effect sizes, "se" for the standard deviation, "effect_allele" for the effect allele and "other_allele" for the other allele of the SNP.

out.files

The GWAS summary statistics file name for the disease data, can be a vector of length m to allow preprocessing m diseases simultaneously. Each GWAS file should have a column "SNP" for the SNP ids, "beta" for the effect sizes, "se" for the standard deviation, "effect_allele" for the effect allele and "other_allele" for the other allele of the SNP.

plink_refdat

The reference genotype files (.bed, .bim, .fam) for clumping using PLINK (loaded with –bfile).

max.p.thres

The upper threshold of the selection p-values for a SNP to be selected before clumping. It only requires that at least one of the p-values of the risk factors of the SNPs to be below the threshold. Default is 0.01.

cal.cor

Whether calculate the (k + 1) by (k + 1) correlation matrix between the k risk factors and the outcome. The default is TRUE

p.thres.cor

The lower threshold of the p-values for a SNP to be used in calculating the correlation matrix. It only select SNPs whose p-values are above the threshold for all risk factors. Default is 0.5.

get.marker.candidates

Whether getting SNPs which are used for mode marker selection. Only applies to cases where the number of risk factors k = 1. Default is TRUE for k = 1.

marker.p.thres

P-value threshold of p-values in the exposure files for mode markers. Default is 1e-5.

marker.p.source

source of p-values of mode markers, a string of either "exposure" or "selection". Default is "exposure" for obtaining more markers.

clump_r2

The clumping r2 threshold in PLINK for genetic instrument selection. Default is set to 0.001 for selection of independent SNPs.

clump_r2_formarkers

The clumping r2 threshold in PLINK. Default is set to 0.05 for selection of candidates for the marker SNPs.

plink_exe

The name of the plink exe. Default is NULL, which uses "plink". For users with Linux systems, one may want to have a different name, like "./plink" depending on where they install plink

Value

A list of selected summary statistics, which include

data

A data frame of size p * (3 + 2k + 2m + 1) for the effect sizes of p number of selected independent SNPs (instruments) on k risk factors (exposures). The first three columns include the SNP rsID, the effect allele and other allele after harmonizing, the next 2k columns are the estimated effect sizes and standard deviations for the k risk factors stored in exp.files, the next 2m columns are the estimated effect sizes and standard deviations for the m diseases stored in exp.files and the the last columns are the selection p-values obtained from sel.files

marker.data

A data frame for marker candidate SNPs, which has the same columns as data

.

cor.mat

The estimated (k + m) by (k + m) correlation matrix between the k risk factors and the disease (outcome) shared by SNPs. The last column is for the outcome trait.


jingshuw/GRAPPLE-beta- documentation built on March 29, 2024, 1:26 p.m.