knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  dev = c("png")
)

Introduction to MTMC-SKAT

Travis build status

Multi-Threaded Monte Carlo Sequence Kernel Association Test (MTMC-SKAT) provides high-level R and command line interfaces for SKAT (Wu et al., 2011; Lee et al., 2016), most notably with support for adaptive resampling with multi-threading using the future parallel framework to rapidly calculate empirical p-values. These features can facilitate high-powered and accurate GWAS for traits that do not satisfy linear model assumptions.

Resampling is performed using an algorithm that allows calculation of empirical p-values to a user-specified number of significant figures. The efficiency of parallelization over SNP windows or permutations depends on a given dataset and resources of a given computer. In the automated workflow, permutations are continually generated and batches of SNP windows are tested to calculate empirical p-values as sufficient permutations become available for each batch. The appropriate mode (multithreading over windows or permutations) is automatically detected for each batch of SNP windows. See the MTMC-SKAT Workflows vignette for details on this automatic workflow and information for designing custom workflows.

Running over a single scaffold

Workflows are applied over scaffolds independently, with parallelization over scaffolds or sub-scaffolds on high-performance clusters as described later. Each scaffold is submitted as a .traw file, which can be prepared from more common SNP data formats using PLINK. The standard workflow can be run over a given scaffold using the following command.

MTMCSKAT_workflow(phenodata = "poplar_shoot_sample.csv",
                  covariates = "poplar_PCs_covariates.csv",
                  raw_file_path = "poplar_Chr10_portion.traw",
                  window_size = 3000,
                  window_shift = 1000,
                  output_dir = "Results/",
                  job_id = "my_sample_analysis",
                  ncore = "AllCores",
                  max_accuracy = 5,
                  sig_figs = 2)

Inputs include file paths to phenotype, covariate and genotype file paths (in standard PLINK formats for phenotypes and covariates and .traw for genotypes data), the sizes of overlapping SNP windows to be tested and the shift in position between windows, and the limit for accuracy (recommended to be set beyond the FDR or Bonferroni correction threshold) given as 10^-x.

This function is also accesible from the command line, thus providing a one-liner that can be easily submitted as a job to any batch query system. Jobs are submitted for each scaffold or sub-scaffold to be analyzed.

Rscript MTMCSKAT_workflow.R --phenodata="poplar_shoot_sample.csv" \
--covariates="poplar_PCs_covariates.csv" \
--raw_file_path="poplar_Chr10_portion.traw" \
--window_size=3000 \
--window_shift=1000 \
--output_dir="Results/" \
--job_id="my_sample_analysis" \
--ncore="AllCores" \
--max_accuracy=5 \
--sig_figs=2

Running over an array of scaffolds using SLURMS

The following bash command can be used to generate a bash script to submit an array of jobs (each for a given scaffold or sub-scaffold) to be submitted to to the SGE or SLURMS batch query system...

prepare_SLURMS_batch.sh \
-d inputs/wholeChr \# Folder containing scaffolds in `.traw` format
-p poplar_shoot_sample.csv \# Phenotype file
-c poplar_PCs_covariates.csv \# Covariate file
-i 3000 \# Window size
-h 1000 \# Window shift
-o Results/ \# Directory to save results in
-j my_sample_analysis \# Prefix to name jobs with
-n AllCores \# Maximum number of threads (use string `AllCores` for maximum threads)
-m 5 \# Maximum accuracy for empirical p-values
-s 2 \# Desired number of significant figures
-t 01:00:00# Maximum runtime after which a job will terminate

...

Wu, M.C., Lee, S., Cai, T., Li, Y., Boehnke, M. and Lin, X., 2011. Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics, 89(1), pp.82-93.

Lee, S., Fuchsberger, C., Kim, S. and Scott, L., 2016. An efficient resampling method for calibrating single and gene-based rare variant association analysis in case–control studies. Biostatistics, 17(1), pp.1-15.



naglemi/mtmcskat documentation built on Aug. 23, 2023, 5:35 p.m.