```{css echo=FALSE} .courier-font {font-family: Courier New, Courier, monospace;}

```r
knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval=TRUE, class.source="courier-font")
library(GUSLD)

This tutorial provides an example on how to use GUS-LD on high performance computing (HPC) cluster. In this tutorial, the code will be written for working with a the Slurm Workload Manager (as this is what is used on a lot of HPCs in New Zealand). However, the idea present can be easily extended to other workload managers.

This tutorial assumes that you are familiar with GUSLD. If not, then have a read though the introduction tutorial for GUSLD. We will a dataset of deer individuals consisting of 704 individuals and 38 SNPs genotyped using genotyping-by-sequencing (GBS). Although the dataset is very small, we use to illustrate the procedure for using GUSLD on a HPC cluster. This data accompanines the GUSLD package and is stored as a Variant Call Format (VCF) file. The file name to the VCF file can be obtained using the deerData() function.

vcffile <- deerData() # extract filename
vcffile               # filename stored in object vcffile

rafile <- VCFtoRA(vcffile)
RAdata <- readRA(rafile)

deer <- makeUR(RAdata, filter=list(MAF=0.01, MISS=0.5, HW = c(-0.05,Inf), MAXDEPTH=500))
deer$writeRA(file="deer_filter")

Now, to parallelize LD estimation on a Slurm cluster, we first need to write a slurm script. An example of one for the deer data is shown below.

```{bash slurm_script.sl, eval=FALSE}

!/bin/bash -e

export NO_BLOCKS=3 export NSNPS=38

SBATCH --job-name=GUSLD

SBATCH --cpus-per-task=2

SBATCH --parition=inv-blade-g8,inv-iranui

SBATCH --mem-per-cpu=4

SBATCH --output log/%x_%A_%a.out

SBATCH --error log/%x_%A_%a.err

SBATCH --array=1-3

source activate GUSverse srun GUSLD_est_LD.R $NO_BLOCKS $NSNPS

The file **GUSLD_est_LD.R** is,

```r
#### file: GUSLD_est_LD.R

## Determine how to break up the data set
nblocks = as.integer(commandArgs(trailingOnly=T))[1] # Number of blocks to break up the SNPs
nSnps = as.integer(commandArgs(trailingOnly=T))[2]  # Number of SNPs in the dataset
cuts <- floor(seq(0,nSnps,length.out=nblocks+1))
block_int <- cbind(cuts[-(nblocks+1)]+1,cuts[-1])
colnames(block_int) <- c("Start","Stop")
blocks <- matrix(c(block_int[rep(1:nblocks,nblocks:1),],
                   block_int[sequence(nblocks:1) + rep(0:nblocks,nblocks:0),]),ncol=4)

## deterimine which blocks we are going to use
iter <- as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID"))
block1 <- blocks[iter,1]:blocks[iter,2]
block2 <-  blocks[iter,3]:blocks[iter,4]
subset <- unique(c(block1, block2))

## read in the required subset of the RA data
RAdata <- readRA("deer_filter.ra.tab", snpsubset = subset)

## create the UR population
deer <- makeUR(RAdata, filter = NULL, mafEst = TRUE)

## Estimate LD
outfile <- paste0("deer_r2_",iter)
nClust <- as.numeric(Sys.getenv("SLURM_CPUS_PER_TASK"))
if(all(block1 == block2)){
  GUSLD(deer, nClust=nClust, filename=outfile, LDmeasure = "r2")
} else { 
  pairs <- as.matrix(expand.grid(1:length(block1),length(block1) + 1:length(block2)))
  GUSLD(deer, SNPpairs = pairs, nClust=nClust, filename=outfile, LDmeasure = "r2")
}


AgResearch/GUS-LD documentation built on July 31, 2019, 10:55 p.m.