```{css echo=FALSE} .courier-font {font-family: Courier New, Courier, monospace;}
```r knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval=TRUE, class.source="courier-font") library(GUSLD)
This tutorial provides an example on how to use GUS-LD on high performance computing (HPC) cluster. In this tutorial, the code will be written for working with a the Slurm Workload Manager (as this is what is used on a lot of HPCs in New Zealand). However, the idea present can be easily extended to other workload managers.
This tutorial assumes that you are familiar with GUSLD. If not, then have a read though the introduction tutorial for GUSLD. We will a dataset of deer individuals consisting of 704 individuals and 38 SNPs genotyped using genotyping-by-sequencing (GBS). Although the dataset is very small, we use to illustrate the procedure for using GUSLD on a HPC cluster. This data accompanines the GUSLD package and is stored as a Variant Call Format (VCF) file. The file name to the VCF file can be obtained using the deerData()
function.
vcffile <- deerData() # extract filename vcffile # filename stored in object vcffile rafile <- VCFtoRA(vcffile) RAdata <- readRA(rafile) deer <- makeUR(RAdata, filter=list(MAF=0.01, MISS=0.5, HW = c(-0.05,Inf), MAXDEPTH=500)) deer$writeRA(file="deer_filter")
Now, to parallelize LD estimation on a Slurm cluster, we first need to write a slurm script. An example of one for the deer data is shown below.
```{bash slurm_script.sl, eval=FALSE}
export NO_BLOCKS=3 export NSNPS=38
source activate GUSverse srun GUSLD_est_LD.R $NO_BLOCKS $NSNPS
The file **GUSLD_est_LD.R** is, ```r #### file: GUSLD_est_LD.R ## Determine how to break up the data set nblocks = as.integer(commandArgs(trailingOnly=T))[1] # Number of blocks to break up the SNPs nSnps = as.integer(commandArgs(trailingOnly=T))[2] # Number of SNPs in the dataset cuts <- floor(seq(0,nSnps,length.out=nblocks+1)) block_int <- cbind(cuts[-(nblocks+1)]+1,cuts[-1]) colnames(block_int) <- c("Start","Stop") blocks <- matrix(c(block_int[rep(1:nblocks,nblocks:1),], block_int[sequence(nblocks:1) + rep(0:nblocks,nblocks:0),]),ncol=4) ## deterimine which blocks we are going to use iter <- as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID")) block1 <- blocks[iter,1]:blocks[iter,2] block2 <- blocks[iter,3]:blocks[iter,4] subset <- unique(c(block1, block2)) ## read in the required subset of the RA data RAdata <- readRA("deer_filter.ra.tab", snpsubset = subset) ## create the UR population deer <- makeUR(RAdata, filter = NULL, mafEst = TRUE) ## Estimate LD outfile <- paste0("deer_r2_",iter) nClust <- as.numeric(Sys.getenv("SLURM_CPUS_PER_TASK")) if(all(block1 == block2)){ GUSLD(deer, nClust=nClust, filename=outfile, LDmeasure = "r2") } else { pairs <- as.matrix(expand.grid(1:length(block1),length(block1) + 1:length(block2))) GUSLD(deer, SNPpairs = pairs, nClust=nClust, filename=outfile, LDmeasure = "r2") }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.