run_GATK: 'Run GATK array jobs on HPC'
In yangjl/huskeR: Genomic and Genetic analysis Pipeline on HPC

Description Usage Arguments Details Value Examples

GATK Best Practices: recommended workflows for variant discovery analysis.

run_GATK(inputdf, runbwa = TRUE, markDup = TRUE, addRG = FALSE,
  rungatk = FALSE,
  ref.fa = "~/dbcenter/Ecoli/reference/Ecoli_k12_MG1655.fasta",
  gatkpwd = "$HOME/bin/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar",
  picardpwd = "$HOME/bin/picard-tools-2.1.1/picard.jar", minscore = 5,
  realignInDels = FALSE, indels.vcf = "indels.vcf",
  recalBases = FALSE, dbsnp.vcf = "dbsnp.vcf", shbase = NULL,
  jobid = "runarray", email = NULL, runinfo = c(FALSE, "batch", 1,
  "1.5", "10:00:00"))

`inputdf`	An input data.frame for fastq files. Must contains fq1, fq2, out (and/or bam). If inputdf contained bam, bwa alignment will be escaped. Additional columns: group (group id), sample (sample id), PL (platform, i.e. illumina), LB (library id), PU (unit, i.e. unit1). These strings (or info) will pass to BWA mem through -R.
`runbwa`	Set up BWA-mem, default=TRUE.
`markDup`	Mark Duplicates, default=TRUE.
`addRG`	Add or replace Read Groups using Picard AddOrReplaceReadGroups, default=FALSE.
`rungatk`	Setup GATK, default=FALSE.
`ref.fa`	The full path of genome with bwa indexed reference fasta file.
`gatkpwd`	The absolute path of GenomeAnalysisTK.jar.
`picardpwd`	The absolute path of picard.jar.
`minscore`	Minimum score to output, default=5, [bwa 30]. It will pass to bwa mem -T INT.
`realignInDels`	Realign Indels, default=FALSE. IF TRUE, a golden indel.vcf file should be provided.
`indels.vcf`	The full path of indels.vcf.
`recalBases`	Recalibrate Bases, default=FALSE. IF TRUE, a golden snps.vcf file should be provided.
`dbsnp.vcf`	The full path of dbsnp.vcf.
`shbase`	Base for the shell id, i.e. "slurm-script/run_gatk_". [chr]
`jobid`	Job ID, default="runarray". [chr]
`email`	Your email address that farm will email to once the jobs were done/failed.
`runinfo`	Parameters specify the array job partition information. A vector of c(FALSE, "bigmemh", "1"): 1) run or not, default=FALSE 2) -p partition name, default=bigmemh and 3) –cpus, default=1. 4) mem, default=1.5, in Gb. It will pass to `set_array_job`.

see more detail about GATK: https://www.broadinstitute.org/gatk/guide/bp_step.php?p=1

idxing: bwa index Zea_mays.AGPv2.14.dna.toplevel.fa

module load java/1.8 module load bwa/0.7.9a

local programs: bwa Version: 0.7.5a-r405 picard-tools-2.1.1 GenomeAnalysisTK-3.5/

return a batch of shell scripts.

inputdf <- data.frame(fq1="fq_1.fq", fq2="f1_2.fq", out="mysample",
                 group="g1", sample="s1", PL="illumina", LB="lib1", PU="unit1")

run_GATK(inputdf, runbwa=TRUE, markDup=TRUE, addRG=FALSE,rungatk=FALSE,
         ref.fa="~/dbcenter/Ecoli/reference/Ecoli_k12_MG1655.fasta",
         gatkpwd="$HOME/bin/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar",
         picardpwd="$HOME/bin/picard-tools-2.1.1/picard.jar",
         minscore=5,
         realignInDels=FALSE, indels.vcf="indels.vcf",
         recalBases=FALSE, dbsnp.vcf="dbsnp.vcf",
         shbase=NULL, jobid="runarray",
         email=NULL, runinfo = c(FALSE, "batch", 1, "1.5", "10:00:00"))