run_gstacks: Run STACKS new module called gstacks
In thierrygosselin/stackr: Run stacks pipeline for RADseq analysis inside R

run_gstacks

R Documentation

Run STACKS new module called gstacks

Description

Run STACKS gstacks module inside R!

Usage

run_gstacks(
  denovo = TRUE,
  P = "06_ustacks_2_gstacks",
  M = "02_project_info/population.map.tsv2bam.tsv",
  I = NULL,
  B = NULL,
  O = NULL,
  parallel.core = parallel::detectCores() - 1,
  model = "marukilow",
  var.alpha = 0.01,
  gt.alpha = 0.05,
  rm.pcr.duplicates = FALSE,
  rm.unpaired.reads = FALSE,
  ignore.pe.reads = TRUE,
  unpaired = TRUE,
  kmer.length = 31,
  max.debruijn.reads = 1000,
  min.kmer.cov = 2,
  write.alignments = FALSE,
  min.mapq = 10,
  max.clipped = 0.2,
  max.insert.len = 1000,
  details = FALSE,
  phasing.cooccurrences.thr.range = c(1, 2),
  phasing.dont.prune.hets = FALSE,
  h = FALSE
)

Arguments

`denovo`	(logical) de novo vs reference-based mode. Default: `denovo = TRUE`.
`P`	(path, character) De novo mode. Path to the directory containing STACKS files. Default: `P = "06_ustacks_2_gstacks"`. Inside the folder, you should have: the catalog files: starting with `batch_` and ending with `.alleles.tsv.gz, .snps.tsv.gz, .tags.tsv.gz and .bam`; 5 files for each samples: The sample name is the prefix for the files ending with: `.alleles.tsv.gz, .models.tsv.gz, .snps.tsv.gz, .tags.tsv.gz and .bam`. Those files are created in the ustacks, sstacks, cxstacks, tsv2bam modules .
`M`	(path, character) Path to a population map giving the list of samples.
`I`	(character, path) Reference-based mode. Input directory containing BAM files. Default: `I = NULL`.
`B`	(character, path) Reference-based mode. Path to input BAM files. The input BAM file(s) must be sorted by coordinate. With `B`, records must be assigned to samples using BAM "reads groups" (gstacks uses the ID/identifier and SM/sample name fields). Read groups must be consistent if repeated different files. With `I`, read groups are unneeded and ignored. Please refer to the gstacks manual page for information. stacks also provide information about how to generate such a BAM file with Samtools or Sambamba, and examples. Default: `B = NULL`.
`O`	(character, path) Path to output directory. `stackr` by default will output in the input directory. Default: `O = NULL` (same as input).
`parallel.core`	(integer) Enable parallel execution with the number of threads. Default: `parallel.core = parallel::detectCores() - 1`.
`model`	(character) The model to use to call variants and genotypes; one of `"marukilow"`, `"marukihigh"`, or `"snp"`. See ref for more details on algorithms. Default: `model = "marukilow"`.
`var.alpha`	(double) Alpha threshold for discovering SNPs. Default: `var.alpha = 0.01`.
`gt.alpha`	(double) Alpha threshold for calling genotypes. Default: `gt.alpha = 0.05`.
`rm.pcr.duplicates`	(logical) remove all but one set of read pairs of the same sample that have the same insert length (implies `rm.unpaired.reads = TRUE`.). Default: `rm.pcr.duplicates = FALSE`.
`rm.unpaired.reads`	(logical) Discard unpaired reads. Default: `rm.unpaired.reads = FALSE`.
`ignore.pe.reads`	(logical) With default the function will ignore paired-end reads even if present in the input. Default: `ignore.pe.reads = TRUE`.
`unpaired`	(logical) Ignore read pairing (only for paired-end GBS; treat READ2's as if they were READ1's). Default: `unpaired = TRUE`.
`kmer.length`	(integer) De novo mode. kmer length for the de Bruijn graph. For expert. Default: `kmer.length = 31`.
`max.debruijn.reads`	(integer) Maximum number of reads to use in the de Bruijn graph. For expert. Default: `max.debruijn.reads = 1000`.
`min.kmer.cov`	(integer) De novo mode. Minimum coverage to consider a kmer. For expert. Default: `min.kmer.cov = 2`.
`write.alignments`	(logical) Save read alignments (heavy BAM files). Default: `write.alignments = FALSE`.
`min.mapq`	(double) Reference-based mode. Minimum PHRED-scaled mapping quality to consider a read. Default: `min.mapq = 10`.
`max.clipped`	(double) Reference-based mode. Maximum soft-clipping level, in fraction of read length. Default: `max.clipped = 0.20`.
`max.insert.len`	(integer) Reference-based mode. Maximum allowed sequencing insert length Default: `max.insert.len = 1000`.
`details`	(logical) With default the function will write a more detailed output. Default: `details = FALSE`.
`phasing.cooccurrences.thr.range`	(integer) range of edge coverage thresholds to iterate over when building the graph of allele cooccurrences for SNP phasing. Default: `phasing.cooccurrences.thr.range = c(1,2)`.
`phasing.dont.prune.hets`	(logical) Don't try to ignore dubious heterozygote genotypes during phasing. By default, during phasing, dubious het are ignored. Default: `phasing.dont.prune.hets = FALSE`.
`h`	Display this help messsage. Default: `h = FALSE`

Value

tsv2bam returns a set of .matches.bam files.

The function run_gstacks returns a list with the number of individuals, the batch ID number, a summary data frame and a plot containing:

INDIVIDUALS: the sample id
ALL_LOCUS: the total number of locus for the individual (shown in subplot A)
LOCUS: the number of locus with a one-to-one relationship (shown in subplot B) with the catalog
MATCH_PERCENT: the percentage of locus with a one-to-one relationship with the catalog (shown in subplot C)

Addtionally, the function returns a batch_X.catalog.bam file that was generated by merging all the individual BAM files in parallel.

References

Catchen JM, Amores A, Hohenlohe PA et al. (2011) Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences. G3, 1, 171-182.

Catchen JM, Hohenlohe PA, Bassham S, Amores A, Cresko WA (2013) Stacks: an analysis tool set for population genomics. Molecular Ecology, 22, 3124-3140.

Maruki T, Lynch M (2017) Genotype Calling from Population-Genomic Sequencing Data. G3, 7, 1393-1404.

Examples

## Not run: 
# The simplest form of the function with De novo data:
bam.sum <- stackr::run_gstacks() # that's it !
# This will use, by default, the same population map used in run_tsv2bam

## End(Not run)

thierrygosselin/stackr documentation built on April 13, 2025, 10:28 a.m.