extract_random_seqs_from_multiple_genomes: Extract random loci from a set of genomes
In HajkD/metablastr: Perform Massive Local BLAST Searches

View source: R/extract_random_seqs_from_multiple_genomes.R

extract_random_seqs_from_multiple_genomes

R Documentation

Extract random loci from a set of genomes

Description

In some cases, users may wish to extract sequences from randomly sampled loci of a particular length from a set of genomes. This function allows users to specify a number of sequences of a specified length that shall be randomly sampled from the genome. The sampling rule is as follows: For each locus independently sample:

1) choose randomly (equal probability: see sample.int for details) from which of the given chromosomes the locus shall be sampled (replace = TRUE).
2) choose randomly (equal probability: see sample.int for details) from which strand (plus or minus) the locus shall be sampled (replace = TRUE).
3) randomly choose (equal probability: see sample.int the starting position of the locus in the sampled chromosome and strand (replace = TRUE).

Usage

extract_random_seqs_from_multiple_genomes(
  sample_size,
  replace = TRUE,
  prob = NULL,
  interval_width,
  subject_genomes,
  file_name = NULL,
  separated_by_genome = FALSE,
  update = TRUE,
  path = NULL
)

Arguments

`sample_size`	a non-negative integer giving the number of loci that shall be sampled.
`replace`	logical value indicating whether sampling should be with replacement. Default: `replace = TRUE`.
`prob`	a vector of probability weights for obtaining the elements of the vector being sampled. Default is `prob = NULL`.
`interval_width`	the length of the locus that shall be sampled.
`subject_genomes`	a vector containing file paths to the reference genomes that shall be queried (e.g. file paths returned by `meta.retrieval`).
`file_name`	name of the fasta file that stores the BLAST hit sequences. This name will only be used when `separated_by_genome = FALSE`.
`separated_by_genome`	a logical value indicating whether or not hit sequences from different genomes should be stored in the same output `fasta` file `separated_by_genome = FALSE` (default) or in separate `fasta` files `separated_by_genome = TRUE`.
`update`	shall an existing `file_name` file be overwritten (`update = TRUE`; Default) or shall blast hit sequences be appended to the existing file (`update = FALSE`)?
`path`	a folder path in which corresponding `fasta` output files shall be stored.