generative.prob: Compute generative probabilities from BLAST output
In metaMix: Bayesian Mixture Analysis for Metagenomic Community Profiling

Description Usage Arguments Value Examples

generative.prob() computes the probability for a read to be generated by a certain species, given that it originates from this species. The input for this function is the tabular BLAST output format, either default or custom. The function uses the recorded mismatches to produce a Poisson probability.

generative.prob.nucl() for when we have nucleotide similarity, i.e we have performed BLASTn.

generative.prob(blast.output.file = NULL, read.length.file = 80,
  contig.weight.file = 1, gi.taxon.file = NULL,
  protaccession.taxon.file = NULL, gi.or.prot = "prot",
  gen.prob.unknown = 1e-06, outDir = NULL, blast.default = TRUE)

generative.prob.nucl(blast.output.file = NULL, read.length.file = 80,
  contig.weight.file = 1, gi.taxon.file, gen.prob.unknown = 1e-20,
  outDir = NULL, genomeLength = NULL, blast.default = TRUE)

`blast.output.file`	This is the tabular BLASTx output format for generative.prob(), while it is the tabular BLASTn output format for generative.prob.nucl(). It can either be the default output format or a specific custom output format, incorporating read length and taxon identifier. Please see the vignette for column order and the exact BLAST command to use. You can also use DIAMOND instead of BLASTx which is much faster and produces default format according to BLAST default output specifications.
`read.length.file`	This argument can either be a file mapping each read to its length or a numerical value, representing the average read length.
`contig.weight.file`	This argument can either be a file where weights are assigned to reads and contigs. For unassembled reads the weight is equal to 1 while for contigs the weight should reflect the number of reads that assembled it.
`gi.taxon.file`	For generative.prob() this would be the 'gi_taxid_prot.dmp' taxonomy file, mapping each protein gi identifier to the corresponding taxon identifier. It can be downloaded from ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz . For generative.prob.nucl() this would be the 'gi_taxid_nucl.dmp' taxonomy file, mapping each nucleotide gi identifier to the corresponding taxon identifier. It can be downloaded from ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz.
`protaccession.taxon.file`	This parameter has been added as NCBI is phasing out the usage of GI identifiers. For generative.prob() this would be the prot.accession2taxid taxonomy file, mapping each protein accession identifier to the corresponding taxon identifier. It can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz. I have found that it is useful to concatenate it with ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/dead_prot.accession2taxid.gz so you can search in both files for the protein identifier (sometimes obsolete sequences can still be present in latest RefSeq releases but not in taxonomy files and vice versa and these mismatches can cause loss of information). TODO add support for nucleotides as well.
`gi.or.prot`	This parameter specifies whether the user is using the GI identifiers or protein accession identifiers to map to taxon identifiers. Values are 'gi' or 'prot'. The default value is 'prot'.
`gen.prob.unknown`	User-defined generative probability for unknown category. Default value for generative.prob() is 1e-06, while for generative.prob.nucl() is 1e-20.
`outDir`	Output directory.
`blast.default`	logical. Is the input the default blast output tabular format? Default value is TRUE. That means that the BLAST output file needs to have the following fields:Query id, Subject id, percent identity, alignment length, mismatches, gap openings, query start, query end, subject start, subject end, e-value, bit score. Alternatively we can use the 'blast.default=FALSE' option, providing a custom blast output that has been produced using the option -outfmt '6 qacc qlen sacc slen stitle bitscore length pident evalue staxids'.
`genomeLength`	This is applicable only for generative.prob.nucl() . It is a file mapping each genome/nucleotide to its respective length. The file must be tab seperated and the first column the nucleotide gi identifier (integer) and the second the corresponding sequence length (integer). It will be used to correct the Poisson probabilities between each read and genome.

step1: A list with five elements. The first one (pij.sparse.mat) is a sparse matrix with the generative probability between each read and each species. The second (ordered.species) is matrix containing all the potential species as recorded by BLAST, ordered by the number of reads. The third one (read.weights) is a data.frame mapping each contig to a weight which is essentially the number of reads that were used to assemble it. For unassembled reads the weight is equal to one. The fourth one is the generative probability for the unknown category (gen.prob.unknown), to be used in all subsequent steps. Finally we also record the output directory (outDir) where the results will be stored.

# See vignette for more details

## Not run: 
# When using custom BLAST output file
step1 <-generative.prob(blast.output.file = "pathtoFile/blastOut.custom", blast.default=FALSE)

# When using default BLAST output file
step1 <-generative.prob(blast.output.file = "pathtoFile/blastOut.default",
                        read.length.file="pathtoFile/read.lengths",
                        contig.weight.file="pathtoFile/read.weights",
                        gi.taxon.file = "pathtoFile/taxon.file")

## End(Not run)