blast_18S_reformat: Process a tabular output from blastn (BLAST+)

Description Usage Arguments Value Column names Programming notes Examples

View source: R/blast.R

Description

The BLAST file must originate from blastn with the follwing output format option:

-outfmt"6 qseqid sseqid sacc stitle sscinames staxids sskingdoms sblastnames pident slen length mismatch gapopen qstart qend sstart send evalue bitscore"

It is very important that the columns are in this precise order and no column is missing.

For the formating options see:

What does the function do : 0. Remove any self hit

  1. Group all GenBank accession

  2. Obtain taxonomy from GenBank (note the GenBank taxonomy is now in the PR2 database after downloading from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/)

  3. Check if the sequence is in PR2 and get the PR2 taxo (this is done with the pr2 database package)

  4. Detect whether the Subject sequence is uncultured or not

  5. Merge back into the BLAST file

  6. Compute a summary with best hit, best hit to PR2, best hit to cultured, taxo consensus (identity>96%), contradiction at division level (identity>90%)

The modified BLAST output file includes additional columns

The summary file contains several set of columns

  1. The top hit (column with prefix hit_top_)

  2. The top hit for which a PR2 sequence is available (columns starting with hit_pr2_)

  3. The top hit corresponding to a culture or an isolate (columns starting with hit_cul)

  4. A "consensus" taxonomy based on all the hits with more than 98\

  5. Contradiction between hits >90\

Usage

1
blast_18S_reformat(file_name)

Arguments

file_name

The name of the BLAST file with full path

Value

TRUE if the function has been successful.

The modified table is saved by changing the name of the file by replacing the extension by _pr2.tsv.

The summary table is saved by changing the name of the file by replacing the extension by _summary.tsv.

Column names

The columns for the Blast are named as follows. For the summary a prefix is added

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
  hit_top_ / hit_pr2_ / hit_cult_

  query_id, hit_id, hit_acc, hit_title, hit_sci_names

  hit_tax_ids, hit_super_kingdoms, hit_blast_names,

  pct_identity, hit_length, alignment_length, mismatches,

  gap_opens, query_start, query_end, hit_start, hit_end,

  evalue, bit_score

Programming notes

The following functions must be used with libary qualifier dplyr:: because they are also in the plyr library

Uses the pr2database package for faster access (much faster !!)

Examples

1
blast_18S_reformat("C:/BLAST_output.tsv")

vaulot/dvutils documentation built on Nov. 20, 2021, 11:01 a.m.