metascope_blast: Blast reads from MetaScope aligned files

View source: R/metascope_blast.R

metascope_blastR Documentation

Blast reads from MetaScope aligned files

Description

This function allows the user to check a subset of identified reads against NCBI BLAST and the nucleotide database to confirm or contradict results provided by MetaScope. It aligns the top 'metascope_id()' results to NCBI BLAST database. It REQUIRES that command-line BLAST and a separate nucleotide database have already been installed on the host machine. It returns a csv file updated with BLAST result metrics.

Usage

metascope_blast(
  metascope_id_path,
  bam_file_path = list.files(tmp_dir, ".updated.bam$", full.names = TRUE)[1],
  tmp_dir,
  out_dir,
  sample_name,
  fasta_dir = NULL,
  num_results = 10,
  num_reads = 100,
  hit_list = 10,
  num_threads = 1,
  db_path,
  quiet = FALSE,
  NCBI_key = NULL,
  db = NULL,
  accessions_path = NULL
)

Arguments

metascope_id_path

Character string; path to a csv file output by 'metascope_id()'.

bam_file_path

Character string; full path to bam file for the same sample processed by 'metascope_id'. Note that the 'filter_bam' function must have output a bam file, and not a .csv.gz file. See '?filter_bam_bowtie' for more details. Defaults to list.files(file_temp, ".updated.bam$")[1].

tmp_dir

Character string, a temporary directory in which to host files.

out_dir

Character string, path to output directory.

sample_name

Character string, sample name for output files.

fasta_dir

Directory where fasta files for blast will be stored.

num_results

Integer, the maximum number of taxa from the metascope_id output to check reads. Takes the top n taxa, i.e. those with largest abundance. Default is 10.

num_reads

Integer, the maximum number of reads to blast per microbe. If the true number of reads assigned to a given taxon is smaller, then the smaller number will be chosen. Default is 100. Too many reads will involve more processing time.

hit_list

Integer, number of blast hit results to keep. Default is 10.

num_threads

Integer, number of threads if running in parallel (recommended). Default is 1.

db_path

Character string. The database file to be searched (including basename, but without file extension). For example, if the nt database is in the nt folder, this would be /filepath/nt/nt assuming that the database files have the nt basename. Check this path if you get an error message stating "No alias or index file found".

quiet

Logical, whether to print out more informative messages. Default is FALSE.

NCBI_key

(character) NCBI Entrez API key. optional. See taxize::use_entrez(). Due to the high number of requests made to NCBI, this function will be less prone to errors if you obtain an NCBI key.

db

Currently accepts one of c("ncbi", "silva", "other") Default is "ncbi", appropriate for samples aligned against indices compiled from NCBI whole genome databases. Alternatively, usage of an alternate database (like Greengenes2) should be specified with "other".

accessions_path

Directory where accession files for blast are stored.

Details

This function assumes that you used the NCBI nucleotide database to process samples, with a download date of 2021 or later. This is necessary for compatibility with the bam file headers.

This is highly computationally intensive and should be ran with multiple cores, submitted as a multi-threaded computing job if possible.

Note, if best_hit_strain is FALSE, then no strain was observed more often among the BLAST results.

Value

This function writes an updated csv file with metrics.

Examples

## Not run: 
### Create temporary directory
file_temp <- tempfile()
dir.create(file_temp)

bamPath <- system.file("extdata", "bowtie_target.bam",
                       package = "MetaScope")
file.copy(bamPath, file_temp)

metascope_id(file.path(file_temp, "bowtie_target.bam"), aligner = "bowtie2",
             input_type = "bam", out_dir = file_temp, num_species_plot = 0,
             update_bam = TRUE)

## Run metascope blast
### Get export name and metascope id results
out_base <- bamPath |> basename() |> tools::file_path_sans_ext() |>
  tools::file_path_sans_ext()
metascope_id_path <- file.path(file_temp, paste0(out_base, ".metascope_id.csv"))

# NOTE: change db_path to the location where your BLAST database is stored!
db <- "/restricted/projectnb/pathoscope/data/blastdb/nt/nt"

Sys.setenv(ENTREZ_KEY = "<your id here>")

metascope_blast(metascope_id_path,
                bam_file_path = file.path(file_temp, "bowtie_target.bam"),
                tmp_dir = file_temp,
                out_dir = file_temp,
                sample_name = out_base,
                db_path = db,
                num_results = 10,
                num_reads = 5,
                hit_list = 10,
                num_threads = 3,
                db = "ncbi",
                quiet = FALSE,
                NCBI_key = NULL,
                fasta_dir = NULL,
                accessions_path = NULL)

## Remove temporary directory
unlink(file_temp, recursive = TRUE)

## End(Not run)


compbiomed/MetaScope documentation built on Nov. 20, 2024, 8 p.m.