In selkamand/utilitybeltrefseq: What the Package Does (One Line, Title Case)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

utilitybeltrefseq

The goal of utilitybeltrefseq is to make it easy to find and download the best refseq assemblies available for any given species. All you need to know is the species NCBI taxonomy ID

Installation

You can install the development version of utilitybeltrefseq from GitHub with:

# install.packages("devtools")
devtools::install_github("selkamand/utilitybeltrefseq")

Example

Downloading the best assembly

Say you want to find the best reference genome for Escherichia coli.

First we load the library then download a list of available assemblies from refseq.

library(utilitybeltrefseq)

update_refseq_data_cache()

For best results, run the above command every couple of months to stay up to date with whats currently in refseq.

Next, we need to find the NCBI taxonomy ID of our species of interest. We can find that here (taxid: 562). We could also have used the taxize package if we wanted to stay within R.

Now that we have the species level taxid, we can run choose_best_assembly

choose_best_assembly(taxid_of_interest = 562)

Note that we get a warning that there are multiple 'best' assemblies. If there are multiple strains in a species with complete assemblies - it his hard to know which to return. We choose the most recently added assembly. To return all of the high quality assemblies you can run:

 choose_best_assembly(
   taxid_of_interest = 562, 
   break_ties_based_on_newest_sequence_added = FALSE,
   return_accession_only = TRUE # set this to false to get more info about each assembly
   )

Keep an eye on the intraspecific name column. Often can use this to choose which strain you're after. See ?choose_best_assembly for details.

Once we have the assembly accession we're interested in, we can download it:

 download_assembly(
   target_assembly_accession = GCF_000005845.2
   )

If we just want to quickly download a ref-genome of our species of interest we can just run

best_assembly = download_best_assembly(taxid_of_interest = 562)

Aligning reads to assembly

If you have bwa-mem2 installed, this R package can also be used to align paired reads to the best assembly. This can be installed with brew install bwa2. Can also find linux binaries here. To make a bam index file you will also need samtools installed and available on PATH

forward_reads = "path/to/forward_reads.fq"
reverse_reads = "path/to/reverse_reads.fq"
taxid_of_interest = 562
outfile_prefix = "E. coli"

taxid_of_interest |>
  download_best_assembly() |>
  prep_for_alignment() %>%
  align_reads(forward_reads = forward_reads, reverse_reads = reverse_reads, outfile_prefix = outfile_prefix) |>
  bam_index() # remove this line and |> line if you don't want an indexed bam / don't have samtools installed