In schyen/isolationWorkflow: Workflow for building strain libraries during isolation experiments

This outlines how to use the isolationWorkflow:

Prepare sequences for Blastn
Blast your fasta file
Read in blast results, and compile into table
Generate/update strain library and fasta file
- Creates/updates strain library with unique isolates
- Creates fasta file by species for ones that need sequence alignment (amalgamates strains from current blast result and strain library) 5) Align redundant species and update strain library

Getting started

Once you have downloaded the isolationgWorkflow package, you need to load the package before using it.

library(isolationWorkflow)

Once loaded, you can use ? to find out more about each function in the package.

?abif_fasta()
?parse_blast()

1) Prepare sequences for Blast

This step includes: Check if sequences were successful, and allows you to exlude failed sequences Trim ends off of sequences, as they are usually strings of Ns * Make a fasta file containing all sequences from your Big Dye plate so can send in one Blastn query

All of this is done using the abif_fasta(). I recommend running this command twice. The first time I use to evaluate all of my sequences, and the second time I generate a "clean" fasta file.

This will give you: a file containing sequences before and after trimming a fasta file containing trimmed sequences (if output = FALSE) Note: all output files saved in same location as sanger files

When evaluating sequences, look for the following: sequence length -- too short means the sequence failed (i.e. DNA not released from boiling, no PCR product or BigDye failed) distribution of Ns -- Ns throughout the majority of the read means the isolate was not pure * satisfaction with the trimming algorithm -- compare the raw vs. trimmmed version of the sequence to ensure you agree with the trimming. If you don't, manually edit the final FASTA file before Blasting

# Check for sequence quality, and trimming
abif_fasta(folder = "/path/to/sanger/files/",
           exclude = NULL,
           trim=TRUE, trim.check=FALSE,
           export.check=TRUE, show.prog=TRUE, 
           output=FALSE)

# prepare sequences for Blast
## exclude parameter accepts multiple entries to exclude multiple samples
abif_fasta(folder = "/path/to/sanger/files/",
           exclude = c('sample1.ab1','sample2.ab1'),
           trim=TRUE, trim.check=FALSE,
           export.check=TRUE, show.prog=TRUE, 
           output='myIsolates-clean.FASTA')

2) Blast your sequences

Use [Blastn](https://blast.ncbi.nlm.nih.gov/Blast.cgi](https://blast.ncbi.nlm.nih.gov/Blast.cgi) to classify your sequences. Setting the database to 16S ribosomal RNA sequences (Bacteria and Archaea) helps with specificity of the search. Download your blast result as an XML file and save in same location as sanger files. Then using the parse_blast(), we can read our results and save as a csv table. Output is saved in same location as your downloaded xml file.

3) Read in blast results

# include path to blast result if necessary
parse_blast(filename='/path/to/downloaded/blast/result/######-Alignment.xml',
            ngroups=1, tophit=FALSE,
            output='blast_result.csv')

4) Building a strain library

You can use gen_strainlib() to both generate and update your strain library, depending on how the parameters are used. Regardless, when you use gen_strainlib() the following happens:

reads in blast result file from parse blast
if no strain library supplied, creates a new one
Species that are only found once and not already in strain library are added to library
Generates strain library as csv
Generates strain library as fasta
Generates fasta file for each species when more than one isolate is assigned that species

Regarding the last point, the blast result being examined will be cross referenced with the strain library. So you can compare and align your blast results with strains already in your library. These species-based fasta files allow you to align the sequences that are classified to the same species (see next step)

Note: a fasta file of your library strains is also generated because that is a useful thing to have.

``` {r, echo = TRUE, eval = FALSE}

generating strain library

gen _strainlib(blast_file = "/include/path/if/necessary/blast_result.csv", lib_path = "/path/to/strain/library", lib_file = NULL, lib_name = 'mystrainlibrary')

updating strain library

gen _strainlib(blast_file = "/include/path/if/necessary/blast_result.csv", lib_path = "/path/to/strain/library", lib_file = "mystrainlibrary.csv", lib_name = NULL)

*****************
# 5) Align redundant species and update strain library
This steps aligns the sequences of isolates that were classified as the same species, and allows you to pick which isolates you want to add to the library. Alignment is done using `AlignSeqs()` in the R package, `DECIPHER`.

The alignment will automatically pop up as an HTML file in your web browser, with the nucleotides colour-coded to help identify any mis-matches.The alignemnt file is automatically saved in the same location as the fasta file that was aligned. When choosing which strains to keep, consider the following:
* is the difference due to an N? if so, then you don't have enough information to say the sequences are different at that base
* a strain that is different from others, ONLY because of Ns may have poor sequence quality and you should consider redoing the sequencing for that isolate.

```r
# supply more than one fasta to align to save time
choose_alignment(lib_path = "/path/to/strain/library", 
                 lib_file = 'mystrainlibrary.csv', 
                 blast_file = "/include/path/if/necessary/blast_result.csv",
                 fasta_path = "/path/to/species/fasta/files",
                 fastatoalign = c("species1.fasta",
                               "species2.fasta",
                               "species3.fasta"))

Concluding remarks

Since isolation results usually comes back in waves, it is helpful to be checking your results as you go, and freezing strains down as needed (rather than waiting for all results to come back, then deciding whichs strains to keep). This workflow is designed so that you are able to examine sanger sequences as they come in, and easily cross-reference them with the strains you have already decided to keep in your strain library.

Note, the final strain library file is in the same format as the blast result. This may have extra info you may not find useful in your final strain library. I recommend you don't modify the format, but rather, copy and paste into a seperate file to edit as you please.

schyen/isolationWorkflow documentation built on May 24, 2019, 2:03 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com