align_to_ref: Align and Trim MSA Against a Reference

View source: R/align_to_ref.R

align_to_refR Documentation

Align and Trim MSA Against a Reference

Description

Takes a FASTA file with target sequences and aligns them against a reference sequence submitted to the program. The output is an aligned fasta file that is trimmed to the length of the reference sequence. Sequences without full coverage (records having sequences with leading or trailing gaps) are removed. Records with characters other than IUPAC are also removed. Finally, internal gaps are removed from the sequence based on the submitted multiple sequence alignment percent coverage of the character position as provided in the pigl argument supplied by the user.

Usage

align_to_ref(
  data_folder = NULL,
  ref_seq_file = NULL,
  MAFFT_loc = NULL,
  output_file = NULL,
  pigl = 0.95,
  op = 1.53
)

Arguments

data_folder

This variable can be used to provide a location for the file containing all of the fasta files wanting to be aligned. The default value is set to NULL where the program will prompt the user to select the folder through point-and-click.

ref_seq_file

This variable can be used to provide a location for the reference sequence file. The default value is set to NULL where the program will prompt the user to select the folder through point-and-click.

MAFFT_loc

This variable can be used to provide a location for the MAFFT program. The default value is set to NULL where the program will prompt the user to select the folder through point-and-click.

output_file

This variable can be used to set the location of the output files from the program. The default value is set to NULL where the program will place the output files in the same location as the target files.

pigl

This is the percent internal gap loop argument. This provides a percent that will remove records causing internal gaps if more than the percent value assigned to this argument is reached. If this value is set to 0 then internal gaps are not removed. The default for this value is 0.95.

op

This is the gap opening penalty for the use of MAFFT. The higher the value the larger penalty in the alignment. The default for this value is set to 1.53 which is the default value in the MAFFT program. For alignment of highly conserved regions where no gaps are expected this should be set to a much higher number and 10 is recommended for coding regions like the COI-5P.

Details

User Input: 1. A file folder location with the fasta files that need to be aligned and trimmed using the supplied reference sequence. Please note that any and all fasta files (named *.fas) in this folder will be analyzed. 2. A reference sequence file with a sequence or MSA with all sequences having the same length. 3. The location of the MAFFT executable file <https://mafft.cbrc.jp/alignment/software/>

Value

Output: 1. In the submitted file folder location there will be a log file titled MAFFT_log. 2. The sequence output files from this script are placed into two subfolders. These folders are in the submitted file location where the fasta files of interest are located. The two folders created are MAFFT and MAFFT_trimmed. In the MAFFT folder there will be files with name of the files in the submitted file folder appended with "_MAFFT". The MAFFT_trimmed file will contain files with the same naming convention as the files in the submitted folder and appended with "_MAFFT_trimmed".

Author(s)

Robert G. Young

References

<https://github.com/rgyoung6/MACER> Young RG, Gill R, Gillis D, Hanner RH (2021) Molecular Acquisition, Cleaning and Evaluation in R (MACER) - A tool to assemble molecular marker datasets from BOLD and GenBank. Biodiversity Data Journal 9: e71378. <https://doi.org/10.3897/BDJ.9.e71378>

See Also

auto_seq_download() create_fastas() barcode_clean()

Examples

## Not run: 
align_to_ref(pigl=0.75)
align_to_ref(pigl=0.95, op=10)
align_to_ref(pigl=0)

## End(Not run)


MACER documentation built on Dec. 3, 2022, 1:10 a.m.