make.gene.matrix: Linkage of Specimens to Sequences for Multiple Loci

View source: R/make.gene.matrix.R

make.gene.matrixR Documentation

Linkage of Specimens to Sequences for Multiple Loci

Description

Creates a matrix which links gene sequences from multiple loci to their associated specimen, determined by assigning a voucher to each specimen using parsed metadata for each sequence.

Usage

make.gene.matrix(metadata, locusCol = "cleanedGeneRegion", vouchersCol = "newLabels", ncbiCol = "NCBI_accession", orgsCol = "organism", logerrors = TRUE, verbose = FALSE)

Arguments

metadata

The output of parse.INSDSeq, containing all parsed metadata and sequences from the raw XML data. Must contain a column for vouchers, which can be made with make.unique.vouchers.R and added to metadata data frame using cbind.

locusCol

An optional string, the name of the column in the metadata which contains the name of the gene region.

vouchersCol

An optional string, the name of the column in the metadata which contains the voucher.

ncbiCol

An optional string, the name of the column in the metadata which contains the NCBI accession number.

orgsCol

An optional string, the name of the column in the metadata which contains the name of the taxon.

logerrors

An optional logical value indicating whether the function should export a csv file with sequences which did not have a voucher, which are automatically excluded from the output matrix. \itemverbose An optional logical value indicating whether the function should print every row of the metadata it successfully incorporates into the matrix.

locusCol, vouchersCol, ncbiCol, and orgsCol all have default values that correspond to the default names of those columns from other functions in the morton package. They are cleanedGeneRegion, newLabels, NCBI_accession, and organism, respectively.

The default value for logerrors is TRUE. For verbose, it is FALSE.

verbose can be a useful tool when troubleshooting to pinpoint where the function has stopped.

This function traverses the metadata data table from parse.INSDSeq and generates a matrix where unique vouchers are the rows andgene loci are the columns. Each cell represents a sequence, where its x and y position in the matrix indicate which voucher and gene locus it belongs to. Cells therefore contain the NCBI accession number for the sequence which they are associated with. If there are multiple sequences for a single voucher and gene locus, the NCBI accession numbers are both entered into the cell, delimited by a pipe(|). A matrix which connects NCBI gene sequences to their associated loci and vouchers. Andrew Hipp and Kasey Pham

parse.INSDSeq, make.unique.vouchers, make.fasta.files, make.shared.gene.matrix, cbind manip methods


andrew-hipp/morton documentation built on April 7, 2024, 12:15 p.m.