DB2Seqs: Export Database Sequences to a FASTA or FASTQ File

Description Usage Arguments Details Value Author(s) References Examples

View source: R/DB2Seqs.R

Description

Exports a database containing sequences to a FASTA or FASTQ formatted file of sequence records.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
DB2Seqs(file,
         dbFile,
         tblName = "Seqs",
         identifier = "",
         type = "BStringSet",
         limit = -1,
         replaceChar = NA,
         nameBy = "description",
         orderBy = "row_names",
         removeGaps = "none",
         append = FALSE,
         width = 80,
         compress = FALSE,
         chunkSize = 1e5,
         sep = "::",
         clause = "",
         verbose = TRUE)

Arguments

file

Character string giving the location where the file should be written.

dbFile

A SQLite connection object or a character string specifying the path to the database file.

tblName

Character string specifying the table in which to extract the data.

identifier

Optional character string used to narrow the search results to those matching a specific identifier. If "" then all identifiers are selected.

type

The type of XStringSet (sequences) to export to a FASTA formatted file or QualityScaledXStringSet to export to a FASTQ formatted file. This should be (an unambiguous abbreviation of) one of "DNAStringSet", "RNAStringSet", "AAStringSet", "BStringSet", "QualityScaledDNAStringSet", "QualityScaledRNAStringSet", "QualityScaledAAStringSet", or "QualityScaledBStringSet". (See details section below.)

limit

Number of results to display. The default (-1) does not limit the number of results.

replaceChar

Optional character used to replace any characters of the sequence that are not present in the XStringSet's alphabet. Not applicable if type=="BStringSet". The default (NA) results in an error if an incompatible character exist. (See details section below.)

nameBy

Character string giving the column name(s) for identifying each sequence record. If more than one column name is provided, the information in each column is concatenated, separated by sep, in the order specified.

orderBy

Character string giving the column name for sorting the results. Defaults to the order of entries in the database. Optionally can be followed by " ASC" or " DESC" to specify ascending (the default) or descending order.

removeGaps

Determines how gaps ("-" or "." characters) are removed in the sequences. This should be (an unambiguous abbreviation of) one of "none", "all" or "common".

append

Logical indicating whether to append the output to the existing file.

width

Integer specifying the maximum number of characters per line of sequence. Not applicable when exporting to a FASTQ formatted file.

compress

Logical specifying whether to compress the output file using gzip compression.

chunkSize

Number of sequences to write to the file at a time. Cannot be less than the total number of sequences if removeGaps is "common".

sep

Character string providing the separator between fields in each sequence's name, by default pairs of colons (“::”).

clause

An optional character string to append to the query as part of a “where clause”.

verbose

Logical indicating whether to display status.

Details

Sequences are exported into either a FASTA or FASTQ file as determined by the type of sequences. If type is an XStringSet then sequences are exported to FASTA format. Quality information for QualityScaledXStringSets are interpreted as PredQuality scores before export to FASTQ format.

If type is "BStringSet" (the default) then sequences are exported to a FASTA file exactly the same as they were when imported. If type is "DNAStringSet" then all U's are converted to T's before export, and vise-versa if type is "RNAStringSet". All remaining characters not in the XStringSet's alphabet are converted to replaceChar or removed if replaceChar is "". Note that if replaceChar is NA (the default), it will result in an error when an unexpected character is found.

Value

Writes a FASTA or FASTQ formatted file containing the sequence records in the database.

Returns the number of sequence records written to the file.

Author(s)

Erik Wright eswright@pitt.edu

References

ES Wright (2016) "Using DECIPHER v2.0 to Analyze Big Biological Sequence Data in R". The R Journal, 8(1), 352-359.

Examples

1
2
3
4
5
db <- system.file("extdata", "Bacteria_175seqs.sqlite", package="DECIPHER")
tf <- tempfile()
DB2Seqs(tf, db, limit=10)
file.show(tf) # press 'q' to exit
unlink(tf)

DECIPHER documentation built on Nov. 8, 2020, 8:30 p.m.