R/701-readFASTA.R

Defines functions readFASTA

Documented in readFASTA

#' Read Protein/DNA Sequences in FASTA Format
#'
#' Read Protein/DNA Sequences in FASTA Format
#' 
#' This function reads protein sequences in FASTA format.
#' 
#' @param file   The name of the file which the sequences in fasta format are 
#'               to be read from. If it does not contain an absolute or 
#'               relative path, the file name is relative to the current 
#'               working directory, \code{\link{getwd}}. 
#'               The default here is to read the \code{P00750.fasta} file which 
#'               is present in the \code{protseq} directory of the BioMedR package.
#' 
#' @param legacy.mode If set to \code{TRUE}, lines starting with a semicolon ';'
#'                    are ignored. Default value is \code{TRUE}.
#' @param seqonly     If set to \code{TRUE}, only sequences as returned without 
#'                    attempt to modify them or to get their names and 
#'                    annotations (execution time is divided approximately
#'                    by a factor 3). Default value is \code{FALSE}.
#' 
#' @return The result character vector
#' 
#' @keywords BioMedR FASTA readFASTA
#'
#' @aliases readFASTA
#' 
#' @note Note that any different sets of instances (chunklets),
#'       e.g. {1, 3, 7} and {4, 6}, might belong to the 
#'       same class and might belong to different classes.
#' 
#' @author Min-feng Zhu <\email{wind2zhu@@163.com}>, 
#'         Nan Xiao <\url{http://r2s.name}>
#' 
#' @seealso See \code{\link{readPDB}} for reading protein sequences 
#' in PDB format.
#' 
#' @export readFASTA
#' 
#' @references
#' Pearson, W.R. and Lipman, D.J. (1988) 
#' Improved tools for biological sequence comparison. 
#' \emph{Proceedings of the National Academy of Sciences 
#' of the United States of America}, \bold{85}: 2444-2448
#' 
#' @examples
#' P00750 = readFASTA(system.file('protseq/P00750.fasta', package = 'BioMedR'))
#' 

readFASTA = function (file = system.file('protseq/P00750.fasta', 
                                         package = 'BioMedR'), 
                      legacy.mode = TRUE, seqonly = FALSE) {

    # Read the fasta file as a vector of strings

    lines = readLines(file)

    # Remove comment lines starting with a semicolon ';'

    if (legacy.mode) {
        comments = grep("^;", lines)
        if (length(comments) > 0) {
            lines = lines[-comments]
        }
    }

    # Get the line numbers where sequences names are

    ind = which(substr(lines, 1L, 1L) == ">")

    # Compute the total number of sequences

    nseq = length(ind)

    if (nseq == 0) stop("no line starting with a > character found")

    # Localize sequence data

    start = ind + 1
    end = ind - 1
    end = c(end[-1], length(lines))

    # Read in sequences

    sequences = lapply(seq_len(nseq), 
                       function(i) paste(lines[start[i]:end[i]], collapse = ""))

    if (seqonly) return(sequences)

    # Read in sequence names

    nomseq = lapply(seq_len(nseq), function (i) {
        firstword = strsplit(lines[ind[i]], " ")[[1]][1]
        substr(firstword, 2, nchar(firstword))
        })

    # Give the sequences names to the list elements

    names(sequences) = nomseq
    
    return(sequences)

}

Try the BioMedR package in your browser

Any scripts or data that you put into this service are public.

BioMedR documentation built on Nov. 17, 2017, 10:08 a.m.