R/readUc.R
In packFinder: de novo Annotation of Pack-TYPE Transposable Elements

Documented in readUc

#' @title
#' Convert .uc Files to Dataframe
#'
#' @description
#' Reads .uc files (USEARCH Cluster Format) generated by 
#' the VSEARCH clustering and alignment algorithms.
#'
#' @param file
#' The file path of the .uc file.
#'
#' @param output
#' The type of analysis that was carried out to produce the 
#' .uc file.
#' \itemize{
#'     \item If output is specified as "cluster", VSEARCH 
#'     clustering was carried out.
#'     \item If output is specified as "alignment", 
#'     VSEARCH pairwise global alignment was carried out.
#' }
#' 
#' Note that clustering produces one "H" record 
#' for each sequence, and one "C" record for 
#' each cluster, while an alignment produces an 
#' "H" record for each alignment (see details).
#'
#' @return
#' A dataframe containing the converted .uc file. The fields 
#' contained within are as follows:
#' \itemize{
#'     \item Record type - "H, C or N", see details for 
#'     further information.
#'     \item Cluster designation 
#'     (\code{output = "cluster"} only)
#'     \item Sequence length, or cluster size
#'     \item Percent identity to target
#'     \item The nucleotide strand 
#'     (\code{output = "cluster"} only)
#'     \item A compressed alignment - see details for 
#'     further information.
#'     \item ID of query sequence
#'     \item ID of target sequence ("H" records only)
#' }
#'
#' @details
#' USEARCH cluster format is a tab separated text file that 
#' contains clustering and/or alignment information for a 
#' set of sequences. For each sequence a record type, 
#' "H, C or N", is provided providing information about the 
#' type of "hit" in the dataframe. These refer to:
#' 
#' \itemize{
#'     \item H - Hit - for alignments, indicates an 
#'     identified alignment of two supplied sequences. For 
#'     clustering, indicates the cluster assignment for a query.
#'     \item C - Cluster record - a record for each cluster generated.
#'     \item N - No hit - indicates that no cluster was 
#'     assigned or no alignment was found with a target 
#'     sequence. For clustering, a query with no hits becomes 
#'     the centroid of a new cluster.
#' }
#' 
#' Additionally, for each record a "compressed alignment" 
#' is generated. This is the alignment represented in a 
#' compact format including the letters "M", "D", and "I". 
#' Before each letter, the number of consecutive columns 
#' of the given letter type is also given. The letter types 
#' are as follows:
#' 
#' \itemize{
#'     \item "M" - Match - Identical bases between the query 
#'     and target sequence
#'     \item "D" - Deletion - A gap in the target sequence
#'     \item "I" - Insertion - A gap in the query sequence
#' }
#' 
#' An example of this would be "13M", referring to 13 
#' consecutive matches between the query and target sequence.
#' 
#' @seealso
#' code{\link{tirClust}}, code{\link{packAlign}}, 
#' code{\link{readBlast}}, code{\link{packClust}}
#'
#' @references
#' VSEARCH may be downloaded from 
#' \url{https://github.com/torognes/vsearch}. See 
#' \url{https://www.ncbi.nlm.nih.gov/pubmed/27781170} 
#' for further information.
#'
#' @examples 
#' readUc(system.file(
#'     "extdata", 
#'     "packMatches.uc", 
#'     package = "packFinder"
#' ))
#' 
#' @author
#' Jack Gisby
#' 
#' @export

readUc <- function(file, output = "cluster") {
    checkPermissions(file)

    if (output != "cluster" & output != "alignment") {
        stop("Argument 'output' must be specified as 'cluster' or 'alignment'")
    }

    packClusts <- utils::read.table(file, sep = "\t")
    colnames(packClusts) <- c(
        "type",
        "cluster",
        "width",
        "identity",
        "strand",
        "6",
        "7",
        "cigarAlignment",
        "query",
        "target"
    )

    cluster <- packClusts$cluster
    strand <- packClusts$strand

    # different parts of the uc file are required for different purposes
    if (output == "cluster") {
        return(subset(packClusts, select = -c(6, 7)))
    } else if (output == "alignment") {
        return(subset(packClusts, select = -c(cluster, strand, 6, 7)))
    }
}