#' toGRanges
#'
#' @description
#' Transforms a file or an object containing a region set into a
#' \code{\link{GRanges}} object.
#'
#' @details
#' If A is already a \code{\link{GRanges}} object, it will be returned untouched.
#'
#' If A is a data frame, the function will assume the first three columns are
#' chromosome, start and end and create a \code{\link{GRanges}} object. Any
#' additional column will be considered metadata and stored as such in the
#' \code{\link{GRanges}} object. There are 2 special cases: 1) if A is a
#' data.frame with only 2 columns, it will assume the first one is the
#' chromosome and the second one the position and it will create a GRanges with
#' single base regions and 2) if the data.frame has the first 3 columns named
#' "SNP", "CHR" and "BP" it will shuffle the columns and repeat "BP" to build
#' a GRanges of single base regions (this is the standard ouput format of plink).
#'
#' If A is not a data.frame and there are more parameters, it will try to build
#' a data.frame with all parameters and use that data.frame to build the
#' GRanges. This allows the user to call it like
#' \code{toGRanges("chr1", 10, 20)}.
#'
#' If A is a character or a character vector and it's not a file or a URL, it
#' assumes it's a genomic position description in the form used by UCSC or
#' IGV, "chr2:1000-2000". It will try to parse the character strings into
#' chromosome, start and end and create a GRanges. The parser can deal with
#' commas separating thousands (e.g. "chr2:1,000-2,000") and with the comma
#' used as a start/end separator (e.g. "chr2:1000,2000"). These different
#' variants can be mixed in the same character vector.
#'
#' If A is a "SimpleRleList" it will be interpreted as the result from
#' GenomicRanges::coverage and the function will return a GRanges with a
#' single metadata column named "coverage".
#'
#' If A is a file name (local or remote) or a connection to a file, it will try
#' to load it in different ways:
#' * BED files (identified by a "bed" extension): will be loaded using
#' \code{rtracklayer::import} function. Coordinates are 0 based as
#' described in the BED specification (https://genome.ucsc.edu/FAQ/FAQformat.html#format1).
#' * PLINK assoc files (identified by ".assoc", ".assoc.fisher",
#' ".assoc.dosage", ".assoc.linear", ".assoc.logistic"): will be loaded
#' as single-base ranges with all original columns present and the SNPs ids
#' as the ranges names
#' * Any other file: It assumes the file is a "generic" tabular file. To load
#' it it will ignore any header line starting with \code{comment.char},
#' autodetect the field separator (if not provided by the user),
#' autodetect if it has a header and read it accordingly.
#'
#' The \code{genome} parameter can be used to set the genome information of
#' the created GRanges. It can be either a \code{\link{BSgenome}} object or a
#' character string defining a genome (e.g. "hg19", "mm10"...) as accepted
#' by the \code{BSgenome::getBSgenome} function. If a valid genome is
#' given and the corresponding BSgenome package is installed, the genome
#' information will be attached to the GRanges. If the chromosome naming style
#' from the GRanges and the genome object are different, it will try to change
#' the GRanges styles to match those of the genome using
#' \code{GenomeInfoDb::seqlevelsStyle}.
#'
#' @note **IMPORTANT:** Regarding the coordinates, BED files are 0 based
#' while \code{data.frames} and generic files are treated as 1 based. Therefore
#' reading a line "chr9 100 200" from a BED file will create a 99 bases wide
#' interval starting at base 101 and ending at 200 but reading it from a txt
#' file or from a \code{data.frame} will create a 100 bases wide interval
#' starting at 100 and ending at 200. This is specially relevant in 1bp
#' intervals. For example, the 10th base of chromosome 1 would be
#' "chr1 9 10" in a BED file and "chr1 10 10" in a txt file.
#'
#' @usage toGRanges(A, ..., genome=NULL, sep=NULL, comment.char="#")
#'
#' @param A a \code{\link{data.frame}} containing a region set, a \code{\link{GRanges}} object, a BED file, any type of file supported by \code{rtracklayer::import} or a \code{"SimpleRleList"} returned by \code{GenomicRanges::coverage}. If there are more than 1 argument, it will build a dataframe out ouf them and process it as usual. If there's only a single argument and it's a character, if it's not an existing file name it will be treated as the definition of a genomic region in the UCSC/IGV format (i.e. "chr9:34229289-34982376") and parsed.
#' @param ... further arguments to be passed to other methods.
#' @param genome (character or BSgenome) The genome info to be attached to the created GRanges. If NULL no genome info will be attached. (defaults to NULL)
#' @param sep (character) The field separator in the text file. If NULL it will be automatically guessed. Only used when reading some file formats. (Defaults to NULL)
#' @param comment.char (character) The character marking comment lines. Only used when reading some file formats. (Defaults to "#")
#'
#' @return
#' A \code{\link{GRanges}} object with the regions in A
#'
#' @seealso \code{\link{toDataframe}}
#'
#' @examples
#' A <- data.frame(chr=1, start=c(1, 15, 24), end=c(10, 20, 30), x=c(1,2,3), y=c("a", "b", "c"))
#' gr1 <- toGRanges(A)
#'
#' #No need to give the data.frame columns any specific name
#' A <- data.frame(1, c(1, 15, 24), c(10, 20, 30), x=c(1,2,3), y=c("a", "b", "c"))
#' gr2 <- toGRanges(A)
#'
#' #We can pass the data without building the data.frame
#' gr3 <- toGRanges("chr9", 34229289, 34982376, x="X")
#'
#' #And each argument can be a vector (they will be recycled as needed)
#' gr4 <- toGRanges("chr9", c(34229289, 40000000), c(34982376, 50000000), x="X", y=c("a", "b"))
#'
#' #toGRanges will automatically convert the second and third argument into numerics
#' gr5 <- toGRanges("chr9", "34229289", "34982376")
#'
#' #It can be a file from disk
#' bed.file <- system.file("extdata", "my.special.genes.txt", package="regioneR")
#' gr6 <- toGRanges(bed.file)
#'
#' #Or a URL to a valid file
#' #gr7 <- toGRanges("http://path.to/myfile.bed")
#'
#' #It can also parse genomic location strings
#' gr8 <- toGRanges("chr9:34229289-34982376")
#'
#' #more than one
#' gr9 <- toGRanges(c("chr9:34229289-34982376", "chr10:1000-2000"))
#'
#' #even with strange and mixed syntaxes
#' gr10 <- toGRanges(c("chr4:3873-92928", "chr4:3873,92928", "chr5:33,444-45,555"))
#'
#' #if the genome is given it is used to annotate the resulting GRanges
#' gr11 <- toGRanges(c("chr9:34229289-34982376", "chr10:1000-2000"), genome="hg19")
#'
#'
#' #and the genome is added to the GRanges even if A is a GRanges
#' gr12 <- toGRanges(gr6, genome="hg19")
#'
#' #And it will change the chromosome naming of the GRanges to match that of the
#' #genome if it is possible (using GenomeInfoDb::seqlevelsStyle)
#' gr2
#' gr13 <- toGRanges(gr2, genome="hg19")
#'
#' #in addition, it can convert other objects into GRanges such as the
#' #result of GenomicRanges::coverage
#'
#' gr14 <- toGRanges(c("1:1-20", "1:5-25", "1:18-40"))
#' cover <- GenomicRanges::coverage(gr14)
#' gr15 <- toGRanges(cover)
#'
#'
#'
#' @export toGRanges
#'
#' @importFrom GenomicRanges GRanges elementMetadata GRangesList
#' @import BSgenome
#' @importFrom rtracklayer import
#' @importFrom tools file_ext
#' @importFrom IRanges tolower
#' @importFrom utils head
#' @import parallel
#' @import GenomeInfoDb
#'
#'
#'
toGRanges <- function(A, ..., genome=NULL, sep=NULL, comment.char="#") {
if(!hasArg(A)) stop("A is missing")
#If it's a GRanges, set the correct genome if needed and return
if(methods::is(A, "GRanges")) {
return(setGenomeToGRanges(A, genome))
}
#If it's a SimpleRleList, assume it's a coverage object as the one generated
#by GenomicRanges::coverage
if(methods::is(A, "SimpleRleList")) {
return(setGenomeToGRanges(coverageToGRanges(A), genome))
}
#if there are more than one parameters, try to build a data.frame from them
if(length(list(...))>0) {
tryCatch(A <- do.call(data.frame, c(list(A), list(...), list(stringsAsFactors=FALSE))),
error=function(e){}, warning=function(e){})
} else { #If it's single parameter
#If a character, it's not a file on disk and does not contain "://" (so it's not a URL), try to parse it as a genome coordinates string
if(all(is.character(A)) && !any(file.exists(A)) && !any(grepl(pattern = "://", A))) {
#If there are no "-" and only one "," may be in this format "chr:start,end". Replace the "," with a "-" and proceed.
A2 <- A
comma.sep <- !grepl("-", A2) & (nchar(A2) - nchar(gsub(",", "", A2))==1)
A2[comma.sep] <- gsub(",", "-", A2[comma.sep])
#Remove the "," used as thousand separators
A2 <- gsub(pattern = ",", replacement = "", x = A2)
#Split on : and -
A2 <- strsplit(A2, "[:-]+")
#And count them
if(!all(lapply(A2, length)==3)) {
failing <- A[which(lapply(A2, length)!=3)]
stop("Error when parsing the genomic region definition strings. There are ", length(failing), " malformed strings: ", paste0(paste0(head(failing, n=5), collapse = ","), ifelse(length(failing)>5, paste0(" and ", length(failing)-5, " more."), ".")))
} else {
A <- data.frame(do.call(rbind, A2))
A[,2] <- as.numeric(as.character(A[,2])) #we need the as.character because they are factors
A[,3] <- as.numeric(as.character(A[,3]))
}
}
}
#if it's a dataframe, assume the three first columns are: chr, start and end
if(methods::is(A, "data.frame")) {
if(length(A)==0) stop("In GRanges: A cannot be a data.frame with 0 columns.")
if(length(A)==1) { #If the data.frame has only one column, treat it as a character vector
return(toGRanges(A[,1]), genome=genome)
}
#If the data.frame has only two columns, repeat the second column to create "one-base-wide" regions
if(length(A)==2) {
A <- cbind(A, A[,2], stringsAsFactors=FALSE)
}
#Assoc files fom plink and the like: they have specific column names
if(ncol(A)>=3 && all(names(A)[1:3] %in% c("CHR", "BP", "SNP"))) {
if(ncol(A)>3) {
A <- A[, c("CHR", "BP", "BP", "SNP", names(A)[4:ncol(A)])]
} else {
A <- A[, c("CHR", "BP", "BP", "SNP")]
}
if(!any(duplicated(A[,4]))) {
row.names(A) <- A[,4]
}
}
chrs <- as.character(A[,1]) #Transform the first column into a character. It does not work with factors.
#and transform the second and third into numerics
if(is.factor(A[,2])) A[,2] <- as.character(A[,2])
if(is.factor(A[,3])) A[,3] <- as.character(A[,3])
start <- as.numeric(A[,2])
end <- as.numeric(A[,3])
gr <- GenomicRanges::GRanges(seqnames=chrs, ranges=IRanges::IRanges(start=start, end=end))
#assign the rownames of A to the GRanges too (if not duplicated!)
if(length(gr)>0 && !any(duplicated(row.names(A)))) {
tryCatch(names(gr) <- rownames(A), error=function(e){})
}
#We cannot assign the metadata in a single line because when only one metadata column was requested,
#it was automatically transformed into a vector and it lost its name.
if(ncol(A)>3) { #if there's metadata or strand information
#Store the metadata
metadata.columns <- c(4:ncol(A))
if(length(metadata.columns>0)) {
original.metadata <- as.data.frame(A[,metadata.columns])
names(original.metadata) <- names(A)[metadata.columns]
GenomicRanges::elementMetadata(gr) <- original.metadata
}
}
} else { #It's not a data.frame
#If we are here, assume it's a file
gr <- fileToGRanges(A, ..., sep=sep, genome=genome)
}
return(setGenomeToGRanges(gr, genome))
}
setGenomeToGRanges <- function(gr, genome) {
if(!is.null(genome) && is.character(genome)) genome <- characterToBSGenome(genome[1])
if(!is.null(genome)) {
if(methods::is(genome, "BSgenome")) {
#Seems weird, but at least the "BSgenome.Hsapiens.1000genomes.hs37d5" returns two styles. Use only the first one
GenomeInfoDb::seqlevelsStyle(gr) <- GenomeInfoDb::seqlevelsStyle(genome)[1]
GenomeInfoDb::seqlevels(gr, pruning.mode="coarse") <- GenomeInfoDb::seqlevels(genome)
GenomeInfoDb::seqinfo(gr) <- GenomeInfoDb::seqinfo(genome)
}
if(methods::is(genome, "SeqInfo")) {
GenomeInfoDb::seqlevelsStyle(gr) <- GenomeInfoDb::seqlevelsStyle(genome)[1]
GenomeInfoDb::seqlevels(gr, pruning.mode="coarse") <- GenomeInfoDb::seqlevels(genome)
GenomeInfoDb::seqinfo(gr) <- genome
}
}
return(gr)
}
#Transform the SimpleRleList generated by GenomicRanges::coverage into a
#Granges with the coverage as an mcol
coverageToGRanges <- function(coverage) {
ends <- cumsum(S4Vectors::runLength(coverage))
valid.chrs <- lapply(ends, length)>0 #remove the chromosomes with no data
ends <- ends[valid.chrs]
coverage.lvl <- S4Vectors::runValue(coverage)[valid.chrs]
starts <- lapply(ends, function(x) {return(c(1, (x[-length(x)]+1)))})
#convert into a GRanges
coverage.gr <- lapply(names(ends),
function(chr) {
return(toGRanges(
data.frame(chr, starts[[chr]], ends[[chr]],
coverage=coverage.lvl[[chr]])
)
)})
return(unlist(GenomicRanges::GRangesList(coverage.gr)))
}
#Read a file into a GRanges
fileToGRanges <- function(A, ..., genome=NULL, sep=NULL, comment.char="#") {
#BED
if(IRanges::tolower(tools::file_ext(A))=="bed") {
#If it's a bed file, try to load it using rtracklayer::import
gr <- tryCatch(
expr = {rtracklayer::import(con=A, format = "BED")},
error = function(e) {
#If it has failed, it might be because it's a malformed bed file with a
#header. Raise an informtative error
#read the first line of the file
ll <- readLines(A, n=1)
if(hasHeader(ll, sep="\t")) {
stop("Error in toGRanges when trying to read the BED file \"", A, "\". BED files do can NOT have headers.\n", e)
} else {
stop("Error in toGRanges when trying to read the BED file \"", A, "\". Is it a valid BED file?\n", e)
}
})
return(gr)
}
# #PLINK files
# NO special treatment is needed. They will be read as data.frame and the
# data.frame processing code will identify them.
#
# if(IRanges::tolower(tools::file_ext(A)) %in%
# c(".assoc", ".assoc.fisher", ".assoc.dosage",
# ".assoc.linear", ".assoc.logistic")) {
#
# }
#
#GFF
#TODO
#VCF
#TODO
#Other files with custom code?
#TODO
#If we are here, we have a generic text file.
num.skip <- firstNonCommentLine(A, comment.char = comment.char) - 1
#check if there are availabe lines in addition to any comment line
skip.plus.one <- readLines(A, n=num.skip+1)
skip.plus.one <- skip.plus.one[skip.plus.one!=""] #Remove any empty lines
if(length(skip.plus.one)==num.skip) return(GRanges())
#Detect header and separator (if needed)
ll <- readLines(A, n = num.skip + 5)
ll <- ll[(num.skip+1):length(ll)]
if(is.null(sep)) sep <- getSeparator(ll)
has.head <- hasHeader(ll, sep=sep)
#Read the file into a data.frame
file.cont <- tryCatch(
expr = {utils::read.table(file = A, header = has.head, sep = sep, skip = num.skip, comment.char = comment.char, ...)},
error = function(e) {stop("Error in toGRanges when trying to read the file \"", A, "\": ", e)}
)
#and convert the data.frame into a GRanges using toGRanges itself
return(tryCatch(
expr = {toGRanges(file.cont)},
error = function(e) {stop("Error in toGRanges when building a GRanges with the content of file \"", A, "\": ", e)}
))
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.