toGRanges: toGRanges

View source: R/toGRanges.R

toGRangesR Documentation

toGRanges

Description

Transforms a file or an object containing a region set into a GRanges object.

Usage

toGRanges(A, ..., genome=NULL, sep=NULL, comment.char="#")

Arguments

A

a data.frame containing a region set, a GRanges object, a BED file, any type of file supported by rtracklayer::import or a "SimpleRleList" returned by GenomicRanges::coverage. If there are more than 1 argument, it will build a dataframe out ouf them and process it as usual. If there's only a single argument and it's a character, if it's not an existing file name it will be treated as the definition of a genomic region in the UCSC/IGV format (i.e. "chr9:34229289-34982376") and parsed.

...

further arguments to be passed to other methods.

genome

(character or BSgenome) The genome info to be attached to the created GRanges. If NULL no genome info will be attached. (defaults to NULL)

sep

(character) The field separator in the text file. If NULL it will be automatically guessed. Only used when reading some file formats. (Defaults to NULL)

comment.char

(character) The character marking comment lines. Only used when reading some file formats. (Defaults to "#")

Details

If A is already a GRanges object, it will be returned untouched.

If A is a data frame, the function will assume the first three columns are chromosome, start and end and create a GRanges object. Any additional column will be considered metadata and stored as such in the GRanges object. There are 2 special cases: 1) if A is a data.frame with only 2 columns, it will assume the first one is the chromosome and the second one the position and it will create a GRanges with single base regions and 2) if the data.frame has the first 3 columns named "SNP", "CHR" and "BP" it will shuffle the columns and repeat "BP" to build a GRanges of single base regions (this is the standard ouput format of plink).

If A is not a data.frame and there are more parameters, it will try to build a data.frame with all parameters and use that data.frame to build the GRanges. This allows the user to call it like toGRanges("chr1", 10, 20).

If A is a character or a character vector and it's not a file or a URL, it assumes it's a genomic position description in the form used by UCSC or IGV, "chr2:1000-2000". It will try to parse the character strings into chromosome, start and end and create a GRanges. The parser can deal with commas separating thousands (e.g. "chr2:1,000-2,000") and with the comma used as a start/end separator (e.g. "chr2:1000,2000"). These different variants can be mixed in the same character vector.

If A is a "SimpleRleList" it will be interpreted as the result from GenomicRanges::coverage and the function will return a GRanges with a single metadata column named "coverage".

If A is a file name (local or remote) or a connection to a file, it will try to load it in different ways: * BED files (identified by a "bed" extension): will be loaded using rtracklayer::import function. Coordinates are 0 based as described in the BED specification (https://genome.ucsc.edu/FAQ/FAQformat.html#format1). * PLINK assoc files (identified by ".assoc", ".assoc.fisher", ".assoc.dosage", ".assoc.linear", ".assoc.logistic"): will be loaded as single-base ranges with all original columns present and the SNPs ids as the ranges names * Any other file: It assumes the file is a "generic" tabular file. To load it it will ignore any header line starting with comment.char, autodetect the field separator (if not provided by the user), autodetect if it has a header and read it accordingly.

The genome parameter can be used to set the genome information of the created GRanges. It can be either a BSgenome object or a character string defining a genome (e.g. "hg19", "mm10"...) as accepted by the BSgenome::getBSgenome function. If a valid genome is given and the corresponding BSgenome package is installed, the genome information will be attached to the GRanges. If the chromosome naming style from the GRanges and the genome object are different, it will try to change the GRanges styles to match those of the genome using GenomeInfoDb::seqlevelsStyle.

Value

A GRanges object with the regions in A

Note

**IMPORTANT:** Regarding the coordinates, BED files are 0 based while data.frames and generic files are treated as 1 based. Therefore reading a line "chr9 100 200" from a BED file will create a 99 bases wide interval starting at base 101 and ending at 200 but reading it from a txt file or from a data.frame will create a 100 bases wide interval starting at 100 and ending at 200. This is specially relevant in 1bp intervals. For example, the 10th base of chromosome 1 would be "chr1 9 10" in a BED file and "chr1 10 10" in a txt file.

See Also

toDataframe

Examples

A <- data.frame(chr=1, start=c(1, 15, 24), end=c(10, 20, 30),  x=c(1,2,3), y=c("a", "b", "c"))
gr1 <- toGRanges(A)

#No need to give the data.frame columns any specific name
A <- data.frame(1, c(1, 15, 24), c(10, 20, 30),  x=c(1,2,3), y=c("a", "b", "c"))
gr2 <- toGRanges(A)

#We can pass the data without building the data.frame
gr3 <- toGRanges("chr9", 34229289, 34982376, x="X")

#And each argument can be a vector (they will be recycled as needed)
gr4 <- toGRanges("chr9", c(34229289, 40000000), c(34982376, 50000000), x="X", y=c("a", "b"))

#toGRanges will automatically convert the second and third argument into numerics
gr5 <- toGRanges("chr9", "34229289", "34982376") 

#It can be a file from disk
bed.file <- system.file("extdata", "my.special.genes.txt", package="regioneR")
gr6 <- toGRanges(bed.file)

#Or a URL to a valid file
#gr7 <- toGRanges("http://path.to/myfile.bed")

#It can also parse genomic location strings
gr8 <- toGRanges("chr9:34229289-34982376")

#more than one
gr9 <- toGRanges(c("chr9:34229289-34982376", "chr10:1000-2000"))

#even with strange and mixed syntaxes
gr10 <- toGRanges(c("chr4:3873-92928", "chr4:3873,92928", "chr5:33,444-45,555"))

#if the genome is given it is used to annotate the resulting GRanges
gr11 <- toGRanges(c("chr9:34229289-34982376", "chr10:1000-2000"), genome="hg19")


#and the genome is added to the GRanges even if A is a GRanges
gr12 <- toGRanges(gr6, genome="hg19")

#And it will change the chromosome naming of the GRanges to match that of the
#genome if it is possible (using GenomeInfoDb::seqlevelsStyle)
gr2
gr13 <- toGRanges(gr2, genome="hg19")

#in addition, it can convert other objects into GRanges such as the 
#result of GenomicRanges::coverage

gr14 <- toGRanges(c("1:1-20", "1:5-25", "1:18-40"))
cover <- GenomicRanges::coverage(gr14)
gr15 <- toGRanges(cover)




bernatgel/regioneR documentation built on Sept. 10, 2023, 12:03 a.m.