GenometriCorr evaluates the spatial correlation between two types of features (experimental or annotated) in genomic coordinates. Several different approaches are implemented, each intended to evaluate a different biologically relevant spatial relationship. The genomic features are split into intervals and correlations are evaluated as described in the package vignette.
Most standard genomic feature annotation file formats are accepted as input. Each feature type is given in a separate file, one used as a query and the other used as a reference. The statistical comparisons that are made are not symmetric; that is, they are sensitive to which file is the query and which is the reference, so we recommend running the comparisons twice, switching the status of the input files for the second run. This is intentional, as biological relationships can be asymmetric. Once the files are read, the annotations are stored as
GRanges objects from
There are two classes that are provided by the package. One of them,
GenometriCorrConfig-class provide serialization into configuration file that describe a full run of the package that is intended to explore spatial correlations between two markups. The class also provide run.config method that open the input data files, read them, make all the calculations by applying the central
GenometriCorrelation function to mapped or unmapped data (see below). The
GenometriCorrelation function returns an instance of
GenometriCorrResult-class that is based on a list with the results of the run and also contains the
GenometriCorrConfig-class instance that describes the run as a slot. The
show method and two graphical representations:
Below we describe the statistical comparisons implemented in this package. One set of features is termed the “query” and the other, the “reference”, throughout.
Query and reference intervals that are often separated by the same distance will have significant p-values; the magnitude of the p-value depends on the number of permutations performed. To determine whether a significant result indicates a small or large distance, we provide the
scaled.absolute.min.distance.sum.lower.tail, a boolean that is
TRUE when the actual absolute distances are smaller than expected and
FALSE if they are higher.
relative.distances.ks.p.value This p-value will be low when the query and reference intervals have similar spatial distributions. This test is hypersensitive and is most useful when the p-value is high.
relative.distances.ecdf.deviation.area.p.value If the query and reference reference intervals are closer or farther apart than expected, this p-value will be lo w.
relative.distances.ecdf.area.correlation This value will be negative when the query and reference intervals are anticorrelated and positive when they are positively correlated. This value has no relation to the p-value of correlation.
If overlaps between query characteristic points (by default, midpoints) and the reference features occur less often than expected, the
TRUE; if they are more common than expected, it is
FALSE. To measure the effect size, the observed to expected ratio is calculated as
jaccard.measure The Jaccard test compares the length of the union of all query and reference features with the length of the intersection of the query and reference features.
This is the permutation-based evaluation of the p-value for the Jaccard measure; the
TRUE if there are fewer overlaps than expected (less overlap) ond
When both query and reference features are restricted to genomic subsets, or when we want to compare their relationship only within smaller genomic intervals (e.g. genes), we can remap the intervals of interest onto pseudochromosomes and use the
MapRangesToGenomicIntervals provided by the package. Each mapping result is a
GRanges object. The pseudochromosomes can now be treated as usual by
GenometriCorrelation to test the correlations of interest.
The package also provides the
VisualiseTwoIRanges function that creates a very high-level graphic overview of a pair of annotations on a chromosome (space).
|Title:||Genometric Correlation package|
|biocViews:||Annotation, Genetics, Infrastructure, DataRepresentation, Bioinformatics, StatisticalMethod|
Alexander Favorov email@example.com, Loris Mularoni, Yulia Medvedeva, Harris A. Jaffee, Ekaterina V. Zhuravleva, Veronica Busa, Leslie M. Cope, Andrey A. Mironov, Vsevolod J. Makeev, Sarah J. Wheelan
See the GenometriCorr package vignette.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
library('rtracklayer') library('GenometriCorr') library('rtracklayer') library('TxDb.Hsapiens.UCSC.hg19.knownGene') refseq<-transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene) cpgis<-import(system.file("extdata", "UCSCcpgis_hg19.bed", package = "GenometriCorr")) seqinfo(cpgis)<-seqinfo(TxDb.Hsapiens.UCSC.hg19.knownGene)[seqnames(seqinfo(cpgis))] permut.number<-0 #permut.number=0 means all the permutations are off #permut.number is the common default for ecdf.area.permut.number, mean.distance.permut.number, and jaccard.measure.permut.number #these three can be set separetely; explicit set overrloads the default cpgi_to_genes<-GenometriCorrelation(cpgis,refseq,chromosomes.to.proceed=c('chr1','chr2','chr3'),permut.number=permut.number,keep.distributions=FALSE,showProgressBar=FALSE) print(cpgi_to_genes) VisualiseTwoIRanges( ranges(cpgis[seqnames(cpgis)=='chr1']), ranges(refseq[seqnames(refseq)=='chr1']), nameA='CpG Islands',nameB='RefSeq Genes', chrom_length=seqlengths(TxDb.Hsapiens.UCSC.hg19.knownGene)['chr1'], title="CpGIslands and RefGenes on chr1 of Hg19 animal") #mapping example, same as in the vignette population<-1000 chromo.length<-c(3000000) names(chromo.length)<-c('the_chromosome') rquery<-GRanges(ranges=IRanges(start=runif(population,1000001,2000000-9),width=c(10)),seqnames='the_chromosome') rref<-GRanges(ranges=IRanges(start=runif(population,1000001,2000000-9),width=c(10)),seqnames='the_chromosome') #create two features, they are randomly scattered in 1 000 000...2 000 000 unmapped_result<-GenometriCorrelation(rquery,rref,chromosomes.length=chromo.length,permut.number=permut.number,keep.distributions=FALSE,showProgressBar=FALSE) #correlate them on the whole chromosome: 1...3 000 000 cat('Unmapped result:\n') print(unmapped_result) map_space<-GRanges(ranges=IRanges(start=c(1000001),end=c(2000000)),seqname='the_chromosome') mapped_rquery<-MapRangesToGenomicIntervals(what.to.map=rquery,where.to.map=map_space) mapped_rref<-MapRangesToGenomicIntervals(what.to.map=rref,where.to.map=map_space) #map them into 1 000 001...2 000 000 mapped_result<-GenometriCorrelation(mapped_rquery,mapped_rref,permut.number=permut.number,keep.distributions=FALSE,showProgressBar=FALSE) #then, correlate again cat('Mapped result:\n') print(mapped_result)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.