GScores-class | R Documentation |
The goal of the GenomicScores
package is to provide support to store
and retrieve genomic scores associated to physical nucleotide positions along
a genome. This is achieved through the GScores
class of objects, which
is a container for genomic score values.
The GScores
class attempts to provide a compact storage and efficient
retrieval of genomic score values that have been typically processed and
stored using some form of lossy compression. This class is currently based
on a former version of the SNPlocs
class defined in the
BSgenome
package, with the following slots:
provider
(character
), the data provider such as UCSC.
provider_version
(character
), the version of the data
as given by the data provider, typically a date in some compact format.
download_url
(character
), the URL of the data provider
from where the original data were downloaded.
download_date
(character
), the date on which the data
were downloaded.
reference_genome
(GenomeDescription
), object with
information about the reference genome whose physical positions have
the genomic scores.
data_pkgname
(character
), name given to the set
of genomic scores associated to a particular genome. When the genomic
scores are stored within an annotation package, then this corresponds to
the name of that package.
data_dirpath
(character
), absolute path to the local
directory where the genomic scores are stored in one file per genome
sequence.
data_serialized_objnames
(character
), named vector of
filenames pointing to files containing the genomic scores in one file per
genome sequence. The names of this vector correspond to the genome
sequence names.
data_group
(character
), name denoting a category of
genomic scores to which the scores stored in the object belong to.
Typical values are "Conservation", "MAF", "Pathogenicity", etc.
data_tag
(character
), name identifying the genomic
scores stored in the object and which can be used, for instance, to
assign a column name storing these scores.
data_pops
(character
), vector of character strings
storing score population names. The term "default" is reserved to denote
a score set that is not associated to a particular population name and
is used by default.
data_nonsnrs
(logical
), flag indicating whether the
object stores genomic scores associated with non-single nucleotide ranges.
data_nsites
(integer
), number of sites in the genome
associated with the genomic scores stored in the object.
.data_cache
(environment
), data structure where
objects storing genomic scores are cached into main memory.
The goal of the design behind the GScores
class is to load into main
memory only the objects associated with the queried sequences to minimize the
memory footprint, which may be advantageous in workflows that parallelize the
access to genomic scores by genome sequence.
GScores
objects are created either from AnnotationHub
resources
or when loading specific annotation packages that store genomic score values.
Two such annotation packages are:
phastCons100way.UCSC.hg19
Nucleotide-level phastCons conservation scores from the UCSC Genome Browser calculated from multiple genome alignments from the human genome version hg19 to 99 vertebrate species.
phastCons100way.UCSC.hg38
Nucleotide-level phastCons conservation scores from the UCSC Genome Browser calculated from multiple genome alignments from the human genome version hg38 to 99 vertebrate species.
GScores(provider, provider_version, download_url,
download_date, reference_genome, data_pkgname, data_dirpath,
data_serialized_objnames, default_pop, data_tag)
:Creates a GScores
object. In principle, the end-user needs not to
call this function.
provider
character string, containing the data provider.
provider_version
character string, containing the version of the data as given by the data provider.
download_url
character string, containing the URL of the data provider from where the original data were downloaded.
reference_genome
GenomeDescription, storing the information about the associated reference genome.
data_pkgname
character string, name given to the set of genomic scores stored through this object.
data_dirpath
character string, absolute path to the local directory where the genomic scores are stored.
data_serialized_objname
character string vector, containing filenames where the genomic scores are stored.
default_pop
character string, containing the name of the default scores population.
data_group
character string, containing a name that indicates a category of genomic scores to which the scores in the object belong to. Typical names could be "Conservation", "MAF", etc.
data_tag
character string, containing a tag that succintly labels genomic scores from a particular source. This can be used to automatically give, for instance, a name to a column storing genomic scores in data frame object. Its default value takes the prefix of the package name.
name(x)
:get the name of the set of genomic scores.
type(x)
: get the substring of the name of the set of genomic
scores comprised between the first character until the first period. This
should typically match the type of genomic scores such as,
phastCons
, phyloP
, etc.
provider(x)
:get the data provider.
providerVersion(x)
:get the provider version.
organism(x)
:get the organism associated with the genomic scores.
seqlevelsStyle(x)
:get the genome sequence style.
seqinfo(x)
:get the genome sequence information.
seqnames(x)
:get the genome sequence names.
seqlengths(x)
:get the genome sequence lengths.
populations(x)
: get the identifiers of the available scores
populations. If only one scores population is available, then it shows
only the term default
.
defaultPopulation(x)
:get or set the default population of scores.
gscoresCategory(x)
:get or set the genomic scores category label.
gscoresTag(x)
:get or set the genomic scores tag label.
gscoresNonSNRs(x)
:get whether there are genomic scores associated with non-single nucleotide ranges.
nsites(x)
:get the number of sites in the genome with genomic scores.
qfun(x)
:get the quantizer function.
dqfun(x)
:get the dequantizer function.
citation(x)
: get citation information for the genomic scores data
in the form of a bibentry
object.
R. Castelo
Puigdevall, P. and Castelo, R. GenomicScores: seamless access to genomewide position-specific scores from R and Bioconductor. Bioinformatics, 18:3208-3210, 2018.
gscores()
score()
phastCons100way.UCSC.hg38
## one genomic range of width 5
gr1 <- GRanges(seqnames="chr22", IRanges(start=50528591, width=5))
gr1
## five genomic ranges of width 1
gr2 <- GRanges(seqnames="chr22", IRanges(start=50528591:50528596, width=1))
gr2
## supporting annotation packages with genomic scores
if (require(phastCons100way.UCSC.hg38)) {
library(GenomicRanges)
phast <- phastCons100way.UCSC.hg38
phast
gscores(phast, gr1)
score(phast, gr1)
gscores(phast, gr2)
populations(phast)
gscores(phast, gr2, pop="DP2")
}
## supporting AnnotationHub resources
## Not run:
availableGScores()
phast <- getGScores("phastCons100way.UCSC.hg38")
phast
gscores(phast, gr1)
## End(Not run)
## metadata from a GScores object
name(phast)
type(phast)
provider(phast)
providerVersion(phast)
organism(phast)
seqlevelsStyle(phast)
seqinfo(phast)
head(seqnames(phast))
head(seqlengths(phast))
gscoresTag(phast)
populations(phast)
defaultPopulation(phast)
qfun(phast)
dqfun(phast)
citation(phast)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.