GScores-class: GScores objects
In rcastelo/GenomicScores: Infrastructure to work with genomewide position-specific scores

GScores-class

R Documentation

GScores objects

Description

The goal of the GenomicScores package is to provide support to store and retrieve genomic scores associated to physical nucleotide positions along a genome. This is achieved through the GScores class of objects, which is a container for genomic score values.

Details

The GScores class attempts to provide a compact storage and efficient retrieval of genomic score values that have been typically processed and stored using some form of lossy compression. This class is currently based on a former version of the SNPlocs class defined in the BSgenome package, with the following slots:

provider: (character), the data provider such as UCSC.
provider_version: (character), the version of the data as given by the data provider, typically a date in some compact format.
download_url: (character), the URL of the data provider from where the original data were downloaded.
download_date: (character), the date on which the data were downloaded.
reference_genome: (GenomeDescription), object with information about the reference genome whose physical positions have the genomic scores.
data_pkgname: (character), name given to the set of genomic scores associated to a particular genome. When the genomic scores are stored within an annotation package, then this corresponds to the name of that package.
data_dirpath: (character), absolute path to the local directory where the genomic scores are stored in one file per genome sequence.
data_serialized_objnames: (character), named vector of filenames pointing to files containing the genomic scores in one file per genome sequence. The names of this vector correspond to the genome sequence names.
data_group: (character), name denoting a category of genomic scores to which the scores stored in the object belong to. Typical values are "Conservation", "MAF", "Pathogenicity", etc.
data_tag: (character), name identifying the genomic scores stored in the object and which can be used, for instance, to assign a column name storing these scores.
data_pops: (character), vector of character strings storing score population names. The term "default" is reserved to denote a score set that is not associated to a particular population name and is used by default.
data_nonsnrs: (logical), flag indicating whether the object stores genomic scores associated with non-single nucleotide ranges.
data_nsites: (integer), number of sites in the genome associated with the genomic scores stored in the object.
.data_cache: (environment), data structure where objects storing genomic scores are cached into main memory.

The goal of the design behind the GScores class is to load into main memory only the objects associated with the queried sequences to minimize the memory footprint, which may be advantageous in workflows that parallelize the access to genomic scores by genome sequence.

GScores objects are created either from AnnotationHub resources or when loading specific annotation packages that store genomic score values. Two such annotation packages are:

phastCons100way.UCSC.hg19: Nucleotide-level phastCons conservation scores from the UCSC Genome Browser calculated from multiple genome alignments from the human genome version hg19 to 99 vertebrate species.
phastCons100way.UCSC.hg38: Nucleotide-level phastCons conservation scores from the UCSC Genome Browser calculated from multiple genome alignments from the human genome version hg38 to 99 vertebrate species.

Constructor

GScores(provider, provider_version, download_url, download_date, reference_genome, data_pkgname, data_dirpath, data_serialized_objnames, default_pop, data_tag):

Creates a GScores object. In principle, the end-user needs not to call this function.

provider: character string, containing the data provider.
provider_version: character string, containing the version of the data as given by the data provider.
download_url: character string, containing the URL of the data provider from where the original data were downloaded.
reference_genome: GenomeDescription, storing the information about the associated reference genome.
data_pkgname: character string, name given to the set of genomic scores stored through this object.
data_dirpath: character string, absolute path to the local directory where the genomic scores are stored.
data_serialized_objname: character string vector, containing filenames where the genomic scores are stored.
default_pop: character string, containing the name of the default scores population.
data_group: character string, containing a name that indicates a category of genomic scores to which the scores in the object belong to. Typical names could be "Conservation", "MAF", etc.
data_tag: character string, containing a tag that succintly labels genomic scores from a particular source. This can be used to automatically give, for instance, a name to a column storing genomic scores in data frame object. Its default value takes the prefix of the package name.

Accessors

name(x):: get the name of the set of genomic scores.
type(x):: get the substring of the name of the set of genomic scores comprised between the first character until the first period. This should typically match the type of genomic scores such as, phastCons, phyloP, etc.
provider(x):: get the data provider.
providerVersion(x):: get the provider version.
organism(x):: get the organism associated with the genomic scores.
seqlevelsStyle(x):: get the genome sequence style.
seqinfo(x):: get the genome sequence information.
seqnames(x):: get the genome sequence names.
seqlengths(x):: get the genome sequence lengths.
populations(x):: get the identifiers of the available scores populations. If only one scores population is available, then it shows only the term default.
defaultPopulation(x):: get or set the default population of scores.
gscoresCategory(x):: get or set the genomic scores category label.
gscoresTag(x):: get or set the genomic scores tag label.
gscoresNonSNRs(x):: get whether there are genomic scores associated with non-single nucleotide ranges.
nsites(x):: get the number of sites in the genome with genomic scores.
qfun(x):: get the quantizer function.
dqfun(x):: get the dequantizer function.
citation(x):: get citation information for the genomic scores data in the form of a bibentry object.

Author(s)

R. Castelo

References

Puigdevall, P. and Castelo, R. GenomicScores: seamless access to genomewide position-specific scores from R and Bioconductor. Bioinformatics, 18:3208-3210, 2018.

Examples

## one genomic range of width 5
gr1 <- GRanges(seqnames="chr22", IRanges(start=50528591, width=5))
gr1

## five genomic ranges of width 1
gr2 <- GRanges(seqnames="chr22", IRanges(start=50528591:50528596, width=1))
gr2

## supporting annotation packages with genomic scores
if (require(phastCons100way.UCSC.hg38)) {
  library(GenomicRanges)

  phast <- phastCons100way.UCSC.hg38
  phast
  gscores(phast, gr1)
  score(phast, gr1)
  gscores(phast, gr2)
  populations(phast)
  gscores(phast, gr2, pop="DP2")
}

## supporting AnnotationHub resources
## Not run: 
availableGScores()
phast <- getGScores("phastCons100way.UCSC.hg38")
phast
gscores(phast, gr1)

## End(Not run)

## metadata from a GScores object
name(phast)
type(phast)
provider(phast)
providerVersion(phast)
organism(phast)
seqlevelsStyle(phast)
seqinfo(phast)
head(seqnames(phast))
head(seqlengths(phast))
gscoresTag(phast)
populations(phast)
defaultPopulation(phast)
qfun(phast)
dqfun(phast)
citation(phast)

rcastelo/GenomicScores documentation built on July 5, 2025, 5:37 a.m.