SNPlocs-class: SNPlocs objects
In BSgenome: Software infrastructure for efficient representation of full genomes and their SNPs

Description Usage Arguments Details Value Author(s) See Also Examples

The SNPlocs class is a container for storing known SNP locations (of class snp) for a given organism.

SNPlocs objects are usually made in advance by a volunteer and made available to the Bioconductor community as SNPlocs data packages. See ?available.SNPs for how to get the list of SNPlocs and XtraSNPlocs data packages curently available.

The main focus of this man page is on how to extract SNPs from an SNPlocs object.

snpcount(x)

snpsBySeqname(x, seqnames, ...)
## S4 method for signature 'SNPlocs'
snpsBySeqname(x, seqnames, drop.rs.prefix=FALSE, genome=NULL)

snpsByOverlaps(x, ranges, ...)
## S4 method for signature 'SNPlocs'
snpsByOverlaps(x, ranges, drop.rs.prefix=FALSE, ..., genome=NULL)

snpsById(x, ids, ...)
## S4 method for signature 'SNPlocs'
snpsById(x, ids, ifnotfound=c("error", "warning", "drop"), genome=NULL)

inferRefAndAltAlleles(gpos, genome)

`x`	A SNPlocs object.
`seqnames`	The names of the sequences for which to get SNPs. Must be a subset of `seqlevels(x)`. NAs and duplicates are not allowed.
`...`	Additional arguments, for use in specific methods. Arguments passed to the `snpsByOverlaps` method for SNPlocs objects thru `...` are used internally in the call to `subsetByOverlaps()`. See `?IRanges::subsetByOverlaps` in the IRanges package and `?GenomicRanges::subsetByOverlaps` in the GenomicRanges package for more information about the `subsetByOverlaps()` generic and its method for GenomicRanges objects.
`drop.rs.prefix`	Should the `rs` prefix be dropped from the returned RefSNP ids? (RefSNP ids are stored in the `RefSNP_id` metadata column of the returned object.)
`genome`	For `snpsBySeqname`, `snpsByOverlaps`, and `snpsById`: `NULL` (the default), or a BSgenome object containing the sequences of the reference genome that corresponds to the SNP positions. See `inferRefAndAltAlleles` below for an alternative way to specify `genome`. If `genome` is supplied, then `inferRefAndAltAlleles` is called internally by `snpsBySeqname`, `snpsByOverlaps`, or `snpsById` to infer the reference allele (a.k.a. ref allele) and alternate allele(s) (a.k.a. alt allele(s)) for each SNP in the returned GPos object. The inferred ref allele and alt allele(s) are returned in additional metadata columns `ref_allele` (character) and `alt_alleles` (CharacterList). For `inferRefAndAltAlleles`: A BSgenome object containing the sequences of the reference genome that corresponds to the SNP positions in `gpos`. Alternatively `genome` can be a single string containing the name of the reference genome, in which case it must be specified in a way that is accepted by the `getBSgenome` function (e.g. `"GRCh38"`) and the corresponding BSgenome data package needs to be already installed (see `?getBSgenome` for the details).
`ranges`	One or more genomic regions of interest specified as a GRanges or GPos object. A single region of interest can be specified as a character string of the form `"ch14:5201-5300"`.
`ids`	The RefSNP ids to look up (a.k.a. rs ids). Can be integer or character vector, with or without the `"rs"` prefix. NAs are not allowed.
`ifnotfound`	What to do if SNP ids are not found.
`gpos`	A GPos object containing SNPs. It must have a metadata column `alleles_as_ambig` like obtained when using any of the SNP extractor `snpsBySeqname`, `snpsByOverlaps`, or `snpsById` on a SNPlocs object.

When the reference genome is specified via the genome argument, SNP extractors snpsBySeqname, snpsByOverlaps, and snpsById call inferRefAndAltAlleles internally to infer the reference allele (a.k.a. ref allele) and alternate allele(s) (a.k.a. alt allele(s)) for each SNP.

For each SNP the ref allele is inferred from the actual nucleotide found in the reference genome at the SNP position. The alt alleles are inferred from metadata column alleles_as_ambig and the ref allele. More precisely for each SNP the alt alleles are considered to be the alleles in alleles_as_ambig minus the ref allele.

snpcount returns a named integer vector containing the number of SNPs for each sequence in the reference genome.

snpsBySeqname, snpsByOverlaps, and snpsById return an unstranded GPos object with one element (genomic position) per SNP and the following metadata columns:

RefSNP_id: RefSNP ID (aka "rs id"). Character vector with no NAs and no duplicates.
alleles_as_ambig: A character vector with no NAs containing the alleles for each SNP represented by an IUPAC nucleotide ambiguity code. See ?IUPAC_CODE_MAP in the Biostrings package for more information.

If the reference genome was specified (via the genome argument), the additional metadata columns are returned:

genome_compat: A logical vector indicating whether the alleles in alleles_as_ambig are consistent with the reference genome.
ref_allele: A character vector containing the inferred reference allele for each SNP.
alt_alleles: A CharacterList object where each list element is a character vector containing the inferred alternate allele(s) for the corresponding SNP.

Note that this GPos object is unstranded i.e. all the SNPs in it have their strand set to "*". Alleles are always reported with respect to the positive strand.

If ifnotfound="error", the object returned by snpsById is guaranteed to be parallel to ids, that is, the i-th element in the GPos object corresponds to the i-th element in ids.

inferRefAndAltAlleles returns a DataFrame with one row per SNP in gpos and with columns genome_compat (logical), ref_allele (character), and alt_alleles (CharacterList).

H. Pag<c3><a8>s

available.SNPs
GPos and GRanges objects in the GenomicRanges package.
XtraSNPlocs packages and objects for molecular variations of class other than snp e.g. of class in-del, heterozygous, microsatellite, etc...
IRanges::subsetByOverlaps in the IRanges package and GenomicRanges::subsetByOverlaps in the GenomicRanges package for more information about the subsetByOverlaps() generic and its method for GenomicRanges objects.
injectSNPs
IUPAC_CODE_MAP in the Biostrings package.

library(SNPlocs.Hsapiens.dbSNP144.GRCh38)
snps <- SNPlocs.Hsapiens.dbSNP144.GRCh38
snpcount(snps)

## ---------------------------------------------------------------------
## snpsBySeqname()
## ---------------------------------------------------------------------

## Get all SNPs located on chromosome 22 or MT:
snpsBySeqname(snps, c("22", "MT"))

## ---------------------------------------------------------------------
## snpsByOverlaps()
## ---------------------------------------------------------------------

## Get all SNPs overlapping some genomic region of interest:
snpsByOverlaps(snps, "X:3e6-33e6")

## With the regions of interest being all the known CDS for hg38
## located on chromosome 22 or MT (except for the chromosome naming
## convention, hg38 is the same as GRCh38):
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
my_cds <- cds(txdb)
seqlevels(my_cds, pruning.mode="coarse") <- c("chr22", "chrM")
seqlevelsStyle(my_cds)  # UCSC
seqlevelsStyle(snps)    # NCBI
seqlevelsStyle(my_cds) <- seqlevelsStyle(snps)
genome(my_cds) <- genome(snps)
my_snps <- snpsByOverlaps(snps, my_cds)
my_snps
table(my_snps %within% my_cds)

## ---------------------------------------------------------------------
## snpsById()
## ---------------------------------------------------------------------

## Lookup some RefSNP ids:
my_rsids <- c("rs10458597", "rs12565286", "rs7553394")
## Not run: 
  snpsById(snps, my_rsids)  # error, rs7553394 not found

## End(Not run)
## The following example uses more than 2GB of memory, which is more
## than what 32-bit Windows can handle:
is_32bit_windows <- .Platform$OS.type == "windows" &&
                    .Platform$r_arch == "i386"
if (!is_32bit_windows) {
    snpsById(snps, my_rsids, ifnotfound="drop")
}

## ---------------------------------------------------------------------
## Obtaining the ref allele and alt allele(s)
## ---------------------------------------------------------------------

## When the reference genome is specified (via the 'genome' argument),
## SNP extractors snpsBySeqname(), snpsByOverlaps(), and snpsById()
## call inferRefAndAltAlleles() internally to **infer** the ref allele
## and alt allele(s) for each SNP.
my_snps <- snpsByOverlaps(snps, "X:3e6-8e6", genome="GRCh38")
my_snps

## Most SNPs have only 1 alternate allele:
table(lengths(mcols(my_snps)$alt_alleles))

## SNPs with 2 alternate alleles:
my_snps[lengths(mcols(my_snps)$alt_alleles) == 2]

## SNPs with 3 alternate alleles:
my_snps[lengths(mcols(my_snps)$alt_alleles) == 3]

## Note that a small percentage of SNPs in dbSNP have alleles that
## are inconsistent with the reference genome (don't ask me why):
table(mcols(my_snps)$genome_compat)

## For the inconsistent SNPs, all the alleles reported by dbSNP
## are considered alternate alleles i.e. for each inconsistent SNP
## metadata columns "alleles_as_ambig" and "alt_alleles" represent
## the same set of nucleotides (the latter being just an expanded
## representation of the IUPAC ambiguity letter in the former):
my_snps[!mcols(my_snps)$genome_compat]