knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This vignette describe how to use the variant endpoints which store information about SNPs used in the smarter genotype datasets.

library(smarterapi)
# required by this vignette
library(pander)

Assembly versions

One of the aim of this project is to manage genotypes in different assembly version. This means collect data from different assemblies (due to when data is generated), from different sources (Affymetrix, Illumina, WGS) and different file formats. Genotypes are normalized in order to be consistent accross data sources and stored in one genotype file for each specie.

Currently four assembly versions are managed, two for the sheep dataset and two for the goat dataset. Information about assemblies data sources can be retrieved from the backend info endpoint through the get_smarter_info() function:

info <- get_smarter_info()
assemblies <- as.data.frame(t(as.data.frame(info$working_assemblies)))
names(assemblies) <- c("name", "source")
pander::pander(assemblies)

Collect data from an assembly

get_smarter_variants() have two mandatory parameters, species and assembly, then it could accept additional parameters (see one of the variant endpoints to have more information). For example you can search variants for snp name or rs id (if the latter exists):

snp <- get_smarter_variants(
  species = "Goat", 
  assembly = "ARS1",
  query = list(
    name = "snp12965-scaffold1499-3295573"
  )
)
pander::pander(subset(snp, select = -c(`_id.$oid`, `sequence.IlluminaGoatSNP50`)))

Please, refer to the get_smarter_info() working_assemblies to have an idea of the assemblies supported by the SMARTER-database. Data which come from SNPchiMp v.3 like the Sheep OAR3 assembly, support the illumina forward attribute. For example the following SNP:

snp <- get_smarter_variants(
  species = "Sheep", 
  assembly = "OAR3",
  query = list(
    rs_id = "rs10721092"
  )
)
pander::pander(subset(snp, select = -c(`_id.$oid`, `sequence.IlluminaOvineHDSNP`)))

Is T/C on the forward strand of OAR3: this means that the reversed probe is aligned to the genome (as you could infer from the bottom illumina strand attribute of this SNP.). Variants in the SMARTER-database are converted using the illumina top coding convenction, so you will find this SNP as A/G in the SMARTER-database while on the reference sequence it's T/C.

Fetch Variants by region

Variants endpoint support query by regions, using <chromosome>:<start>-<end> as format, for example:

variants <- get_smarter_variants(
  species = "Goat",
  assembly = "ARS1",
  query = list(region = "1:1-100000")
)
pander::pander(subset(variants, select = -c(`_id.$oid`, `sequence.IlluminaGoatSNP50`)))

Fetch Variants by chip name

You can download all variants for a certain chip: please consider that it will require a lot of time and memory, since we store more than 600K SNPs in the smarter database. First, collect the available chips from the SMARTER-database, for example for the Sheep species:

sheep_chips <- get_smarter_supportedchips(query = list(species = "Sheep"))
pander::pander(subset(sheep_chips, select = -c(`_id.$oid`)))

Then collect all the SNPs for a certain chip by providing the SMARTER chip name. Please, consider that you will download more than 50K SNP for this chip and this will take a lot of time

variants <- get_smarter_variants(
  species = "Sheep",
  assembly = "OAR3",
  query = list(
    chip_name = "IlluminaOvineSNP50"
  )
)


cnr-ibba/r-smarter-api documentation built on Nov. 1, 2022, 4:24 a.m.