knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

smarterapi is an R package which enable to browse the SMARTER backend API and collect smarter data. In this R package, functions to deal with the different API endpoints are implemented. In this vignette, we assume you have configured properly this function in order to use your SMARTER credentials.

Getting started

Simply import smarterapi like any other R package:

library(dplyr)
library(smarterapi)
# required by this vignette
library(pander)

Collect info about versions

The smarter database is continuously updated, for example when receiving new samples, genotypes, metadata or when something is fixed or updated, for example when a new assembly is added. In order to ensure reproducibility and monitoring changes, genotypes and metadata are provided using versions: this means that you have to ensure that the genotypes dataset and metadata are referring to the same version. You can check database version using the get_smarter_info() method:

info <- get_smarter_info()
info$version
info$last_updated

The same version system is specified in the genotype file dataset you can download from the smarter FTP site:

$ tree -L 2 SHEEP/
SHEEP/
└── OAR3
    ├── archive
    ├── SMARTER-OA-OAR3-top-0.4.4.md5
    └── SMARTER-OA-OAR3-top-0.4.4.zip

You have to ensure that you are working with the same version, for both metadata and genotypes: if not, you may not find samples in the genotype files or you are referring to an out-dated assembly version. To see the changelist in each version, please refer to the SMARTER-database HISTORY.rst file.

Querying SMARTER API

The smarterapi package is structured to handle requests/responses to the SMARTER API backend using R packages like httr, jsonlite, handling token authentication through helper functions. In general, each functions accepts a query=list() parameter, in which pass addtional parameter to the API endpoint in order to filter results matching query. Some functions like get_smarter_samples() or get_smarter_variants() requires additional parameters like the species or the assembly version. Each function, in general will returns results in a data.frame: you could filter out data using dplyr or with standard R methods or you could filter data directly by submitting a proper query to the API. Please refer to the proper documentation to understand which parameters a function expect. See also the SMARTER-backend API web interface to have an idea on which parameter are allowed for each endpoint.

Collect samples

Here are some examples on collecting samples. The main function required to collect sample is get_smarter_samples(), which is an helper function which allow to query the /samples/goat and /samples/sheep endpoints of the SMARTER-backend API. We will use some other functions to have better idea on which values to use to filter data. Remember that all the parameters you see in each different example can be merged with other to restrict your query to have only the samples you are looking for.

Select by datasets

Getting samples by dataset is quite easy, you have to provide the proper dataset_id to the get_smarter_samples() function. However, you have to determine the proper dataset_id using the get_smarter_dataset() function, which model the /datasets endpoint. For example, to extract only background genotypes datasets (data generated before SMARTER project) you have to pass the proper type to the query argument. Since the query argument is a list, you can pass multiple parameters at once. For parameters which supports arrays, you could supply the same parameters multiple times: each one will be passed through the API endpoint

background_genotypes <- get_smarter_datasets(
  query = list(type = "background", type = "genotypes"))

# same as before, but limiting to goat species
background_goat_genotypes <- get_smarter_datasets(
  query = list(type = "background", type = "genotypes", 
             species = "Goat"))

Take some time to explore the dataframe columns. There are two importants fields, the first is the _id.$oid column, which is the dataset_id we want to provide to collect samples belonging to this dataset. The second is the file column, which is the archive name which was uploaded into the smarter database. For example, here is what the background_goat_genotypes table looks like:

pander::pander(background_goat_genotypes[, c("_id.$oid", "breed", "file")])

So collect the adaptmap samples, we can provide the proper dataset_id to the get_smarter_samples() method. We can add additional parameters, like country:

adatpmap_id <- background_goat_genotypes["_id.$oid"][1]
adaptmap_goats <- get_smarter_samples(
  species = "Goat", query = list(dataset = adatpmap_id, country = "Italy"))

The previous case is quite easy, there was only one dataset in background_goat_genotypes dataframe, so we can simply paste this value in the get_smarter_samples query. But how we can handle multiple datasets? we can transform the proper column in a list and then renaming it:

# get more datasets
foreground_goat_genotypes <- get_smarter_datasets(
  query = list(type = "genotypes", type = "foreground", species = "Goat"))

# construct the query arguments
datasets <- as.list(foreground_goat_genotypes$"_id.$oid")
names(datasets) = rep("dataset", length(datasets))
breeds <- list(breed_code = "LNR", breed_code = "SKO", breed_code = "FSS")
query <- append(datasets, breeds)

# select samples: subset by breed code and datasets
foreground_goat_samples <- get_smarter_samples(species = "Goat", query = query)

The last selection example relies on dataset file contents: if you remember the name of the file submitted in the dataset, you can search by datasets content:

datasets <- get_smarter_datasets(query = list(search = "adaptmap"))
pander::pander(datasets[, c("_id.$oid", "breed", "file")])

This time two results are returned, since one is a phenotypes dataset, while the other is a genotypes. To select only genotypes, simply add type=genotypes to the query parameter.

Select by breed

You can select samples relying on breeds names or breed codes. Breed names are written in the languages they come from, so in order to retrieve Île de France or Fjällnäs breed samples, you have to specify the full breed name or use the search parameter with the get_smarter_breeds() which model the /breeds endpoint:

breeds <- get_smarter_breeds(query = list(
  species = "Sheep", search = "de france"))
pander::pander(subset(breeds, select = c("name", "code")))

Search for breeds can return multiple values, for example:

breeds <- get_smarter_breeds(query = list(
  species = "Sheep", search = "merino"))
pander::pander(subset(breeds, select = c("name", "code")))

Name and codes can be used as they are to select samples by passing multiple arguments to the query:

selected_samples <- get_smarter_samples(species = "Sheep", query = list(
  breed_code = "MER", breed_code = "AME"
))

or to get all the samples with merino in breed name:

# construct the query arguments
query <- as.list(breeds$code)
names(query) <- rep("breed_code", length(query))

# execute query
merino_samples <- get_smarter_samples(species = "Sheep", query = query)

Select by country

You can retrive samples by countries. First get a list of the available countries relying on country name, then extract samples using the correct country name:

italy <- get_smarter_countries(query = list(search = "italy"))

italian_background_sheeps <- get_smarter_samples(
  species = "Sheep",
  query = list(
    country = italy$name[1]
  )
)

Select by chip

You can select samples relying on the chip they are sequenced. If you search for multiple chip types, you will collect all samples which belongs to any of the specified chip. First, collect a list of the available chips for a certain species:

sheep_chips <- get_smarter_supportedchips(query = list(species = "Sheep"))
pander::pander(subset(sheep_chips, select = -c(`_id.$oid`)))

Then collect samples relying on chip name, for example:

selected_samples <- get_smarter_samples(
  species = "Sheep",
  query = list(
    chip_name="IlluminaOvineHDSNP",
    chip_name="AffymetrixAxiomOviCan"
  )
)

Select by metadata

Since metadata aren't formatted in the same way in each samples, is difficult to define a single query you can apply to each samples. For the moment, the only queries you can apply on metadata are restricted to their presence or absence. For example, we can collect all samples which have GPS coordinates and phenotypes (any):

smarter_goats <- get_smarter_samples(
  species = "Goat",
  query = list(
    locations__exists=TRUE,
    phenotype__exists=TRUE
  )
)

After that, you have to filter out the smarter_goats dataframe in order to collect only the samples you want.

Subset genotypes relying samples

After you have identified the samples of your interest, you can extract their genotypes from the proper file using plink. First, you have to write a TSV file with breed_code and smarter_id as columns. For example, using the samples selected above and dplyr:

selected_sheeps_ids <- italian_background_sheeps %>% dplyr::select(
  "breed_code", "smarter_id")

write.table(
  selected_sheeps_ids, 
  file = "selected_sheeps.txt", 
  quote = FALSE, 
  sep = "\t", 
  row.names = FALSE, 
  col.names = FALSE)

Next, you need to collect the proper plink options in order to not loose information from the plink file. The parameters used to generate the genotype files are tracked in the info endpoint. In this example, get parameters from Sheep genotypes:

info <- get_smarter_info()
plink_opts <- paste0(info$plink_specie_opt$Sheep, collapse = " ")
plink_opts

And finally you can call plink and providing your sample list:

plink --chr-set 26 no-xy no-mt --allow-no-sex \
  --bfile SMARTER-OA-OAR3-top-0.4.4 \
  --keep selected_sheeps.txt \
  --out selected_sheeps-OAR3-top-0.4.4 \
  --make-bed


cnr-ibba/r-smarter-api documentation built on Nov. 1, 2022, 4:24 a.m.