knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
smarterapi is an R package which enable to browse the SMARTER backend API and collect smarter data. In this R package, functions to deal with the different API endpoints are implemented. In this vignette, we assume you have configured properly this function in order to use your SMARTER credentials.
Simply import smarterapi
like any other R package:
library(dplyr) library(smarterapi)
# required by this vignette library(pander)
The smarter database is continuously updated, for example when receiving new
samples, genotypes, metadata or when something is fixed or updated, for example
when a new assembly is added. In order to ensure reproducibility and monitoring
changes, genotypes and metadata are provided using versions: this means that
you have to ensure that the genotypes dataset and metadata are referring to the
same version. You can check database version using the get_smarter_info()
method:
info <- get_smarter_info() info$version info$last_updated
The same version system is specified in the genotype file dataset you can download from the smarter FTP site:
$ tree -L 2 SHEEP/
SHEEP/
└── OAR3
├── archive
├── SMARTER-OA-OAR3-top-0.4.4.md5
└── SMARTER-OA-OAR3-top-0.4.4.zip
You have to ensure that you are working with the same version, for both metadata and genotypes: if not, you may not find samples in the genotype files or you are referring to an out-dated assembly version. To see the changelist in each version, please refer to the SMARTER-database HISTORY.rst file.
The smarterapi
package is structured to handle requests/responses to the
SMARTER API backend using R packages like httr
, jsonlite
, handling token
authentication through helper functions. In general, each functions accepts
a query=list()
parameter, in which pass addtional parameter to the API endpoint
in order to filter results matching query. Some functions like get_smarter_samples()
or get_smarter_variants()
requires additional parameters like the species
or
the assembly
version. Each function, in general will returns results in a
data.frame
: you could filter out data using dplyr
or with standard R methods
or you could filter data directly by submitting a proper query to the API.
Please refer to the proper documentation to understand
which parameters a function expect. See also the
SMARTER-backend API web
interface to have an idea on which parameter are allowed for each endpoint.
Here are some examples on collecting samples. The main function required to collect
sample is get_smarter_samples()
, which is an helper function which allow to query
the /samples/goat
and /samples/sheep
endpoints of the SMARTER-backend API. We will use some other functions to have
better idea on which values to use to filter data. Remember that all the parameters
you see in each different example can be merged with other to restrict your query
to have only the samples you are looking for.
Getting samples by dataset is quite easy, you have to provide the proper dataset_id
to the get_smarter_samples()
function. However, you have to determine the proper
dataset_id
using the get_smarter_dataset()
function, which model the
/datasets
endpoint. For example, to extract
only background genotypes datasets (data generated before SMARTER project) you
have to pass the proper type
to the query
argument. Since the query
argument
is a list, you can pass multiple parameters at once. For parameters which supports
arrays, you could supply the same parameters multiple times: each one will be passed
through the API endpoint
background_genotypes <- get_smarter_datasets( query = list(type = "background", type = "genotypes")) # same as before, but limiting to goat species background_goat_genotypes <- get_smarter_datasets( query = list(type = "background", type = "genotypes", species = "Goat"))
Take some time to explore the dataframe columns. There are two importants fields,
the first is the _id.$oid
column, which is the dataset_id
we want to provide
to collect samples belonging to this dataset.
The second is the file
column, which is the archive name which was uploaded into
the smarter database. For example, here is what the background_goat_genotypes
table looks like:
pander::pander(background_goat_genotypes[, c("_id.$oid", "breed", "file")])
So collect the adaptmap samples, we can provide the proper dataset_id
to the
get_smarter_samples()
method. We can add additional parameters, like country:
adatpmap_id <- background_goat_genotypes["_id.$oid"][1] adaptmap_goats <- get_smarter_samples( species = "Goat", query = list(dataset = adatpmap_id, country = "Italy"))
The previous case is quite easy, there was only one dataset in
background_goat_genotypes
dataframe, so we can simply paste this value in
the get_smarter_samples
query. But how we can handle multiple datasets?
we can transform the proper column in a list and then renaming it:
# get more datasets foreground_goat_genotypes <- get_smarter_datasets( query = list(type = "genotypes", type = "foreground", species = "Goat")) # construct the query arguments datasets <- as.list(foreground_goat_genotypes$"_id.$oid") names(datasets) = rep("dataset", length(datasets)) breeds <- list(breed_code = "LNR", breed_code = "SKO", breed_code = "FSS") query <- append(datasets, breeds) # select samples: subset by breed code and datasets foreground_goat_samples <- get_smarter_samples(species = "Goat", query = query)
The last selection example relies on dataset file contents: if you remember the name of the file submitted in the dataset, you can search by datasets content:
datasets <- get_smarter_datasets(query = list(search = "adaptmap"))
pander::pander(datasets[, c("_id.$oid", "breed", "file")])
This time two results are returned, since one is a phenotypes dataset, while
the other is a genotypes. To select only genotypes, simply add type=genotypes
to the query
parameter.
You can select samples relying on breeds names or breed codes. Breed names are
written in the languages they come from, so in order to retrieve Île de France
or Fjällnäs breed samples, you have to specify the full breed name or use the
search parameter with the get_smarter_breeds()
which model the
/breeds
endpoint:
breeds <- get_smarter_breeds(query = list( species = "Sheep", search = "de france"))
pander::pander(subset(breeds, select = c("name", "code")))
Search for breeds can return multiple values, for example:
breeds <- get_smarter_breeds(query = list( species = "Sheep", search = "merino"))
pander::pander(subset(breeds, select = c("name", "code")))
Name and codes can be used as they are to select samples by passing multiple arguments to the query:
selected_samples <- get_smarter_samples(species = "Sheep", query = list( breed_code = "MER", breed_code = "AME" ))
or to get all the samples with merino in breed name:
# construct the query arguments query <- as.list(breeds$code) names(query) <- rep("breed_code", length(query)) # execute query merino_samples <- get_smarter_samples(species = "Sheep", query = query)
You can retrive samples by countries. First get a list of the available countries relying on country name, then extract samples using the correct country name:
italy <- get_smarter_countries(query = list(search = "italy")) italian_background_sheeps <- get_smarter_samples( species = "Sheep", query = list( country = italy$name[1] ) )
You can select samples relying on the chip they are sequenced. If you search for multiple chip types, you will collect all samples which belongs to any of the specified chip. First, collect a list of the available chips for a certain species:
sheep_chips <- get_smarter_supportedchips(query = list(species = "Sheep"))
pander::pander(subset(sheep_chips, select = -c(`_id.$oid`)))
Then collect samples relying on chip name, for example:
selected_samples <- get_smarter_samples( species = "Sheep", query = list( chip_name="IlluminaOvineHDSNP", chip_name="AffymetrixAxiomOviCan" ) )
Since metadata aren't formatted in the same way in each samples, is difficult to define a single query you can apply to each samples. For the moment, the only queries you can apply on metadata are restricted to their presence or absence. For example, we can collect all samples which have GPS coordinates and phenotypes (any):
smarter_goats <- get_smarter_samples( species = "Goat", query = list( locations__exists=TRUE, phenotype__exists=TRUE ) )
After that, you have to filter out the smarter_goats
dataframe in order to
collect only the samples you want.
After you have identified the samples of your interest, you can extract their genotypes from the proper file using plink. First, you have to write a TSV file with breed_code and smarter_id as columns. For example, using the samples selected above and dplyr:
selected_sheeps_ids <- italian_background_sheeps %>% dplyr::select( "breed_code", "smarter_id") write.table( selected_sheeps_ids, file = "selected_sheeps.txt", quote = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE)
Next, you need to collect the proper plink options in order to not loose information from the plink file. The parameters used to generate the genotype files are tracked in the info endpoint. In this example, get parameters from Sheep genotypes:
info <- get_smarter_info() plink_opts <- paste0(info$plink_specie_opt$Sheep, collapse = " ") plink_opts
And finally you can call plink and providing your sample list:
plink --chr-set 26 no-xy no-mt --allow-no-sex \ --bfile SMARTER-OA-OAR3-top-0.4.4 \ --keep selected_sheeps.txt \ --out selected_sheeps-OAR3-top-0.4.4 \ --make-bed
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.