knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
smarterapi is an R package which enable to browse the SMARTER backend API and collect smarter data. In this R package, functions to deal with the different API endpoints are implemented.
Simply import smarterapi
like any other R package:
library(dplyr) library(smarterapi)
# required by this vignette library(pander)
The smarter database is continuously updated, for example when receiving new
samples, genotypes, metadata or when something is fixed or updated, for example
when a new assembly is added. In order to ensure reproducibility and monitoring
changes, genotypes and metadata are provided using versions: this means that
you have to ensure that the genotypes dataset and metadata are referring to the
same version. You can check database version using the get_smarter_info()
method:
info <- get_smarter_info() info$version info$last_updated
The same version system is specified in the genotype file dataset you can download from the smarter FTP site:
$ tree -L 2 SHEEP/
SHEEP/
└── OAR3
├── archive
├── SMARTER-OA-OAR3-top-0.4.10.md5
└── SMARTER-OA-OAR3-top-0.4.10.zip
You have to ensure that you are working with the same version, for both metadata and genotypes: if not, you may not find samples in the genotype files or you are referring to an out-dated assembly version. To see the changelist in each version, please refer to the SMARTER-database HISTORY.rst file.
The smarterapi
package is structured to handle requests/responses to the
SMARTER API backend using R packages like httr
and jsonlite
.
In general, each functions accepts
a query=list()
parameter, in which pass additional parameter to the API endpoint
in order to filter results matching query. Some functions like get_smarter_samples()
or get_smarter_variants()
requires additional parameters like the species
or
the assembly
version. Each function in general will return results in a
data.frame
: you could filter out data using dplyr
or with standard R methods
or you could filter data directly by submitting a proper query to the API.
Please refer to the proper documentation to understand
which parameters a function expect. See also the
SMARTER-backend API web
interface to have an idea on which parameter are allowed for each endpoint.
Here are some examples on collecting samples. The main function required to collect
sample is get_smarter_samples()
, which is an helper function which allow to query
the /samples/goat
and /samples/sheep
endpoints of the SMARTER-backend API. We will use some other functions to have
better idea on which values to use to filter data. Remember that all the parameters
you see in each different example can be merged with other to restrict your query
to have only the samples you are looking for.
Getting samples by dataset is quite easy, you have to provide the proper dataset_id
to the get_smarter_samples()
function. However, you have to determine the proper
dataset_id
using the get_smarter_dataset()
function, which model the
/datasets
endpoint. For example, to extract
only background genotypes datasets (data generated before SMARTER project) you
have to pass the proper type
to the query
argument. Since the query
argument
is a list, you can pass multiple parameters at once. For parameters which supports
arrays, you could supply the same parameter name multiple times:
each one will be passed through the API endpoint
# select all genotypes datasets made before SMARTER project background_genotypes <- get_smarter_datasets( query = list(type = "background", type = "genotypes") ) # same as before, but limiting to goat species background_goat_genotypes <- get_smarter_datasets( query = list( type = "background", type = "genotypes", species = "Goat" ) )
Take some time to explore the dataframe columns. There are two importants fields,
the first is the _id.$oid
column, which is the dataset_id
we want to provide
to collect samples belonging to this dataset.
The second is the file
column, which is the archive name which was uploaded into
the smarter database. For example, here is what the background_goat_genotypes
table looks like:
pander::pander(background_goat_genotypes[, c("_id.$oid", "breed", "file")])
So collect the adaptmap samples, we can provide the proper dataset_id
to the
get_smarter_samples()
method. We can add additional parameters, like country:
# select the adaptmap id, which is in the first row of the dataframe adatpmap_id <- background_goat_genotypes["_id.$oid"][1, 1] adaptmap_goats <- get_smarter_samples( species = "Goat", query = list(dataset = adatpmap_id, country = "Italy") )
The previous case is quite easy, we want only one dataset in
background_goat_genotypes
dataframe, so we can simply paste this value in
the get_smarter_samples
query. But how we can handle multiple datasets?
we can transform the proper column in a list and then renaming it:
# get more datasets foreground_goat_genotypes <- get_smarter_datasets( query = list(type = "genotypes", type = "foreground", species = "Goat") ) # construct the query arguments datasets <- as.list(foreground_goat_genotypes$"_id.$oid") names(datasets) <- rep("dataset", length(datasets)) breeds <- list(breed_code = "LNR", breed_code = "SKO", breed_code = "FSS") query <- append(datasets, breeds) # select samples: subset by breed code and datasets foreground_goat_samples <- get_smarter_samples(species = "Goat", query = query)
We can also use the get_smarter_datasets()
function to do a reverse selection
of our samples, for example to get all the samples which are not in one or more
datasets: suppose, for example, to collect all samples which are not in the
isheep datasets:
# collect all foreground datasets having "isheep" in their name isheep_datasets <- get_smarter_datasets( query = list( type = "foreground", type = "genotypes", search = "isheep" ) ) # collect ids for isheep datasets isheep_ids <- isheep_datasets$"_id.$oid" # collect all foreground samples foreground_samples <- get_smarter_samples( "Sheep", query = list(type = "foreground") ) # get rid of isheep_samples filtered_samples <- foreground_samples %>% dplyr::filter(!`dataset_id.$oid` %in% isheep_ids)
The last selection example relies on dataset file contents: if you remember the name of the file submitted in the dataset, you can search by datasets content:
datasets <- get_smarter_datasets(query = list(search = "adaptmap"))
pander::pander(datasets[, c("_id.$oid", "breed", "file")])
This time two results are returned, since one is a phenotypes dataset, while
the other is a genotypes. To select only genotypes, simply add type=genotypes
to the query
parameter.
You can select samples relying on breeds names or breed codes. Breed names are
written in the languages they come from, so in order to retrieve Île de France
or Fjällnäs breed samples, you have to specify the full breed name or use the
search parameter with the get_smarter_breeds()
which model the
/breeds
endpoint:
breeds <- get_smarter_breeds( query = list( species = "Sheep", search = "de france" ) )
pander::pander(subset(breeds, select = c("name", "code")))
Search for breeds can return multiple values, for example:
breeds <- get_smarter_breeds( query = list( species = "Sheep", search = "merino" ) )
pander::pander(subset(breeds, select = c("name", "code")))
Name and codes can be used as they are to select samples by passing multiple arguments to the query:
selected_samples <- get_smarter_samples(species = "Sheep", query = list( breed_code = "MER", breed_code = "AME" ))
or to get all the samples with merino in breed name:
# construct the query arguments query <- as.list(breeds$code) names(query) <- rep("breed_code", length(query)) # execute query merino_samples <- get_smarter_samples(species = "Sheep", query = query)
You can retrive samples by countries. First get a list of the available countries relying on country name, then extract samples using the correct country name:
italy <- get_smarter_countries(query = list(search = "italy")) italian_background_sheeps <- get_smarter_samples( species = "Sheep", query = list( country = italy$name[1] ) )
You can select samples relying on the chip they are sequenced. If you search for multiple chip types, you will collect all samples which belongs to any of the specified chip. First, collect a list of the available chips for a certain species:
sheep_chips <- get_smarter_supportedchips(query = list(species = "Sheep"))
pander::pander(subset(sheep_chips, select = -c(`_id.$oid`)))
Then collect samples relying on chip name, for example:
selected_samples <- get_smarter_samples( species = "Sheep", query = list( chip_name = "IlluminaOvineHDSNP", chip_name = "AffymetrixAxiomOviCan" ) )
Since metadata aren't formatted in the same way in each samples, is difficult to define a single query you can apply to each samples. For the moment, the only queries you can apply on metadata are restricted to their presence or absence. For example, we can collect all samples which have GPS coordinates and phenotypes (any):
smarter_goats <- get_smarter_samples( species = "Goat", query = list( locations__exists = TRUE, phenotype__exists = TRUE ) )
After that, you have to filter out the smarter_goats
dataframe in order to
collect only the samples you want.
Suppose you want to refer the samples to the original publication the come from:
in the dataset dataframe we have a doi
column in which the publication DOI
is stored. You can use this information to collect publication information, but
you need to merge samples and datasets dataframes in order to refer properly
the samples to the publication they come from. Here is an example of how to do:
# collect all the genotypes datasets datasets <- get_smarter_datasets( query = list( species = "Sheep", type = "genotypes" ) ) # collect all the samples samples <- get_smarter_samples( species = "Sheep", ) # merge datasets and samples using dplyr samples_with_doi <- dplyr::inner_join( datasets, samples, by = dplyr::join_by(`_id.$oid` == `dataset_id.$oid`) ) %>% dplyr::select(smarter_id, breed_code, breed.y, doi) %>% dplyr::filter(!is.na(doi)) # count how many samples come from each publication samples_with_doi %>% group_by(doi) %>% summarise(counts = n())
The genotypes can't be retrieved using the smarter-api because they are
available from the public FTP. You need
to download the files using an FTP client like FileZilla
or lftp and then
extract the genotypes using plink.
In alternative, is it possible to download the genotypes using the get_smarter_genotypes
method of this package. This method will download the genotypes from the FTP
for the current releases of the selected species and assembly and will
return the destination path of the downloaded archive:
downloaded_archive <- get_smarter_genotypes( "Sheep", "OAR3" )
This method will download the genotypes in the current working directory, or
in the directory specified in the dest_path
argument. The genotypes will be stored
in a compressed .zip
file, which need to be de-compressed in order to be
used with plink.
After you have identified the samples of your interest, you can extract their genotypes from the proper file using plink. First, you have to write a TSV file with breed_code and smarter_id as columns. For example, using the samples selected above and dplyr:
selected_sheeps_ids <- italian_background_sheeps %>% dplyr::select( "breed_code", "smarter_id" ) write.table( selected_sheeps_ids, file = "selected_sheeps.txt", quote = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE )
Next, you need to collect the proper plink options in order to not loose information from the plink file. The parameters used to generate the genotype files are tracked in the info endpoint. In this example, get parameters from Sheep genotypes:
info <- get_smarter_info() plink_opts <- paste0(info$plink_specie_opt$Sheep, collapse = " ") plink_opts
And finally you can call plink and providing your sample list:
plink --chr-set 26 no-xy no-mt --allow-no-sex \ --bfile SMARTER-OA-OAR3-top-0.4.10 \ --keep selected_sheeps.txt \ --out selected_sheeps-OAR3-top-0.4.10 \ --make-bed
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.