knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
In this vignette we provided a set of advantages to using quincunx when compared to a direct access of the PGS Catalog REST API.
quincunx automatically handles lower-level tasks related to the GET requests,
such as pagination and handling of errors. When a user of quincunx calls a get
functions
, .e.g, get_scores()
, this will translate to GET requests on one of
the following endpoints: /rest/score/all
, /rest/score/{pgs_id}
, or
/rest/score/search
. The first and last endpoints return paginated JSON
responses, meaning that we need logic on the client side that iterates over all
the paginated resources, making the respective GET requests such that all the
data is fetched. quincunx automatically detects if the endpoint is returning a
paginated response or not (because it is not always paginated), and iterates
automatically. Also, quincunx's get functions
gracefully handle http errors
that might arise along the process, not breaking the execution, and collecting
the data from those responses that were successful, while generating warnings
for those that failed. Moreover, user control is offered on these warnings, and
a verbose flag is also provided in our functions for easy debugging of the
underlying requests. All these are menial features that, nevertheless, are
important for a smooth usage of the REST API.
quincunx's set of get_<entity>()
functions provide an API that abstracts
away the REST API endpoint construction and that best matches user expectations,
i.e., one get
function for each type of Catalog entity. For example, a
quincunx's user only needs to know the get_scores()
to retrieve metadata about
Polygenic Scores. A more direct access to the REST API implies knowing the three
endpoints and how to pass them their parameters. Also, by having one function
for all related endpoints, this means that we can provide a single interface for
all search criteria. For example, in the case of get_scores()
, the user may
search by PGS identifier (pgs_id
), Trait id (efo_id
) or by PubMed id
(pubmed_id
), or even get all polygenic scores by not supplying any criteria.
quincunx's accepts all these options in a single interface, and provides an
extra argument (set_operation
) on how to combine the results obtained with the
different criteria. Without a client like quincunx all of this logic would have
to be implemented anew.
Retrieving data from a REST API to an environment, such as the R programming language, requires the conversion of JSON text into R objects, i.e., JSON deserialization. Here, we hinge on the R package tidyjson. Some other more direct access method to the REST API would require the users to either learn to use a similar tool or implement it themselves.
Most importantly, besides JSON deserialization, we chose to represent each Catalog entity as a relational database (quincunx's S4 objects). This does not follow automatically from the data structure in the JSON responses. This process required careful analysis of the Catalog data and of the relationships between the different data structures, since this information is not explicit in the Catalog documentation. Specifically, we partly reverse engineered by studying the JSON responses and frequently communicated with the Catalog team during the development of our package.
The actual implementation of the in-memory relational databases based on lists of tibbles provides a commonly used interface in the R community nowadays, i.e., the use of so-called tidy data, which allows taking advantage of the tidyverse toolkit for direct data analysis/modeling and visualization, e.g. with ggplot2.
Each table in quincunx's relational databases is not simply an automatic
result of the parsing of the JSON responses. The majority of the column names
have been unified to follow a common naming scheme and, whenever possible,
simplified to better communicate their biological meaning. For example, in
nearly all JSON responses, there is an id
element, which, depending on the
endpoint, can stand for PGS id, PGP id, PSS id, PPM id, Trait id, etc.; in
quincunx, these have been aptly mapped to pgs_id
, pgp_id
, pss_id
, ppm_id
or efo_id
for clarity and to prevent id
mistakes. Moreover, new identifiers
have been created in quincunx's objects, allowing the connection of information
between tables. These new identifiers are indicated as having a "local" scope
(see vignette('identifiers')
.
Basic data types, such as booleans, strings, integers and doubles are
correctly mapped to R's equivalent basic types in quincunx's tables. Note that
in JSON there is no distinction between integers and doubles, an ambiguity that
can only be resolved by analysis of the data variables' meaning. Also, various
representations of missing data (e.g., "NR"
, "Not Reported"
, null
, ""
,
etc.) are correctly mapped to NA
in R. In the case of null
JSON objects that
have a nested schema, the corresponding relational tables are recreated with the
correct columns and types (albeit empty) so that the scripts do not break. This
data structure consistency is an important feature allowing posterior data
wrangling, such as combination of results that have the same structure. So
although our tables inside the S4 objects might seem raw, they are actually the
result of data tidying and cleaning, and of reverse engineering of the
relationships between objects such that they can be made into relational tables.
A very useful feature of our S4 objects is the subsetting based on indices or identifiers. Moreover, subsetting quincunx's objects conveniently permeates to all tables (slots).
quincunx's is a sibling package of gwasrapidd--- providing access to the GWAS Catalog---, and incorporates the same design principles, particularly when it comes to the data representation and wrangling. So users working in this field that have already used one of the tools, will see their knowledge easily transferred to the other one.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.