In maialab/quincunx: REST API Client for the 'PGS' Catalog

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

PGS Catalog entities are represented in quincunx as S4 objects. In this article, we explain how to subset these objects using the [ operator. In a nutshell, we provide subsetting by either position or by the object's respective identifier. The main entities/objects are:

Scores
Publications
Traits
Sample Sets
Performance Metrics

The general approach to subset the various S4 objects is the same. Hence, to avoid repetition, we only provide a set of comprehensive examples for the scores object. Subsetting with the other objects is only illustrated when subsetting with identifiers to emphasise that different objects have different associated identifiers.

If you do not know how to subset the tables included in the S4 objects, please take a look at Subsetting tibbles.

Start by loading quincunx:

library(quincunx)

Subsetting scores

Subsetting scores by position

For illustrative purposes, let us get some arbitrary polygenic scores objects, say, the first 10 PGSs in the catalog:

pgs_ids <- sprintf('PGS%06d', 1:10)
my_scores <- get_scores(pgs_ids)

The object my_scores is an S4 object of class scores, see class?scores for details. In quincunx, each S4 object contains at least one (the first) table where each observation refers to an entity. To access tables in S4 objects you use the @ operator. The first table in my_scores is scores:

my_scores@scores

nrow(my_scores@scores)

This table has as many rows as polygenic scores. This is one way of knowing how many scores there are in the object. Alternatively, you can use the function n() on the object:

quincunx::n(my_scores)

It is important to know the number of scores if you plan to subset the my_scores object by position. In this case there are r quincunx::n(my_scores) scores. If you want to subset the first, fifth, and tenth score, then you could do:

my_scores[c(1, 5, 10)]@scores[1:2]

This returns a new object containing only the data for the scores "PGS000001", "PGS000005" and "PGS000010".

Notice that this operation automatically traverses all tables in the my_scores object and subsets all tables accordingly keeping only those rows corresponding to the first, fifth and tenth scores. For example, compare the table samples from the my_scores object before and after the subsetting.

Before subsetting:

my_scores@samples[1:4]

After subsetting with c(1, 5, 10):

my_scores[c(1, 5, 10)]@samples[1:4]

Subsetting scores by identifer

To subset by identifier you simply use a character vector with the identifiers of interest. Let us say now you want two identifiers: "PGS000002" and "PGS000008". Then only you need to do is:

my_scores[c('PGS000002', 'PGS000008')]@scores[1:2]

Subsetting using repeated positions or identifiers

Please note that if you repeat the same position or identifier, you will get that score repeated:

my_scores[c('PGS000003', 'PGS000003')]@scores[1:2]

quincunx::n(my_scores[c('PGS000003', 'PGS000003')])

Or using the third position twice:

my_scores[c(3, 3)]@scores[1:2]

quincunx::n(my_scores[c(3, 3)])

Subsetting using negative positions

Just like with basic R objects, we can also use negative indices to drop elements of an object. This is also supported with quincunx's S4 objects. For example, to drop now the first, fifth and tenth score:

# Notice the minus sign before c(1, 5, 10)
my_scores[-c(1, 5, 10)]@scores[1:2]

Subsetting with non-existing positions or identifiers

If you request a position or identifier that does not match in the object, the result is an empty object. For example, the 11th position is not present in my_scores so the returned object is empty:

my_scores[11]@scores[1:2]

quincunx::n(my_scores[11])

Please note that the returned object is still a valid scores object and that it contains all the expected tables of such an object. It is just that all tables have no rows. The same behaviour is to be expected if you try to subset with non-existing identifiers:

my_scores['PGS000011']@scores[1:2]

quincunx::n(my_scores['PGS000011'])

Subsetting Publications

Subsetting publications objects, or any other S4 object in quincunx, works exactly the same way as described for scores. The only difference is that identifiers have to be changed accordingly. So in the next sections we only show how to subset using the respective identifiers.

# Get all publications where Abraham G is an author
my_publ <- get_publications(author = 'Abraham G')

# Note that the column `author_fullname` corresponds to the first author.
my_publ@publications[c('pgp_id', 'pubmed_id', 'publication_date', 'author_fullname')]

By visual inspection we can see that Abraham G is the first author in r knitr::combine_words(my_publ@publications[my_publ@publications$author_fullname == 'Abraham G', 'pgp_id', drop = TRUE]).

To keep only those publications we subset the publication object my_publ by those PGP identifiers:

my_publ[c('PGP000005', 'PGP000027', 'PGP000028', 'PGP000029')]@publications[c('pgp_id', 'pubmed_id', 'publication_date', 'author_fullname')]

Subsetting Traits

To illustrate subsetting of a traits object with EFO identifiers, let us say you'd like to create a traits object with traits whose trait name contained the keyword "lymph". To do this, we will start by downloading all traits into a traits object. Then we look for the term "lymph" in the trait column, and find which EFO identifiers are matched. Finally, we will use those identifiers to create a traits object containing only those matched identifiers.

Get all traits:

all_traits <- get_traits(interactive = FALSE)

Find which traits have in their name (trait column of traits table) the term "lymph" (we use grep for this):

lymph_traits_positions <- grep('lymph', all_traits@traits$trait)

all_traits[lymph_traits_positions]@traits[c('efo_id', 'trait')]

Select only those EFO identifiers whose trait name contained "lymph":

my_efo_ids <- all_traits[lymph_traits_positions]@traits$efo_id
my_efo_ids

Finally, create a new traits object (traits_only_lymph) with only those traits matching "lymph" by subsetting by identifier:

traits_only_lymph <- all_traits[my_efo_ids]

Confirm that indeed only those traits with "lymph" in the name are present:

traits_only_lymph@traits[c(1, 4)]

You might have noticed that we could have used lymph_traits_positions to subset all_traits by position instead to the same effect. That would have been more straightforward, but the point here is to illustrate subsetting with EFO identifiers. Moreover, as an exercise, you might want to compare the results obtained with this example with:

# Get traits containing the term 'lymph' in the name or its description
get_traits(trait_term = 'lymph', exact_term = FALSE)

# Get traits whose name is exactly 'lymph'
get_traits(trait_term = 'lymph', exact_term = TRUE)

Subsetting Sample Sets

To subset PGS Sample Sets you use identifiers of the form: "PSS000000". Here's a simple example where we download two Sample Sets ("PSS000008" and "PSS000042"), and afterwards we take "PSS000008":

my_sample_sets <- get_sample_sets(pss_id = c('PSS000008', 'PSS000042'))

# Table `samples` contains the samples that comprise this Sample Set
my_sample_sets['PSS000008']@samples[1:6]

Subsetting Performance Metrics

Without much more creativity, you subset Performance Metrics objects with identifiers of the form: "PPM000000". Example:

my_perf_metrics <- get_performance_metrics(ppm_id = c('PPM000001', 'PPM000002'))

# Table `samples` contains the samples that comprise this Performance Metrics
my_perf_metrics['PPM000002']@samples[1:6]