knitr::opts_chunk$set(echo = TRUE)
One fundamental operation when working with pangenomic data is to subset it in order to better analyze it. pagoo
provides three convenient ways that can help in different scenarios. First we will see predefined subsets, then we will understand how to use the classic R's [
operator with pagoo
objects, and finally we will see how to temporarily remove one or more organisms from the dataset, and then to recover them.
Start the tutorial by loading the included Campylobacter pangenome.
library(pagoo) # Load package toy_rds <- system.file('extdata', 'campylobacter.RDS', package = 'pagoo') p <- load_pangenomeRDS(toy_rds)
Predefined subsets were already superficially seen in 3 - Querying Data tutorial. Here we will review them and understand its notation.
| - | $core_*
| $shell_*
| $cloud_*
|
| --- | --- | --- | --- |
| $*_genes
| $core_genes
| $shell_genes
| $cloud_genes
|
| $*_clusters
| $core_clusters
| $shell_clusters
| $cloud_clusters
|
| $*_sequences
| $core_sequences
| $shell_sequences
| $cloud_sequences
|
As seen in the above table, the notation is quite straightforward. Lets take clusters
for instance to illustrate it better.
p$clusters p$core_clusters p$shell_clusters p$cloud_clusters
[
operatorUsing [
for subset vectors, lists, or matrices in R is one of the most common operations we, as R users, do every day. Methods using [
for subsetting can be divided into 2 types:
In the following subsections we will see how to use them to subset $pan_matrix
, $genes
, $clusters
, $sequences
, or$organisms
fields.
When we subset a vector or list in R we use a single number or a vector of numbers, lets say from i
to j
, to pick those elements by indexes: x[i:j]
. In pagoo
we implement a method for this generic function to subset data fields, where indexes represent clusters. So instead of subsetting directly the final object (which you can of course) you apply this method directly to the object, and then select any of the data fields.
# From clusters 1 to 3 p[c(1, 5, 18)]$pan_matrix p[c(1, 5, 18)]$genes p[c(1, 5, 18)]$clusters p[c(1, 5, 18)]$sequences p[c(1, 5, 18)]$organisms
Note that in each case, indexes can be interpreted as clusters. In pan_matrix
they refer to columns, in genes
and sequences
to elements in a list, in clusters
rows in the dataframe, and in the case of $organisms
it returns the organisms (rows) where those clusters are present. See what happen for this last case when we query for organisms present for a shell cluster:
shell_clust <- p$shell_clusters$cluster[1] # [1] "OG0005" p[shell_clust]$organisms
Only (6) organisms which contain that shell cluster will be listed.
The other use of [
notation is when we subset a matrix-like object. In this case we provide 2 sets of indexes: the first for rows, and the second for columns, separated by a coma. pagoo
interprets these indexes as organisms (rows) and clusters (columns), so you are referencing a set of genes (cell value). Basically you are referencing cells in the pan_matrix.
p$pan_matrix[1:3, c(1, 5, 18)]
Here we are selecting the first 3 organisms, and the clusters indexed as 1, 5, and 18. Note that we see (sum columns) 3 genes in the first cluster, 3 in the second, and 0 in the third. Now we flip the notation and subset directly the pagoo
object, and then ask for a data field:
p[1:3, c(1, 5, 18)]$pan_matrix # The same as above p[1:3, c(1, 5, 18)]$genes p[1:3, c(1, 5, 18)]$clusters p[1:3, c(1, 5, 18)]$sequences p[1:3, c(1, 5, 18)]$organisms
So this selection returns a list of length 2, with 3 elements each for $genes
and $sequences
, a dataframe showing only the selected clusters in the $clusters
field, and a dataframe showing only the selected organisms in the $organisms
field.
We found this implementation quite useful for both data exploration and fine grained analysis.
One useful feature implemented in pagoo
is the possibility of easily removing (hiding) organisms from the dataset. This is useful if during the analysis we identify some genome with weird characteristics (i.e. potentially contaminated), or if we want to focus just in a subset of genomes of interest given any metadata value, or if we included an outgroup for phylogenetic purposes but we want to remove it from downstream analyses. Let's see how it works..
p$organisms p$summary_stats
We can see that we have 7 organisms, and we have the pangenome summary statistics. Let's say, for instance, that we want to exclude french isolates (because any reason) from the dataset, which are indexed 1 and 2 (rows) in the above dataframe.
p$drop(1:2) p$organisms # Updated organisms !! p$summary_stats # Updated stats !!
As you see, not only the french organisms have been removed from the dataset, but also summary statistics have been updated! This is also true for any other data field: $pan_matrix
, $genes
, $clusters
, $sequences
, and$organisms
fields are automatically updated. Also any embedded statistical or visualization methods will now only consider available organisms/clusters/genes. When we drop an organisms, we are hiding all features associated to that genomes including genes, sequences, gene clusters and metadata.
It's important to note that you don't have to reassign the object to a new one, it is self modified (in place modification). That's R6 reference semantics (see R6 documentation for details), use with caution.
You can recover dropped organisms. To see if we have any hidden organism, we use the $dropped
field, and then recover it using its index or its name.
p$dropped p$recover( p$dropped ) p$organisms
Now french isolates and all its features are again available.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.