View source: R/read_functions.R
pagoo | R Documentation |
This is the main function to load a pagoo object. It's safer and
more friendly than using pagoo's class constructors (PgR6,
PgR6M, and PgR6MS). This function returns either a
PgR6M
class object, or a PgR6MS
class object,
depending on the parameters provided. If sequences are provided, it returns
the latter. See below for more details.
pagoo( data, org_meta, cluster_meta, sequences, core_level = 95, sep = "__", verbose = TRUE )
data |
A |
org_meta |
(optional) A |
cluster_meta |
(optional) A |
sequences |
(optional) Can accept: 1) a named |
core_level |
The initial core_level (that's the percentage of organisms a
core cluster must be in to be considered as part of the core genome). Must
be a number between 100 and 85, (default: 95). You can change it later by
using the |
sep |
A separator. By default is '__'(two underscores). It will be used
to create a unique |
verbose |
|
This package uses [R6](https://r6.r-lib.org/articles/Introduction.html) classes to provide a unified, comprehensive, standardized, but at the same time flexible, way to analyze a pangenome. The idea is to have a single object which contains both the data and the basic methods to analyze them, as well as manipulate fields, explore, and to use in harmony with the already existing and extensive list of R packages available created for comparative genomics and genetics.
For more information, tutorials, and resources, please visit https://iferres.github.io/pagoo/ .
$pan_matrix
The panmatrix. Rows are organisms, and
columns are groups of orthologous. Cells indicates the presence (>=1) or
absence (0) of a given gene, in a given organism. Cells can have values
greater than 1 if contain in-paralogs.
$organisms
A DataFrame
with available
organism names, and organism number identifier as rownames()
. (Dropped
organisms will not be displayed in this field, see $dropped
below).
Additional metadata will be shown if provided, as additional columns.
$clusters
A DataFrame
with the groups
of orthologous (clusters). Additional metadata will be shown as additional columns,
if provided before. Each row corresponds to each cluster.
$genes
A SplitDataFrameList
object with
one entry per cluster. Each element contains a DataFrame
with gene ids (<gid>
) and additional metadata, if provided. gid
are
created by paste
ing organism and gene names, so duplication in gene names
are avoided.
$sequences
A DNAStringSetList
with the
set of sequences grouped by cluster. Each group is accessible as were a list. All
Biostrings
methods are available.
$core_level
The percentage of organisms a gene must be in
to be considered as part of the coregenome. core_level = 95
by default.
Can't be set above 100, and below 85 raises a warning.
$core_clusters
Like $clusters
, but only showing core
clusters.
$core_sequences
Like $sequences
, but only showing core
sequences.
$cloud_genes
Like genes
, but only showing cloud genes.
These are defined as those clusters which contain a single gene (singletons), plus
those which have more than one but its organisms are probably clonal due to identical
general gene content. Colloquially defined as strain-specific genes.
$cloud_clusters
Like $clusters
, but only showing cloud
clusters as defined above.
$cloud_sequences
Like $sequences
, but only showing cloud
sequences as defined above.
$shell_genes
Like genes
, but only showing shell genes.
These are defined as those clusters than don't belong neither to the core genome,
nor to cloud genome. Colloquially defined as genes that are present in some but not
all strains, and that aren't strain-specific.
$shell_clusters
Like $clusters
, but only showing shell
clusters, as defined above.
$shell_sequences
Like $sequences
, but only showing shell
sequences, as defined above.
$summary_stats
A DataFrame
with
information about the number of core, shell, and cloud clusters, as well as the
total number of clusters.
$random_seed
The last .Random.seed
. Used for
reproducibility purposes only.
$dropped
A character
vector with dropped organism
names, and organism number identifier as names()
Below is a comprehensive description of all the methods provided by the object.
Add metadata to the object. You can add metadata to each organism, to each
group of orthologous, or to each gene. Elements with missing data should be filled
by NA
(dimensions of the provided data.frame must be coherent with object
data).
$add_metadata(map = 'org', df)
map
: character
identifying the metadata to map. Can
be one of "org"
, "group"
, or "gid"
.
df
: data.frame
or DataFrame
with the metadata to
add. For each case, a column named as "map"
must exists, which should
contain identifiers for each element. In the case of adding gene (gid
)
metadata,each gene should be referenced by the name of the organism and the name
of the gene as provided in the "data"
data.frame, separated by the
"sep"
argument.
self
invisibly, but with additional metadata.
Drop an organism from the dataset. This method allows to hide an organism from
the real dataset, ignoring it in downstream analyses. All the fields and
methods will behave as it doesn't exist. For instance, if you decide to drop
organism 1, the $pan_matrix
field (see below) would not show it when
called.
$drop(x)
x
: character
or numeric
. The name of the
organism wanted to be dropped, or its numeric id as returned in
$organism
field (see below).
self
invisibly, but with x
dropped. It isn't necessary
to assign the function call to a new object, nor to re-write it as R6 objects
are mutable.
Recover a previously $drop()
ped organism (see above). All fields
and methods will start to behave considering this organism again.
$recover(x)
x
: character
or numeric
. The name of the
organism wanted to be recover, or its numeric id as returned in
$dropped
field (see below).
self
invisibly, but with x
recovered. It isn't necessary
to assign the function call to a new object, nor to re-write it as R6 objects
are mutable.
Write the pangenome data as flat tables (text). Is not the most recommended way
to save a pangenome, since you can loose information as numeric precision,
column classes (factor, numeric, integer), and the state of the object itself
(i.e. dropped organisms, or core_level), loosing reproducibility. Use
save_pangenomeRDS
for a more precise way of saving a pagoo object.
Still, it is useful if you want to work with the data outside R, just keep
the above in mind.
$write_pangenome(dir = "pangenome", force = FALSE)
dir
: The unexisting directory name where to put the data files. Default
is "pangenome".
force
: logical
. Whether to overwrite the directory if it already
exists. Default: FALSE
.
A directory with at least 3 files. "data.tsv" contain the basic
pangenome data as it is provided to the data
argument in the
initialization method ($new(...)
). "clusters.tsv" contain any metadata
associated to the clusters. "organisms.tsv" contain any metadata associated to
the organisms. The latter 2 files will contain a single column if no metadata
was provided.
Save a pagoo pangenome object. This function provides a method for saving a pagoo
object and its state into a "RDS" file. To load the pangenome, use the
load_pangenomeRDS
function in this package. It *should* be compatible between
pagoo versions, so you could update pagoo and still recover the same pangenome. Even
sep
and core_level
are restored unless the user provides those
arguments in load_pangenomeRDS
. dropped
organisms also kept hidden, as
you where working with the original object.
$save_pangenomeRDS(file = "pangenome.rds")
file
: The name of the file to save. Default: "pangenome.rds".
Writes a list with all the information needed to restore the object by using the load_pangenomeRDS function, into an RDS (binary) file.
The objects of this class are clonable with this method.
$clone(deep = FALSE)
deep
: character
identifying the metadata to map. Can
be one of "org"
, "group"
, or "gid"
.
Whether to make a deep clone.
Compute distance between all pairs of genomes. The default dist method is
"bray"
(Bray-Curtis distance). Another used distance method is "jaccard"
,
but you should set binary = FALSE
(see below) to obtain a meaningful result.
See vegdist
for details, this is just a wrapper function.
$dist(
method = "bray",
binary = FALSE,
diag = FALSE,
upper = FALSE,
na.rm = FALSE,
...
)
method
: The distance method to use. See vegdist
for available methods, and details for each one.
binary
: Transform abundance matrix into a presence/absence
matrix before computing distance.
diag
: Compute diagonals.
upper
: Return only the upper diagonal.
na.rm
: Pairwise deletion of missing observations when
computing dissimilarities.
...
: Other parameters. See vegdist for details.
A dist
object containing all pairwise dissimilarities between genomes.
Performs a principal components analysis on the panmatrix.
$pan_pca( center = TRUE, scale. = FALSE, ...)
center
: a logical value indicating whether the variables should be shifted
to be zero centered. Alternately, a vector of length equal the number of columns of x can be
supplied. The value is passed to scale.
scale.
: a logical value indicating whether the variables should be scaled
to have unit variance before the analysis takes place. The default is TRUE.
...
: Other arguments. See prcomp
Returns a list with class "prcomp". See prcomp for more information.
Fits a power law curve for the pangenome rarefaction simulation.
$pg_power_law_fit(raref, ...)
raref
: (Optional) A rarefaction matrix, as returned by rarefact()
.
...
: Further arguments to be passed to rarefact()
. If raref
is missing, it will be computed with default arguments, or with the ones provided here.
A list
of two elements: $formula
with a fitted function, and $params
with fitted parameters. An attribute "alpha"
is also returned (If
alpha>1
, then the pangenome is closed, otherwise is open.
Fits an exponential decay curve for the coregenome rarefaction simulation.
$cg_exp_decay_fit(raref, pcounts = 10, ...)
raref
: (Optional) A rarefaction matrix, as returned by rarefact()
.
pcounts
: An integer of pseudo-counts. This is used to better fit the function
at small numbers, as the linearization method requires to subtract a constant C, which is the
coregenome size, from y
. As y
becomes closer to the coregenome size, this operation
tends to 0, and its logarithm goes crazy. By default pcounts=10
.
...
: Further arguments to be passed to rarefact()
. If raref
is missing, it will be computed with default arguments, or with the ones provided here.
A list
of two elements: $formula
with a fitted function, and $params
with fitted intercept and decay parameters.
Computes the genomic fluidity, which is a measure of population
diversity. See fluidity
for more details.
$fluidity(nsim = 10)
nsim
:An integer specifying the number of random samples
to use in the computations.
A list with two elements, the mean fluidity and its sample standard deviation over the n.sim computed values.
Plot a barplot with the frequency of genes within the total number of genomes.
$gg_barplot()
A barplot, and a gg
object (ggplot2
package) invisibly.
Plot a heatmap showing the computed distance between all pairs of organisms.
$gg_dist(method = "bray", ...)
method
: Distance method. One of "Jaccard" (default), or
"Manhattan",see above.
...
: More arguments to be passed to distManhattan
A heatmap (ggplot2::geom_tile()
), and a gg
object (ggplot2
package) invisibly.
Plot a pangenome binary map representing the presence/absence of each gene within each organism.
$gg_binmap()
A binary map (ggplot2::geom_raster()
), and a gg
object (ggplot2
package) invisibly.
Plot a scatter plot of a Principal Components Analysis.
$gg_pca(colour = NULL, ...))
colour
:The name of the column in $organisms
field
from which points will take color (if provided). NULL
(default) renders
black points.
...
: More arguments to be passed to ggplot2::autoplot()
.
A scatter plot (ggplot2::autoplot()
), and a gg
object
(ggplot2
package) invisibly.
Plot a pie chart showing the number of clusters of each pangenome category: core, shell, or cloud.
$gg_pie()
A pie chart (ggplot2::geom_bar() + coord_polar()
), and a
gg
object (ggplot2
package) invisibly.
Plot pangenome and/or coregenome curves with the fitted functions returned by
pg_power_law_fit()
and cg_exp_decay_fit()
. You can add points by
adding + geom_points()
, of ggplot2 package.
$gg_curves(what = c("pangenome", "coregenome", ...)
what
: "pangenome"
and/or "coregenome"
.
...
: ignored
A scatter plot, and a gg
object (ggplot2
package) invisibly.
Launch an interactive shiny app. It contains a sidebar with controls and switches to interact with the pagoo object. You can drop/recover organisms from the dataset, modify the core_level, visualize statistics, plots, and browse cluster and gene information. In the main body, it contains 2 tabs to switch between summary statistics plots and core genome information on one side, and accessory genome plots and information on the other.
The lower part of each tab contains two tables, side by side. On the "Summary" tab, the left one contain information about core clusters, with one cluster per row. When one of them is selected (click), the one on the right is updated to show information about its genes (if provided), one gene per row. On the "Accessory" tab, a similar configuration is shown, but on this case only accessory clusters/genes are displayed. There is a slider on the sidebar where one can select the accessory frequency range to display.
Give it a try!
Take into account that big pangenomes can slow down the performance of the app. More than 50-70 organisms often leads to a delay in the update of the plots/tables.
$runShinyApp()
Opens a shiny app on the browser.
A field for obtaining core gene sequences is available (see below), but for creating a phylogeny with this sets is useful to: 1) have the possibility of extracting just one sequence of each organism on each cluster, in case paralogues are present, and 2) filling gaps with empty sequences in case the core_level was set below 100%, allowing more genes (some not in 100% of organisms) to be incorporated to the phylogeny. That is the purpose of this special function.
$core_seqs_4_phylo(max_per_org = 1, fill = TRUE)
max_per_org
: Maximum number of sequences of each organism
to be taken from each cluster.
fill
: logical
. If fill DNAStringSet
with
empty DNAString
in cases where core_level
is set below 100%,
and some clusters with missing organisms are also considered.
A DNAStringSetList
with core genes. Order of organisms on each cluster
is conserved, so it is easier to concatenate them into a super-gene suitable
for phylogenetic inference.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.